Table of Contents
- cs.CL [Total: 22]
- cs.CV [Total: 65]
- cs.CR [Total: 1]
- cs.CE [Total: 1]
- eess.IV [Total: 2]
- cs.RO [Total: 5]
- cs.CY [Total: 1]
- cs.AI [Total: 9]
- cs.LG [Total: 6]
- cs.GR [Total: 3]
cs.CL [Back]
[1] Automatic Construction of a Legal Citation Graph from 100 Million Ukrainian Court Decisions: Large-Scale Extraction, Topological Analysis, and Ontology-Driven Clustering cs.CL | cs.DL | cs.IRPDF
Volodymyr Ovcharov
TL;DR: 该论文从1.007亿份乌克兰法院判决中提取了5.02亿条引用边,构建了首个大规模法律引用图。通过拓扑分析和社区检测,发现引用结构能无监督地揭示法律领域边界,并以近乎完美的准确率预测未来立法重要性。此外,引用特征可检测立法制度变迁和突发事件(如2022年入侵)的影响。
Details
Motivation: 现有法律引用图研究规模有限,且缺乏对大规模司法数据的自动构建和分析方法。论文旨在利用乌克兰完整的法院判决数据库,自动构建大规模引用图,并探索其拓扑特性、领域聚类能力及立法预测价值。
Result: 在200份判决的验证样本上,引用提取精度达到1.00(95% Wilson CI: [0.982, 1.000])。引用特征预测前1000篇重要文章的AUC为0.9984,P@1000为0.655,显著优于频率基线。社区检测的模块度Q为0.44-0.55,时间稳定性NMI为0.83-0.86。
Insight: 创新点包括:1)首次从超1亿份判决中大规模提取引用图;2)利用共引投影的Louvain社区检测自动构建法律本体;3)引用特征可高精度预测立法重要性并检测制度变迁;4)将引用本体作为LLM辅助法律分析的工作流记忆系统领域层。
Abstract: Half a billion citation edges extracted from 100.7 million Ukrainian court decisions reveal that judicial citation structure encodes legal domain boundaries without supervision and predicts future legislative importance with near-perfect accuracy. We construct the first large-scale citation graph from the complete EDRSR registry (99.5 million full texts, 1.1 TB), extracting 502 million citation links across six types via regex on commodity hardware in approximately 5 hours, with precision of 1.00 on a 200-decision validation sample (95% Wilson CI: [0.982, 1.000]). Three principal findings emerge. (1) The degree distribution follows a power law (alpha = 1.57 +/- 0.008), placing the Ukrainian court network near the EU Court of Justice and below the US Supreme Court, with hub articles cited by millions of decisions. (2) Louvain community detection on the co-citation projection recovers legal domain boundaries (civil, criminal, administrative, commercial) with modularity Q = 0.44-0.55 and temporal stability (NMI = 0.83-0.86 across periods), constituting an automatically constructed legal ontology grounded in judicial practice. (3) Citation features predict top-1000 articles with AUC = 0.9984, substantially outperforming a naive frequency baseline (P@1000 = 0.655); temporal dynamics detect legislative regime changes as phase transitions and the 2022 invasion as a citation entropy spike (H: 11.02 -> 13.49) with emergent wartime legislation nodes. The citation-derived ontology is operationalized as the domain layer of a workflow memory system for LLM-assisted legal analysis, connecting to the ontology-controlled paradigm. The extraction pipeline, analysis code, and aggregated statistics are released as open data.
[2] Capability Conditioned Scaffolding for Professional Human LLM Collaboration cs.CLPDF
Sen Yang, Yinglei Ma
TL;DR: 本文提出了一种名为“能力条件化脚手架”的框架,用于专业人机协作。该框架将专业知识领域划分为强、混合和弱三个类别,并根据结构化的能力档案调整大语言模型的干预行为,以解决因用户评估能力差异导致的“专业领域漂移”问题。在多个MMLU子集和四种LLM基座上的初步评估显示,该方法能实现与能力档案一致的干预行为,包括档案交换时的类别反转和混合领域风险区的选择性激活。
Details
Motivation: 现有的大语言模型个性化方法主要关注适应用户偏好和风格,但未能考虑用户在不同专业领域的评估能力差异。这可能导致“专业领域漂移”,即用户在无法可靠评估的领域过度依赖AI生成的推理,从而影响协作的可靠性。
Result: 在多个MMLU子集和四种LLM基座上的初步评估中,该方法展示了与能力档案一致的干预行为,包括在档案交换时出现类别反转,以及在混合领域风险区实现选择性激活。这些结果证明了能力感知脚手架的有效性。
Insight: 本文的创新点在于提出了一个类型化框架,通过将专业知识领域划分为强、混合和弱三类,并基于结构化的能力档案动态调整LLM的干预行为,从而超越了传统的风格个性化,实现了更可靠的专业人机协作。
Abstract: Large language model personalization typically adapts outputs to user preferences and style but does not account for differences in user evaluation capacity across domains of expertise. This limitation can encourage Professional Domain Drift, where users rely on AI generated reasoning in domains they cannot reliably evaluate. We introduce Capability Conditioned Scaffolding, a typed framework that partitions expertise into strong, mixed, and weak domains and conditions intervention behavior on structured capability profiles. A pilot evaluation across multiple MMLU subsets and four LLM substrates shows consistent profile conditioned intervention behavior, including categorical inversion under profile swapping and selective activation in mixed domain risk zones. These findings suggest that capability aware scaffolding can support more reliable professional human AI collaboration beyond stylistic personalization.
[3] Neural Activation Patterns Across Language Model Architectures: A Comprehensive Analysis of Cognitive Task Performance cs.CL | cs.LGPDF
Mahdi Naser-Moghadasi, Faezeh Ghaderi
TL;DR: 本文系统分析了六种不同的大语言模型架构在十二类认知任务上的神经激活模式,通过测量最终激活值、注意力熵和稀疏性模式,揭示了编码器和解码器架构在处理不同认知任务时的根本差异。
Details
Motivation: 旨在理解不同LLM架构在处理认知任务时的神经激活模式差异,为模型选择和优化提供指导。
Result: 在144个任务-模型组合的分析中,数学推理在所有架构中产生最高的注意力熵,而解码器模型相比编码器模型表现出显著更高的稀疏性模式。
Insight: 创新点在于通过系统测量注意力熵和稀疏性等指标,揭示了架构间处理认知任务的差异,为大数据应用中的模型选择提供了实证依据。
Abstract: This paper presents a comprehensive analysis of neural activation patterns across six distinct large language model (LLM) architectures, examining their performance on twelve cognitive task categories. Through systematic measurement of final activation values, attention entropy, and sparsity patterns, we reveal fundamental differences in how encoder and decoder architectures process diverse cognitive tasks. Our analysis of 144 task-model combinations demonstrates that mathematical reasoning consistently produces the highest attention entropy across all architectures, while decoder models exhibit significantly higher sparsity patterns compared to encoder models. The findings provide critical insights into the computational characteristics of modern language models and their task-specific neural behaviors, with implications for model selection and optimization in big data applications.
[4] Reasoning Models Don’t Just Think Longer, They Move Differently cs.CL | cs.LG | stat.MLPDF
Anders Gjølbye, Lars Kai Hansen, Sanmi Koyejo
TL;DR: 本文研究了推理训练的语言模型在生成思维链时,其隐藏状态轨迹不仅因问题难度而变长,更在几何结构上表现出差异。通过残差化处理消除长度影响后,发现推理模型在编程领域表现出更直接、曲率更均匀的轨迹,而在数学和布尔可满足性领域效应较弱。
Details
Motivation: 现有研究仅关注推理模型在困难问题上生成更多token,但未区分这是计算步数增加还是内部轨迹结构变化。本文旨在通过隐藏状态轨迹分析,揭示推理训练是否改变了模型的内部推理路径。
Result: 在编程、数学和布尔可满足性三个领域,残差化后的轨迹几何与问题难度存在系统性关联。编程领域效果最显著:推理模型的轨迹更直接、局部曲率更均匀,而指令微调基线模型则相反。数学和布尔可满足性领域效应较弱。
Insight: 创新点在于提出长度校正(length correction)作为轨迹分析的前提条件,并发现推理训练可改变轨迹几何结构,且该效应具有领域依赖性。这为理解推理模型的内部机制提供了新视角。
Abstract: Reasoning-trained language models often spend more tokens on harder problems, but longer chains of thought do not show whether a model is merely computing for more steps or following a different internal trajectory. We study this distinction through hidden-state trajectories during chain-of-thought generation across competitive programming, mathematics, and Boolean satisfiability. Raw trajectory geometry is strongly shaped by generation length: longer generations mechanically alter path statistics, so difficulty-dependent comparisons are misleading without adjustment. After residualizing trajectory statistics on length, difficulty remains systematically coupled to corrected trajectory geometry across all domains studied. The clearest reasoning-specific separation appears in the code domain, where harder problems show more direct corrected trajectories and less heterogeneous local curvature in reasoning-trained models than in matched instruction-tuned baselines. Corrected difficulty-geometry coupling is weaker, but still present, in mathematics and Boolean satisfiability. Prompt-stage linear probes do not mirror the code-domain separation, and behavioral annotations show that stronger corrected coupling co-occurs with strategy shifts and uncertainty monitoring. Together, these findings establish length correction as a prerequisite for generation-time trajectory analysis and show that reasoning training can be associated with distinct corrected trajectory geometry, with the strength of the effect depending on the domain.
[5] FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models cs.CLPDF
Dmitry Stanishevskii, Nini Kamkia, Alexey Khoroshilov, Dmitry Zmitrovich, Denis Kokosinskii
TL;DR: 本文提出了FINESSE-Bench,一个用于分层评估大语言模型金融领域能力的基准套件,包含8个专业基准、共3993个问题,覆盖从基础金融知识到专家级推理的多个难度层级。
Details
Motivation: 现有金融基准(如FinQA、FinanceBench)主要关注财报问答或任务广度,缺乏对专业难度层级和从基础知识到专家推理能力过渡的明确评估。
Result: 该基准套件结合了受专业认证(如CFA、CMT、CFTe)启发的考试导向数据集、应用交易任务集和一个俄语奥赛基准,旨在评估模型的领域广度、随难度增加的性能衰减、计算任务解决能力以及在专业金融领域的行为。
Insight: 创新点在于构建了一个分层、多难度的金融能力评估体系,并设计了统一的评估协议(包括多项选择、数值答案和开放式短答案)以及基于LLM-as-judge范式的自动评分方案,以更实质性地评估大语言模型的职业相关金融能力。
Abstract: Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and professional training. However, robust evaluation of their domain competence in finance remains incomplete. Widely used open benchmarks such as FinQA, ConvFinQA, and TAT-QA have played an important role in advancing financial question answering and numerical reasoning, but they focus primarily on question answering over financial reports and do not provide an explicit hierarchy of professional difficulty. Broader resources, including FinanceBench, PIXIU, FinBen, and FLaME, expand the coverage of financial tasks, yet the problem of evaluating the transition from foundational knowledge to expert-level financial reasoning remains open. In this work, we present FINESSE-Bench, a suite of eight specialized benchmarks comprising 3,993 questions for hierarchical evaluation of financial competencies in LLMs. FINESSE-Bench combines exam-oriented datasets inspired by professional certifications (CFA-like Levels 1-3, CMT-like Level 2, and CFTe-like Level 1), applied trading task collections, and a Russian-language olympiad benchmark. This design enables evaluation of domain breadth, performance degradation as difficulty increases, the ability to solve computational tasks, and model behavior in specialized financial domains. We also describe a unified evaluation protocol covering multiple-choice questions, numerical answers, and short open-ended responses, together with an automated scoring scheme for freeform answers based on the LLM-as-judge paradigm. FINESSE-Bench is intended both as a complement to existing open financial benchmarks and as a tool for more substantive evaluation of professionally relevant financial competencies in large language models.
[6] Process Rewards with Learned Reliability cs.CL | cs.AI | cs.LGPDF
Jinyuan Li, Langlin Huang, Chengsong Huang, Shaoyang Xu, Donghong Cai
TL;DR: 本文提出了BetaPRM,一种分布式的过程奖励模型,它不仅预测步骤级别的成功概率,还评估该预测的可靠性。通过Beta-Binomial似然建模,BetaPRM能够学习到奖励信号的可靠性,从而在下游应用中区分可靠与不确定的预测。作为应用,作者引入了自适应计算分配(ACA)方法,用于PRM引导的Best-of-N推理,实验表明BetaPRM在保持步骤级别错误检测的同时,改善了推理选择,ACA在四个基准测试中显著减少了token使用并提高了准确性。
Details
Motivation: 当前的过程奖励模型(PRMs)通常仅为每个推理步骤输出单一奖励分数,下游方法必须将这些不完美的步骤级别奖励预测视为可靠决策信号,但缺乏指示何时应信任这些预测。
Result: 在四个骨干模型和四个推理基准测试上的实验表明,BetaPRM改进了PRM引导的Best-of-N选择,同时保持了标准的步骤级别错误检测能力。基于此,ACA在准确性-token权衡上优于固定预算的Best-of-16,在提高最终答案准确性的同时,将token使用量减少了高达33.57%。
Insight: 创新点在于将PRM从点估计扩展为分布估计,通过Beta-Binomial似然建模步骤成功概率的可靠性,从而提供可靠性信号;这允许下游应用(如自适应计算分配)动态调整计算资源,优先处理不确定的候选前缀,提升效率与准确性。
Abstract: Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one application, we introduce Adaptive Computation Allocation (ACA) for PRM-guided Best-of-N reasoning. ACA uses the learned reliability signal to stop when a high-reward solution is reliable and to spend additional computation on uncertain candidate prefixes. Experiments across four backbones and four reasoning benchmarks show that BetaPRM improves PRM-guided Best-of-N selection while preserving standard step-level error detection. Built on this signal, ACA improves the accuracy–token tradeoff over fixed-budget Best-of-16, reducing token usage by up to 33.57% while improving final-answer accuracy.
[7] Measuring Maximum Activations in Open Large Language Models cs.CLPDF
Luxuan Chen, Han Tian, Xinran Chen, Rui Kong, Fang Wang
TL;DR: 本文系统测量了现代开源大语言模型(LLM)中激活值的最大幅度,研究了其在模型家族、架构(密集、MoE、视觉语言)、训练阶段和指令微调变体间的变化规律。研究发现,激活最大值在可比参数量下可跨越近四个数量级,且与模型大小并非简单单调关系,MoE模型的峰值显著低于密集模型,残差流通常是全局最大值的来源。
Details
Motivation: 激活值的动态范围是低比特量化、激活缩放和稳定LLM推理的一阶约束。先前研究基于2024年前的LLaMA风格模型,而开源模型生态已发生巨大变化,因此需要重新评估现代开源LLM中激活值的实际最大幅度,为部署提供指导。
Result: 在统一评估框架下(5000个多领域样本,家族特定分词,覆盖嵌入层、隐藏状态、注意力、MLP/MoE、SwiGLU门控和最终归一化层)对来自8个开源家族的27个检查点进行了测量。定性发现包括:Gemma3-27B-it的激活最大值可达约7e5,而Qwen3.5和MoE检查点在10^2至10^3范围;MoE检查点的峰值比同等规模的密集模型低14.0-23.4倍。轻量级INT-8验证表明,测量的最大值与通过激活缩放选择的低比特重建误差共变。
Insight: 创新点在于首次对后LLaMA时代的多样化开源LLM家族进行了系统性的激活最大值测量与分析,揭示了其与模型家族、架构和训练阶段紧密相关,而非仅由模型大小决定。核心洞见是:最大激活幅度是模型的内在属性,应在低比特部署前进行测量并随开源权重一同报告,这为模型量化与部署提供了关键的先验知识。
Abstract: The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the downstream activation-quantization stack inherits that picture without revisiting it for the post-LLaMA open-model boom. We ask the deployment-oriented question: how large can activations get in modern open LLMs, and how does this magnitude vary across families, generations, and training stages? Under a unified pipeline (5,000-sample multi-domain corpus, family-specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU gates, and final norm), we measure global and layerwise maxima on 27 checkpoints from 8 open families spanning dense, MoE, vision-language, intermediate-training, and instruction-tuned variants. We find that (i) global maxima span over nearly four orders of magnitude at comparable parameter counts, with Qwen3.5 and MoE checkpoints in the 10^2 to 10^3 range and Gemma3-27B-it reaching ~7 x 10^5; (ii) cross-family and cross-generation comparisons break simple monotonic scaling; and (iii) MoE checkpoints exhibit 14.0-23.4x lower peaks than matched-scale dense counterparts, while the residual stream carries the global maximum in 22/24 checkpoints. A lightweight INT-8 sanity check shows that measured maxima co-vary with low-bit reconstruction error via activation-scale selection. We conclude that maximum activation magnitude is a model property tied to family, architecture, and training stage - not a simple byproduct of size - and should be measured and reported alongside any open-weight release before low-bit deployment. The code is publicly available at https://github.com/clx1415926/Max_act_llm.
[8] MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models cs.CLPDF
Weixin Liu, Congning Ni, Shelagh A. Mulvaney, Susannah L. Rose, Murat Kantarcioglu
TL;DR: 该论文提出了MHGraphBench,一个基于知识图谱的心理健康知识基准测试,用于评估大语言模型在心理健康领域的实体识别、关系判断和两跳推理能力。基准测试从PrimeKG知识图谱构建,包含九个任务家族,并进行了15个闭源和开源LLM的实验。
Details
Motivation: 动机在于,尽管LLMs越来越多地应用于心理健康领域,但其对相关生物医学知识的掌握程度以及将这些知识可靠地应用于临床结构化判断的能力尚不明确,因此需要专门的基准测试进行评估。
Result: 实验结果显示,领先模型在实体类型识别和小型关系类型子集上接近满分,但在关系预测和两跳推理上仍有困难,存在持续的识别到判断的差距。此外,输出格式的可靠性在受限多项选择设置下显著影响测量性能。
Insight: 创新点在于构建了首个基于知识图谱的心理健康知识基准测试,并揭示了LLMs在该领域存在的识别-判断差距。客观来看,该研究强调了在基准评估中考虑响应有效性的重要性,并指出基准测试结果应被解释为在特定接口下与知识图谱的一致性,而非对现实世界临床安全性的直接评估。
Abstract: Large language models (LLMs) are increasingly used in the mental health domain, yet it remains unclear how well they capture related biomedical knowledge and how reliably they apply it to clinically salient structured judgments. Here, we present a knowledge-graph (KG)-grounded benchmark for assessing LLMs on mental-health entity recognition, relation judgment, and two-hop reasoning. The benchmark is derived from PrimeKG and comprises nine task families with KG-supported answers and controlled negative options. Experiments across 15 closed- and open-source LLMs reveal a persistent recognition-to-judgment gap: leading models achieve near-ceiling performance on entity typing and on the small relation-typing subset, yet they still struggle with relation prediction and two-hop reasoning. Additionally, short KG-derived snippets benefit some models but degrade performance for others. Moreover, output-format reliability can substantially influence measured performance under constrained multiple-choice settings, highlighting the critical role of response validity in benchmark-based evaluation. MHGraphBench should therefore be interpreted as evaluating agreement with a curated mental-health slice of PrimeKG under a constrained multiple-choice interface, rather than as a direct assessment of real-world clinical safety.
[9] Calibrating LLMs with Semantic-level Reward cs.CL | cs.LGPDF
Fengfei Yu, Ruijia Niu, Dongxia Wu, Yian Ma, Rose Yu
TL;DR: 本文提出了一种名为CSR(Calibrated Semantic Reward)的框架,旨在解决大语言模型(LLMs)在不确定性校准方面的问题。传统基于可验证奖励的强化学习(RLVR)使用二元正确性奖励,会损害模型的校准性;而近期基于语言化置信度的方法在语义层面存在不一致性。CSR通过在语义空间中直接校准模型,结合正确性奖励和一种新颖的语义校准奖励,鼓励正确输出之间的语义一致性,并抑制错误输出间的虚假一致性。
Details
Motivation: 随着大语言模型在医疗问答、法律推理等关键领域部署,准确估计其输出正确的可能性对于安全可靠的使用至关重要,这需要模型具备良好的不确定性校准能力。现有方法(如RLVR和语言化置信度奖励)在模型校准方面存在不足,前者会损害校准性,后者在语义层面存在不一致性。
Result: 在HotpotQA(分布内)以及TriviaQA、MSMARCO和NQ-Open(分布外)三个模型系列上的实验表明,CSR在几乎所有设置下都持续取得了比基于语言化置信度的基线方法更低的预期校准误差(ECE)和更高的AUROC。具体而言,ECE降低了高达40%,AUROC提升了高达31%,并且校准行为在所有四个评估设置中都表现出稳健的泛化能力。
Insight: 论文的核心创新点在于提出了一个直接在语义空间进行校准的框架(CSR),它通过一种新颖的语义校准奖励来替代或补充传统的语言化置信度接口。这种方法避免了因文本表层变化(但语义相同)导致的置信度不一致问题,从客观角度看,将校准目标从词元级别提升到语义级别是一个有前景的研究方向,可能提升模型在开放域问答等任务中不确定性估计的鲁棒性和可靠性。
Abstract: As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standard reinforcement learning with verifiable rewards (RLVR) trains models with a binary correctness reward that is indifferent to confidence, providing no penalty for confident but wrong predictions and thereby degrading calibration. Recent work addresses this by training models to produce verbalized confidence scores alongside answers and rewarding agreement with correctness. However, verbalized confidence is calibrated at the token level and thus exhibits inconsistency across textual variations with same semantic meaning. We propose \textbf{Calibration with Semantic Reward (CSR)}, a framework that calibrates language models directly in semantic space without a verbalized confidence interface. CSR combines the correctness reward with a novel semantic calibration reward that encourages exploitation among correct rollouts by promoting semantic agreement, and exploration among incorrect ones by discouraging spurious consistency. Experiments across three model families on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) show that CSR consistently achieves lower ECE and higher AUROC than verbalized-confidence baselines across nearly all settings, reducing ECE by up to $40%$ and improving AUROC by up to $31%$ over verbalized-confidence baselines, with calibration behavior generalizing robustly across all four evaluation settings.
[10] Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language cs.CL | cs.LGPDF
Vinayshekhar Bannihatti Kumar, Disha Makhija, Manoj Ghuhan Arivazhagan, Rashmi Gangadharaiah
TL;DR: 本文研究了大型语言模型(LLM)在未见编程语言上的代码生成能力,通过引入一个全新的、未在预训练语料中出现的最小命令式语言PyLang进行实验。研究发现,微调可以快速教会模型新语言的语法,但无法有效转移语义理解能力,导致模型在PyLang上的表现始终显著落后于Python,这揭示了模型存在‘实现保真度差距’。
Details
Motivation: 探究LLM在代码生成基准上表现优异的能力,是否能够迁移到预训练中从未见过的编程语言上,以理解其代码生成能力的本质和局限性。
Result: 在352个问题上的评估显示,微调后的Qwen3模型在PyLang上的表现始终比Python差最多19%,且多种干预措施(如多任务学习、偏好调优等)均无法弥合此差距。前沿模型在80%的情况下选择了与Python相同的算法,却无法将其转化为可工作的PyLang实现。CKA分析表明微调后模型跨语言的内部表征高度相似(CKA > 0.97),但输出阶段出现分歧。
Insight: 论文的核心创新点在于揭示了LLM代码生成中存在的‘实现保真度差距’:模型具备与语言无关的算法理解能力,但无法在陌生语言中有效地表达这种理解。这强调了需要开发能够将推理过程与语言特定实现解耦的训练方法。
Abstract: Large language models (LLMs) achieve high pass rates on code generation benchmarks, yet whether they can transfer this ability to languages absent from pretraining remains poorly understood. We introduce PyLang, a minimal imperative language absent from all pretraining corpora, and evaluate frontier models zero-shot and fine-tuned Qwen3 (4B, 8B, 32B) on 352 problems. We find that fine-tuning quickly teaches syntax but fails to transfer semantic competence: Python outperforms PyLang by up to 19% across all configurations, and no intervention (multi-task learning, preference tuning, code infilling, or latent-space objectives) closes the gap. An LLM judge reveals that frontier models select an identical algorithm to Python 80% of the time, yet cannot translate it into a working PyLang implementation., and CKA analysis confirms that fine-tuned models converge to nearly identical internal representations across languages (CKA > 0.97) while diverging at the output stage. We term this the implementation fidelity gap: models possess language-agnostic algorithmic understanding but cannot express it in an unfamiliar language. Our findings highlight the need for training methods that decouple reasoning from language-specific realization.
[11] PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding cs.CLPDF
Shengyin Sun, Yiming Li, Renxi Liu, Xinqi Li, Hui-Ling Zhen
TL;DR: 本文提出了一种名为并行推测解码(PSD)的无训练框架,旨在提升扩散大语言模型(dLLMs)的推理效率。该方法通过单次前向传播的置信度分数,自适应地选择要解掩码的位置并构建多深度推测草稿,最后通过批量验证和分层接受机制来保持生成质量。实验表明,PSD在推理效率和生成质量之间取得了有利的权衡,在推理和代码生成任务上实现了高达5.5倍的每前向传播令牌数,且准确率与贪婪解码相当。
Details
Motivation: 扩散大语言模型通过迭代去噪生成文本,尽管每步可并行预测所有掩码位置,但大量去噪迭代仍导致推理成本高昂。现有方法通过每步解掩码多个令牌(空间维度)或将多步合并为一次验证调用(时间维度)来降低成本,但仍有优化空间。
Result: 在三个dLLM模型上,针对推理和代码生成任务的实验显示,PSD实现了高达5.5倍的每前向传播令牌数,同时生成准确率与贪婪解码相当,在效率与质量间取得了有利权衡。
Insight: 创新点在于提出了一种无训练框架,联合优化空间和时间维度:利用单次前向传播的置信度自适应选择解掩码位置并构建多深度草稿,再通过批量验证和分层接受机制确保一致性。这为扩散模型的推理加速提供了一种高效且保持质量的方法。
Abstract: Diffusion large language models (dLLMs) generate text by iteratively denoising masked token sequences. Although dLLMs can predict all masked positions in parallel within each step, the large number of denoising iterations still makes inference expensive. This cost can be reduced spatially by unmasking multiple tokens per step, or temporally by collapsing multiple denoising steps into one verification call. We propose Parallel Speculative Decoding (PSD), a training-free framework that jointly improves inference along both axes. Using the confidence scores from a single forward pass, PSD selects positions to unmask via a configurable, adaptive unmasking policy and constructs multi-depth speculative drafts without extra model calls. A final batched verification pass then applies hierarchical acceptance, keeping the deepest draft that remains consistent with the updated predictions. Experiments on three dLLMs across reasoning and code generation tasks show that PSD achieves favorable trade-offs between inference efficiency and generation quality, reaching up to $5.5\times$ tokens per forward pass with accuracy comparable to greedy decoding.
[12] ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models cs.CL | cs.AIPDF
Jiahui Guang, Yingjie Zhu, Cuiyun Gao, Haiyan Wang, Jing Li
TL;DR: 本文提出ASRU框架,针对多模态大语言模型(MLLMs)中敏感跨模态信息的遗忘问题,通过结合激活引导和强化遗忘,在实现目标知识遗忘的同时优化生成质量。
Details
Motivation: 现有机器遗忘(MU)方法通常仅基于输出偏差评估遗忘效果,忽视了遗忘后模型的生成质量,这容易导致幻觉或僵化响应,影响模型可用性和安全性。
Result: 在Qwen3-VL模型上的实验表明,ASRU在仅使用少量保留监督数据的情况下,平均显著提升了遗忘效果(+24.6%)和生成质量(5.8倍),同时有效保持了模型效用。
Insight: 创新点在于将生成质量作为核心评估目标,并设计了一个可控的多模态遗忘框架,通过激活重定向诱导初始拒绝行为,再利用定制奖励函数优化细粒度拒绝边界,实现了遗忘效果与模型效用的更好权衡。
Abstract: Multimodal large language models (MLLMs) may memorize sensitive cross-modal information during pretraining, making machine unlearning (MU) crucial. Existing methods typically evaluate unlearning effectiveness based on output deviations, while overlooking the generation quality after unlearning. This can easily lead to hallucinated or rigid responses, thereby affecting the usability and safety of the unlearned model. To address this issue, we propose ASRU, a controllable multimodal unlearning framework that incorporates generation quality as a core evaluation objective. ASRU first induces initial refusal behavior through activation redirection, and then optimizes fine-grained refusal boundaries using a customized reward function, thereby achieving a better trade-off between target knowledge unlearning and model utility. Experiments on Qwen3-VL show that ASRU significantly improves unlearning effectiveness (+24.6%) on average and generation quality (5.8x) on average while effectively preserving model utility, using only a small amount of retained supervision data.
[13] VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing cs.CL | cs.CVPDF
Xiaoyan Su, Peijie Dong, Zhenheng Tang, Song Tang, Yuyao Zhai
TL;DR: 本文提出了VCG-Bench,一个专注于结构化图表生成与编辑的统一视觉基准。它引入了一种新的‘图表即代码’范式,使用mxGraph XML进行精确的图表生成和编辑,并包含一个涵盖多个领域的多样化数据集、任务定义以及多维度的评估协议。
Details
Motivation: 现有视觉语言模型在处理专业工作流所需的结构化、可控图表任务方面存在能力缺口,主要依赖基于像素的合成方法,导致可编辑性和保真度受限。
Result: 实验结果表明,当前最先进的视觉语言模型在结构化保真度和指令遵从方面面临挑战,反映了它们在视觉和推理能力上的局限。
Insight: 核心创新在于提出了‘图表即代码’的符号逻辑范式,使用mxGraph XML作为中间表示,并构建了一个统一的、包含多维度评估指标的基准,以系统评估模型在结构化视觉任务上的能力。
Abstract: Despite the rapid advancements in Vision-Language Models (VLMs), a critical gap remains in their ability to handle structured, controllable diagrammatic tasks essential for professional workflows. Existing methods predominantly rely on pixel-based synthesis, which operates in probabilistic pixel spaces and is inherently limited in editability and fidelity. Instead, we propose a new Diagram-as-Code paradigm with symbolic logic that leverages mxGraph Extensible Markup Language (XML) for precise diagram generation and editing. We present VCG-Bench, a unified benchmark for visual-centric \texttt{mxGraph} tasks. VCG-Bench comprises: (1) a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, (2) a paradigm definition that integrates Generation (Vision-to-Code) and Editability (Code-to-Code), (3) a Tailored Evaluation Protocol employing multi-dimensional metrics such as \texttt{mxGraph} Execution Success Rate, Style Consistency Score (SCS), etc. Experimental results highlight the challenges faced by current State-of-the-Art (SOTA) VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning capabilities.
[14] SMMBench: A Benchmark for Source-Distributed Multimodal Agent Memory cs.CLPDF
Huacan Chai, Yukai Wang, Yingxuan Yang, Dan Peng, Yuanyi Song
TL;DR: 该论文提出了一个名为SMMBench的新基准测试,用于评估多模态智能体在跨多个独立来源(如对话、图像、文档等)检索、对齐和组合分散的多模态证据的能力,而不仅仅是在预先组装的单一上下文中进行推理。该基准包含1877个样本,评估了跨源多模态推理、冲突解决、偏好推理和基于记忆的行动预测四大核心能力。实验表明,当前主流系统在这些能力上仍面临困难。
Details
Motivation: 现有基准主要评估系统在预先组装上下文中的多模态记忆推理能力,但未能充分评估智能体是否能利用分散在不同独立来源中的证据。作者认为,跨源记忆组合是多模态智能体记忆中的一个重要且未被充分研究的瓶颈。
Result: 在代表性记忆式和基于检索的基线模型上的实验表明,当前系统在这些核心能力上仍然表现不佳,凸显了跨源多模态记忆是一个重要且尚未被充分评估的挑战。
Insight: 论文的创新点在于识别并形式化了“跨源多模态记忆组合”这一新问题,并为此构建了一个专门的基准测试SMMBench。从客观角度看,该工作将评估重点从单一上下文转向了更现实、更复杂的分布式信息源环境,为未来多模态智能体的记忆系统研究指明了新的方向。
Abstract: Existing benchmarks for multimodal memory reasoning largely evaluate systems within pre-assembled contexts, but under-evaluate whether agents can use evidence distributed across independently originated sources. We argue that source-distributed memory composition is an important and under-examined bottleneck in multimodal agent memory, especially when relevant evidence is fragmented across heterogeneous artifacts such as conversations, profiles, screenshots, tables, images, and documents. To address this gap, we introduce Source-distributed Multimodal Memory Benchmark(SMMBench), which measures whether agents can retrieve, align, and compose multimodal evidence scattered across multiple sources rather than reason within a single curated context. SMMBench evaluates four core capabilities: (1) cross-source multimodal reasoning; (2) conflict resolution; (3) preference reasoning; (4) memory-grounded action prediction. The benchmark contains 1877 samples grounded in 264 sources. Experiments on representative memory-style and retrieval-based baselines show that current systems still struggle on these capabilities, positioning source-distributed multimodal memory as an important and still under-evaluated challenge for multimodal agents. Our data are available at https://huggingface.co/datasets/HuacanChai/SMMBench.
[15] ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation cs.CLPDF
Michał Ciesiółka, Dawid Wiśniewski, Adrian Charkiewicz, Kamil Guttmann
TL;DR: 本文介绍了ForMaT数据集,这是一个包含3,956个PDF文档、涵盖15种语言对的平行语料库,专为多模态机器翻译设计,旨在保留原始文档的布局元数据。为确保数据集的视觉结构多样性,作者采用基于45个几何特征的K-Medoids采样方法,重点关注包含嵌套表格和公式等复杂元素的多样化PDF文档。该数据集旨在为开发能够整合视觉与文本上下文、实现高保真文档重建的布局感知翻译模型提供基准。
Details
Motivation: 当前机器翻译系统在处理PDF文档时,难以保持文本与其视觉上下文(如空间布局和几何结构)的关联,导致翻译后文档的格式和结构信息丢失。ForMaT旨在解决这一问题,为研究布局感知的多模态翻译提供专门的基准数据集。
Result: 评估表明,现有机器翻译系统在处理ForMaT数据集时,在空间定位和几何同步方面表现不佳,经常破坏文本与视觉上下文的联系。该数据集本身被提出作为一个新的基准,用于推动能实现高保真文档重建的翻译模型的发展。
Insight: 论文的核心创新在于构建了一个专门针对多模态翻译、强调视觉布局保真度的平行PDF语料库。其采用基于几何特征的K-Medoids采样方法来确保数据集的视觉结构多样性,这为评估和开发需要理解文档空间布局的翻译模型提供了新的、更具挑战性的基准。
Abstract: We present ForMaT (Format-Preserving Multilingual Translation), a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata proposed for multimodal machine translation. To ensure structural diversity in the dataset, we employ K-Medoids sampling over 45 geometric features, capturing complex elements like nested tables and formulas to focus only on visually diverse PDF documents. Our evaluation reveals that current MT systems struggle with spatial grounding and geometric synchronization, often losing the link between text and its visual context. ForMaT provides a benchmark for developing layout-aware translation models that integrate visual and textual context for high-fidelity document reconstruction.
[16] Reference-Free Reinforcement Learning Fine-Tuning for MT: A Seq2Seq Perspective cs.CL | cs.AIPDF
Ernesto Garcia-Estrada, Carlos Escolano, José A. R. Fonallosa
TL;DR: 本文提出了一种基于Group Relative Policy Optimization的无参考强化学习微调方法,用于改进编码器-解码器Seq2Seq机器翻译模型。该方法结合LaBSE和COMET-Kiwi的混合奖励函数,无需平行数据,在13种类型多样的语言上验证了其有效性。
Details
Motivation: 当前机器翻译生产系统主要依赖编码器-解码器Seq2Seq模型,但强化学习微调方法大多针对解码器专用的大语言模型,缺乏对编码器-解码器架构的系统研究。
Result: 在NLLB-200模型上,该方法在13种语言上均取得一致提升,最高在繁体中文上提升5.03 chrF++,且在无目标语言数据情况下,可与形态复杂语言上的3轮监督微调相媲美。
Insight: 创新点在于将GRPO与无参考混合奖励结合,适用于编码器-解码器架构;研究发现性能提升最大处出现在基线性能最弱且奖励可区分性最高的场景,这恰好对应平行数据最稀缺的情况,具有实际应用价值。
Abstract: Production machine translation relies overwhelmingly on encoder-decoder Seq2Seq models, yet reinforcement learning approaches to MT fine-tuning have largely targeted decoder-only LLMs at $\geq$7B parameters, with limited systematic study of encoder-decoder architectures. We apply Group Relative Policy Optimization to NLLB-200 (600M and 1.3B) using a hybrid reference-free reward (LaBSE and COMET-Kiwi) that requires no parallel data at fine-tuning time, evaluating across 13 typologically diverse languages. GRPO yields consistent improvements on all 13 languages, up to $+$5.03 chrF++ for Traditional Chinese, and, without any target-language data, competes with 3-epoch supervised fine-tuning on morphologically complex languages . We identify a consistent empirical pattern in which gains are largest where baseline performance is weakest and reward discriminability is highest, making this approach most effective precisely where parallel data is scarcest, and replicate this pattern across English and Spanish source languages.
[17] Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches cs.CLPDF
Daria Blinova, Gayathri Emuru, Rakesh Emuru, Kushagradheer Shridheer Srivastava, Mina Rulis
TL;DR: 本文介绍了一个关于俄罗斯政府多模态政治传播的关联数据集,旨在解决威权政治背景下社会文本和图像数据可用性不足的问题。该数据集包含克里姆林宫和俄罗斯外交部高级官员数十年来发表的两大官方演讲语料库,每个演讲提供俄语和英语文本、相关图像及标题(如有)、以及统一的元数据(如日期、发言人、地理位置、政府内容标签)。通过唯一标识符链接图像与演讲,并对齐同一传播文本的俄英版本。数据集进一步通过基于Transformer的多模态主题建模生成并经过专家验证的主题标注进行增强,支持对(威权)政治传播进行多模态、多语言、时间和/或空间分析,为社会科学研究和政治领域的大语言模型(LLM)应用提供了有价值的测试平台。
Details
Motivation: 解决威权政治背景下社会文本和图像数据可用性不足的持续缺陷,为相关研究提供数据支持。
Result: 构建了一个包含多模态、多语言、带标注和元数据的大型关联数据集,支持对政治传播的多维度分析,并可作为LLM在政治领域的测试平台。
Insight: 通过唯一标识符实现多模态数据(文本、图像)的精确关联与对齐,并采用基于Transformer的多模态主题建模结合专家验证来生成高质量的主题标注,增强了数据集的可用性和研究价值。
Abstract: This paper introduces a dataset of interlinked multimodal political communications from the Russian government, addressing persistent deficiencies in the availability of social text- and image-based data for authoritarian politics contexts. The dataset comprises two large corpora of official speeches delivered by senior actors within the Kremlin and the Russian Ministry of Foreign Affairs over multiple decades. For each speech, we provide Russian- and English-language texts, associated images and captions where available, and harmonized metadata including (e.g.) dates, speakers, (geo)locations, and official government content tags. Unique identifiers link images to speeches and align Russian and English versions of the same communication texts. We further augment these linked datasets with validated topical annotations for both speech texts and speech images, which are generated via transformer-based multimodal topic modeling and refined by a Russian politics expert. The resulting data resources support multimodal, multilingual, temporal, and/or spatial analyses of (authoritarian) political communication and offer a valuable testbed for social science research and large language model (LLM) applications in political domains.
[18] Ontology for Policing: Conceptual Knowledge Learning for Semantic Understanding and Reasoning in Law Enforcement Reports cs.CL | cs.AI | cs.LOPDF
Anita Srbinovska, Jansen Orfan, Adrian Martin, Ernest Fokoué
TL;DR: 本文提出了一种基于符号方法的框架,用于将执法报告中的叙述文本转换为证据关联的事实,旨在从非结构化文本中恢复事件细节并构建带时间线索和领域公理的时间图。
Details
Motivation: 执法报告中的叙述文本包含大量自然语言描述的事件事实,需要人工阅读,效率低下;本文旨在通过自动化方法提取这些事实,以支持审查、警察培训和调查工作。
Result: 在450份财产犯罪报告上评估,系统提取的事件中54.1%置信度至少为0.80,93.7%通过PropBank–VerbNet–WordNet语义路径映射;在事件起始、被盗物品和时间线索上达到100%一致,但强制进入解释的一致性较低。
Insight: 创新点包括结合语义解析、谓词映射到本体和推理的符号方法,以及构建时间图以支持执法领域的语义理解和推理;客观分析认为该方法在结构化非结构化执法数据方面具有实用价值。
Abstract: Law enforcement reports contain structured fields and written narratives. However, many incident facts that are needed for review, police training, and investigations are in natural language and require manual reading. We propose a framework using symbolic methods for converting narratives into evidence-linked facts. Our objective is to measure the value of narratives to recover incident details only from the unstructured text and build temporal graphs with time cues and domain axioms. We achieve this by redacting personal identifiers, semantic parsing, predicate mapping to ontology, and reasoning. We evaluate the symbolic approach on 450 property crime reports and a short human review. Of the extracted events from the system, 54.1% had a confidence score of at least 0.80 and 93.7% were mapped through the PropBank–VerbNet–WordNet semantic path. 100% agreement was reached on incident initiation, stolen items, and temporal cues and lower agreement for forced entry interpretation.
[19] Multi-Level Contextual Token Relation Modeling for Machine-Generated Text Detection cs.CLPDF
Chenwang Wu, Yiuming Cheung, Bo Han, Shuhai Zhang, Defu Lian
TL;DR: 本文提出了一种用于机器生成文本检测的多层次上下文标记关系建模框架。该框架通过统一的视角分析现有基于度量的方法,识别出标记级检测分数易受生成过程随机性影响的核心挑战,进而从理论上推导分数多跳转移并建模其局部与全局关系。具体包括轻量级马尔可夫校准模块优化局部证据聚合,以及基于上下文统计的规则支持推理模块捕捉全局逻辑,最终在多种现实场景中实现广泛且显著的性能提升。
Details
Motivation: 机器生成文本(如虚假信息和钓鱼内容)带来风险,需要可靠检测;基于度量的方法比复杂模型更实用,但现有方法设计多样且标记级检测分数易受生成随机性影响,因此需统一框架分析并解决此核心挑战。
Result: 大量实验表明,该方法在跨LLM和跨领域等多种现实场景中均取得广泛且显著的改进,且计算开销低。
Insight: 创新点在于将基于度量的检测方法置于统一框架下分析,理论推导标记级分数的多跳转移,并分别通过马尔可夫校准和规则支持推理建模局部与全局上下文关系,形成多层次推理框架,兼顾性能与效率。
Abstract: Machine-generated texts (MGTs) pose risks such as disinformation and phishing, underscoring the need for reliable detection. Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting. Given their diverse designs, we first place representative metric-based methods within a unified framework, enabling a clear assessment of their advantages and limitations. Our analysis identifies a core challenge across these methods: the token-level detection score is easily biased by the inherent randomness of the MGTs generation process. Then, we theoretically derive the multi-hop transitions of the token-level detection score and explore their local and global relations. Based on these findings, we propose a multi-level contextual token relation modeling framework for MGT detection. Specifically, for local relations, we model them through a lightweight Markov-informed calibration module that refines token-level evidence before aggregation. For global relations, we introduce a rule-support reasoning module that uses explicit logical rules derived from contextual score statistics. Finally, we combine the local calibrated score and the global rule-support reasoning signal in a joint multi-level inference framework. Extensive experiments show broad and substantial improvements across various real-world scenarios, including cross-LLM and cross-domain settings, with low computational overhead.
[20] Can Vision Language Models Be Adaptive in Mathematics Education? A Learner Model-based Rubric Study cs.CL | cs.AIPDF
Jie Gao, Yongan Yu, Junzhu Su, Yiran Lin, Adam K. Dube
TL;DR: 本研究探讨了视觉语言模型(VLMs)在数学教育中能否根据学习者模型进行适应性教学。作者基于自适应学习框架中的学习者模型,提出了一个包含认知、动机和复杂性三个维度的评估标准,并额外评估了模型回答的正确性和质量。实验结果表明,现有VLMs在适应性方面存在差异,且在有限学习者信息下难以稳定生成基于学习者模型的教学回应。
Details
Motivation: 自适应学习技术能根据学习者的表现调整教学过程,对开发有效学习工具至关重要。虽然VLMs已应用于数学教育,但其是否能够适应不同学习者特征提供数学指导尚不明确,且缺乏系统性的评估框架。
Result: 实验揭示了不同模型在适应性方面存在可测量的差异,同时发现当前VLMs难以一致地生成基于学习者模型的教学回应,尤其是在学习者信息有限的情况下。
Insight: 创新点在于将自适应学习框架中的学习者模型转化为具体的评估标准(认知、动机、复杂性),为VLMs在数学教育中的适应性提供了系统化的评估方法,有助于推动个性化教育技术的发展。
Abstract: Adaptive learning refers to educational technologies that track learners’ learning progress and adapt the instructional process based on individual learners’ learning performance. It is increasingly recognized as critical for developing an effective learning support tool. Vision language models (VLMs) have seen adoption in mathematics education, and students have been using them as learning aids for personalized instruction. However, it is unknown whether VLMs have the ability to adapt to different learner profiles when providing mathematical instructions. Current VLMs lack a systematic evaluation framework for this adaptivity to different learner profiles in mathematics tutoring tasks. To address this gap, we draw on the learner model from the adaptive learning framework (Shute and Towle, 2018) and propose a learner model-based rubric. Our rubric formalizes adaptivity assessment into three aspects: cognitive aspects, motivational aspects, and complexity. We also evaluate two additional dimensions of VLM responses: correctness (of answers and solutions) and quality (of the response itself). Our experimental results show measurable differences in adaptivity across models and also reveal that current VLMs struggle to consistently produce learner model-based instructional responses, especially when receiving limited learner information.
[21] SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation cs.CLPDF
Xin Zhang, Yang Cao, Baoxing Wu, Kai Song, Siying Li
TL;DR: 本文提出了SGR框架,通过外部子图生成来增强大语言模型(LLMs)的推理能力。该框架从外部知识库构建与查询相关的子图,并利用其语义结构支持多步推理,以提高复杂任务中的推理准确性和事实可靠性。
Details
Motivation: 大语言模型在需要深度推理和逻辑推断的复杂场景中表现有限,其生成过程可能引入无关、嘈杂或事实不一致的内容,因此需要一种方法来增强其推理的可靠性和准确性。
Result: 在多个基准数据集上的实验结果表明,SGR相比竞争基线取得了持续的性能提升,证明了其在提高推理准确性和事实可靠性方面的价值。
Insight: 创新点在于将外部结构化知识(子图)与LLMs的逐步推理过程相结合,通过查询特定的子图生成来引导模型关注相关实体和关系,从而减少噪声并增强推理的可解释性。
Abstract: Large Language Models (LLMs) have demonstrated strong capabilities across diverse NLP applications, such as translation, text generation, and question answering. Nevertheless, they remain limited in complex settings that demand deep reasoning and logical inference. Since these models are trained on large-scale text corpora, their generation process may still introduce irrelevant, noisy, or factually inconsistent content. To mitigate this problem, we introduce SGR, a stepwise framework that enhances LLM reasoning through external subgraph generation. SGR builds query-specific subgraphs from external knowledge bases and uses their semantic structure to support multi-step inference. By grounding intermediate reasoning steps in structured external knowledge, the framework helps the model concentrate on relevant entities, relations, and supporting evidence. In particular, SGR first constructs a subgraph tailored to the input question. It then guides the model to reason progressively over the generated structure and combines multiple reasoning trajectories to obtain the final prediction. Experimental results across several benchmark datasets show that SGR achieves consistent improvements over competitive baselines, highlighting its value for improving both reasoning accuracy and factual reliability.
[22] Argus: Evidence Assembly for Scalable Deep Research Agents cs.CL | cs.AI | cs.IRPDF
Zhen Zhang, Liangcai Su, Zhuo Chen, Xiang Lin, Haotian Xu
TL;DR: 本文提出了Argus系统,这是一个用于可扩展深度研究智能体的证据组装框架。该系统通过Searcher和Navigator两个组件的协作,将深度研究任务视为从互补证据片段中组装拼图,而非并行暴力搜索完整答案。Navigator维护共享证据图,指导Searchers收集缺失证据,并在完整图上进行推理以生成可溯源的最终答案。
Details
Motivation: 现有深度研究智能体(如并行搜索方法)在探索过程中经常重复收集相同证据而非互补片段,导致收益递减且聚合上下文接近模型极限。本文旨在解决证据收集的冗余问题,提高研究任务的效率和可扩展性。
Result: 在八个基准测试上,基于35B-A3B MoE骨干的Argus系统平均提升5.5分(单Searcher)和12.7分(8个并行Searchers)。使用64个Searchers时,在BrowseComp上达到86.2分,超越了所有基准测试中的专有智能体,且Navigator的推理上下文保持在21.5K词元以内。
Insight: 创新点在于将深度研究重构为证据组装问题,并引入分工协作的智能体架构(Searcher收集证据,Navigator管理证据图并协调任务)。通过强化学习独立训练Navigator的验证、调度和合成能力,使其能灵活支持单/多Searcher并行工作而无需重新训练,实现了计算效率与答案质量的平衡。
Abstract: Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model’s limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator’s reasoning context stays under 21.5K tokens.
cs.CV [Back]
[23] ReactiveGWM: Steering NPC in Reactive Game World Models cs.CVPDF
Zeqing Wang, Danze Chen, Zhaohu Xing, Zizhao Tong, Yinhan Zhang
TL;DR: ReactiveGWM是一种反应式游戏世界模型,能够合成玩家与非玩家角色(NPC)之间的动态交互。它通过解耦玩家控制与NPC行为,利用扩散骨干网络中的轻量级加性偏置注入玩家动作,并通过交叉注意力模块实现高层NPC策略(如进攻、控制、防御)的引导。该模型支持零样本策略迁移,可直接插入不同游戏的现成世界模型,无需领域特定重训练。在《街头霸王》两款游戏上的评估表明,ReactiveGWM在保持精细玩家可控性的同时,实现了稳健的、与提示对齐的NPC策略遵循。
Details
Motivation: 现有游戏世界模型从主观的玩家中心视角模拟环境,将NPC仅视为背景像素,无法捕捉玩家与NPC之间的交互,缺乏建模动作引发NPC反应所需的物理理解,因此更像被动视频渲染器而非真实模拟引擎。
Result: 在两个《街头霸王》游戏上的评估中,ReactiveGWM保持了精细的玩家可控性,并实现了稳健的、与提示对齐的NPC策略遵循。
Insight: 创新点在于显式解耦玩家控制与NPC行为,通过轻量级加性偏置注入玩家动作,利用交叉注意力模块学习游戏无关的交互逻辑表示,从而实现零样本策略迁移,无需领域特定重训练即可在不同游戏中解锁可引导的NPC交互。
Abstract: Current game world models simulate environments from a subjective, player-centric perspective. However, by treating the Non-Player Character (NPC) merely as background pixels, these models cannot capture interactions between the player and NPC. In that sense, they act as passive video renderers rather than real simulation engines, lacking the physical understanding needed to model action-induced NPC reactivities. We introduce ReactiveGWM, a reactive game world model that synthesizes dynamic interactions between the player and NPC. Instead of entangling all interaction dynamics, ReactiveGWM explicitly decouples player controls from NPC behaviors. Player actions are injected into the diffusion backbone via a lightweight additive bias, while high-level NPC responses (e.g., Offense, Control, Defense) are grounded through cross-attention modules. Crucially, these modules learn a game-agnostic representation of interactive logic. This enables zero-shot strategy transfer: our learned modules can be plugged directly into off-the-shelf, unannotated world models of different games. This instantly unlocks steerable NPC interactions without any domain-specific retraining. Evaluated on two Street Fighter games, ReactiveGWM maintains fine-grain player controllability while achieving robust, prompt-aligned NPC strategy adherence, paving the way for scalable, strategy-rich interaction with the NPC.
[24] Deep Pre-Alignment for VLMs cs.CVPDF
Tianyu Yu, Kechen Fang, Zihao Wan, Kaidong Zhang, Yicheng Zhang
TL;DR: 本文提出深度预对齐(DPA)架构,通过用小规模VLM替换标准ViT编码器作为感知器,在视觉特征输入LLM前实现深层对齐,从而解决VLM中视觉特征与文本空间初始层距离远、浪费模型深度的问题。在4B参数规模下,DPA在8个多模态基准上平均提升1.9个点,在32B规模下提升3.0个点,同时减少32.9%的语言能力遗忘,且在不同LLM家族(如Qwen3和LLaMA 3.2)上表现一致。
Details
Motivation: 现有VLM架构中,视觉特征在LLM初始层与文本空间距离较远,导致模型将关键深度浪费在浅层模态对齐上,而非深层理解和复杂推理,限制了性能提升。
Result: 在4B参数规模下,DPA在8个多模态基准上平均超越基线1.9个点;在32B规模下,优势扩大至3.0个点。在3个文本基准上,语言能力遗忘减少32.9%。该增益在Qwen3和LLaMA 3.2等不同LLM家族上一致,且仅需模块化替换视觉编码器,计算开销小。
Insight: 创新点在于用小型VLM作为感知器替代标准ViT编码器,在视觉特征进入LLM前完成深度对齐,从而释放LLM深层用于复杂推理,同时减少语言能力遗忘,并提供无缝升级路径。
Abstract: Most Vision Language Models (VLMs) directly map outputs from ViT encoders to the LLM via a lightweight projector. While effective, recent analysis suggests this architecture suffers from an alignment challenge: visual features remain distant from the text space in the initial layers of the LLM, forcing the model to waste critical depth~\cite{zhang-etal-2024-investigating,artzy-schwartz-2024-attend} on superficial modality alignment rather than deep understanding and complex reasoning. In this work, we propose Deep Pre-Alignment (DPA), a novel architecture that replaces the standard ViT encoder with a small VLM as perceiver, ensuring visual features are deeply aligned with the text space of the target large language model. Comprehensive experiments demonstrate the effectiveness of DPA. On the 4B parameter scale, DPA outperforms baselines by 1.9 points across 8 multimodal benchmarks, with gains widening to 3.0 points at the 32B scale. Moreover, by offloading alignment to the perceiver, DPA achieves a 32.9% reduction in language capability forgetting over 3 text benchmarks. We further demonstrate that these gains are consistent across different LLM families including Qwen3 and LLaMA 3.2, highlighting the generality of our approach. Beyond performance, DPA also offers a seamless upgrade path for current VLM development, requiring only a modular replacement for the visual encoder with marginal computation overhead.
[25] COPRA: Conditional Parameter Adaptation with Reinforcement Learning for Video Anomaly Detection cs.CVPDF
Darryl Cherian Jacob, Xinyu Liu, Kai Wang, Pan He
TL;DR: 本文提出COPRA框架,通过强化学习实现条件参数自适应,解决视觉语言模型在视频异常检测中训练与推理的数据分布和模型配置不匹配问题。该方法为每个视频片段生成输入特定的参数更新,动态适应冻结的VLM,在标准VAD基准上表现优异,并泛化到视频问答和密集描述等未见任务。
Details
Motivation: 现有基于VLM的VAD方法存在两个根本性不匹配:一是依赖静态后训练适应,限制了在分布偏移(如未见环境或异常类型)下的泛化能力;二是训练时使用长视频的稀疏帧,但推理时对密集采样的短片段进行预测,导致训练和测试不一致。
Result: 在标准VAD基准上,COPRA在域内和跨域设置中均持续优于静态基线方法,并泛化到多选视频问答和密集描述等未见任务,展示了其作为可扩展、自适应和上下文感知视频理解框架的有效性。
Insight: 创新点在于提出条件参数自适应框架,通过强化学习为每个输入生成特定参数更新,而非固定提示或共享参数更新,从而动态适应冻结的VLM,解决了训练与推理的不匹配问题,并实现了跨任务泛化。
Abstract: Vision-language models (VLMs) have shown strong performance in video anomaly detection (VAD) while providing interpretable predictions. However, existing VLM-based VAD methods suffer from a fundamental mismatch between training and inference in both data distribution and model configuration. First, most approaches rely on static post-training adaptation, limiting generalization under distribution shifts such as unseen environments or anomaly types. Second, they train VLMs on sparse frames from long videos, but perform inference on densely sampled short segments, creating inconsistencies between training and testing. To address these limitations, we propose COPRA, a conditional parameter adaptation framework for VLM-based VAD. Instead of fixed prompts or shared parameter updates, COPRA generates input-specific parameter updates to dynamically adapt a frozen VLM for each video segment during both training and inference. Experiments show strong performance on standard VAD benchmarks, consistently outperforming static baselines in both in-domain and cross-domain settings. Moreover, COPRA generalizes beyond VAD to unseen tasks such as multiple-choice Video Question Answering and Dense Captioning. These results highlight COPRA as an effective weight-space generation framework for scalable, adaptive, and context-aware video understanding. The code will be released at https://github.com/THE-MALT-LAB/COPRA
[26] Multimodal Object Detection Under Sparse Forest-Canopy Occlusion cs.CVPDF
Nitik Jain, Mangal Kothari
TL;DR: 本文提出了一种多模态目标检测管道,结合LiDAR、可见光-热成像融合和机载光学切片(AOS)技术,以解决稀疏森林冠层遮挡下的人体检测难题。实验表明,LiDAR穿透能力有限,而可见光-热融合和AOS能提升目标可见性,微调后的YOLOv5在FLIR数据集上达到约0.83的平均精度。
Details
Motivation: 森林冠层下的稀疏、结构化且视角依赖的遮挡使得遥感中的人体检测非常困难,现有方法难以可靠检测,需要多模态融合来克服单一传感器的局限性。
Result: 微调后的YOLOv5在Teledyne FLIR热数据集的前三个类别上达到约0.83的平均精度(mAP)。LiDAR配置对目标级检测穿透有限,可见光-热融合改善了低对比度场景中的目标可见性,AOS增强了合成森林图像中的地面检测。
Insight: 创新点在于提出了一个多模态概念验证管道,整合了LiDAR评估、可见光-热图像融合(多尺度变换和稀疏表示)以及AOS合成孔径成像,为无人机部署的搜索救援和监控系统建立了基线,并强调了未来专用森林数据集和实时多模态集成的必要性。
Abstract: Reliable detection of humans beneath forest canopy remains a difficult remote-sensing challenge due to sparse, structured, and viewpoint-dependent occlusion. This paper presents a multimodal proof-of-concept pipeline that integrates three complementary approaches: (i) experimental evaluation of LiDAR returns through vegetation to assess the feasibility of active sensing, (ii) visible–thermal image fusion using a multi-scale transform and sparse-representation framework to enhance human saliency, and (iii) synthetic-aperture image formation via Airborne Optical Sectioning (AOS) to suppress canopy clutter. A YOLOv5 detector is fine-tuned on the Teledyne FLIR thermal dataset and evaluated on thermal and fused imagery. Results show that the tested terrestrial LiDAR configuration provides limited penetration for object-level detection, while visible–thermal fusion improves target visibility in low-contrast scenes and AOS enhances ground-plane detection in synthetic forest imagery. The fine-tuned YOLOv5 achieves a mean average precision of $\sim$0.83 on the top three FLIR classes. These findings establish an initial baseline for UAV-deployable search-and-rescue and surveillance systems operating in forested environments, and motivate future work on dedicated forest datasets and real-time multimodal integration.
[27] Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding cs.CV | cs.LGPDF
Arsha Nagrani, Jasper Uijilings, Shyamal Buch, Tobias Weyand, Sudheendra Vijayanarasimhan
TL;DR: 本文提出了Minerva-Ego基准,用于评估第一人称视频理解中的复杂多步推理能力。该基准包含多模态问题和时空密集的人工标注推理轨迹,实验表明当前最先进模型与人类性能仍有较大差距,而通过提供’在哪里’和’何时’观察的提示可以显著提升模型表现。
Details
Motivation: 现有视频推理基准仅评估最终输出(如问题答案),缺乏对中间推理步骤的评估,且大多局限于文本域。为了深入分析第一人称视频理解中的推理过程,需要引入包含时空标注的复杂多步推理基准。
Result: 在Minerva-Ego基准上,最先进模型与人类性能存在较大差距。通过提供时空提示(即’在哪里’和’何时’观察),模型性能得到显著提升。
Insight: 创新点在于构建了包含时空密集推理轨迹和对象掩码标注的基准,并揭示了时空提示对提升第一人称视频推理模型性能的关键作用。
Abstract: Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We introduce Minerva-Ego, a benchmark for evaluating complex egocentric visual reasoning. We extend recent high-quality video data sources recorded from egocentric / embodied settings with a set of challenging, multi-step multimodal questions and spatiotemporally-dense human-annotated reasoning traces. Benchmarking experiments show that state-of-the-art models still have a large gap to human performance. To investigate this gap in detail, we annotate each reasoning trace in the dataset with the objects of interest required to solve the question, as spatiotemporal mask annotations. Through extensive evaluations, we identify that prompting frontier models with hints of ‘where’ and ‘when’ to look yields substantial improvements in performance. Minerva-Ego can be downloaded at https://github.com/google-deepmind/neptune.
[28] PanoWorld: Geometry-Consistent Panoramic Video World Modeling cs.CV | cs.AIPDF
Le Jiang, Xiangyu Bai, Bishoy Galoaa, Shayda Moezzi, Caleb James Lee
TL;DR: PanoWorld是一个全景视频世界模型,能够从单张图像和文本描述生成几何一致的全景视频。该方法通过引入深度一致性和轨迹一致性损失,以及球面几何感知的条件编码,解决了现有方法在深度、对应关系和运动一致性方面的不足。
Details
Motivation: 现有全景视频生成方法主要优化视觉真实性,但未显式约束底层3D场景状态,导致输出存在深度不一致、对应关系断裂和运动不合理等问题。本文旨在将全景视频生成重新定义为几何和动力学一致的潜在状态建模问题。
Result: 实验表明,PanoWorld在保持视觉真实性的同时,显著提升了几何一致性,优于先前的全景生成方法。该模型在PanoGeo数据集上进行了分层评估,证明了其有效性。
Insight: 创新点包括将全景视频生成视为几何建模问题,引入轻量级正则化损失(深度一致性和轨迹一致性损失),以及构建了包含一致深度、轨迹和提示标注的统一几何感知数据集PanoGeo。这为具身AI应用所需的整体空间理解提供了支持。
Abstract: We present PanoWorld, a panoramic video world model that generates geometry-consistent 360$\degree$ video from a single image and a caption. Existing panoramic video methods optimize primarily for visual realism and do not explicitly constrain the underlying 3D scene state, producing outputs that appear plausible yet exhibit inconsistent depth, broken correspondences, and implausible motion across the spherical surface. We address this gap by framing panoramic video generation as a geometry- and dynamics-consistent latent state modeling problem rather than pure visual synthesis. Building on a pre-trained perspective video world model, we introduce two lightweight regularizers: a depth consistency loss against pseudo ground-truth panoramic depth, and a trajectory consistency loss that supervises the 3D world-frame positions of tracked points across time. We further apply spherical-geometry-aware adaptation to the conditioning and positional encoding. We additionally introduce PanoGeo, a unified geometry-aware panoramic video dataset with consistent depth, trajectory, and prompt annotations across diverse real and synthetic sources, used for both training and stratified evaluation. Experiments show that PanoWorld improves geometric consistency over prior panoramic generation methods while maintaining competitive visual realism, establishing that panoramic video generation must be treated as a geometric modeling problem to support the holistic spatial understanding requirements of embodied AI applications. Code is available at https://github.com/ostadabbas/PanoWorld.
[29] ELDOR: A Dataset and Benchmark for Illegal Gold Mining in the Amazon Rainforest cs.CVPDF
Kangning Cui, Surendra Bohara, Suraj Prasai, Zishan Shao, Wei Tang
TL;DR: 本文介绍了ELDOR,一个用于监测亚马逊雨林非法金矿开采环境破坏的大规模无人机影像基准数据集。该数据集包含超过2500公顷的手动标注正射影像,具有像素级语义标签,覆盖采矿活动及周边生态结构。基于此,论文建立了四项基准任务(语义分割、分割衍生识别、直接多标签分类、视觉语言模型分类),并评估了多种模型,发现现有方法在处理稀有、小规模采矿结构和细粒度恢复类别时仍存在困难。
Details
Motivation: 非法金矿开采对亚马逊雨林造成严重环境破坏,但现有卫星影像难以在精细空间尺度上进行有效监测,常遗漏小型采矿结构和细微的土地覆盖变化。
Result: 在ELDOR基准上,对通用及遥感专用分割模型、视觉基础模型相关分割方法、直接多标签分类方法和视觉语言模型进行了评估。结果表明,现有方法在稀有小型采矿结构和细粒度恢复类别上表现不佳,凸显了上下文感知和多模态建模的必要性。
Insight: 创新点在于构建了首个针对雨林非法采矿的大规模、精细标注无人机影像基准,并系统性地定义了多任务评估框架。客观来看,该工作将高分辨率无人机数据与统一的多任务基准相结合,为环境监测领域提供了重要的数据资源和评估标准,并指出了未来模型需增强对小目标及上下文理解的方向。
Abstract: Illegal gold mining in the Amazon rainforest causes deforestation, water contamination, and long-term ecosystem disruption, yet remains difficult to monitor at fine spatial scales. Satellite imagery supports large-scale observation, but often misses small mining-related structures and subtle land-cover transitions, especially under frequent cloud cover. We introduce ELDOR, a large-scale UAV benchmark for monitoring environmental and landscape disturbance from illegal gold mining in the rainforest. ELDOR contains manually annotated orthomosaic imagery covering over 2,500 hectares, with pixel-level semantic labels for both mining-related activities and surrounding ecological structures. With this unified annotation source, we establish four benchmark tasks: semantic segmentation, segmentation-derived recognition, direct multi-label classification, and class-presence recognition with vision-language models. Across these tasks, we compare generic and remote-sensing-specific segmentation models, vision foundation model-related segmentation methods, direct multi-label classification methods, and vision-language models under a controlled closed-set protocol. Results show that current methods still struggle with rare small-scale mining structures and fine-grained recovery classes, suggesting the need for context-aware and multimodal modeling. To support domain analysis and practical use, we further build an interactive explorer for domain experts that provides a unified interface for data exploration and model inference.
[30] MorphoHELM: A Comprehensive Benchmark for Evaluating Representations for Microscopy-Based Morphology Assays cs.CVPDF
Emre Hayir, Lorin Crawford, Alex X. Lu
TL;DR: MorphoHELM是一个用于评估显微镜图像表示提取方法的综合性开源基准测试,专注于最广泛使用的形态学分析检测方法Cell Painting。该基准整合并扩展了现有评估标准,通过在不同程度的批次效应(技术噪声)下评估任务,量化方法检测生物信号的能力随噪声增加而下降的情况,并发现现有模型在通用性上仍不及经典计算机视觉分析策略。
Details
Motivation: 当前显微镜图像表示提取方法的评估存在碎片化问题,不同模型在不同任务、数据集和自定义流程上评估,难以公平比较。
Result: 在Cell Painting数据集上,该基准评估了迄今为止最广泛的方法,结果表明没有现有模型在所有设置下都优于经典的计算机视觉分析策略,后者仍是最强的通用表示。
Insight: 创新点在于引入了一个统一的、考虑不同批次效应程度的评估框架,能够揭示方法在检测不同类型生物信号时的权衡,并强调了经典方法在通用性上的持续优势。
Abstract: Microscopy images contain rich information about how cells respond to perturbations, making them essential to applications like drug screening. To quantify images, researchers often use representation extraction methods, and recent years have seen a proliferation of deep learning methods. While measuring the quality of these representations is essential, evaluation remains fragmented, with each proposed model evaluated on different tasks and datasets, using custom pipelines and metrics, making it difficult to fairly compare models. Here, we introduce MorphoHELM, a comprehensive open benchmark for evaluating feature extraction methods for Cell Painting, the most widely-used morphological profiling assay. MorphoHELM consolidates evaluation standards in the field, extends and corrects them to be more robust, and evaluates on the widest range of methods to date. A defining feature of the benchmark is that each task is evaluated at different degrees of batch effects (or technical noise), directly quantifying how the ability of methods to detect biological signal degrades as noise increases. Together, these properties enable MorphoHELM to detect trade-offs between methods, and we demonstrate that models that excel at certain kinds of biological signal are weaker at others. We show that no existing model outperforms classic computer vision analytic strategies across all settings, which remain the strongest general use-case representations. All datasets, code, and evaluation tools are publicly available at https://github.com/microsoft/MorphoHELM.
[31] U-SEG: Uncertainty in SEGmentation – A systematic multi-variable exploration cs.CVPDF
Michael Smith, Frank P. Ferrie
TL;DR: 本文系统性地研究了分割任务中不确定性估计的多个关键变量,包括数据集、骨干网络、下游任务等,探讨了不确定性估计在不同场景下的表现和适用性。研究发现,全景分割等复杂任务通常导致不确定性估计性能下降,时间序列样本在某些配置下有用但成本较高,样本多样性在校准任务中表现最佳,而确定性方法在某些任务中足够,但集成方法在特定条件下能显著提升性能。
Details
Motivation: 研究动机在于深入探讨不确定性估计与分割任务交叉领域中尚未充分研究的问题,特别是影响不确定性估计质量的各种变量,以帮助在实际场景中识别和处理预测错误。
Result: 研究通过大规模实验发现,全景分割任务通常导致不确定性估计性能更差,数据集和骨干网络之间的高方差表明泛化性无法保证;时间序列样本在特定配置下有用但成本效益低;样本多样性在校准任务中表现最佳,但其他情况下不如简单方法;确定性方法适用于部分下游任务,而集成方法在部署条件合适时能显著改进。
Insight: 创新点在于构建了一个系统性框架来探索分割中不确定性估计的多变量影响,揭示了任务复杂性、数据集差异和骨干网络选择对不确定性估计性能的关键作用,并提供了关于样本多样性和集成方法实用性的新见解。
Abstract: In this study, we explore in depth a few under-studied topics at the intersection of uncertainty estimation and segmentation. Prior work has shown that the quality of uncertainty estimates can be very sensitive to a range of variables. As one of the main uses of uncertainty estimation is to help identify and deal with prediction errors in practical scenarios, any factors that affect this must be clearly identified. For example, do more challenging domains or different datasets and architectures result in worse performance when using uncertainty estimates? Can prior frames in a video sequence in fact provide useful uncertainty estimates comparable to other approaches? Is it possible to combine uncertainty estimation approaches, taking advantage of sample diversity, to get better estimates? Finally, when might it make sense to use an ensemble-based uncertainty estimate over a deterministic network? We address these questions by creating a framework for and executing a large scale study across many variables such as datasets, backbones, and downstream tasks, for both semantic and panoptic segmentation. We find that a) the more challenging task of panoptic segmentation usually results in worse performance while high performance variance between datasets and backbones indicates that generalization is not guaranteed, b) time series samples can be useful for specific configurations, but in many cases are not worth the cost, c) sample diversity shows the most promise in the downstream task of calibration, but otherwise fails to beat simpler alternatives, d) a deterministic approach is adequate for some downstream tasks, but ensembles allow for significant improvements if the right conditions can be achieved in deployment.
[32] MR2-ByteTrack: CNN and Transformer-based Video Object Detection for AI-augmented Embedded Vision Sensor Nodes cs.CV | cs.AI | eess.IVPDF
Luca Bompani, Manuele Rusci, Luca Benini, Daniele Palossi, Francesco Conti
TL;DR: 本文提出了一种名为MR2-ByteTrack的视频目标检测方法,专为基于微控制器(MCU)的嵌入式视觉节点设计。该方法通过交替使用全分辨率和低分辨率推理来降低计算成本,并利用ByteTrack进行帧间检测关联,通过Rescore算法聚合多帧置信度以纠正误分类。实验表明,该方法在保持精度的同时显著减少了计算量和能耗,首次在MCU级设备上实现了基于Transformer的实时视频目标检测。
Details
Motivation: 解决智能视觉传感器在带宽、延迟和隐私限制下无法依赖云计算,而本地MCU设备又因内存和算力有限,难以运行传统需要特征存储或多帧缓冲的视频目标检测方法的问题。
Result: 在ImageNetVID数据集上,基于CNN的模型mAP最高达49.0,基于Transformer的模型mAP达48.7,同时分别减少了53%和32%的乘累加操作;在超低功耗RISC-V多核MCU GAP9上部署,相比仅处理全分辨率图像节能高达55%。
Insight: 创新点在于将多分辨率推理、ByteTrack跟踪和基于概率联合规则的Rescore重评分算法结合,形成了一种轻量化的视频目标检测框架,其架构无关性(同时适用于CNN和Transformer)为嵌入式设备上的高效视频处理提供了新思路。
Abstract: Modern smart vision sensors need on-device intelligence to process video streams, as cloud computing is often impractical due to bandwidth, latency, and privacy constraints. However, these sensory systems typically rely on ultra-low-power microcontrollers (MCUs) with limited memory and compute, making conventional video object detection methods, which require feature storage or multi-frame buffering, unfeasible. To address this challenge, we introduce Multi-Resolution Rescored ByteTrack (MR2-ByteTrack), a Video Object Detection (VOD) method tailored for MCU-based embedded vision nodes. MR2-ByteTrack reduces computational cost by alternating between full- and low-resolution inference, while linking detections across frames via ByteTrack and correcting misclassifications through the Rescore algorithm, which applies probability union rules to aggregate detection confidence scores across frames. We apply our approach to both a CNN-based detector and a Transformer-based model, demonstrating its generality across architectures with fundamentally different spatial processing. Experiments on ImageNetVID demonstrate that MR2-ByteTrack maintains accuracy, achieving mAP scores of up to 49.0 for the CNN-based models and 48.7 for the Transformer, while reducing multiply-accumulate operations by as much as 53% for the CNNs and 32% for the Transformer. When deployed on GAP9, an ultra-low-power RISC-V multicore MCU, our method yields up to 55% energy savings compared to processing only full-resolution images, enabling the first real-time Transformer-based VOD on an MCU-class embedded vision node. Code available at https://github.com/Bomps4/Multi_Resolution_Rescored_ByteTrack/tree/IEEE_Access
[33] Video Models Can Reason with Verifiable Rewards cs.CVPDF
Tinghui Zhu, Sheng Zhang, James Y. Huang, Selena Song, Xiaofei Wen
TL;DR: 本文提出了VideoRLVR,一种利用可验证奖励优化视频扩散模型的方法,旨在提升模型在满足明确空间、时间或逻辑约束任务中的推理能力。该方法通过SDE-GRPO优化框架、密集分解奖励和Early-Step Focus策略,在Maze、FlowFree和Sokoban等可验证推理基准上超越了监督微调基线及现有视频生成模型。
Details
Motivation: 当前视频扩散模型主要优化感知真实性和时间连贯性,但在需要满足明确约束的可验证推理任务上存在局限,本文旨在将语言模型中可验证奖励强化学习(RLVR)的思想引入视频生成,以提升模型在规则一致性视觉推理中的可靠性。
Result: 在Maze、FlowFree和Sokoban三个具有客观成功标准的程序生成领域评估中,VideoRLVR consistently优于监督微调基线,并在低成功率设置下密集分解奖励尤其有效;其RL优化模型在这些可验证推理基准及域外基准上均超越了评估的专有和开源视频生成模型。
Insight: 创新点包括将视频推理形式化为可验证视觉轨迹生成,提出SDE-GRPO优化主干、密集分解奖励和Early-Step Focus策略(通过限制去噪早期阶段的策略优化,减少约40%训练延迟且保持性能),这为视频模型从感知模仿转向可靠规则一致性推理提供了可行路径。
Abstract: Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.
[34] Entity-Centric World Models: Interaction-Aware Masking for Causal Video Prediction cs.CVPDF
Santosh Kumar Paidi
TL;DR: 本文提出了一种名为IA-JEPA的实体中心世界模型,通过一种自监督的、以运动为中心的掩码策略来优先学习物理交互,旨在解决传统JEPA模型在视频预测中缺乏对因果动力学理解的问题。该方法在CLEVRER基准测试中显著提升了因果推理任务的准确率,并证明了其在真实世界动作和零样本物理谜题上的泛化能力。
Details
Motivation: 从无标签视频中学习预测性世界模型是人工智能的基础挑战。现有联合嵌入预测架构(JEPA)在语义分类上表现出色,但往往对物理不敏感,无法捕捉下游推理所需的因果动力学。作者认为这源于标准的基于图像块的掩码策略过于关注视觉纹理,而忽略了罕见但信息丰富的运动学事件。
Result: 在CLEVRER基准测试的因果推理任务上,IA-JEPA达到了14.26%的准确率,显著优于使用标准图像块掩码基线(3.22%)。该方法还诱导出更高熵、更具判别性的潜在空间(熵增益+10%),并能线性化物理能量(R²=0.43)。此外,该方法在Something-Something V2和PHYRE-Lite数据集上展示了良好的泛化性能。
Insight: 核心创新在于提出了“交互感知”的掩码策略,通过自监督方式专门针对参与碰撞或动量传递的实体进行掩码,迫使模型重建潜在的运动轨迹而非静态背景特征。这有效打破了标准自监督学习的“静态偏差”,为构建能够内化物理世界因果结构的基础世界模型提供了一条可扩展的、完全自监督的路径。
Abstract: Learning predictive world models from unlabelled video is a foundational challenge in artificial intelligence. While Joint Embedding Predictive Architectures (JEPA) have set new benchmarks in semantic classification, they often remain physics-blind, failing to capture the causal dynamics necessary for downstream reasoning. We hypothesize that this stems from standard patch-based masking strategies, which prioritize visual texture over rare but informative kinematic events. We propose Interaction-Aware JEPA (IA-JEPA), which utilizes a self-supervised motion-centric masking strategy to prioritize physical interactions. By specifically targeting entities engaged in collisions or momentum transfers, we force the architecture to reconstruct latent trajectories rather than static background features. Evaluated on the CLEVRER benchmark, IA-JEPA achieves 14.26% accuracy on causal reasoning tasks, a significant lead over the 3.22% achieved by standard patch-masked baselines. Crucially, we demonstrate that IA-JEPA breaks the “static bias” of standard self-supervision by inducing a higher-entropy, more discriminative latent space (+10% entropy gain) that linearizes physical energy ($R^2=0.43$). We show that this interaction bias generalizes to real-world human actions (Something-Something V2) and zero-shot physical puzzles (PHYRE-Lite). Our results provide a scalable, fully self-supervised path toward building foundational world models that begin to internalize the causal structure of the physical world.
[35] EgoExo-WM: Unlocking Exo Video for Ego World Models cs.CVPDF
Danny Tran, Roberto Martín-Martín, Kristen Grauman
TL;DR: 本文提出了一种名为EgoExo-WM的方法,旨在利用丰富的第三人称(exocentric)视频数据来训练第一人称(egocentric)世界模型。该方法通过从第三人称视频中提取结构化人体姿态作为动作表示,并基于人体运动学先验将第三人称视频转换为第一人称视角,从而解锁了野外第三人称数据在第一人称世界模型训练中的应用。实验表明,使用转换后的数据训练全身动作条件化的第一人称世界模型,显著提升了预测质量和下游规划性能。
Details
Motivation: 第一人称世界模型在使智能体进行预测和规划方面具有潜力,但其性能受限于第一人称训练数据的稀缺性以及人类物理动作固有的部分可观测性。相比之下,第三人称视频数据丰富且能清晰展现人体姿态,但其视角与智能体的动作空间不直接对齐,且非第一人称视角。
Result: 使用本文方法转换的数据训练的第一人称世界模型,在预测质量和下游规划任务(即推断实现视觉目标状态所需的人体姿态序列)上均取得了显著提升。
Insight: 核心创新在于提出了一种将第三人称视频数据转化为可用于第一人称世界模型训练的形式的方法,通过提取人体姿态作为动作表示并进行视角转换,有效利用了丰富的野外视频数据,为构建更强大的第一人称世界模型开辟了新途径,可应用于机器人规划和增强现实引导等领域。
Abstract: Egocentric world models present a promising direction for enabling agents to predict and plan, but their performance is constrained by the limited availability of egocentric training data and its inherent partial observability of humans’ physical actions. In contrast, exocentric video is abundant and reveals body poses well, but lacks direct alignment with an agent’s action space – and is not egocentric. We propose a method to bridge this gap by extracting structured body pose from exocentric video as a representation of action and transforming the exocentric video to egocentric video, informed by a human kinematics prior. This process unlocks the integration of in-the-wild exocentric data for egocentric world model training. We show that training whole-body action-conditioned egocentric world models with our converted data significantly improves both prediction quality and downstream planning performance, where we infer the sequence of body poses needed to achieve a visual goal state. Our approach paves the way to enlist arbitrary in-the-wild videos for building powerful egocentric world models, furthering applications in robot planning and augmented-reality guidance.
[36] DiffVAS: Diffusion-Guided Visual Active Search in Partially Observable Environments cs.CV | cs.AIPDF
Anindya Sarkar, Srikumar Sastry, Aleksis Pirinen, Nathan Jacobs, Yevgeniy Vorobeychik
TL;DR: 本文提出了DiffVAS,一种用于部分可观测环境中的视觉主动搜索方法。该方法利用扩散模型从顺序观测的局部图像中重建整个地理空间区域,并结合基于强化学习的规划模块,能够根据任务要求同时搜索多种目标对象。
Details
Motivation: 现有视觉主动搜索方法通常假设整个搜索空间已知,且策略针对特定目标定制,这在视野受限、成本高昂的现实部分可观测环境中不切实际,且无法同时搜索多类目标。
Result: 大量实验表明,DiffVAS在多个数据集上显著超越了最先进的方法,在部分可观测环境中搜索多种对象方面表现出色。
Insight: 核心创新在于将扩散模型用于环境重建,并结合目标条件强化学习进行规划,实现了对未知环境的有效推理和多目标同时搜索,提升了策略在真实场景中的部署能力。
Abstract: Visual active search (VAS) has been introduced as a modeling framework that leverages visual cues to direct aerial (e.g., UAV-based) exploration and pinpoint areas of interest within extensive geospatial regions. Potential applications of VAS include detecting hotspots for rare wildlife poaching, aiding search-and-rescue missions, and uncovering illegal trafficking of weapons, among other uses. Previous VAS approaches assume that the entire search space is known upfront, which is often unrealistic due to constraints such as a restricted field of view and high acquisition costs, and they typically learn policies tailored to specific target objects, which limits their ability to search for multiple target categories simultaneously. In this work, we propose DiffVAS, a target-conditioned policy that searches for diverse objects simultaneously according to task requirements in partially observable environments, which advances the deployment of visual active search policies in real-world applications. DiffVAS leverages a diffusion model to reconstruct the entire geospatial area from sequentially observed partial glimpses, which enables a target-conditioned reinforcement learning-based planning module to effectively reason and guide subsequent search steps. Extensive experiments demonstrate that DiffVAS excels in searching diverse objects in partially observable environments, significantly surpassing state-of-the-art methods on several datasets.
[37] When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing cs.CV | cs.LGPDF
Libo Sun, Po-wei Harn, Peixiong He, Xiao Qin
TL;DR: 本文研究了稀疏混合专家网络在视觉分类任务中何时有效,通过在多数据集上的实验发现,只有当路由计算量占总计算量的比例足够高时,稀疏MoE才能带来准确率提升;在ImageNet规模下,还需要使用多专家路由。研究通过控制实验分离了这些因素,并发现批处理调度是CNN中导致失败的主要模式。
Details
Motivation: 稀疏MoE网络在理论上能提供更好的精度-计算权衡,但在实际视觉部署中受到专家崩溃和端到端效率提升有限的阻碍。本文旨在探究在何种条件下,具有硬容量约束的稀疏top-k路由能在视觉分类任务中真正发挥作用。
Result: 在CIFAR-10/100、Tiny-ImageNet和ImageNet-1K四个基准数据集上,通过多种子协议评估发现,只有当路由计算量占总计算量的比例ρ足够大时,才会出现正的准确率差距。在ImageNet规模下,这仅是必要条件,还需要多专家路由。一个仅改变top-k的ImageNet-1K消融实验在所有五个种子下都将差距从正反转为负。
Insight: 核心创新点是揭示了计算杠杆模式:稀疏路由的收益取决于路由计算占总计算的比例。此外,研究识别出在CNN的逐样本设置中,批处理轴调度是主要的失败模式,并提出了一个在专家而非批次上进行softmax的Soft MoE变体来缓解此问题。
Abstract: Mixture-of-Experts (MoE) networks promise favorable accuracy-compute trade-offs, yet practical vision deployments are hindered by expert collapse and limited end-to-end efficiency gains. We study when sparse top-$k$ routing with hard capacity constraints helps in vision classification, evaluated under multi-seed protocols on four benchmarks (CIFAR-10/100, Tiny-ImageNet, ImageNet-1K). We observe a \emph{compute-leverage pattern}: positive accuracy gaps require a substantial fraction $ρ$ of total FLOPs to be routed; at ImageNet scale this is necessary but not sufficient, as multi-expert routing ($k \geq 2$) is additionally required. Two controlled experiments isolate these factors. A hidden-size sweep on CIFAR-10 yields both predicted sign reversals across standard and depthwise backbones, ruling out backbone family as the active variable. An ImageNet-1K ablation that varies only top-$k$ – holding architecture, initialization, and $ρ$ fixed – reverses the gap from positive to negative across all five seeds. A per-sample variant of Soft MoE that softmaxes over experts rather than the batch rescues CIFAR-100 above the dense baseline, identifying batch-axis dispatch as the dominant failure mode in per-sample CNN settings. Code and aggregate results: https://github.com/libophd/sparse-moe-vision-rho.
[38] Self-Prompting Diffusion Transformer for Open-Vocabulary Scene Text Editing via In-Context Learning cs.CVPDF
Hongxi Li, Tong Wang, Chengjing Wu, Tianbao Liu, Jiangtao Yao
TL;DR: 本文提出了一种自提示的场景文本编辑方法,通过从原始图像直接构建风格和字形提示,无需引入额外的风格或字形编码器。该方法采用两阶段训练策略:首先在大规模自监督数据上训练扩散变换器,然后使用少量配对图像进行微调,利用多模态扩散变换器的上下文学习能力,实现了开放词汇和风格一致的文本编辑。
Details
Motivation: 现有方法仅依赖图像背景信息,忽略了目标区域的视觉细节,丢弃了原始文本的风格特征,本质上将任务降级为文本渲染,且预训练字形编码器施加的条件限制了可编辑文本的范围。
Result: 在多种语言上的实验结果表明,该方法在文本准确性和风格一致性方面均达到了最先进的性能。
Insight: 创新点在于提出自提示机制直接从原始图像构建提示,避免了额外编码器的限制,并利用两阶段训练和扩散变换器的上下文学习能力实现开放词汇编辑;客观分析其核心创新是结合自监督预训练与少量配对数据微调,有效利用多模态信息进行风格保持的文本生成。
Abstract: Scene text editing aims to modify text in a target region of an image while preserving surrounding background style and texture. Existing methods rely solely on image background information while neglecting the visual details of target regions, which discards stylistic features in the original text and essentially degrades the task to text rendering. Moreover, the conditions imposed by pre-trained glyph encoder limit the scope of editable text. To address these issues, this paper proposes a self-prompting scene text editing method that constructs style and glyph prompts directly from the original image, without introducing additional style or glyph encoders. We employ a two-stage training strategy: the diffusion transformer is first trained on large-scale self-supervised data and then refined using a small set of paired images. By leveraging the in-context learning capability of the Multi-Modal Diffusion Transformer (MM-DiT), it achieves open-vocabulary and style-consistent text editing. Experimental results on various languages demonstrate that our method achieves the state-of-the-art performance in both text accuracy and style consistency. Our project page: \href{https://hongxiii.github.io/mstedit}{hongxiii.github.io/mstedit}.
[39] AnyAct: Towards Human Reenactment of Character Motion From Video cs.CV | cs.GRPDF
Liuhan Chen, Lei Zhong, Jiewei Wang, Qin Shuai, Li Yuan
TL;DR: 本文提出了一种名为AnyAct的方法,旨在从非人类角色的单目视频中直接生成初始的人类重演动画。该方法的核心思想是利用稀疏的局部关节运动线索作为桥梁,将角色视频中的动态信息转化为可编辑的人类表演,以支持下游的动画创作。
Details
Motivation: 现有基于视频的运动捕捉方法主要局限于人类中心的结构空间,而运动重定向方法通常需要结构化的3D源运动和已知的源拓扑。本文旨在解决直接从非人类角色视频中生成人类重演动画的挑战,以克服这些限制。
Result: 在构建的涵盖多种非人类角色视频的基准测试上,实验表明AnyAct能够生成高保真度的初始人类重演动画,有效保留了参考视频中角色的核心动态。进一步的消融研究验证了其核心设计的有效性。
Insight: 论文的创新点在于将角色视频驱动的人类重演表述为基于可转移的稀疏局部2D关节运动的条件人类运动生成。关键设计包括:通过增强的3D到2D投影进行纯人类运动监督、渐进式3D到2D训练以缓解条件模糊性,以及全局-局部运动解耦以实现可靠的局部运动控制。
Abstract: We study the problem of directly deriving an initial human reenactment from a monocular video of a non-human character. Our goal is not to reconstruct the source character itself but to reinterpret its motion as a plausible and editable human performance for downstream animation authoring. This task is challenging because existing video-based motion capture methods are largely restricted to human-centric structural spaces, while motion retargeting methods typically require structured 3D source motions and known source topologies. Our key insight is that sparse local articulated motion cues can preserve essential dynamics across large structural differences, providing a stable bridge from character video to human reenactment. Based on this observation, we propose AnyAct, which formulates character-video-driven human reenactment as conditional human motion generation from transferable sparse local 2D articulated motion. To make this practical, we introduce three key designs: human-motion-only supervision via augmented 3D-to-2D projection, progressive 3D-to-2D training to alleviate conditioning ambiguity, and global-local motion decoupling for reliable local motion control. We further construct a benchmark primarily covering diverse non-human character videos. Experiments on the benchmark show that AnyAct produces high-fidelity initial human reenactments that preserve the essential dynamics of the characters in reference videos, and further ablation studies validate the effectiveness of its core designs.
[40] Tuning-free Instruction-based Video Editing Via Structural Noise Initialization and Guidance cs.CV | cs.AIPDF
Song Wu, Xinyu Chen, Qian Wang, Liang Li, Zili Yi
TL;DR: 本文提出了一种无需调优、基于指令的视频编辑框架,通过结构噪声初始化策略(SNIS)为编辑区域分配更高噪声水平以促进内容变化,未编辑区域分配更低噪声水平以保持一致性,并引入噪声引导机制(NGM)利用生成模型中的视频先验和噪声潜在中的丰富信息来指导去噪过程,从而保持未编辑内容和整体视觉连贯性。
Details
Motivation: 解决现有无需调优的视频编辑方法未能充分利用噪声潜在中丰富信息导致结果不理想的问题,旨在实现更高质量的指令驱动视频编辑。
Result: 实验表明,该方法在视觉质量和性能上达到了最先进水平(SOTA)。
Insight: 创新点在于从噪声潜在视角出发,设计了结构噪声初始化策略和噪声引导机制,有效平衡了编辑区域的内容变化与未编辑区域的保持,提升了编辑的连贯性和质量。
Abstract: Video editing poses a significant challenge. While a series of tuning-free methods circumvent the need for extensive data collection and model training, they often underutilize the rich information embedded within noisy latent, leading to unsatisfactory results. To address this, we propose a \textit{tuning-free, instruction-based} video editing framework. We approach video editing from the perspective of noisy latent: we design a Structural Noise Initialization Strategy (SNIS) to secure a superior editing starting point by assigning higher noise levels to edited regions (to facilitate content change) and lower noise levels to unedited regions (to maintain content consistency). We introduce a Noise Guidance Mechanism (NGM), which leverages the video prior in the generative model and effectively integrates rich information within the noisy latent to guide the denoising process, thereby preserving unedited content and overall visual coherence. Experiments show that our proposed method achieves better visual quality and state-of-the-art performance.
[41] Learning Dynamic Structural Specialization for Underwater Salient Object Detection cs.CVPDF
Lin Hong, Chenhui Wang, Linan Deng, Yuning Cui, Yu Zhang
TL;DR: 本文提出了一种名为DSS-USOD的新型水下显著目标检测方法,该方法基于动态结构专业化。该方法从单张水下图像中提取共享基础表示,并将其分解为边界敏感和区域一致的结构特征,然后根据局部结构上下文动态协调两者的贡献。实验表明,该方法在基准数据集上取得了优越性能,并在水下机器人上验证了其实用性。
Details
Motivation: 现有水下显著目标检测方法受水下图像退化影响,常导致目标定位不准确、显著区域破碎和边界预测粗糙。本文旨在解决这些挑战。
Result: 在基准数据集上的大量实验表明,DSS-USOD取得了优越的性能。
Insight: 创新点在于提出了动态结构专业化框架,通过分解共享表示、引入空间协调模块和协作结构监督,自适应地平衡边界精度和区域一致性,以应对水下退化条件。
Abstract: Underwater salient object detection (USOD) has attracted increasing attention for underwater visual scene understanding and vision-guided robotic applications. However, existing USOD methods still struggle with underwater image degradations, which often lead to inaccurate object localization, fragmented salient regions, and coarse boundary prediction. To address these challenges, this paper proposes DSS-USOD, a novel RGB-based USOD method built upon dynamic structural specialization. DSS-USOD extracts a shared base representation from a single underwater image, decomposes it into boundary-sensitive and region-coherent structural features, and dynamically coordinates their contributions according to local structural context. Specifically, the extracted shared base representation is decomposed into a boundary-sensitive branch for modeling fine-grained boundary details and a region-coherent branch for capturing region-level structural consistency. A spatial coordination module is then introduced to adaptively regulate the relative contributions of the two branches according to local structural context. Moreover, cooperative structural supervision is introduced to promote branch specialization and stabilize spatial coordination, enabling DSS-USOD to better balance boundary precision and region coherence under degraded underwater conditions. Extensive experiments show that DSS-USOD achieves superior performance on benchmark datasets. Finally, real-world deployment on an underwater robot validates the practical effectiveness of DSS-USOD for underwater object inspection.
[42] MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays cs.CVPDF
Sunghwan Steve Cho, Yunseok Han, Jaeyoung Do
TL;DR: 本文提出了MI-CXR基准测试,用于评估模型在多时间间隔、多访视的纵向胸部X光序列上的推理能力。该基准包含三个任务族:时序事件定位、间隔变化推理和全局轨迹总结。在评估14个最先进的视觉语言模型后,发现其平均准确率仅为29.3%,仅略高于随机猜测,揭示了当前模型在长期时序推理上的重大局限。
Details
Motivation: 现有的医学视觉问答基准大多关注单张图像或短期图像对,缺乏对疾病在多访视时间线上演变的长期推理能力的标准化评估。
Result: 在MI-CXR基准上评估了14个SOTA视觉语言模型,平均准确率为29.3%,仅略高于随机猜测(20%)。诊断性探测表明,模型能生成局部合理的间隔描述,但无法强制执行时序约束或将证据组合成全局一致的决策。
Insight: 创新点在于构建了一个专注于多时间间隔纵向推理的标准化医学视觉基准,包含三个互补的、临床基础的任务族,系统性揭示了当前VLMs在长时序约束和证据组合推理上的关键缺陷,为未来研究提供了明确的评估方向。
Abstract: Longitudinal chest X-ray (CXR) interpretation requires reasoning over disease evolution across multiple patient visits, yet most existing medical VQA benchmarks focus on single images or short-horizon image pairs. We introduce MI-CXR, a benchmark for standardized evaluation of Multi-Interval longitudinal reasoning over multi-visit CXR sequences, without requiring free-form report generation or additional clinical context. MI-CXR comprises five-way multiple-choice questions over five-visit patient timelines and instantiates three complementary task families: Temporal Event Localization, Interval-wise Change Reasoning, and Global Trajectory Summarization, which assess clinically grounded visual reasoning over time. Evaluating 14 state-of-the-art vision-language models (VLMs) shows low overall performance, with an average accuracy of 29.3%, only modestly above random guessing. Using stage-wise diagnostic probing, we find that models often produce locally plausible interval descriptions but fail to enforce temporal constraints or compose evidence into globally consistent decisions over the full timeline. These findings reveal key limitations of current VLMs and establish MI-CXR as a principled benchmark for longitudinal medical reasoning. The benchmark is available at https://github.com/AIDASLab/MI-CXR
[43] RoiMAM: Region-of-Interest Medical Attention Model for Efficient Vision-Language Understanding cs.CVPDF
Jiayan Yang, Zhuoyu Wu, Wenqi Fang
TL;DR: 本文提出了RoiMAM,一种用于医学视觉问答的高效视觉语言模型。它通过无训练的感兴趣区域生成模块和文本提示增强器,专注于病灶相关区域并提供模态特定上下文,从而在模型大小大幅减小的情况下实现更准确的诊断。
Details
Motivation: 现有医学视觉语言模型通常依赖大型架构和封闭式答案,限制了其效率和临床适用性。本文旨在克服这些缺点,开发一个更高效、更准确的模型。
Result: 与广泛使用的MedVInT-TD模型相比,RoiMAM在模型大小减少80%以上的情况下,在SLAKE数据集上准确率提升约2%,在PMC-VQA数据集上提升约4.6%。
Insight: 创新点在于集成了无需训练的感兴趣区域生成与语义选择性抑制机制来聚焦关键区域,以及无需引入训练参数的文本提示增强器来提供上下文,实现了参数效率与性能的提升。
Abstract: Vision-Language Models (VLMs) facilitate medical visual question answering (MedVQA) by jointly interpreting images and text. However, existing models typically depend on large architectures and closed-set answers, which limits their efficiency and potential clinical applicability. To overcome these shortcomings, we introduce RoiMAM, an efficient VLM. It integrates a training-free ROI Generation Module with Semantic Selective Suppression to focus on lesion-relevant regions, alongside a Text Prompt Enhancer module that provides modality-specific context without introducing training parameters. Compared to the widely used MedVInT-TD model, our design achieves efficient and accurate diagnosis at less than 20% of the model size, while improving accuracy by approximately 2% on SLAKE and 4.6% on PMC-VQA.
[44] Unsupervised 3D Human Pose Estimation via Conditional Multi-view Ancestral Sampling cs.CVPDF
Ryohei Goto, Takuya Fujihashi, Shunsuke Saruwatari, Fumio Okura
TL;DR: 本文提出了一种无需3D监督的单视角3D人体姿态估计方法。该方法的核心是利用在大规模2D人体姿态数据集上预训练的运动扩散模型(MDM)的2D扩散先验,通过新提出的条件多视角祖先采样(cMAS)技术,将2D姿态提升至3D。
Details
Motivation: 解决在缺乏3D标注数据(尤其是在极端姿态下)的情况下,如何从单张图像进行准确3D人体姿态估计的问题。
Result: 在Yoga数据集上的实验表明,该方法在跨域性能上优于当前最先进的监督和无监督3D姿态估计方法,特别是在没有3D监督的极端人体姿态上表现出色。
Insight: 创新点在于将扩散模型的多视角祖先采样思想扩展并条件化,用于2D到3D的姿态提升,通过优化3D姿态使其多视角投影遵循2D扩散模型的流形,同时满足给定的2D姿态约束和人体解剖学约束,从而实现了无3D监督的3D重建。
Abstract: We propose a method of estimating a 3D human pose from a single view without 3D supervision. The key to our method is to leverage the 2D diffusion priors of motion diffusion models (MDMs) pre-trained on large 2D human pose datasets. Specifically, we extend multi-view ancestral sampling of diffusion models to the task of 2D-3D lifting of human pose. To this end, we newly propose a conditional multi-view ancestral sampling (cMAS) that optimizes the 3D pose such that its multi-view projections follow the manifold in 2D MDM noise space, while conditioning the 3D pose to match the given 2D poses and anatomical constraints of humans. Experiments on the Yoga dataset demonstrate that our method achieves better cross-domain performance compared to state-of-the-art supervised and unsupervised 3D pose estimation methods, including extreme human poses where 3D supervision is unavailable. Code is available at: https://github.com/asaa0001/c-MAS.
[45] AGC: Adaptive Geodesic Correction for Adversarial Robustness on Vision-Language Models cs.CVPDF
Zhiwei Li, Jiacheng Xue, Weining Wang, Ajian Liu, Xingyu Gao
TL;DR: 本文提出了一种名为自适应测地线校正(AGC)的训练免费防御机制,用于增强视觉语言模型(如CLIP)对抗对抗性扰动的鲁棒性。该方法通过识别可靠的增强作为几何锚点,并利用自适应步长将输入特征校正向该锚点,从而在保持清洁准确性的同时提升鲁棒性。
Details
Motivation: CLIP等视觉语言模型在零样本迁移方面表现出色,但对不可察觉的对抗性扰动仍存在安全漏洞;现有测试时防御方法通常依赖梯度优化,导致高计算开销,因此需要一种高效且无需参数更新的防御方案。
Result: 在八个细粒度数据集和三个CLIP骨干网络上,AGC实现了卓越性能,将平均鲁棒准确率比最先进基线提高了44.4%,同时推理延迟降低了10倍。
Insight: 创新点在于发现数据增强在CLIP鲁棒性中并非同等有效,特定增强能提供与超球面特征空间中正确类别语义对齐的稳健几何线索;AGC利用这一几何特性,通过自适应校正实现高效防御,为鲁棒多模态部署提供了新范式。
Abstract: Vision-language models like CLIP have demonstrated remarkable zero-shot transfer capabilities. However, their susceptibility to imperceptible adversarial perturbations remains a critical security concern. While test-time defenses offer a pragmatic solution for deployed models, existing approaches typically rely on gradient-based optimization during inference, incurring significant computational overhead. In this paper, we revisit the role of data augmentation in CLIP robustness and observe that augmentations are not equally effective: specific augmentations consistently provide robust geometric cues that align with correct class semantics in the hyperspherical feature space. Based on this, we propose Adaptive Geodesic Correction (AGC), a training-free defense mechanism that requires no parameter updates. AGC identifies a reliable augmentation as a geometric anchor and corrects the input feature towards it, utilizing an adaptive step size to balance robustness against clean accuracy preservation. AGC achieves superior performance across eight fine-grained datasets and three CLIP backbones, improving average robust accuracy by 44.4% over state-of-the-art baseline while delivering a 10$\times$ reduction in inference latency. Our findings reveal a fundamental geometric property of CLIP features, offering a highly efficient and effective paradigm for robust multimodal deployment.
[46] Neutral-Reference Prompting for Vision-Language Models cs.CV | cs.LGPDF
Senmao Tian, Xiang Wei, Shunli Zhang
TL;DR: 本文提出了一种名为NeRP的即插即用提示校正策略,用于解决视觉语言模型(VLMs)高效迁移学习中存在的基类-新类权衡问题。该方法通过利用中性文本提示和参考图像来度量预训练模型在类别间的先验偏好,并结合样本似然来获得代理分数,从而在不修改模型参数的情况下,校正由先验主导的错误预测,提升对新类的识别能力而不牺牲基类性能。
Details
Motivation: 动机源于观察到VLMs在下游数据上经常表现出不对称的混淆现象,即A类样本被系统性地误判为B类,而反向混淆(B到A)很少发生。对于新类别,这种由预训练引起的偏差会持续存在并损害泛化能力,而现有工作常将基类-新类权衡简单归因于对已知类的过拟合。
Result: 在多个骨干网络和15个少样本及跨域基准测试上的广泛实验表明,NeRP显著提升了模型在未见类别上的准确率,同时保持了已知类别的预测性能。
Insight: 创新点在于揭示了VLMs中不对称混淆现象对泛化的影响,并提出了一种无需调参的提示校正策略,通过结合先验偏好与样本似然来校正预测,为解决基类-新类权衡问题提供了新视角。
Abstract: Efficient transfer learning of vision-language models (VLMs) commonly suffers from a Base-New Trade-off (BNT): improving performance on unseen (new) classes often degrades accuracy on known (base) classes. Addressing how to boost recognition of unseen classes without sacrificing known-class performance remains a central challenge. Existing work often simplistically attributes the BNT to overfitting on known classes. We observe an interesting phenomenon: VLMs frequently exhibit asymmetric confusion on certain downstream data, i.e., samples of class A are systematically mispredicted as class B, while the reverse confusion (B to A) rarely occurs. For known classes, this kind of bias can be mitigated by tuning using a cross-entropy loss, but for unseen classes, such pretraining-induced bias persists and harms generalization. Motivated by this, we propose NeRP, a plug-and-play prompting correction strategy that improves discrimination on unseen classes without modifying model parameters. NeRP leverages neutral text prompts and reference images to measure class-wise prior preferences along the pre-trained inter-class geometry, and combines them with the sample likelihood to obtain the model’s surrogate score. If, for a given sample, the prior strongly favors the current prediction while the observed evidence is clearly insufficient, we perform a local flip between easily confusable class pairs, thereby correcting prior-dominated mispredictions. Extensive experiments across multiple backbones and 15 few-shot and cross-domain benchmarks show that NeRP substantially improves accuracy on unseen classes while preserving known-class prediction performance.
[47] LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs cs.CVPDF
Hongyu Lu, Feng Zhang, Wenwei Jin, Huanling Hu, Tianjun Shi
TL;DR: 本文提出了一种名为LRCP的训练无关视觉令牌压缩框架,用于高效的大型视觉语言模型(LVLMs)。该方法基于视觉令牌表示具有显著低秩结构的观察,通过主成分分析(PCA)估计主导低秩子空间,并根据令牌在该子空间上的投影残差进行评分,保留那些无法被低秩背景良好解释的令牌,从而在大量减少令牌数量的同时保持模型性能。
Details
Motivation: 现有基于注意力分数的方法可能引入位置偏差,而基于表示的方法在减少视觉冗余时忽略了视觉令牌集的全局结构。本文从低秩可压缩性的角度重新审视视觉令牌压缩问题,旨在更有效地减少LVLMs的推理成本。
Result: 在图像理解任务中,LRCP在减少88.9%令牌的情况下保持了原始性能的94.7%;在视频理解任务中,在减少87.5%令牌的情况下保持了平均准确率的97.8%。
Insight: 创新点在于首次从低秩可压缩性角度系统研究视觉令牌压缩,并提出了基于PCA投影残差的令牌重要性评分方法。该方法无需训练,能有效捕捉视觉令牌的全局结构,避免了现有方法的局限性。
Abstract: Large vision-language models (LVLMs) achieve strong multimodal understanding, but their inference cost grows rapidly with the number of visual tokens, especially for high-resolution images and long videos. Existing attention-based methods estimate token importance from attention scores, which may introduce positional bias, while representation-based methods reduce visual redundancy based on feature relations or reconstruction errors, overlooking the global structure of the visual token set. In this paper, we revisit visual token compression from the perspective of low-rank compressibility. Across models and datasets, we observe that visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed. Motivated by this finding, we propose LRCP, a training-free compression framework that first estimates the dominant low-rank subspace of visual tokens via PCA, and then scores each token by its projection residual onto this subspace, retaining tokens that are poorly explained by the low-rank background. Extensive experiments show that LRCP achieves superior results, preserving 94.7% of the original image-understanding performance with an 88.9% token reduction and 97.8% of the average video-understanding accuracy with an 87.5% token reduction.
[48] Latent Video Prediction Learns Better World Models cs.CV | cs.AIPDF
Ali J Alrasheed, Aryan Yazdan Parast, Basim Azam, James Bailey, Naveed Akhtar
TL;DR: 本文首次系统性地评估了四种前沿视频基础模型(V-JEPA 2.1、V-JEPA 2、VideoPrism、VideoMAEv2)作为世界模型的鲁棒性,发现基于潜在预测的模型在特征可区分性、抗干扰性、细粒度判别、遮挡鲁棒性和时间方向敏感性五个维度上表现出一致且独特的优势。
Details
Motivation: 当前自监督视频模型常被视为世界模型,但其评估主要依赖干净基准测试的Top-1准确率,这限制了对其作为世界模型潜力的全面理解,因此需要系统研究其在多种鲁棒性维度上的表现。
Result: 潜在预测模型在所有五个鲁棒性轴上均展现出更优性能,例如在像素损坏下性能下降更平缓,在遮挡下能保持类别结构而非仅几何稳定性。冻结的V-JEPA 2骨干网络配合轻量级注意力探针,在抗干扰和遮挡鲁棒性上甚至优于完全微调的VideoMAE和有监督的TimeSformer。
Insight: 论文的核心创新点在于首次系统性地从多维度鲁棒性角度评估视频世界模型,并提供了支持潜在预测方法在构建鲁棒世界模型方面优势的具体证据,表明其能更好地编码物理接触线索和时间方向,且优势可迁移至下游任务。
Abstract: Self-supervised video models are increasingly framed as world models, yet their evaluation remains largely confined to a single top-1 accuracy score on clean benchmarks. This leaves a major gap in comprehending their potential as world models. We present the first systematic study addressing this gap, analyzing four matched-capacity frontier video foundation models, V-JEPA 2.1, V-JEPA 2, VideoPrism, and VideoMAEv2, across five robustness axes relevant to their deployment as video world models: feature discriminability, corruption robustness, fine-grained discrimination, occlusion robustness, and sensitivity to temporal direction. Our evaluations establish that across all five axes, latent-prediction models form a distinct and consistent profile. They degrade more gracefully under pixel corruption, preserve usable class structure rather than mere geometric stability under occlusion, capture fine-grained physical contact cues without reconstructing pixels, and uniquely encode the arrow of time. These advantages can even survive task adaptation: a frozen V-JEPA 2 backbone with a lightweight attentive probe outperforms a fully fine-tuned VideoMAE and a supervised TimeSformer on corruption and occlusion robustness. Our extensive results offer concrete new evidence in favor of latent prediction for robust world modeling.
[49] MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer cs.CVPDF
Nisha Huang, Henglin Liu, Yizhou Lin, Kaer Huang, Chubin Chen
TL;DR: 本文提出MaTe,一种基于扩散Transformer的免训练零样本材质迁移框架,通过多模态注意力在共享潜在空间进行统一处理,无需文本引导或辅助网络即可实现高质量材质生成。
Details
Motivation: 现有基于扩散的材质迁移方法依赖文本微调或复杂辅助网络,存在文本依赖性、计算成本高和特征错位等问题,MaTe旨在简化架构并消除这些限制。
Result: 在零样本免训练范式下,MaTe在视觉质量和效率上均优于现有SOTA方法,同时保持了精确的细节对齐,显著简化了推理前提条件。
Insight: 创新点在于通过token级图像集成和多模态注意力实现统一潜在空间处理,无需适配器、ControlNet、反转采样或模型微调,为扩散模型提供了更轻量高效的材质迁移解决方案。
Abstract: Recent diffusion-based methods for material transfer rely on image fine-tuning or complex architectures with assistive networks, but face challenges including text dependency, extra computational costs, and feature misalignment. To address these limitations, we propose MaTe, a streamlined diffusion framework that eliminates textual guidance and reference networks. MaTe integrates input images at the token level, enabling unified processing via multi-modal attention in a shared latent space. This design removes the need for additional adapters, ControlNet, inversion sampling, or model fine-tuning. Extensive experiments demonstrate that MaTe achieves high-quality material generation under a zero-shot, training-free paradigm. It outperforms state-of-the-art methods in both visual quality and efficiency while preserving precise detail alignment, significantly simplifying inference prerequisites.
[50] VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following cs.CV | cs.AIPDF
Hyesoo Hong, Minsoo Kim, Wonje Jeung, Sangyeon Yoon, Dongjae Jeon
TL;DR: 该论文研究了视觉语言模型(VLMs)在视觉路径跟随任务中的失败模式,特别是‘线条追踪’任务。研究发现,即使是最先进的VLMs也经常在追踪过程中丢失目标路径,转而跟随视觉上相似的邻近干扰路径,尤其是在存在局部相似竞争者的受控任务中。这种失败源于局部竞争,且模型规模扩展、推理或明确指令等标准方法均无法有效解决该瓶颈。
Details
Motivation: 尽管VLMs在多模态基准测试中表现强劲,但可能仍缺乏对基本视觉操作的稳健控制。本研究旨在诊断VLMs在‘线条追踪’这一基础视觉任务中的失败,以探究其视觉推理能力的局限性。
Result: 在设计的受控追踪任务中,即使是最先进的VLMs也频繁失败,尤其是在存在局部相似干扰路径时。测试进一步表明,在更复杂的真实场景(如缠绕电缆和地铁图)中,同样的路径切换失败问题依然存在。
Insight: 论文的创新点在于设计了一套受控任务来隔离和诊断VLMs在视觉路径跟随中的特定失败模式,揭示了‘局部竞争’是导致失败的关键瓶颈。从客观角度看,该研究为理解VLMs的视觉推理能力边界提供了新的诊断工具和视角,强调了当前模型在低级视觉操作上的脆弱性,而非仅仅依赖语义理解。
Abstract: Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, where a model must follow a selected visual path through successive local continuations. To isolate this ability, we design controlled tracing tasks that introduce nearby competitors while reducing semantic and topological ambiguity such as crossings and overlaps. Across these tasks, even state-of-the-art VLMs frequently lose the target path and switch to nearby alternatives, especially when those alternatives look locally similar to the target. Behavioral interventions and internal analyses indicate that these failures arise from local competition: nearby similar distractors pull the model away from the true continuation. Standard remedies do not remove this bottleneck: model-size scaling provides only limited gains, reasoning partially compensates through costly substitute strategies, and explicit tracing instructions fail to recover stable path following. Finally, tests on tangled-cable scenes and metro maps with richer visual complexity show that the same path-switching failure persists beyond our controlled settings.
[51] EntropyScan: Towards Model-level Backdoor Detection in LVLMs via Visual Attention Entropy cs.CVPDF
Xuanyu Ge, Zhongqi Wang, Jie Zhang, Shiguang Shan, Xilin Chen
TL;DR: 本文提出了一种名为EntropyScan的轻量级、与触发器无关的方法,用于在大视觉语言模型(LVLMs)中进行模型级后门检测。该方法通过量化良性样本上视觉注意力分布的结构异常来识别被植入后门的模型,其核心是利用LLM初始层的视觉注意力分布并应用Tsallis熵来捕捉后门注入导致的多模态对齐破坏。
Details
Motivation: 现有防御方法主要集中于样本级防御,依赖于训练数据或触发器的知识,而识别给定模型是否被植入后门是一个关键但尚未探索的任务。本文旨在填补这一空白,实现无需触发器知识的模型级后门检测。
Result: 在两个LVLM架构和三种先进攻击场景下的大量实验表明,EntropyScan平均F1分数达到98.5%,AUC达到96.6%,实现了高效的模型级后门检测。
Insight: 论文的创新点在于首次关注LVLM的模型级后门检测问题,并提出通过分析良性样本上视觉注意力熵的异常作为检测指标。从客观角度看,其利用跨模态对齐破坏作为内在特征,以及采用参考锚定的Z-score归一化方法,为后门防御提供了新的、无需触发器知识的模型层面分析视角。
Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across various tasks, yet they remain vulnerable to backdoor attacks. Existing defense methods predominantly focus on sample-level defense, which relies on the knowledge of training data or triggers. However, identifying whether a given model is backdoored remains a critical but unexplored task. To fill this gap, we propose EntropyScan, a lightweight and trigger-agnostic method for model-level backdoor detection in LVLMs. We first observe that backdoor injection disrupts the cross-modal alignment, resulting in pronounced structural anomalies in visual attention allocation on benign samples. Based on this insight, EntropyScan detects the backdoor models by quantifying such attention deviations. Specifically, it extracts visual attention distributions from the initial layers of the Large Language Model (LLM) and applies Tsallis entropy to capture these structural distortions. By employing a reference-anchored Z-score normalization on a small set of benign samples, it effectively identifies the backdoored model. Extensive experiments across two LVLMs architectures and three advanced attack scenarios show that EntropyScan achieves an F1 score of 98.5% in average and an AUC of 96.6%. Our code will be publicly available soon.
[52] 3D Segmentation Using Viewpoint-Dependent Spatial Relationships cs.CVPDF
Ayaka Nanri, Klara Reichard, Mert Kiray, Federico Tombari, Benjamin Busam
TL;DR: 本文针对3D场景理解中依赖观察者视角的空间关系(如’左/右’、’前/后’)的模糊性问题,提出了一个包含22万样本的视角感知3D指代分割数据集,并可通过密集视角采样扩展至数百万样本。论文评估了现有3D大模型在零样本设置下处理视角依赖空间指令的困难,并引入了一种编码相机位姿的视角表示方法,将观察视角信息融入模型,显著提升了分割精度。
Details
Motivation: 现有3D指代分割方法通常未显式表示观察者视角,导致依赖视角的空间关系(如’左/右’)模糊且难以评估,限制了自然语言3D场景理解能力。
Result: 在提出的视角感知数据集上,零样本评估显示现有3D大模型难以处理视角依赖的空间指令;引入视角表示后,模型在视角依赖关系上的分割准确率提升,mIoU从0.30提高到0.47。
Insight: 创新点在于构建了首个大规模视角感知3D指代分割数据集,并提出了显式编码相机位姿的视角表示方法,将观察视角条件化融入3D大模型,有效解决了视角依赖空间关系的歧义问题。
Abstract: Recent advances in 3D datasets and multimodal models have greatly improved natural language 3D scene understanding. However, most 3D referring segmentation methods do not explicitly represent the observer viewpoint, making spatial relations such as “left,” “right,” “front,” and “behind” ambiguous and difficult to evaluate. We introduce a viewpoint-aware 3D referring segmentation dataset containing 220k benchmark samples, and scalable to tens of millions of viewpoint-conditioned samples through dense viewpoint sampling. In this dataset, target objects can only be identified through observer-centric spatial relations, making viewpoint-conditioned grounding necessary. We construct the benchmark by leveraging camera poses to automatically annotate observer-centric relations (left/right, front/behind) together with viewpoint-independent relations (above/under). Using this benchmark, we evaluate several existing 3D large multimodal models in a zero-shot setting and find that current models struggle with viewpoint-dependent spatial instructions. We further study how explicit viewpoint information can be incorporated into 3D large multimodal models. We introduce a viewpoint representation that encodes camera poses and conditions the model on the observation viewpoint, improving segmentation accuracy on viewpoint-dependent relations and increasing mIoU from 0.30 to 0.47 compared to a model without viewpoint conditioning. The dataset, code, and trained models will be made publicly available upon acceptance.
[53] DiLA: Disentangled Latent Action World Models cs.CV | cs.AI | cs.ROPDF
Tianqiu Zhang, Muyang Lyu, Yufan Zhang, Fang Fang, Si Wu
TL;DR: DiLA是一种解耦的潜在动作世界模型,通过内容-结构解耦来解决潜在动作模型在动作抽象与生成保真度之间的权衡问题。该方法利用潜在动作学习的预测瓶颈作为解耦驱动力,将空间布局提炼到结构路径中,同时将视觉细节卸载到单独的内容路径进行生成。
Details
Motivation: 解决潜在动作模型在从无标签视频学习世界模型时面临的动作抽象与生成保真度之间的根本权衡问题,现有方法通常通过两阶段训练或限制预测为光流来规避此问题。
Result: DiLA在视频生成质量、动作迁移、视觉规划和流形可解释性方面取得了优异结果,在自监督世界模型学习领域实现了高水平动作抽象和高保真生成的统一。
Insight: 创新点在于提出解耦与潜在动作学习是协同演化的,利用预测瓶颈作为解耦驱动力实现内容-结构分离;从客观角度看,这种协同机制为同时实现语义结构化的连续潜在动作空间和高保真生成提供了新思路。
Abstract: Latent Action Models (LAMs) enable the learning of world models from unlabeled video by inferring abstract actions between consecutive frames. However, LAMs face a fundamental trade-off between action abstraction and generation fidelity. Existing methods typically circumvent this issue by using two-stage training with pre-trained world models or by limiting predictions to optical flow. In this paper, we introduce DiLA, a novel Disentangled Latent Action world model that aims to resolve this trade-off via content-structure disentanglement. Our key insight is that disentanglement and latent action learning are co-evolving: the predictive bottleneck inherent in latent action learning serves as a driving force for disentanglement, compelling the model to distill spatial layouts into the structure pathway while offloading visual details to a separate content pathway for generation. This synergy yields a continuous, semantically structured latent action space without compromising generative quality. DiLA achieves superior results in video generation quality, action transfer, visual planning, and manifold interpretability. These findings establish DiLA as a unified framework that simultaneously achieves high-level action abstraction and high-fidelity generation, advancing the frontier of self-supervised world model learning.
[54] BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation cs.CV | cs.AIPDF
Huanyang Tong, Kai Liu, Fangjun Kuang, Huiling Chen
TL;DR: 本文提出BiomedAP,一种视觉信息引导的双锚点框架,通过门控跨模态融合增强医学视觉-语言模型的鲁棒性。该框架通过门控跨模态融合机制实现模态间分层交互以抑制噪声文本线索,并利用双锚点约束将可学习提示正则化至专家模板和视觉原型衍生的稳定语义中心。在11个基准测试中,BiomedAP在少样本准确性和提示扰动下的鲁棒性方面均超越基线方法。
Details
Motivation: 现有生物医学视觉-语言模型在少样本医疗诊断中面临关键瓶颈:对提示词变动的脆弱性。传统适配框架通常将视觉和文本提示作为独立流优化,依赖理想化的“黄金提示”,在临床描述存在噪声和异质性的现实场景中,这种模态隔离会导致跨模态对齐不稳定。
Result: 在11个基准测试上的广泛实验表明,BiomedAP持续超越基线方法,实现了具有竞争力的少样本准确率,并在提示扰动下显著增强了鲁棒性。
Insight: 创新点包括:1)门控跨模态融合机制实现分层跨模态交互,动态调节噪声文本线索;2)双锚点约束利用专家模板(高锚点)和少样本视觉原型(低锚点)构建稳定语义中心,正则化可学习提示。从客观角度看,该框架通过协同对齐机制解决了医学场景中提示敏感性问题,为噪声环境下的跨模态适应提供了新思路。
Abstract: Biomedical Vision–Language Models (VLMs) have shown remarkable promise in few-shot medical diagnosis but face a critical bottleneck: \textit{fragility to prompt variations}.Existing adaptation frameworks typically optimize visual and textual prompts as independent streams, relying on ideal ``Golden Prompts’’. In clinical reality, where descriptions are often noisy and heterogeneous, this modality isolation leads to unstable cross-modal alignment. To address this, we propose BiomedAP, a vision-informed dual-anchor framework with gated cross-modal fusion.BiomedAP enforces synergistic alignment through two mechanisms: (1) Gated Cross-Modal Fusion, which enables layer-wise interaction between modalities, acting as a dynamic noise regulator to suppress irrelevant textual cues; and (2) a Dual-Anchor Constraint that regularizes learnable prompts toward stable semantic centroids derived from both expert templates (High Anchors) and few-shot visual prototypes (Low Anchors). Extensive experiments across 11 benchmarks demonstrate that BiomedAP consistently surpasses baselines, achieving competitive few-shot accuracy and markedly enhanced robustness under prompt perturbations. Our code is available at: https://github.com/tongdiedie/BiomedAP. Keywords: Vision-Language Models; Prompt Learning; Parameter-Efficient Fine-Tuning; Few-shot Learning
[55] Semi-MedRef: Semi-Supervised Medical Referring Image Segmentation with Cross-Modal Alignment cs.CV | cs.LGPDF
Yuchen Li, Zhen Zhao, Yi Liu, Luping Zhou
TL;DR: 本文提出Semi-MedRef,一种用于医学参考图像分割(MRIS)的半监督学习框架,旨在解决标注成本高的问题。该框架通过教师-学生架构,结合三种保持跨模态对齐的组件——T-PatchMix、PosAug和ITCL,以在强增强下维持图像与文本描述的一致性。在QaTa-COV19和MosMedData+数据集上的实验表明,该方法在全监督和半监督基线中均取得更优性能。
Details
Motivation: 医学参考图像分割需要像素级标注与文本描述对齐,标注成本高昂;现有半监督方法在强增强下难以保持可靠的跨模态对齐,且CutMix等多模态扰动方法未充分探索。
Result: 在QaTa-COV19和MosMedData+数据集上,Semi-MedRef在所有标注比例下均优于全监督和半监督基线,实现了SOTA性能。
Insight: 创新点包括:T-PatchMix(跨模态CutMix增强,通过位置约束和概率规则同步图像块混合与参考表达)、PosAug(位置感知的文本增强)和ITCL(位置引导的图像-文本对比学习模块),这些设计有效维护了医学图像与位置语言的跨模态对齐。
Abstract: Medical referring image segmentation (MRIS) requires pixel-level masks aligned with textual descriptions of anatomical locations, making annotation costly in low-label regimes. Semi-supervised learning (SSL) can mitigate this burden by leveraging unlabeled data, but its success hinges on maintaining reliable image-text alignment under perturbations. Most existing SSL-based referred segmentation methods use either independent or simplistic multi-modal perturbations (e.g., left-right flips), without fully addressing cross-modal alignment under strong augmentation, while CutMix, highly effective in single-modal SSL, remains underexplored in multi-modal settings due to its tendency to disrupt image-text coherence. We propose Semi-MedRef, a teacher-student SSL framework designed to explicitly maintain consistency between medical images and positional language through three alignment-preserving components: T-PatchMix, a cross-modal CutMix-style augmentation that synchronizes patch mixing with referring expressions via position-constrained and probability-driven rules; PosAug, a position-aware text augmentation that masks or fuzzes anatomical phrases; and ITCL, a position-guided image-text contrastive learning module, which leverages positional pseudo-labels to construct soft anatomical positives and strengthen medically grounded cross-modal alignment. Experiments on QaTa-COV19 and MosMedData+ demonstrate that Semi-MedRef consistently outperforms both fully supervised and semi-supervised baselines across all label regimes.
[56] Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models cs.CVPDF
Cheng Zhang, Yuer Liu, Zhiyu Zhou, Hongxia Xie, Wen-Huang Cheng
TL;DR: 本文针对多模态大语言模型在艺术品情感理解中存在的’属性泛滥’问题,提出了属性锚定的选择性推理框架。通过扩展EmoArt数据集并引入人类标注的情感显著性属性,开发了FAB-G多智能体框架,该框架能先预测属性级显著性,再基于保留的线索进行情感分析。实验表明该方法在情感、唤醒度和效价预测上均取得提升,生成解释更紧凑,且显著性选择能力可迁移。
Details
Motivation: 解决多模态大语言模型在解释艺术品情感时,倾向于枚举所有可见形式属性而无法识别真正支持情感判断的线索的问题,即’属性泛滥’现象。
Result: 在EmoArt数据集扩展版上,FAB-G框架在情感、唤醒度和效价预测上取得一致提升;在Dice和Tversky指标下与人类标注的显著属性有更强一致性;生成的最终解释比基于提示的基线方法更紧凑。跨数据集评估表明其属性锚定的显著性选择能力可迁移。
Insight: 将艺术品情感理解形式化为属性锚定的选择性推理问题,引入人类标注的实例级显著性监督来区分’存在’与’情感显著’的属性;提出FAB-G这一监督式多智能体框架,通过形式属性瓶颈引导推理,先进行显著性预测再约束下游分析,实现了更聚焦和可解释的情感理解。
Abstract: Multimodal large language models (MLLMs) can produce fluent artwork emotion explanations, but they often suffer from attribute flooding: they enumerate many visible formal attributes without identifying which cues actually support the affective judgment. We therefore formulate artwork emotion understanding as Attribute-Grounded Selective Reasoning (AGSR), where predefined formal attributes serve as evidence units and only emotionally operative attributes should enter the final interpretation. To make this problem measurable, we extend EmoArt, originally introduced at ACM MM 2025 as a 132,664-artwork resource with content, formal-attribute, valence-arousal, and emotion annotations, by adding a 1,400-artwork human salience extension annotated by 15 art-trained annotators. This extension provides instance-level supervision for distinguishing attributes that are merely present from those that are emotionally salient. We further propose FAB-G (Formal-Attribute Bottleneck-Guided reasoning), a supervised multi-agent framework that first predicts attribute-level salience and then constrains downstream emotional analysis to the retained cues. Experiments show that FAB-G yields consistent gains in emotion, arousal, and valence prediction, achieves stronger agreement with human-marked salient attributes under Dice and Tversky metrics, and produces substantially more compact final explanations than prompting-based baselines. Cross-dataset evaluation further suggests that attribute-grounded salience selection transfers beyond the source distribution of EmoArt, while also revealing attribute-specific boundary cases. The dataset and project page are available at https://zhiliangzhang.github.io/EmoArt-130k/
[57] Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models cs.CVPDF
Yujun Tong, Dongliang Chang, Zijin Yin, Xintong Liu, Yuanchen Fang
TL;DR: 本文提出了一种名为“生成到理解”(G2U)的协同框架,旨在逆转当前多模态模型中理解单向指导生成的现状,通过将视觉生成作为显式的中间推理步骤来增强多模态理解。
Details
Motivation: 当前多模态AI的统一模型通常是理解单向指导生成,而生成如何以及为何能支持理解却很少被研究,本文旨在探索并实现生成对理解的反向增强。
Result: 在十二个基准测试上的综合评估表明,这种反向信息流持续改进了多模态理解,并揭示了生成保真度限制了感知增益,以及不同编辑提示家族控制着迁移效率。
Insight: 创新点在于将视觉生成(如细节增强、上下文扩展)作为模型自我生成的视觉思维,并反馈以精炼感知,无需重新训练或外部工具;客观分析认为,这暴露了当前大模型在真正自我反思能力上的不足,表明想象力是理解的起点而非终点。
Abstract: The long-standing goal of multimodal AI is to build unified models in which visual understanding and visual generation mutually enhance one another. Despite recent works such as BAGEL, BLIP3o achieves remarkable progress; In practice, however, this unification remains one-directional: understanding routinely guides generation, yet how and why generation can support understanding is rarely investigated. We revisit this asymmetry and propose Generation-to-Understanding (G2U) synergy, where visual generation becomes an explicit intermediate reasoning step. Our framework enables a model to perform controlled generative acts, such as detail enhancement, context expansion or structural visualisation, to produce self-generated visual thoughts, which are then fed back into the model to refine perception without retraining or external tools. Through a comprehensive evaluation on twelve benchmarks, this reversed information flow consistently improves multimodal understanding. We show that generative fidelity bounds perceptual gain and that distinct families of edit prompts govern transfer efficiency. We further analyse whether models can decide what to imagine. While they can produce plausible edits, these self-generated visual thoughts lack stable task alignment, revealing that current large multimodal models fall short of true self-reflection. This work exposes a missing mechanism in unified cognition and suggests that imagination is not the end of understanding but its beginning.
[58] GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions cs.CV | cs.AIPDF
Junho Kim, Xu Cao, Houze Yang, Bikram Boote, Ana Jojic
TL;DR: 本文提出了GRASP数据集,这是一个大规模的社会推理数据集,旨在连接高级别的社会问答与细粒度的注视和指示性手势事件。该数据集包含超过46K个视频、749小时时长和290K个问答对,涵盖注视、手势及其联合推理的16个类别,并引入了GRASP-Bench进行评估。此外,作者提出了社会基础奖励(SGR)学习信号,利用社会事件鼓励模型推理交互中的参与者。实验表明,SGR在提升GRASP-Bench性能的同时,保持了在相关社会视频问答基准上的零样本性能。
Details
Motivation: 当前的多模态大语言模型(MLLMs)在多人物视频中难以识别谁与谁进行交互,无法理解依赖于微妙非语言线索的社会互动。因此,需要构建一个连接高级社会问答与细粒度非语言事件(如注视和手势)的数据集来促进模型的社会推理能力。
Result: 在提出的GRASP-Bench评估中,使用社会基础奖励(SGR)训练的模型性能得到提升,同时保持了在相关社会视频问答基准(如Social-IQ和Causal-VidQA)上的零样本性能,表明该方法有效且具有泛化性。
Insight: 创新点在于构建了首个大规模、细粒度标注的社会推理数据集GRASP,它通过身份一致的注视轨迹、指示性手势及其组合来构建问题,从而弥合了高级社会QA与低级非语言事件之间的鸿沟。此外,提出的SGR奖励机制利用社会事件作为学习信号,引导模型关注交互参与者,这是一种新颖的模型训练策略。
Abstract: Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question–answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze–gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.
[59] UAM: A Dual-Stream Perspective on Forgetting in VLA Training cs.CV | cs.AIPDF
Jianke Zhang, Yuanfei Luo, Yucheng Hu, Xiaoyu Chen, Yanjiang Guo
TL;DR: 本文提出统一动作模型(UAM),以解决视觉-语言-动作(VLA)模型在动作数据微调时导致预训练视觉-语言模型(VLM)多模态能力退化(即‘具身税’)的问题。UAM采用双流架构,引入一个模拟生物视觉背侧通路的‘背侧专家’,通过预测视觉动态的中层推理目标进行训练,从而在保持VLM语义能力的同时提升动作任务的泛化性能。
Details
Motivation: 当前VLA模型通过微调预训练VLM来构建,但这种方法会系统性地侵蚀VLM原有的多模态能力(称为‘具身税’)。论文旨在探究VLA是否必须遗忘语义能力,并受生物视觉双流(腹侧识别通路与背侧视觉运动控制通路分离)的启发,试图通过架构设计解决这一瓶颈。
Result: 在多种操作任务上测试了分布外泛化能力(包括未见物体、新物体-目标组合和指令变化),UAM在基线中取得了最高的平均成功率。同时,它保留了底层VLM超过95%的多模态能力,且无需参数冻结、梯度停止或辅助视觉-语言协同训练。
Insight: 核心创新在于从架构层面引入双流设计(背侧专家),将语义理解与视觉控制功能分离,而非依赖冻结权重或数据回放来强制保留语义。这表明语义保留可以通过结构分离自然涌现,并且VLM的语义能力能够有效迁移到动作任务的语义泛化中。
Abstract: Vision–language–action (VLA) models are typically built by fine-tuning a pretrained vision–language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM’s multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain’s dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a pretrained generative model and train it with a mid-level reasoning objective that predicts visual dynamics. This design allows us to train the whole VLA end-to-end on action data alone: with no parameter freezing, no gradient stopping, and no auxiliary VL co-training, UAM retains over $95%$ of the underlying VLM’s multimodal capability and at the same time achieves the highest average success rate among baselines on a variety of manipulation tasks that probe out-of-distribution generalization, including unseen objects, novel object–target compositions, and instruction variation. Together, these results suggest that semantic preservation in VLAs can emerge from architectural separation itself, rather than being enforced by frozen weights or auxiliary data replay, and that this preserved semantic capability can naturally transfer from VLMs to semantic generalization in actions.
[60] Cross-Modal Registration Between 3D and 2D Fingerprints via Pose-Aware Unwrapping and Point-Cloud Fusion cs.CVPDF
Xiongjun Guan, Jianjiang Feng, Jie Zhou
TL;DR: 本文提出了一种统一的框架,用于3D指纹的预处理和跨模态(非接触式和接触式2D指纹)配准。该框架包含四个核心组件:非参数可视化与展开方法、点云融合流程、基于椭圆的姿态归一化方法以及姿态感知的跨模态配准策略。实验表明,该框架实现了脊线级别的3D配准精度,并提升了3D指纹与现有2D指纹系统的兼容性。
Details
Motivation: 3D指纹能保留完整的指部几何和局部脊线结构,且避免了接触式采集带来的形变,但其难以与主流的2D指纹系统集成。本文旨在解决3D指纹采集与跨模态匹配之间的中间处理阶段,即3D指纹的预处理和与2D指纹的配准问题。
Result: 在包含150个手指的自建多模态指纹数据库上的实验表明,该框架的3D融合误差集中在0.09毫米左右,非接触式2D-3D配准达到了脊线尺度的投影精度,且姿态感知的展开方法相比通用3D展开提升了真实匹配分数。
Insight: 创新点在于提出了一个不依赖全局手指形状模型的非参数3D指纹展开方法,以及一个结合了姿态归一化与点云融合的完整预处理框架。其姿态感知的跨模态配准策略是提升3D指纹与异构2D指纹系统兼容性的关键,为3D指纹作为跨模态几何桥梁提供了有效支持。
Abstract: Three-dimensional (3D) fingerprints preserve global finger geometry and local ridge structure while avoiding contact-induced deformation, but they remain difficult to integrate with legacy two-dimensional (2D) fingerprint systems. This paper addresses the intermediate stage between 3D acquisition and cross-modal matching, and presents a unified framework for 3D fingerprint preprocessing and registration across contactless and contact-based 2D modalities. The framework combines four components: 1) a nonparametric visualization and unwrapping method that converts a 3D fingerprint point cloud into a rolled-equivalent 2D representation without relying on a global finger-shape model; 2) a point-cloud fusion pipeline that registers and mosaics multiple partial 3D captures into a more complete fingerprint model; 3) an ellipse-based pose normalization method for canonical finger alignment; and 4) a pose-aware cross-modal registration strategy that improves compatibility between 3D fingerprints and both contactless and contact-based 2D fingerprints. Experiments on a self-collected multimodal fingerprint database containing 150 fingers show that the proposed framework achieves ridge-level 3D registration accuracy, robust pose estimation, and consistent gains in 2D compatibility. In particular, the 3D fusion error is concentrated around 0.09 mm, contactless 2D–3D registration reaches ridge-scale projection accuracy, and pose-aware unwrapping improves genuine matching scores relative to generic 3D unwrapping. These results support the use of 3D fingerprints as an effective geometric bridge across heterogeneous fingerprint modalities.
[61] Embedding-perturbed Exploration Preference Optimization for Flow Models cs.CV | cs.LGPDF
Sujie Hu, Chubin Chen, Jiashu Zhu, Jiahong Wu, Xiangxiang Chu
TL;DR: 本文提出了一种名为嵌入扰动探索偏好优化(E²PO)的新框架,旨在解决基于群体的优化方法(如GRPO)中组内方差快速衰减导致学习信号消失的问题。该方法通过在嵌入层面引入结构化扰动,维持了训练过程中的判别信号,从而稳定了优化过程并实现了与人类偏好更精确的对齐。
Details
Motivation: 基于群体的强化学习对齐框架(如GRPO)存在组内样本差异性迅速衰减的根本性缺陷,导致方差趋近于零、学习信号消失,进而引发训练不稳定、策略早熟停滞或奖励黑客行为。现有方法(如改变初始噪声或增加组大小)无法有效解决此问题。
Result: 广泛的实验表明,E²PO方法显著优于现有的最先进基线,在人类偏好对齐任务上取得了更忠实的结果。
Insight: 核心创新点在于提出了嵌入层面的结构化扰动机制,而非在输入或噪声层面进行调整,从而在训练全程保证了一个鲁棒的方差,维持了关键的判别信号。这为解决群体优化中方差消失问题提供了一个新颖且有效的技术路径。
Abstract: Recent advancements have established Reinforcement Learning (RL) as a pivotal paradigm for aligning generative models with human intent. However, group-based optimization frameworks (e.g., GRPO) face a critical limitation: the rapid decay of intra-group variance. As the distinctiveness among samples within a group diminishes, the variance approaches zero. This eliminates the very learning signal required for optimization, rendering the process unstable and forcing the policy into premature stagnation or reward hacking. Existing strategies, such as varying the initial noise or increasing group sizes, often fail to address this fundamental issue, resulting in training instability or diminishing returns. To overcome these challenges, we propose $\textbf{Embedding-perturbed Exploration Preference Optimization (}E^2\textbf{PO)}$, a novel framework that sustains optimization through embedding-level perturbation. Our method introduces structured, embedding-level perturbations within sample groups, guaranteeing a robust variance that preserves the discriminative signal throughout the training process. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving a more faithful alignment with human preference.
[62] FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization cs.CVPDF
Quanjian Song, Yefeng Shen, Mengting Chen, Hao Sun, Jinsong Lan
TL;DR: 本文提出FashionChameleon,一个实时交互式的人体服装视频定制框架,支持在自回归视频生成过程中交互式地切换服装。该方法仅使用单服装视频数据进行训练,通过教师模型上下文学习、流式蒸馏和免训练的KV缓存重调度三项关键技术,实现了运动连贯性保持、长视频外推一致性以及高达23.8 FPS的实时生成速度。
Details
Motivation: 现有方法无法支持低延迟和交互式的服装控制,而这对电子商务和内容创作等应用至关重要。本文旨在研究如何仅使用单服装视频数据,实现交互式的多服装视频定制,同时保持运动连贯性。
Result: FashionChameleon在单个GPU上实现了23.8 FPS的实时生成,比现有基线快30-180倍。它支持交互式定制和一致的长视频外推。
Insight: 创新点包括:1) 通过强制参考图像与服装图像不匹配的单服装对上下文学习,鼓励模型在单服装切换时隐式保持连贯性;2) 通过梯度重加权分布匹配蒸馏的流式蒸馏,提高外推一致性;3) 通过免训练的KV缓存重调度(包括服装KV刷新、历史KV撤回和参考KV解耦)实现多服装切换并保持运动连贯性。这些技术避免了多服装视频数据的需求,并实现了高效的实时交互。
Abstract: Human-centric video customization, particularly at the garment level, has shown significant commercial value. However, existing approaches cannot support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation. This paper studies how to achieve interactive multi-garment video customization while preserving motion coherence using only single-garment video data. We present FashionChameleon, a real-time and interactive framework for human-garment customization in autoregressive video generation, where users can interactively switch garment during generation. FashionChameleon consists of three key techniques: (i) Instead of training on multi-garment video data, we train a Teacher Model with In-Context Learning on a single reference-garment pair. By retaining the image-to-video training paradigm while enforcing a mismatch between the reference and garment image, the model is encouraged to implicitly preserve coherence during single-garment switching. (ii) To achieve consistency and efficiency during generation, we introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. (iii) To extend the model for interactive multi-garment video customization, we propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence. Our FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU, 30-180$\times$ faster than existing baselines.
[63] Not All Tasks Quantize Equally: Fisher-Guided Quantization for Visual Geometry Transformer cs.CVPDF
Yipu Zhang, Jintao Cheng, Weilun Feng, Jiehao Luo, Chuanguang Yang
TL;DR: 本文针对前馈式3D重建模型(如VGGT)在量化过程中不同任务对量化误差敏感度差异显著的问题,提出了一种基于Fisher信息矩阵的量化方法(FGQ)。该方法通过量化不同任务、Transformer块和隐藏通道的敏感度,并将其融入可学习的仿射变换校准过程,从而在4位量化下显著提升模型性能。
Details
Motivation: 前馈式3D重建模型(如VGGT)通过共享主干网络同时预测深度估计、相机姿态估计和点云重建等多个几何任务,但其十亿级参数带来巨大的内存和计算开销,阻碍了设备端部署。现有后训练量化方法主要关注处理重尾激活分布和构建多样化校准数据集,但忽视了不同任务、块和通道对量化误差的敏感度差异,导致对敏感任务造成显著精度损失。
Result: 在相机姿态估计、点图重建和深度估计等任务上的大量实验表明,FGQ在VGGT模型上始终优于最先进的量化基线方法,在4位量化下实现了高达39%的相对性能提升。
Insight: 创新点在于首次揭示了前馈式3D模型中不同任务对量化误差的敏感度异质性,并提出利用Fisher信息矩阵量化这种敏感度差异,将其集成到校准过程中。从客观角度看,该方法为多任务模型的量化提供了一种任务感知的精细化处理思路,可推广至其他共享主干网络的多任务模型。
Abstract: Feed-forward 3D reconstruction models, represented by Visual Geometry Grounded Transformer (VGGT), jointly predict multiple visual geometry tasks such as depth estimation, camera pose prediction, and point cloud reconstruction in a single forward pass. They have been widely adopted in 3D vision applications, but their billion-scale parameters bring substantial memory and computation overhead, posing challenges for on-device deployment. Post-Training Quantization (PTQ) is an effective technique to reduce this overhead. Existing PTQ methods for feed-forward 3D models mainly focus on handling heavy-tailed activation distributions and constructing diverse calibration datasets. However, we observe that feed-forward 3D models predict multiple geometric attributes through a shared backbone, where different transformer blocks and hidden channels contribute distinctly to each task, resulting in substantially different sensitivities to quantization errors across tasks, blocks, and channels. Consequently, treating all tasks equally over-emphasizes insensitive tasks and causes significant accuracy loss on the sensitive ones. To address this issue, we propose Fisher-Guided Quantization (FGQ) for feed-forward 3D reconstruction models. Specifically, FGQ uses the diagonal Fisher information matrix to quantify the different sensitivities across tasks, blocks, and channels, and incorporates these sensitivities into the Learnable Affine Transformation during calibration to better preserve the channels and blocks most critical to each task. Extensive experiments across camera pose estimation, point map reconstruction, and depth estimation show that FGQ consistently outperforms state-of-the-art quantization baselines on VGGT, achieving up to 39% relative improvement under the 4-bit quantization.
[64] WorldAct: Activating Monolithic 3D Worlds into Interactive-Ready Object-Centric Scenes cs.CVPDF
Jichen Hu, Jiawei Guo, Jiazhong Cen, Chen Yang, Sikuang Li
TL;DR: WorldAct是一个将静态生成的3D世界模型(如Marble系统生成的)转化为可编辑、支持交互的物体中心场景的框架。它通过多模态智能体引导场景分解,识别可操作物体,重建几何对齐的物体级网格以支持交互,并通过3D修复恢复背景,从而支持物体级编辑、碰撞感知操作和具身任务执行。
Details
Motivation: 解决现有生成式场景合成系统(如Marble)输出的3D环境通常是静态、整体且编辑性和物理交互性有限的问题,以满足沉浸式内容创作和具身模拟中对生成世界进行主动修改和操作的需求。
Result: 实验表明,WorldAct能够比原始生成的场景支持更丰富的交互场景,为可编辑和交互式3D世界模型提供了一条实用路径。
Insight: 创新点在于提出一个将静态单体世界“激活”为交互就绪场景的完整流程,结合了多模态智能体引导的场景分解与物体识别、几何对齐的物体网格重建以及背景3D修复,在保持全局场景一致性的前提下实现了物体级的可编辑性和物理交互能力。
Abstract: Recent 3D world modeling systems based on generative scene synthesis, such as Marble, can create coherent and explorable 3D environments, yet their outputs are typically static monolithic assets with limited editability and physical interaction. This restricts their use in immersive content creation and embodied simulation, where generated worlds must be actively modified and manipulated. To tackle this challenge, we present WorldAct, a framework that converts static generated 3D worlds into editable and interaction-ready scenes. WorldAct uses a multimodal agent to guide scene decomposition, identify actionable objects, reconstruct geometrically aligned object-level meshes for interaction, and restore the residual background via 3D inpainting. The resulting scenes support object-level editing, collision-aware manipulation, and embodied task execution while preserving global scene coherence. Experiments show that WorldAct enables richer interaction scenarios than the original generated scenes, suggesting a practical path toward editable and interactive 3D world models.
[65] GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction cs.CVPDF
Leyang Chen, Junyi Wu, Zhiteng Li, Yulun Zhang
TL;DR: 本文提出了GHOST框架,用于解决长单目视频序列流式3D重建中KV缓存线性增长导致的内存瓶颈问题。该方法利用模型自身的3D几何输出来在线剔除冗余token,通过分层重要性评分、特权保护机制和基于余弦相似度的层间预算分配,在保持重建质量的同时显著减少缓存大小并提升推理速度。
Details
Motivation: 现有方法要么截断缓存导致重建质量下降,要么使用与3D场景结构无关的注意力分数启发式方法,无法保留具有几何价值的token。本文旨在开发一种无需训练的KV缓存管理框架,利用3D几何信息智能地在线剔除冗余token,以解决内存瓶颈问题。
Result: 在多个基准测试上的实验表明,GHOST在保持优异重建质量的同时,将KV缓存减少了近一半,并且与最先进方法相比,推理速度提升了1.75倍。
Insight: 创新点包括:1) 分层双级重要性评分方案;2) 保护特殊token不被剔除的特权机制;3) 基于余弦相似度的层间预算分配策略。这些设计使框架能够利用3D几何信息进行更智能的token淘汰,从而在内存效率和重建质量之间取得更好的平衡。
Abstract: Streaming 3D reconstruction from long monocular video sequences requires maintaining a key-value (KV) cache that grows linearly with sequence length, creating a severe memory bottleneck. Existing approaches either truncate the cache to a fixed set of anchor frames, leading to reconstruction quality degradation, or rely on attention-score heuristics that are agnostic to 3D scene structure, failing to preserve geometrically valuable tokens. To address these problems, we present GHOST (Geometry-Hierarchical Online Streaming Token Eviction), a training-free KV cache management framework that exploits the model’s own 3D geometry outputs to evict redundant tokens online. GHOST introduces three mutually reinforcing innovations: a hierarchical dual-level importance scoring scheme, a privilege mechanism that protects special tokens from eviction, and a cosine-similarity-guided layer-wise budget allocation. Experiments on various benchmarks show that GHOST preserves excellent reconstruction quality while cutting the KV cache by nearly half and delivering 1.75x faster inference compared to state-of-the-art methods. Our code is available at https://github.com/lokiniuniu/GHOST.
[66] Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination cs.CV | cs.CLPDF
Chufan Shi, Cheng Yang, Yaokang Wu, Linhao Jin, Bo Shui
TL;DR: 该论文通过VisualSwap图像替换探测框架,研究视觉语言模型(VLMs)在推理过程中声称‘让我再检查一下图像’等自我反思性陈述时,是否真的进行了视觉重检,还是仅仅模仿了文本模式。研究发现,模型在图像被替换后普遍无法察觉,准确率大幅下降,表明其陈述多为‘空谈’而非真正的视觉重检。
Details
Motivation: 动机是探究VLMs在推理中产生的自我反思性陈述(如‘让我再检查图像’)是否真的触发了对视觉内容的重新审视,还是仅仅是模型从训练数据中学到的文本模式,以揭示模型在视觉理解上的潜在缺陷。
Result: 在VS-Bench(包含800对来自MathVista等数据集的图像对)上测试Qwen3-VL、Kimi-VL和ERNIE-VL等模型,发现模型普遍错过图像替换,准确率下降高达60%;思维链模型比指令模型脆弱近3倍,且模型规模扩大无缓解作用;多轮用户指令能恢复视觉基础,但模型自生成的反思陈述无效。
Insight: 创新点在于提出了VisualSwap探测框架和VS-Bench数据集,定量揭示了VLMs在视觉重检上的‘幻觉’问题;通过注意力分析发现,用户指令能显著提升对视觉令牌的关注,而自我反思陈述则不能,这为理解模型内部机制提供了新视角。
Abstract: Vision-Language Models (VLMs) often produce self-reflective statements like “let me check the figure again” during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated reflective statements during continuous generation do not. Attention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming to perform visual re-examination. Our code and dataset are available at the project page: https://visualswap.github.io
[67] SOLAR: Self-supervised Joint Learning for Symmetric Multimodal Retrieval cs.CVPDF
Wenjie Yang, Hang Yu, Yuyu Guo, Peng Di
TL;DR: 本文提出了SOLAR,一种新颖的两阶段自监督框架,用于解决对称多模态到多模态检索的挑战。该框架利用未标注的网络规模图文对,通过第一阶段学习图文对的交集掩码以对齐共同语义并保留差异,第二阶段利用该掩码构建正负样本进行嵌入学习。同时,论文还引入了一个包含高质量人工验证正负对的新基准来评估对称MM2MM检索。
Details
Motivation: 解决现有通用多模态检索方法在对称多模态到多模态检索任务上的不足,这些方法受限于标注的非对称数据集,而对称检索要求查询和上下文可互换。
Result: 在提出的新基准上,SOLAR在广泛实验中超越了十个SOTA方法,比最强的监督视觉语言模型高出7.08个点,同时模型参数减少50倍以上,嵌入维度小5倍。
Insight: 创新点在于利用自监督学习从未标注数据中学习对称检索,通过交集掩码分离和利用图文语义的对齐与差异;客观分析认为,其两阶段掩码策略和构建硬负样本的方法对多模态表示学习有借鉴意义。
Abstract: In this work, we address the critical yet underexplored challenge of symmetric multimodal-to-multimodal (MM2MM) retrieval, where queries and contexts are interchangeable. Existing universal multimodal retrieval works struggle with this task, as they are constrained by the labeled asymmetric datasets used. We produce SOLAR (Self-supervised jOint LeArning for symmetric multimodal Retrieval), a novel two-stage self-supervised framework that leverages readily available unlabeled web-scale image-text pairs. Based on the observation that both semantic alignment and discrepancies exist between two modalities, in the first stage, we learn the intersection mask of image-text pair, allowing us to align intersection while preserving semantic of difference. In the second stage, the learned mask is further utilized to construct positive and hardnegative samples via masking different parts of image/text, which enable us to conduct self-supervised multimodal embedding learning. Complementing this framework, we present a new benchmark featuring high-quality human-verified positive and hard-negative pairs to evaluate symmetric MM2MM retrieval under realistic conditions, as well as the corresponding pipeline. Extensive experiments against ten SOTA methods show SOLAR surpasses the strongest supervised VLM by 7.08 points on this benchmark, with over 50x fewer model parameters and a 5x smaller embedding dimension. Code and benchmark will be available soon.
[68] On RGB-TIR Stereo Calibration under Extreme Resolution Asymmetry cs.CVPDF
Michał Król, Michał Salamonowicz, Władysław Skarbek, Michał Tomaszewski
TL;DR: 本文提出了一种针对RGB-热红外(TIR)立体相机系统的实用标定框架,专门解决在RGB相机(2028 x 1520像素)与极低分辨率TIR相机(80 x 62像素,像素数量比约1:625)配对时的几何标定难题。该方法使用主动OLED屏幕动态切换模态特定图案(TIR用棋盘格,RGB用ChArUco),并开发了专用的角点检测算法,结合基线约束的捆绑调整,实现了物理一致的标定结果。
Details
Motivation: 在建筑围护结构多模态分析中,RGB-TIR立体相机系统的精确几何标定至关重要,但当使用空间分辨率极低的低成本热传感器时,标定仍然具有挑战性。
Result: 标定得到的立体基线为32.7 mm(标称30 mm),整体重投影误差为0.382像素。该系统在一个热激活的建筑模型上通过恒定深度和逐像素深度估计进行了验证,证明了其TIR到RGB投影的一致性,适用于建筑能源性能评估。
Insight: 创新点在于:1)使用主动OLED屏幕动态切换模态特定图案,在单一物理表面上提供可控、可重复的热对比度;2)开发了结合透视校正、Hessian鞍点分析和Mean Shift定位的专用角点检测算法,无需逐帧参数调整即可在极低分辨率(80x62像素)下实现可靠的棋盘格检测;3)采用基线约束的捆绑调整,在平面标定物体退化情况下强制实现物理一致的设备几何结构。
Abstract: Accurate geometric calibration of RGB-thermal infrared (TIR) stereo camera systems is essential for multimodal building envelope analysis, yet remains challenging when low-cost thermal sensors with very low spatial resolution are employed. This paper presents a practical stereo calibration framework for an RGB camera (2028 x 1520 px) paired with a TIR camera operating at only 80 x 62 px - a pixel-count ratio of approximately 1:625. An active OLED screen dynamically switches modality-specific patterns (checkerboard for TIR, ChArUco for RGB) on a single physical surface, providing controlled and repeatable thermal contrast. A dedicated corner detection algorithm combining perspective rectification, Hessian saddle-point analysis, and Mean Shift localisation achieves reliable checkerboard detection at 80 x 62 px without per-frame parameter tuning. A baseline-constrained bundle adjustment enforces physically consistent rig geometry under the planar-calibration-object degeneracy, yielding a stereo baseline of 32.7 mm (nominal 30 mm) with an overall reprojection error of 0.382 px. The system is validated on a thermally active building mock-up using constant-depth and per-pixel depth estimation, demonstrating consistent TIR-to-RGB projection suitable for building energy performance assessment.
[69] Unlocking Dense Metric Depth Estimation in VLMs cs.CVPDF
Hanxun Yu, Xuan Qu, Yuxin Wang, Jianke Zhu, Lei ke
TL;DR: 本文提出DepthVLM框架,将单一视觉语言模型(VLM)转变为能够原生预测密集几何(即稠密度量深度)的模型,同时保留其多模态能力。该方法通过为LLM主干附加轻量级深度头,并在统一的视觉-文本监督范式下进行两阶段训练,实现了单次前向传播即可生成全分辨率深度图和语言输出。
Details
Motivation: 现有VLM在2D任务上表现出色,但在3D理解方面受限,主要原因是其纯文本监督范式无法约束细粒度视觉感知,阻碍了密集几何信息的恢复。先前方法要么从外部视觉模型蒸馏几何信息(导致误差累积),要么采用低效的逐像素查询或粗糙的token级输出。
Result: 实验表明,DepthVLM在推理效率更高的前提下,显著优于现有VLM,并超越了领先的纯视觉模型,同时提升了复杂的3D空间推理能力。作者还为此引入了一个统一的、VLM兼容的室内外度量深度基准。
Insight: 核心创新在于提出了一个简单有效的框架,通过附加轻量级深度头和统一的视觉-文本监督训练,使VLM能够原生、高效地预测密集度量深度,朝着真正统一的基础模型迈进。其两阶段训练策略和统一的基准也颇具借鉴意义。
Abstract: Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified foundation model. All code and checkpoints will be publicly released.
[70] A Causally Grounded Taxonomy for Image Degradation Robustness Evaluation cs.CVPDF
Stefan Becker, Simon Weiss, Wolfgang Hübner, Michael Arens
TL;DR: 本文提出了一种基于因果关系的图像退化分类框架,旨在统一不同研究领域(如鲁棒性评估、图像质量评估和物理成像分析)中对图像退化的描述和度量。该框架通过两个正交轴(主导因果来源和感知效果)对退化进行分类,并引入一个轻量级的严重性度量层,使用PSNR、SSIM和LPIPS等全参考图像质量指标来量化不同后端实现的退化强度,从而使得跨数据集、来源和任务的比较成为可能。
Details
Motivation: 解决不同研究社区在图像退化研究中因使用不兼容的分类方案和特定后端的严重性定义而导致的结果难以比较的问题。
Result: 通过提出的框架,构建了COCO Degradation基准,用于评估目标检测器在多样化成像条件下的鲁棒性,展示了该框架的实用性。
Insight: 创新点在于提供了一个解释性的表示和度量层,将隐式假设显式化,通过因果来源和感知效果的双轴抽象实现跨领域退化分类的统一,并通过标准化度量实现严重性的可观测和可比较,而不改变现有实现。
Abstract: Image degradations can occur during acquisition, processing, and transmission, altering visual appearance and affecting downstream vision tasks. They are studied in several communities, including synthetic corruption benchmarks for robustness evaluation, perceptual image quality assessment, and physically grounded analyses of imaging systems or real camera failures. Although these areas address closely related phenomena, they often use incompatible grouping schemes and backend specific severity definitions, making results difficult to compare across datasets, degradation sources, and tasks. We propose a causally grounded framework for organizing and interpreting image degradations across these settings. Instead of introducing new degradations or redefining existing benchmarks, we provide an interpretive representation and measurement layer that makes implicit assumptions explicit. Each degradation is described along two orthogonal axes: its dominant causal source in the imaging pipeline (environment, sensor/optics, ISP/renderer/codec, or transfer/system), and its resulting perceptual effect. This dual axis abstraction yields a compact taxonomy spanning algorithmic corruptions, perceptual distortions, and physically motivated imaging artifacts. To address inconsistent severity semantics without changing existing implementations, we introduce a lightweight severity measurement layer. For every degradation and each native severity level of a given backend, we quantify degradation strength using full reference image quality metrics: PSNR, SSIM, and LPIPS. This makes severity observable and comparable across sources while preserving native parameterizations. We demonstrate the framework through COCO Degradation, a taxonomy aligned benchmark for evaluating object detector robustness under diverse imaging conditions.
[71] Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation cs.CV | cs.AIPDF
Chenhao Wang, Yingrui Ji, Yu Meng, Yao Zhu
TL;DR: 本文提出了一种分解视觉-语言对齐框架,用于细粒度开放词汇分割。该方法将文本提示分解为概念标记和多个属性标记,通过特征门控交叉注意力模块实现语义单元的独立跨模态交互,并在对数空间聚合标记相似度以增强组合泛化能力。
Details
Motivation: 现有开放词汇分割模型难以泛化到未见过的物体类别与属性组合,因为细粒度描述通常被编码为包含多个语义单元的整体句子,导致语义纠缠。
Result: 该方法在细粒度开放词汇分割基准测试中显著提升了对未见属性-类别组合的泛化能力,并可无缝集成到现有基于Transformer的分割架构中。
Insight: 创新点在于将文本提示显式分解为独立语义单元进行跨模态对齐,并通过特征门控机制和对数空间聚合实现稳定可解释的组合匹配,解决了细粒度组合泛化的核心挑战。
Abstract: Open-vocabulary segmentation models often struggle to generalize to unseen combinations of object categories and attributes, because fine-grained descriptions are typically encoded as holistic sentences that entangle multiple semantic units. We propose a Decomposed Vision-Language Alignment framework that explicitly factorizes textual prompts into a concept token and multiple attribute tokens, enabling separate cross-modal interactions for each semantic unit. At the feature level, we introduce a Feature-Gated Cross-Attention module that generates attribute-specific gating maps to fuse information in a multiplicative manner, effectively enforcing compositional semantics. At the scoring level, per-token similarities are aggregated in log-space, producing a stable and interpretable compositional matching. The method can be seamlessly integrated into existing transformer-based segmentation architectures and significantly improves generalization to unseen attribute-category compositions in fine-grained open-vocabulary segmentation benchmarks.
[72] Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models cs.CVPDF
Fabian Morelli, Arnas Uselis, Ankit Sonthalia, Seong Joon Oh
TL;DR: 本文提出了一种名为SAE-FT的新方法,用于对CLIP等大规模预训练视觉-语言模型进行鲁棒微调。该方法仅作用于模型的视觉表示,通过稀疏自编码器识别并正则化语义特征的增减,以防止灾难性遗忘并提升可解释性。该方法在保持计算效率的同时,在ImageNet及其分布偏移基准上达到或超越了最先进性能。
Details
Motivation: 微调CLIP等大规模预训练模型以提升下游任务性能时,通常会损害模型对分布偏移的鲁棒性。现有方法常依赖计算成本高昂的文本引导,因此需要一种更高效且仅作用于视觉表示的鲁棒微调方法。
Result: SAE-FT在ImageNet及其相关分布偏移基准(如ImageNet-V2、ImageNet-R等)上,匹配或超越了最先进的性能水平(SOTA)。
Insight: 创新点在于利用在预训练模型上训练的稀疏自编码器来识别语义特征,并以此正则化微调过程,从而在机制上透明且高效地实现鲁棒性和可解释性。这为分析模型语义变化提供了直接途径。
Abstract: Large-scale pre-trained vision-language models like CLIP demonstrate remarkable zero-shot performance across diverse tasks. However, fine-tuning these models to improve downstream performance often degrades robustness against distribution shifts. Recent approaches have attempted to mitigate this trade-off, but often rely on computationally expensive text-guidance. We propose a novel method for robust fine-tuning, SAE-FT, which operates only on the model’s visual representations. SAE-FT regularizes changes to these representations by penalizing the addition and removal of semantically meaningful features identified by a Sparse Autoencoder trained on the pre-trained model. This constraint prevents catastrophic forgetting and makes the fine-tuning process interpretable, enabling direct analysis of semantic changes. SAE-FT is both mechanistically transparent and computationally efficient, matching or exceeding state-of-the-art performance on ImageNet and its associated distribution shift benchmarks. Code is publicly available at: https://github.com/Fabian-Mor/sae-ft.
[73] Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization cs.CVPDF
Xiaoxuan He, Siming Fu, Zeyue Xue, Weijie Wang, Ruizhe He
TL;DR: 本文提出了Flash-GRPO,一种用于视频扩散模型与人类偏好对齐的高效单步训练框架,旨在解决现有方法(如GRPO)计算成本过高且训练不稳定的问题。通过引入等时分组和时间梯度校正技术,该方法在保证对齐质量的同时,显著提升了训练效率。
Details
Motivation: 现有基于Group Relative Policy Optimization的视频扩散模型对齐方法计算成本极高(例如训练140亿参数模型需数百GPU天),且现有通过滑动窗口采样时间步的加速方法会损害优化过程,导致不稳定并无法达到完整轨迹的性能。
Result: 在13亿到140亿参数模型上的实验验证了Flash-GRPO的有效性,它实现了显著的训练加速,保持了稳定的训练过程,并达到了最先进的对齐质量。
Insight: 核心创新点在于两个方面:一是等时分组技术,通过强制提示词级别的时间一致性来消除时间步混淆的方差,解耦策略性能与时间步难度;二是时间梯度校正,通过中和时间依赖的缩放因子来解决不同时间步梯度幅度不一致的问题,从而实现了高效且稳定的单步优化。
Abstract: Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO’s effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.
[74] From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding cs.CVPDF
Yuyuan Liu, Yiping Ji, Anjie Le, Jiayuan Zhu, Jiazhen Pan
TL;DR: 本文提出了一种名为Group-Revision的优化范式,旨在解决基于GRPO的强化学习方法在物体级指代任务中因稀疏奖励而难以处理困难案例的问题。该方法通过采样初始响应并生成一组修订候选,利用奖励塑形思想量化每个候选相对于初始尝试的改进,从而生成信息丰富的塑形信号,用于优化奖励和调整优势函数。
Details
Motivation: 现有基于GRPO的强化学习方法在物体级指代任务中主要依赖响应级别的稀疏奖励,当所有候选响应在困难场景中都失败时,学习信号极小,导致模型难以从失败中学习。
Result: 在指代分割、推理分割、REC和计数等多个基准测试中,该方法相比之前的GRPO模型取得了持续的性能提升。
Insight: 核心创新在于将失败案例转化为学习反馈,通过群体修订探索改进结果,并利用奖励塑形将相对改进量化为密集的学习信号,从而增强模型在困难案例上的学习能力。
Abstract: Finetuning Large Vision-Language Models with reinforcement learning has emerged as a promising approach to enhance their capability in object-level grounding. However, existing methods, mainly based on GRPO, assign rewards at the response level. Such sparse reward, often criterion-induced, leads to minimal learning signals when all candidate responses fail in challenging scenarios. In this work, we propose a group-revision optimisation paradigm that enhances learning on hard cases. It begins with a sampled initial response and generates a set of revised candidates to explore improved grounding outcomes. Inspired by reward shaping, we introduce a consolidation process that quantifies each candidate’s improvement over the initial attempt and converts it into informative shaping signals. These signals are used to both refine the reward and modulate the advantage, amplifying the influence of high-quality revisions. Our method achieves consistent gains across referring and reasoning segmentation, REC, and counting benchmarks compared with prior GRPO-based models. Our code is available at https://github.com/yyliu01/GroupRevision.
[75] Segmentation, Detection and Explanation: A Unified Framework for CT Appearance Reasoning cs.CVPDF
Yuyuan Liu, Can Peng, Yingyu Yang, Qianye Yang, Cheng Ouyang
TL;DR: 本文提出了一种统一的CT图像外观推理框架,将语言引导的视觉推理整合到CT解读中。该框架通过任务路由令牌触发检测和分割头,实现视觉输出(如掩码和边界框)与文本推理的连贯生成,并设计了“细看”机制以逐步提升定位精度和语义清晰度。
Details
Motivation: 当前深度学习方法多局限于图像级模式识别,缺乏显式的解剖或上下文推理,且临床工作流需要多种细粒度分析(如解剖检测和分割),而现有视觉语言模型通常只关注单一任务,无法满足这一需求。
Result: 在公开基准测试(如BTCV和MosMed+)上,该方法相比现有最佳方法(SOTA)取得了一致性提升,在BTCV上Dice分数提升达1.0%,在MosMed+上提升达1.7%,同时还能提供外观推理输出。
Insight: 创新点包括:1)提出统一的自回归框架,通过任务路由令牌将多任务(检测、分割、推理)集成到视觉语言模型中;2)设计“细看”机制,实现从粗到细的渐进式区域访问,以增强定位和语义理解;3)构建了一个新的多模态CT数据集,支持模型训练和评估。
Abstract: Recent progress in deep learning has significantly advanced CT image analysis, particularly for segmentation tasks. However, these advances are largely confined to image-level pattern recognition, with most methods lacking explicit anatomical or contextual reasoning. Large vision-language models introduce linguistic context into image analysis, yet most approaches typically focus on a single task, which is insufficient for clinical workflow analysis that requires multiple fine-grained types of analysis, such as anatomy detection and segmentation. In this paper, we propose a unified autoregressive framework that integrates language-guided visual reasoning into CT interpretation. Our method introduces task-routing tokens that trigger detection and segmentation heads conditioned on the hidden states of a large vision-language model, enabling coherent generation of visual outputs (e.g., masks and bounding boxes) and textual reasonings. To progressively enhance localisation accuracy and semantic clarity, we further design a “closer-look” mechanism that allows the model to perform progressive coarse-to-fine visits to regions of interest under refined fields of view. To support model training and evaluation, we curated a new multimodal CT dataset containing pixel-wise masks, bounding boxes, spatial prompts, and structured descriptions for visual objects constructed through an AI-assisted annotation process with human verification. Experiments on public benchmarks demonstrate consistent improvements over the SoTA, achieving up to 1.0% Dice on BTCV and 1.7% Dice on MosMed+, while additionally providing appearance reasoning outputs. The code and dataset will be available.
[76] VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation cs.CV | cs.AI | cs.HCPDF
Yiming Zhao, Yu Zeng, Wenxuan Huang, Zhen Fang, Qing Miao
TL;DR: 本文提出VideoSeeker,一种通过视觉提示进行实例级视频理解的新范式。它通过将智能体推理与实例级视频理解任务无缝集成,使模型能够主动感知并按需检索相关视频片段。该方法利用全自动数据合成管道生成大规模高质量数据,并通过监督和强化学习训练,在实例级任务上显著超越基线及GPT-4o等闭源模型。
Details
Motivation: 现有大型视觉语言模型在需要精确时空定位的实例级视频理解任务中存在挑战,主要依赖文本提示导致空间和时间参考不精确,且通常将视觉感知与语言推理解耦,限制了模型主动感知细粒度视觉证据的能力。
Result: 实验表明,该模型在实例级视频理解任务上平均比基线提升+13.7%,超越了GPT-4o和Gemini-2.5-Pro等强大的闭源模型,并在通用视频理解基准上显示出有效的可迁移性。
Insight: 创新点在于提出通过视觉提示进行实例级理解的范式,将智能体工具调用能力与主动视觉感知内化到模型中;客观来看,其全自动数据合成管道和结合冷启动监督与强化学习的训练方法,为高效构建高性能视频理解模型提供了可借鉴的路径。
Abstract: Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model’s ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.
[77] ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation cs.CVPDF
Qing Huang, Zhipei Xu, Xuanyu Zhang, Xiangyu Yu, Jian Zhang
TL;DR: 本文提出ReAlign框架,通过将GRPO优化的LLM生成的高质量推理文本蒸馏到轻量级AIGI检测器中,以提升图像伪造检测的泛化能力和语义敏感性。该框架采用对比学习实现图像-文本对齐,并结合分类损失进行联合优化,在多个基准测试中取得了优于现有方法的性能。
Details
Motivation: 现有图像伪造检测方法中,非LLM模型缺乏语义理解,而LLM方法计算量大且对细微视觉伪影不敏感;本文旨在探究LLM生成推理文本的内在价值,将其作为泛化能力和语义错误敏感性的来源,以构建高效、可泛化的检测系统。
Result: 在AIGCDetectBenchmark、AIGI-Holmes和新构建的UltraSynth-10k数据集上,ReAlign在准确率和泛化能力上均优于现有SOTA检测器,特别是在处理现代生成模型产生的复杂、高保真伪造图像时表现突出。
Insight: 创新点在于将LLM的推理文本作为知识源进行蒸馏,通过对比学习对齐图像与文本表示,从而在轻量级模型中继承LLM的语义推理能力;该方法为结合LLM语义优势与高效检测提供了新思路,其联合优化策略也增强了模型性能。
Abstract: The rise of AI-generated images (AIGIs) poses growing challenges for digital authenticity, prompting the need for efficient, generalizable image forgery detection systems. Existing methods, whether non-LLM-based or LLM-based, exhibit distinct advantages and limitations. While non-LLM-based models offer efficient low-level artifact detection, they often lack semantic understanding. Conversely, LLM-based methods provide strong semantic reasoning and explainability but are computationally intensive and less sensitive to subtle visual artifacts. Moreover, the true contribution of explanatory reasoning texts to forgery detection performance remains unclear. In this work, we investigate the intrinsic value and potential of LLM-generated reasoning texts, considering it a source of generalization and semantic-error sensitivity. Based on these findings, we propose ReAlign, a novel framework that distills high-quality reasoning texts generated by a GRPO-optimized LLM into a lightweight AIGI detector via contrastive learning. ReAlign effectively inherits the generalization ability and semantic sensitivity capability of reasoning textual representations, while remaining efficient and lightweight for deployment. Moreover, ReAlign adopts a tailored joint optimization strategy that integrates contrastive loss for image-text alignment and classification loss for accurate forgery discrimination. Experimental results on AIGCDetectBenchmark, AIGI-Holmes, and our newly constructed UltraSynth-10k demonstrate that ReAlign consistently outperforms existing state-of-the-art detectors in both accuracy and generalization, particularly when facing complex, high-fidelity forgeries from modern generative models.
[78] Echo-Forcing: A Scene Memory Framework for Interactive Long Video Generation cs.CVPDF
Mingqiang Wu, Weilun Feng, Zhefeng Zhang, Haotong Qin, Yuqi Li
TL;DR: 本文提出了一种名为Echo-Forcing的无训练场景记忆框架,旨在解决自回归视频扩散模型在交互式长视频生成中的核心瓶颈。该框架通过解耦历史KV状态,实现了在有限缓存预算下对提示词切换、旧场景遗忘和历史场景回忆的统一支持。
Details
Motivation: 现有免训练的长视频优化方法主要关注单一提示下的稳定扩展,难以处理涉及提示切换、旧场景遗忘和历史场景回忆的交互式场景。核心瓶颈在于历史KV状态的功能纠缠,导致过时背景污染、对新提示响应延迟以及长程记忆丢失。
Result: 在VBench-Long基准上的广泛评估表明,Echo-Forcing在长视频生成和交互式视频生成设置中均取得了最佳的整体性能。
Insight: 创新点在于提出了三个核心机制:在相对RoPE下解耦稳定锚点、压缩历史和近期窗口的分层时序记忆;将历史场景压缩为空间结构化KV表示以支持长期回忆的场景回忆帧;以及根据新旧场景差异自适应遗忘冲突令牌的差异感知记忆衰减。这些设计实现了对历史记忆的精细化管理。
Abstract: Autoregressive video diffusion models enable open-ended generation through local attention and KV caching. However, existing training-free long-video optimization methods mainly focus on stable extension under a single prompt, making them difficult to handle interactive scenarios involving prompt switching, old scene forgetting, and historical scene recall. We identify the core bottleneck as the functional entanglement of historical KV states: stable anchors and recent dynamics are handled by the same cache policy, leading to outdated background contamination, delayed response to new prompts, and loss of long-range memory. To address this issue, we propose Echo-Forcing, a training-free scene memory framework specifically designed for interactive long video generation with three core mechanisms: (1) Hierarchical Temporal Memory, which decouples stable anchors, compressed history, and recent windows under relative RoPE; (2) Scene Recall Frames, which compresses historical scenes into spatially structured KV representations to support long-term recall; and (3) Difference-aware Memory Decay, which adaptively forgets conflicting tokens according to the discrepancy between old and new scenes. Based on these designs, Echo-Forcing uniformly supports smooth transitions, hard cuts, and long-range scene recall under a bounded cache budget. Extensive evaluations on VBench-Long further demonstrate that Echo-Forcing achieves the best overall performance in both long-video generation and interactive video generation settings. Our code is released in https://github.com/mingqiangWu/Echo-Forcing
[79] EndoGSim: Physics-Aware 4D Dynamic Endoscopic Scene Simulations via MLLM-Guided Gaussian Splatting cs.CVPDF
Changjing Liu, Yiming Huang, Long Bai, Beilei Cui, Hongliang Ren
TL;DR: 本文提出EndoGSim框架,通过多模态大语言模型引导的高斯泼溅技术,实现机器人辅助微创手术中内窥镜场景的物理感知重建与动态模拟。该方法结合4D高斯泼溅、预训练分割与深度估计来表征可变形组织和器械,并利用物体级材料场与可微分材料点法优化物理属性,在开源和内部数据集上验证了其优越的模拟保真度与物理准确性。
Details
Motivation: 针对现有内窥镜场景重建方法仅关注视觉重建而缺乏物理描述的问题,旨在实现高保真动态内窥镜场景的物理感知重建与模拟,以提升下游任务和手术效果。
Result: 在开源和内部数据集上,该框架在模拟保真度和物理准确性方面均优于现有最先进方法,验证了其优越性能。
Insight: 创新点包括利用MLLM引导初始化物理属性,结合4D高斯泼溅与可微分材料点法实现端到端的物理感知重建与模拟,为手术机器人应用提供了统一的物理仿真解决方案。
Abstract: In robot-assisted minimally invasive surgery, high-fidelity dynamic endoscopic scene reconstruction and simulation are crucial to enhancing downstream tasks and advancing surgical outcomes. However, existing methods primarily focus on visual reconstruction, lacking physics-based descriptions of the scene required for realistic simulation. We propose a unified framework that achieves physics-aware reconstruction and physical simulation of endoscopic scenes through Multi-modal Large Language Models (MLLMs)-guided Gaussian Splatting. Our approach utilizes 4D Gaussian Splatting (4DGS) integrated with pre-trained segmentation and depth estimation to represent deformable tissues and tools. To achieve automatic inference of physical properties, we introduce an object-wise material field that initializes material parameters via MLLM and refines them through a differentiable Material Point Method (MPM) under joint supervision from rendered images and optical flow. Validated on both open-source and in-house datasets, our framework achieves superior simulation fidelity and physical accuracy compared to state-of-the-art methods, underscoring its potential to advance robot-assisted surgical applications.
[80] STABLE: Simulation-Ready Tabletop Layout Generation via a Semantics-Physics Dual System cs.CV | cs.ROPDF
Zhen Luo, Yixuan Yang, Xudong Xu, Jinkun Hao, Zhaoyang Lyu
TL;DR: STABLE提出了一种语义-物理双系统,用于从任务指令生成模拟就绪的桌面场景布局。它包含语义推理器和物理校正器两个模块,通过交替执行逐步扩展场景,确保语义对齐和物理合理性。
Details
Motivation: 现有基于大语言模型的任务到场景生成方法在3D空间推理上存在局限,常导致物体碰撞或悬浮问题,无法生成物理合理的模拟就绪场景。
Result: 实验表明,STABLE能生成严格符合任务指令的模拟就绪桌面场景,在物理有效性上显著优于现有方法。
Insight: 创新点在于结合微调LLM的语义推理与基于流的物理感知去噪模型,采用渐进式生成范式,兼顾语义对齐与物理约束,解决了纯LLM方法的空间推理不足问题。
Abstract: Generating simulation-ready tabletop scenes from task instructions is an intriguing and promising research direction in the field of Embodied AI. However, existing task-to-scene generation methods rely exclusively on large language models (LLMs) to predict scene layouts, inevitably yielding object collisions or floating due to LLMs’ inherent limitations in 3D spatial reasoning. In this paper, we present STABLE, a semantics-physics dual-system tailored for simulation-ready tabletop scene generation. STABLE consists of two complementary modules: (i) a Semantic Reasoner, a fine-tuned LLM trained on a structured tabletop scene dataset to generate coarse layouts from input task instructions, and (ii) a Physics Corrector, a physics-aware flow-based denoising model that outputs pose updates to refine layouts, which ensures the physical plausibility of scenes while preserves semantic alignment with task instructions. STABLE adopts a progressive generation paradigm: by alternating between the Semantic Reasoner and Physics Corrector, it incrementally expands the scene from task-critical objects to background objects. Experiments demonstrate that STABLE successfully generates simulation-ready tabletop scenes that strictly conform to task instructions and significantly enhances the physical validity of scenes over prior art.
[81] WeatherOcc3D: VLM-Assisted Adverse Weather Aware 3D Semantic Occupancy Prediction cs.CVPDF
A. Enes Doruk, Abdelaziz Hussein, Hasan F. Ates
TL;DR: 本文提出了WeatherOcc3D,一个利用视觉语言模型(VLM)辅助的、针对恶劣天气的3D语义占据预测框架。该框架通过CLIP的预训练潜在空间,利用语言环境线索来指导相机和LiDAR传感器的融合,并引入一个门控策略来分解环境不确定性(可见度和光照),从而动态调整融合权重。在nuScenes数据集上的实验表明,该方法能显著提升现有模型(如OccMamba和M-CONet)在恶劣天气条件下的性能。
Details
Motivation: 解决多模态3D语义占据预测在环境变化(如低光照、强降水)下的鲁棒性问题。传统静态融合策略在特定传感器(相机在低光下退化,LiDAR在雨雪中产生反向散射噪声)不可靠时无法自适应调整,导致模态信任问题。
Result: 在nuScenes数据集上进行评估。将所提框架应用于OccMamba和M-CONet架构,分别取得了26.3和21.1的mIoU分数,显著优于其传统基线方法。
Insight: 创新点在于利用VLM(CLIP)的语义先验,通过天气特定的文本嵌入来引导和动态调制多传感器融合,将环境不确定性分解为可见度和光照两个因子进行门控。这为多模态感知系统在复杂环境下的自适应融合提供了新思路,即利用语言作为环境状态的抽象表示来指导融合策略。
Abstract: While multi-modal 3D semantic occupancy prediction typically enhances robustness by fusing camera and LiDAR inputs, its effectiveness is fundamentally constrained by environmental variability. Specifically, camera sensors suffer from severe low-light degradation, while LiDAR sensors encounter significant backscatter noise during heavy precipitation. These adverse conditions create a modality trust problem, as static fusion strategies fail to adaptively re-weight inputs when a specific sensor becomes unreliable. To address this, we propose a VLM-assisted framework leveraging the pre-trained CLIP latent space to guide multi-sensor integration via linguistic environmental cues. We utilize a parameter-efficient adapter to align weather-specific text embeddings with sensor features, coupled with a gating strategy that decomposes environmental uncertainty into two factors: visibility and illumination. This enables the model to dynamically modulate the fusion ratio - prioritizing semantic camera features in clear daylight and shifting to geometric LiDAR priors during rainy nights. Evaluations on the nuScenes dataset demonstrate the versatility of our approach, as implementing our proposed framework on the OccMamba and M-CONet architectures achieves mIoU scores of 26.3 and 21.1, respectively, significantly outperforming their traditional baselines.
[82] Registers Matter for Pixel-Space Diffusion Transformers cs.CVPDF
Nikita Starodubcev, Ilia Sudakov, Ilya Drobyshevskiy, Artem Babenko, Dmitry Baranchuk
TL;DR: 本文研究了在像素空间扩散变换器(DiTs)中引入寄存器令牌(register tokens)的作用。研究发现,与视觉变换器(ViTs)不同,DiTs本身不存在补丁令牌异常值问题,但寄存器令牌仍能显著提升像素空间DiTs的收敛速度和生成质量。通过分析中间表示,作者发现寄存器令牌在高噪声水平下能产生更清晰的特征图,并观察到近期像素空间DiT架构已隐含类似机制。基于此,作者提出了一种参数高效的双流架构,专门处理寄存器令牌,以可忽略的运行时开销提升生成质量。
Details
Motivation: 动机源于视觉变换器(ViTs)中已知的补丁令牌异常值问题会降低特征图质量,而寄存器令牌能有效缓解此问题。随着扩散模型越来越多地采用变换器架构并转向像素空间训练,其形式更接近ViTs,因此探究寄存器令牌是否对扩散变换器(DiTs)同样有益成为一个关键问题。
Result: 实验表明,寄存器令牌显著改善了像素空间DiTs的收敛性和生成质量。通过分析中间表示,发现寄存器令牌在高噪声水平下能产生更干净的特征图,这可能是其有效的关键。此外,研究指出近期像素空间DiT架构已隐含类似寄存器令牌的机制,这可能部分解释了其强大的经验性能。
Insight: 创新点在于揭示了寄存器令牌对像素空间DiTs的有效性,尽管DiTs本身不存在ViTs中的异常值问题。从客观角度看,论文的深入分析(如中间表示分析)提供了对机制的新理解,而提出的参数高效双流架构则是一种可借鉴的轻量级改进方法,专门优化寄存器令牌处理以提升性能。
Abstract: Vision Transformers (ViTs) are known to exhibit high-norm patch-token outliers that degrade feature map quality, a problem effectively mitigated by \textit{register tokens}. As diffusion models increasingly adopt transformer architectures and move toward pixel-space training, they become closer in form to ViTs, raising the question of whether register tokens are also useful for Diffusion Transformers (DiTs). In this work, we show that DiTs differ from ViTs in a key respect: they do not exhibit patch-token outliers. Interestingly, register tokens significantly improve convergence and generation quality of pixel-space DiTs. By analyzing intermediate representations, we find that register tokens produce cleaner feature maps at high noise levels, which may contribute to their effectiveness in pixel-space generation. We further observe that recent pixel-space DiT architectures implicitly incorporate register-like mechanisms, which may partially account for their strong empirical performance. Motivated by these insights, we investigate a parameter-efficient dual-stream architecture that specializes processing for register tokens and improves pixel-space generation quality with negligible runtime overhead.
[83] MAgSeg: Segmentation of Agricultural Landscapes in High-Resolution Satellite Imagery using Multimodal Large Language Models cs.CVPDF
Piyush Tiwary, Utkarsh Ahuja, Depanshu Sani, Aishwarya Jayagopal, Sagar Gubbi
TL;DR: MAgSeg是一种新颖的、无需解码器的多模态大语言模型(MLLM)分割方法,旨在解决全球南方地区高分辨率卫星图像中复杂小农农业景观的分割难题。该方法通过一种新颖的指令调优数据格式,使标准MLLM能够从图像的全局上下文中学习,同时仅为图像内的一个斑块生成文本标记,从而避免了辅助视觉解码器的需求。
Details
Motivation: 全球南方农业景观分割面临地块破碎、类内方差高和标注训练数据稀缺的挑战。现有基于MLLM的分割方法存在关键上下文长度瓶颈和卫星特征理解领域的对齐差距。
Result: 在覆盖全球南方三个国家的数据集上进行广泛评估,结果表明MAgSeg显著优于最先进的MLLM基线方法,为小农农业环境制图提供了一个可扩展的解决方案。
Insight: 创新点在于提出了一种无需解码器的MLLM分割架构,以及一种新颖的指令调优数据格式,使模型能够高效处理高分辨率卫星图像,同时学习全局上下文并聚焦局部斑块,从而弥合了领域对齐差距并克服了上下文长度限制。
Abstract: Agricultural landscape segmentation in the Global South is challenging as it is characterized by fragmented plots, high intra-class variance, and a scarcity of labeled training data. Recent advances in segmentation have been made by Multimodal Large Language Models (MLLMs). However, current approaches encounter critical context length bottlenecks and a domain alignment gap in understanding satellite features. We address these limitations through MAgSeg, a novel, decoder-free MLLM segmentation approach. MAgSeg is an architecturally efficient approach that enables standard MLLMs to perform segmentation of complex smallholder agricultural landscapes from high-resolution satellite imagery, without requiring auxiliary vision decoders. We introduce a novel instruction tuning data format designed to enable scalable fine-tuning and post-training on high resolution satellite imagery, which enables MAgSeg to learn from the global context of the image while generating text tokens for only a patch within the image. Extensive evaluations on datasets spanning three countries in the Global South demonstrate that MAgSeg significantly outperforms state-of-the-art MLLM baselines, offering a scalable solution to map smallholder agricultural environments.
[84] Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models cs.CV | cs.AIPDF
Yishun Lu, Wes Armour
TL;DR: 本文提出了一种名为ML-FOP-SOAP的二阶优化框架,用于解决自回归多模态模型中存在的模态竞争问题。该方法通过多级方差校正和Fisher正交投影来抑制方差引起的模态冲突,从而稳定优化过程并支持大批量训练。在Janus和Emu3模型上的实验表明,该方法在图像生成和文本理解任务上均取得了一致性提升,并实现了高达8192的批量大小稳定训练。
Details
Motivation: 自回归的下一个token训练为图像生成和文本理解提供了统一框架,但也引发了强烈的模态竞争,导致优化不稳定并限制了大批量扩展。一阶优化器(如AdamW)容易受到跨模态梯度异质性的影响,因此需要更稳定的优化基础。
Result: 在Janus和Emu3基准上的实验表明,相比AdamW,该方法将样本效率提升了高达1.4倍,并将实际训练时间加速了高达1.5倍,同时支持批量大小8192的稳定训练,在多模态任务上取得了全面且一致的性能提升。
Insight: 论文的核心创新点在于将二阶预条件(特别是SOAP)作为多模态对齐的稳定基础,并提出了多级方差校正框架。其Fisher正交投影技术能有效抑制方差引起的模态冲突,而分层折叠策略则在低微步开销下捕获细粒度方差,为扩展多模态基础模型提供了鲁棒的优化器。从客观角度看,该方法将优化视角与模态竞争问题相结合,为解决多模态训练中的稳定性与可扩展性挑战提供了新思路。
Abstract: Autoregressive next-token training offers a unified formulation for image generation and text understanding, but it also creates strong modality competition that destabilizes optimization and limits large-batch scaling. We show that first-order optimizers such as AdamW are vulnerable to cross-modality gradient heterogeneity, while second-order preconditioning, particularly SOAP, provides a more stable basis for multimodal alignment. Building on this insight, we propose \emph{ML-FOP-SOAP}, a second-order optimization framework with Multi-Level Variance Correction. Our Fisher-Orthogonal Projection suppresses variance-induced modality conflicts, reducing the trade-off between visual generation and textual understanding. To make this practical under large gradient accumulation, we introduce a hierarchical folding strategy that captures fine-grained variance with low micro-step overhead. Experiments on Janus and Emu3 show consistent gains across both modalities and stable training at batch size 8192. Compared with AdamW, our method improves sample efficiency by up to $1.4\times$ and accelerates wall-clock training by up to $1.5\times$, offering a robust optimizer for scaling multimodal foundation models.
[85] Res$^2$CLIP: Few-Shot Generalist Anomaly Detection with Residual-to-Residual Alignment cs.CVPDF
Xinyue Liu, Jianyuan Wang, Biao Leng, Shuo Zhang
TL;DR: 本文提出了Res²CLIP,一种基于残差对齐的少样本通用异常检测框架,旨在解决现有CLIP方法在细粒度差异和跨类别泛化上的不足。该方法将多模态对齐完全转移到统一的残差空间,通过残差表示消除区域间正常特征差异和类别特定偏差,从而同时应对跨粒度不匹配和跨类别泛化退化问题。
Details
Motivation: 少样本通用异常检测要求模型无需重新训练即可泛化到新类别,这在样本稀缺且类别快速变化的现实场景中具有挑战性。现有基于CLIP的方法面临两个主要问题:粗粒度统一文本提示难以适应细粒度前景-背景差异,导致跨粒度不匹配;以及在辅助数据集上的微调会因域偏移破坏CLIP固有的开放世界泛化能力,引发跨类别泛化退化。
Result: 在多个数据集上的实验证明了该架构的有效性,但摘要未具体说明定量结果或基准对比(如是否达到SOTA水平)。
Insight: 创新点在于首次提出残差到残差对齐框架,将视觉和文本模态对称地桥接在CLIP的残差空间内,所有可学习优化均约束在残差域中,迫使模型关注相对异常偏差而非优化类别特定特征。从客观角度看,该方法通过统一的残差空间同时解决细粒度适应和泛化保持问题,为少样本异常检测提供了新思路。
Abstract: Few-shot Generalist Anomaly Detection requires models to generalize to novel categories without retraining, posing significant challenges in real-world scenarios with scarce samples and rapidly changing categories. Existing CLIP-based methods face two major challenges: coarse-grained unified text prompts struggle to adapt to fine-grained foreground-background differences, causing cross-granularity mismatch; and fine-tuning on auxiliary datasets disrupts CLIP’s inherent open-world generalization due to domain shift, leading to cross-category generalization degradation. To address these, we propose to shift multimodal alignment entirely into a unified residual space, where residual representations naturally eliminate fine-grained normal feature differences across regions and class-specific biases, simultaneously resolving both problems. Based on this insight, Res$^2$CLIP, the first residual-to-residual alignment framework that symmetrically bridges visual and text modalities within CLIP’s residual space, is designed. The framework is developed from a residual perspective into three branches: a text prompt-based branch, a visual prompt-based branch, and a novel residual-to-residual alignment branch. All learnable optimizations are constrained within the residual domain, and the residual alignment optimization objectives are designed to force the model to focus on relative anomaly deviations rather than optimizing class-specific features. Experiments on multiple datasets demonstrate the effectiveness of our architecture. The code is available at https://github.com/hito2448/Res2CLIP.
[86] IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation cs.CV | cs.AI | cs.ROPDF
Yuqi Wu, Tianyu Hu, Wenzhao Zheng, Yuanhui Huang, Haowen Sun
TL;DR: IVGT是一种隐式视觉几何变换器,用于从无位姿多视图图像中学习连续的神经场景表示。该方法在规范坐标系中建模连续几何,支持任意3D位置的空间查询,通过轻量级解码器预测有符号距离场和颜色,并能直接提取连续表面几何,实现多视角渲染。
Details
Motivation: 现有视觉几何基础模型通常通过回归像素对齐的点图来预测显式几何,存在冗余和几何连续性受限的问题。本文旨在从无位姿多视图图像中隐式建模连续且连贯的几何。
Result: 通过多数据集联合优化和2D监督与3D几何正则化训练,IVGT在多个任务上表现出强泛化性能,包括网格和点云重建、新视角合成、深度与表面法线估计以及相机位姿估计。
Insight: 创新点在于提出隐式视觉几何变换器,在规范坐标系中学习连续神经场景表示,支持任意3D位置的连续查询,避免了显式方法的冗余问题,并实现了多任务统一建模与渲染。
Abstract: Reconstructing coherent 3D geometry and appearance from unposed multi-view images is a fundamental yet challenging problem in computer vision. Most existing visual geometry foundation models predict explicit geometry by regressing pixel-aligned pointmaps, often suffering from redundancy and limited geometric continuity. We propose IVGT, an Implicit Visual Geometry Transformer that implicitly models continuous and coherent geometry from pose-free multi-view images. This formulation learns a continuous neural scene representation in a canonical coordinate system and supports continuous spatial queries at any 3D positions, retrieving local features to predict signed distance (SDF) values and colors using lightweight decoders. It allows direct extraction of continuous and coherent surface geometry, enabling rendering of RGB images, depth maps, and surface normal maps from arbitrary viewpoints. We train IVGT via multi-dataset joint optimization with 2D supervision and 3D geometric regularization. IVGT demonstrates generalization across scenes and achieves strong performance on various tasks, including mesh and point cloud reconstruction, novel view synthesis, depth and surface normal estimation, and camera pose estimation.
[87] Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation cs.CV | cs.AIPDF
Jin Shi, Brady Zhang, Yishun Lu
TL;DR: 本文提出了一种名为VLA-AD的蒸馏框架,利用视觉语言模型作为离线语义监督器,将大型视觉-语言-动作策略教师模型的知识转移到轻量级学生策略中。该方法通过引入高级语义指导(如任务阶段锚点和多帧操作方向描述)来增强训练,从而在保持性能的同时,显著减小模型规模并提升推理速度。
Details
Motivation: 解决大型视觉-语言-动作策略模型在机器人操作中因参数量大和推理成本高而难以实现实时闭环控制的问题,旨在通过知识蒸馏获得高效、可部署的轻量级策略。
Result: 在LIBERO基准测试套件上,使用OpenVLA-7B作为教师模型,VLA-AD生成了一个1.58亿参数的学生模型,模型大小减少了44倍,平均相对性能差距仅为0.27%,推理速度在RTX 4090上达到12.5 Hz,比OpenVLA-7B快3.28倍。使用π_{0.5}-4B教师模型时,学生在两个套件上表现优于教师,在libero_goal上差距保持在0.53%以内。
Insight: 创新点在于利用离线语义指导(如任务阶段和操作方向)来增强蒸馏过程,而不仅依赖低级动作模仿,这提高了学生策略的鲁棒性,减少了对教师模型噪声动作的敏感性。从客观角度看,该方法展示了视觉语言模型在提升策略蒸馏效率和可部署性方面的潜力。
Abstract: Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbf{VLA-AD}, a distillation framework that uses a Vision-Language Model as an offline semantic supervisor to transfer large VLA teachers into lightweight student policies. Instead of relying only on low-level action imitation, VLA-AD augments teacher-provided 7-DoF action targets with high-level semantic guidance, including task phase anchors and multi-frame operating-direction descriptions. These auxiliary signals are used only during training: at test time, the student policy runs independently, with neither the VLA teacher nor the VLM required. We evaluate VLA-AD on three LIBERO benchmark suites. Using OpenVLA-7B as the teacher, our method produces a 158M-parameter student, yielding a $44\times$ reduction in model size while matching the teacher with only a $0.27%$ average relative gap. The resulting policy runs at 12.5 Hz on an RTX 4090, achieving a $3.28\times$ inference speedup over OpenVLA-7B. We further show that the same semantic distillation pipeline generalizes to a different $π_{0.5}$-4B teacher, where the student outperforms the teacher on two suites and remains within $0.53%$ on \texttt{libero_goal}. Additional analysis indicates that phase-level supervision and multi-frame directional cues make the student less sensitive to noisy teacher actions, such as erroneous high-frequency gripper changes. Overall, VLA-AD demonstrates that offline semantic guidance from VLMs can substantially improve the efficiency, robustness, and deployability of VLA policy distillation.
cs.CR [Back]
[88] A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation cs.CR | cs.CVPDF
Hao Yang, Zhuo Ma, Yang Liu, Yilong Yang, Guancheng Wang
TL;DR: 本文提出了一种新型跨模态提示注入攻击方法CrossMPI,它能够仅通过对图像添加扰动,就同时操控大型视觉语言模型对文本和视觉输入的理解。该方法通过将优化目标从视觉嵌入空间转向模型隐藏状态空间,并结合层选择策略和距离递减的扰动预算分配策略来克服优化挑战。
Details
Motivation: 现有提示注入攻击存在局限性:要么只能操控单一模态的输入解释,要么虽然是多模态但无法实现跨模态的提示扰动。本文旨在弥补这一差距,实现仅通过图像扰动就能进行跨模态操控的攻击。
Result: 在多个大型视觉语言模型和数据集上的广泛实验表明,该方法显著优于基线方法。
Insight: 核心创新点在于将扰动优化目标从低维的视觉嵌入空间转向高维、负责多模态信息整合的模型隐藏状态空间。此外,研究发现对于LVLM提示扰动,最优的层位于模型中部而非最后层,这与以往经验不同;并提出了距离递减的扰动预算分配策略来约束图像扰动空间。
Abstract: Large vision-language models (LVLMs) have emerged as a powerful paradigm for multimodal intelligence, but their growing deployment also expands the attack surface of prompt injection. Despite this growing concern, existing attacks still suffer from a critical limitation: the injected prompt for one modality only steers the model’s interpretation of that singular input. Alternatively, these attacks remain multimodal but fail to achieve cross-modal prompt perturbation. To bridge this gap, we introduce a novel cross-modal prompt injection attack CrossMPI, which can steer the model’s interpretation of both textual and visual inputs via image-only prompt injection. Our design is underpinned by the following key breakthroughs. First, we turn the focus of the injected prompt perturbation optimization from the visual embedding space (typically with only $10^5$ parameters) to the model hidden state space (for multimodal information integration and with $10^7$ parameters). Then, two strategies are adopted to mitigate the optimization challenges posed by the larger parameter space. To constrain the optimized model parameter space, we introduce a layer selection strategy that identifies the layers most critical to multimodal integration. Interestingly, deviating from the past experience, our analysis reveals that the optimal layers for LVLM prompt perturbation reside in the middle of the model rather than the last. To constrain the image perturbation space, we propose a new distance-decremental perturbation budget assignment strategy that allocates budgets decrementally as the pixel distance to semantic-critical regions increases. Extensive experiments across multiple LVLMs and datasets show that our method significantly outperforms baseline approaches.
cs.CE [Back]
[89] From Feedback Loops to Policy Updates: Reinforcement Fine-Tuning for LLM-Based Alpha Factor Discovery cs.CE | cs.AI | cs.CLPDF
Lingzhe Zhang, Tong Jia, Yunpeng Zhai, Zixuan Xie, Chiming Duan
TL;DR: 本文提出了一个名为QuantEvolver的自进化阿尔法因子发现框架,旨在解决现有基于大语言模型(LLM)的因子生成方法中存在的上下文爆炸、反馈漂移和搜索停滞等问题。该框架通过强化微调,将可执行的量化评估转化为策略更新,使一个Miner LLM能够通过参数学习内化历史优化经验,从而持续生成高质量且互补的阿尔法因子。
Details
Motivation: 现有基于LLM的阿尔法因子自动生成方法主要依赖提示级的生成-评估-反馈循环进行迭代优化,这会导致上下文爆炸、推理成本增加、信息稀释和反馈漂移。此外,依赖大型LLM的稳定生成偏好可能导致结构相似的表达式、冗余候选因子和搜索停滞。
Result: 在三个真实市场基准上的广泛实验表明,QuantEvolver在各项任务的主要评估指标上持续优于现有的基于LLM的阿尔法因子发现基线方法,并能产生更高质量、更具互补性的因子池。
Insight: 核心创新在于将强化学习思想引入因子发现过程,用参数化的策略更新替代了传统的提示级反馈累积,从而实现了优化经验的内化。另一个关键点是引入了多样性-互补性奖励和Mined Factor Database,以促进因子池的质量和多样性,避免搜索停滞。
Abstract: Modern quantitative trading increasingly relies on systematic models to extract predictive signals from large-scale financial data, where alpha factor discovery plays a central role in transforming market observations into tradable signals. Recent LLM-based methods have shown promise in automating factor generation, but most of them still rely on prompt-level generation–evaluation–feedback loops for iterative optimization. As the loop becomes longer, repeatedly appended historical candidates and feedback can cause context explosion, increase inference cost, dilute useful information, and introduce feedback drift. Moreover, these methods often depend on very large LLMs whose stable generation preferences may lead to structurally similar expressions, redundant candidates, and search stagnation. To address these limitations, we propose \textsc{QuantEvolver}, a self-evolving alpha factor discovery framework based on reinforcement fine-tuning. Instead of accumulating feedback in the prompt, \textsc{QuantEvolver} converts executable quantitative evaluation into policy updates, enabling a Miner LLM to internalize historical optimization experience through parameter learning. Specifically, \textsc{QuantEvolver} constructs high-quality seed factors, builds diverse seed–time-window training tasks, generates executable Factor DSL expressions, evaluates them through Regime Backtest, and optimizes the Miner LLM with Diversity-Complementarity Reward. During training, high-quality factors are continuously accumulated in a Mined Factor Database, which serves as the final discovered factor library. Extensive experiments on three realistic market benchmarks demonstrate the effectiveness of \textsc{QuantEvolver}, which consistently improves the primary evaluation metric of each task over existing LLM-based alpha factor discovery baselines, produces higher-quality and more complementary factor pools.
eess.IV [Back]
[90] TVRN: Invertible Neural Networks for Compression-Aware Temporal Video Rescaling eess.IV | cs.CVPDF
Xinmin Feng, Li Li, Dong Liu, Feng Wu
TL;DR: 本文提出了一种名为TVRN的端到端框架,用于压缩感知的视频帧率重缩放。该方法通过可逆神经网络架构,结合多输入多输出时域小波变换和高频重建模块,以减少帧率下采样过程中的高频信息损失。同时,设计了一个替代网络来近似不可微的损失性编解码器的梯度,并采用学习排序策略学习压缩感知特征,以增强在不同压缩级别下的鲁棒性。
Details
Motivation: 现有视频帧率重缩放方法通常通过训练目标连接下采样和上采样操作,未能充分利用其互逆特性,导致高频信息丢失,且忽略了损失性编解码器对低帧率视频的影响,限制了实际应用。
Result: 在工业视频压缩设置下的大量实验表明,TVRN在重建质量上优于现有方法。
Insight: 创新点包括采用可逆架构以正则化高频信息、设计替代网络实现端到端训练通过不可微编解码器,以及通过学习排序策略引入压缩感知特征以提升鲁棒性。从客观角度看,该方法将压缩感知与帧率重缩放联合优化,增强了实际部署的适用性。
Abstract: To fit diverse display and bandwidth constraints, high-frame-rate videos are temporally downscaled to low-frame-rate (LFR) and later upscaled, requiring joint optimization for effective frame-rate rescaling. However, existing methods typically link the two operations via training objectives, without fully exploiting their reciprocal nature, which may cause high-frequency information loss. Moreover, they overlook the impact of lossy codecs on LFR videos, limiting real-world applicability. In this work, we propose an end-to-end framework for compression-aware frame-rate rescaling, named TVRN. To regularize high-frequency information lost during frame-rate downscaling, TVRN adopts an invertible architecture that combines a Multi-Input Multi-Output Temporal Wavelet Transform with a high-frequency reconstruction module. To enable end-to-end training through non-differentiable lossy codecs, we design a surrogate network that approximates their gradients. Finally, to improve robustness under various compression levels, we extend TVRN to an asymmetric architecture by incorporating compression-aware features learned via a learning-to-rank strategy. Extensive experiments show that TVRN outperforms existing methods in reconstruction quality under industrial video compression settings. Source code is publicly available at https://github.com/fengxinmin/TVRN_public.
[91] Degradation-Aware Blur-Segmentation of Brain Tumor eess.IV | cs.CVPDF
Yuchun Wang, Xiaosong Li, Gefei Liang, Yang Liu
TL;DR: 本文提出了一种名为Degradation-Aware Blur-Segmentation Net (DABSeg)的同步去模糊3D多模态MRI脑肿瘤分割网络,旨在解决MRI扫描中因患者运动导致的图像模糊和伪影问题,从而提升分割性能。该方法在BraTS2020数据集上进行了系统评估,在清晰和退化条件下均超越了现有最先进方法。
Details
Motivation: 现有脑肿瘤分割方法通常假设MRI图像无伪影,但实际扫描中不可避免的患者运动会导致图像模糊和伪影,从而降低边界和纹理特征质量,影响分割准确性。
Result: 在BraTS2020数据集上,DABSeg在清晰和退化条件下均取得了最先进的肿瘤Dice分数和边界精度,超越了现有方法。
Insight: 创新点包括:提出特征域运动去模糊主干网络以补偿模糊并重新平衡强度;嵌入模糊感知的跨模态交叉注意力模块和多尺度残差聚合以实现有效的模态互补;优化结合加权Dice和清晰参考重建项的联合损失函数,对小目标应用不平衡权重以增强小病灶和边界区域的学习强度和预测稳定性。
Abstract: Multimodal 3D MRI brain tumor segmentation is a pivotal step in radiotherapy target delineation, surgical planning and post-treatment assessment. Existing methods often assume artifact-free MRI images. However, inevitable patient motion during scanning introduces artifacts and blur that degrade boundary and texture features, leading to poor segmentation performance. To bridge this gap, we introduce Degradation-Aware Blur-Segmentation Net (DABSeg), a synchronous deblurring 3D multimodal MRI segmentation network that unifies blur removal and accurate segmentation. Specifically, we propose a feature-domain motion-deblurring stem to compensate for blur and rebalance intensity. Concurrently, the backbone network embeds a blur-aware cross-modal cross-attention module and multi-scale residual aggregation to yield effective modality complementarity. Notably, we optimize a joint loss that combines weighted Dice with a clear-reference reconstruction term, where imbalanced weights are applied to small targets to boost learning intensity and predictive stability for small lesions and border regions. Systematic comparisons and ablation experiments on the BraTS2020 dataset under both clear and degenerative conditions consistently demonstrate that DABSeg surpasses state-of-the-art methods in tumor Dice score and boundary precision. These results validate the effectiveness of degenerative-aware cross-task collaborative learning in improving the robustness and clinical utility of multi-modal 3D brain tumor segmentation under realistic degenerative conditions. The source code is available at https://github.com/YuchunWang24/DABSeg_ICPR
cs.RO [Back]
[92] PhysBrain 1.0 Technical Report cs.RO | cs.AI | cs.CL | cs.CVPDF
Shijie Lian, Bin Yu, Xiaopeng Lin, Changti Wu, Hang Yuan
TL;DR: PhysBrain 1.0提出了一种从大规模人类第一视角视频中提取结构化物理常识知识,并用于训练视觉语言模型(VLM),进而迁移到视觉语言动作(VLA)机器人策略的方法。该方法通过数据引擎提取场景元素、空间动态、动作执行和深度感知关系,并将其转化为问答监督信号。
Details
Motivation: 现有基于机器人轨迹的视觉语言动作模型学习物理理解的覆盖范围有限,因此论文探索从人类交互视频中学习广泛的物理常识作为补充途径。
Result: 在ERQA、PhysBench、SimplerEnv-WidowX、LIBERO和RoboCasa等多个多模态问答和具身控制基准测试中,PhysBrain 1.0取得了SOTA结果,并在SimplerEnv上表现出特别强的领域外泛化性能。
Insight: 创新点在于构建了一个从人类视频到结构化物理常识监督的数据引擎,以及一种能力保持且对语言敏感的适应设计,将物理先验从VLM有效迁移到VLA策略,为从多模态理解到机器人动作提供了有效桥梁。
Abstract: Vision-language-action models have advanced rapidly, but robot trajectories alone provide limited coverage for learning broad physical understanding. PhysBrain 1.0 studies a complementary route: converting large-scale human egocentric video into structured physical commonsense supervision before robot adaptation. Our data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, then turns them into question-answer supervision for training PhysBrain VLMs. The resulting physical priors are further transferred to VLA policies through a capability-preserving and language-sensitive adaptation design. Across multimodal QA benchmarks and embodied control benchmarks, including ERQA, PhysBench, SimplerEnv-WidowX, LIBERO, and RoboCasa, PhysBrain 1.0 achieves SOTA results and shows especially strong out-of-domain performance on SimplerEnv. These results suggest that scaling physical commonsense from human interaction video can provide an effective bridge from multimodal understanding to robot action.
[93] DualReg: Dual-Space Filtering and Reinforcement for Rigid Registration cs.RO | cs.CVPDF
Jiayi Li, Yuxin Yao, Qiuhang Lu, Juyong Zhang
TL;DR: 本文提出了一种名为DualReg的新型双空间刚性配准方法,旨在解决噪声、部分重叠数据以及实时处理需求带来的挑战。该方法通过结合基于特征的匹配(擅长处理大变换差异)和基于局部几何的匹配(擅长精细局部对齐)的优势,首先使用轻量级单点RANSAC算法和细化模块过滤不可靠的特征对应关系,然后将其作为锚点提取几何代理,并设计目标函数求解变换。实验表明,该方法在KITTI数据集上取得了与MAC相当的精度的同时,CPU时间加速了32倍。
Details
Motivation: 刚性配准面临噪声、部分重叠数据和实时处理需求的挑战。基于特征的匹配能处理大变换但精度有限,而基于局部几何的匹配能实现精细对齐但严重依赖良好的初始变换,因此需要一种新范式来充分利用两者的优势。
Result: 在KITTI数据集上的实验验证了方法的有效性,在达到与MAC方法相当精度的同时,实现了32倍的CPU时间加速。
Insight: 创新点在于提出了一个双空间配准范式,通过高效的过滤机制(轻量级单点RANSAC和细化模块)来整合特征匹配和几何匹配的优势。从客观角度看,其将过滤后的特征对应作为锚点并提取几何代理来构建目标函数的思路,为平衡配准的鲁棒性、精度和效率提供了可借鉴的方案。
Abstract: Noisy, partially overlapping data and the need for real-time processing pose major challenges for rigid registration. Considering that feature-based matching can handle large transformation differences but suffers from limited accuracy, while local geometry-based matching can achieve fine-grained local alignment but relies heavily on a good initial transformation, we propose a novel dual-space paradigm to fully leverage the strengths of both approaches. First, we introduce an efficient filtering mechanism consisting of a computationally lightweight one-point RANSAC algorithm and a subsequent refinement module to eliminate unreliable feature-based correspondences. Subsequently, we treat the filtered correspondences as anchor points, extract geometric proxies, and formulate an effective objective function with a tailored solver to estimate the transformation. Experiments verify our method’s effectiveness, as demonstrated by a 32x CPU-time speedup over MAC on KITTI with comparable accuracy. Project page: https://ustc3dv.github.io/DualReg/.
[94] Where to Perch in a Tree: Vision-Guidance for Tree-Grasping Drones cs.RO | cs.CVPDF
Alex Dunnett, Leonie Bottomley, Mirko Kovac, Basaran Bahadir Kocer
TL;DR: 本文提出了一种基于视觉引导的无人机自主栖息树方法,通过图像处理算法评估树木形状和结构,以确定最佳的栖息位置。该方法不仅寻找最近的树枝,还综合考虑树枝宽度、倾斜角度和曲率等因素来评估其适宜性。在超过10,000张城市树木图像的数据集上,该方法对76%的可行目标成功生成了结果。
Details
Motivation: 解决无人机在树上自主栖息时如何选择理想位置的问题,传统方法仅关注最近树枝,而本研究旨在通过视觉方法评估每个树枝的潜力,基于多因素判断其适宜性。
Result: 在亚热带和温带季风气候下采集的10,000多张城市树木图像数据集上,该方法对76%的可行目标(树枝直径足够粗且栖息空间至少等于肌腱驱动抓爪宽度)成功生成了结果,为后续改进奠定了基础。
Insight: 创新点在于结合机器学习、图像分割和二值图像形态学等算法,从树枝宽度、倾斜角度和曲率多维度评估栖息适宜性,而非仅依赖距离;未来可通过深度感知和姿态传感器数据增强评估,实现通用化方法。
Abstract: This study demonstrates a method to locate an ideal perch location on a tree for vision-guided autonomous tree-perching drones. Various image processing algorithms, including those used for machine learning, image segmentation and binary image morphology, are implemented to assess the shape and structure of a tree. Rather than identifying the closest available branch, this study builds on vision methods by evaluating the potential of each branch, determining its suitability for perching based on factors such as branch width, slope (angle to the horizontal) and curvature. For a given tree-perching drone and a dataset of more than 10,000 urban tree images taken from February to October in a subtropical and temperate monsoon climate, the proposed method successfully produces a result for 76% of feasible targets. A feasible target defined as a tree where the branch diameters are sufficiently thick and where the available perching space is at least equal to the width of a tendon-driven grasping claw. These successful preliminary results create a foundation from which a number of identified improvements and additional features can be developed to create a generalised method; this will involve the incorporation of supplementary data from depth perception and attitude sensors to enhance the branch assessment.
[95] Hierarchical and Holistic Open-Vocabulary Functional 3D Scene Graphs for Indoor Spaces cs.RO | cs.CVPDF
Xinggang Hu, Chenyangguang Zhang, Alexandros Delitzas, Xiangkui Zhang, Marc Pollefeys
TL;DR: 本文提出了一种开放词汇的分层整体功能3D场景图构建方法,用于室内空间理解。通过引入密集桌面物体和显式多层次功能关系扩展了现有基准,并基于2D视觉定位与3D图优化构建了鲁棒的开放词汇流水线,以解决小尺度、密集相似实例带来的挑战。
Details
Motivation: 现有功能3D场景图基准覆盖有限且缺乏层次结构,主要关注大型家具而忽略了密集桌面物体与多层次功能关系,限制了其在复杂室内场景中的应用潜力。
Result: 大量实验表明,该方法能在具有挑战性的真实世界场景中可靠地推断功能3D场景图,进一步释放了其实际应用潜力。
Insight: 创新点包括:通过2D视觉证据锚定细粒度功能边,利用多线索进行跨帧节点关联;将边关联建模为时序图优化问题,整合证据累积、熵正则化和时序平滑;执行全局层次结构塑造以恢复分层图结构。
Abstract: Functional 3D scene graphs offer a versatile and flexible representation for 3D scene understanding and robotic manipulation, defined by object nodes, interactive elements, and functional relationship edges. However, their potential remains underexplored due to the limited coverage of existing benchmarks and the overly straightforward design of previous pipelines, which primarily focus on large-scale furniture but lack of hierarchical structures. Therefore, in this work, we extend the benchmark coverage by introducing dense tabletop objects and explicit multi-level functional relationships. This expansion introduces critical challenges involving small-scale, dense, and similar instances, with lack of visual anchoring in relational reasoning, instance confusion during cross-frame fusion, and attribution uncertainty under dynamic viewpoints. To address these issues, we propose an open-vocabulary pipeline based on 2D visual grounding and 3D graph optimization. Specifically, we anchor fine-grained functional edges from 2D visual evidence, and associate nodes across frames in 3D using multiple cues. Furthermore, edge association is formulated as temporal graph optimization, integrating evidence accumulation, entropy regularization, and temporal smoothing to robustly determine the functional connections of each node. Finally, global hierarchy shaping is performed to recover the hierarchical graph structure. Extensive experiments demonstrate that the proposed method can reliably infer functional 3D scene graphs in challenging real-world scenes, thereby further unlocking their potential for practical applications.
[96] WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation cs.RO | cs.CVPDF
Baining Zhao, Jiacheng Xu, Weicheng Feng, Xin Zhang, Zhaolu Wang
TL;DR: 本文提出了WorldVLN,一种用于空中视觉语言导航(VLN)的自回归世界动作模型。它将空中VLN任务重新定义为预测驱动的世界-动作问题,通过预测短时域的世界状态转移来直接解码为可执行的航点动作,并采用两阶段训练框架进行优化。
Details
Motivation: 现有方法可能未充分建模智能体动作对3D环境演化的长期影响,因此作者认为应将空中VLN视为一个预测驱动的世界-动作问题,即智能体需要预测潜在的世界演化并根据预测结果来行动。
Result: 在公开的室外和室内基准测试中,WorldVLN持续超越现有的视觉-语言-动作基线方法,取得了超过12%的成功率提升,在具有挑战性的案例中优势更大,并且能够零样本迁移到真实无人机部署中。
Insight: 核心创新在于将自回归视频主干网络适配用于预测短时域世界状态转移并直接解码为动作,以及提出了首个为自回归世界动作模型量身定制的强化学习方法——Action-aware GRPO,以优化航点决策的长期后果。
Abstract: Aerial vision-language navigation (VLN) requires agents to follow natural-language instructions through closed-loop perception and action in 3D environments. We argue that aerial VLN can be formulated as a prediction-driven world-action problem: the agent should anticipate latent world evolution and act according to the predicted consequences. To this end, we propose WorldVLN, the first autoregressive world action model for aerial VLN. Unlike full-sequence video-generation world models that generate an entire visual clip, WorldVLN adapts a latent autoregressive video backbone to predict short-horizon world-state transitions and directly decodes them into executable waypoint actions. After each action segment is executed, newly received observations are encoded back into the autoregressive context, enabling closed-loop world-action prediction. We further introduce a two-stage training framework that first grounds the video prior in instruction-conditioned navigation dynamics and then develops Action-aware GRPO, the first reinforcement learning method tailored to autoregressive WAMs, to optimize waypoint decisions through their downstream rollout consequences. On public outdoor and indoor benchmarks, WorldVLN consistently outperforms existing Vision-Language-Action baselines with 12%+ success-rate gains and larger advantages on challenging cases. It further transfers zero-shot to real drone deployment, suggesting that the proposed WorldVLN offers a promising route for spatial action tasks. Demos and code are available at https://embodiedcity.github.io/WorldVLN/.
cs.CY [Back]
[97] Beyond Performance Disparities: A Three-Level Audit of Representational Harm in CelebA cs.CY | cs.CVPDF
Sieun Park, Yuanmo He
TL;DR: 本文对CelebA人脸数据集进行了三个层面的审计,揭示了数据集中编码的文化偏见(如性别化的衰老与审美双重标准)如何从数据集标签传递到模型的特征权重和空间注意力中,导致两种代表性伤害:对女性的过度审视和对老年男性的完全排除。
Details
Motivation: 现有公平性研究主要关注分类标签和性能差异,但忽略了代表性伤害在模型学习特征和注意力机制中的体现。本文旨在揭示CelebA等大规模人脸数据集中嵌入的文化偏见如何在数据结构和模型行为中再现。
Result: 通过层次聚类发现39个属性组织成与刻板印象一致的潜在特质束;XGBoost与SHAP分析揭示了性别特异性效应(如肥胖仅降低女性吸引力);Grad-CAM显示模型对女性和年轻男性的预测集中于面部中部特征,而对老年男性的预测则偏向头发和衣物等外围线索。
Insight: 研究创新地通过数据集结构、特征权重和空间注意力三个层面系统审计代表性伤害,表明文化双重标准会从媒体表征渗入数据集标签、特征权重和模型注意力,而仅关注性能差异的公平性指标会掩盖这些伤害,这为公平性研究提供了新的审计框架和视角。
Abstract: Large-scale facial datasets like CelebA are widely used in computer vision, yet the cultural biases embedded in their labels remain underexplored. Fairness research has distinguished representational from allocational harms, but audits of computer vision datasets have mostly examined categorical labels, leaving open how such harms appear in learned features and model attention. This paper examines CelebA at three levels: dataset structure, learned feature weights, and spatial attention, focusing on how gendered double standards of ageing and beauty are encoded in the data and reproduced in model behaviour. First, hierarchical clustering of 202,599 images shows that the 39 attributes organise into latent trait bundles aligned with cultural archetypes: performative femininity (youth, makeup, adornment) and professional masculinity (ageing, facial hair, formal attire). Female faces, though more often rated attractive overall, incur steep penalties when assigned to ageing or masculine-coded clusters. Second, XGBoost with SHAP analysis reveal gender-specific effects, such as adiposity reducing attractiveness only for females. Third, Grad-CAM finds that predictions for female and younger male subgroups concentrate on mid-face cues, whereas predictions for older males drift toward peripheral cues such as hair and clothing. Older males attain the highest accuracy but the lowest average precision, indicating categorical exclusion of groups outside the dataset’s evaluative templates. Cultural double standards thus pass from media representation into dataset labels, feature weights, and model attention, producing two representational harms: hyper-scrutiny of women under a narrow evaluative template, and exclusion of older men from the scheme entirely. Fairness metrics focused on performance disparities mask both, underscoring the need to address representational harm in fairness research.
cs.AI [Back]
[98] Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning cs.AI | cs.CLPDF
Jingjing Wang, Xiwen Chen, Wenhui Zhu, Huayu Li, Zhengxiao He
TL;DR: 本文提出LaMR(Latent Multi-Rubric),一种结构化剪枝框架,用于提升LLM驱动的代码代理效率。它将代码相关性分解为语义证据和依赖支持两个可解释的质量维度,通过混合专家门控网络动态加权,并利用CRF层做出最终保留或剪枝决策。该方法在多个基准测试中显著节省了token使用并提升了任务性能。
Details
Motivation: 现有学习型剪枝器使用单一目标序列标注器压缩上下文,将所有代码相关性方面压缩为一个分数和一个转移矩阵,这造成了建模瓶颈,因为单一的CRF转移先验必须服务于异质的保留模式(如连续语义片段和稀疏结构支持行)。
Result: 在四个基准测试(SWE-Bench Verified, SWE-QA, LCC, LongCodeQA)上,LaMR在16次多轮头对头比较中赢得12次。它在多轮代理任务中最多节省31%的token,在单轮任务中将Exact Match提升高达+3.5,并且通过去噪上下文经常能增强性能,任何剩余的性能下降都很微小。
Insight: 核心创新点是将代码相关性分解为多个可解释的维度(语义证据和依赖支持)进行独立建模,并通过基于AST的程序分析从现有训练语料中自动推导多维度监督标签,无需额外标注成本。这缓解了单一CRF的建模瓶颈,并实现了有效的上下文去噪。
Abstract: LLM-powered coding agents spend the majority of their token budget reading repository files, yet much of the retrieved code is irrelevant to the task at hand. Existing learned pruners compress this context with a single-objective sequence labeler, collapsing all facets of code relevance into one score and one transition matrix. We show that this formulation creates a modeling bottleneck: a single CRF transition prior must serve heterogeneous retention patterns, including contiguous semantic spans and sparse structural support lines. We propose LaMR (Latent Multi-Rubric), a structured pruning framework that decomposes code relevance into two interpretable quality dimensions, semantic evidence and dependency support, each modeled by a dedicated CRF with dimension-specific transition dynamics. A mixture-of-experts gating network dynamically weights the per-rubric emissions conditioned on the query, and a final CRF layer on the fused emissions produces the aggregate keep-or-prune decision. To supervise each dimension without additional annotation cost, we derive multi-rubric labels from the existing training corpus via AST-based program analysis, simultaneously denoising the teacher’s binary labels. By effectively filtering distracting noise, LaMR frequently matches or even outperforms unpruned full-context baselines. Experiments on four benchmarks (SWE-Bench Verified, SWE-QA, LCC, LongCodeQA) show that LaMR wins 12 of 16 head-to-head multi-turn comparisons. It saves up to 31% more tokens on multi-turn agent tasks and improves Exact Match by up to +3.5 on single-turn tasks, while performance is frequently enhanced by denoising the context, and any remaining drops are marginal.
[99] Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR cs.AI | cs.CLPDF
Chanuk Lee, Sangwoo Park, Minki Kang, Sung Ju Hwang
TL;DR: 本文提出了NudgeRL框架,用于增强RLVR(带可验证奖励的强化学习)中的探索效率。该框架通过策略引导(Strategy Nudging)引入轻量级策略级上下文来诱导多样化的推理轨迹,并设计统一目标函数分解奖励信号,结合蒸馏目标将发现的行为迁移回基础策略。
Details
Motivation: RLVR的有效性受限于探索不足:策略只能在已采样的轨迹上改进。现有方法要么依赖计算昂贵的暴力扩展,要么通过修改优化目标提供有限的探索控制,因此需要一种结构化、多样化的高效探索方法。
Result: 在五个具有挑战性的数学基准测试中,NudgeRL在平均性能上超越了标准GRPO(即使后者使用高达8倍的采样预算),并且优于基于特权信息的oracle-guided RL基线,实现了SOTA水平。
Insight: 创新点在于提出策略引导的上下文条件化探索机制,无需昂贵的外部监督即可生成多样化推理路径;同时,通过分解奖励信号为上下文间和上下文内组件并引入蒸馏目标,实现了从结构化探索中高效学习与行为迁移。
Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at https://github.com/tally0818/NudgeRL.
[100] Look Before You Leap: Autonomous Exploration for LLM Agents cs.AI | cs.CLPDF
Ziang Ye, Wentao Shi, Yuxin Liu, Yu Wang, Zhengzhou Cai
TL;DR: 本文针对基于大语言模型的智能体在陌生环境中因过早利用先验知识而失败的问题,提出了自主探索能力的重要性。作者引入了可验证的度量标准‘探索检查点覆盖率’来量化探索广度,并开发了一种交替任务执行与探索的训练策略,进而提出了‘先探索后行动’的范式,以提升智能体的适应性和泛化能力。
Details
Motivation: 解决大语言模型智能体在陌生环境中因‘过早利用’倾向(即在获取足够环境特定信息前就依赖先验知识行动)而导致的失败问题,强调自主探索是构建自适应智能体的关键但未被充分探索的能力。
Result: 系统评估表明,标准任务导向的强化学习训练的智能体表现出狭窄且重复的行为,阻碍了下游性能;而提出的训练策略和Explore-then-Act范式能有效提升探索广度,为构建泛化能力强、适用于现实世界的智能体提供了实证支持。
Insight: 创新点在于形式化并量化了自主探索能力(通过‘探索检查点覆盖率’),并提出了将信息收集与任务执行解耦的‘先探索后行动’范式;客观来看,其交替优化任务和探索奖励的训练策略,以及可验证的度量方法,为智能体学习系统化探索提供了可借鉴的框架。
Abstract: Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances. Our systematic evaluation reveals that agents trained with standard task-oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance. To address this limitation, we develop a training strategy that interleaves task-execution rollouts and exploration rollouts, with each type of rollout optimized by its corresponding verifiable reward. Building on this training strategy, we propose the Explore-then-Act paradigm, which decouples information-gathering from task execution: agents first utilize an interaction budget to acquire grounded environmental knowledge, then leverage it for task resolution. Our results demonstrate that learning to systematically explore is imperative for building generalizable and real-world-ready agents.
[101] Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP cs.AI | cs.CL | cs.LG | cs.MA | eess.SYPDF
Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor
TL;DR: 本文在对抗性部分可观测马尔可夫决策过程(POMDP)环境中,对复合大语言模型(LLM)智能体的设计维度(上下文表示、推理过程和层次分解)进行了成本-性能的对照研究。研究发现,程序化的状态抽象能带来最高的单位令牌回报,而将推理工具分布在层次结构中会导致性能下降(称为“推理级联”),且无推理的层次分解通常能获得最佳绝对性能。
Details
Motivation: 在对抗性、部分可观测的序列环境中部署复合LLM智能体时,从业者缺乏关于哪些设计选择能真正提升性能(而非仅增加推理成本)的指导。本文旨在通过对照实验,量化评估不同设计维度对成本与性能的影响。
Result: 在CybORG CAGE-2网络防御环境(建模为POMDP)中,对5个模型系列、6个模型和12种配置(共3475个回合)进行了评估。结果表明:程序化状态抽象相比原始观测,能将平均回报提升高达76%,且单位令牌回报最高;而层次分解结合推理工具会导致性能下降(平均回报最多差3.4倍,令牌使用量多1.8-2.7倍);无推理的层次分解对大多数模型实现了最佳绝对性能。
Insight: 论文的核心创新点在于对复合LLM智能体设计进行了系统性的成本-性能分析,并提出了“推理级联”这一负面现象。其关键设计原则是:在结构化对抗POMDP中,应优先投资于程序化基础设施和清晰的任务分解,而非深化单个智能体的推理能力,因为后者在组合时可能产生干扰。这为实际部署提供了明确的工程指导。
Abstract: Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4$\times$ worse mean return while using 1.8-2.7$\times$ more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.
[102] Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law cs.AI | cs.CLPDF
Parisa Kordjamshidi, Samer Aslan, Madhavan Seshadri, Leslie Barrett, Enrico Santus
TL;DR: 本文通过系统评估大型语言模型在税法推理任务中的表现,揭示了数据污染可能夸大模型性能的问题,并构建了新型测试套件以评估泛化能力。研究发现,神经符号混合系统在可靠性和鲁棒性方面优于单一LLM,表明法律推理具有内在组合性,神经符号框架为法律AI提供了更可靠的基础。
Details
Motivation: 研究旨在澄清LLM在税法推理中的性能是否源于真正的法律推理能力,还是数据污染导致的假象,并探索更可靠的自动化法律推理方法。
Result: 在设计的测试套件(通过案例和规则变体评估未见文档的泛化能力)上,神经符号混合系统相比单一LLM展现出更可靠的性能,提供了更强的鲁棒性和对未观测情况的泛化改进。
Insight: 创新点包括实施数据污染检测协议以严格评估LLM可靠性,以及构建专注于组合泛化的测试套件;客观分析认为,将法律文本翻译为形式表示并与符号求解器结合的神经符号框架是提升法律AI鲁棒性的关键方向。
Abstract: Recent advances in large language models (LLMs) have significantly enhanced automated legal reasoning. Yet, it remains unclear whether their performance reflects genuine legal reasoning ability or artifacts of data contamination. We present a comprehensive empirical study of tax law reasoning approaches and implement a contamination detection protocol to rigorously assess LLM reliability. We show that performance can be inflated by contamination. Building on this analysis, we conduct a systematic evaluation, comparing monolithic LLMs with hybrid systems that translate statutory text into formal representations and delegate inference to symbolic solvers. We build a novel test suite designed to probe generalization to unseen documents via case and rule variations. Our findings indicate that legal reasoning is inherently compositional and that neuro-symbolic frameworks offer a more reliable and robust foundation for legal AI, as well as improved generalization to unobserved situations.
[103] Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most cs.AI | cs.CLPDF
Tahreem Yasir, Wenbo Li, Sam Gilson, Sutapa Dey Tithi, Xiaoyi Tian
TL;DR: 这篇论文评估了基于大型语言模型(LLM)的智能辅导系统在命题逻辑领域的诊断反馈能力。研究发现,LLM辅导代理在识别学生最优解步骤上表现优异,但系统性地过度拒绝有效但次优的推理,并过度验证错误的解决方案,而这恰恰是自适应辅导最关键的地方。此外,准确的诊断并不总能产生具有教学可操作性的反馈,揭示了诊断判断与教学效果之间的差距。
Details
Motivation: 随着LLM越来越多地被探索作为智能辅导系统(ITS)的对话补充,评估其诊断精确性至关重要。论文旨在测试LLM能否有效区分学生的最优解、有效但次优解以及错误解,这是ITS的核心功能但此前未在LLM辅导代理中得到充分测试。
Result: 在包含10,836个解决方案-反馈对和三种反馈条件的命题逻辑基准测试中,七个LLM反馈代理在最优步骤上达到了接近天花板水平的性能,但在有效但次优的推理上系统性地过度拒绝,在错误解决方案上过度验证。这些失败在不同模型和解决方案上下文中持续存在。
Insight: 论文的创新点在于构建了一个基于知识图谱真实标注的基准来系统评估LLM辅导代理的诊断精度,并揭示了其核心局限性。客观来看,研究结果表明LLM可能更适合混合架构,其中由知识图谱支撑的模型负责诊断,而LLM则支持开放式的支架式教学和对话,这为未来ITS设计提供了重要方向。
Abstract: Effective tutoring requires distinguishing optimal, valid but suboptimal, and incorrect student solutions, a distinction central to intelligent tutoring systems (ITS) but untested for LLM-based tutors. As LLMs are increasingly explored as conversational complements to ITS, evaluating their diagnostic precision is essential. We present a benchmark of seven LLM feedback agents in propositional logic using knowledge-graph-derived ground truth across 10,836 solution–feedback pairs and three feedback conditions. Models achieved near-ceiling performance on optimal steps but systematically over-rejected valid but suboptimal reasoning and over-validated incorrect solutions, precisely where adaptive tutoring matters most. These failures persisted across models regardless of solution context, suggesting architectural rather than informational limits. Moreover, accurate diagnosis did not reliably produce pedagogically actionable feedback, revealing a gap between diagnostic judgment and instructional effectiveness. Our findings suggest that LLMs are better suited for hybrid architectures where KG-grounded models handle diagnosis while LLMs support open-ended scaffolding and dialogue.
[104] Probabilistic Dating of Historical Manuscripts via Evidential Deep Regression on Visual Script Features cs.AI | cs.CVPDF
Ranjith Chodavarapu
TL;DR: 本文提出了一种基于证据深度回归的概率方法,用于仅从视觉特征对历史手稿页面进行断代。该方法将断代问题建模为连续年份轴上的回归任务,通过结合EfficientNet-B2骨干网络和Normal-Inverse-Gamma输出头,在单次前向传播中输出包含分解的偶然不确定性和认知不确定性的完整预测分布。在DIVA-HisDB基准测试中,模型在补丁级别实现了5.4年的平均绝对误差,并展现出优异的校准性能和不确定性分解能力。
Details
Motivation: 现有方法通常将世纪作为离散类别进行聚合,这限制了断代的精细度;本文旨在通过连续回归框架提供更精确的年份预测,并同时量化预测的不确定性。
Result: 在DIVA-HisDB基准(150页,151,936个补丁)上,测试MAE为5.4年,优于50年的世纪标签监督粒度;93%的补丁误差在5年内,97%在10年内。模型取得了92.6%的预测区间覆盖率(PICP),校准性能优于MC Dropout和深度集成方法,且推理成本低5倍。不确定性分解显示偶然不确定性是预测误差的强相关指标(Spearman ρ=0.729)。
Insight: 创新点在于将历史手稿断代建模为证据深度回归问题,实现了连续年份预测与不确定性分解的统一;提出的NIG输出头和联合损失函数允许在单次前向传播中高效输出校准良好的概率分布;不确定性分析为模型的可解释性和选择性预测提供了实用工具,例如对最确定的20%补丁可实现0.5年MAE。
Abstract: We introduce a probabilistic approach for dating historical manuscript pages from visual features alone. Instead of aggregating centuries into classes as is standard in the previous literature, we pose dating as an evidential deep regression problem over a continuous year axis, allowing our neural network to output a full predictive distribution with decomposed aleatoric and epistemic uncertainty in a single forward pass. Our architecture combines an EfficientNet-B2 backbone with a Normal-Inverse-Gamma (NIG) output head trained with a joint negative-log-likelihood and evidence-regularization objective. On the DIVA-HisDB benchmark (150 pages, 3 medieval codices, 151,936 patches), our model scores a test MAE of 5.4 years, well below the 50-year century-label supervision granularity, with 93% of patches within 5 years and 97% within 10 years. Our approach achieves \textbf{PICP=92.6%}, the best calibration among all compared methods, in a single forward pass, outperforming MC Dropout (PICP=88.2%, 50 passes) and Deep Ensembles (PICP=79.7%, 5 models) at $5\times$ lower inference cost. Uncertainty decomposition shows aleatoric uncertainty is a strong predictor of dating error (Spearman $ρ=0.729$), and a selective prediction about the most certain 20% of patches can provide \textbf{0.5 years MAE}. We show that predicted uncertainty increases as image degradation worsens, spatial decomposition maps explain which script regions cause aleatoric uncertainty, and page-level aggregation reduces MAE to 4.5 years with $ρ=0.905$ between uncertainty and page-level error.
[105] See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation cs.AI | cs.CVPDF
Yuejia Li, Ke He, Junheng Li, Shutong Chen, Jingkang Xia
TL;DR: 本文提出了OmniManim框架,旨在解决大语言模型生成教育动画代码时出现的视觉缺陷问题,如元素重叠、错位和动画连续性中断。该框架通过共享场景状态、显式视觉规划、结构化渲染后诊断和局部修复,实现了渲染反馈感知的约束代码生成。
Details
Motivation: 大语言模型生成的教育动画代码在渲染后常出现视觉缺陷,这些缺陷无法仅从代码中可靠检测,需在执行后才能显现。因此,需要一种渲染反馈感知的约束代码生成方法,确保渲染输出满足仅在渲染后可评估的结构化质量标准。
Result: 在EduRequire-500数据集上,OmniManim在渲染质量上优于单模型基线和现有多智能体框架。系统消融研究进一步验证了显式视觉规划,特别是其粗粒度空间先验、边界框细化和插值感知优化,是提升性能的关键。
Insight: 创新点包括引入Vision Agent作为任务特定的视觉规划模块,通过粗到细的边界框去噪预测稀疏关键帧布局,并优化插值感知目标以减少下游动画插值引起的中间帧失败。这强调了在代码生成前进行视觉规划的重要性,以提升动画的视觉质量和连续性。
Abstract: Large language models can generate executable code for educational animations, but the resulting renders often exhibit visual defects, including element overlap, misalignment, and broken animation continuity. These defects cannot be reliably detected from the code alone and become apparent only after execution. We formalize this problem as render-feedback-aware constrained code generation: given a natural language specification, the model must generate executable code whose rendered output satisfies structured quality criteria that can be evaluated only after rendering. To address this problem, we introduce OmniManim, a render-feedback-aware educational animation generation framework built around a shared scene state, explicit visual planning, structured post-render diagnostics, and localized repair. Within OmniManim, the Vision Agent is a task-specific visual planning module: it predicts sparse keyframe layouts with coarse-to-fine bounding-box denoising and optimizes an interpolation-aware objective to reduce intermediate-frame failures induced by downstream animation interpolation. We further construct two datasets, ManimLayout-1K and EduRequire-500, and provide a reproducible evaluation protocol covering executability, instructional quality, visual quality, and efficiency. On EduRequire-500, OmniManim improves measured render quality over both single-model baselines and existing multi-agent frameworks. Systematic ablation studies further verify that explicit visual planning, especially its coarse spatial prior, bounding-box refinement, and interpolation-aware optimization, is central to these gains.
[106] Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning cs.AI | cs.CV | cs.LOPDF
Fabio Rovai
TL;DR: 本文研究了事件图基板(event-graph substrates)这类世界模型,它通过仅追加的RDF三元组日志表示智能体状态,并使用结构化干预词汇进行日志分叉来回答反事实查询。该模型无需学习组件,支持精确反事实推理且可跨领域迁移。作者形式化了此类模型,证明了解释性查询与反事实查询的对偶性,并在CLEVRER-DSL和twin-EventLog基准上进行了评估。
Details
Motivation: 旨在构建一种可解释、支持精确反事实推理且无需学习组件的世界模型,以解决传统方法在反事实查询中依赖参数化模型、缺乏可解释性和跨领域泛化能力的问题。
Result: 在CLEVRER验证集(n=75,618)上,该基板在四个问题类别上均超越NS-DR符号预言机(分别提升9.89、20.26、17.65和0.80个百分点),在描述性和解释性问题上优于参数化基线ALOE,但在预测性和反事实问题上稍逊。在twin-EventLog基准上,其联合准确率比Llama-3.1-8B(全上下文)高18.80个百分点。
Insight: 创新点在于提出基于仅追加RDF三元组日志的确定性事件图基板,实现了可解释的精确反事实推理;通过证明解释与反事实查询的对偶性,将两者统一为因果祖先遍历问题,简化了计算;模型无需学习组件即可跨领域迁移,为符号推理与反事实评估提供了新范式。
Abstract: We study event-graph substrates: a class of world models that represent agent state as an append-only log of typed RDF triples and answer counterfactual queries by forking the log under a structured intervention vocabulary. Substrates are inspectable at the triple level, support exact counterfactuals, and transfer across domains without learned components. We formalize the class, prove a duality between explanatory and counterfactual queries that reduces both to the same causal-ancestor traversal, and evaluate a 1,400-line CLEVRER-DSL interpreter atop a domain-agnostic substrate runtime at full CLEVRER validation scale (n=75,618). The substrate exceeds the NS-DR symbolic oracle on all four per-question categories (by 9.89, 20.26, 17.65, and 0.80 percentage points), and exceeds the parametric ALOE baseline on descriptive and explanatory while lagging on predictive and counterfactual. We also introduce twin-EventLog, a 500-specification Park-canonical Smallville counterfactual benchmark on which the substrate exceeds Llama-3.1-8B with full context by 18.80 points joint accuracy.
cs.LG [Back]
[107] DeltaPrompts: Escaping the Zero-Delta Trap in Multimodal Distillation cs.LG | cs.AI | cs.CLPDF
Jaehun Jung, Hyunwoo Kim, Brandon Cui, Ximing Lu, David Acuna
TL;DR: 本文提出DeltaPrompts方法,旨在解决多模态蒸馏中存在的’零差异陷阱’问题。该方法通过量化教师模型与学生模型之间的答案分布差异(Δ),识别并合成高差异的提示,构建了一个包含20万个合成推理问题的数据集。实验表明,使用该数据集进行蒸馏能显著提升学生模型的性能,在多个基准测试上取得了高达15%的相对改进。
Details
Motivation: 现有蒸馏方法中,许多提示是’零差异’的,即教师和学生模型已产生完全相同的答案分布,导致训练信号微弱,学生模型性能提升迅速饱和。论文旨在通过主动识别和生成能暴露师生能力差距的高价值提示,来提升蒸馏效率。
Result: 在涵盖图表、文档和感知中心推理的10个基准测试上,DeltaPrompts方法带来了显著的性能提升,即使对于高度优化的推理模型(如Qwen3-VL-8B-Thinking),也能实现高达15%的相对改进。该方法在目标师生对的同策略蒸馏、向新模型族的迁移以及非推理模型的异策略微调三种不同设置下均有效。
Insight: 核心创新点在于从分布差异最小化的第一性原理出发,将提示的价值量化为师生模型的答案分布差异(Δ),并基于此构建了一个主动针对学生失败模式的高差异提示合成流水线。这为高效的多模态蒸馏数据构建提供了新思路。
Abstract: Distillation enables compact Vision-Language Models (VLMs) to obtain strong reasoning capabilities, yet the prompts driving this process are typically chosen via simple heuristics or aggregated from off-the-shelf datasets. We reveal a critical inefficiency in this approach: up to 69% of the prompts in standard chart / document reasoning datasets are effectively zero-delta, meaning the teacher and student already induce the exact same answer distribution. Training on these prompts provides minimal learning signal, causing student improvement to rapidly saturate regardless of data scale. To escape the zero-delta trap, we return to first principles: distillation fundamentally minimizes distributional divergence, and thus a prompt is valuable only if it exposes a functional capability gap between the teacher and student. We quantify this gap through answer divergence ($Δ$), demonstrating that non-zero divergence is critical for effective scaling. Building on this insight, we propose a staged synthesis pipeline that repurposes existing datasets as seeds, actively targeting student failure modes to produce better prompts. The result is DeltaPrompts, a diverse dataset of 200k synthetic, high-divergence reasoning problems. We evaluate DeltaPrompts across three distinct settings: on-policy distillation with the target teacher-student pair, transfer to a novel model family without regenerating the data, and off-policy fine-tuning of a non-reasoning model. Across all scenarios, DeltaPrompts drives substantial gains, yielding up to 15% relative improvement even on top of a highly-optimized reasoning model (e.g., Qwen3-VL-8B-Thinking) – averaged over 10 benchmarks spanning chart, document and perception-centric reasoning.
[108] GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero cs.LG | cs.AI | cs.CLPDF
Shangjian Yin, Yu Fu, Yue Dong, Zhouxing Shi
TL;DR: 本文提出了一种名为GRLO的通用强化学习方法,旨在从零开始在开放环境中进行小规模交互学习,并研究其习得的对话能力能否隐式迁移到数学推理和代码生成等下游任务。实验表明,该方法在Qwen3-4B-Base模型上,仅用5K提示和22.7 GPU小时,就将所有领域的平均性能从24.1提升至63.1,大幅降低了数据和计算需求。
Details
Motivation: 当前基于强化学习的后训练(如RLVR)虽然在特定领域任务(如推理)上表现出色,但需要大量GPU计算,阻碍了其广泛应用。本文旨在探索从开放环境的小规模交互中学习的RLHF方法的泛化能力,以降低后训练成本。
Result: 在Qwen3-4B-Base骨干模型上,GRLO仅使用5K提示和22.7 GPU小时,就将跨领域平均性能从24.1提升至63.1,相比强大的领域内RLVR基线,数据需求减少约46倍,计算需求减少约68倍。所得模型甚至与训练成本高得多的已发布后训练模型(如Qwen)具有竞争力。后续的领域内RLVR阶段仅带来选择性增益,主要是在更难的竞赛数学基准上。
Insight: 论文的创新点在于提出了一种简单高效的通用强化学习后训练方法(GRLO),证明了从开放环境的小规模交互中学习到的对话能力可以隐式迁移到多个下游任务,显著降低了后训练的计算和数据需求,为构建广泛能力的后训练模型提供了新思路。
Abstract: Post-training has become a crucial step for unlocking the capabilities of large language models, with reinforcement learning (RL) emerging as a critical paradigm. Recent RL-based post-training has increasingly split into two paradigms: reinforcement learning from human feedback (RLHF), which optimizes models using human preference signals in target domains, and reinforcement learning from verifiable rewards (RLVR), which operates in verifier-backed environments. The latter has dominated recent reasoning-oriented post-training because it delivers stronger gains and higher efficiency on domain-specific tasks (e.g., reasoning). However, although in-domain RL training achieves promising performance, it still requires a substantial amount of GPU compute, which remains a major barrier to broad adoption. In this work, we study the generalization ability of RLHF learned from scratch from a small set of interactions in open-ended environments, and investigate whether the conversational abilities it explicitly acquires can implicitly transfer to downstream tasks such as mathematical reasoning and code generation, namely GRLO. Specifically, on Qwen3-4B-Base backbone, GRLO improves the average performance across all domains from 24.1 to 63.1 with only 5K prompts and 22.7 GPU hours, requiring about $46\times$ less data and $68\times$ less compute than a strong in-domain RLVR baseline. The resulting model is even competitive with Qwen’s released post-trained models which required a much larger training cost. Notably, a subsequent in-domain RLVR stage brings only selective gains, mainly on harder competition-math benchmarks. We hope GRLO offers a simple and efficient recipe for building broadly capable post-trained models. Our code and data will be available at: \href{https://github.com/SJY8460/GRLO}{https://github.com/SJY8460/GRLO}.
[109] VSPO: Vector-Steered Policy Optimization for Behavioral Control cs.LG | cs.CLPDF
Xuechen Zhang, Zijian Huang, Kai Yang, Weijia Zhang, Jiasi Chen
TL;DR: 本文提出了一种名为VSPO(Vector-Steered Policy Optimization)的新方法,用于优化语言模型在保持主要任务准确性的同时,控制其生成内容的次要行为偏好(如详细程度、自信表达等)。该方法通过引入与目标行为相关的引导向量来调节生成内容的行为强度,从而解决稀疏奖励瓶颈问题,并在多个推理基准测试中验证了其有效性。
Details
Motivation: 现代语言模型在优化主要准确性目标的同时,常常需要兼顾次要的行为偏好(如回答的详细程度、亲和力或专业水平),但基础模型可能很少或根本不表现出期望行为,这导致了稀疏行为奖励的瓶颈问题。
Result: 在包括MATH和MMLU-Pro在内的多个推理基准测试上,针对解释专业性、自信表达、对误导性上下文的鲁棒性和回答详细度这四种目标行为进行评估。结果表明,与奖励塑形、教师轨迹蒸馏和基于引导的基线方法相比,VSPO在保持或提高任务准确性的同时,能持续改进对目标行为的控制。
Insight: 核心创新在于通过引导向量控制行为强度,并将其解释为一种策略上的潜在自蒸馏过程,使模型内化其引导向量。通过改变引导强度,VSPO对罕见行为进行上采样并丰富生成内容的多样性,这理论上能缓解稀疏奖励问题并加速策略优化。在理论分析中,当引导诱导的分布与目标行为充分对齐时,VSPO在强盗问题抽象下被证明比奖励塑形的GRPO具有更好的迭代复杂度。
Abstract: Modern language models often need to optimize a primary accuracy objective while also accommodating secondary behavioral preferences, such as verbosity, agreeableness, or the level of technical expertise in its response. In practice, a base model may exhibit a desired behavior very rarely or not at all. Thus, endowing the model with a target behavior creates a sparse behavioral reward bottleneck. To address such multi-objective problems, we introduce Vector-Steered Policy Optimization (VSPO) which employs a steering vector associated with the target behavior to control the behavior intensity of the generated rollouts. VSPO is obtained by modifying GRPO to sample rollouts with varying steering intensities. This process can be interpreted as an on-policy latent self-distillation procedure where the model internalizes its steering vector. By varying steering intensities, VSPO upsamples rare behaviors and enriches rollout diversity, which alleviates the sparse reward issue and provably accelerates the policy optimization. Through comprehensive theory and experiments, we establish that VSPO has favorable properties compared to vanilla reward shaping and other alternative approaches. Specifically, under a bandit abstraction, VSPO provably achieves better iteration complexity than reward-shaped GRPO when the steering-induced distributions are sufficiently aligned with the target behavior. We evaluate VSPO across multiple reasoning benchmarks, including MATH and MMLU-Pro, for four target behaviors: explanation expertise, confidence expression, robustness to misleading context, and response verbosity. Our results show that VSPO consistently improves the control along target behavior while maintaining or improving task accuracy compared with reward shaping, teacher-trace distillation, and guidance-based baselines.
[110] GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective cs.LG | cs.CVPDF
Xu Wang, Xunkai Li, Yinlin Zhu, Rong-Hua Li, Guoren Wang
TL;DR: 本文提出了一种名为GOMA(Graph-Optimized Multimodal Alignment)的结构驱动后对齐框架,用于优化冻结的多模态嵌入。该方法将多模态属性图(MAG)中的节点嵌入视为图信号,通过解耦消息流向、多模态证据传播方式和平滑深度三个关键设计,在保留有用信息的同时利用图结构进行精细化调整。
Details
Motivation: 现有方法(如CLIP)通常从孤立的图像-文本对学习对齐,忽略了实体间的关系上下文。多模态属性图(MAG)为优化冻结的视觉语言嵌入提供了自然场景,但如何有效利用图结构,同时克服模态特定拓扑障碍、控制平滑机制并防止语义边界崩溃,是一个挑战。
Result: 在七个MAG基准测试上,GOMA在检索任务中达到了最先进(SOTA)或并列最先进的性能,并且比最强的图方法竞争对手显著更稳定。
Insight: 创新点在于将多模态对齐问题从图信号平滑的视角重新定义,并提出了一个统一的后对齐框架。该框架通过解耦和优化传播算子、有限步耦合平滑以及自适应读取节点特定平滑轨迹,有效利用了未标注的图结构上下文来增强冻结嵌入,而不需要重新训练编码器。
Abstract: Multimodal alignment is commonly learned from isolated image-text pairs via CLIP-style dual encoders, leaving the relational context among entities largely unused. Multimodal attributed graphs (MAGs), where nodes carry multimodal attributes and edges encode corpus structure, provide a natural setting for refining frozen vision-language embeddings. This refinement is challenging: visual, textual, and cross-modal relations often induce different neighborhood geometries, while unrestricted graph propagation can quickly over-smooth retrieval representations. Effectively leveraging graph context therefore requires simultaneously breaking modality-specific topological barriers, controlling the smoothing regime, and preserving informative smoothing before semantic boundaries collapse. We propose Graph-Optimized Multimodal Alignment (GOMA), a structure-driven post-alignment framework that views frozen multimodal embeddings as graph signals and addresses these requirements through a unified retrieval-oriented design. GOMA decouples three key design choices: where messages should flow, how multimodal evidence should propagate, and which smoothing depth should be retained. Concretely, it learns modality-aware propagation operators, performs finite-step coupled smoothing without diagonal cross-modal shortcuts, and adaptively reads out node-specific smoothing trajectories to preserve useful smoothing before collapse. All experiments follow a transductive MAG retrieval protocol where the graph serves only as unlabeled context and diagonal self-pair edges are removed. On seven MAG benchmarks, GOMA achieves state-of-the-art or tied state-of-the-art retrieval and remains substantially more stable than the strongest graph competitor, demonstrating that MAG structure can serve as an effective post-encoder for frozen multimodal embeddings.
[111] LoCO: Low-rank Compositional Rotation Fine-tuning cs.LG | cs.AI | cs.CVPDF
An Nguyen, Jaesik Choi, Anh Tong
TL;DR: 本文提出了一种名为LoCO(低秩组合正交微调)的新型参数高效微调方法,该方法通过低秩斜对称矩阵和组合旋转链构建正交变换,以在适应大规模基础模型时保持预训练表示的几何结构。
Details
Motivation: 现有参数高效微调方法(如低秩适应)虽然通过低秩权重更新实现参数效率,但难以保持预训练表示的几何结构,LoCO旨在解决这一问题。
Result: LoCO在扩散Transformer微调、视觉Transformer适应和语言模型适应等多个领域进行了验证,与现有正交和非正交方法相比,表现出优越或具有竞争力的性能。
Insight: LoCO的创新点在于通过低秩斜对称矩阵和组合旋转链构建正交变换,并提出了支持完全并行计算的近似方案,从而在保持正交性和低计算复杂度的同时,适用于高维特征空间。
Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as an critical technique for adapting large-scale foundation models across natural language processing and computer vision. While existing methods such as low-rank adaptations achieve parameter efficiency via low-rank weight updates, they are limited in their ability to preserve the geometric structure of pretrained representations. We introduce Low-rank Compositional Orthogonal fine-tuning (LoCO), a novel PEFT method that constructs orthogonal transformations through low-rank skew-symmetric matrices and compositional rotation chains. We propose an approximation scheme that enables fully parallel computation of compositional rotations, making the approach practical for high-dimensional feature spaces. Our method maintains low computational complexity while maintaining orthogonality with controlled approximation error. We validate LoCO across diverse domains, including diffusion transformer fine-tuning, vision transformer adaptation, and language model adaptation. Our method demonstrates superior or competitive performance compared to both existing orthogonal and non-orthogonal methods.
[112] MIND: Decoupling Model-Induced Label Noise via Latent Manifold Disentanglement cs.LG | cs.CVPDF
Dayong Ren
TL;DR: 本文提出了一种名为MIND的理论框架,旨在解决由预训练专家和基础模型驱动的自动标注所产生的模型诱导标签噪声问题。该框架通过潜在流形解耦,将高维噪声流形分解为可处理的子空间依赖组件,从而在无需真实锚点的情况下实现噪声识别。
Details
Motivation: 动机在于自动标注范式虽然解决了数据饥渴应用的需求,但引入了模型诱导标签噪声,这种噪声源于标注器的归纳偏差,表现为与局部特征流形紧密耦合的系统性错误,而现有方法无法有效处理这种结构性噪声。
Result: 在CIFAR-100的受控噪声实验和大规模真实世界3D数据集(S3DIS、ScanNet)的结构性压力测试中,MIND显著优于最先进的方法,并有效纠正了视觉语言模型(如OpenSeg)的零样本幻觉,展示了其作为基础模型鲁棒蒸馏框架的潜力。
Insight: 创新点在于提出了潜在解耦估计器(LDE),通过动态将样本投影到具有一致错误模式的潜在结构簇中,实现了模型诱导噪声的可识别性,而无需依赖真实标签锚点,这为处理结构性标签噪声提供了新的理论和方法基础。
Abstract: The paradigm of learning from automatic annotations driven by pre-trained experts and Foundation Models dominates data-hungry applications. However, it introduces a critical challenge: model-induced label noise. Unlike stochastic noise in classical robust learning, this noise stems from annotator inductive biases, manifesting as systematic errors tightly coupled with local feature manifolds. Existing methods relying on global transition matrices underfit these structural patterns, while learning instance-specific matrices remains mathematically intractable. We propose Model-Induced Noise Decoupling (MIND), a theoretically grounded framework addressing this dilemma. We demonstrate that the high-dimensional noise manifold can be decoupled into tractable, subspace-dependent components via Latent Manifold Disentanglement. Specifically, our Latent Decoupling Estimator (LDE) dynamically projects samples into latent structural clusters with consistent error modes, facilitating noise identifiability without ground-truth anchor points. To rigorously evaluate robustness, we adopt a hierarchical protocol: moving from controlled noise on CIFAR-100 to a structural stress test on large-scale real-world 3D datasets (S3DIS, ScanNet), where error patterns explicitly couple with geometric manifolds. Empirically, MIND significantly outperforms state-of-the-art methods on these complex benchmarks and effectively corrects zero-shot hallucinations from Vision-Language Models (e.g., OpenSeg), highlighting its potential as a robust distillation framework for Foundation Models.
cs.GR [Back]
[113] Sound Sparks Motion: Audio and Text Tuning for Video Editing cs.GR | cs.CV | cs.MM | cs.SDPDF
AmirHossein Naghi Razlighi, Aryan Mikaeili, Ali Mahdavi-Amiri, Daniel Cohen-Or, Yiorgos Chrysanthou
TL;DR: 本文提出了一种名为Sound Sparks Motion的训练免调框架,用于解决现有大型生成视频模型在运动编辑方面的困难。该方法通过调整音频-视觉视频生成模型中的多模态条件信号,在测试时仅优化音频潜变量和文本条件残差扰动,实现了对视频中特定局部动作或状态转换的编辑。
Details
Motivation: 现有大型生成视频模型在改变外观方面表现良好,但在编辑现有视频片段中的特定、局部化动作或状态转换方面存在困难,因此需要一种能够有效进行运动编辑的方法。
Result: 该方法在运动编辑任务上表现出色,能够实现仅通过提示控制难以完成的动作编辑。通过视觉语言模型提供的反馈进行调优,并结合正则化和感知-时间约束,有效保持了内容一致性和视觉质量。
Insight: 创新点在于利用多模态条件调优,特别是音频通路,作为运动感知视频编辑的有效途径;同时,测试时调优作为一种轻量级探测机制,揭示了模型多模态条件中嵌入的潜在运动控制方向,且学习到的潜变量控制具有跨视频的可迁移性。
Abstract: Motion-centric video editing remains difficult for large generative video models, which often respond well to appearance changes but struggle to produce specific, localized actions or state transitions in an existing clip. We introduce Sound Sparks Motion, a training-free framework that enables motion editing in an audio-visual video generation model by tuning its internal multimodal conditioning signals at test time. Rather than modifying model weights, our method tunes only two lightweight variables: an audio latent derived from the source video and a residual perturbation in the text-conditioning. We find that this combination can encourage motion edits that the underlying model often struggles to realize under prompt-only control. Since there is no direct way to evaluate temporal alignment between text and motion, we guide the tuning process using a vision-language model that provides feedback indicating whether the intended motion appears in the generated video. This simple supervision yields an effective semantic objective for motion editing, while regularization and perceptual-temporal constraints help preserve content and visual quality. Beyond per-video tuning, we show that the learned latent controls are transferable across videos, suggesting that they capture reusable motion-edit directions rather than overfitting to a single example. Our results highlight multimodal conditioning tuning, particularly through the audio pathway, as a promising direction for motion-aware video editing, and suggest that test-time tuning can serve as a lightweight probing mechanism that helps reveal latent motion controls embedded in the model’s multimodal conditioning. Code and data are available via our project page: https://amirhossein-razlighi.github.io/Sound_Sparks_Motion/
[114] FFAvatar: Few-Shot, Feed-Forward, and Generalizable Avatar Reconstruction cs.GR | cs.CV | cs.LGPDF
Thuan Hoang Nguyen, Jiahao Luo, Yinyu Nie, Hao Li, Gordon Guocheng Qian
TL;DR: FFAvatar是一个通用的前馈框架,能够从少量未摆姿的肖像图像中,在几秒内重建出高质量、可动画化的3D高斯人头化身。它通过多视图查询变换器将多源图像信息融合到统一的规范高斯表示中,并利用从像素端到端预测的FLAME参数进行动画驱动,无需离线提取FLAME。训练采用三阶段课程,包括大规模单目视频预训练、高质量多视图微调和可选个性化,实现了广泛泛化与高保真重建。
Details
Motivation: 传统化身重建方法依赖于耗时的逐主体优化或昂贵的预处理,限制了可扩展性。FFAvatar旨在解决这一问题,提出一个快速、前馈且泛化性强的框架,从少量图像高效重建可动画化身。
Result: 在NeRSemble基准测试中,FFAvatar大幅超越当前最优方法LAM,PSNR提升5.5分,达到新的SOTA水平。它支持实时部署,无个性化时2秒重建化身,个性化时10秒,在单张NVIDIA A100 GPU上动画渲染达49 FPS。
Insight: 创新点包括:1) 引入多视图查询变换器实现多图像信息到规范空间的融合;2) 端到端从像素预测FLAME参数,省去离线预处理;3) 三阶段训练课程结合大规模数据泛化与小规模高质量数据精调,平衡泛化能力与重建保真度;4) 支持快速个性化优化,在500步内达到最高保真度。
Abstract: Avatar reconstruction has traditionally relied on per-subject optimization that requires hours of computation or on expensive preprocessing that limits scalability. We introduce FFAvatar, a generalizable feed-forward framework that reconstructs high-quality, animatable 3D Gaussian head avatars from few-shot unposed portrait images in seconds. FFAvatar fuses information from multiple source images into a unified canonical Gaussian representation through Multi-View Query-Former, which is animated via FLAME parameters predicted end-to-end directly from pixels, eliminating the overhead of offline FLAME extraction. We further propose a three-stage training curriculum that achieves both broad generalization and high-fidelity reconstruction: (i) scalable pretraining on extensive monocular video data with over 1M identities to learn strong generalizable priors; (ii) multi-view fine-tuning on a small but high-quality dataset of 360-degree captures to enhance geometric fidelity and extreme-view awareness; and (iii) optional personalization that adapts to specific identities for maximum fidelity within 500 optimization steps. Extensive experiments demonstrate that FFAvatar sets a new standard for identity preservation, geometric consistency, and animation fidelity. On the NeRSemble benchmark, it outperforms the state-of-the-art LAM by a substantial 5.5 PSNR gain. Furthermore, FFAvatar enables real-time deployment, reconstructing avatars in 2 seconds without personalization and 10 seconds with personalization, while supporting 49 FPS animation on a single NVIDIA A100 GPU.
[115] Evaluating Design Video Generation: Metrics for Compositional Fidelity cs.GR | cs.AI | cs.CVPDF
Adrienne Deganutti, Dingning Cao, Jaejung Seol, Elad Hirsch, Purvanshi Mehta
TL;DR: 本文针对设计动画视频生成领域缺乏标准化评估框架的问题,提出了一种全自动的四维评估框架,涵盖布局保真度、运动正确性、时序质量和内容保真度,以替代主观人工评估,为该领域的基准测试提供统一标准。
Details
Motivation: 解决设计动画视频生成领域缺乏标准化评估框架的问题,该领域要求特定组件按预设运动类型、方向、速度和时序动画,非动画区域需保持稳定且布局结构必须保留,而现有自然视频生成评估方法不适用。
Result: 论文提出了一种全自动评估框架,通过四个维度(布局保真度、运动正确性、时序质量、内容保真度)进行量化评估,消除了对主观人工评估的依赖,为领域进展建立了通用基准。
Insight: 创新点在于首次针对设计动画视频生成任务构建了结构化、自动化的多维度评估框架,强调了对运动约束和布局稳定性的量化衡量,为生成模型在该专业领域的性能比较提供了客观标准。
Abstract: Generative video models are increasingly used in design animation tasks, yet no standardized evaluation framework exists for this domain. Unlike natural video generation, design animation imposes structured constraints: specific components shall animate with prescribed motion types, directions, speed and timing, while non-animated regions must remain stable and layout structure must be preserved. This paper provides a fully automated evaluation framework organized across four dimensions: layout fidelity, motion correctness, temporal quality, and content fidelity. This eliminates the reliance on subjective human evaluation and establishes a common basis for benchmarking progress in the field.