Table of Contents

cs.CL [Back]

[1] ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis cs.CL | cs.AIPDF

Atharva Naik, Yash Mathur, Prakam, Carolyn Rose, David Mortensen

TL;DR: 本文提出ReaComp方法,将大型语言模型(LLM)的推理轨迹编译成可重用的符号程序合成器,用于高效解决程序合成任务。该方法在测试时无需调用LLM,通过符号求解器集成在PBEBench和SLR-Bench等基准上显著提升准确率,同时大幅降低计算成本,并能零样本迁移到历史语言学任务中。

Details

Motivation: 解决LLM在需要大规模组合搜索的困难程序合成任务上效率低下和不可靠的问题,旨在将LLM的推理能力编译为高效的符号求解器。

Result: 在PBEBench-Lite上达到91.3%准确率,在PBEBench-Hard上达到84.7%准确率,后者比使用LLM进行测试时扩展的方法高出16.3个百分点且无需LLM推理成本;在神经符号混合设置中,将PBEBench-Hard准确率从68.4%提升至85.8%,同时减少78%的token使用,并将SLR-Bench困难层级准确率从34.4%提升至58.0%;在历史语言学任务中达到80.1%准确率。

Insight: 创新点在于将LLM的推理轨迹编译为可重用的符号求解器,实现零token执行的帕累托效率提升,并能与LLM搜索互补,为领域通用求解器归纳提供可扩展路径。

Abstract: LLMs can solve program synthesis tasks but remain inefficient and unreliable on hard instances requiring large combinatorial search. Given a small set of reasoning traces, we use coding agents to compile them into reusable symbolic program synthesizers over constrained DSLs. The resulting solvers require no LLM calls at test time and are strong standalone systems: symbolic solver ensembles reach 91.3% accuracy on PBEBench-Lite and 84.7% on PBEBench-Hard, outperforming LLMs with test-time scaling for the latter by +16.3 percentage points at zero LLM inference cost. They also complement LLM search, improving PBEBench-Hard accuracy from 68.4% to 85.8% while reducing reported token usage by 78%, and raising SLR-Bench hard-tier accuracy from 34.4% to 58.0% in a neuro-symbolic hybrid setting. Compared to directly using coding agents as per-instance solvers, induced solvers are substantially more Pareto-efficient, amortizing a small one-time construction cost over many zero-token executions. Finally, most solvers transfer zero-shot to a real historical linguistics task - predicting sound changes in natural language data - reaching 80.1% accuracy under ensembling and recovering some plausible linguistic rules. Together, these results show that reasoning traces can be compiled into reusable symbolic solvers that solve many tasks directly, complement LLM inference on hard cases, and provide a scalable route to domain-general solver induction. We release code and data for reproducibility.


[2] The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation cs.CL | cs.CV | cs.LGPDF

Hoin Jung, Xiaoqian Wang

TL;DR: 本文研究了多模态检索增强生成(RAG)中引入外部文档时,模型可能出现的‘再腐化’现象,即准确的上下文反而导致模型放弃原本正确的预测。通过机制分析,作者发现这源于视觉注意力的系统性抑制和结构性的位置偏见。为解决此问题,论文提出了无需参数的推理时框架BAIR,以恢复视觉显著性并抑制文本干扰,从而在多模态基准上提升模型的诊断可靠性。

Details

Motivation: 多模态大语言模型(MLLMs)与检索增强生成(RAG)结合时,外部文档的引入可能在实例层面掩盖严重的故障模式,特别是‘再腐化’现象,即准确的上下文导致模型放弃正确预测,这削弱了RAG的可靠性。

Result: 在医学事实性、社会公平性和地理空间基准测试中,提出的BAIR框架成功恢复了多模态基础,提高了诊断可靠性,无需模型重训练或微调,有效缓解了文本偏见。

Insight: 创新点在于识别并形式化了‘再腐化’现象,并通过注意力机制诊断揭示了视觉盲点和位置偏见的双重崩溃;提出的BAIR框架是一种无需参数、推理时的干预方法,通过恢复视觉显著性和施加位置感知惩罚来增强模型鲁棒性,为多模态RAG系统的可靠性改进提供了新思路。

Abstract: While Multimodal Large Language Models (MLLMs) are increasingly integrated with Retrieval-Augmented Generation (RAG) to mitigate hallucinations, the introduction of external documents can conceal severe failure modes at the instance level. We identify and formalize the phenomenon of recorruption, where the introduction of even perfectly accurate “oracle” context causes a capable model to abandon an initially correct prediction. Through a mechanistic diagnosis of internal attention matrices, we show that recorruption is driven by a two-fold attentional collapse: (1) visual blindness, characterized by the systemic suppression of visual attention mass ($M_{vis}$) and sharpness ($S_{vis}$), and (2) a structural positional bias that forces the model to prioritize boundary tokens over semantic relevance. Our analysis reveals an Illusion of Success, demonstrating that many seemingly correct RAG outcomes are merely positional coincidences where the model’s textual copying bias happens to align with the ground-truth location. To address these vulnerabilities, we propose Bottleneck Attention Intervention for Recovery (BAIR), a parameter-free, inference-time framework that restores visual saliency and applies position-aware penalties to textual distractors. Across medical factuality, social fairness, and geospatial benchmarks, BAIR successfully restores multimodal grounding and improves diagnostic reliability without requiring model retraining or fine-tuning.


[3] When2Speak: A Dataset for Temporal Participation and Turn-Taking in Multi-Party Conversations for Large Language Models cs.CL | cs.AIPDF

Vihaan Nama, Shreya Mendi, Zian Ye, Brinnae Bent

TL;DR: 本文介绍了When2Speak数据集,这是一个用于训练大语言模型在多方对话中学习何时发言(即参与时机和轮流发言)的合成数据集。通过四阶段生成流程,该数据集包含超过21.5万个示例,覆盖2-6人对话。研究表明,基于该数据集的有监督微调显著优于零样本基线,但模型仍过于保守;通过非对称奖励塑形的强化学习,模型在错过干预率(MIR)和召回率方面得到显著改善。

Details

Motivation: 解决大语言模型在多方对话中校准不足的问题,即模型不仅需要生成合适的内容,还需决定何时发言,以避免过度打断和保持对话连贯性。

Result: 在多个模型家族上,使用When2Speak进行有监督微调(SFT)显著优于零样本基线(例如,在4B+参数模型中平均Macro F1提升60%,最大提升120%)。但SFT训练的模型仍过于保守,错过干预率(MIR)平均为0.50。应用非对称奖励塑形的强化学习后,MIR降至0.186-0.218,召回率从0.479提升至0.78-0.81。

Insight: 创新点在于提出了一个专注于对话时序参与的数据集和生成流程,将“何时发言”视为对话智能的一个可训练维度。客观来看,其通过合成数据与强化学习结合的方法,为解决LLM在复杂社交互动中的时机决策问题提供了可扩展的途径。

Abstract: Large Language Models (LLMs) excel at generating contextually appropriate responses but remain poorly calibrated for multi-party conversations, where deciding when to speak is as critical as what to say. In such settings, naively responding at every turn leads to excessive interruptions and degraded conversational coherence. We introduce When2Speak, a grounded synthetic dataset and four-stage generation pipeline for learning intervention timing in group interactions. The dataset comprises over 215,000 examples derived from 16,000 conversations involving 2-6 speakers, spanning diverse conversational styles, tones, and participant dynamics, and explicitly modeling SPEAK vs. SILENT decisions at each turn. Our pipeline combines real-world grounding, structured augmentation, controlled transcript synthesis, and fine-tuning-ready supervision, and is fully open-sourced to support reproducibility and adaptation to domain-specific conversational norms. Across multiple model families, supervised fine-tuning (SFT) on When2Speak significantly outperforms zero-shot baselines (e.g., the average Macro F1 increase across 4B+ parameter models was 60%, with the largest increase being 120%). However, SFT-trained models remain systematically over-conservative, missing nearly half of warranted interventions as seen through the Missed Intervention Rate (MIR), which was on average 0.50 and is noticed even at larger model sizes. To address this limitation, we apply reinforcement learning with asymmetric reward shaping, which reduces MIR to 0.186-0.218 and increases recall from 0.479 to 0.78-0.81. Our findings establish that temporal participation is a distinct and trainable dimension of conversational intelligence, and that grounded synthetic data provides an effective and scalable pathway for enabling LLMs to participate more naturally and appropriately in multi-party interactions.


[4] Estimating the Black-box LLM Uncertainty with Distribution-Aligned Adversarial Distillation cs.CLPDF

Huizi Cui, Huan Ma, Qilin Wang, Yuhang Gao, Changqing Zhang

TL;DR: 本文提出了一种名为分布对齐对抗蒸馏(DisAAD)的方法,用于估计仅能通过API访问的商业黑盒大语言模型(LLM)的不确定性。该方法通过一个生成-判别架构,引导一个轻量级代理模型学习黑盒LLM输出分布的高质量区域,从而使其能够判断黑盒LLM是否‘知道’答案,进而基于证据学习估计不确定性。实验表明,即使代理模型规模仅为目标LLM的1%,也能实现可靠的不确定性量化。

Details

Motivation: 解决黑盒LLM(仅能通过API访问)在实际部署中因幻觉问题而面临的瓶颈,现有不确定性量化方法通常依赖计算成本高的多次采样或内部参数,无法进行实时估计且难以捕捉黑盒推理过程中的隐含信息。

Result: 大量实验验证了所提方法的有效性和前景,表明即使代理模型规模仅为目标LLM的1%,也能实现可靠的不确定性量化。

Insight: 创新点在于提出了一种分布对齐对抗蒸馏(DisAAD)架构,通过生成-判别过程让轻量代理模型学习黑盒LLM的输出分布特征,并结合证据学习进行不确定性估计,实现了对黑盒模型的高效、实时不确定性量化,无需访问模型内部参数或进行昂贵采样。

Abstract: Large language models (LLMs) have progressed rapidly in complex reasoning and question answering, yet LLM hallucination remains a central bottleneck that hinders practical deployment, especially for commercial black-box LLMs accessible only via APIs. Existing uncertainty quantification methods typically depend on computationally expensive multiple sampling or internal parameters, which prevents real-time estimation and fails to capture information implicit in the black-box reasoning process. To address this issue, we propose Distribution-Aligned Adversarial Distillation (DisAAD), which introduces a generation-discrimination architecture to guide a lightweight proxy model to learn the high-quality regions of the output distribution of the black-box LLM, thus effectively endowing it with the ability to know whether the black-box LLM knows or not. Subsequently, we use the proxy model to reproduce the specific responses of the black-box LLM and estimate the corresponding uncertainty based on evidence learning. Extensive experiments have verified the effectiveness and promise of our proposed method, indicating that a proxy model even one that only accounts for 1% of the target LLM’s size can achieve reliable uncertainty quantification.


[5] Evaluation Awareness in Language Models Has Limited Effect on Behaviour cs.CL | cs.CYPDF

Amelie Knecht, Lucas Florin, Thilo Hagendorff

TL;DR: 这篇论文研究了大型推理模型在思维链中表达出的评估意识是否会影响其行为。通过实验发现,无论是注入还是移除评估意识,对模型行为的影响都非常有限,表明评估意识可能并不像现有文献所假设的那样构成显著的安全风险。

Details

Motivation: 研究人员担心大型推理模型在思维链中表达出的评估意识会导致模型策略性地调整输出以优化感知到的评估标准,例如使模型看起来比实际更安全,但这种影响是否真实存在尚不清楚。

Result: 在涵盖安全性、对齐、道德推理和政治观点的基准测试上,注入评估意识的影响近乎为零(ω≤0.06),移除它仅引起微小变化(ω≤0.12),自发出现的评估意识最多使答案分布偏移3.7个百分点(ω≤0.31)。

Insight: 论文的创新点在于系统地量化了评估意识对模型行为的实际影响,其客观分析表明,高评估意识率不应被直接解释为策略性行为或对齐篡改的证据,这挑战了现有文献的假设,并为模型安全评估提供了更审慎的视角。

Abstract: Large reasoning models (LRMs) sometimes note in their chain of thought (CoT) that they may be under evaluation. Researchers worry that this verbalised evaluation awareness (VEA) causes models to adapt their outputs strategically, optimising for perceived evaluation criteria, which, for instance, can make models appear safer than they actually are. However, whether VEA actually has this effect is largely unknown. We tested this across open-weight LRMs and benchmarks covering safety, alignment, moral reasoning, and political opinion. We tested this both on-policy, sampling multiple CoTs per item and comparing those that spontaneously contained VEA against those that did not, and off-policy, using model prefilling to inject evaluation-aware sentences where missing and remove them where present, with subsequent resampling. VEA has limited effect on model behaviour: injecting VEA into CoTs produces near-zero effects ($ω\leq 0.06$), removing it causes small shifts ($ω\leq 0.12$) and spontaneously occurring VEA shifts answer distributions by at most 3.7 percentage points ($ω\leq 0.31$). Our findings call for caution when interpreting high VEA rates as evidence of strategic behaviour or alignment tampering. Evaluation awareness may pose a smaller safety risk than the current literature assumes.


[6] Logic-Regularized Verifier Elicits Reasoning from LLMs cs.CL | cs.AIPDF

Xinyu Wang, Changzhi Sun, Lian Cheng, Yuanbin Wu, Dell Zhang

TL;DR: 本文提出了一种名为LOVER的无监督验证器,通过逻辑规则正则化来增强大型语言模型的推理能力。该方法将验证器视为二元隐变量,利用模型内部激活并施加三种逻辑约束(否定一致性、组内一致性和组间一致性),无需标注数据即可与现成LLMs兼容。在10个数据集上的实验表明,LOVER显著优于无监督基线,性能达到监督验证器平均95%的水平。

Details

Motivation: 现有验证器依赖资源密集的监督数据集构建,成本高且数据多样性受限,本文旨在通过逻辑规则正则化开发无需标注数据的无监督验证器。

Result: 在10个数据集上的实验显示,LOVER显著超越无监督基线,平均性能达到监督验证器的95%,接近监督方法的水平。

Insight: 创新点在于将验证器建模为二元隐变量并引入逻辑规则作为先验(否定、组内、组间一致性),实现了无需标注数据的无监督验证,可直接适配任意现成LLMs,为推理增强提供了高效且可扩展的解决方案。

Abstract: Verifiers are crucial components for enhancing modern LLMs’ reasoning capability. Typicalverifiers require resource-intensive superviseddataset construction, which is costly and faceslimitations in data diversity. In this paper, wepropose LOVER, an unsupervised verifier regularized by logical rules. LOVER treats theverifier as a binary latent variable, utilizinginternal activations and enforcing three logical constraints on multiple reasoning paths:negation consistency, intra-group consistency,and inter-group consistency (grouped by thefinal answer). By incorporating logical rulesas priors, LOVER can leverage unlabeled examples and is directly compatible with any offthe-shelf LLMs. Experiments on 10 datasetsdemonstrate that LOVER significantly outperforms unsupervised baselines, achieving performance comparable to the supervised verifier(reaching its 95% level on average). The sourcecode is publicly available at https://github.com/wangxinyufighting/llm-lover.


[7] Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation cs.CLPDF

Siyuan Li, Aodu Wulianghai, Xi Lin, Xibin Yuan, Qinghua Mao

TL;DR: 本文提出了一种名为LiSCP的轻量级风格一致性分析方法,用于鲁棒检测LLM生成的文本内容。该方法通过结合离散风格特征与连续语义信号构建一致性特征,利用多模态引导的改写文本变体中的风格稳定性,以提升对抗性攻击和跨域场景下的检测性能。

Details

Motivation: 随着LLM在内容创作中的普及,区分人类与LLM生成的文本成为多媒体内容审核的关键任务。现有检测器依赖统计线索或模型特定启发式方法,易受改写和对抗性操作影响,导致鲁棒性和可解释性不足。

Result: 在真实世界多媒体新闻、电影数据集及传统文本领域的实验中,LiSCP在域内检测上表现优异,并在跨域设置中比现有方法提升高达11.79%。同时,在对抗性攻击和混合人机设置等对抗性场景下展现出显著鲁棒性。

Insight: 创新点在于构建结合离散风格与连续语义的一致性特征,利用多模态引导改写变体中的风格稳定性来增强检测鲁棒性。从客观角度看,该方法通过轻量级设计和对对抗性操作的关注,为LLM生成文本检测提供了更稳定和可解释的解决方案。

Abstract: The increasing prevalence of Large Language Models (LLMs) in content creation has made distinguishing human-written textual content from LLM-generated counterparts a critical task for multimedia moderation. Existing detectors often rely on statistical cues or model-specific heuristics, making them vulnerable to paraphrasing and adversarial manipulations, and consequently limiting their robustness and interpretability. In this work, we proposeLiSCP , a novel lightweight stylistic consistency profiling method for robust detection of LLM-generated textual content, focusing on feature stability under adversarial manipulation. Our approach constructs a consistency profile that combines discrete stylistic features with continuous semantic signals, leveraging stylistic stability across multimodal-guided paraphrased text variants. Experiments spanning real-world multimedia news and movie datasets and conventional text domains demonstrate that LiSCP achieves superior performance on in-domain detection and outperforms existing approaches by up to 11.79% in cross-domain settings. Additionally,it demonstrates notable robustness under adversarial scenarios, including adversarial attacks and hybrid human-AI settings.


[8] Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits cs.CL | cs.AIPDF

Erik Nielsen, Elia Cunegatti, Marcus Vukojevic, Giovanni Iacca

TL;DR: 论文提出了一种名为PCNET的概率电路方法,用于检测大型语言模型中的幻觉(即事实错误生成)。PCNET将幻觉视为事实流形上的几何异常,并通过精确计算负对数似然进行检测,无需采样、外部验证器或修改模型权重。基于PCNET,论文进一步提出了PC-LDCD解码策略,仅在检测到幻觉时动态介入修正,从而避免对原本正确的生成内容造成破坏。

Details

Motivation: 现有方法在纠正幻觉时,通常不加区分地对每个token进行修正,这往往会破坏原本正确的生成内容。为了解决这一局限性,论文旨在开发一种能够精确识别并仅在幻觉发生时进行干预的方法。

Result: 在涵盖对话推理、知识密集型QA、阅读理解和真实性评估的四个基准测试(CoQA, SQuAD v2.0, TriviaQA, TruthfulQA)上,PCNET在多个1B到8B的LLM上实现了接近完美的幻觉检测(AUROC高达99%)。PC-LDCD解码策略在TruthfulQA基准上,在四个模型中的三个上取得了最高的True+Info、MC2和MC3分数(达到SOTA水平),同时将平均破坏率降低至53.7%,并实现了79.3%的保留率。

Insight: 核心创新点在于将幻觉建模为概率空间中的几何异常,并利用可精确计算的概率电路作为密度估计器进行检测。这实现了无需外部工具或模型修改的高效、精确检测。另一个关键创新是提出了动态门控的解码策略(PC-LDCD),实现了仅在必要时进行针对性干预,从而在纠正幻觉的同时最大限度地保留模型的原始正确输出。

Abstract: One of the most critical challenges in Large Language Models is their tendency to hallucinate, i.e., produce factually incorrect responses. Existing approaches show promising results in terms of hallucination correction, but still suffer from a main limitation: they apply corrections indiscriminately to every token, corrupting also the originally correct generations. To overcome this drawback, we propose PCNET, a Probabilistic Circuit trained as a tractable density estimator over the LLM residual stream. The method detects hallucinations as geometric anomalies on the factual manifold, which is done via exact Negative Log-Likelihood computation, hence without the need for sampling, external verifiers, or weight modifications, as in existing techniques. To demonstrate its effectiveness, we exploit PCNET as a dynamic gate that distinguishes hallucinated from factual hidden states at each decoding step. This triggers our second main contribution, PC-LDCD (Probabilistic Circuit Latent Density Contrastive Decoding), only when the latent geometry deviates from factual regions, while leaving correct generations untouched. Across four LLMs, ranging from 1B to 8B models, and four benchmarks covering conversational reasoning, knowledge-intensive QA, reading comprehension, and truthfulness, PCNET achieves near-perfect hallucination detection across CoQA, SQuAD v2.0, and TriviaQA, with AUROC reaching up to 99%. Moreover, PC-LDCD obtains the highest True+Info, MC2, and MC3 scores on TruthfulQA in three out of four models, in comparison with state-of-the-art baselines, while reducing the mean corruption rate to 53.7% and achieving a preservation rate of 79.3%. Our proposed method is publicly available on GitHub.


[9] TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity cs.CL | cs.CVPDF

Zheyuan Yang, Liqiang Shang, Junjie Chen, Xun Yang, Chenglong Xu

TL;DR: 本文提出了TableVista,一个用于评估基础模型在视觉和结构复杂性下的多模态表格推理能力的综合性基准。该基准包含3,000个高质量表格推理问题,并通过多风格渲染和转换流程为每个实例生成10种不同的视觉变体,最终形成30,000个多模态样本。作者对29个最先进的开源和专有基础模型进行了广泛评估,发现这些模型在不同渲染风格下表现稳定,但在复杂结构布局和纯视觉设置下性能显著下降,揭示了当前模型在结构复杂性与视觉呈现结合时难以保持推理一致性的关键缺陷。

Details

Motivation: 为了解决当前缺乏能够全面评估基础模型在视觉和结构复杂性下多模态表格推理能力的基准的问题,作者旨在通过引入一个包含多样化视觉变体和复杂结构布局的基准来揭示现有模型的局限性。

Result: 在TableVista基准上对29个最先进的开源和专有基础模型进行了评估。定量和定性分析表明,模型在不同渲染风格下表现稳定,但在复杂结构布局和纯视觉设置下性能显著下降,这揭示了当前模型在应对结构复杂性与视觉整合呈现时的推理一致性不足。

Insight: 论文的创新点在于构建了一个通过多风格渲染和转换流程生成多样化视觉变体的综合性多模态表格推理基准,从而能够进行多维度评估。从客观角度看,该研究通过系统性地结合视觉复杂性和结构复杂性,为揭示当前多模态模型在表格理解任务上的关键能力缺口提供了新的视角和方法论,对推动更鲁棒可靠的表格理解模型发展具有重要借鉴意义。

Abstract: We introduce TableVista, a comprehensive benchmark for evaluating foundation models in multimodal table reasoning under visual and structural complexity. TableVista consists of 3,000 high-quality table reasoning problems, where each instance is expanded into 10 distinct visual variants through our multi-style rendering and transformation pipeline. This process encompasses diverse scenario styles, robustness perturbations, and vision-only configurations, culminating in 30,000 multimodal samples for a multi-dimensional evaluation. We conduct an extensive evaluation of 29 state-of-the-art open-source and proprietary foundation models on TableVista. Through comprehensive quantitative and qualitative analysis, we find that while evaluated models remain largely stable across diverse rendering styles, they exhibit pronounced performance degradation on complex structural layouts and vision-only settings, revealing that current models struggle to maintain reasoning consistency when structural complexity combines with visually integrated presentations. These findings highlight critical gaps in current multimodal capabilities, providing insights for advancing more robust and reliable table understanding models.


[10] From Articles to Premises: Building PrimeFacts, an Extraction Methodology and Resource for Fact-Checking Evidence cs.CLPDF

Premtim Sahitaj, Jawan Kolanowski, Ariana Sahitaj, Veronika Solopova, Max Upravitelev

TL;DR: 该论文提出了PrimeFacts,一种从事实核查文章中提取细粒度证据的方法和资源。该方法利用大语言模型将文章中的锚点句子重写为独立、脱离上下文的‘前提’,并构建了一个包含13,106篇PolitiFact文章及其相关证据的大规模数据集。实验表明,提取出的证据能显著提升跨文章证据检索和声明验证任务的性能。

Details

Motivation: 自动化验证系统难以利用事实核查文章中非结构化呈现的丰富证据和推理过程,因此需要一种方法来提取和结构化这些证据以供自动化系统使用。

Result: 在跨文章证据检索任务中,去上下文化的证据相比原文句子,在平均倒数排名上获得了高达30%的相对提升。在声明验证任务中,使用该证据进行裁决预测,相比基线在宏平均F1分数上提升了10-20个百分点。这些提升在不同裁决粒度(2类 vs. 5类)和模型架构上保持一致。

Insight: 创新点在于提出了一种利用文章内超链接作为自然锚点来定位关键证据,并利用LLM将其重写为独立前提的方法。这为自动化事实核查提供了大规模、结构化的证据资源,并证明了去上下文化处理能有效提升证据的可检索性和下游任务的性能。

Abstract: Fact-checking articles encode rich supporting evidence and reasoning, yet this evidence remains largely inaccessible to automated verification systems due to unstructured presentation. We introduce PrimeFacts, a methodology and resource for extracting fine-grained evidence from full fact-checking articles. We compile 13,106 PolitiFact articles with claims, verdicts, and all referenced sources, and we identify 49,718 in-article hyperlinks as natural anchors to pinpoint key evidence. Our framework leverages large language models (LLMs) to rewrite these anchor sentences into stand-alone, context-independent premises and investigates the extraction of additional implicit evidence. In evaluations on cross-article evidence retrieval and claim verification, the extracted premises substantially improve performance. Decontextualized evidence yields higher retrievability, achieving up to a 30 percent relative gain in Mean Reciprocal Rank over verbatim sentences, and using the evidence for verdict prediction raises Macro-F1 by 10-20 points over the baseline. These gains are consistent across different verdict granularities (2-class vs. 5-class) and model architectures. A qualitative analysis indicates that the decontextualized premises remain faithful to the original sources. Our work highlights the promise of reusing fact-checkers’ evidence for automation and provides a large-scale resource of structured evidence from real-world fact-checks.


[11] Milestone-Guided Policy Learning for Long-Horizon Language Agents cs.CL | cs.AIPDF

Zixuan Wang, Yuchen Yan, Hongxing Li, Teng Pan, Dingming Li

TL;DR: 本文提出了BEACON框架,一个基于里程碑引导的策略学习方法,用于解决长视野语言智能体训练中的信用分配错误和样本效率低下问题。该方法通过任务分解、分段奖励塑造和双尺度优势估计,显著提升了智能体在复杂顺序决策任务中的性能。

Details

Motivation: 解决长视野智能体任务中,由于需要执行大量顺序决策而导致的强化学习训练挑战,特别是信用错误归因(早期正确动作因最终失败而被惩罚)和样本效率低下(成功轨迹稀少导致学习信号几乎完全丢失)这两个根本问题。

Result: 在ALFWorld、WebShop和ScienceWorld基准测试上,BEACON均优于GRPO和GiGPO方法。尤其在长视野ALFWorld任务中,BEACON取得了92.9%的成功率(几乎是GRPO的53.5%的两倍),并将有效样本利用率从23.7%提升至82.0%。

Insight: 核心创新在于利用长视野任务的组合结构,通过里程碑锚定的信用分配范式,将轨迹在里程碑边界处分割,在分段内进行时间奖励塑造以奖励部分进展,并使用双尺度优势估计来防止远端失败污染局部动作的评估。这为训练长视野语言智能体提供了一个有效的通用框架。

Abstract: While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal failures, and sample inefficiency, where scarce successful trajectories result in near-total loss of learning signal. We introduce a milestone-guided policy learning framework, BEACON, that leverages the compositional structure of long-horizon tasks to ensure precise credit assignment. BEACON partitions trajectories at milestone boundaries, applies temporal reward shaping within segments to credit partial progress, and estimates advantages at dual scales to prevent distant failures from corrupting the evaluation of local actions. On ALFWorld, WebShop, and ScienceWorld, BEACON consistently outperforms GRPO and GiGPO. Notably, on long-horizon ALFWorld tasks, BEACON achieves 92.9% success rate, nearly doubling GRPO’s 53.5%, while improving effective sample utilization from 23.7% to 82.0%. These results establish milestone-anchored credit assignment as an effective paradigm for training long-horizon language agents. Code is available at https://github.com/ZJU-REAL/BEACON.


[12] Uncovering Entity Identity Confusion in Multimodal Knowledge Editing cs.CL | cs.CVPDF

Shu Wu, Xiaotian Ye, Xinyu Mou, Dongsheng Liu, Xiaohan Wang

TL;DR: 本文研究了多模态知识编辑(MKE)后模型中的系统性故障模式——实体身份混淆(EIC),即编辑后的模型在纯文本查询中错误地将原实体身份与新实体信息关联。作者构建了诊断基准EC-Bench,揭示了EIC源于现有方法未能区分模型中的图像-实体绑定与实体-实体关系知识,导致模型过度依赖后者作为捷径。研究还探索了通过约束编辑作用于图像-实体处理阶段来缓解EIC的策略,并提出了实现忠实MKE的原则性要求。

Details

Motivation: 解决多模态知识编辑后模型行为模式未充分探索的问题,特别是识别并分析编辑后模型出现的实体身份混淆这一系统性故障。

Result: 在构建的诊断基准EC-Bench上进行分析,发现约束编辑至图像-实体处理阶段能显著减少EIC,为未来研究提供了方法论指导。

Insight: 创新点在于首次系统性地识别并形式化了MKE中的实体身份混淆问题,并揭示了其根源是模型混淆了图像-实体绑定与实体-实体关系知识;从客观角度看,研究强调了在多模态编辑中区分不同类型知识表征的重要性,并为设计更忠实的编辑方法提供了新视角。

Abstract: Multimodal knowledge editing (MKE) aims to correct the internal knowledge of large vision-language models after deployment, yet the behavioral patterns of post-edit models remain underexplored. In this paper, we identify a systemic failure mode in edited models, termed Entity Identity Confusion (EIC): edited models exhibit an absurd behavior where text-only queries about the original entity’s identity unexpectedly return information about the new entity. To rigorously investigate EIC, we construct EC-Bench, a diagnostic benchmark that directly probes how image-entity bindings shift before and after editing. Our analysis reveals that EIC stems from existing methods failing to distinguish between Image-Entity (I-E) binding and Entity-Entity (E-E) relational knowledge in the model, causing models to overfit E-E associations as a shortcut: the image is still perceived as the original entity, with the new entity’s name serving only as a spurious identity label. We further explore potential mitigation strategies, showing that constraining edits to the model’s I-E processing stage encourages edits to act more faithfully on I-E binding, thereby substantially reducing EIC. Based on these findings, we discuss principled desiderata for faithful MKE and provide methodological guidance for future research.


[13] MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval cs.CLPDF

Chunyu Li, Jingyi Kang, Ding Chen, Mengyuan Zhang, Jiajun Shen

TL;DR: MemReranker是一个用于智能体记忆检索的推理感知重排序模型家族,通过多阶段LLM知识蒸馏构建,旨在解决传统重排序模型因缺乏推理能力而导致的语义相关但信息缺失的问题。

Details

Motivation: 解决智能体记忆系统中,传统基于语义相似度的重排序模型在面临时间约束、因果推理和指代消解等复杂查询时,无法准确检索关键信息的问题。

Result: 在记忆检索基准测试中,MemReranker-0.6B显著优于BGE-Reranker,并在关键指标上匹配开源4B/8B模型及GPT-4o-mini;MemReranker-4B达到0.737 MAP,多项指标与Gemini-3-Flash相当,推理延迟仅为大模型的10-20%。在金融和医疗垂直领域基准上,其泛化能力与主流大参数重排序器相当。

Insight: 创新点包括:通过多教师成对比较生成校准软标签、BCE逐点蒸馏建立良好分布分数、InfoNCE对比学习增强难样本区分能力;结合通用语料和记忆特定多轮对话数据进行训练,有效提升了模型在复杂推理场景下的性能。

Abstract: In agent memory systems, the reranking model serves as the critical bridge connecting user queries with long-term memory. Most systems adopt the “retrieve-then-rerank” two-stage paradigm, but generic reranking models rely on semantic similarity matching and lack genuine reasoning capabilities, leading to a problem where recalled results are semantically highly relevant yet do not contain the key information needed to answer the question. This deficiency manifests in memory scenarios as three specific problems. First, relevance scores are miscalibrated, making threshold-based filtering difficult. Second, ranking degrades when facing temporal constraints, causal reasoning, and other complex queries. Third, the model cannot leverage dialogue context for semantic disambiguation. This report introduces MemReranker, a reranking model family (0.6B/4B) built on Qwen3-Reranker through multi-stage LLM knowledge distillation. Multi-teacher pairwise comparisons generate calibrated soft labels, BCE pointwise distillation establishes well-distributed scores, and InfoNCE contrastive learning enhances hard-sample discrimination. Training data combines general corpora with memory-specific multi-turn dialogue data covering temporal constraints, causal reasoning, and coreference resolution. On the memory retrieval benchmark, MemReranker-0.6B substantially outperforms BGE-Reranker and matches open-source 4B/8B models as well as GPT-4o-mini on key metrics. MemReranker-4B further achieves 0.737 MAP, with several metrics on par with Gemini-3-Flash, while maintaining inference latency at only 10–20% of large models. On finance and healthcare vertical-domain benchmarks, the models preserve generalization capabilities on par with mainstream large-parameter rerankers.


[14] HNC: Leveraging Hard Negative Captions towards Models with Fine-Grained Visual-Linguistic Comprehension Capabilities cs.CL | cs.AI | cs.CVPDF

Esra Dönmez, Pascal Tilli, Hsiu-Yu Yang, Thang Vu, Carina Silberer

TL;DR: 本文提出了一种名为Hard Negative Captions(HNC)的自动生成数据集,用于增强视觉语言(VL)模型中图像-文本匹配(ITM)任务的细粒度跨模态理解能力。通过引入具有挑战性的负例描述,模型在零样本诊断任务和噪声视觉输入场景下表现出更强的鲁棒性,并且为下游任务的微调提供了更好的初始化。

Details

Motivation: 现有基于网络收集的图像-文本对数据关联性较弱,导致模型难以实现对这些模态组合语义的细粒度理解,因此需要一种方法来提升模型的细粒度跨模态理解能力。

Result: 实验结果表明,使用HNC训练能有效提升模型在细粒度跨模态不匹配诊断任务上的零样本能力,并在噪声视觉输入下表现稳健;同时,HNC模型为下游任务微调提供了可比或更优的初始化起点。

Insight: 核心创新点在于自动构建包含困难负例描述的HNC数据集,以增强ITM训练的细粒度语义区分能力;同时,论文还提供了一个手动创建的、具有不同组合复杂度的细粒度跨模态不匹配测试集,用于基准评估。从客观角度看,该方法通过数据增强策略直接针对模型弱点进行训练,是一种有效且可扩展的提升模型理解能力的方法。

Abstract: Image-Text-Matching (ITM) is one of the defacto methods of learning generalized representations from a large corpus in Vision and Language (VL). However, due to the weak association between the web-collected image-text pairs, models fail to show a fine-grained understanding of the combined semantics of these modalities. To address this issue we propose Hard Negative Captions (HNC): an automatically created dataset containing foiled hard negative captions for ITM training towards achieving fine-grained cross-modal comprehension in VL. Additionally, we provide a challenging manually-created test set for benchmarking models on a fine-grained cross-modal mismatch task with varying levels of compositional complexity. Our results show the effectiveness of training on HNC by improving the models’ zero-shot capabilities in detecting mismatches on diagnostic tasks and performing robustly under noisy visual input scenarios. Also, we demonstrate that HNC models yield a comparable or better initialization for fine-tuning


[15] A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping cs.CLPDF

Dingwei Chen, Zefang Zong, Zhipeng Ma, Leo Luo, Yang Li

TL;DR: 本文提出A²TGPO方法,针对基于信息增益(IG)的智能体大语言模型强化学习中的过程信用分配问题,通过引入轮次组归一化、方差重缩放折扣累积和自适应轮次裁剪三项改进,优化了IG信号在训练循环中的使用方式。

Details

Motivation: 解决现有方法在利用信息增益作为内在过程信号时面临的三个系统性挑战:跨轮次归一化失真、优势幅度随轨迹深度漂移以及固定裁剪范围对不同信息量轮次更新效果不佳的问题。

Result: 论文在多个智能体基准测试上进行了实验,结果表明A²TGPO方法在保持轨迹多样性的同时,显著提升了训练效率和最终性能,达到了当前最先进水平。

Insight: 创新点在于将IG信号的归一化、累积和消费过程重新设计,特别是引入基于(提示,轮次索引)组的归一化方法和自适应裁剪机制,这为基于内在奖励的强化学习提供了可借鉴的细粒度信用分配框架。

Abstract: Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy’s predicted probability of the ground-truth, termed Information Gain (IG), as an intrinsic process signal without an external evaluator. However, prior work on leveraging IG signals within the RL training loop faces three systematic challenges: normalizing across turns that face heterogeneous positional contexts can distort the relative standing of individual turns, accumulating a variable number of terms causes advantage magnitudes to drift with trajectory depth, and a fixed clipping range governs policy updates identically for turns with vastly different IG signals. In this paper, we propose A$^2$TGPO (Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping), which retains IG as the intrinsic signal but re-designs how it is normalized, accumulated, and consumed: (i) turn-group normalization: normalizes IG within each (prompt, turn-index) group so that each turn is compared only against peers at the same interaction depth; (ii) variance-rescaled discounted accumulation: divides cumulative normalized IG by square root of accumulated terms to keep advantage magnitudes comparable across turn positions; and (iii) adaptive turn-level clipping: modulates each turn’s clipping range based on its normalized IG, widening the update region for informative turns and narrowing it for uninformative ones.


[16] Rethinking RL for LLM Reasoning: It’s Sparse Policy Selection, Not Capability Learning cs.CLPDF

Ömer Faruk Akgül, Rajgopal Kannan, Willie Neiswanger, Viktor Prasanna

TL;DR: 这篇论文挑战了强化学习(RL)提升大语言模型(LLM)推理能力的传统观点,认为RL并非教会模型新策略,而只是重新分配模型已有解决方案的概率质量。通过令牌级分析,作者发现RL的有效影响是稀疏且可预测的,仅集中在模型不确定的高熵决策点上。基于此,他们提出了无需RL的轻量级方法ReasonMaxxer,该方法仅在熵门控的决策点应用对比损失,大幅降低了训练成本。

Details

Motivation: 传统上使用强化学习来改进大语言模型的推理能力,但越来越多的证据表明RL并未教授新策略,而是重新分配了基础模型已有解决方案的概率。因此,本文质疑RL优化循环本身的必要性,并探究其实际作用机制。

Result: 在三个模型系列、六种规模以及六个数学推理基准测试(如GSM8K、MATH等)上,提出的ReasonMaxxer方法匹配或超越了完整RL的性能,同时仅需数十个问题样本和数分钟的单GPU训练,将训练成本降低了大约三个数量级。

Insight: 论文的核心创新在于将推理改进重新定义为稀疏策略选择而非能力获取,并利用基础模型自身的熵来识别关键决策点。由此提出的ReasonMaxxer方法,通过仅在高熵决策点应用对比损失,实现了高效、低成本的性能提升,为模型微调提供了新的轻量级范式。

Abstract: Reinforcement learning has become the standard for improving reasoning in large language models, yet evidence increasingly suggests that RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains. In this work, we ask: if RL merely steers the model toward paths it already knows, is the RL optimization loop itself necessary? Through token-level analysis across multiple model families and RL algorithms, we find that RL’s beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points where the model is uncertain which branch to take. Only 1–3% of token positions are affected, the promoted token always lies within the base model’s top-5 alternatives, and targeted corrections at those few positions causally recover a large fraction of RL’s accuracy gain, while random corrections fail. The base model’s own entropy identifies these positions without any RL-trained model, and the entire correction is low-dimensional, representable in a tiny fraction of model parameters. These findings reframe reasoning improvement as sparse policy selection, not capability acquisition. We translate this insight into ReasonMaxxer, a minimal RL-free method that applies contrastive loss only at entropy-gated decision points, using a few hundred base-model rollouts and no online generation. Across three model families, six scales, and six math reasoning benchmarks, ReasonMaxxer matches or exceeds full RL performance while requiring only tens of problems and minutes of single-GPU training, a reduction in training cost of roughly three orders of magnitude.


[17] LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG cs.CL | cs.LGPDF

Yijia Zheng, Marcel Worring

TL;DR: LatentRAG是一种新颖的框架,旨在解决传统多步Agentic RAG因自回归生成冗长中间思维和子查询而导致的高延迟问题。它将推理和检索从离散的语言空间转移到连续的潜在空间,通过单次前向传播直接从隐藏状态生成潜在标记来表示思维和子查询,并与密集检索模型在潜在空间中对齐,支持端到端联合优化。

Details

Motivation: 传统单步检索增强生成(RAG)在处理复杂问题时能力有限,而多步的Agentic RAG虽然通过LLM作为搜索代理迭代生成中间思维和子查询来提升性能,但这个过程因自回归生成冗长文本而带来显著的延迟。本文旨在解决Agentic RAG的高延迟瓶颈。

Result: 在七个基准数据集上的广泛实验表明,LatentRAG在性能上与显式的Agentic RAG方法相当,同时将推理延迟降低了约90%,显著缩小了与传统单步RAG之间的延迟差距。

Insight: 核心创新在于将推理和检索过程从显式的自然语言空间迁移到连续的潜在空间,通过单次前向传播生成潜在标记来替代逐token的自回归生成,并结合潜在空间对齐与联合优化以提升效率。此外,引入并行的潜在解码机制将潜在标记翻译回自然语言,提高了透明度和潜在表示的语义意义,这是一种兼顾效率与可解释性的设计。

Abstract: Single-step retrieval-augmented generation (RAG) provides an efficient way to incorporate external information for simple question answering tasks but struggles with complex questions. Agentic RAG extends this paradigm by replacing single-step retrieval with a multi-step process, in which the large language model (LLM) acts as a search agent that generates intermediate thoughts and subqueries to iteratively interact with the retrieval system. This iterative process incurs substantial latency due to the autoregressive generation of lengthy thoughts and subqueries. To address this limitation, we propose LatentRAG, a novel framework that shifts both reasoning and retrieval from discrete language space to continuous latent space. Unlike existing explicit methods that generate natural language thoughts or subqueries token-by-token, LatentRAG produces latent tokens for thoughts and subqueries directly from the hidden states in a single forward pass. We align LLMs with dense retrieval models in the latent space, enabling retrieval over latent subquery tokens and supporting end-to-end joint optimization. To improve transparency and encourage semantically meaningful latent representations, we incorporate a parallel latent decoding mechanism that translates latent tokens back into natural language. Extensive experiments on seven benchmark datasets show that LatentRAG achieves performance comparable to explicit agentic RAG methods while reducing inference latency by approximately 90%, substantially narrowing the latency gap with traditional single-step RAG.


[18] Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning cs.CLPDF

Qianjia Cheng, Yuchen Zhang, Zhilin Wang, Yuxin Zuo, Shunkai Zhang

TL;DR: 本文提出了一套完整的工具集成推理(TIR)训练方案,旨在将自然使用工具的能力注入强大的思维模型,同时不损害其纯文本推理能力。该方案包含监督微调(SFT)和基于可验证奖励的强化学习(RLVR)两个阶段,并强调了数据选择、轨迹比例控制、优化目标调整等关键设计。

Details

Motivation: 解决一个悖论:即使强大的思维模型几乎不实际调用工具,启用工具的评估也可能降低其推理性能。因此,研究如何在不牺牲模型纯文本推理能力的前提下,为其注入自然的工具使用行为。

Result: 在Qwen3的4B和30B规模思维模型上应用该方案,在开源模型中实现了广泛的SOTA性能,例如在AIME 2025基准上分别达到96.7%和99.2%的准确率。

Insight: 创新点包括:1)强调TIR SFT的有效性取决于教师轨迹的可学习性,应优先选择本质上适合工具增强解决方案的问题;2)控制工具使用轨迹的比例可以缓解对纯文本推理能力的灾难性遗忘;3)优化pass@k和响应长度而非训练损失,可以在保留强化学习探索空间的同时最大化SFT收益;4)提出了一个基于合适SFT初始化和明确防模式崩溃保障的、稳定且可验证奖励的强化学习(RLVR)阶段。

Abstract: Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls. In this paper, we investigate how to inject natural tool-use behavior into a strong thinking model without sacrificing its no-tool reasoning ability, and present a comprehensive TIR recipe. We highlight that (i) the effectiveness of TIR supervised fine-tuning (SFT) hinges on the learnability of teacher trajectories, which should prioritize problems inherently suited for tool-augmented solutions; (ii) controlling the proportion of tool-use trajectories could mitigate the catastrophic forgetting of text-only reasoning capacity; (iii) optimizing for pass@k and response length instead of training loss could maximize TIR SFT gains while preserving headroom for reinforcement learning (RL) exploration; (iv) a stable RL with verifiable rewards (RLVR) stage, built upon suitable SFT initialization and explicit safeguards against mode collapse, provides a simple yet remarkably effective solution. When applied to Qwen3 thinking models at 4B and 30B scales, our recipe yields models that achieve state-of-the-art performance in a wide range of benchmarks among open-source models, such as 96.7% and 99.2% on AIME 2025 for 4B and 30B, respectively.


[19] Don’t Lose Focus: Activation Steering via Key-Orthogonal Projections cs.CLPDF

Haoyan Luo, Mateo Espinosa Zarlenga, Mateja Jamnik

TL;DR: 本文提出了一种名为SKOP(Steering via Key-Orthogonal Projections)的激活引导方法,旨在解决现有激活引导技术在控制LLM行为时导致推理和检索性能下降的问题。该方法通过保持模型对关键焦点令牌的注意力模式,同时允许在次要令牌间重新分配注意力,从而在实现行为引导的同时最小化性能损失。

Details

Motivation: 现有激活引导技术通过干预内部表征来控制LLM行为,但常常导致注意力重路由,使注意力从上下文重要的令牌转移到信息量较少的令牌,从而损害模型的推理和检索能力。本文旨在解决这种引导与性能之间的权衡问题。

Result: 在多个引导基准测试中,SKOP方法实现了最佳的引导-效用权衡,将效用下降减少了5-7倍,同时保留了超过95%的原始引导效果。特别是在长上下文检索场景中,当原始引导方法失效时,SKOP仍能通过避免注意力重路由来保持鲁棒性能。

Insight: 论文的核心创新在于提出了通过键正交投影来约束有害注意力重路由的引导机制。其关键洞察是,通过识别并保护模型用于推理和检索的一小部分“焦点令牌”的注意力模式,可以在不牺牲引导效果的前提下,显著减少对模型核心能力的干扰。这为精细化的模型行为控制提供了新思路。

Abstract: Activation steering controls LLM behaviour towards target behaviour by intervening in internal representations, yet it often degrades reasoning and retrieval performance. We argue that a primary cause of this trade-off is attention rerouting: steering vectors alter query-key matching, shifting attention away from contextually important tokens toward less informative ones. To address this, we propose Steering via Key-Orthogonal Projections (SKOP), a steering method that constrains harmful attention rerouting without eliminating steering efficacy. SKOP achieves this by preserving attention patterns on a small set of focus tokens the model relies on for reasoning and retrieval, while allowing redistribution among less critical tail tokens. Across multiple steering benchmarks, we show that SKOP achieves the best joint steering-utility trade-off, reducing utility degradation by 5-7x while retaining over 95% of vanilla steering efficacy. Our results further suggest that, in long-context retrieval settings where vanilla steering approaches are ineffective, SKOP can maintain robust performance by avoiding attention rerouting.


[20] GATHER: Convergence-Centric Hyper-Entity Retrieval for Zero-Shot Cell-Type Annotation cs.CL | cs.IRPDF

Zhonghui Zhang, Feng Jiang, Shaowei Qin, Jiahao Zhao, Min Yang

TL;DR: 本文提出GATHER方法,针对零样本单细胞类型注释任务中多基因协同查询的挑战,设计了一种基于图拓扑收敛点的超实体检索策略,通过全局图遍历识别共同可达节点作为高信息证据,显著减少了LLM调用次数并提升了标注准确率。

Details

Motivation: 现有基于知识图谱的RAG方法在处理包含数十至数百个基因的查询时,因依赖局部实体扩展和迭代LLM推理,面临可扩展性差和成本高的问题,无法有效捕捉基因协同效应。

Result: 在自建的细胞中心知识图谱VCKG上,GATHER在两个数据集(Immune和Lung)上超越了多个KG-RAG基线(ToG、ToG-2、RoG、PoG),取得了最高的精确匹配准确率(27.45%和59.64%),且每个样本仅需一次LLM调用,而基线方法需要2到61次。

Insight: 创新点在于提出收敛节点作为超实体来压缩多实体信号,替代传统的局部实体推理;通过全局图遍历和节点/路径重要性评分,在检索阶段完全无需LLM参与,实现了高效且信息密集的证据提取。

Abstract: Zero-shot single-cell cell-type annotation aims to determine a cell’s type from a given set of expressed genes without any training. Existing knowledge-graph-based RAG approaches retrieve evidence by expanding from source entities and relying on iterative LLM reasoning. However, in this setting each query contains tens to hundreds of genes, where no single gene is decisive and the label emerges only from their collective co-occurrence. Such hyper-entity queries fundamentally challenge local, entity-wise exploration strategies, which reason from individual genes, leading to poor scalability and substantial LLM cost. We propose GATHER (Graph-Aware Traversal with Hyper-Entity Retrieval), a convergence-centric retriever tailored to hyper-entity queries. It performs global multi-source graph traversal and identifies topological convergence points – nodes jointly reachable from many input genes. These convergence nodes act as high-information hyper-entities that capture entity synergy. By incorporating node- and path-importance scoring, GATHER selects informative evidence entirely without LLM involvement during retrieval. Instantiated on a self-constructed cell-centric biological knowledge graph (VCKG), GATHER outperforms strong KG-RAG baselines (ToG, ToG-2, RoG, PoG) on two datasets (Immune and Lung), achieving the highest exact-match accuracy (27.45% and 59.64%) with only a single LLM call per sample, compared to 2–61 calls for KG-RAG baselines. Our results demonstrate that convergence nodes compress multi-entity signals into compact, high-information evidence that conveys more per item than multi-hop paths, providing an efficient global alternative to local entity-wise reasoning.


[21] STALE: Can LLM Agents Know When Their Memories Are No Longer Valid? cs.CLPDF

Hanxiang Chao, Yihan Bai, Rui Sheng, Tianle Li, Yushi Sun

TL;DR: 该论文针对LLM智能体在长期记忆更新方面的不足,提出了STALE基准测试,专门评估智能体在遇到‘隐性冲突’(即新证据间接否定旧记忆)时修正记忆的能力。研究发现当前前沿LLM和专用记忆框架普遍存在‘检索到更新证据却无法据此行动’的缺陷,最高准确率仅55.2%。论文还提出了CUPMem原型系统,通过结构化状态整合和传播感知搜索来改进记忆更新机制。

Details

Motivation: 当前LLM智能体的记忆基准主要测试静态事实检索,忽略了在新证据出现时修正已有信念的能力,特别是需要上下文推理和常识才能察觉的‘隐性冲突’场景。

Result: 在包含400个专家验证冲突场景(覆盖100多个日常主题,上下文达15万token)的STALE基准上,系统评估显示最佳模型总体准确率仅为55.2%,模型普遍难以拒绝查询中隐含的过时假设,且无法识别用户状态某方面变化对相关记忆的影响。

Insight: 创新点在于提出了首个专注于‘隐性冲突’检测的记忆评估基准STALE及其三维探测框架(状态解析、前提抵抗、隐性策略适应),并指出显式状态裁决是构建鲁棒智能体记忆的有前景方向,CUPMem原型通过结构化状态整合验证了这一路径的有效性。

Abstract: Large Language Model (LLM) agents are increasingly expected to maintain coherent, long-term personalized memory, yet current benchmarks primarily measure static fact retrieval, overlooking the ability to revise stored beliefs when new evidence emerges. We identify a critical and underexplored failure mode, Implicit Conflict: a later observation invalidates an earlier memory without explicit negation, requiring contextual inference and commonsense reasoning to detect. To rigorously evaluate this capability, we introduce STALE, a benchmark of 400 expert-validated conflict scenarios (1,200 evaluation queries across three probing dimensions) spanning over 100 everyday topics with contexts up to 150K tokens. We propose a three-dimensional probing framework that tests State Resolution (detecting that a prior belief is outdated), Premise Resistance (rejecting queries that falsely presuppose a stale state), and Implicit Policy Adaptation (proactively applying updated states in downstream behavior). A systematic evaluation of frontier LLMs and specialized memory frameworks reveals a pervasive gap between retrieving updated evidence and acting on it, with even the best evaluated model achieving only 55.2% overall accuracy. Models often accept outdated assumptions embedded in a user’s query, and they struggle to recognize when a change in one aspect of the user’s state should invalidate related memories. To establish an initial baseline for state-aware memory, we further present CUPMem, a prototype that strengthens write-time revision through structured state consolidation and propagation-aware search, suggesting that explicit state adjudication is a promising direction for robust agentic memory.


[22] UniSD: Towards a Unified Self-Distillation Framework for Large Language Models cs.CL | cs.AI | cs.LGPDF

Yiqiao Jin, Yiyang Wang, Lucheng Fu, Yijia Xiao, Yinyi Luo

TL;DR: 本文提出UniSD,一个统一的自蒸馏框架,用于系统研究大语言模型的自蒸馏方法。该框架整合了多教师一致性、EMA教师稳定、token级对比学习、特征匹配和发散裁剪等机制,以解决监督可靠性、表示对齐和训练稳定性问题。在六个基准测试和三个模型家族的六个模型上,UniSD揭示了自蒸馏何时优于静态模仿、哪些组件驱动性能提升以及这些组件如何跨任务交互。基于这些洞察,作者构建了UniSDfull集成管道,在基础模型上提升了5.4个百分点,并比最强基线高出2.8个百分点。

Details

Motivation: 自蒸馏为适应大语言模型提供了一条有前景的路径,无需依赖更强的外部教师模型。然而,在自回归大语言模型中,自蒸馏仍面临挑战,因为自生成的轨迹是自由形式的,正确性依赖于任务,且看似合理的推理仍可能提供不稳定或不可靠的监督。现有方法主要孤立地考察设计选择,其有效性、作用和交互尚不明确。

Result: 在六个基准测试和三个模型家族的六个模型上,UniSDfull集成管道实现了最强的整体性能,比基础模型提升了5.4个百分点,比最强基线高出2.8个百分点。

Insight: 论文的创新点在于提出了一个统一的自蒸馏框架UniSD,系统整合了多教师一致性、EMA教师稳定、token级对比学习、特征匹配和发散裁剪等互补机制,以全面解决监督可靠性、表示对齐和训练稳定性问题。从客观角度看,该研究通过系统实验揭示了自蒸馏中不同组件的有效性和交互作用,为高效LLM适应提供了可指导且实用的方法,无需依赖更强的外部教师模型。

Abstract: Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a unified framework to systematically study self-distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self-distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSDfull, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 points and the strongest baseline by +2.8 points. Extensive evaluation highlights self-distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.


[23] StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction cs.CL | cs.AIPDF

Xiangyuan Xue, Yifan Zhou, Zidong Wang, Shengji Tang, Philip Torr

TL;DR: 本文提出了战略轨迹抽象(StraTA)框架,旨在解决大语言模型(LLM)作为交互智能体在长时程决策任务中因纯反应式方法导致的探索和信用分配问题。该框架通过从初始任务状态采样紧凑的策略,并以此策略为条件指导后续动作,结合分层GRPO式训练、多样化策略采样和关键自我判断来联合优化策略生成与动作执行。在ALFWorld、WebShop和SciWorld三个基准上的实验表明,StraTA在样本效率和最终性能上均优于强基线。

Details

Motivation: 当前基于大语言模型的智能体在长时程决策任务中主要采用纯反应式方法,这削弱了长期轨迹上的探索能力和信用分配效果,因此需要引入显式的轨迹级策略来优化智能体强化学习。

Result: 在ALFWorld上达到93.1%的成功率,在WebShop上达到84.2%的成功率,在SciWorld上获得63.5%的综合得分,性能超越了前沿的闭源模型,在多个基准上均实现了样本效率和最终性能的持续提升。

Insight: 主要创新点在于引入了显式的轨迹级策略抽象,通过分层训练架构将策略生成与动作执行联合优化,并辅以多样化策略采样和关键自我判断机制,从而增强了智能体在长时程任务中的规划与决策能力。

Abstract: Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. In this work, we present Strategic Trajectory Abstraction (StraTA), a simple framework that introduces an explicit trajectory-level strategy into agentic reinforcement learning (RL). StraTA samples a compact strategy from the initial task state, conditions subsequent actions on that strategy, and trains strategy generation and action execution jointly with a hierarchical GRPO-style rollout design, further enhanced by diverse strategy rollout and critical self-judgment. Experiments on ALFWorld, WebShop, and SciWorld show that StraTA consistently improves both sample efficiency and final performance over strong baselines. StraTA reaches success rates of 93.1% on ALFWorld and 84.2% on WebShop. On SciWorld, StraTA attains a 63.5% overall score, outperforming frontier closed-source models.


[24] Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients cs.CLPDF

Mingwei Xu, Hao Fang

TL;DR: 本文提出了一种名为POPO(Positive-Only Policy Optimization)的新型强化学习框架,专门用于具有可验证奖励的强化学习(RLVR)场景,以提升大语言模型的推理能力。该方法摒弃了传统方法中对负向样本(negative rollouts)的显式采样和惩罚,仅通过在线正向样本(positive rollouts)进行学习,并通过有界重要性采样、孪生策略网络和动量更新等机制实现稳定优化。

Details

Motivation: 现有RLVR方法(如GRPO)依赖于对正负样本的分组估计,但负样本可能无法反映失败程度的梯度,且在稀疏二元奖励下,惩罚少数采样的负样本难以提供有意义的奖励信号。因此,本文旨在解决负样本利用效率低和信号稀疏的问题。

Result: 在Qwen系列等公开文本LLM模型上,于各级数学基准测试中进行了广泛实验。结果表明,POPO取得了与GRPO相当甚至更优的性能。具体而言,使用Qwen-Math-7B模型在AIME 2025基准上达到了36.67%的准确率,优于GRPO的30.00%。消融和参数扫描研究进一步验证了POPO各组件的必要性和鲁棒性。

Insight: 论文宣称的创新点在于提出了一个仅使用正向样本进行策略优化的框架,通过有界重要性采样和正向概率的强化,自然地涌现出隐式的负梯度,从而避免了显式负样本采样的复杂性。从客观角度看,其核心创新在于将优化焦点完全转移到正向经验上,并设计了孪生策略网络与动量更新、以及表示空间的有界相似性惩罚来稳定训练,这为稀疏奖励环境下的策略优化提供了一种新颖且高效的思路。

Abstract: Reinforcement learning with verifiable rewards (RLVR), due to the deterministic verification, becomes a dominant paradigm for enhancing the reasoning ability of large language models (LLMs). The community witnesses the rapid change from the Proximal Policy Optimization (PPO) to Group Relative Policy Optimization (GRPO), in which GRPO reduces the complicated advantage estimation with simple estimation over grouped positive and negative rollouts. However, we note that negative rollouts may admit no gradation of failure severity, and the combinatorial vastness makes penalizing a few sampled negatives unlikely to cover a meaningful reward signal under sparse binary rewards. In this work, we propose Positive-Only Policy Optimization (POPO), a novel RLVR framework in which learning can occur exclusively via online positive rollouts. Specifically, POPO utilizes bounded importance sampling over the positive rollout set. Thus, no disjoint negative rollouts are used for the gradient guidance. We show that implicit negative gradients can emerge naturally through reinforcing the positive probability via rollouts redistribution. Next, POPO stabilizes the policy optimization through two mechanisms. First, it applies a siamese policy network with a momentum-based adaptation law for stabilized policy evolution. Second, we replace the KL-divergence with a bounded similarity penalty term in the siamese representation space. We conduct extensive experiments using publicly available, well-established text-LLM models, e.g., the Qwen family, across all-level mathematical benchmarks. Our experiment demonstrates that POPO achieves performance comparable to, or even superior to GRPO. Notably, we show that POPO can achieve 36.67% in AIME 2025 with Qwen-Math-7B, outperforming GRPO 30.00%. Our ablation and sweep studies further illustrate the necessity and robustness of POPO components.


cs.CV [Back]

[25] Seeing What Shouldn’t Be There: Counterfactual GANs for Medical Image Attribution cs.CVPDF

Shakeeb Murtaza

TL;DR: 本文提出了一种基于反事实解释(CX)的面向类别的特征归因方法,用于医学图像分析。该方法利用生成对抗网络(GANs)和循环一致性损失,生成反事实实例(CIs)以解释分类决策,并在合成、结核病和BraTS数据集上验证了其有效性。

Details

Motivation: 现有可视化技术多基于判别模型,仅突出分类决策中使用的少量判别性特征,未能考虑所有显著对象。为解决这一问题,本文引入反事实解释来提供更全面的因果推理,帮助放射科医生可视化医学图像中的异常。

Result: 在合成、结核病和BraTS三个数据集上的实验均证实了所提方法的有效性,并在BraTS数据集上提供了基线结果,展示了其在生成可信反事实实例方面的优势。

Insight: 创新点包括:结合GANs和循环一致性损失实现反事实解释,提出反事实实例生成方法以增强解释的可信度,并引入新技术评估反事实实例的质量,为医学图像归因提供了自解释的类比式解释框架。

Abstract: Ascription of an image gives insights into the objects that influence the classification of the whole image or its pixels towards a specific category. These insights help radiologists to visualize deformities in medical imaging. Most of the existing visualization techniques are based on discriminative models and highlight regions of the input image participating in the decision-making of a classifier. However, these approaches do not take all noticeable objects into account as their objective is to classify the input by using a minimal set of discriminative features. To overcome the issue, a counterfactual explanation (CX) based class-oriented feature attribution method is proposed. A counterfactual explanation (CX) explicates a causal reasoning process of the form: “if X had not happened, then Y would not have happened”. The method is built on generative adversarial networks (GANs) with a cyclical-consistent loss function. We evaluate our method on three datasets: synthetic, tuberculosis and BraTS. All experiments confirm the efficacy of the proposed method. This study also highlighted the limitations of existing counterfactual explanation techniques in producing plausible counterfactual instances (CIs). Accompanying CXs with believable CIs thus provides self-explanatory analogy-based explanations. To this end, a CI generation method is proposed. Also, a novel technique is used to evaluate the quality of CI. The baseline results are produced on the BraTS dataset.


[26] ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters cs.CV | cs.AI | cs.LGPDF

Philippe Hansen-Estruch, Jiahui Chen, Vivek Ramanujan, Orr Zohar, Yan Ping

TL;DR: 本文提出了ViTok-v2,一个支持原生分辨率的视觉Transformer自编码器,通过引入NaFlex机制实现跨分辨率和宽高比的泛化,并使用新颖的DINOv3感知损失替代LPIPS和GAN目标以实现稳定的大规模训练。该模型在约20亿张图像上训练,参数量达到50亿,是目前最大的图像自编码器。实验表明,ViTok-v2在256p分辨率上达到或超越了最先进的重建效果,并在512p及以上分辨率上优于所有基线模型。

Details

Motivation: 现有ViT自编码器在训练分辨率之外性能下降,且依赖对抗损失导致训练不稳定,难以扩展。ViTok-v2旨在解决这些限制,通过改进重建质量来优化自编码器与生成模型之间的帕累托前沿。

Result: ViTok-v2在256p分辨率上匹配或超越了最先进的重建性能,在512p及以上分辨率上优于所有基线模型。与流匹配生成器联合扩展实验表明,同时扩展自编码器和生成器可以推进重建与生成之间的帕累托前沿。

Insight: 创新点包括:1)通过NaFlex机制实现跨分辨率和宽高比的原生分辨率支持;2)使用DINOv3感知损失替代传统LPIPS和GAN目标,实现稳定的大规模训练。这为构建更大规模的图像自编码器提供了可行的技术路径。

Abstract: Vision Transformer (ViT) autoencoders have emerged as compelling tokenizers for images, offering improved reconstruction over convolutional tokenizers. However, existing ViT tokenizers cannot explore this landscape as performance degrades outside training resolutions, and reliance on adversarial losses prevents stable scaling. ViTok (Hansen-Estruch et al., 2025) found that the compression ratio r mediates a reconstruction-generation trade-off where lower r means better reconstructions but harder generations, so improving tokenizer reconstruction is key to more Pareto-optimal tokenizers. We introduce ViTok-v2, which addresses these limitations with native resolution support via NaFlex for generalization across resolutions and aspect ratios, and a novel DINOv3 perceptual loss that replaces both LPIPS and GAN objectives for stable training at any scale. ViTok-v2 is trained on about 2B images and scaled to 5B parameters, the largest image autoencoder to date. ViTok-v2 matches or exceeds state-of-the-art reconstruction at 256p and outperforms all baselines at 512p and above. In joint scaling experiments with flow matching generators, we show that scaling both the autoencoder and the generator advances the Pareto frontier of this trade-off.


[27] Open-SAT: LLM-Guided Query Embedding Refinement for Open-Vocabulary Object Retrieval in Satellite Imagery cs.CV | cs.AI | cs.IRPDF

Md Adnan Arefeen, Biplob Debnath, Ravi K. Rajendran, Murugan Sankaradas, Srimat T. Chakradhar

TL;DR: 本文提出了Open-SAT,一种用于开放词汇卫星图像检索的训练无关查询嵌入优化算法。该方法在推理时利用大型语言模型(LLM)来优化文本查询嵌入,通过融入目标对象及其周围环境的上下文信息,以更好地与卫星图像内容对齐,并采用无阈值检索机制。实验表明,该方法在三个公开基准上显著提升了检索性能。

Details

Motivation: 解决卫星图像应用中,用户使用开放词汇自然语言查询时,现有视觉语言模型(如CLIP)难以准确对齐查询与卫星图像内容的问题。

Result: 在三个公开基准上的实验结果表明,Open-SAT将F1分数提升了高达16.04%,同时检索了数量相当的图像块,证明了其有效性。

Insight: 创新点在于提出了一种无需额外训练、在推理时利用LLM引导进行查询嵌入优化的方法,通过LLM引入上下文信息来增强开放词汇检索的泛化能力,并结合无阈值检索机制提升精度和效率。

Abstract: In satellite applications, user queries often take the form of open-ended natural language, extending beyond a fixed set of predefined categories. This open-vocabulary nature poses significant challenges for retrieving relevant image tiles, as the retrieval system must generalize to a wide range of unseen objects and concepts. While vision-language models (VLMs) such as CLIP are widely used for text-image retrieval, even fine-tuned variants often struggle to accurately align such queries with satellite imagery. To address this, we propose Open-SAT, a training-free query embedding refinement algorithm that operates at inference time to improve alignment between user queries and satellite image content. Open-SAT uses VLMs to compute embeddings for image tiles, which are stored in a vector database for efficient retrieval. At query time, it leverages Large Language Models (LLMs) to refine the text embeddings by incorporating contextual information about objects of interest and their surroundings. A threshold-free retrieval mechanism further enhances accuracy and efficiency. Experimental results in three public benchmarks demonstrate that Open-SAT improves the F1 score by up to 16.04%, while retrieving a comparable number of image tiles. These results demonstrate the effectiveness of Open-SAT in open-vocabulary satellite image retrieval, leveraging LLM guidance without the need for additional training or supervision.


[28] Tamaththul3D: High-Fidelity 3D Saudi Sign Language Avatars from Monocular Video cs.CV | cs.AIPDF

Eyad Alghamdi, Sattam Altuuaim, Obay Ghulam, Abdulrahman Qutah, Yousef Basoodan

TL;DR: 本文提出了Tamaththul3D,一个从单目视频重建高保真沙特手语3D虚拟人的专业流程,并发布了首个针对Ishara-500沙特手语数据集的高质量3D参数化标注。

Details

Motivation: 阿拉伯手语及其方言服务于全球约4亿使用者,但该领域缺乏高质量的3D参数化标注和专门用于虚拟人生成的重建方法。

Result: Tamaththul3D在手部精度上达到了最先进水平(相比之前方法提升高达32%),同时在身体姿态上保持了竞争力。

Insight: 主要创新点在于:1)首个针对沙特手语的高质量SMPL-X参数化标注数据集;2)一个专门针对阿拉伯手语独特关节模式设计的重建流程,集成了SMPLer-X、WiLoR和MediaPipe,并采用了基于运动学链的腕部对齐与混合摆动-扭转分解以及2D监督的关节优化技术。

Abstract: Arabic Sign Language (ArSL) and its dialects serve approximately 400 million Arabic speakers worldwide, yet the community lacks high-quality 3D parametric annotations and specialized reconstruction methods for avatar generation. We address this critical gap through two key contributions: First, we introduce the first high-quality 3D parametric annotations for the Ishara-500 Saudi Sign Language dataset, providing precise SMPL-X parameters for 500 culturally authentic SSL signs. Second, we present Tamaththul3D, a specialized reconstruction pipeline designed for ArSL’s unique articulation patterns. Our pipeline integrates SMPLer-X for robust body estimation, WiLoR for detailed hand refinement with automatic localization and mirroring, and MediaPipe for 2D pose supervision. Through kinematic-chain-based wrist alignment with hybrid swing-twist decomposition and 2D-supervised joint optimization, Tamaththul3D achieves state-of-the-art hand accuracy (up to 32% improvement over previous methods) while maintaining competitive body pose. Together, these 3D annotations and Tamaththul3D pipeline establish the first comprehensive framework for high-fidelity ArSL avatar reconstruction, enabling new accessibility technologies and cultural preservation efforts for the Arab Deaf community.


[29] LAMP: Localization Aware Multi-camera People Tracking in Metric 3D World cs.CVPDF

Nan Yang, Julian Straub, Fan Zhang, Richard Newcombe, Jakob Engel

TL;DR: LAMP是一种用于从多摄像头头戴设备中跟踪3D人体运动的新框架,通过早期解耦观察者和目标运动来解决动态自中心视角下的挑战。该方法采用两步法:首先利用已知设备6自由度运动和标定将检测到的2D关键点转换为统一的3D世界坐标系,然后使用时空Transformer直接拟合3D人体运动。

Details

Motivation: 解决自中心多摄像头头戴设备在剧烈自运动、部分可见性或遮挡以及缺乏训练数据情况下,现有单目视频方法无法有效利用多视角、校准和定位输入的问题。

Result: 在单目基准测试中达到SOTA水平,并在目标自中心设置中显著优于基线方法。

Insight: 创新点在于提出’先提升后拟合’的两步框架,通过早期运动解耦允许模型学习世界空间中的人体运动先验,并灵活整合多摄像头异步、部分观测和运动信息;客观分析认为其将2D-3D转换与运动建模解耦的设计简化了多视角数据融合问题。

Abstract: Tracking 3D human motion from egocentric multi-camera headset is challenged by severe egomotion, partial visibility or occlusions and lack of training data. Existing methods designed for monocular video often require static or slowly-moving cameras and cannot efficiently leverage multi-view, calibrated and localized input. This makes them brittle and prone to fail on dynamic egocentric captures. We propose LAMP (Localization Aware Multi-camera People Tracking): a novel, simple framework to solve this via early disentanglement of observer and target motion. LAMP introduces a two-step process. First, we leverage the known device 6 DoF motion and calibration to convert detected 2D body keypoints from all cameras over a temporal window into a unified 3D world reference frame. Second, an end-to-end-trained spatio-temporal transformer fits 3D human motion directly to this 3D ray cloud. This “lift-then-fit” approach allows LAMP to learn and leverage a natural human motion prior in the world-space, as well as providing an elegant framework to flexibly incorporate information from multiple temporally asynchronous, partially observing and moving cameras. LAMP achieves state-of-the-art results on monocular benchmarks, while significantly outperforming baselines for our targeted egocentric setting.


[30] Safety-Critical Camera Reliability Monitoring for ADAS via Degradation-Aware Uncertainty Pattern Analysis cs.CVPDF

Shiva Aher

TL;DR: 本文提出了一种主动式相机可靠性监控框架,通过分析退化引起的感知不确定性模式,在可观测的下游任务性能下降之前估计感知风险,从而提升ADAS的安全性。该方法引入全局传感器健康指数(GSHI),通过风险感知的乘法公式聚合多种退化模式的严重程度,并利用轻量级多任务网络从单张RGB图像预测退化类型、严重程度、GSHI和空间不确定性图。

Details

Motivation: 可靠的相机输入对安全关键的ADAS感知至关重要,但现有监控方法通常在下游性能已经下降后才检测传感器故障,因此需要一种能够提前预警的主动监控方法。

Result: 在基于KITTI数据集的退化实验中,GSHI随退化严重程度单调下降,健康估计的MAE为0.064,并在YOLOv8检测失败前提供平均0.47个严重度单位的早期预警时间。GSHI在性能上优于图像质量评估、检测器置信度和干净特征分布外检测等基线方法,并能零样本迁移到真实恶劣天气驾驶数据。

Insight: 创新点在于提出了一个基于退化感知不确定性分析的主动监控框架,通过合成监督训练预测多种退化模式,并设计了风险感知的GSHI指标来量化相机健康状态,实现了无需下游任务反馈的轻量级实时监控,为智能车辆的传感器可靠性监控提供了新方向。

Abstract: Reliable camera input is essential for safety-critical ADAS perception, but most monitoring approaches detect sensor failures only after downstream performance has degraded. We propose a proactive camera reliability monitoring framework that estimates perception risk from degradation-induced uncertainty patterns before downstream failure becomes observable. The method introduces a Global Sensor Health Index (GSHI), a continuous reliability score that aggregates per-degradation severities using a risk-aware multiplicative formulation, allowing severe single-mode failures such as lens occlusion or motion blur to dominate the health estimate. A lightweight multi-task network predicts degradation type, severity, GSHI, and spatial uncertainty maps from a single RGB image without downstream task feedback. Training uses physics- and geometry-aware synthetic supervision over twelve camera degradation modes. Experiments on KITTI-derived degradations show that GSHI decreases monotonically with severity, achieves a health-estimation MAE of 0.064, and provides positive early-warning lead time of 0.47 $\pm$ 0.25 severity units before YOLOv8 detection failure. GSHI also outperforms IQA, detector-confidence, and clean-feature OOD baselines, and transfers zero-shot to real adverse-weather driving data. These results support degradation-aware uncertainty analysis as a practical direction for proactive camera reliability monitoring in intelligent vehicles.


[31] EchoXFlow: A Beamspace Echocardiography Dataset for Cardiac Motion, Flow, and Function cs.CVPDF

Elias Stenhede, Joanna Sulkowska, Eivind Bjørkan Orstad, Henrik Schirmer, Arian Ranjbar

TL;DR: 本文介绍了EchoXFlow数据集,这是一个用于心脏超声研究的临床数据集,其独特之处在于保留了超声数据的原始采集几何结构(波束空间),而非传统的扫描转换后的笛卡尔坐标视频。该数据集包含来自666次常规检查的37125条记录,涵盖了时间分辨的1D、2D、3D数据以及多种多普勒模态,并配有同步心电图和密集的临床标注。

Details

Motivation: 现有公开超声数据集通常缺乏多普勒数据或将其融合为RGB叠加层,且数据在发布前经过了有损的厂商显示处理,这限制了从物理基础角度研究心脏解剖、心肌运动和血流之间跨模态关系的机会。

Result: 论文没有提供具体的定量性能结果或基准测试比较,而是介绍了EchoXFlow数据集本身的内容、规模和特点,旨在为相关研究提供一个测试平台。

Insight: 主要创新点在于提供了一个保留原始采集几何、时序和模态关系的多模态心脏超声数据集,支持物理基础的跨模态学习任务,这是传统扫描转换视频无法实现的。从客观角度看,该数据集填补了领域空白,为4D视觉和物理基础的多模态学习研究提供了宝贵的资源。

Abstract: We introduce EchoXFlow, a clinical echocardiography dataset for learning from ultrasound in its native acquisition geometry rather than from scan-converted Cartesian videos. Existing public datasets offer limited opportunities to study cross-modal relationships between cardiac anatomy, myocardial motion, and blood flow, as Doppler is typically absent or fused as RGB overlays, and acquisitions are released after lossy vendor display processing. EchoXFlow comprises 37125 recordings from 666 routine-care examinations, preserving the timing, geometry, and modality relationships needed for physically grounded echo learning. Each recording is retained as separable modality-specific streams: temporally resolved 1D, 2D, and 3D data alongside multiple Doppler modalities, paired with a synchronized ECG. Clinical annotations span guideline-based measurements to dense 2D myocardial contours and 3D left-ventricular endocardial meshes. With its associated open-source tooling, EchoXFlow enables cross-modal, acquisition-aware learning tasks that cannot be formulated from conventional scan-converted videos alone, and serves as a testbed for 4D vision and physically grounded multi-modal learning more broadly.


[32] An extremely coarse feedback signal is sufficient for learning human-aligned visual representations cs.CVPDF

Yash Mehta, Michael F. Bonner

TL;DR: 本文系统研究了学习信号粒度对神经网络视觉表征与人类视觉对齐的影响,发现仅使用极粗粒度的分类信号(如区分8个类别)训练的模型,其表征与猕猴电生理记录和人类fMRI响应的对齐程度可媲美甚至超越使用细粒度监督(如1000类分类)的模型,且在人类感知相似性判断任务上表现更优。

Details

Motivation: 探索监督信号粒度(从细粒度分类到粗粒度分类)如何影响人工神经网络视觉表征与灵长类视觉系统的对齐程度,以理解构建与大脑对齐的模型所需的学习信号本质。

Result: 在ImageNet风格数据集上,通过基于PCA分割预训练嵌入的方法构建不同粒度(2至64类)的分类任务训练卷积和Transformer网络,发现仅用8类粗粒度监督训练的模型,其表征与猕猴IT皮层记录和人类fMRI数据的对齐度达到或超过1000类细粒度监督模型;在人类感知相似性判断(如Odd-One-Out任务)上,粗粒度模型优于所有评估模型(包括细粒度监督、自监督模型及大型视觉模型)。

Insight: 创新点在于首次系统证明极粗粒度的监督信号足以诱导出与人类视觉高度对齐的表征,挑战了当前领域普遍追求更细粒度监督信号(如对比学习)的趋势;方法上采用数据驱动的PCA分割预训练嵌入来参数化控制信号粒度,为构建更人类感知对齐的AI系统提供了新方向。

Abstract: Artificial neural networks trained on visual tasks develop internal representations resembling those of the primate visual system, a discovery that has guided a decade of computational neuroscience. Research on building brain-aligned models has progressively embraced finer-grained supervisory signals, from object classification to contrastive self-supervised objectives that maximize distinctions among individual images, yet the role of supervisory signal granularity on brain alignment remains largely unexamined. Here we systematically investigate how the coarseness of a learning signal shapes representational alignment with human vision. We parametrically vary the level of signal granularity using a data-driven approach that partitions a set of training images into varied numbers of categories (2, 4, 8, 16, …, 64) via PCA-based splits of pretrained embeddings. We train hundreds of neural networks across convolutional and transformer architectures on these coarse classification tasks and compare their representations to macaque electrophysiology recordings and human fMRI responses. We find that networks trained to distinguish as few as 8 broad categories learn representations that match or exceed the neural alignment of models distinguishing 1,000-classes. Even more strikingly, these coarsely trained networks align more closely with human perceptual similarity judgments than all other models evaluated, including networks trained with fine-grained supervision or self-supervision as well as leading large-scale vision models. These results demonstrate that human-like visual representations emerge from remarkably coarse feedback, reframing what learning signals vision may require and opening a path toward building AI systems that are more aligned with human perception.


[33] Text-to-CAD Retrieval: a Strong Baseline cs.CVPDF

Honghu Pan, Zibo Du, Daxiang Liu, Chengliang Liu, Xiaoling Luo

TL;DR: 本文正式提出了文本到CAD检索这一新的跨模态检索任务,旨在通过自然语言查询从大规模数据库中检索语义相关的CAD模型。作者利用Text2CAD数据集中的配对文本-CAD标注,为此任务建立了一个实用的基准。为了实现基于文本的检索,作者提出了一个统一框架,从程序化序列和几何点云中学习多模态CAD嵌入。该框架包含序列编码器、点编码器和文本编码器,并在训练时引入了一个新颖的特征解码器来重建被掩码的序列特征,以促进隐式的多模态对齐。在推理时,移除该辅助解码器以实现高效检索。

Details

Motivation: 现有CAD模型库通常使用文件名或目录进行搜索,这限制了设计检索的效率、可扩展性和准确性。文本到CAD检索是一个关键但尚未充分探索的任务,对于重用遗留工业设计至关重要。

Result: 作者在Text2CAD数据集上建立了基准,并提出了一个作为强基线的统一框架。虽然没有在摘要中提及具体的定量结果(如准确率)或与其他方法的直接比较,但该框架为下游任务(如检索增强生成)奠定了基础。

Insight: 主要创新点包括:1) 正式定义了文本到CAD检索这一新任务并建立了基准;2) 提出了一个从程序序列和点云中学习多模态CAD嵌入的统一框架;3) 引入了一个新颖的特征解码器,通过跨注意力机制重建掩码序列特征,以促进训练期间的多模态对齐,并在推理时移除以实现高效检索。这为跨模态检索和CAD生成任务提供了可借鉴的思路。

Abstract: Text-based retrieval of Computer-Aided Design (CAD) models is a critical yet underexplored task for the reuse of legacy industrial designs. Existing CAD repositories are typically searched using filenames or directories, which limits the efficiency, scalability, and accuracy of design retrieval. In this paper, we formally introduce text-to-CAD retrieval as a new cross-modal retrieval task, aiming to retrieve semantically relevant CAD models from large-scale databases given natural language queries. Leveraging paired text-CAD annotations from the Text2CAD dataset, we establish a practical benchmark for this task. To achieve text-based retrieval, we propose a unified framework that learns multi-modal CAD embeddings from both procedural sequences and geometric point clouds. Specifically, a sequence encoder captures the construction logic of CAD models, while a point encoder extracts explicit geometric features. A text encoder is used to learn semantic representations of textual queries. During training, we introduce a novel feature decoder that reconstructs masked sequence features via cross-attention with text and point features, encouraging implicit multi-modal alignment. At inference time, we remove this auxiliary decoder to enable efficient retrieval using concatenated sequence-point features. Our framework serves as a strong baseline for text-to-CAD retrieval and lays the foundation for downstream CAD generation paradigms, such as retrieval-augmented generation. The source code will be released.


[34] RAM-H1200: A Unified Evaluation and Dataset on Hand Radiographs for Rheumatoid Arthritis cs.CV | cs.LGPDF

Songxiao Yang, Haolin Wang, Yao Fu, Junmu Peng, Lin Fan

TL;DR: RAM-H1200是一个用于类风湿关节炎(RA)评估的统一基准数据集,包含来自六个医疗中心的1200张手部X光片,提供了从整体骨骼结构实例分割到像素级骨侵蚀(BE)掩码、关节区域定义以及基于SvdH标准的临床评分等多层次标注,旨在全面评估模型从影像中捕捉解剖结构、局部病理变化和临床严重程度的能力。

Details

Motivation: 现有公开数据集缺乏对手部X光片进行统一多层次分析的支持,特别是在骨侵蚀的定量分析方面标注稀缺,无法满足同时分析解剖结构、细粒度病理变化并与临床评分系统整合的需求。

Result: 在基准任务上的结果表明,解剖结构建模(如全手骨骼分割)已取得较强性能,而骨侵蚀(BE)的像素级分割仍然是一个主要的开放挑战,性能远未成熟。

Insight: 创新点在于首次提出了一个统一的大规模公共基准,整合了解剖建模、定量病灶分析和临床评分;其提出的像素级BE掩码首次为骨侵蚀的病灶范围和形态提供了明确的空间监督,实现了超越粗略分类评级的定量分析。

Abstract: Rheumatoid arthritis (RA) assessment from hand radiographs requires multi-level analysis and modeling of anatomical structures and fine-grained local pathological changes. However, existing public resources do not support such unified multi-level analysis, often lacking full-hand coverage, fine-grained annotations, and consistent integration with clinical scoring systems. In particular, annotations that enable quantitative analysis of bone erosion (BE) remain scarce. RAM-H1200 contains 1,200 hand radiographs collected from six medical centers, with multi-level annotations including (i) whole-hand bone structure instance segmentation, (ii) pixel-level BE masks, (iii) SvdH-defined joint regions of interest, and (iv) joint-level SvdH scores for both BE and joint space narrowing (JSN). It is designed to evaluate whether models can jointly capture anatomical structure, localized erosive pathology, and clinically standardized RA severity from hand radiographs. The proposed BE masks enable, for the first time, quantitative BE analysis beyond coarse categorical grading by providing explicit spatial supervision for lesion extent and morphology. To our knowledge, RAM-H1200 is the first public large-scale benchmark that jointly supports whole-hand bone structure instance segmentation, pixel-level BE delineation, and clinically grounded joint-level SvdH scoring for both BE and JSN. Results across benchmark tasks show that anatomical modeling is substantially more mature than quantitative BE analysis: whole-hand bone segmentation achieves strong performance, whereas BE segmentation remains a major open challenge. By unifying anatomical structure modeling, quantitative lesion analysis, and clinically grounded SvdH scoring, RAM-H1200 provides a single benchmark for comprehensive RA analysis on hand radiographs.


[35] Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping cs.CV | cs.AI | cs.LG | cs.ROPDF

Gabriel Jeanson, David-Alexandre Duclos, William Larrivée-Hardy, Noé Cochet, Matěj Boxan

TL;DR: 本文提出了一种利用大规模视觉语言模型生成合成数据来解决森林再生物种细粒度语义分割中数据稀缺和类别不平衡问题的方法。通过结合真实数据和AI生成的图像,构建了Gen4Regen数据集,并展示了合成数据能显著提升模型性能,特别是在代表性不足的物种上。

Details

Motivation: 传统森林调查方法劳动密集且受地理限制,而基于深度学习的无人机图像解释面临专家标注图像严重稀缺的瓶颈,尤其是在视觉异质性高的复杂再生区域。

Result: 在扩展的自然森林数据集WilDReF-Q-V2和合成的Gen4Regen数据集上进行实验,统一训练相比纯监督基线F1分数提升了超过15个百分点,部分代表性不足物种的F1分数增益高达30个百分点。

Insight: 创新点在于利用视觉语言模型(如Nano Banana Pro)从提示词同时生成高保真图像及其像素对齐的语义掩码,作为敏捷的数据生成器,有效引导专家标签稀缺的利基AI领域感知任务。这表明AI生成的数据与真实数据高度互补,能有效缓解数据稀缺和类别不平衡问题。

Abstract: Sustainable forest management relies on precise species composition mapping, yet traditional ground surveys are labour-intensive and geographically constrained. While Uncrewed Aerial Vehicles (UAVs) offer scalable data collection, the transition to deep learning-based interpretation is bottlenecked by the severe scarcity of expert-annotated imagery, particularly in complex, visually heterogeneous regeneration zones. This paper addresses the dual challenges of data scarcity and extreme class imbalance in the semantic segmentation of fine-grained forest regeneration species by providing a scalable framework that reduces reliance on manual photo-interpretation for high-resolution, millimetre-level aerial imagery. Importantly, we leverage the large-scale vision-language Nano Banana Pro model to simultaneously generate high-fidelity images and their corresponding pixel-aligned semantic masks from prompts. We introduce WilDReF-Q-V2, an expansion of a natural forest dataset with 13 977 new unlabelled and 50 labelled real images, as well as the Gen4Regen dataset, featuring 2101 pairs of synthetic images and semantic masks. Our methodology integrates real-world data with AI-generated images, highlighting that AI-generated data is highly complementary to real-world data, with unified training yielding an F1 score improvement of over 15 %pt compared to purely supervised baselines. Furthermore, we demonstrate that even small quantities of prompt-generated data significantly improve performance for underrepresented species, some of which saw per-species F1 score gains of up to 30 %pt. We conclude that vision-language models can serve as agile data generators, effectively bootstrapping perception tasks for niche AI domains where expert labels are scarce or unavailable. Our datasets, source code, and models will be available at https://norlab-ulaval.github.io/gen4regen.


[36] Learning a Delighting Prior for Facial Appearance Capture in the Wild cs.CV | cs.GRPDF

Yuxuan Han, Xin Ming, Tianxiao Li, Zhuofan Shen, Qixuan Zhang

TL;DR: 本文提出了一种新的面部外观捕捉范式,通过训练一个强大的去光照网络作为先验来约束优化过程,从而在野外智能手机拍摄条件下实现高质量的面部反射率估计。该方法利用OLAT数据集和渲染的光照舞台扫描数据进行训练,并提出了数据集潜在调制技术来整合异构数据源,最终构建了一个简单自动的捕捉流程,大幅超越了现有方法。

Details

Motivation: 传统高质量面部外观捕捉需要昂贵的影棚设备,而现有基于野外智能手机的模型化逆向渲染方法在从未知光照中解耦反射率方面存在困难。本文旨在通过引入一个强大的去光照先验来弥补这一差距,简化并提升野外捕捉的质量。

Result: 该方法在反射率估计上大幅超越了现有技术,并利用其外观捕捉方法将多视角NeRSemble数据集转换为NeRSemble-Scan,这是一个包含4K分辨率可重光照扫描的大规模数据集。

Insight: 核心创新在于将范式从纯粹的模型化逆向渲染,转变为训练一个强大的去光照网络作为先验来约束优化。提出的数据集潜在调制技术通过可学习的源感知令牌来调节核心网络,成功解耦了数据集特定风格与物理去光照原则,从而学习到一个超越现有专有模型的通用先验。这为野外高质量面部捕捉和数字人研究提供了新的基础。

Abstract: High-quality facial appearance capture has traditionally required costly studio recording. Recent works consider an in-the-wild smartphone-based setup; however, their model-based inverse rendering paradigm struggles with the complex disentanglement of reflectance from unknown illumination. To bridge this gap, we propose to shift the paradigm into training a powerful delighting network as a prior to constrain the optimization. We leverage the OLAT dataset and the rendered Light Stage scans for training, and propose Dataset Latent Modulation (DLM) to seamlessly integrate these heterogeneous data sources. Specifically, by conditioning the core network on learnable source-aware tokens, we decouple dataset-specific styles from physical delighting principles, enabling the emergence of a delighting prior that outperforms existing proprietary models. This powerful delighting prior enables a simple and automatic appearance capture pipeline that achieves high-quality reflectance estimation from casual video inputs, outperforming prior arts by a large margin. Furthermore, we leverage our appearance capture method to transform the multi-view NeRSemble dataset into NeRSemble-Scan, a large-scale collection of 4K-resolution relightable scans. By open-sourcing our model and the NeRSemble-Scan dataset, we democratize high-end facial capture and provide a new foundation for the research community to build photorealistic digital humans.


[37] AffectSeek: Agentic Affective Understanding in Long Videos under Vague User Queries cs.CVPDF

Zhen Zhang, Yuhang Yang, Yunxiang Jiang, Yuhuan Lu, Haifeng Lu

TL;DR: 本文提出了一个名为Vague-Query-driven video Affective Understanding的新任务,旨在解决用户在长视频中通过模糊自然语言查询进行情感理解的问题。为此,作者构建了VQAU-Bench基准测试,并提出了一个名为AffectSeek的智能体框架,该框架通过分解任务步骤(意图解释、候选定位、片段验证、情感推理和原理生成)来主动定位、验证和解释长视频中的情感时刻。

Details

Motivation: 现有情感理解研究多集中于从图像、音频或预剪辑视频片段中识别情感,这种被动且以片段为中心的方式无法反映用户在长视频中通过模糊自然语言查询表达需求的真实场景。

Result: 实验表明,现有情感识别模型和单步视觉语言模型在VQAU任务上仍面临挑战,而AffectSeek为长视频情感理解提供了一个简单有效的智能体框架。

Insight: 论文的创新点在于提出了VQAU这一新任务及其基准VQAU-Bench,并设计了AffectSeek框架,通过角色专业化推理和跨阶段验证,将模糊用户意图与长视频证据逐步对齐,实现了主动、多步骤的情感理解。

Abstract: Existing affective understanding studies have mainly focused on recognizing emotions from images, audio signals, or pre-cliped video clips, where the affective evidence is already given. This passive and clip-centered setting does not fully reflect real-world scenarios, in which users often interact with long videos and express their needs through natural-language queries. In this paper, we study \textbf{Vague-Query-driven video Affective Understanding (VQAU)}, a new task that requires models to localize affective moments in long videos, predict their emotion categories, and generate evidence-grounded rationales under vague user queries. To support this task, we construct \textbf{VQAU-Bench}, a benchmark that integrates long videos, vague affective queries, temporal clip annotations, emotion labels, and rationale explanations into a unified evaluation framework. VQAU-Bench enables systematic assessment of semantic-temporal-affective alignment, affective moment localization, emotion classification, and rationale generation. To address the multi-step reasoning challenges of VQAU, we further propose \textbf{AffectSeek}, an agentic framework that actively seeks, verifies, and explains affective moments in long videos. AffectSeek decomposes VQAU into intent interpretation, candidate localization, clip verification, emotion reasoning, and rationale generation, and progressively aligns vague user intent with long-video evidence through role-specialized reasoning and cross-stage verification. Experiments show that VQAU remains challenging for existing affective recognition models and single-step vision-language models, while AffectSeek provides a simple yet effective framework for agentic long-video affective understanding.


[38] MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality cs.CVPDF

Panqi Yang, Haodong Jing, Jiahao Chao, Tingyan Xiang, Li Lin

TL;DR: MUSE提出了一种基于拓扑正交性的视觉标记化框架,通过将结构作为正交桥梁,解耦了Transformer中的优化过程,从而解决了统一视觉标记化中高保真像素重建与语义抽象之间的权衡问题。

Details

Motivation: 统一视觉标记化面临高保真像素重建(空间等变性)与语义抽象(概念不变性)之间的根本权衡,作者将此归因于流形错位,即朴素的联合优化会导致相反的梯度,在重建和感知之间形成零和博弈。

Result: 实验表明,MUSE打破了这种权衡,在生成质量上达到了最先进水平(gFID 3.08),并在线性探测任务上超越了其教师模型InternViT-300M(85.2% vs. 82.5%)。

Insight: 核心创新在于利用拓扑正交性,将结构梯度与语义梯度解耦,使结构梯度优化注意力拓扑,语义梯度更新特征值,从而将破坏性干扰转化为相互增强,使得结构对齐的重建能够增强语义感知。

Abstract: Unified visual tokenization faces a fundamental trade-off between high-fidelity pixel reconstruction (spatial equivariance) and semantic abstraction (conceptual invariance). We attribute this conflict to Manifold Misalignment: naive joint optimization induces opposing gradients, creating a zero-sum game between reconstruction and perception. To address this, we propose MUSE, a framework based on Topological Orthogonality. By treating Structure as an orthogonal bridge, MUSE decouples optimization within Transformers: structural gradients refine attention topology, while semantic gradients update feature values. This turns destructive interference into Mutual Reinforcement. Experiments show that MUSE breaks the trade-off, achieving state-of-the-art generation quality (gFID 3.08) and surpassing its teacher InternViT-300M in linear probing (85.2% vs. 82.5%), demonstrating that structurally aligned reconstruction can enhance semantic perception. Code is available at https://github.com/PanqiYang1/MUSE.


[39] EGA: Adapting Frozen Encoders for Vector Search with Bounded Out-of-Distribution Degradation cs.CV | cs.AI | cs.LGPDF

Dongfang Zhao

TL;DR: 本文提出了一种名为欧几里得测地线对齐(EGA)的残差适配器方法,用于解决基于冻结视觉编码器的向量搜索系统在面对未见类查询时性能下降的问题。EGA通过零初始化、局部三元组损失和超球面投影三个原则,实现了一种自限制的动态机制,使得适配器在局部几何正确时自动停止更新,从而在保持已见类性能的同时,最小化对未见类区域的干扰。

Details

Motivation: 现有基于冻结编码器的向量搜索系统在部署时面临来自未见类的查询,而现有适配器训练方法在这种分布偏移下会失效:高容量适配器结合全局对比损失会错误地将未见类样本分配到已见类簇中,导致最坏情况下的标签精度大幅下降(在测试中比冻结基线低40分以上)。

Result: 在五个不同的分布外(OOD)基准测试中,EGA在四个主要分割上取得了最高的最坏情况标签精度,并在第五个分割上实现了持续改进。实验表明,在收敛时96.5%的三元组不产生梯度,适配器设计还可迁移到CLIP以外的更强骨干网络。

Insight: 创新点在于通过零初始化、局部三元组损失和超球面投影的耦合,诱导出梯度稀疏的自限制动态机制,从而在优化已见类的同时,有界地扰动未见类区域。这为适配器设计提供了理论依据,确保了在分布偏移下的鲁棒性。

Abstract: Vector search systems built on frozen vision encoders face queries from unseen classes at deployment, yet existing adapter training collapses under this shift: high-capacity adapters with global contrastive losses silently reassign unseen-class samples to wrong seen-class clusters, dropping worst-case Label Precision by over 40 points below the frozen baseline in our tests. We propose Euclidean Geodesic Alignment (EGA), a residual adapter that couples three principles: zero initialization, local triplet loss, and hypersphere projection. These collectively induce a self-limiting dynamic: triplets that already satisfy a small margin stop producing gradients, so the adapter automatically stops updating where the local geometry is already correct. Our experiments show that at convergence $96.5%$ of triplets are gradient-free, leaving unseen-class regions largely untouched while still enabling full-capacity refinement of seen classes. Across five diverse out-of-distribution (OOD) benchmarks, EGA achieves the highest worst-case Label Precision on the four primary splits and a consistent improvement on the fifth. The design also transfers to stronger backbones in addition to CLIP, and we provide an analytical justification linking gradient sparsity to bounded OOD perturbation.


[40] MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery cs.CVPDF

Nanjie Yao, Junlong Ren, Wenhao Shen, Hao Wang

TL;DR: 本文提出MotionGRPO框架,用于从头部设备信号恢复全身3D人体运动。该方法通过强化学习后训练将细粒度指导注入扩散过程,解决了现有扩散方法因全局分布匹配导致的局部关节重建误差问题,并引入噪声注入策略以增加样本多样性并稳定学习。

Details

Motivation: 现有基于扩散的方法在从头部设备信号恢复全身3D运动时,依赖全局分布匹配,导致局部关节重建误差,因此需要引入细粒度指导来提升精度。

Result: 在广泛实验中,MotionGRPO实现了最先进的性能,具有卓越的视觉保真度,达到了SOTA水平。

Insight: 创新点包括:将扩散采样建模为马尔可夫决策过程,通过GRPO进行优化;引入结合学习条件感知模型和显式局部约束的混合奖励机制;以及通过噪声注入策略增加组内样本多样性以解决梯度消失问题。

Abstract: This paper studies full-body 3D human motion recovery from head-mounted device signals. Existing diffusion-based methods often rely on global distribution matching, leading to local joint reconstruction errors. We propose MotionGRPO, a novel framework leveraging reinforcement learning post-training to inject fine-grained guidance into the diffusion process. Technically, we model diffusion sampling as a Markov decision process optimized via Group Relative Policy Optimization (GRPO). To this end, we introduce a hybrid reward mechanism that combines a learned conditioned perceptual model for global visual plausibility and explicit constraints for local joint precision. Our key technical insight is that policy optimization in diffusion-based recovery suffers from vanishing gradients due to limited intra-group sample diversity. To address this, we further introduce a noise-injection strategy that explicitly increases sample variance and stabilizes learning. Extensive experiments demonstrate that MotionGRPO achieves state-of-the-art performance with superior visual fidelity


[41] CFE-PPAR: Compression-friendly encryption for privacy-preserving action recognition leveraging video transformers cs.CV | cs.AI | cs.CRPDF

Haiwei Lin, Shoko Imaizumi, Hitoshi Kiya

TL;DR: 本文提出了一种名为CFE-PPAR的压缩友好加密方法,用于隐私保护的动作识别,该方法允许加密视频在压缩后仍能被视频Transformer模型直接识别,从而解决了现有加密方法在视频压缩后性能急剧下降的问题。

Details

Motivation: 现有基于加密的隐私保护动作识别方法在视频压缩时会导致识别性能和视觉质量灾难性下降,缺乏压缩友好性,因此需要一种能在压缩环境下保持高性能的加密方案。

Result: 在UCF101和HMDB51数据集上,使用Motion-JPEG和H.264压缩标准进行测试,CFE-PPAR的性能优于先前方法。

Insight: 创新点在于首次提出压缩友好的加密方法,通过使用与视频加密相同的密钥来变换模型参数,使视频Transformer能直接处理加密压缩视频,这为隐私保护视觉任务中的加密与压缩兼容性提供了新思路。

Abstract: Privacy-preserving action recognition (PPAR) enables machines to understand human activities in videos without revealing sensitive visual content. Among the various strategies for PPAR, encryption-based methods achieve strong privacy protection while maintaining high recognition performance. However, these methods lead to a catastrophic decrease in recognition performance and visual quality when the encrypted videos are compressed. That is, the previous methods are not compression-friendly. To address these issues, in this paper, we propose the first compression-friendly encryption method for PPAR, called CFE-PPAR. In CFE-PPAR, videos encrypted with secret keys can be directly recognized by a video transformer, which uses parameters transformed by the same keys as those used for video encryption. In experiments, it is verified that CFE-PPAR outperforms previous methods on the UCF101 and HMDB51 datasets under Motion-JPEG and H.264 compression.


[42] Adaptive Physical-Facial Representation Fusion via Subject-Invariant Cross-Modal Prompt Tuning for Video-Based Emotion Recognition cs.CVPDF

Xiwen Luo, Jia Li, Rencheng Song, Yu Liu, Juan Cheng

TL;DR: 本文提出了一种基于主题不变跨模态提示调优的框架,用于视频情感识别,通过将rPPG信号转换为时频表示生成模态互补提示,在冻结的ViT中调制面部令牌,并引入解耦共享-特定适配器分离主题共享和特定成分,以提升跨主题泛化能力。

Details

Motivation: 解决现有多模态方法在融合面部和rPPG特征时破坏预训练面部表示、缺乏显式机制抑制主题特定变异的问题,以提高情感识别的准确性和泛化性。

Result: 在MAHNOB-HCI和DEAP基准测试中,该方法在识别准确率和泛化能力上均优于强基线模型,达到了SOTA水平。

Insight: 创新点包括使用噪声鲁棒的rPPG时频表示生成跨模态提示来调制冻结ViT的面部令牌,以及引入解耦共享-特定适配器显式分离主题变异,从而在保持预训练表示的同时增强跨主题泛化。

Abstract: Emotion recognition from facial videos enables non-contact inference of human emotional states. Although facial expressions are widely used cues, they cannot fully reflect intrinsic affective states. Remote photoplethysmography (rPPG) provides complementary physiological information, but it is highly susceptible to noise and inter-subject variability, limiting generalization to unseen individuals. Existing multimodal methods combine facial and rPPG features, yet their fusion strategies often disrupt pretrained facial representations and lack explicit mechanisms to suppress subject-specific variations. To address these issues, we propose a subject-invariant cross-modal prompt-tuning framework for video-based emotion recognition. Specifically, rPPG waveforms are transformed into noise-robust time-frequency representations (TFRs), from which modality-complementary prompts are generated to modulate facial tokens within a frozen Vision Transformer (ViT). This design enables effective cross-modal interaction while preserving the generalizable facial representations learned by the pretrained backbone. In addition, we introduce a decoupled shared-specific adapter (DSSA) into each ViT layer to explicitly separate subject-shared and subject-specific components, thereby improving cross-subject generalization. Experiments on the MAHNOB-HCI and DEAP benchmarks demonstrate that the proposed method consistently outperforms strong baselines in both recognition accuracy and generalization ability, highlighting its effectiveness for video-based emotion recognition.


[43] Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling cs.CV | cs.GR | cs.HC | cs.LG | cs.MMPDF

Anh H. Vo, Sungyo Lee, Phil-Joong Kim, Soo-Mi Choi, Yong-Guk Kim

TL;DR: 本文提出了一种统一框架,通过将大语言模型与强化学习耦合,实现了语言驱动的3D场景生成与沉浸式用户交互的闭环整合。系统首先利用LLMs根据自然语言指令构建结构化场景表示,然后在几何和语义约束下通过强化学习优化空间布局,并将生成的环境部署于虚拟现实中,通过用户交互提供持续反馈以优化生成内容。

Details

Motivation: 现有方法通常将3D场景生成与用户交互视为独立过程,限制了交互式多媒体系统的适应性和沉浸潜力,本文旨在通过闭环整合解决这一问题。

Result: 在ALFRED基准测试中实现了任务型场景生成的SOTA性能;定性结果和用户研究表明,在沉浸感、交互质量和任务效率方面均有持续提升。

Insight: 核心创新在于将生成与交互紧密耦合的闭环设计,通过LLM-RL协同和HRI-in-the-loop反馈机制,提升了系统的响应性、适应性和真实性,为下一代多媒体系统提供了重要借鉴。

Abstract: Recent advances in large language models (LLMs) have significantly improved language-driven 3D content generation, but most existing approaches still treat scene generation and user interaction as separate processes, limiting the adaptability and immersive potential of interactive multimedia systems. This paper presents a unified framework that closes the loop between language-driven 3D scene generation and immersive user interaction. Given natural language instructions, the system first constructs structured scene representations using LLMs, and then optimizes spatial layouts via reinforcement learning under geometric and semantic constraints. The generated environments are deployed in a virtual reality setting to facilitate HRI-in-the-loop, where user interactions provide continuous feedback to align generated content with human perception and usability. By tightly coupling generation and interaction, the proposed framework enables more responsive, adaptive, and realistic multimedia experiences. Experiments on the ALFRED benchmark demonstrate state-of-the-art performance in task-based scene generation. Furthermore, qualitative results and user studies show consistent improvements in immersion, interaction quality, and task efficiency, highlighting the importance of closed-loop integration of generation and interaction for next-generation multimedia systems. Our project page can be found at https://proj-showcase.github.io/h3ds/.


[44] EgoEMG: A Multimodal Egocentric Dataset with Bilateral EMG and Vision for Hand Pose Estimation cs.CVPDF

Ziheng Xi, Jiayi Yu, Yitao Wang, Yanbo Duan, Jianjiang Feng

TL;DR: 本文介绍了EgoEMG,一个用于双手姿态估计的多模态第一人称数据集,包含双侧腕带表面肌电信号(sEMG)、第一人称广角RGB视频、外部RGB-D视频、IMU数据以及动捕衍生的手部运动数据。该数据集覆盖41名参与者执行60种手势类别,总记录时长超过10小时。作者还提出了一个包含EMG到姿态、视觉到姿态以及EMG+视觉融合三个任务的基准测试,并评估了基线模型。

Details

Motivation: 现有数据集缺乏同步的肌电信号(EMG)和第一人称视觉(egocentric vision)数据,而这两种模态在手部感知中是互补的:EMG即使在遮挡和光照不佳时也能捕捉细粒度的手指关节活动,而视觉提供全局手部配置。

Result: 作者评估了EMGFormer用于EMG到姿态任务,以及通用的ResNet/ViT骨干网络用于视觉到姿态任务作为基线。进一步研究了残差融合架构,该架构在匹配的轻量级纯视觉基线上有所提升。

Insight: 创新点在于首次提供了同步的双侧EMG与第一人称视觉的多模态数据集,并建立了统一的基准测试框架(包括跨手势、跨用户等泛化分割轴),为未来基于EMG和视觉的多模态手部姿态估计研究奠定了基础。

Abstract: Surface electromyography (sEMG) records muscle activity during hand movement and can be decoded to recover detailed hand articulation. EMG and egocentric vision are complementary for hand sensing: EMG captures fine-grained finger articulation even under occlusion and poor lighting, while vision provides global hand configuration. However, no existing dataset synchronizes both modalities. We present EgoEMG, a multimodal egocentric dataset for bimanual hand pose estimation. EgoEMG includes bilateral wristband EMG with 16 total channels (8 per wrist) sampled at 2 kHz, 120 Hz IMU, egocentric wide-angle RGB video, external RGB-D video, and mocap-derived hand motion with wrist articulation angles. The dataset covers 41 participants performing 60 gesture classes, including 30 single-hand gestures and 30 bimanual gestures, totaling more than 10 hours of recording. We also introduce a benchmark with three tasks – EMG-to-pose, vision-to-pose, and EMG+vision fusion – under a shared joint-angle prediction target and common generalization split axes (cross-gesture, cross-user, and combined). As baselines, we evaluate EMGFormer for EMG-to-pose and generic ResNet/ViT backbones for vision-to-pose. We further study a residual fusion architecture that improves over matched lightweight vision-only baselines. Together, EgoEMG and its benchmark establish a foundation for future research on multimodal hand pose estimation with EMG and vision.


[45] TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation cs.CV | cs.ROPDF

Hanyu Zhou, Chuanhao Ma, Gim Hee Lee

TL;DR: 本文提出了TriRelVLA,一个基于三元关系结构的视觉-语言-动作模型框架,旨在提升具身操作任务的泛化能力。该框架通过构建显式的物体-手-任务三元关系表示,建立任务驱动的关系图,并进行关系条件化的动作生成,以减少对视觉外观统计的依赖,从而在未见过的场景、物体和任务组合上实现更好的泛化。

Details

Motivation: 现有VLA模型在训练过的机器人任务上表现良好,但在泛化到新场景和新物体时存在困难,其根本原因在于其隐式视觉表示将物体外观、背景和场景布局纠缠在一起,导致策略对视觉变化敏感。先前工作通过结构化中间表示来提升可迁移性,但这些表示主要捕捉场景语义而非与动作相关的关系,使得动作预测仍依赖于外观统计。

Result: 实验表明,该方法在微调任务上表现出色,并在跨场景、跨物体和跨任务泛化方面取得了明显提升。作者还引入了一个真实世界的机器人数据集用于微调。

Insight: 核心创新在于识别出操作动作依赖于物体-手-任务三元关系结构,并据此构建了显式的三元关系表示作为关系基元。通过任务引导的交叉注意力形成节点,并利用关系感知的图变换器建模其交互,最后将关系结构压缩到瓶颈空间并投影到LLM中进行动作预测。这种三元关系瓶颈设计减少了对外观统计的依赖,是实现泛化的关键。

Abstract: Vision-language-action (VLA) models perform well on training-seen robotic tasks but struggle to generalize to unseen scenes and objects. A key limitation lies in their implicit visual representations, which entangle object appearance, background, and scene layout. This makes policies sensitive to visual variations. Prior work improves transferability through structured intermediate representations that objectify visual content. However, these representations mainly capture scene semantics instead of action-relevant relations. As a result, action prediction remains tied to appearance statistics. We observe that manipulation actions depend on the object-hand-task relational structure, which governs interactions among task requirements, robot states, and object properties. Based on this observation, we propose TriRelVLA, a triadic relational VLA framework for generalizable embodied manipulation. Our approach consists of three components: 1) We construct explicit object-hand-task triadic representations from multimodal inputs as relational primitives. 2) We build a task-grounded relational graph. Task-guided cross-attention forms nodes, and a relation-aware graph transformer models interactions among them. 3) We perform relation-conditioned action generation. The relational structure is compressed into a bottleneck space and projected into the LLM for action prediction. This triadic relational bottleneck reduces reliance on appearance statistics and enables transfer across scenes, objects, and task compositions. We further introduce a real-world robotic dataset for fine-tuning. Experiments show strong performance on fine-tuned tasks and clear gains in cross-scene, cross-object, and cross-task generalization.


[46] Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction cs.CVPDF

Feifei Li, Qi Song, Chi Zhang, Rui Huang

TL;DR: 本文提出了一种用于流式3D重建的射线感知指针记忆方法,该方法通过统一表示空间位置和观察方向,并引入自适应的指针更新策略,以选择性保留信息指针并丢弃冗余观测,从而提升长期重建的稳定性和相机姿态精度。

Details

Motivation: 现有前馈重建框架主要依赖基于外观相似性的记忆更新,这容易导致观测冗余累积和视角变化时的几何不稳定。本文旨在通过联合建模几何邻近性和视角一致性来解决这些问题。

Result: 大量实验表明,所提出的射线感知记忆设计显著提高了长期重建稳定性和相机姿态精度,同时保持了高效的流式推理。

Insight: 创新点在于将空间位置和射线方向统一建模为指针记忆,并采用基于保留或替换的自适应更新机制,而非传统的基于融合的压缩,这为可扩展且抗漂移的在线3D重建提供了一个原则性框架。

Abstract: Dense 3D reconstruction from continuous image streams requires both accurate geometric aggregation and stable long-term memory management. Recent feed-forward reconstruction frameworks integrate observations through persistent memory representations, yet most rely primarily on appearance-based similarity when updating memory. Such appearance-driven integration often leads to redundant accumulation of observations and unstable geometry when viewpoint changes occur. In this work, we propose a ray-aware pointer memory for streaming 3D reconstruction that explicitly models both spatial location and viewing direction within a unified memory representation. Each memory pointer stores its 3D position, associated ray direction, and feature embedding, allowing the system to reason jointly about geometric proximity and viewpoint consistency. Based on this representation, we introduce an adaptive pointer update strategy that replaces traditional fusion-based memory compression with a retain-or-replace mechanism. Instead of averaging nearby observations, the system selectively retains informative pointers while discarding redundant ones, preserving distinctive geometric structures while maintaining bounded memory growth. Furthermore, the joint reasoning over spatial distance and ray-direction discrepancy enables the system to distinguish between local redundancy, novel observations, and potential loop revisits in a unified manner. When loop candidates are detected, pose refinement is triggered to enforce global geometric consistency across the reconstruction. Extensive experiments demonstrate that the proposed ray-aware memory design significantly improves long-term reconstruction stability and camera pose accuracy while maintaining efficient streaming inference. Our approach provides a principled framework for scalable and drift-resistant online 3D reconstruction from image streams.


[47] Jointly Learning Structured Representations and Stabilized Affinity for Human Motion Segmentation cs.CVPDF

Xianghan Meng, Zhiyuan Huang, Zhengyu Tong, Chun-Guang Li

TL;DR: 本文提出了一种名为TDSC的高效人体运动分割方法,通过联合学习时间一致的结构化表示和稳定的亲和力矩阵,以解决现有子空间聚类方法因原始帧特征不满足子空间并集假设而导致的性能不佳问题。

Details

Motivation: 现有的人体运动分割方法主要基于子空间聚类,其假设高维时间特征符合子空间并集分布,但真实视频的原始帧级特征常违反该假设,导致分割效果不理想。

Result: 在五个基准HMS数据集上使用传统特征(HoG)和前沿深度特征(CLIP、DINOv2)进行了广泛实验,验证了该方法的有效性。

Insight: 创新点包括:通过编码率最大化正则化避免表示崩溃并使其符合子空间并集分布;引入时间约束促进相邻帧分组一致;设计时间动量平均机制稳定亲和力演化,以及采用重参数化策略实现高效优化。

Abstract: Human Motion Segmentation (HMS), which aims to partition a video into non-overlapping segments corresponding to different human motions, has recently attracted increasing research attention. Existing HMS approaches are predominantly based on subspace clustering, which are grounded on the assumption that the distribution of high-dimensional temporal features well aligns with a Union-of-Subspaces (UoS). For videos in the real world, however, the raw frame-level features often violate the UoS assumption and yield unsatisfactory segmentation performance. To address this issue, we propose an efficient and effective approach for HMS, named Temporal Deep Self-expressive subspace Clustering (TDSC), which jointly learns temporally consistent structured representations and stabilized affinity for accurate and robust HMS. Specifically, in TDSC, we alternately learn structured representations of the input frame features and self-expressive coefficients via a properly regularized self-expressive model, in which a coding-rate maximization regularizer is incorporated to avoid representation collapse and conform the learned representations to span a desired UoS distribution, and meanwhile, temporal constraints are incorporated to promote temporally adjacent frames to be partitioned into the same groups. Moreover, we develop a temporal momentum averaging mechanism to stabilize affinity evolution and design a reparameterization strategy to enable efficient optimization. We conduct extensive experiments on five benchmark HMS datasets using both conventional (HoG) and up-to-date deep features (i.e., CLIP, DINOv2) to validate the effectiveness of our approach.


[48] iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models cs.CVPDF

Fakrul Islam Tushar, Umme Hafsa Momy, Joseph Y. Lo, Geoffrey D. Rubin

TL;DR: 本文提出了iTRIALSPACE,一个用于肺部CT模型可控评估的可编程框架。该框架通过一个四阶段流程,将真实临床CT和病灶特征组合成可控的虚拟病灶试验,以解决标准静态基准测试中病灶大小、肺叶分布、解剖结构和采集环境等因素纠缠的问题。

Details

Motivation: 标准基准测试是静态的回顾性数据集,其混杂了病灶大小、肺叶患病率、解剖结构和采集环境等多种因素,使得难以确定影响模型准确性的结构驱动因素。iTRIALSPACE旨在解决这一局限性,实现对模型性能更可控、可解释的评估。

Result: 在包含55,469个样本的虚拟病灶研究中,评估了三个医学视觉语言模型、四种空间引导条件和三项临床任务。在所有13种试验模式下,合成数据与真实数据的FID(弗雷歇起始距离)保持在真实数据间的基线范围内,且合成数据上的性能排名与真实临床数据高度相关(ρ = 0.93, p < 10^{-15})。可控试验模式揭示了固定分布基准无法发现的发现,例如在肺叶均衡采样下因捷径学习导致的尺寸预测崩溃,以及在孪生交叉分析中宿主与供体方差比高达8.9倍和3.3倍。

Insight: 论文的创新点在于提出了一个可编程、可控的评估框架,通过合成可控的虚拟病灶试验来解耦混杂因素,从而更深入地诊断模型行为。从客观角度看,其将真实病灶特征与解剖结构感知的合成技术(如ControlNet)相结合,构建了一个可审计的评估基础设施,超越了静态回顾性基准,为模型评估提供了可证伪的测试环境。

Abstract: We introduce iTRIALSPACE, a programmable evaluation framework for controlled assessment of lung CT models. Standard benchmarks are static retrospective collections that entangle lesion size, lobe prevalence, anatomy, and acquisition context, making it difficult to determine what structurally drives model accuracy. iTRIALSPACE addresses this limitation by composing real clinical CTs and lesion profiles into controlled virtual lesion trials through a four-stage pipeline: multidataset nodule profiling, explicit trial specification, anatomy-aware mask insertion, and ControlNet-conditioned CT synthesis. The framework is built on a unified 54-attribute nodule-profile dataset spanning 13,140 annotated nodules from seven public CT sources and instantiated as 13 trial modes. We evaluate iTRIALSPACE in a 55,469-sample Virtual Lesion Study spanning three medical VLMs, four spatialguidance conditions, and three clinical tasks. Across all 13 modes, the synthetic substrate remains within the real-to-real FID baseline, and synthetic performance rankings transfer strongly to real clinical data ($ρ$ = 0.93, p < 10$^{-15}$). Controlled trial modes expose findings unavailable to fixed-distribution benchmarks, including shortcut-driven size prediction collapse under lobe-equalized sampling and hostto-donor variance ratios of 8.9x and 3.3x in twin-cross analysis. These results position iTRIALSPACE as an auditable evaluation infrastructure for controlled, falsifiable testing beyond static retrospective benchmarks.


[49] X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction cs.CVPDF

Xiaoming Ren, Ru Zhen, Chao Li, Yang Song, Qiuxia Hou

TL;DR: 本文介绍了X-OmniClaw,一个为Android生态系统设计的统一移动代理,用于多模态理解和交互。它通过统一的感知、记忆和行动架构,结合UI状态、真实世界视觉上下文和语音输入,处理复杂的移动任务,并利用混合接地策略和技能学习来提高交互效率和任务可靠性。

Details

Motivation: 受OpenClaw发展的启发,需要能够处理复杂直观交互的基于移动设备的个人代理,以应对日益增长的需求。

Result: 在多种场景的演示中,X-OmniClaw有效提升了交互效率和任务可靠性,为下一代移动原生个人助手提供了实用的架构蓝图。

Insight: 创新点包括统一的多模态入口管道、结合运行时工作记忆和长期个人记忆的多模态记忆优化,以及结合结构XML元数据和视觉感知的混合接地策略,通过行为克隆和轨迹回放捕获可重用技能以实现精确直接访问执行。

Abstract: Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal understanding and interaction in the Android ecosystem. This unified architecture of perception, memory, and action enables the agent to handle complex mobile tasks with high contextual awareness. Specifically, Omni Perception provides a unified multimodal ingress pipeline that integrates UI states, real-world visual contexts, and speech inputs, leveraging a temporal alignment module to decompose raw data into structured multimodal intent representations. Omni Memory leverages multimodal memory optimization to enhance personalized intelligence by integrating runtime working memory for task continuity with long-term personal memory distilled from local data, enabling highly context-aware and personalized interactions. Finally, Omni Action employs a hybrid grounding strategy that combines structural XML metadata with visual perception for robust interaction. Through Behavior Cloning and Trajectory Replay, the system captures user navigation as reusable skills, enabling precise direct-access execution. Demonstrations across diverse scenarios show that X-OmniClaw effectively enhances interaction efficiency and task reliability, providing a practical architectural blueprint for the next generation of mobile-native personal assistants.


[50] Steering Visual Generation in Unified Multimodal Models with Understanding Supervision cs.CV | cs.AIPDF

Zeyu Liu, Zanlin Ni, Yang Yue, Cheng Da, Huan Yang

TL;DR: 本文提出了一种名为UNO(Understanding-Oriented Post-Training)的轻量级框架,旨在通过理解任务作为监督信号,来引导统一多模态模型中的视觉生成过程,从而增强理解与生成组件之间的协同作用。

Details

Motivation: 当前最先进的多模态模型通常采用解耦的理解与生成组件,这种设计削弱了二者之间相互增强的联系,使得潜在的协同效应难以实现。本文旨在通过引入理解导向的监督,显式地恢复这种协同。

Result: 在图像生成和编辑任务上的大量实验表明,理解任务可以作为生成的有效催化剂,提升了模型的生成性能。

Insight: 核心创新点在于将理解任务(如语义摘要的标题生成和结构细节的视觉回归)不仅视为独立任务,更作为直接监督信号来引导生成表示,实现了从理解到生成的有效梯度流,从而强化了多模态模型中理解与生成的耦合与协同。

Abstract: Unified multimodal models are envisioned to bridge the gap between understanding and generation. Yet, to achieve competitive performance, state-of-the-art models adopt largely decoupled understanding and generation components. This design, while effective for individual tasks, weakens the connection required for mutual enhancement, leaving the potential synergy empirically uncertain. We propose to explicitly restore this synergy by introducing Understanding-Oriented Post-Training (UNO), a lightweight framework that treats understanding not only as a distinct task, but also a direct supervisory signal to steer generative representations. By incorporating objectives that encode semantic abstraction (captioning) and structural details (visual regression), we enable effective gradient flow from understanding to generation. Extensive experiments on image generation and editing demonstrate that understanding can serve as an effective catalyst for generation.


[51] CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs cs.CVPDF

Zhengru Fang, Yanan Ma, Yu Guo, Senkang Hu, Yixian Zhang

TL;DR: 该论文提出了CXR-ContraBench基准,用于评估医学视觉语言模型在胸部X光图像问答中出现的‘否定选项吸引’失败模式,即模型倾向于选择与图像证据和问题都矛盾的否定答案。研究揭示了该问题在多个模型和数据集上的普遍性与严重性,并提出了一种无需重新训练的确定性修复方法QCCV-Neg,显著提升了模型在关键测试上的准确率。

Details

Motivation: 解决医学视觉语言模型在解读胸部X光图像时,即使图像显示特定病变,模型仍可能选择‘无病变’这一否定答案的严重临床矛盾问题,即‘否定选项吸引’失败。

Result: 在严格的直接存在性探测测试中,MedGemma和Qwen2.5-VL的准确率分别仅为31.49%和30.21%;在匹配的CheXpert训练集协议上,两个模型在超过62%的存在性问题中选择了否定选项。提出的QCCV-Neg方法将两个模型在直接存在性探测上的准确率分别提升至96.60%和95.32%。

Insight: 论文的创新点在于识别并形式化了一个新的、具有临床风险的模型失败模式,并构建了专门的诊断基准。从客观角度看,其提出的无需重新训练的确定性修复方法QCCV-Neg,为解决模型推理时的逻辑一致性(特别是极性混淆)问题提供了一个高效且可解释的解决方案,强调了标准准确率指标可能掩盖重要的推理缺陷。

Abstract: When a chest X-ray shows consolidation but the question asks which finding is present, a medical vision-language model may answer “No consolidation.” This is more than an incorrect choice: it is a polarity reversal that emits a clinical statement contradicting the image. We study this failure as negated-option attraction, where a model is drawn to a negated answer option even when it conflicts with both the visual evidence and the question. We introduce CXR-ContraBench (Chest X-Ray Contradiction Benchmark), a diagnostic benchmark spanning internal ReXVQA slices and external OpenI and CheXpert protocols. The benchmark centers on present-finding questions, where selecting “No X” despite visible X creates the main clinical risk, and uses absent-finding questions as secondary tests of whether models copy negated wording. Across CheXpert protocols, the failure is substantial and persistent. On a strict direct presence probe, MedGemma and Qwen2.5-VL reach only 31.49% and 30.21% accuracy, respectively; on a matched 135,754-record CheXpert training-split protocol, both models select negated options on over 62% of presence questions. Chain-of-thought prompting reduces some presence-side reversals but does not eliminate them and can amplify absence-side contradictions. Finally, QCCV-Neg (Question-Conditioned Consistency Verifier for Negation) deterministically repairs the measured polarity-confused subset without retraining, raising MedGemma and Qwen2.5-VL to 96.60% and 95.32% accuracy on the direct presence probe. These results show that standard accuracy can hide a clinically meaningful inference-time polarity failure. Source code and benchmark construction scripts are available at https://github.com/fangzr/cxr-contrabench-code.


[52] ChartZero: Synthetic Priors Enable Zero Shot Chart Data Extraction cs.CVPDF

Md Touhidul Islam, Yasir Mahmud, Sujan Kumar Saha, Mark Tehranipoor, Farimah Farahmandi

TL;DR: 本文提出了ChartZero框架,利用合成先验实现零样本的图表数据提取,通过仅使用简单数学函数生成的合成数据集进行训练,避免了真实世界标注的瓶颈,并引入了全局正交实例损失和VLM引导的图例匹配策略来解决曲线碎片化和语义关联问题。

Details

Motivation: 解决线图数据自动提取中因风格多样性和标注数据稀缺导致的泛化能力差、曲线碎片化及图例语义关联错误等问题。

Result: 在专门设计的端到端重建基准上评估,ChartZero显著提升了广义图表数字化的性能,且无需真实世界监督。

Insight: 利用合成数据训练实现零样本泛化,结合全局正交实例损失处理曲线交叉,以及使用开放词汇视觉语言模型进行图例匹配,避免了依赖空间启发式规则。

Abstract: Automated data extraction from line charts remains fundamentally bottlenecked by extreme stylistic diversity and a severe scarcity of comprehensively annotated, real-world datasets. Current end-to-end pipelines depend heavily on costly manual annotations, crippling their ability to generalize across arbitrary aesthetics and grid layouts. Furthermore, existing models suffer from two critical failure modes during reconstruction. First, extracting thin, intersecting curves frequently causes structural fragmentation and the erasure of fine visual details, as standard architectures struggle against complex backgrounds. Second, semantic association is notoriously error-prone; current pipelines rely on rigid spatial heuristics that easily break down against the unpredictable legend placements of in-the-wild charts. Finally, measuring true progress is hindered by evaluation protocols that assess isolated sub-tasks rather than holistic, end-to-end data reconstruction. To address these foundational issues, we introduce ChartZero, a parsing framework that leverages synthetic priors to enable robust zero-shot chart data extraction. By training exclusively on a purely synthetic dataset of simple mathematical functions, our model completely bypasses the real-world annotation bottleneck. We overcome curve fragmentation via a novel Global Orthogonal Instance (GOI) loss, and replace brittle spatial rules with an open-vocabulary, Vision-Language Model (VLM)-guided legend matching strategy. Accompanied by a new metric and benchmark specifically designed for full end-to-end reconstruction, our evaluations demonstrate that ChartZero significantly advances generalized plot digitization without requiring real-world supervision. Code and dataset will be released upon acceptance.


[53] Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media cs.CVPDF

Megha Mariam K. M, Vineeth N. Balasubramanian, C. V. Jawahar

TL;DR: 该论文提出了首个整合研究论文、演示视频、解说视频和幻灯片的多模态会议数据集(MCD),用于评估模型在发现细粒度跨格式对应关系方面的能力,并揭示了当前视觉语言模型和基于嵌入的模型在此任务上的优势与局限。

Details

Motivation: 科学知识的传播日益多模态化,但不同格式的材料(如论文、幻灯片、视频)之间缺乏结构化连接,难以追踪概念、视觉和解释的对应关系,限制了研究内容的统一探索与分析。

Result: 在MCD基准上评估了多种基于嵌入和视觉语言的模型,结果显示视觉语言模型鲁棒性强但在细粒度对齐方面存在困难,而基于嵌入的模型能较好捕捉文本-视觉对应,但方程和符号内容在嵌入空间中形成独立聚类。

Insight: 创新点在于构建了首个统一多模态科学材料的基准数据集,并系统评估了现有模型,揭示了它们在处理细粒度跨模态对应时的特定挑战,为未来多模态科学理解研究指明了关键方向。

Abstract: The communication of scientific knowledge has become increasingly multimodal, spanning text, visuals, and speech through materials such as research papers, slides, and recorded presentations. These different representations collectively convey a study’s reasoning, results, and insights, offering complementary perspectives that enrich understanding. However, despite their shared purpose, such materials are rarely connected in a structured way. The absence of explicit links across formats makes it difficult to trace how concepts, visuals, and explanations correspond, limiting unified exploration and analysis of research content. To address this gap, we introduce the Multimodal Conference Dataset (MCD), the first benchmark that integrates research papers, presentation videos, explanatory videos, and slides from the same works. We evaluate a range of embedding-based and vision-language models to assess their ability to discover fine-grained cross-format correspondences, establishing the first systematic benchmark for this task. Our results show that vision-language models are robust but struggle with fine-grained alignment, while embedding-based models capture text-visual correspondences well but equations and symbolic content form distinct clusters in the embedding space. These findings highlight both the strengths and limitations of current approaches and point to key directions for future research in multimodal scientific understanding. To ensure reproducibility, we release the resources for MCD at https://github.com/meghamariamkm2002/MCD


[54] VideoRouter: Query-Adaptive Dual Routing for Efficient Long-Video Understanding cs.CV | cs.AIPDF

Kuanwei Lin, Wenhao Zhang, Ge Li

TL;DR: VideoRouter是一个查询自适应的双路由框架,旨在解决长视频理解中视觉令牌序列过长导致的扩展性瓶颈问题。它通过语义路由器和图像路由器动态分配计算预算,对不相关帧进行激进压缩,同时保留关键证据帧的细节,从而在降低计算成本的同时提升模型性能。

Details

Motivation: 现有视频压缩方法通常查询感知能力弱或采用固定压缩策略,当视觉证据在时间上分布不均时效果不佳。VideoRouter旨在通过查询自适应的预算证据分配,更高效地处理长视频,减少内存和延迟。

Result: 在VideoMME、MLVU和LongVideoBench基准测试中,VideoRouter在可比或更低预算下持续优于InternVL基线,实现了高达67.9%的令牌减少,提升了长视频理解的效率。

Insight: 创新点包括查询自适应的双路由机制(语义路由和图像路由),结合了时间覆盖和高分辨率保留的灵活策略,以及通过大规模数据集(Video-QTR-10K和Video-FLR-200K)进行路由监督训练,实现了动态且高效的视频压缩。

Abstract: Video large multimodal models increasingly face a scalability bottleneck: long videos produce excessively long visual-token sequences, which sharply increase memory and latency during inference. While existing compression methods are effective in specific settings, most are either weakly query-aware or apply a fixed compression policy across frames, proving suboptimal when visual evidence is unevenly distributed over time. To address this, we present VideoRouter, a query-adaptive dual-router framework built on InternVL for budgeted evidence allocation. The Semantic Router predicts the dominant allocation policy, choosing between broad temporal coverage and adaptive high-resolution preservation, while the Image Router uses early LLM layers to score frame relevance. This enables aggressive compression on less relevant frames while preserving detail on critical evidence frames. To train both routers, we build Video-QTR-10K for allocation-policy supervision and Video-FLR-200K for frame-relevance supervision. Experiments on VideoMME, MLVU, and LongVideoBench show that VideoRouter consistently improves over the InternVL baseline under comparable or lower budgets, achieving up to a 67.9% token reduction.


[55] Align3D-AD: Cross-Modal Feature Alignment and Dual-Prompt Learning for Zero-shot 3D Anomaly Detection cs.CVPDF

Letian Bai, Xuanming Cao, Juan Du, Chengyu Tao

TL;DR: 本文提出了Align3D-AD,一个用于零样本3D异常检测的两阶段框架。该框架通过跨模态特征对齐将3D渲染特征映射到RGB语义空间,并利用双提示对比学习来融合多模态信息,从而在未见过的类别上实现异常检测。

Details

Motivation: 现有零样本3D异常检测方法通常将3D数据投影为多视图表示,这些表示主要捕捉几何线索而非真实视觉语义,且使用在RGB数据上预训练的视觉编码器处理,导致编码器与投影表示之间存在显著的领域鸿沟。

Result: 在MVTec3D-AD、Eyecandies和Real3D-AD数据集上的大量实验表明,Align3D-AD在one-vs-rest和跨数据集设置下均持续优于现有的零样本方法,展现了其泛化能力和鲁棒性。

Insight: 主要创新点包括:1) 显式的跨模态特征对齐范式,将渲染特征直接映射到RGB语义空间,实现直接的语义迁移;2) 语义一致性重加权策略,根据整体语义一致性对局部区域进行重加权以优化特征对齐;3) 模态感知的双提示对比学习框架,通过为RGB对齐特征和渲染特征分配独立提示来捕获跨模态的互补语义,并通过对比对齐增强提示表示以提高判别力。

Abstract: Zero-shot 3D anomaly detection aims to identify anomalies without access to training data from target categories. However, existing methods mainly rely on projecting 3D observations into multi-view representations that primarily capture geometric cues rather than realistic visual semantics and process them with vision encoders pretrained on RGB data, leading to a significant domain gap between the encoder and the projected representations. To address this issue, we propose Align3D-AD, a unified two-stage framework that leverages the RGB modality from auxiliary categories as cross-modal guidance for zero-shot 3D anomaly detection. First, we introduce a cross-modal feature alignment paradigm that maps rendering features into the RGB semantic space. Unlike prior works that implicitly rely on pretrained encoders, our method enables direct semantic transfer from RGB observations. A semantic consistency reweighting strategy is further introduced to refine feature alignment by reweighting local regions according to holistic semantic consistency. Second, we propose a modality-aware prompt learning framework with dual-prompt contrastive alignment. By assigning independent prompts to RGB-aligned and rendering features, our method captures complementary semantics across modalities, while the contrastive alignment further enhances prompt representations to improve discriminability. Extensive experiments on MVTec3D-AD, Eyecandies, and Real3D-AD demonstrate that Align3D-AD consistently outperforms existing zero-shot methods under both one-vs-rest and cross-dataset settings, highlighting its generalization capability and robustness. Code and the dataset will be made available once our paper is accepted.


[56] Training-Free Dense Hand Contact Estimation with Multi-Modal Large Language Models cs.CVPDF

Daniel Sungho Jung, Kyoung Mu Lee

TL;DR: 本文提出了一种无需训练、零样本的密集手部接触估计方法ContactPrompt,利用多模态大语言模型(MLLMs)的推理能力,通过手部分割和分块顶点网格表示编码3D手部几何信息,并采用多阶段结构化接触推理实现精确的密集接触预测。

Details

Motivation: 解决密集手部接触估计中需要高层语义理解和细粒度几何推理的挑战,现有MLLMs难以直接编码3D几何信息并进行细粒度顶点级接触预测。

Result: 该方法在无需训练的情况下,超越了先前基于大规模密集接触数据集训练的有监督方法,实现了SOTA性能。

Insight: 创新点在于将MLLMs的语义推理能力与结构化几何表示(手部分割和顶点网格)结合,通过多阶段推理桥接全局语义与局部几何,为无需训练的密集几何任务提供了新思路。

Abstract: Dense hand contact estimation requires both high-level semantic understanding and fine-grained geometric reasoning of human interaction to accurately localize contact regions. Recently, multi-modal large language models (MLLMs) have demonstrated strong capabilities in understanding visual semantics, enabled by vision-language priors learned from large-scale data. However, leveraging MLLMs for dense hand contact estimation remains underexplored. There are two major challenges in applying MLLMs to dense hand contact estimation. First, encoding explicit 3D hand geometry is difficult, as MLLMs primarily operate on vision and language modalities. Second, capturing fine-grained vertex-level contact remains challenging, as MLLMs tend to focus on high-level semantics rather than detailed geometric reasoning. To address these challenges, we propose ContactPrompt, a training-free and zero-shot approach for dense hand contact estimation using MLLMs. To effectively encode 3D hand geometry, we introduce a detailed hand-part segmentation and a part-wise vertex-grid representation that provides structured, localized geometric information. To enable accurate and efficient dense contact prediction, we develop a multi-stage structured contact reasoning with part conditioning, progressively bridging global semantics and fine-grained geometry. Therefore, our method effectively leverages the reasoning capabilities of MLLMs while enabling precise dense hand contact estimation. Surprisingly, the proposed approach outperforms previous supervised methods trained on large-scale dense contact datasets without requiring any training. The codes will be released.


[57] Detecting AI-Generated Videos with Spiking Neural Networks cs.CV | cs.AIPDF

Minsuk Jang, Yujin Yang, Heeseon Kim, Minseok Son, Younghun Kim

TL;DR: 本文提出了一种基于脉冲神经网络(SNN)的AI生成视频检测方法MAST,通过分析视频在像素级残差和语义特征空间轨迹上的时间平滑性差异,利用SNN的事件驱动、稀疏激活特性来捕捉集中在物体和运动边界的时间伪影,从而在跨生成器评估中实现高性能检测。

Details

Motivation: 现有AI生成视频检测器在处理跨生成器评估时性能急剧下降,因为它们未能充分利用AI视频在像素级残差平滑性和语义特征轨迹紧凑性这两个层面的时间平滑性差异信号,且传统密集人工神经网络(ANN)主干与这些稀疏、异步的信号结构不匹配。

Result: 在GenVideo基准测试的严格跨生成器评估中,MAST在10个未见生成器上取得了93.14%的平均准确率,匹配或超越了最强的基于ANN的检测器。

Insight: 创新点在于首次将脉冲神经网络(SNN)引入AI生成视频检测任务,利用其事件驱动和稀疏激活的动态特性,自然地捕捉视频中集中在变化时刻(如边缘)的稀疏、异步时间伪影信号,并结合多通道时间残差与冻结语义编码器以提升跨生成器泛化能力。

Abstract: Modern AI-generated videos are photorealistic at the single-frame level, leaving inter-frame dynamics as the main remaining axis for detection. Existing detectors typically handle this temporal evidence in three ways: feeding the full frame sequence to a generic temporal backbone, reducing one dominant temporal cue to fixed video-level descriptors, or comparing temporal features to real-video statistics through a detection metric. These strategies degrade sharply under cross-generator evaluation, where artifact type and timescale vary across generators. On caption-paired benchmark, GenVidBench, we identify two signatures that prior detectors do not jointly exploit: AI-generated videos exhibit smoother frame-to-frame temporal residuals at the pixel level, and more compact trajectories in the semantic feature space, indicating a temporal smoothness gap at both levels. We further observe that, when raw video is fed into a Spiking Neural Networks (SNNs), fake clips elicit firing predominantly at object and motion boundaries, unlike real clips, suggesting that the SNN responds to temporal artifacts localized at edges. These cues are sparse, asynchronous, and concentrated at moments of change, which makes SNNs a natural choice for this task: their event-driven, sparsely-activated dynamics align with the structure of the residual signal in a way that dense ANN backbones do not. Building on this observation, we propose MAST, a detector that processes multi-channel temporal residuals with a spike-driven temporal branch alongside a frozen semantic encoder for cross-generator generalization. On the GenVideo benchmark, MAST achieves 93.14% mean accuracy across 10 unseen generators under strict cross-generator evaluation, matching or surpassing the strongest ANN-based detectors and demonstrating the practical applicability of SNNs to AI-generated video detection.


[58] Architecture-agnostic Lipschitz-constant Bayesian header and its application to resolve semantically proximal classification errors with vision transformers cs.CV | cs.AIPDF

Frederik Schäfer, Luis Mandl, Lars Kälber, Tim Ricken

TL;DR: 本文提出了一种架构无关的Lipschitz常数贝叶斯头部模块(LipB-ViT),可集成到视觉Transformer等特征提取器中,通过强制对变分权重的均值和log-方差进行谱归一化来校准预测不确定性,并利用特征空间邻近性与预测不确定性融合的方案检测损坏标签,在结构化噪声下优于现有方法。

Details

Motivation: 解决监督深度学习模型中结构化标签噪声导致的泛化瓶颈,特别是语义相近的分类错误,传统鲁棒训练方法在此类场景下往往失效。

Result: 在15%语义误分类标签下,所提自适应算术平均融合方案的召回率超过0.93,比基于k近邻的最新识别方法提升超过7%;在结构化和非结构化噪声的推理场景中均表现出鲁棒性。

Insight: 创新点包括:1) 对变分权重均值和log-方差同时进行谱归一化的贝叶斯头部设计,以校准不确定性并抑制噪声放大;2) 联合捕获不确定性和置信度的新度量;3) 结合特征空间邻近性与预测不确定性的融合方案,用于损坏标签检测;4) 评估数据集质量和标签噪声的联合分析流程及量化指标。

Abstract: Label noise remains a critical bottleneck for the generalization of supervised deep learning models, particularly when errors are structured rather than random. Standard robust training methods often fail in the presence of such semantically proximal classification errors. This work presents an architecture-agnostic Lipschitz-constant Bayesian header that can be integrated into feature extractors such as vision transformers, yielding the bi-Lipschitz-constrained Bayesian Vision Transformer (LipB-ViT). In contrast to conventional Bayesian layers, our approach enforces spectral normalization on both the mean and log-variance of the variational weights, which promotes calibrated predictive uncertainty and mitigates noise amplification. We further propose a novel metric to jointly capture uncertainty and confidence across misclassification rates, as well as an adaptive arithmetic-mean fusion scheme that combines feature-space proximity with predictive uncertainty to detect corrupted labels outperforming the state of the art k-nearest neighbor based identification methods by more than 7% reaching a recall of more than 0.93 at 15% semantically misclassified labels. Although computational costs increase due to Monte Carlo sampling, the method offers plug-and-play compatibility with pre-trained backbones and consistent hyperparameters across domains, suggesting strong utility for high-stakes applications with variable annotation reliability. The stabilized confidence estimates serve as the foundation for an analysis pipeline that jointly assesses dataset quality and label noise, yielding a second novel metric for their combined quantification. Lastly, we systematically evaluate LipB-ViT under both structured (adversarial) and unstructured noise at inference time, demonstrating its robustness in realistic high-noise and attack scenarios. We compare its performance against baseline methods.


[59] Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model cs.CVPDF

Junhui Yin, Nan Pu, Xinyu Zhang, Lingfeng Yang, Lin Wu

TL;DR: 本文提出了一种即插即用的类感知知识注入(CAKI)框架,用于增强视觉语言模型(如CLIP)在零样本分类任务中的提示学习。CAKI通过类特定提示生成和查询-键提示匹配两个组件,将类级知识注入现有方法,从而提升基类和新类的分类性能。

Details

Motivation: 现有提示学习方法通常关注学习域共享提示或生成实例特定提示,但忽视了类特定知识,导致次优结果。类特定提示能提供更细粒度的监督,防止不同类数据被误分为同一类;而实例特定提示可能忽略跨实例的丰富类级信息,导致同类数据被分割。

Result: 大量实验表明,CAKI能有效提升现有方法在基类和新类上的性能,在多个基准测试中取得了改进。

Insight: 创新点在于引入类感知知识注入机制,通过构建类级知识库和检索匹配,将类特定知识以即插即用方式集成到提示学习中,增强了模型的细粒度分类能力。

Abstract: Prompt learning has become an effective and widely used technique in enhancing vision-language models (VLMs) such as CLIP for various downstream tasks, particularly in zero-shot classification within specific domains. Existing methods typically focus on either learning class-shared prompts for a given domain or generating instance-specific prompts through conditional prompt learning. While these methods have achieved promising performance, they often overlook class-specific knowledge in prompt design, leading to suboptimal outcomes. The underlying reasons are: 1) class-specific prompts offer more fine-grained supervision compared to coarse class-shared prompts, which helps prevent misclassification of data from different classes into a single class; 2) compared to class-specific prompts, instance-specific prompts neglect the richer class-level information across multiple instances, potentially causing data from the same class to be divided into multiple classes. To effectively supplement the class-specific knowledge into existing methods, we propose a plug-and-play Class-Aware Knowledge Injection (CAKI) framework. CAKI comprises two key components, i.e., class-specific prompt generation and query-key prompt matching. The former encodes class-specific knowledge into prompts from few-shot samples that belong to the same class and stores the learned prompts in a class-level knowledge bank. The latter provides a plug-and-play mechanism for each test instance to retrieve relevant class-level knowledge from the knowledge bank and inject such knowledge to refine model predictions. Extensive experiments demonstrate that our CAKI effectively improves the performance of existing methods on base and novel classes. Code is publicly available at \href{https://github.com/yjh576/CAKI}{this https URL}.


[60] Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling cs.CVPDF

Yuan Wang, Ouxiang Li, Yulong Xu, Borui Liao, Jiajun Liang

TL;DR: 本文提出DeScore,一种解耦的‘先思考后评分’视频奖励模型范式,通过分离推理链生成与奖励评分,结合判别式冷启动与双目标强化学习,旨在提升视频奖励模型的泛化能力与训练稳定性。

Details

Motivation: 现有视频奖励模型面临两难:判别式模型易陷入捷径学习且依赖海量数据,而生成式模型虽具解释性但推理与评分耦合导致优化瓶颈。

Result: DeScore在视频偏好对齐任务上展现出优于基线模型的泛化性能,其解耦设计确保了更高质量的推理能直接转化为更优的模型表现。

Insight: 创新点在于将推理与评分解耦为独立模块,并通过两阶段训练框架(判别式冷启动与双目标强化学习)协同优化,兼顾了生成式模型的推理优势与判别式模型的评分效率。

Abstract: Recent advances in generative video models are increasingly driven by post-training and test-time scaling, both of which critically depend on the quality of video reward models (RMs). An ideal reward model should predict accurate rewards that align with human preferences across diverse scenarios. However, existing paradigms face a fundamental dilemma: \textit{Discriminative RMs} regress rewards directly on features extracted by multimodal large language models (MLLMs) without explicit reasoning, making them prone to shortcut learning and heavily reliant on massive data scaling for generalization. In contrast, \textit{Generative RMs} with Chain-of-Thought (CoT) reasoning exhibit superior interpretability and generalization potential, as they leverage fine-grained semantic supervision to internalize the rationales behind human preferences. However, they suffer from inherent optimization bottlenecks due to the coupling of reasoning and scoring within a single autoregressive inference chain. To harness the generalization benefits of CoT reasoning while mitigating the training instability of coupled reasoning and scoring, we introduce DeScore, a training-efficient and generalizable video reward model. DeScore employs a decoupled ``think-then-score’’ paradigm: an MLLM first generates an explicit CoT, followed by a dedicated discriminative scoring module consisting of a learnable query token and a regression head that predicts the final reward. DeScore is optimized via a two-stage framework: (1) a discriminative cold start incorporating a random mask mechanism to ensure robust scoring capabilities, and (2) a dual-objective reinforcement learning stage that independently refines CoT reasoning quality and calibrates the final reward, ensuring that higher-quality reasoning directly translates to superior model performance.


[61] Backdoor Mitigation in Object Detection via Adversarial Fine-Tuning cs.CV | cs.CRPDF

Kealan Dunnett, Reza Arablouei, Dimity Miller, Volkan Dedeoglu, Raja Jurdak

TL;DR: 本文提出了一种针对目标检测模型后门攻击的对抗性微调缓解框架,通过检测感知的对抗样本生成和双目标微调损失,在仅使用受损检测器和少量干净数据的情况下有效降低攻击成功率并保持清洁检测性能。

Details

Motivation: 后门攻击对安全关键视觉系统构成严重威胁,但现有缓解方法主要针对图像分类任务,目标检测领域的防御研究相对不足,且分类导向的对抗微调方法难以直接适配检测任务中复杂的攻击空间(如目标误分类或消失)和损失函数特性。

Result: 在基于CNN和Transformer的检测器上的实验表明,该方法相比分类导向的基线能更有效地降低攻击成功率并保持真实检测性能,同时在清洁数据上保持了有竞争力的检测精度。

Insight: 创新点包括:1)提出软分支最小化方法,通过软门结合误分类和消失攻击目标,实现无需先验攻击知识的检测感知对抗样本生成;2)设计针对目标匹配预测的双目标微调损失,将防御更新集中在与后门行为最相关的预测上,从而增强修复信号的针对性。

Abstract: Backdoor attacks can implant malicious behaviours into deep models while preserving performance on clean data, posing a serious threat to safety-critical vision systems. Although backdoor mitigation has been studied extensively for image classification, defenses for object detection remain comparatively underdeveloped. Adversarial fine-tuning is a common backdoor mitigation approach in classification, but adapting it to detection is nontrivial as classification-oriented adversarial generation does not match the detection attack space, where attacks may cause object misclassification or disappearance, and standard detection losses can dilute the repair signal across many predictions. We address these challenges through a detection-aware adversarial fine-tuning framework for mitigating object-detection backdoors when the defender has access only to a compromised detector and a small clean dataset, without knowing the attack objective. For adversarial generation that does not require knowledge of the attack objective, we introduce soft-branch minimisation, which uses a soft gate to combine objectives aligned with misclassification and disappearance attacks, together with a detection-aware classification-loss maximisation. For targeted repair, we introduce a dual-objective fine-tuning loss applied to target-matched predictions, concentrating the defensive update on predictions most relevant to the backdoor behaviour. Experiments across CNN- and Transformer-based detectors show that our approach more effectively reduces attack success while preserving true detections, compared with classification-oriented baselines, and maintains competitive clean detection performance.


[62] MobileEgo Anywhere: Open Infrastructure for long horizon egocentric data on commodity hardware cs.CV | cs.CLPDF

Senthil Palanisamy, Abhishek Anand, Satpal Singh Rathor, Pratyush Patnaik, Shubhanshu Khatana

TL;DR: 本文提出了MobileEgo Anywhere框架,旨在利用商用移动硬件(如智能手机)便捷地采集长达一小时以上的自我中心(egocentric)长轨迹数据,以解决现有数据集时长短、无法捕捉长期时间依赖性的问题。

Details

Motivation: 现有的大规模自我中心数据集通常持续时间很短(仅几分钟),无法满足复杂机器人任务执行所需的长期时间依赖性建模,这限制了视觉语言动作(VLA)模型的发展。

Result: 论文贡献包括:发布了一个包含200小时多样化、长时程自我中心数据的新数据集,提供了持续状态跟踪;开源了一个移动应用程序,允许任何用户记录自我中心数据;并提供了一个完整的处理流水线,将原始移动设备捕获的数据转换为适用于VLA模型和基础模型研究的标准化、训练就绪格式。

Insight: 主要创新点在于利用智能手机的普及传感器套件实现高保真、长期的相机姿态跟踪,显著降低了传统机器人数据采集的高硬件门槛,从而民主化了长时程自我中心数据的收集过程,有望促进通用化机器人策略的开发。

Abstract: The recent advancement of Vision Language Action (VLA) models has driven a critical demand for large scale egocentric datasets. However, existing datasets are often limited by short episode durations, typically spanning only a few minutes, which fails to capture the long horizon temporal dependencies necessary for complex robotic task execution. To bridge this gap, we present MobileEgo Anywhere, a framework designed to facilitate the collection of robust, hour plus egocentric trajectories using commodity mobile hardware. We leverage the ubiquitous sensor suites of modern smartphones to provide high fidelity, long term camera pose tracking, effectively removing the high hardware barriers associated with traditional robotics data collection. Our contributions are three fold: (1) we release a novel dataset comprising 200 hours of diverse, long form egocentric data with persistent state tracking; (2) we open source a mobile application that enables any user to record egocentric data, and (3) we provide a comprehensive processing pipeline to convert raw mobile captures into standardized, training ready formats for Vision Language Action model and foundation model research. By democratizing the data collection process, this work enables the massive scale acquisition of long horizon data across varied global environments, accelerating the development of generalizable robotic policies.


[63] iPhoneBlur: A Difficulty-Stratified Benchmark for Consumer Device Motion Deblurring cs.CV | cs.AIPDF

Abdullah Al Shafi, Kazi Saeed Alam

TL;DR: 该论文提出了iPhoneBlur基准测试,这是一个基于iPhone 17 Pro拍摄的高帧率视频合成的、包含7,400对图像的难度分层数据集,用于评估消费级移动设备上的运动去模糊模型。该基准通过PSNR引导的自适应时间窗口将样本分为简单、中等和困难三个难度等级,并包含丰富的元数据以支持研究。评估显示,现有模型在困难样本上性能显著下降7-9 dB,而传统聚合指标掩盖了这一差距。

Details

Motivation: 现有消费设备运动去模糊评估通常使用聚合指标,掩盖了不同模糊难度下的性能差异,无法反映模型在真实部署条件下的行为。

Result: 在iPhoneBlur基准上评估了六种架构,结果显示从简单到困难子集,模型性能一致下降7-9 dB。该基准还揭示了专业相机与消费相机之间的域差距,通过针对性微调可大幅恢复性能。

Insight: 创新点在于构建了一个难度分层的、包含真实场景元数据的消费级设备去模糊基准,使得系统评估模型在资源受限边缘系统上的可靠性和失效模式成为可能。客观来看,其PSNR引导的难度分层方法和包含ISP感知元数据的设计具有借鉴意义。

Abstract: Motion blur restoration on consumer mobile devices is typically evaluated using aggregate metrics that obscure performance variation across blur difficulty, masking model behavior under real deployment conditions. This work introduces iPhoneBlur, a difficulty-stratified benchmark of 7,400 image pairs synthesized from high-framerate iPhone 17 Pro videos captured in diverse real-world scenarios. Samples are partitioned into Easy, Medium, and Hard categories through PSNR-guided adaptive temporal windowing, with stratification validated by monotonic 2.2x increase in optical flow magnitude across tiers. Each sample includes comprehensive metadata enabling investigation of ISP-aware and difficulty-adaptive restoration strategies. Spectral analysis confirms synthesized blur exhibits high-frequency suppression patterns consistent with authentic motion degradation. Evaluation of six architectures reveals consistent 7-9 dB performance degradation from Easy to Hard subsets, a substantial gap entirely hidden by aggregate reporting. The benchmark further exposes a domain gap between professional and consumer cameras which targeted fine-tuning substantially recovers. By coupling difficulty stratification with deployment-critical metadata, iPhoneBlur enables systematic assessment of model reliability and failure modes for resource-constrained edge systems.


[64] 4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding cs.CVPDF

Zhangquan Chen, Manyuan Zhang, Xinlei Yu, Xiang An, Bo Li

TL;DR: 4DThinker是一个让视觉语言模型(VLMs)通过动态潜在心理意象进行4D推理的框架,它通过无标注数据生成、动态意象微调和4D强化学习,提升了模型对动态空间的理解能力。

Details

Motivation: 解决单目视频动态空间推理的挑战,克服现有方法要么完全文本化(冗长不精确)要么依赖外部几何模块(增加复杂度)的局限性。

Result: 在多个动态空间推理基准测试中,4DThinker持续优于强基线模型。

Insight: 创新点在于引入’用4D思考’的概念,通过内部模拟场景在连续隐藏空间中的演化来实现动态潜在心理意象,并结合无监督数据生成和针对文本令牌的稳定强化学习优化方法。

Abstract: Dynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to “think with 4D” through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs. Our code is available at https://github.com/zhangquanchen/4DThinker.


[65] Adding Thermal Awareness to Visual Systems in Real-Time via Distilled Diffusion Models cs.CV | cs.AIPDF

Yuchen Guo, Junli Gong, Wenjun Dong, Yiuming Cheung, Weifeng Su

TL;DR: 本文提出FusionProxy,一种实时图像融合模块,旨在通过蒸馏扩散模型将热感知能力集成到视觉系统中。该模块作为独立的即插即用组件,利用教师样本集合的像素级和特征级方差统计进行监督与对齐,实现扩散级质量的融合,无需联合优化即可直接整合到任何视觉感知系统中。

Details

Motivation: 解决纯RGB视觉模型在夜间、雾天等挑战性场景下性能下降和安全风险的问题,同时克服现有高保真融合方法延迟过高、难以实时部署于边缘设备的局限性。

Result: 在静态识别任务上取得优越性能,在动态任务(如闭环自动驾驶)中显著增强鲁棒性,并在从高端GPU到商用硬件的多种平台上实现实时推理速度。

Insight: 创新点在于利用教师样本的像素级方差进行加权监督和特征级方差进行空间路由,实现高质量实时融合;客观分析认为其模块化、无需联合优化的设计提供了灵活且可泛化的全天候感知解决方案。

Abstract: Purely RGB-based vision models often fail to provide reliable cues in challenging scenarios such as nighttime and fog, leading to degraded performance and safety risks. Infrared imaging captures heat-emitting sources and provides critical complementary information, but existing high-fidelity fusion methods suffer from prohibitive latency, rendering them impractical for real-time edge deployment. To address this, we propose FusionProxy, a real-time image fusion module designed as a fully independent, plug-and-play component with diffusion level quality. FusionProxy exploits two complementary statistics of a teacher sample ensemble: per-pixel variance in raw image space, used to weight pixel-level supervision, and per-pixel variance inside frozen foundation backbones, used to route feature-level alignment spatially. Once trained, FusionProxy can be directly integrated into any visual perception system without joint optimization. Extensive experiments demonstrate that our method achieves superior performance on static recognition tasks and significantly enhances robustness in dynamic tasks, including closed-loop autonomous driving. Crucially, FusionProxy achieves real-time inference speeds on diverse platforms, from high-end GPUs to commodity hardware, providing a flexible and generalizable solution for all-day perception.


[66] PlotPick: AI-powered batch extraction of numerical data from scientific figures cs.CV | cs.DLPDF

Tommy Carstensen

TL;DR: 本文介绍了PlotPick,一个基于视觉语言模型(VLMs)的开源工具,用于从科学图表中批量提取结构化表格数据,以解决系统综述和元分析中手动数字化数据效率低下的问题。

Details

Motivation: 解决系统综述和元分析中,作者仅以图表形式报告数值数据,而手动数字化过程缓慢且难以扩展的问题。

Result: 在两个图表到表格基准测试(ChartX和PlotQA)上评估了来自三个提供商的六个VLMs,并与专用模型DePlot对比。所有VLMs在两个基准上均优于DePlot:在ChartX(限于条形图、折线图、箱线图和直方图;n=300)上,VLMs召回率达到88-96%,而DePlot为71%;在PlotQA(n=529)上,VLMs的RMSF1达到86-99%,而DePlot为94%。在DePlot训练数据中未包含的图表类型(如箱线图)上差距最大,DePlot的RMSF1为24%,而VLMs达到83-97%。

Insight: 创新点在于利用通用视觉语言模型(VLMs)进行批量图表数据提取,而非依赖专用模型,在多样化的图表类型上表现出更强的泛化能力,尤其是在训练数据未覆盖的图表类型上优势显著。这为自动化数据提取工具提供了新的思路,即利用大规模预训练的VLMs来提升任务的适应性和性能。

Abstract: Systematic reviews and meta-analyses frequently require numerical data that authors report only as figures, yet manual digitisation is slow and does not scale. We present PlotPick, an open-source tool that uses vision-language models (VLMs) to batch-extract structured tabular data from scientific figures. We evaluate six VLMs from three providers on two established chart-to-table benchmarks (ChartX and PlotQA) and compare against the dedicated chart-to-table model DePlot. All six VLMs outperform DePlot on both benchmarks. On ChartX (restricted to bar charts, line charts, box plots, and histograms; n=300), VLMs achieve 88-96% recall versus 71% for DePlot. On PlotQA (n=529), VLMs achieve 86-99% RMSF1 versus 94% for DePlot. The gap is largest on chart types absent from the dedicated models’ training data: on box plots, DePlot achieves 24% RMSF1 while VLMs achieve 83-97%. PlotPick is available at https://plotpick.streamlit.app.


[67] Fusion in Your Way: Aligning Image Fusion with Heterogeneous Demands via Direct Preference Optimization cs.CVPDF

Weijian Su, Songqian Zhang, Yuqi Han, Jian Zhuang, Yongdong Huang

TL;DR: 本文提出了一种名为DPOFusion的框架,用于红外与可见光图像融合(IVIF),通过直接偏好优化(DPO)技术实现自适应融合,以满足人类和机器视觉的多样化需求。该框架结合了属性对齐的潜在扩散模型(PALDM)和偏好可控的潜在扩散模型(PCLDM),能够生成符合不同偏好的融合结果,并在质量和任务导向可转移性上达到新水平。

Details

Motivation: 现有红外与可见光图像融合方法难以灵活适应异构需求,实现与人类和机器视觉多种偏好对齐的自适应融合仍是一个开放且具有挑战性的问题。

Result: 实验结果表明,该框架在人类、视觉语言模型和任务驱动网络之间实现了精确的偏好对齐,并在自适应融合质量和任务导向可转移性上设立了新的基准。

Insight: 创新点在于将直接偏好优化(DPO)引入图像融合领域,通过PALDM生成多样化候选结果,再通过实例直接偏好优化(IDPO)微调PCLDM,实现对异构偏好信号直接控制的自适应融合机制。

Abstract: As a key technique in multi-modal processing, infrared and visible image fusion (IVIF) plays a crucial role in integrating complementary spectral information for visual enhancement and downstream vision tasks. Despite remarkable progress, existing methods struggle to flexibly accommodate heterogeneous demands. Achieving adaptive fusion that aligns with various preferences from both human and machine vision remains an open and challenging problem. To address this challenge, we propose DPOFusion, a direct preference optimization (DPO) framework integrating the property-aligned latent diffusion model (PALDM) and the preference-controllable latent diffusion model (PCLDM), enabling task-guided, preference-adaptive IVIF for both human and machine vision. The PALDM leverages a latent fusion prior and a joint conditional loss to generate diverse candidate fusion results with various properties. PCLDM is subsequently fine-tuned via instance direct preference optimization (IDPO), enabling direct control of the final fusion results with heterogeneous preference signals. Experimental results demonstrate that our framework not only attains precise preference alignment among humans, vision-language models, and task-driven networks, but also sets a new benchmark for adaptive fusion quality and task-oriented transferability.


[68] RealCam: Real-Time Novel-View Video Generation with Interactive Camera Control cs.CVPDF

Youcan Xu, Jiaxin Shi, Zhen Wang, Wensong Song, Feifei Shao

TL;DR: RealCam是一个用于实时、交互式相机控制视频生成的自回归框架,通过跨帧上下文学习教师模型和自强迫分布匹配蒸馏技术,实现了高效流式合成,并提出了循环闭合数据增强来缓解闭环轨迹中的不一致性问题。

Details

Motivation: 现有相机控制视频生成方法依赖非因果、全序列处理和刚性前缀拼接,导致计算延迟高、二次复杂度扩展,且无法实时处理流式或变长输入,限制了交互式应用。

Result: 在多个基准测试中,RealCam实现了最先进的视觉保真度和时间一致性,推理速度比现有方法快数个数量级,支持真正的交互式相机控制。

Insight: 创新点包括跨帧上下文学习范式打破刚性前缀瓶颈,自强迫分布匹配蒸馏实现高效因果学生模型,以及循环闭合数据增强提升闭环轨迹一致性;客观分析认为其自回归流式架构和因果适应设计对实时视频生成有重要借鉴意义。

Abstract: Camera-controlled video-to-video (V2V) generation enables dynamic viewpoint synthesis from monocular footage, holding immense potential for interactive filmmaking and live broadcasting. However, existing implicit synthesis methods fundamentally rely on non-causal, full-sequence processing and rigid prefix-style temporal concatenation. This architectural paradigm mandates bidirectional attention, resulting in prohibitive computational latency, quadratic complexity scaling, and inherent incompatibility with real-time streaming or variable-length inputs. To overcome these limitations, we introduce \texttt{RealCam}, a novel autoregressive framework for interactive, real-time camera-controlled V2V generation. We first design a high-fidelity teacher model grounded in a \textbf{Cross-frame In-context Learning} paradigm. By interleaving source and target frames into synchronized contextual pairs, our design inherently enables length-agnostic generalization and naturally facilitates causal adaptation, breaking the rigid prefix bottleneck. We then distill this teacher into a few-step causal student via Self-Forcing with Distribution Matching Distillation, enabling efficient, on-the-fly streaming synthesis. Furthermore, to mitigate severe loop inconsistency in closed-loop trajectories, we propose \textbf{Loop-Closed Data Augmentation (LoopAug)}, a novel paradigm that synthesizes globally consistent loop sequences from existing multiview datasets. Extensive experiments demonstrate that \texttt{RealCam} achieves state-of-the-art visual fidelity and temporal consistency while enabling truly interactive camera control with orders-of-magnitude faster inference than existing paradigms. Our project page is at https://xyc-fly.github.io/RealCam/.


[69] Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models cs.CVPDF

Zhikai Li, Yue Zhao, Edward Zhongwei Zhang, Xuewen Liu, Jing Zhang

TL;DR: 本文提出ArenaPO方法,利用Arena分数作为离线奖励,为扩散模型提供细粒度反馈,实现无需奖励模型的高效偏好优化。该方法结合了传统RLHF的丰富奖励信号和DPO的效率优势。

Details

Motivation: 现有直接偏好优化(DPO)方法依赖二元反馈,仅能进行粗粒度建模,导致优化效果欠佳。本文旨在通过细粒度反馈提升扩散模型偏好对齐的效率和效果。

Result: 在Pick-a-Pic v2和HPD v3数据集上的实验表明,ArenaPO方法持续优于现有基线方法。

Insight: 创新点在于构建模型竞技场(Arena),将模型能力建模为高斯分布,并基于截断正态分布的隐变量推理估计图像对的绝对质量差距,从而提供细粒度离线奖励,避免了奖励模型训练的开销。

Abstract: Reinforcement learning from human feedback (RLHF) effectively promotes preference alignment of text-to-image (T2I) diffusion models. To improve computational efficiency, direct preference optimization (DPO), which avoids explicit reward modeling, has been widely studied. However, its reliance on binary feedback limits it to coarse-grained modeling on chosen-rejected pairs, resulting in suboptimal optimization. In this paper, we propose ArenaPO, which leverages Arena scores as offline rewards to provide refined feedback, thus achieving efficient and fine-grained optimization without a reward model. This enables ArenaPO to benefit from both the rich rewards of traditional RLHF and the efficiency of DPO. Specifically, we first construct a model Arena in which each model’s capability is represented as a Gaussian distribution, and infer these capabilities by traversing the annotated pairwise preferences. Each output image is treated as a sample from the corresponding capability distribution. Then, for a image pair, conditioned on the two capability distributions and the observed pairwise preference, the absolute quality gap is estimated using latent-variable inference based on truncated normal distribution, which serves as fine-grained feedback during training. It does not require a reward model and can be computed offline, thus introducing no additional training overhead. We conduct ArenaPO training on Pick-a-Pic v2 and HPD v3 datasets, showing that ArenaPO consistently outperforms existing baselines.


[70] Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval cs.CV | cs.IR | cs.LG | cs.MMPDF

Jun Li, Peifeng Lai, Xuhang Lou, Jinpeng Wang, Yuting Wang

TL;DR: 本文提出了一种名为Holmes的分层证据学习框架,用于解决部分相关视频检索中的不确定性挑战。该方法通过聚合多粒度跨模态证据,在视频间层面将相似度分数解释为证据支持并建模为狄利克雷分布,在视频内层面通过带自适应垃圾桶的灵活最优传输实现软查询-片段对齐,从而量化不确定性并缓解稀疏时序监督问题。

Details

Motivation: 部分相关视频检索中,简短查询与丰富视频内容之间的不对称性会引入不确定性,模糊查询导致视频间的语义模糊,而视频内稀疏的时序监督无法提供足够的匹配证据,因此需要一种能够显式建模不确定性的方法。

Result: 在部分相关视频检索任务上的大量实验表明,Holmes框架优于现有最先进方法。

Insight: 创新点包括:将相似度分数解释为证据支持并建模为狄利克雷分布以量化不确定性;提出三原则指导的细粒度查询识别和查询自适应校准学习;通过带自适应垃圾桶的最优传输实现软查询-片段对齐,以积累更密集的证据并抑制虚假局部响应。

Abstract: Partially relevant video retrieval aims to retrieve untrimmed videos using text queries that describe only partial content. However, the inherent asymmetry between brief queries and rich video content inevitably introduces uncertainty into the retrieval process. In this setting, vague queries often induce semantic ambiguity across videos, a challenge that is further exacerbated by the sparse temporal supervision within videos, which fails to provide sufficient matching evidence. To address this, we propose Holmes, a hierarchical evidential learning framework that aggregates multi-granular cross-modal evidence to quantify and model uncertainty explicitly. At the inter-video level, similarity scores are interpreted as evidential support and modeled via a Dirichlet distribution. Based on the proposed three-fold principle, we perform fine-grained query identification, which then guides query-adaptive calibrated learning. At the intra-video level, to accumulate denser evidence, we formulate a soft query-clip alignment via flexible optimal transport with an adaptive dustbin, which alleviates sparse temporal supervision while suppressing spurious local responses. Extensive experiments demonstrate that Holmes outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/ICML26-Holmes.


[71] LARGO: Low-Rank Hypernetwork for Handling Missing Modalities cs.CVPDF

Niels Vyncke, Pooya Ashtari, Aleksandra Pižurica

TL;DR: 本文提出了一种名为LARGO的新型超网络,用于处理多模态图像分析中的模态缺失问题。该方法在权重空间中操作,通过使用规范多线性(CP)张量分解对卷积权重进行建模,将原本需要为不同缺失模态组合设计的多个独立模型压缩成一个单一网络。在BraTS 2018和ISLES 2022医学影像数据集上的广泛实验表明,该方法在绝大多数配置下性能领先于现有技术。

Details

Motivation: 解决多模态分析中模态缺失的挑战,现有方法通常在特征空间中进行复杂设计,缺乏通用性且难以迁移。本文的动机是转向权重空间,提出一个更通用、无需为不同数据集调整架构或超参数的解决方案。

Result: 在BraTS 2018(4模态,15种缺失场景)和ISLES 2022(3模态,7种缺失场景)数据集上,LARGO在52种配置中的47种排名第一,平均Dice系数分别比最先进的基线方法(mmFormer, M³AE, ShaSpec, SimMLM)提升了+0.68%和+2.53%。在avMNIST数据集上的概念验证实验表明该方法可能扩展到非医学的异构模态。

Insight: 主要创新点在于将处理缺失模态的问题从特征空间转移到权重空间,并利用CP张量分解构建超网络来参数化所有可能的缺失模态组合的模型权重,从而实现了模型的极大压缩和高效推理。从客观角度看,这种权重空间的建模方法为多模态鲁棒性学习提供了一个新颖且通用的框架,可能具有较好的可迁移性。

Abstract: Addressing missing modalities is an important challenge in multimodal image analysis and often relies on complex architectures that do not transfer easily to different datasets without architectural modifications or hyperparameter tuning. While most existing methods tackle this problem in feature space by engineering representations that are robust to missing inputs, we instead operate in weight space. We propose LARGO, a hypernetwork that compresses the $2^N-1$ dedicated missing-modality models into a single network by modelling the convolutional weights using the Canonical Polyadic (CP) tensor decomposition. Extensive experimental validation on BraTS 2018 (4 modalities, 15 scenarios) and ISLES 2022 (3 modalities, 7 scenarios) shows that our method ranks first in 47 out of 52 configurations, achieving average Dice improvements of +0.68$%$ and +2.53$%$ over state-of-the-art baselines (mmFormer, M$^{3}$AE, ShaSpec, SimMLM). A proof-of-concept experiment on avMNIST suggests that LARGO may extend beyond medical imaging to heterogeneous non-medical modalities.


[72] OpenGaFF: Open-Vocabulary Gaussian Feature Field with Codebook Attention cs.CVPDF

Kunyi Li, Michael Niemeyer, Sen Wang, Stefano Gasperini, Nassir Navab

TL;DR: 本文提出了OpenGaFF,一个基于3D高斯泼溅(3D Gaussian Splatting)的开放词汇3D场景理解新框架。其核心是高斯特征场,它将语义建模为高斯几何和外观的连续函数,并通过引入结构化码本和码本引导的注意力机制,增强了几何与语义的耦合,提升了3D空间中语义预测的空间一致性和对象级语义一致性。

Details

Motivation: 解决基于高斯表示的开放词汇3D场景理解中,多视角观测下语义预测存在碎片化和空间不一致性的挑战。

Result: 在标准的2D和3D开放词汇基准测试上进行的广泛实验表明,该方法始终优于先前方法,实现了更高的分割质量、更强的3D语义一致性,并学习到了一个具有语义可解释性的码本。

Insight: 主要创新点在于:1) 提出高斯特征场,通过将语义预测显式地建立在几何结构上,强化了几何与语义的耦合;2) 引入结构化码本作为共享语义基元;3) 提出码本引导的注意力机制,通过查询嵌入与学习到的码本条目之间的相似性匹配来检索语言特征,从而在减少对象内特征方差的同时实现鲁棒的开放词汇推理。

Abstract: Understanding open-vocabulary 3D scenes with Gaussian-based representations remains challenging due to fragmented and spatially inconsistent semantic predictions across multi-view observations. In this paper, we present OpenGaFF, a novel framework for open-vocabulary 3D scene understanding built upon 3D Gaussian Splatting. At the core of our method is a Gaussian Feature Field that models semantics as a continuous function of Gaussian geometry and appearance. By explicitly conditioning semantic predictions on geometric structure, this formulation strengthens the coupling between geometry and semantics, leading to improved spatial coherence across similar structures in 3D space. To further enforce object-level semantic consistency, we introduce a structured codebook that serves as a set of shared semantic primitives. Furthermore, a codebook-guided attention mechanism is proposed to retrieve language features via similarity matching between query embeddings and learned codebook entries, enabling robust open-vocabulary reasoning while reducing intra-object feature variance. Extensive experiments on standard 2D and 3D open-vocabulary benchmarks demonstrate that our method consistently outperforms prior approaches, achieving improved segmentation quality, stronger 3D semantic consistency and a semantically interpretable codebook that provides insight into the learned representation.


[73] Boosting Self-Supervised Tracking with Contextual Prompts and Noise Learning cs.CVPDF

Yaozong Zheng, Qihua Liang, Bineng Zhong, Shuimu Zeng, Yuanliang Xue

TL;DR: 本文提出了一种名为\tracker的新型自监督跟踪框架,通过引入双模态上下文关联机制,结合细粒度语义提示和上下文噪声,驱动模型学习鲁棒的跟踪表示。该机制遵循从易到难的学习原则,在早期训练中使用实例补丁标记(提示)促进跟踪知识获取,随后逐步注入上下文噪声以扰动特征,鼓励跟踪器在更复杂的特征空间中学习鲁棒表示。该方法仅在训练阶段应用,以保持高效推理。

Details

Motivation: 传统自监督跟踪器缺乏有效的上下文建模,而现有的基于非语义查询的上下文关联方法难以适应无标签跟踪场景,导致难以学习可靠的上下文线索。本文旨在解决这一问题,通过设计新的上下文关联机制,从无标签视频中学习鲁棒的上下文知识。

Result: 广泛的实验证明了该方法的优越性。

Insight: 创新点在于提出了双模态上下文关联机制,将细粒度语义提示和上下文噪声学习相结合,并遵循从易到难的学习原则分阶段训练。这为从无标签视频中学习高质量跟踪表示提供了一种新思路,且机制仅用于训练,不影响推理效率。

Abstract: Learning robust contextual knowledge from unlabeled videos is essential for advancing self-supervised tracking. However, conventional self-supervised trackers lack effective context modeling, while existing context association methods based on non-semantic queries struggle to adapt to unlabeled tracking scenarios, making it difficult to learn reliable contextual cues. In this work, we propose a novel self-supervised tracking framework, named \textbf{\tracker}, which introduces a dual-modal context association mechanism that jointly leverages fine-grained semantic prompts and contextual noise to drive the model toward learning robust tracking representations. Adherent to the easy-to-hard learning principle, our contextual association mechanism operates based on two stages. During early training, instance patch tokens (prompts) are assigned to both forward and backward tracking branches to facilitate the acquisition of tracking knowledge. As training progresses, contextual noise is gradually injected into the model to perturb feature, encouraging the tracker to learn robust tracking representations in a more complex feature space. Thus, this novel contextual association mechanism enables our self-supervised model to learn high-quality tracking representations from unlabeled videos, while being applied exclusively during training to preserve efficient inference. Extensive experiments demonstrate the superiority of our method.


[74] VISD: Enhancing Video Reasoning via Structured Self-Distillation cs.CV | cs.AIPDF

Hao Lin, Kunyang Lv, Xu Jiang, Jingqi Tian, Zhongjing Du

TL;DR: 本文提出了VISD(Video Structured Self-Distillation)框架,旨在解决视频大语言模型在复杂推理任务中面临的序列级奖励稀疏和长时序推理轨迹中细粒度信用分配困难的问题。该框架通过引入视频感知的评判模型,将推理质量分解为答案正确性、逻辑一致性和时空基础等多个维度,并利用这种结构化反馈来指导一个教师策略,从而为模型训练提供密集的、令牌级的监督信号。

Details

Motivation: 训练VideoLLMs进行复杂推理的挑战在于:基于可验证奖励的强化学习(RLVR)只能提供稀疏的序列级监督,无法捕捉令牌级别的贡献;而现有的自蒸馏方法虽然能提供密集监督,但缺乏结构化和诊断特异性,且与强化学习结合时不稳定。

Result: 在多个基准测试上的实验表明,VISD始终优于强基线模型,提高了答案准确性和时空基础质量。特别值得注意的是,VISD在达到这些性能提升的同时,其优化收敛速度比基线快近2倍,显著提升了训练样本效率。

Insight: 核心创新点在于提出了一个结构化的自蒸馏框架,通过视频感知评判模型生成多维度的诊断性特权信息,并设计了方向-幅度解耦机制来稳定地整合密集监督与强化学习。该机制利用奖励计算的序列级优势决定更新方向,而结构化特权信号则调制令牌级的更新幅度,实现了语义对齐的细粒度信用分配,从而同时提升了推理的忠实度和训练效率。此外,课程调度和基于指数移动平均的教师策略稳定化也支持了对长视频序列的鲁棒优化。

Abstract: Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.


[75] Metonymy in vision models undermines attention-based interpretability cs.CVPDF

Ananthu Aniraj, Cassio F. Dantas, Dino Ienco, Massimiliano Mancini, Diego Marcos

TL;DR: 本文研究了视觉模型中存在的转喻现象,即对象部件的潜在表示编码了整个对象的信息,而非仅对应图像区域。通过实验评估,发现现代预训练视觉Transformer违反了局部性假设,表现出强烈的对象内部信息泄漏,这损害了基于注意力的可解释性方法的忠实度。作者提出了一种两阶段方法作为上限,通过设计防止泄漏,并在多个任务上验证了其改善属性驱动部件发现的效果。

Details

Motivation: 动机是测试基于部件的推理方法中的局部性假设,即部件表示是否仅编码对应图像区域的信息,以评估基于注意力的可解释性方法的有效性。

Result: 实验表明,预训练视觉Transformer存在显著的对象内部泄漏,导致基于注意力的可解释方法不可靠;两阶段方法作为上限能防止泄漏,并在多个任务中提升属性驱动的部件发现性能。

Insight: 创新点在于揭示了视觉模型中的转喻问题对可解释性的负面影响,并提出两阶段方法作为解决方案,强调了设计上防止泄漏对部件表示的重要性,这对依赖部件中心概念的模型(如CBMs)具有启示意义。

Abstract: Part-based reasoning is a classical strategy to make a computer vision model directly focus on the object parts that are relevant to the downstream task. In the context of deep learning, this also serves to improve by-design interpretability, often by using part-centric attention mechanisms on top of a latent image representation provided by a standard, black-box model. This approach is based on a locality assumption: that the latent representation of an object part encodes primarily information about the corresponding image region. In this work, we test this basic assumption, measuring intra-object leakage in vision models using part-based attribute annotations. Through a comprehensive experimental evaluation, we show that modern pretrained vision transformers violate the locality assumption and exhibit a strong intra-object leakage, in which each part encodes information from the whole object, a visual metonymy that compromises the faithfulness of attention-based interpretable-by-design methods for part-based reasoning, ultimately rendering them uninterpretable. In addition, we establish an upper bound using a two-stage approach that prevents leakage by design. We then show that this inherently disentangled feature extraction improves attribute-driven part discovery on a variety of tasks, confirming the practical impact of intra-object leakage. Our results uncover a neglected issue affecting the interpretability of part-based representations, such as those in CBMs relying on part-centric concepts, highlighting that two-stage approaches offer a promising way to mitigate it.


[76] Dynamic Pondering Sparsity-aware Mixture-of-Experts Transformer for Event Stream based Visual Object Tracking cs.CV | cs.AIPDF

Shiao Wang, Xiao Wang, Duoqing Yang, Wenhao Zhang, Bo Jiang

TL;DR: 本文提出了一种动态权衡稀疏感知的专家混合Transformer模型,用于基于事件流的视觉目标跟踪。该框架通过多时间尺度建模事件密度变化,并采用三阶段视觉Transformer主干网络进行分层多密度特征学习,同时引入稀疏感知的专家混合模块和动态深度调整策略,以在跟踪精度和计算效率之间取得良好平衡。

Details

Motivation: 现有RGB跟踪器在低光照和快速运动等挑战性成像条件下表现不佳,而事件相机通过异步捕获像素级亮度变化提供了高动态范围和高时间分辨率。然而,现有基于事件的跟踪器往往忽略事件数据固有的空间稀疏性和时间密度,且依赖单一固定时间窗口采样策略,在不同运动动态下效果欠佳。

Result: 在FE240hz、COESOT和EventVOT基准测试上的大量实验表明,所提方法在跟踪精度和计算效率之间取得了有利的权衡。

Insight: 创新点包括:显式建模多时间尺度事件密度变化的分层多密度特征学习框架;稀疏感知的专家混合模块促进不同稀疏模式下的专家专业化;动态权衡策略根据跟踪难度自适应调整推理深度,实现效率与精度的平衡。

Abstract: Despite significant progress, RGB-based trackers remain vulnerable to challenging imaging conditions, such as low illumination and fast motion. Event cameras offer a promising alternative by asynchronously capturing pixel-wise brightness changes, providing high dynamic range and high temporal resolution. However, existing event-based trackers often neglect the intrinsic spatial sparsity and temporal density of event data, while relying on a single fixed temporal-window sampling strategy that is suboptimal under varying motion dynamics. In this paper, we propose an event sparsity-aware tracking framework that explicitly models event-density variations across multiple temporal scales. Specifically, the proposed framework progressively injects sparse, medium-density, and dense event search regions into a three-stage Vision Transformer backbone, enabling hierarchical multi-density feature learning. Furthermore, we introduce a sparsity-aware Mixture-of-Experts module to encourage expert specialization under different sparsity patterns, and design a dynamic pondering strategy to adaptively adjust the inference depth according to tracking difficulty. Extensive experiments on FE240hz, COESOT, and EventVOT demonstrate that the proposed approach achieves a favorable trade-off between tracking accuracy and computational efficiency. The source code will be released on https://github.com/Event-AHU/OpenEvTracking.


[77] Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning cs.CVPDF

Xueheng Li, Yu Wang, Tao Hu, Ji Huang, Ke Cao

TL;DR: 本文提出Pest-Thinker,一个基于强化学习的知识驱动框架,旨在让多模态大语言模型(MLLMs)能够像昆虫学家一样对害虫的细粒度形态特征进行推理。该方法首先构建了两个高清害虫基准数据集QFSD和AgriInsect,并利用合成思维链轨迹进行监督微调,然后通过带有新颖特征奖励的组相对策略优化(GRPO)引导模型关注可观察的形态证据。实验表明,该方法显著提升了模型在领域内外的形态理解能力。

Details

Motivation: 解决害虫识别领域因物种间复杂性高、物种内变异大以及专家标注数据稀缺等独特挑战,导致现有MLLMs直接应用受限的问题,旨在推动智能农业害虫分析向专家级视觉推理迈进。

Result: 大量实验证明,Pest-Thinker在领域内和领域外的形态理解方面均有显著提升,标志着在智能农业害虫分析的专家级视觉推理方向上取得了进展。

Insight: 创新点包括:1) 构建了包含专家标注形态特征的高清害虫基准数据集;2) 利用合成思维链轨迹进行结构化学习;3) 设计了结合新颖特征奖励和LLM-as-a-Judge评估策略的GRPO强化学习框架,引导模型专注于可观察的形态证据进行推理。

Abstract: Pest-induced crop losses pose a major threat to global food security and sustainable agricultural development. While recent advances in Multimodal Large Language Models (MLLMs) have shown strong potential for visual understanding and smart agriculture, their direct application to pest recognition remains limited due to the domain’s unique challenges such as high inter-species complexity, intra-species variability, and the scarcity of expert-annotated data. In this work, we introduce Pest-Thinker, a knowledge-driven reinforcement learning (RL) framework that enables MLLMs to reason over fine-grained pest morphology. We first construct two high-definition pest benchmarks, QFSD and AgriInsect, comprising diverse species and expert-annotated morphological traits. Leveraging these datasets, we synthesize Chain-of-Thought (CoT) reasoning trajectories to facilitate structured learning of pest-specific visual cues through Supervised Fine-Tuning (SFT). Subsequently, we employ Group Relative Policy Optimization (GRPO) with a novel feature reward that guides the model to focus on observable morphological evidence, assessed by an LLM-as-a-Judge strategy. Extensive experiments demonstrate that Pest-Thinker substantially improves both in-domain and out-of-domain morphological understanding, marking a step toward expert-level visual reasoning for intelligent agricultural pest analysis. The datasets and source code are available upon acceptance.


[78] Autoregressive Visual Generation Needs a Prologue cs.CV | cs.AI | cs.LGPDF

Bowen Zheng, Weijian Luo, Guang Yang, Colin Zhang, Tianyang Hu

TL;DR: 本文提出Prologue方法,通过在自回归视觉生成模型中引入一组额外的prologue tokens来弥补重建与生成之间的差距,这些tokens专门用于优化生成质量,而视觉tokens则专注于重建任务,从而在不影响重建性能的情况下显著提升生成效果。

Details

Motivation: 解决自回归图像生成中重建与生成目标之间的冲突问题,即传统方法难以同时优化两者,导致生成质量受限。

Result: 在ImageNet 256x256上,Prologue-Base将gFID从21.01降至10.75(无分类器引导),Prologue-Large达到rFID 0.99和gFID 1.46,且重建质量几乎不变;prologue tokens表现出语义结构,线性探测Top-1准确率达35.88%。

Insight: 创新点在于解耦生成与重建表示:通过引入独立的可学习生成表示(prologue tokens)来专门优化生成,而保持原始视觉表示不变,这为提升自回归生成模型性能提供了新方向。

Abstract: In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Instead of modifying visual tokens to satisfy both reconstruction and generation, Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy (CE) loss, while visual tokens remain dedicated to reconstruction. This decoupled design lets us optimize generation through the AR model’s true distribution without affecting reconstruction quality, which we further formalize from an ELBO perspective. On ImageNet 256x256, Prologue-Base reduces gFID from 21.01 to 10.75 without classifier-free guidance while keeping reconstruction almost unchanged; Prologue-Large reaches a competitive rFID of 0.99 and gFID of 1.46 using a standard AR model without auxiliary semantic supervision. Interestingly, driven only by AR gradients, prologue tokens exhibit emergent semantic structure: linear probing on 16 prologue tokens reaches 35.88% Top-1, far above the 23.71% of the first 16 tokens from a standard tokenizer; resampling with fixed prologue tokens preserves a similar high-level semantic layout. Our results suggest a new direction: generation quality can be improved by introducing a separate learned generative representation while leaving the original representation intact.


[79] Retina-RAG: Retrieval-Augmented Vision-Language Modeling for Joint Retinal Diagnosis and Clinical Report Generation cs.CV | cs.AIPDF

Abdelrahman Zaian, Sheethal Bhat, Mohamed Abdalkader, Andreas Maier

TL;DR: 本文提出Retina-RAG,一个低成本模块化框架,用于联合执行糖尿病视网膜病变(DR)严重程度分级、黄斑水肿(ME)检测和临床报告生成。该框架将高性能视网膜分类器与通过LoRA适配的参数高效视觉语言模型解耦,并引入检索增强生成(RAG)模块注入眼科知识,以提高诊断一致性和减少幻觉。

Details

Motivation: 现有自动化筛查系统大多局限于图像级分类,缺乏临床结构化报告。本文旨在解决糖尿病视网膜病变筛查中联合诊断与报告生成的临床需求。

Result: 在带有标注的视网膜疾病检测数据集上,Retina-RAG在DR分级上取得F1分数0.731,ME检测上取得0.948,显著优于零样本Qwen(0.096, 0.732)和MMed-RAG(0.541, 0.641)。在报告生成任务上,达到ROUGE-L 0.429和SBERT相似度0.884,超过所有基线模型。

Insight: 创新点包括:1)模块化架构解耦高性能分类器与参数高效VLM,实现灵活集成;2)引入RAG模块注入结构化分类器输出和精选眼科知识,提升诊断一致性和减少幻觉;3)整个框架可在单张消费级GPU上运行,展示了以适度计算资源实现临床结构化视网膜AI的可行性。

Abstract: Diabetic Retinopathy (DR) is a leading cause of preventable blindness among working-age adults worldwide, yet most automated screening systems are limited to image-level classification and lack clinically structured reporting. We propose Retina-RAG, a low-cost modular framework that jointly performs DR severity grading, macular edema (ME) detection, and report generation. The architecture decouples a high-performance retinal classifier and a parameter-efficient vision-language model (Qwen2.5-VL-7B-Instruct) adapted via Low-Rank Adaptation (LoRA), enabling flexible component integration. A retrieval-augmented generation (RAG) module injects curated ophthalmic knowledge together with structured classifier outputs at inference time to improve diagnostic consistency and reduce hallucinations. Retina-RAG achieves an F1-score of 0.731 for DR grading and 0.948 for ME detection, substantially outperforming zero-shot Qwen (0.096, 0.732) and MMed-RAG (0.541, 0.641) on a retinal disease detection dataset with captions. For report generation, Retina-RAG attains ROUGE-L 0.429 and SBERT similarity 0.884, exceeding all baselines. The full framework operates on a single consumer-grade GPU, demonstrating that clinically structured retinal AI can be achieved with modest computational resources.


[80] SuperFace: Preference-Aligned Facial Expression Estimation Beyond Pseudo Supervision cs.CVPDF

Zejian Kang, Xuanyang Xu, Wentao Yang, Kai Zheng, Yuanchen Fei

TL;DR: 本文提出SuperFace框架,通过人类偏好反馈优化ARKit混合形状系数的面部表情估计,以提升数字人动画的感知真实性和表现力,摆脱对伪标签监督的依赖。

Details

Motivation: 现有方法依赖Live Link Face等捕捉软件提供的伪标签进行监督学习,这些标签存在噪声、偏差和动作缺失问题,导致模型无法优化感知表情保真度。

Result: 实验表明,SuperFace在表情保真度上优于基于Live Link Face监督的方法,验证了偏好驱动优化在语义面部动作预测中的有效性。

Insight: 创新点在于将ARKit系数预测从伪标签模仿转向人类对齐的感知优化,利用人类偏好反馈直接优化渲染表情的视觉质量,而非依赖有缺陷的数值伪标签。

Abstract: Accurate facial estimation is crucial for realistic digital human animation, and ARKit blendshape coefficients offer an interpretable representation by mapping facial motions to semantic animation controls. However, learning high-quality ARKit coefficient prediction remains limited by the absence of reliable ground-truth supervision. Existing methods typically rely on capture software such as Live Link Face to provide pseudo labels, which may contain noisy activations, biased coefficient magnitudes, and missing or inaccurate facial actions. Consequently, models trained with supervised learning tend to reproduce imperfect pseudo labels rather than optimize for perceptual expression fidelity. In this paper, we propose SuperFace, a preference-driven framework that moves ARKit facial expression estimation from pseudo-label imitation toward human-aligned perceptual optimization. Instead of treating software-estimated coefficients as fixed ground truth, SuperFace uses them only as an initialization and further improves coefficient prediction through human preference feedback on rendered facial expressions. By aligning the model with perceptual judgments rather than numerical pseudo labels, SuperFace enables more visually faithful and expressive facial animation. Experiments show that SuperFace improves expression fidelity over Live Link Face supervision, demonstrating the effectiveness of preference-driven optimization for semantic facial action prediction.


[81] EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields cs.CV | cs.AI | cs.ROPDF

Zhaoyang Yang, Yurun Jin, Lizhe Qi, Cong Huang, Kai Chen

TL;DR: 本文提出EA-WM(事件感知生成世界模型),通过引入结构化运动学-视觉动作场,将机器人动作和运动状态直接投影到目标相机视图,从而在视频扩散模型基础上构建了能精确保持机器人空间几何和细粒度交互动态的世界模型。

Details

Motivation: 现有基于视频扩散模型的世界-动作模型主要将视频生成作为策略学习的辅助表示,未能充分利用动作信号来指导视频合成,导致生成的推演序列难以保持精确的机器人空间几何和精细的机器人-物体交互动态。

Result: 在综合基准WorldArena上评估,EA-WM取得了最先进的性能,显著优于现有基线模型。

Insight: 核心创新在于提出了结构化运动学-视觉动作场,将低维抽象动作令牌转化为几何接地的视觉表示,并设计了事件感知双向融合块来调制跨分支注意力,以捕捉物体状态变化和交互动态,从而有效闭合了运动学控制与视觉感知之间的循环。

Abstract: Pretrained video diffusion models provide powerful spatiotemporal generative priors, making them a natural foundation for robotic world models. While recent world-action models jointly optimize future videos and actions, they predominantly treat video generation as an auxiliary representation for policy learning. Consequently, they insufficiently explore the inverse problem: leveraging action signals to guide video synthesis, thereby often failing to preserve precise robot spatial geometry and fine-grained robot-object interaction dynamics in the generated rollouts. To bridge this gap, we present EA-WM, an Event-Aware Generative World Model that effectively closes the loop between kinematic control and visual perception. Rather than injecting joint or end-effector actions as abstract, low-dimensional tokens, EA-WM projects actions and kinematic states directly into the target camera view as Structured Kinematic-to-Visual Action Fields. To fully exploit this geometrically grounded representation, we introduce event-aware bidirectional fusion blocks that modulate cross-branch attention, capturing object state changes and interaction dynamics. Evaluated on the comprehensive WorldArena benchmark, EA-WM achieves state-of-the-art performance, outperforming existing baselines by a significant margin.


[82] Bridging visual saliency and large language models for explainable deep learning in medical imaging cs.CV | cs.LGPDF

Paul Valery Nguezet, Elie Tagne Fute, Yusuf Brima, Benoit Martin Azanguezet, Marcellin Atemkeng

TL;DR: 本文提出了一种多模态可解释性框架,旨在解决深度学习模型在医学影像中因不透明性而难以临床采纳的问题。该框架通过结合卷积神经网络(CNN)、视觉显著性归因方法和大型语言模型(LLM),为脑肿瘤分类任务生成可解释的诊断报告。具体包括三个耦合阶段:扩展CNN架构以同时优化分类和分割头;应用Grad-CAM等方法生成热图并转换为肿瘤掩码;将掩码映射到神经解剖图谱,并利用LLM生成放射学风格的诊断叙述。

Details

Motivation: 深度学习模型在医学影像中的不透明性阻碍了其临床采纳,因此需要一种能够将CNN预测与临床可操作的见解联系起来的多模态可解释性框架,以提升人工智能辅助脑肿瘤诊断的透明度和临床可解释性。

Result: 在包含4,834张对比增强T1加权脑MRI图像(涵盖三类肿瘤)的数据集上评估,InceptionResNetV2取得了最高的分类性能,Grad-CAM++在分割重叠方面表现最佳。在语言模型中,Grok3在词汇多样性和连贯性上领先,而LLaMA获得了最高的可读性分数。

Insight: 创新点包括:1)提出双输出混合CNN架构,同时优化分类和分割头以实现空间更丰富的特征学习;2)将视觉显著性热图通过自适应百分位阈值管道精炼为二进制肿瘤掩码,并映射到神经解剖图谱以转换为结构化证据;3)整合视觉、解剖和语言模态,利用LLM生成技术上可靠且可解释的诊断报告,为医学影像可解释性提供了一种统一的多模态方法。

Abstract: The opaque nature of deep learning models remains a significant barrier to their clinical adoption in medical imaging. This paper presents a multimodal explainability framework that bridges the gap between convolutional neural network (CNN) predictions and clinically actionable insights for brain tumor classification, leveraging large language models (LLMs) to deliver human-interpretable diagnostic narratives. The proposed framework operates through three coupled stages. First, nine CNN architectures are extended with a dual-output hybrid formulation that simultaneously optimises a classification head and a segmentation head, enabling spatially richer feature learning. Second, visual saliency attribution methods, namely Grad-CAM, Grad-CAM++, and ScoreCAM, are applied to generate class-discriminative heatmaps, which are subsequently refined into binary tumor masks via an adaptive percentile thresholding pipeline. Third, the resulting masks are mapped onto the Harvard-Oxford cortical atlas to translate pixel-level evidence into named neuroanatomical structures, and the extracted findings are encoded into a structured JSON file that conditions three LLMs (Grok3, Mistral, and LLaMA) to generate coherent, radiological-style diagnostic reports. Evaluated on a dataset of 4,834 contrast-enhanced T1-weighted brain MRI images spanning three tumor classes, InceptionResNetV2 achieved the highest classification performance and Grad-CAM++ yielded the best segmentation overlap. Among the language models, Grok3 led in lexical diversity and coherence, while LLaMA achieved the highest readability score. By integrating visual, anatomical, and linguistic modalities into a unified pipeline, the framework produces explanations that are technically grounded and meaningfully interpretable, advancing the transparency and clinical accountability of artificial intelligence assisted brain tumor diagnosis.


[83] Look Beyond Saliency: Low-Attention Guided Dual Encoding for Video Semantic Search cs.CVPDF

Faisal Aljehrai, Mohammed A. Alkhrashi, Alreem Almuhrij, Sarah Abuhimed, Noorh Aldossary

TL;DR: 本文针对拥挤场景视频语义搜索中视觉编码器过度关注显著前景而忽略重要背景区域的问题,提出了一种逆注意力嵌入机制,通过显式捕获和突出被忽视的背景区域,结合传统视觉嵌入,无需额外训练即可显著提升语义检索性能。

Details

Motivation: 解决拥挤场景视频语义搜索中,现有视觉编码器倾向于优先处理显著前景区域,而忽略上下文重要的背景区域,导致语义检索性能受限的问题。

Result: 初步实验和消融研究表明,该方法在拥挤环境视频语义搜索的召回率上相比现有方法有显著提升,展示了有前景的改进。

Insight: 创新点在于提出逆注意力嵌入机制,通过关注低注意力区域(即背景)来补充传统基于显著性的编码,从而更全面地捕获视频语义信息,这是一种简单有效且无需训练的数据增强策略。

Abstract: Video semantic search in densely crowded scenes remains a challenging task due to visual encoders tendency to prioritize salient foreground regions while neglecting contextually important, background areas. We propose an Inverse Attention Embedding mechanism that explicitly captures and highlights these overlooked regions. By combining inverse attention embeddings with traditional visual embeddings, our method significantly enhances semantic retrieval performance without additional training. Initial experiments and ablation studies demonstrate promising improvements over existing approaches in recall for video semantic search in crowded environments.


[84] ZScribbleSeg: A comprehensive segmentation framework with modeling of efficient annotation and maximization of scribble supervision cs.CVPDF

Ke Zhang, Bomin Wang, Hangqi Zhou, Xiahai Zhuang

TL;DR: ZScribbleSeg是一个用于医学图像分割的全面框架,旨在通过最大化涂鸦标注的监督信息和引入空间先验来提升弱监督分割性能。该框架首先研究了高效涂鸦标注的原则,通过监督最大化和随机性模拟优化标注形式,然后结合空间关系和形状约束的正则化项,利用EM算法估计类别混合比例以纠正错误预测,最终在多个医学图像分割任务上实现了与全监督方法竞争的结果。

Details

Motivation: 医学图像分割中获取全标注数据集成本高昂,现有基于涂鸦标注的弱监督方法因监督不足和形状信息不完整导致分割不准确,本文旨在通过建模高效标注和最大化涂鸦监督来解决这些问题。

Result: 在仅使用涂鸦标注的情况下,ZScribbleSeg在ACDC、MSCMRseg、BTCV、MyoPS、Decathlon-BrainTumor和Decathlon-Prostate六个分割任务上取得了具有竞争力的性能,接近或达到当前先进水平。

Insight: 创新点包括首次系统研究高效涂鸦标注原则(通过监督最大化和随机性模拟),以及引入基于EM算法的类别混合比例估计来整合空间先验和形状约束,从而更有效地利用弱监督信息提升分割精度。

Abstract: Curating fully annotated datasets for medical image segmentation is labour-intensive and expertise-demanding. To alleviate this problem, prior studies have explored scribble annotations for weakly supervised segmentation. Existing solutions mainly compute losses on annotated areas and generate pseudo labels by propagating annotations to adjacent regions. However, these methods often suffer from inaccurate and unrealistic segmentations due to insufficient supervision and incomplete shape information. In contrast, we first investigate the principle of good scribble annotations, which leads to efficient scribble forms via supervision maximization and randomness simulation. We further introduce regularization terms to encode the spatial relationship and the shape constraints, where the EM algorithm is utilized to estimate the mixture ratios of label classes. These ratios are critical in identifying the unlabeled pixels for each class and correcting erroneous predictions, thus the accurate estimation lays the foundation for the incorporation of spatial prior. Finally, we integrate the efficient scribble supervision with the prior into a framework, referred to as ZScribbleSeg, and apply it to multiple scenarios. Leveraging only scribble annotations, ZScribbleSeg achieves competitive performance on six segmentation tasks including ACDC, MSCMRseg, BTCV, MyoPS, Decathlon-BrainTumor and Decathlon-Prostate. Our code will be released via https://github.com/DLwbm123/ZScribbleSeg.


[85] Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction cs.CVPDF

Zecheng Tang, Jiaye Fu, Qiankun Gao, Haijie Li, Yanmin Wu

TL;DR: Spark3R是一种无需训练即可加速前馈式3D重建模型的框架,通过非对称的令牌缩减策略,对查询令牌和键值令牌分别应用不同的压缩方法(组内令牌合并与轻量级令牌剪枝),并自适应调整各层的键值缩减因子,从而在保持重建质量的同时显著提升处理长视频输入的计算效率。

Details

Motivation: 解决基于Vision Transformer的前馈3D重建模型在处理数百或数千帧视频输入时,因全局注意力层的二次计算成本而难以扩展的问题,同时克服现有令牌合并方法对查询和键值令牌进行统一压缩而忽略其功能差异的局限性。

Result: 作为即插即用框架,Spark3R可直接集成到多个预训练模型(如VGGT、π³、Depth-Anything-3)中,在1000帧输入上实现了高达28倍的加速,同时保持了有竞争力的重建质量。

Insight: 创新点在于识别出前馈3D重建模型中查询令牌(编码视图特定几何请求,对压缩敏感)与键值令牌(表示共享场景上下文,可容忍激进压缩)的功能差异,并据此设计非对称的、自适应的令牌缩减策略,实现了更好的质量-效率权衡。

Abstract: Feed-forward 3D reconstruction models based on Vision Transformers can directly estimate scene geometry and camera poses from a small set of input images, but scaling them to video inputs with hundreds or thousands of frames remains challenging due to the quadratic cost of global attention layers. Recent token-merging methods accelerate these models by compressing the token sequence within the global attention layers, but they apply a uniform reduction to query tokens and key-value tokens, ignoring their functionally distinct roles in 3D reconstruction. In this work, we identify a key property of feed-forward 3D reconstruction models: query tokens encode view-specific geometric requests and are sensitive to compression, while key-value tokens represent shared scene context and tolerate aggressive compression. Guided by this insight, we propose Spark3R, a training-free acceleration framework that decouples the compression of query tokens and key-value tokens by assigning distinct reduction factors, with intra-group token merging applied to query tokens and lightweight token pruning to key-value tokens. Additionally, Spark3R adaptively adjusts the key-value reduction factor across layers, further improving the quality-efficiency trade-off. As a plug-and-play framework requiring no retraining, Spark3R integrates directly into multiple pretrained feed-forward 3D reconstruction models, including VGGT, $π^3$, and Depth-Anything-3, and achieves up to $28\times$ speedup on 1,000-frame inputs while maintaining competitive reconstruction quality.


[86] Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency cs.CVPDF

Thong Nguyen, Khoi M. Le, Cong-Duy Nguyen, Luu Anh Tuan, See-Kiong Ng

TL;DR: 本文提出了一种基于欧拉运动引导的图像动画方法,通过相邻帧的欧拉运动场进行局部监督,结合双向几何一致性机制来减少遮挡区域的伪影,从而加速训练并提升时序一致性。

Details

Motivation: 现有基于扩散模型的图像动画方法通常依赖拉格朗日运动引导(即相对于初始帧的光流估计),这可能导致训练效率低和动态伪影问题,本文旨在通过更局部的欧拉运动监督来解决这些问题。

Result: 实验表明,该方法在多个基准测试中相比基于参考的基线方法加速了训练,保持了更好的时序一致性,并减少了动态伪影,达到了先进的性能水平。

Insight: 创新点在于将运动引导从拉格朗日视角转向欧拉视角,实现了并行化训练和有界误差监督;同时引入双向几何一致性机制,通过前向-后向循环检查数学地识别和掩码遮挡区域,避免了错误变形目标的学习。

Abstract: Recent advancements in image animation have utilized diffusion models to breathe life into static images. However, existing controllable frameworks typically rely on Lagrangian motion guidance, where optical flow is estimated relative to the initial frame. This paper revisits the same optical-flow primitive through a more local supervision design: we use adjacent-frame Eulerian motion fields to guide generation, where the motion signal always describes a short temporal hop. This shift enables parallelized training and provides bounded-error supervision throughout the generation process. To mitigate the drift artifacts common in adjacent frame generation, we introduce a Bidirectional Geometric Consistency mechanism, which computes a forward-backward cycle check to mathematically identify and mask occluded regions, preventing the model from learning incorrect warping objectives. Extensive experiments demonstrate that our approach accelerates training, preserves temporal coherence, and reduces dynamic artifacts compared to reference-based baselines.


[87] Render, Don’t Decode: Weight-Space World Models with Latent Structural Disentanglement cs.CV | cs.AIPDF

Roussel Desmond Nzoyem, Mauro Comi

TL;DR: 本文提出NOVA框架,将世界模型的状态表示为基于坐标的隐式神经表示(INR)的权重和偏置,通过解析渲染而非解码器重构视频,实现了紧凑、可移植和零样本超分辨率的世界建模。

Details

Motivation: 解决现有世界模型在无标签视频训练中,因将原始像素编码为不透明潜在空间并依赖重型解码器而导致的计算成本高和可解释性差的问题。

Result: 在多个挑战性数据集上验证,NOVA在约4000万参数的单消费级GPU上实现了强大的可控预测性能。

Insight: 利用INR的结构化表示消除解码器瓶颈,无需辅助损失或对抗目标即可解耦场景结构(如背景、前景和帧间运动),支持内容或动态的独立编辑,为可定制虚拟体验铺平道路。

Abstract: Training world models on vast quantities of unlabelled videos is a critical step toward fully autonomous intelligence. However, the prevailing paradigm of encoding raw pixels into opaque latent spaces and relying on heavy decoders for reconstruction leaves these models computationally expensive and uninterpretable. We address this problem by introducing NOVA, a world modelling framework that represents the system state as the weights and biases of an auxiliary coordinate-based implicit neural representation (INR). This structured representation is analytically rendered, which eliminates the decoder bottleneck while conferring compactness, portability, and zero-shot super-resolution. Furthermore, like most latent action models, NOVA can be distilled into a context-dependent video generator via an action-matching objective. Surprisingly, without resorting to auxiliary losses or adversarial objectives, NOVA can disentangle structural scene components such as background, foreground, and inter-frame motion, enabling users to edit either content or dynamics without compromising the other. We validate our framework on several challenging datasets, achieving strong controllable forecasting while operating on a single consumer GPU at $\sim$40M parameters. Ultimately, structured representations like INRs not only enhance our understanding of latent dynamics but also pave the way for immersive and customisable virtual experiences.


[88] NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps cs.CV | cs.AIPDF

Dijia Zhan, Jinyi Li, Chenxi Zheng, Shaoyu Huang, Yong Li

TL;DR: 本文提出了一种新的视觉语言导航(VLN)方法NavOne,将导航重新定义为在预构建的俯视地图上进行一步全局路径规划的问题。该方法通过Top-Down Map Fuser融合多模态地图表示,并利用扩展的Attention Residuals进行空间感知的深度混合,实现了在单次前向传播中直接预测密集路径概率。在新建的R2R-TopDown数据集上,NavOne在基于地图的VLN方法中达到了最先进的性能,并显著提升了规划效率。

Details

Motivation: 现有VLN方法通常采用以自我为中心的逐步导航范式,容易导致错误累积且效率低下。虽然近期方法尝试利用预建环境地图,但往往依赖于增量更新记忆图或对离散路径提案进行评分,这限制了连续空间推理并造成了离散瓶颈。

Result: 在R2R-TopDown数据集上的大量实验表明,NavOne在基于地图的VLN方法中实现了最先进的性能。与现有基于地图的基线相比,其规划阶段速度提升了8倍;与以自我为中心的方法相比,速度提升了80倍,实现了高效的全局导航。

Insight: 论文的核心创新点在于将VLN重新定义为一步全局路径规划问题,并提出了统一的端到端框架NavOne。其技术亮点包括用于联合多模态地图表示的Top-Down Map Fuser,以及用于空间感知深度混合的扩展Attention Residuals机制。这为连续空间推理和高效全局规划提供了新思路。

Abstract: Existing Vision-Language Navigation (VLN) methods typically adopt an egocentric, step-by-step paradigm, which struggles with error accumulation and limits efficiency. While recent approaches attempt to leverage pre-built environment maps, they often rely on incrementally updating memory graphs or scoring discrete path proposals, which restricts continuous spatial reasoning and creates discrete bottlenecks. We propose Top-Down VLN (TD-VLN), reformulating navigation as a one-step global path planning problem on pre-built top-down maps, supported by our newly constructed R2R-TopDown dataset. To solve this, we introduce NavOne, a unified framework that directly predicts dense path probabilities over multi-modal maps in a single end-to-end forward pass. NavOne features a Top-Down Map Fuser for joint multi-modal map representation, and extends Attention Residuals for spatial-aware depth mixing. Extensive experiments on R2R-TopDown show that NavOne achieves state-of-the-art performance among map-based VLN methods, with a planning-stage speedup of 8x over existing map-based baselines and 80x over egocentric methods, enabling highly efficient global navigation.


[89] TinyBayes: Closed-Form Bayesian Inference via Jacobi Prior for Real-Time Image Classification on Edge Devices cs.CV | cs.AI | cs.LG | stat.AP | stat.MLPDF

Shouvik Sardar, Sourish Das

TL;DR: TinyBayes是一个专为边缘设备实时图像分类设计的框架,首次将闭式贝叶斯分类器与移动级计算机视觉流程相结合,用于可可作物病害检测。该流程使用YOLOv8-Nano进行病变定位、MobileNetV3-Small进行特征提取,以及基于Jacobi先验的贝叶斯分类器,总模型大小在9.5 MB以内,单张图像CPU推理时间低于150毫秒。

Details

Motivation: 解决西非小农户可可作物因可可肿枝病毒病和炭疽病导致严重减产的问题,需要一种能在资源受限环境中部署的自动化病害检测系统,要求模型小巧、快速且无需网络连接,同时现有边缘可部署系统缺乏不确定性量化,而贝叶斯方法又未专注于农业应用。

Result: 在Amini可可污染挑战数据集上达到78.7%的准确率,模型总大小在9.5 MB以内,推理速度低于150毫秒每图像,优于包括随机森林、SVM、XGBoost等在内的七种分类器,在准确性、模型大小和推理速度之间实现了最佳权衡。

Insight: 创新点在于结合闭式贝叶斯分类器(Jacobi-DMR)与移动级视觉流程,首次为边缘设备农业病害检测提供不确定性量化;Jacobi先验通过投影实现非迭代估计,模型仅增加13.5 KB,证明了渐近等价性、一致性、渐近正态性和偏差校正,为资源受限环境提供了高效解决方案。

Abstract: Cocoa (Theobroma cacao) is a critical cash crop for millions of smallholder farmers in West Africa, where Cocoa Swollen Shoot Virus Disease (CSSVD) and anthracnose cause devastating yield losses. Automated disease detection from leaf images is essential for early intervention, yet deploying such systems in resource-constrained settings demands models that are small, fast, and require no internet connectivity. Existing edge-deployable plant disease systems rely on end-to-end deep learning without uncertainty quantification, while Bayesian methods for edge devices focus on hardware-level inference architectures rather than agricultural applications. We bridge this gap with TinyBayes, the first framework to combine a closed-form Bayesian classifier with a mobile-grade computer vision pipeline for crop disease detection. Our pipeline uses YOLOv8-Nano (5.9 MB) for lesion localisation, MobileNetV3-Small (3.5 MB) for feature extraction, and the Jacobi prior; a Bayesian method that provides a closed form non-iterative estimators via projection, for the classification. The Jacobi-DMR (Distributed Multinomial Regression) classifier adds only 13.5 KB to the pipeline, bringing the total model size within 9.5 MB, while achieving 78.7% accuracy on the Amini Cocoa Contamination Challenge dataset and enabling end-to-end CPU inference under 150 ms per image. We benchmark against seven classifiers including Random Forest, SVM, Ridge, Lasso, Elastic Net, XGBoost, and Jacobi-GP, and demonstrate that the Jacobi-DMR offers the best trade-off between accuracy, model size, and inference speed for edge deployment. We have proved the asymptotic equivalence and consistency, asymptotic normality and the bias correction of Jacobi-DMR. All data and codes are available here: https://github.com/shouvik-sardar/TinyBayes


[90] Earth-o1: A Grid-free Observation-native Atmospheric World Model cs.CVPDF

Junchao Gong, Kaiyi Xu, Wangxu Wei, Siwei Tu, Jingyi Xu

TL;DR: 本文提出了Earth-o1,一种无需网格、以观测数据为本的大气世界模型。该模型直接从非网格化的多模态观测数据中学习地球系统的连续三维物理演化,克服了传统建模框架将异构数据强制插值到预设网格的局限性,实现了无需显式数值求解器的实时预报和跨传感器推理。

Details

Motivation: 传统大气建模框架将异构观测数据强制插值到预设空间网格,这既限制了原始传感器数据的充分利用,也造成了严重的计算瓶颈。本文旨在克服这些结构性限制,直接从原始观测数据中学习连续的大气动力学。

Result: 在回算评估中,Earth-o1的地表预报技能与业务化的集成预报系统(IFS)相当,表明其能达到现有物理框架的保真度水平。

Insight: 论文的核心创新在于提出了一种全新的、完全以观测数据为本的地球物理模拟器范式。它摒弃了传统的网格化处理和显式数值求解器,通过将多样传感器输入集成到统一的、无网格的动态场中,直接从连续观测中学习物理演化,为构建地球数字孪生提供了一个可扩展的数据驱动基础。

Abstract: Despite the unprecedented volume of multimodal data provided by modern Earth observation systems, our ability to model atmospheric dynamics remains constrained. Traditional modeling frameworks force heterogeneous measurements into predefined spatial grids, inherently limiting the full exploitation of raw sensor data and creating severe computational bottlenecks. Here we present Earth-o1, an observation-native atmospheric world model that overcomes these structural limitations. Rather than relying on conventional atmospheric dynamical modeling systems or traditional data assimilation, Earth-o1 directly learns the continuous, three-dimensional physical evolution of the Earth system from ungridded observational data. By integrating diverse sensor inputs into a unified, grid-free dynamical field, the model autonomously advances the atmospheric state in space and time. We show that this fundamentally distinct paradigm enables direct, real-time forecasting and cross-sensor inference without the overhead of explicit numerical solvers. In hindcast evaluations, Earth-o1 achieves surface forecast skill comparable to the operational Integrated Forecasting System (IFS). These results establish that continuous, observation-driven world models – a new class of fully observation-native geophysical simulators – can match the fidelity of established physical frameworks, providing a scalable data-driven foundation for a digital twin of the Earth.


[91] SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation cs.CVPDF

YaoYang Liu, Yuechen Zhang, Wenbo Li, Yufei Zhao, Rui Liu

TL;DR: SwiftI2V是一个高效的高分辨率图像到视频生成框架,通过两阶段设计和条件分段生成来解决2K分辨率下内存消耗大和细节保真度低的问题。

Details

Motivation: 现有高分辨率图像到视频生成方法存在内存和延迟成本过高,或通过通用视频超分辨率后处理导致细节失真和偏离输入图像局部结构的问题。

Result: 在2K分辨率的VBench-I2V基准测试上,SwiftI2V取得了与端到端基线相当的性能,同时将总GPU时间减少了202倍,可在单张数据中心或消费级GPU上实现实用的2K视频生成。

Insight: 核心创新点是条件分段生成机制,它通过分段合成视频以控制每步的token预算,并结合段内双向上下文交互来提升跨段一致性和输入保真度,从而在效率和生成质量间取得平衡。

Abstract: High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored for high-resolution I2V. Following the widely used two-stage design, it addresses the efficiency–fidelity dilemma by first generating a low-resolution motion reference to reduce token costs and ease the modeling burden, then performing a strongly image-conditioned 2K synthesis guided by the motion to recover input-faithful details with controlled overhead. Specifically, to make generation more scalable, SwiftI2V introduces Conditional Segment-wise Generation (CSG) to synthesize videos segment-by-segment with a bounded per-step token budget, and adopts bidirectional contextual interaction within each segment to improve cross-segment coherence and input fidelity. On VBench-I2V at 2K resolution, SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202x. Particularly, it enables practical 2K I2V generation on a single datacenter GPU (e.g., H800) or consumer GPU (e.g., RTX 4090).


[92] Continuous-Time Distribution Matching for Few-Step Diffusion Distillation cs.CV | cs.AIPDF

Tao Liu, Hao Yan, Mengting Chen, Taihang Hu, Zhengrong Yue

TL;DR: 本文提出了一种名为连续时间分布匹配(CDM)的新方法,用于加速扩散模型的少步图像生成。该方法将分布匹配蒸馏(DMD)框架从离散时间点优化扩展到连续时间优化,通过动态连续调度和连续时间对齐目标,在无需复杂辅助模块的情况下,显著提升了生成图像的视觉保真度。

Details

Motivation: 现有少步扩散蒸馏方法中,分布匹配蒸馏(DMD)依赖于少数预定义离散时间步的监督,其反向KL散度的模式寻求特性容易导致视觉伪影和过度平滑的输出,通常需要引入GAN或奖励模型等复杂辅助模块来恢复视觉质量。本文旨在解决DMD在离散时间框架下的局限性。

Result: 在包括SD3-Medium和Longcat-Image在内的不同架构上进行的广泛实验表明,CDM在少步图像生成任务中提供了极具竞争力的视觉保真度,且不依赖复杂的辅助目标。

Insight: 创新点在于首次将DMD框架迁移到连续时间优化,具体通过两个连续时间设计实现:一是用动态连续随机长度调度替代固定离散调度,使分布匹配在采样轨迹的任意点执行;二是提出连续时间对齐目标,通过学生速度场外推的潜在变量进行主动离轨迹匹配,从而改善泛化能力并保留精细视觉细节。

Abstract: Step distillation has become a leading technique for accelerating diffusion models, among which Distribution Matching Distillation (DMD) and Consistency Distillation are two representative paradigms. While consistency methods enforce self-consistency along the full PF-ODE trajectory to steer it toward the clean data manifold, vanilla DMD relies on sparse supervision at a few predefined discrete timesteps. This restricted discrete-time formulation and mode-seeking nature of the reverse KL divergence tends to exhibit visual artifacts and over-smoothed outputs, often necessitating complex auxiliary modules – such as GANs or reward models – to restore visual fidelity. In this work, we introduce Continuous-Time Distribution Matching (CDM), migrating the DMD framework from discrete anchoring to continuous optimization for the first time. CDM achieves this through two continuous-time designs. First, we replace the fixed discrete schedule with a dynamic continuous schedule of random length, so that distribution matching is enforced at arbitrary points along sampling trajectories rather than only at a few fixed anchors. Second, we propose a continuous-time alignment objective that performs active off-trajectory matching on latents extrapolated via the student’s velocity field, improving generalization and preserving fine visual details. Extensive experiments on different architectures, including SD3-Medium and Longcat-Image, demonstrate that CDM provides highly competitive visual fidelity for few-step image generation without relying on complex auxiliary objectives. Code is available at https://github.com/byliutao/cdm.


[93] Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models cs.CV | cs.LG | cs.ROPDF

Nilaksh, Saurav Jha, Artem Zholus, Sarath Chandar

TL;DR: 本文系统评估了基于潜在扩散模型(LDM)的机器人世界模型中不同潜在空间(重建编码器与语义编码器)的有效性。通过在BridgeV2数据集上固定协议下比较六种编码器,研究发现仅视觉保真度不足以选择最优世界模型,而语义编码器(如V-JEPA 2.1)在规划、下游策略性能和潜在表示质量方面表现更优,主张语义潜在空间是策略相关机器人扩散世界模型的更强基础。

Details

Motivation: 随着动作条件视频扩散模型越来越多地采用潜在扩散建模(LDM),选择合适的潜在空间变得至关重要。当前主流使用为像素重建训练的自编码器潜在空间(如VAE),但近期研究表明预训练编码器提供的语义对齐潜在空间可能更有益。本文旨在系统评估这些潜在空间在动作条件LDM中用于机器人世界模型的效果。

Result: 在BridgeV2数据集上,使用固定协议训练世界模型变体并比较六种重建和语义编码器。结果显示,重建编码器(如VAE和Cosmos)在像素级评分上表现强劲,但语义编码器(如V-JEPA 2.1、Web-DINO和SigLIP 2)在所有模型规模上,在规划、下游策略性能和潜在表示质量这两个评估轴上普遍更优,其中V-JEPA 2.1在策略性能上整体最强。

Insight: 论文的创新点在于提出了评估机器人世界模型性能的三个轴(视觉保真度、规划与下游策略性能、潜在表示质量),并系统论证了语义潜在空间相较于传统重建潜在空间在策略相关任务中的优越性。客观来看,该研究为构建更有效的机器人世界模型提供了重要的编码器选择依据和评估框架。

Abstract: World model-based policy evaluation is a practical proxy for testing real-world robot control by rolling out candidate actions in action-conditioned video diffusion models. As these models increasingly adopt latent diffusion modeling (LDM), choosing the right latent space becomes critical. While the status quo uses autoencoding latent spaces like VAEs that are primarily trained for pixel reconstruction, recent work suggests benefits from pretrained encoders with representation-aligned semantic latent spaces. We systematically evaluate these latent spaces for action-conditioned LDM by comparing six reconstruction and semantic encoders to train world model variants under a fixed protocol on BridgeV2 dataset, and show effective world model training in high-dimensional representation spaces with and without dimension compression. We then propose three axes to assess robotic world model performance: visual fidelity, planning and downstream policy performance, and latent representation quality. Our results show visual fidelity alone is insufficient for world model selection. While reconstruction encoders like VAE and Cosmos achieve strong pixel-level scores, semantic encoders such as V-JEPA 2.1 (strongest overall on policy), Web-DINO, and SigLIP 2 generally excel across the other two axes at all model scales. Our study advocates semantic latent space as stronger foundation for policy-relevant robotics diffusion world models.


[94] GeoStack: A Framework for Quasi-Abelian Knowledge Composition in VLMs cs.CVPDF

Pranav Mantini, Shishir K. Shah

TL;DR: 本文提出GeoStack框架,用于解决视觉语言模型(VLMs)中多领域知识组合导致的灾难性遗忘问题。该框架通过几何约束和模块化设计,将独立训练的领域专家组合成统一模型,并实现恒定时间推理复杂度。

Details

Motivation: 解决VLMs在跨领域或任务中积累知识时出现的灾难性遗忘问题,实现长期知识的高效组合。

Result: 在多领域适应和类增量学习实验中,GeoStack有效缓解了灾难性遗忘,并实现了O(1)的推理复杂度。

Insight: 创新点在于引入几何堆叠的模块化框架,通过适配器流形的几何结构约束保护基础模型知识,并利用权重折叠实现恒定时间推理,为VLMs的知识组合提供了可扩展的解决方案。

Abstract: We address the challenge of knowledge composition in Vision-Language Models (VLMs), where accumulating expertise across multiple domains or tasks typically leads to catastrophic forgetting. We introduce GeoStack (Geometric Stacking), a modular framework that allows independently trained domain experts to be composed into a unified model. By imposing geometric and structural constraints on the adapter manifold, GeoStack ensures the foundational knowledge of the base model is preserved. Furthermore, we mathematically demonstrate a weight-folding property that achieves constant-time inference complexity ($O(1)$), regardless of the number of integrated experts. Experimental results across multi-domain adaptation and class-incremental learning show that GeoStack provides an efficient mechanism for long-term knowledge composition while significantly mitigating catastrophic forgetting. Code is available at https://github.com/QuantitativeImagingLaboratory/GeoStack.


[95] 3D MRI Image Pretraining via Controllable 2D Slice Navigation Task cs.CV | cs.AIPDF

Yu Wang, Qingchao Chen

TL;DR: 该论文提出了一种新颖的3D MRI图像自监督预训练方法,通过将3D体积转换为可控的2D渲染序列(即视频-动作序列),并引入动作条件预训练目标来学习解剖和空间表征。

Details

Motivation: 现有自监督预训练方法大多将MRI扫描视为静态的切片、块或体积的聚合,缺乏对3D体积内在连续性的利用。本文旨在探索一种不同于掩码重建的、基于可控2D切片导航任务的内在自监督信号。

Result: 在具有代表性的解剖和空间下游任务上进行评估,结果表明所提出的预训练方法优于标准的静态体积基线、仅使用分词器的预训练以及动作未对齐的动态变体,为学习MRI表征提供了有用的补充接口。

Insight: 创新点在于将3D MRI体积视为可通过连续位置、方向和缩放参数控制的2D切片序列,从而构建了一个动作条件化的潜在动态模型预训练框架,这为从大规模未标记MRI数据中学习连续空间和结构信息提供了新视角。

Abstract: Self-supervised pretraining has become the mainstream approach for learning MRI representations from unlabeled scans. However, most existing objectives still treat each scan primarily as static aggregations of slices, patches or volumes. We ask whether there exists an intrinsic form of self-supervision signal that is different from reconstructing the masked patches, through transforming the 3D volumes into controllable 2D rendered sequences: by rendering slices at continuous positions, orientations, and scales, a 3D volume can be converted into dense video-action sequences whose controls are the action trajectories. We study this formulation with an action-conditioned pretraining objective, where a tokenizer encodes slice observations and a latent dynamics model predicts the evolution of latent features. Across representative anatomical and spatial downstream tasks, the proposed pretraining is evaluated against standard static-volume baselines, tokenizer-only pretraining, and dynamics variants without aligned actions. These results suggest that controllable MRI slice navigation provides a useful complementary pretraining interface for learning anatomical and spatial representations from large unlabeled MRI collections.


[96] MARBLE: Multi-Aspect Reward Balance for Diffusion RL cs.CV | cs.LGPDF

Canyu Zhao, Hao Chen, Yunze Tong, Yu Qiao, Jiacheng Li

TL;DR: 本文提出MARBLE(多维度奖励平衡)框架,用于解决扩散模型强化学习微调中多奖励优化问题。传统加权求和奖励聚合方法存在样本级不匹配问题,导致监督信号稀释。MARBLE通过在梯度空间独立维护各奖励优势估计器,计算每奖励策略梯度,并通过求解二次规划问题将它们协调为单一更新方向,无需手动调整奖励权重。

Details

Motivation: 现有扩散模型与人类偏好对齐的强化学习微调方法在处理多维度评估标准时,通常采用训练多个专家模型、加权求和奖励或手工设计阶段式微调,这些方法要么无法产生统一模型,要么需要大量手动调优。核心问题在于加权求和奖励聚合存在样本级不匹配,导致专业样本的监督信号被稀释。

Result: 在SD3.5 Medium模型上使用五个奖励进行实验,MARBLE能同时改善所有五个奖励维度,将最差对齐奖励的梯度余弦相似度从加权求和下80%小批次的负值转为持续正值,且训练速度达到基线训练的0.97倍。

Insight: 创新点在于提出梯度空间的多奖励协调框架,通过独立优势估计和二次规划求解实现自动奖励平衡。进一步提出摊销化公式和EMA平滑技术,将每步计算成本从K+1次反向传播降低至接近单奖励基线成本,并稳定更新对抗瞬时波动。该方法避免了手动权重调整,实现了多奖励联合优化。

Abstract: Reinforcement learning fine-tuning has become the dominant approach for aligning diffusion models with human preferences. However, assessing images is intrinsically a multi-dimensional task, and multiple evaluation criteria need to be optimized simultaneously. Existing practice deal with multiple rewards by training one specialist model per reward, optimizing a weighted-sum reward $R(x)=\sum_k w_k R_k(x)$, or sequentially fine-tuning with a hand-crafted stage schedule. These approaches either fail to produce a unified model that can be jointly trained on all rewards or necessitates heavy manually tuned sequential training. We find that the failure stems from using a naive weighted-sum reward aggregation. This approach suffers from a sample-level mismatch because most rollouts are specialist samples, highly informative for certain reward dimensions but irrelevant for others; consequently, weighted summation dilutes their supervision. To address this issue, we propose MARBLE (Multi-Aspect Reward BaLancE), a gradient-space optimization framework that maintains independent advantage estimators for each reward, computes per-reward policy gradients, and harmonizes them into a single update direction without manually-tuned reward weighting, by solving a Quadratic Programming problem. We further propose an amortized formulation that exploits the affine structure of the loss used in DiffusionNFT, to reduce the per-step cost from K+1 backward passes to near single-reward baseline cost, together with EMA smoothing on the balancing coefficients to stabilize updates against transient single-batch fluctuations. On SD3.5 Medium with five rewards, MARBLE improves all five reward dimensions simultaneously, turns the worst-aligned reward’s gradient cosine from negative under weighted summation in 80% of mini-batches to consistently positive, and runs at 0.97X the training speed of baseline training.


[97] FreeSpec: Training-Free Long Video Generation via Singular-Spectrum Reconstruction cs.CVPDF

Fangda Chen, Shanshan Zhao, Longrong Yang, Chuanfu Xu, Zhigang Luo

TL;DR: 本文提出了一种名为FreeSpec的训练自由长视频生成框架,旨在解决视频扩散模型在生成长视频时出现的内容漂移、时间不一致和动态过度平滑等问题。该方法从奇异谱视角分析问题,通过奇异值分解将全局和局部特征分解为低秩谱引导和高秩重建基础,在谱层面进行融合,从而在保持长程一致性的同时更好地保留空间细节和时间动态。

Details

Motivation: 现有训练自由的长视频生成方法通常通过结合全局分支和局部分支来改善时间一致性,但它们在每个分支内使用预定义标准进一步分解外观一致性和时间动态。当外观和动作进展紧密耦合(如相机运动和连续运动)时,这种分配不可靠。论文旨在解决这种刚性特征划分导致的问题。

Result: 在Wan2.1和LTX-Video基准上的实验表明,FreeSpec改善了长视频生成,特别是在时间动态方面,同时保持了强大的视觉质量和时间一致性。

Insight: 论文的创新点在于从奇异谱角度分析视频时间扩展问题,揭示了扩大自注意力窗口会导致谱集中现象,并提出了基于奇异值分解的谱级融合框架来避免先前分解规则的刚性特征划分。从客观角度看,该方法将全局特征作为低秩谱引导、局部特征作为高秩重建基础的思路,为特征融合提供了一种新的、更灵活的谱视角解决方案。

Abstract: Video diffusion models perform well in short-video synthesis, but their training-free extension to long videos often suffers from content drift, temporal inconsistency, and over-smoothed dynamics. Existing methods improve temporal consistency by combining a global branch with a local branch, but they often further decompose appearance consistency and temporal dynamics within each branch using predefined criteria. This assignment is unreliable when appearance and action progression are tightly coupled, such as in camera motion and sequential motion. We analyze the video temporal extension issue from a singular-spectrum perspective and show that enlarged self-attention windows induce spectral concentration: spectral energy becomes dominated by a few low-rank singular directions, preserving coarse structure but suppressing high-rank spatial details and motion-rich temporal variations. To mitigate this problem, we propose FreeSpec, a training-free spectral reconstruction framework for long-video generation. FreeSpec decomposes global and local features with singular value decomposition, and uses the global branch as low-rank spectral guidance and the local branch as a high-rank reconstruction basis. This spectrum-level fusion avoids the rigid feature partitioning of previous decomposition rules, preserving long-range consistency while better retaining spatial details and temporal dynamics. Experiments on Wan2.1 and LTX-Video demonstrate that FreeSpec improves long-video generation, especially for temporal dynamics, while maintaining strong visual quality and temporal consistency. Project demo: https://fdchen24.github.io/FreeSpec-Website/.


[98] Sparkle: Realizing Lively Instruction-Guided Video Background Replacement via Decoupled Guidance cs.CV | cs.AIPDF

Ziyun Zeng, Yiqi Lin, Guoqiang Liang, Mike Zheng Shou

TL;DR: 本文提出了Sparkle数据集和基准测试,用于解决视频背景替换任务中高质量训练数据稀缺的问题,通过解耦的前景和背景指导生成方法,显著提升了背景替换的生动性和时序一致性。

Details

Motivation: 当前公开数据集主要关注局部编辑或风格迁移,而背景替换任务需要合成全新且时序一致的场景,但高质量训练数据稀缺,导致现有模型性能不佳。

Result: 在OpenVE-Bench和Sparkle-Bench上的实验表明,基于Sparkle数据集训练的模型在所有现有基线中取得了显著更好的性能。

Insight: 创新点在于设计了一个可扩展的数据生成流程,以解耦方式生成前景和背景指导并进行严格质量过滤,从而解决了背景替换任务中的数据质量瓶颈。

Abstract: In recent years, open-source efforts like Senorita-2M have propelled video editing toward natural language instruction. However, current publicly available datasets predominantly focus on local editing or style transfer, which largely preserve the original scene structure and are easier to scale. In contrast, Background Replacement, a task central to creative applications such as film production and advertising, requires synthesizing entirely new, temporally consistent scenes while maintaining accurate foreground-background interactions, making large-scale data generation significantly more challenging. Consequently, this complex task remains largely underexplored due to a scarcity of high-quality training data. This gap is evident in poorly performing state-of-the-art models, e.g., Kiwi-Edit, because the primary open-source dataset that contains this task, i.e., OpenVE-3M, frequently produces static, unnatural backgrounds. In this paper, we trace this quality degradation to a lack of precise background guidance during data synthesis. Accordingly, we design a scalable pipeline that generates foreground and background guidance in a decoupled manner with strict quality filtering. Building on this pipeline, we introduce Sparkle, a dataset of ~140K video pairs spanning five common background-change themes, alongside Sparkle-Bench, the largest evaluation benchmark tailored for background replacement to date. Experiments demonstrate that our dataset and the model trained on it achieve substantially better performance than all existing baselines on both OpenVE-Bench and Sparkle-Bench. Our proposed dataset, benchmark, and model are fully open-sourced at https://showlab.github.io/Sparkle/.


[99] MedHorizon: Towards Long-context Medical Video Understanding in the Wild cs.CVPDF

Bodong Du, Bowen Liu, Yang Yu, Xinpeng Ding, Zhiheng Wu

TL;DR: 该论文提出了MedHorizon,一个用于评估长上下文医疗视频理解的野外基准。该基准包含759小时的全长临床手术视频和1253个基于证据的多选题,旨在联合评估模型对稀疏证据的理解和多跳临床推理能力。研究发现,现有模型在该基准上的最佳准确率仅为41.1%,表明当前系统远未实现稳健的全流程理解。

Details

Motivation: 现有的医疗多模态大语言模型(MLLMs)主要关注图像理解和短视频分析,而真实的临床审查通常需要对完整手术视频进行理解。医疗手术视频具有高度冗余的解剖视图,而关键证据在时间上稀疏、空间上细微且依赖于上下文。现有基准通常假设证据已通过图像、短视频或预分割视频定位,未能充分测试“检索后推理”的问题。

Result: 在MedHorizon基准上评估了代表性的通用领域、医疗领域和长视频MLLMs。最佳模型的准确率仅为41.1%,远未达到稳健水平。分析表明,性能并不随输入帧数增加而可靠提升,证据检索和临床解释是主要瓶颈。

Insight: 论文的创新点在于构建了一个专注于“检索后推理”的、真实世界的长医疗视频理解基准,其证据极其稀疏(平均仅0.166%的帧为证据帧),迫使模型必须在嘈杂的手术流程中搜索关键信息。客观分析认为,其核心洞察是揭示了当前MLLMs在长医疗视频理解中的根本瓶颈:薄弱的流程推理能力、冗余信息下的注意力漂移,以及通用采样方法在局部细节与全局覆盖间平衡的局限性。

Abstract: Medical multimodal large language models (MLLMs) have advanced image understanding and short-video analysis, but real clinical review often requires full-procedure video understanding. Unlike general long videos, medical procedures contain highly redundant anatomical views, while decisive evidence is temporally sparse, spatially subtle, and context dependent. Existing benchmarks often assume this evidence has already been localized through images, short clips, or pre-segmented videos, leaving the retrieval-before-reasoning problem under-tested. We introduce MedHorizon, an in-the-wild benchmark for long-context medical video understanding. MedHorizon preserves 759 hours of full-length clinical procedures and provides 1,253 evidence-grounded multiple-choice questionsthat jointly evaluate sparse evidence understanding and multi-hop clinical reasoning. Its evidence is extremely sparse, with only 0.166% evidence frames on average, requiring models to search noisy procedural streams before interpreting and aggregating findings. We evaluate representative general-domain, medical-domain, and long-video MLLMs. The best model reaches only 41.1% accuracy, showing that current systems remain far from robust full-procedure understanding. Further analysis yields four key findings: performance does not scale reliably with more frames, evidence retrieval and clinical interpretation remain primary bottlenecks; these bottlenecks are rooted in weak procedural reasoning and attention drift under redundancy, and generic sampling methods only partially balances local detail with global coverage. MedHorizon provides a rigorous testbed for MLLMs that retrieve sparse evidence and reason over complete clinical workflows.


[100] DINORANKCLIP: DINOv3 Distillation and Injection for Vision-Language Pretraining with High-Order Ranking Consistency cs.CV | cs.AI | cs.LGPDF

Shuyang Jiang, Nan Yu, Yiming Zhang, Zenghui Ding, Zhenyu Wu

TL;DR: DINORANKCLIP是一个视觉-语言预训练框架,通过注入冻结的DINOv3教师模型来增强CLIP的局部结构感知能力,并引入高阶Plackett-Luce排序一致性损失来改进对比学习中批次内不匹配对的相对排序问题。

Details

Motivation: 解决CLIP的两个结构弱点:对称InfoNCE损失丢弃了批次内不匹配对的相对排序,以及全局池化导致视觉表示对细粒度局部结构不敏感。

Result: 在匹配计算条件下,DINORANKCLIP在Conceptual Captions 3M上训练,在细粒度和分布外评估上持续优于CLIP、CyCLIP、ALIP和RANKCLIP,最优排序阶数为R*=3。

Insight: 创新点包括:通过双分支轻量学生和多尺度融合模块注入DINOv3知识以增强局部感知;提出高阶Plackett-Luce排序模型,将CLIP和RANKCLIP作为其零阶和一阶特例包含在内。

Abstract: Contrastive language-image pretraining (CLIP) suffers from two structural weaknesses: the symmetric InfoNCE loss discards the relative ordering among unmatched in-batch pairs, and global pooling collapses the visual representation into a semantic bottleneck that is poorly sensitive to fine-grained local structure. RANKCLIP partially addresses the first issue with a list-wise Plackett-Luce ranking-consistency loss, but its model is strictly first-order and inherits the second weakness untouched. We propose DINORANKCLIP, a pretraining framework that addresses both jointly. Our principal contribution is injecting a frozen DINOv3 teacher into the contrastive trunk through a dual-branch lightweight student and a multi-scale fusion module with channel-spatial attention, a self-attention refiner, and a conflict-aware gate that preserves the cross-modal alignment up to first order. Complementarily, we introduce a high-order Plackett-Luce ranking model in which the per-position utility is augmented with attention-parameterised pairwise and tuple-wise transition terms; the family contains CLIP and RANKCLIP as nested zero-order and first-order special cases, and the optimal order on every benchmark is $R^*=3$. The full empirical study – order sweep, Fine-grained Probe on five datasets, four-node Modality-Gap analysis, six-variant Fusion ablation – fits in 72 hours on a single eight-GPU H100 node and trains entirely on Conceptual Captions 3M. DINORANKCLIP consistently outperforms CLIP, CyCLIP, ALIP, and RANKCLIP under matched compute, with the largest relative gains on the fine-grained and out-of-distribution evaluations that most directly stress local structural reasoning.


[101] DPM++: Dynamic Masked Metric Learning for Occluded Person Re-identification cs.CVPDF

Lei Tan, Yingshi Luan, Pincong Zou, Pingyang Dai, Liujuan Cao

TL;DR: 本文提出DPM++,一种动态掩码度量学习框架,用于解决遮挡行人重识别问题。该方法通过输入自适应的掩码度量动态选择可靠的身份子空间,强调可见一致证据并抑制不可靠成分,结合基于CLIP的两阶段监督和显著性引导的补丁转移策略合成逼真遮挡样本,在遮挡和整体行人重识别基准上均优于先前SOTA方法。

Details

Motivation: 解决实际应用中因障碍物导致的遮挡问题,现有方法缺乏统一框架来学习在真实遮挡模式下的鲁棒可见性一致匹配。

Result: 在遮挡和整体行人重识别基准上,DPM++一致优于先前最先进方法,在整体和遮挡场景中均达到SOTA水平。

Insight: 创新点包括动态掩码度量学习、基于CLIP的两阶段监督将文本分支语义先验迁移到分类器-原型空间,以及显著性引导的补丁转移策略合成可控逼真遮挡样本,增强了模型对真实部分观察的鲁棒性。

Abstract: Although person re-identification has made impressive progress, occlusion caused by obstacles remains an unsettled issue in real applications. The difficulty lies in the mismatch between incomplete occluded samples and holistic identity representations. Severe occlusion removes discriminative body cues and introduces interference from background clutter and occluders, making global metric learning unreliable. Existing methods mainly rely on extra pre-trained models to estimate visible parts for alignment or construct occluded samples via data augmentation, but still lack a unified framework that learns robust visibility-consistent matching under realistic occlusion patterns. In this paper, we propose DPM++, a Dynamic Masked Metric Learning framework for occluded person re-identification. DPM++ learns an input-adaptive masked metric that dynamically selects reliable identity subspaces for each occluded instance, enabling matching to emphasize visibility-consistent evidence while suppressing unreliable components. Built upon the classifier-prototype space, DPM++ introduces a CLIP-based two-stage supervision scheme, where ID-level semantic priors are learned from the text branch and transferred into the classifier-prototype space for dynamic masked matching. To strengthen the masked metric, we introduce a saliency-guided patch transfer strategy to synthesize controllable and photo-realistic occluded samples during training. Exploiting real scene priors, this strategy exposes the model to realistic partial observations and provides richer supervision than random erasing. In addition, occlusion-aware sample pairing and mask-guided optimization improve the stability and effectiveness of the framework. Experiments on occluded and holistic person re-identification benchmarks show that DPM++ consistently outperforms previous state-of-the-art methods in both holistic and occlusion scenarios.


[102] Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study cs.CV | cs.AI | cs.LG | cs.MMPDF

Hao Dong, Hongzhao Li, Shupan Li, Muhammad Haris Khan, Eleni Chatzi

TL;DR: 该论文提出了首个统一且全面的多模态域泛化(MMDG)基准测试MMDG-Bench,旨在解决当前研究因评估协议不一致而难以衡量真实进展的问题。该基准涵盖了三个任务、六个数据集、六种模态组合和九种代表性方法,并系统评估了标准精度、抗输入损坏鲁棒性、缺失模态泛化能力以及模型可信度。通过对95个跨域任务训练7402个神经网络,研究得出五个关键发现,表明当前MMDG方法的实际改进有限且面临诸多挑战。

Details

Motivation: 当前多模态域泛化研究缺乏标准化评估,性能提升可能源于不一致的实验设置而非算法进步,且现有基准主要关注动作识别,忽略了输入损坏、模态缺失和模型可信度等现实挑战。

Result: 在MMDG-Bench上的大规模实验表明:在公平比较下,近期专门的MMDG方法仅比ERM基线略有提升;没有单一方法能在所有数据集或模态组合上持续领先;性能与理论上限存在显著差距;三模态融合并不总是优于最强的双模态配置;所有方法在损坏和缺失模态场景下性能显著下降,且可能损害模型可信度。

Insight: 论文的核心创新点是构建了一个标准化的综合基准MMDG-Bench,它统一了评估协议并扩展了评估维度(如鲁棒性和可信度),为领域提供了可靠的进展衡量工具。从客观角度看,其大规模实证分析揭示了当前MMDG研究的局限性,强调了未来工作需关注更现实的挑战和更统一的评估。

Abstract: Despite the growing popularity of Multimodal Domain Generalization (MMDG) for enhancing model robustness, it remains unclear whether reported performance gains reflect genuine algorithmic progress or are artifacts of inconsistent evaluation protocols. Current research is fragmented, with studies varying significantly across datasets, modality configurations, and experimental settings. Furthermore, existing benchmarks focus predominantly on action recognition, often neglecting critical real-world challenges such as input corruptions, missing modalities, and model trustworthiness. This lack of standardization obscures a reliable assessment of the field’s advancement. To address this issue, we introduce MMDG-Bench, the first unified and comprehensive benchmark for MMDG, which standardizes evaluation across six datasets spanning three diverse tasks: action recognition, mechanical fault diagnosis, and sentiment analysis. MMDG-Bench encompasses six modality combinations, nine representative methods, and multiple evaluation settings. Beyond standard accuracy, it systematically assesses corruption robustness, missing-modality generalization, misclassification detection, and out-of-distribution detection. With 7, 402 neural networks trained in total across 95 unique cross-domain tasks, MMDG-Bench yields five key findings: (1) under fair comparisons, recent specialized MMDG methods offer only marginal improvements over ERM baseline; (2) no single method consistently outperforms others across datasets or modality combinations; (3) a substantial gap to upper-bound performance persists, indicating that MMDG remains far from solved; (4) trimodal fusion does not consistently outperform the strongest bimodal configurations; and (5) all evaluated methods exhibit significant degradation under corruption and missing-modality scenarios, with some methods further compromising model trustworthiness.


[103] Relit-LiVE: Relight Video by Jointly Learning Environment Video cs.CVPDF

Weiqing Xiao, Hong Li, Xiuyu Yang, Houyuan Chen, Wenyi Li

TL;DR: 本文提出了Relit-LiVE,一种新颖的视频重光照框架,它无需相机姿态先验知识,通过将原始参考图像显式引入渲染过程,并联合预测重光照视频和每帧环境贴图,从而生成物理一致且时间稳定的结果。

Details

Motivation: 现有基于视频扩散模型和本征分解的重光照方法,因本征分解在真实视频中不可靠,常导致外观扭曲、材质破损和时间伪影,本文旨在解决这些问题。

Result: 在合成和真实世界基准测试中,Relit-LiVE在视频重光照和神经渲染方面一致地超越了最先进的方法。

Insight: 核心创新点在于通过引入原始参考图像来恢复丢失的场景线索,并采用联合预测环境视频的扩散过程,强制了几何-光照对齐,支持动态光照和相机运动,提升了物理一致性。

Abstract: Recent advances have shown that large-scale video diffusion models can be repurposed as neural renderers by first decomposing videos into intrinsic scene representations and then performing forward rendering under novel illumination. While promising, this paradigm fundamentally relies on accurate intrinsic decomposition, which remains highly unreliable for real-world videos and often leads to distorted appearances, broken materials, and accumulated temporal artifacts during relighting. In this work, we present Relit-LiVE, a novel video relighting framework that produces physically consistent, temporally stable results without requiring prior knowledge of camera pose. Our key insight is to explicitly introduce raw reference images into the rendering process, enabling the model to recover critical scene cues that are inevitably lost or corrupted in intrinsic representations. Furthermore, we propose a novel environment video prediction formulation that simultaneously generates relit videos and per-frame environment maps aligned with each camera viewpoint in a single diffusion process. This joint prediction enforces strong geometric-illumination alignment and naturally supports dynamic lighting and camera motion, significantly improving physical consistency in video relighting while easing the requirement of known per-frame camera pose. Extensive experiments demonstrate that Relit-LiVE consistently outperforms state-of-the-art video relighting and neural rendering methods across synthetic and real-world benchmarks. Beyond relighting, our framework naturally supports a wide range of downstream applications, including scene-level rendering, material editing, object insertion, and streaming video relighting. The Project is available at https://github.com/zhuxing0/Relit-LiVE.


[104] ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation cs.CV | cs.AI | cs.LGPDF

Omar El Khalifi, Thomas Rossi, Oscar Fossey, Thibault Fouque, Ulysse Mizrahi

TL;DR: ActCam是一种零样本视频生成方法,能够联合控制角色动作和摄像机轨迹,通过驱动视频将角色运动迁移到新场景中,并实现每帧摄像机内外参数的控制。该方法基于预训练的扩散模型,利用姿态和深度条件生成几何一致的视频,采用两阶段条件调度策略来平衡结构约束和细节生成。

Details

Motivation: 针对艺术创作中视频生成需要精细控制表演(角色动作)和摄影(摄像机轨迹)的问题,现有方法难以同时实现两者的一致控制,ActCam旨在解决这一联合控制挑战。

Result: 在多个涵盖多样角色动作和挑战性视角变化的基准测试中,ActCam相比仅使用姿态控制或其他姿态与摄像机方法,提高了摄像机遵循度和运动保真度,在人类评估中更受青睐,尤其是在大视角变化下表现出色。

Insight: 创新点在于提出了一种无需训练的零样本方法,通过几何一致的姿态和深度条件生成,以及两阶段条件调度(早期结合姿态和稀疏深度约束结构,后期仅用姿态指导细化细节),实现了对摄像机参数和角色运动的联合精细控制,避免了过度约束生成过程。

Abstract: For artistic applications, video generation requires fine-grained control over both performance and cinematography, i.e., the actor’s motion and the camera trajectory. We present ActCam, a zero-shot method for video generation that jointly transfers character motion from a driving video into a new scene and enables per-frame control of intrinsic and extrinsic camera parameters. ActCam builds on any pretrained image-to-video diffusion model that accepts conditioning in terms of scene depth and character pose. Given a source video with a moving character and a target camera motion, ActCam generates pose and depth conditions that remain geometrically consistent across frames. We then run a single sampling process with a two-phase conditioning schedule: early denoising steps condition on both pose and sparse depth to enforce scene structure, after which depth is dropped and pose-only guidance refines high-frequency details without over-constraining the generation. We evaluate ActCam on multiple benchmarks spanning diverse character motions and challenging viewpoint changes. We find that, compared to pose-only control and other pose and camera methods, ActCam improves camera adherence and motion fidelity, and is preferred in human evaluations, especially under large viewpoint changes. Our results highlight that careful camera-consistent conditioning and staged guidance can enable strong joint camera and motion control without training. Project page: https://elkhomar.github.io/actcam/.


eess.AS [Back]

[105] WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling eess.AS | cs.AI | cs.CLPDF

Guanrou Yang, Tian Tan, Qian Chen, Zhikang Niu, Yakun Song

TL;DR: WavCube提出了一种紧凑的连续语音潜在表示,通过语义-声学联合建模,统一支持语音理解、重建和生成任务。它采用两阶段训练方案:第一阶段训练语义瓶颈以过滤冗余信息,第二阶段通过端到端重建注入细粒度声学细节,同时使用语义锚定损失确保表示保持在其原始语义流形内。

Details

Motivation: 当前语音理解任务通常使用自监督学习(SSL)提取的语义特征,而语音生成任务则依赖于重建获得的声学特征,这种碎片化的表示阻碍了真正统一语音模型的构建。论文旨在解决这两种表示之间的兼容性挑战。

Result: 在SUPERB基准测试中,WavCube在8倍维度压缩下性能接近WavLM;其重建质量与现有声学表示相当;在零样本TTS任务上达到SOTA水平且训练收敛显著更快;在SUPERB-SG基准的语音增强、分离和语音转换任务中表现出色。

Insight: 核心创新在于提出了一种能同时编码语义和声学信息的统一紧凑表示(WavCube),并通过两阶段训练方案解决了SSL特征用于生成建模时的两个固有缺陷(冗余和缺乏细粒度细节),为未来统一语音系统提供了可行的技术路径。

Abstract: Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented features are learned from self-supervised learning (SSL), and acoustic-oriented features from reconstruction. Such fragmented representations hinder the realization of truly unified speech systems. We present WavCube, a compact continuous latent derived from an SSL speech encoder that simultaneously supports speech understanding, reconstruction, and generation. WavCube employs a two-stage training scheme. Stage 1 trains a semantic bottleneck to filter off-manifold redundancy that makes raw SSL features intractable for diffusion. Stage 2 injects fine-grained acoustic details via end-to-end reconstruction, while a semantic anchoring loss ensures the representation remains grounded within its original semantic manifold. Comprehensive experiments show that WavCube closely approaches WavLM performance on SUPERB despite an 8x dimensional compression, attains reconstruction quality on par with existing acoustic representations, delivers state-of-the-art zero-shot TTS performance with markedly faster training convergence, and excels in speech enhancement, separation, and voice conversion tasks on the SUPERB-SG benchmark. Systematic ablations reveal that WavCube’s two-stage recipe resolves two intrinsic flaws of SSL features for generative modeling, paving the way for future unified speech systems. Codes and checkpoints are available at https://github.com/yanghaha0908/WavCube.


cs.CR [Back]

[106] Architecture Matters: Comparing RAG Systems under Knowledge Base Poisoning cs.CR | cs.CL | cs.LGPDF

Samuel Korn

TL;DR: 本研究系统评估了四种RAG架构(vanilla RAG、agentic RAG、MADAM-RAG和递归语言模型)在知识库中毒攻击下的鲁棒性差异。通过在921个Natural Questions QA对上实施单文档中毒攻击,研究发现架构选择是影响对抗鲁棒性的关键变量,攻击成功率在81.9%至24.4%之间波动,而干净准确率均保持在约92%。研究还引入了七类行为分类法来刻画矛盾检测、模糊表述等超越二值准确率的失效模式。

Details

Motivation: 现有RAG系统对抗知识库中毒攻击的研究主要针对基础的检索-生成流程,而设计用于处理冲突检索信息的高级架构(如多智能体辩论、智能体检索、递归语言模型)尚未在对抗性优化矛盾场景下得到充分测试。

Result: 在CorruptRAG-AK对抗攻击下,vanilla RAG攻击成功率最高(81.9%),递归语言模型(RLM)最低(24.4%),差距近58个百分点。MADAM-RAG表现出最高的表面矛盾检测率,但LLM判断器存在高误报(约48.5%精度),且无法可靠解决矛盾(即使在干净输入上也有41.4%的无答案率)。所有架构在干净基准上的准确率均约为92%。

Insight: 架构设计是RAG系统对抗鲁棒性的决定性因素;对抗性框架(而非检索优化)是驱动攻击成功的主要因素,揭示了跨架构漏洞集中在内容推理阶段;提出的七类行为分类法为评估RAG系统提供了超越二值准确率的细粒度分析框架。

Abstract: Retrieval-Augmented Generation (RAG) systems are vulnerable to knowledge base poisoning, yet existing attacks have been evaluated almost exclusively against vanilla retrieve-then-generate pipelines. Architectures designed to handle conflicting retrieved information - multi-agent debate, agentic retrieval, recursive language models - remain untested against adversarially optimized contradictions. We evaluate four RAG architectures (vanilla RAG, agentic RAG, MADAM-RAG, and Recursive Language Models) under controlled single-document (N=1) poisoning on 921 Natural Questions QA pairs, comparing a clean baseline, naive injection, and CorruptRAG-AK - an adversarial attack whose meta-epistemic framing targets credibility assessment. Architecture is a high-impact variable in adversarial robustness: under CorruptRAG-AK, attack success rates range from 81.9% (vanilla) to 24.4% (RLM) - a spread of nearly 58 percentage points across architectures with comparable clean accuracy (92%). Decomposing this gap, once the poisoned document is retrieved, adversarial framing - not retrieval optimization - drives the majority of CorruptRAG-AK’s advantage for three of four architectures, localizing the cross-architecture vulnerability at the content-reasoning stage. Our MADAM-RAG reimplementation shows the highest apparent contradiction detection rate, though our LLM judge over-identifies this behavior (48.5% precision), so reported rates are upper bounds. Regardless of detection, MADAM-RAG cannot resolve contradictions reliably, producing a 41.4% non-answer rate even on clean inputs - though implementation divergences from the original may contribute. We introduce a seven-category behavioral taxonomy capturing contradiction detection, hedging, and failure modes beyond binary accuracy. Code, data, and analysis notebooks are publicly available.


eess.SP [Back]

[107] The frame-level leakage trap: rethinking evaluation protocols for intrinsic image decomposition, with source-separable uncertainty as a case study eess.SP | cs.CVPDF

Jihwan Woo

TL;DR: 本文首先揭示了MPI Sintel数据集上本征图像分解评估协议的不一致问题,指出按帧划分训练集和测试集会因数据泄露导致性能虚高,并倡导采用按场景划分作为社区标准。作为案例研究,论文提出了一种物理启发的分解模型,该模型结合了源可分离的三向异方差不确定性估计,并验证了其通道专业性和下游应用价值。

Details

Motivation: 解决MPI Sintel数据集上本征图像分解评估协议不一致的问题,特别是量化按帧划分导致的数据泄露效应,并建立更严格的场景级划分标准。同时,在修正的协议下,探索一种能够提供源可分离不确定性的物理启发分解方法。

Result: 在修正的场景级划分协议下,论文方法在MPI Sintel上达到15.98 ± 0.41 dB的R_PSNR,与一个5成员深度集成方法相比仅差0.8 dB,但计算成本仅为五分之一。该方法还展示了源可分离不确定性的独特能力,例如过滤高不确定性像素可将保留像素的重构MSE降低77%。

Insight: 主要创新点包括:1) 首次量化了评估协议中数据泄露对性能的显著影响(帧级划分可虚高1.6-2.0 dB),强调了场景级划分的重要性;2) 提出了一种物理启发的分解模型(I = R ∘ S + N)并集成源可分离的三向异方差不确定性估计,实现了不确定性通道的专业化(如非朗伯不确定性通道与残差误差的相关系数达0.67);3) 展示了不确定性估计在下游任务(如像素过滤)中的实际效用。

Abstract: Evaluation protocols for learned intrinsic image decomposition on MPI Sintel have been inconsistent. Several prior works split the dataset by frames, which allows spatially similar frames of the same scene to appear in both train and test partitions. We quantify this leakage effect for the first time, across three architectures: a frame-level split inflates test R_PSNR by 1.6 to 2.0 dB (p less than 0.01 for all three, paired t-test across 3 seeds) relative to a scene-level split, confirming an architecture-independent protocol effect. A three-point gradient (random/temporal/scene) shows the gap is continuous, and under extended training the frame-level inflation exceeds 10 dB. We advocate scene-level splits as the community standard and provide reference numbers for six representative models under this protocol. As a case study within the corrected protocol, we present a physics-informed decomposition I = R composed with S + N with a source-separable three-way heteroscedastic uncertainty head. We empirically verify channel specialization: the non-Lambertian uncertainty channel shows r = 0.67 cross-correlation with non-Lambertian residual error, more than 4 times the texture channel’s correlation. We further demonstrate downstream utility: filtering out the 75% highest-uncertainty pixels reduces reconstruction MSE by 77% on retained pixels, whereas random filtering produces no improvement. The specialization also holds on out-of-distribution real photographs. We report negative results for a more elaborate variant combining frequency decomposition, cross-task supervision, evidential learning, contrastive loss, and test-time adaptation. Our method reaches 15.98 plus or minus 0.41 dB R_PSNR, within 0.8 dB of a 5-member Deep Ensemble at one-fifth the cost, with the unique capability of source-separated uncertainty.


cs.AI [Back]

[108] PRAISE: Prefix-Based Rollout Reuse in Agentic Search Training cs.AI | cs.CL | cs.IRPDF

Erhan Zhang, Yiqun Chen, Zechun Niu, Wei Yang, Xiaochi Wei

TL;DR: 论文提出了PRAISE框架,通过重用长序列搜索轨迹中的前缀状态来构造额外训练数据并推导中间步骤奖励,以解决智能体搜索训练中数据利用率低和奖励稀疏的问题。

Details

Motivation: 当前基于搜索的强化学习方法存在两个核心局限:训练中昂贵的长序列轨迹未被充分利用,且监督信号通常仅在最终答案处可用,导致严重的奖励稀疏性。

Result: 在多跳问答基准测试上的实验表明,PRAISE方法在强基线模型上持续提升了性能。

Insight: 创新点在于利用单一共享模型同时进行搜索策略学习和前缀答案评估,无需额外人工标注或独立奖励模型,通过前缀状态重用和中间奖励推导实现了数据效率和信用分配的联合优化。

Abstract: In agentic search, large language models (LLMs) are trained to perform multi-turn retrieval and reasoning for complex tasks such as multi-hop question answering (QA). However, current search-based Reinforcement Learning (RL) methods suffer from two core limitations: expensive long-horizon rollouts are under-utilized during training, and supervision is typically available only at the final answer, resulting in severe reward sparsity. We present Prefix-based Rollout reuse for Agentic search with Intermediate Step rEwards (PRAISE), a framework for improving both data efficiency and credit assignment in agentic search training. Given a complete search trajectory, PRAISE extracts prefix states at different search turns, elicits intermediate answers from them, and uses these prefixes both to construct additional training trajectories and to derive step-level rewards from performance differences across prefixes. Our method uses a single shared model for both search policy learning and prefix answer evaluation, enabling joint optimization without extra human annotations or a separate reward model. Experiments on multi-hop QA benchmarks show that PRAISE consistently improves performance over strong baselines.


[109] ZAYA1-8B Technical Report cs.AI | cs.CLPDF

Robert Washbourne, Rishi Iyer, Tomas Figliolia, Henry Zheng, Ryan Lorig-Roach

TL;DR: 本文介绍了ZAYA1-8B,一个专注于推理的专家混合模型,拥有700M激活参数和8B总参数。该模型基于Zyphra的MoE++架构,在AMD全栈平台上完成预训练、中期训练和监督微调。在多个数学和编程基准测试中,其性能匹配或超越DeepSeek-R1-0528,并与更大的开源推理模型竞争。训练过程采用四阶段强化学习级联,并引入了Markovian RSA测试时计算方法,显著提升了在AIME’25和HMMT’25等挑战性基准上的表现。

Details

Motivation: 旨在构建一个参数高效、专注于推理任务的专家混合模型,以在资源受限(激活参数少于10亿)的情况下,达到或超越现有大型推理模型的性能,特别是在数学和编程等复杂推理领域。

Result: 在多个具有挑战性的数学和编程基准测试中,ZAYA1-8B匹配或超越了DeepSeek-R1-0528,并与更大的开源推理模型保持竞争力。通过Markovian RSA测试时计算,其在AIME’25和HMMT’25上的准确率分别提升至91.9%和89.6%,缩小了与Gemini-2.5 Pro、DeepSeek-V3.2和GPT-5-High等更大模型的差距。

Insight: 论文宣称的创新点包括:1) 从预训练开始就整合推理数据的答案保留修剪方案;2) 四阶段强化学习级联训练流程,专门针对数学、谜题和代码进行优化;3) 引入Markovian RSA测试时计算方法,通过递归聚合并行推理轨迹并仅携带有限长度的推理尾部,在提升性能的同时控制计算开销。从客观角度看,其核心创新在于将高效的MoE架构设计与针对推理任务的全栈优化训练流程(特别是强化学习课程和测试时计算策略)紧密结合,实现了小激活参数下的高性能推理。

Abstract: We present ZAYA1-8B, a reasoning-focused mixture-of-experts (MoE) model with 700M active and 8B total parameters, built on Zyphra’s MoE++ architecture. ZAYA1-8B’s core pretraining, midtraining, and supervised fine-tuning (SFT) were performed on a full-stack AMD compute, networking, and software platform. With under 1B active parameters, ZAYA1-8B matches or exceeds DeepSeek-R1-0528 on several challenging mathematics and coding benchmarks, and remains competitive with substantially larger open-weight reasoning models. ZAYA1-8B was trained from scratch for reasoning, with reasoning data included from pretraining onward using an answer-preserving trimming scheme. Post-training uses a four-stage RL cascade: reasoning warmup on math and puzzles; a 400-task RLVE-Gym curriculum; math and code RL with test-time compute traces and synthetic code environments built from competitive-programming references; and behavioral RL for chat and instruction following. We also introduce Markovian RSA, a test-time compute method that recursively aggregates parallel reasoning traces while carrying forward only bounded-length reasoning tails between rounds. In TTC evaluation, Markovian RSA raises ZAYA1-8B to 91.9% on AIME’25 and 89.6% on HMMT’25 while carrying forward only a 4K-token tail, narrowing the gap to much larger reasoning models including Gemini-2.5 Pro, DeepSeek-V3.2, and GPT-5-High.


[110] BALAR : A Bayesian Agentic Loop for Active Reasoning cs.AI | cs.CL | cs.LGPDF

Aymen Echarghaoui, Dongxia Wu, Emily B. Fox

TL;DR: 本文提出了BALAR(贝叶斯主动推理循环),一种任务无关的外循环算法,无需微调即可实现大型语言模型(LLM)代理与用户之间的结构化多轮交互。BALAR通过维护对潜在状态的贝叶斯信念、选择最大化期望互信息的澄清问题,并在当前状态表示不足时动态扩展它,来主动推理缺失信息。

Details

Motivation: 当前大多数系统以被动方式处理对话,缺乏一个原则性机制来推理缺失信息以及确定下一步应询问什么问题。本文旨在解决LLM在交互式任务中主动进行多轮信息交换的挑战。

Result: 在AR-Bench-DC(侦探案例)、AR-Bench-SP(思维谜题)和iCraft-MD(临床诊断)三个基准测试上,BALAR显著优于所有基线模型,准确率分别提高了14.6%、38.5%和30.5%,达到了最先进的性能水平。

Insight: 创新点在于将贝叶斯推理和主动学习原则整合到一个任务无关的外循环框架中,通过最大化期望互信息来主动选择澄清问题,并动态扩展状态表示,这为LLM的交互式推理提供了一种结构化、可解释的方法。

Abstract: Large language models increasingly operate in interactive settings where solving a task requires multiple rounds of information exchange with a user. However, most current systems treat dialogue reactively and lack a principled mechanism to reason about what information is missing and which question should be asked next. We propose BALAR (Bayesian Agentic Loop for Active Reasoning), a task-agnostic outer-loop algorithm that requires no fine-tuning and enables structured multi-turn interaction between an LLM agent and a user. BALAR maintains a structured belief over latent states, selects clarifying questions by maximizing expected mutual information, and dynamically expands its state representation when the current one proves insufficient. We evaluate BALAR on three diverse benchmarks: AR-Bench-DC (detective cases), AR-Bench-SP (thinking puzzles), and iCraft-MD (clinical diagnosis). BALAR significantly outperforms all baselines across all three benchmarks, with $14.6%$ higher accuracy on AR-Bench-DC, $38.5%$ on AR-Bench-SP, and $30.5%$ on iCraft-MD.


[111] Agentic Retrieval-Augmented Generation for Financial Document Question Answering cs.AI | cs.CLPDF

Yang Shu, Yingmin Liu, Zequn Xie

TL;DR: 本文提出了一种名为FinAgent-RAG的智能体化检索增强生成框架,专门用于解决金融文档问答中复杂的多步骤数值推理问题。该框架通过迭代的检索-推理循环和自我验证机制,集成了对比金融检索器、程序化思维推理模块和自适应策略路由器三个领域特定创新,在多个基准数据集上显著超越了现有基线模型。

Details

Motivation: 现有检索增强生成方法采用单次检索-生成范式,难以处理金融分析中普遍存在的组合式推理链,无法满足金融数值推理对精度的严格要求。

Result: 在FinQA、ConvFinQA和TAT-QA三个基准数据集上,FinAgent-RAG分别取得了76.81%、78.46%和74.96%的执行准确率,比最强基线高出5.62到9.32个百分点,同时在FinQA上减少了41.3%的API调用成本。

Insight: 创新点在于将智能体范式引入RAG,通过迭代循环和自我验证提升推理可靠性;具体技术贡献包括:1) 使用难负例挖掘训练的对比金融检索器,能区分语义相似但数值不同的金融段落;2) 程序化思维模块生成可执行代码进行精确算术,避免LLM心算错误;3) 自适应策略路由器根据问题复杂度动态分配计算资源,实现成本效益平衡。

Abstract: Financial document question answering (QA) demands complex multi-step numerical reasoning over heterogeneous evidence–structured tables, textual narratives, and footnotes–scattered across corporate filings. Existing retrieval-augmented generation (RAG) approaches adopt a single-pass retrieve-then-generate paradigm that struggles with the compositional reasoning chains prevalent in financial analysis. We propose FinAgent-RAG, an agentic RAG framework that orchestrates iterative retrieval-reasoning loops with self-verification, specifically engineered for the precision requirements of financial numerical reasoning. The framework integrates three domain-specific innovations: (1) a Contrastive Financial Retriever trained with hard negative mining to distinguish semantically similar but numerically distinct financial passages, (2) a Program-of-Thought reasoning module that generates executable Python code for precise arithmetic rather than relying on error-prone LLM-based mental computation, and (3) an Adaptive Strategy Router that dynamically allocates computational resources based on question complexity, reducing API costs by 41.3% on FinQA while preserving accuracy. Extensive experiments on three benchmark datasets–FinQA, ConvFinQA, and TAT-QA–demonstrate that FinAgent-RAG achieves 76.81%, 78.46%, and 74.96% execution accuracy respectively, outperforming the strongest baseline by 5.62–9.32 percentage points. Ablation studies, cross-backbone evaluation with four LLMs, and deployment cost analysis confirm the framework’s robustness and practical viability for financial institutions.


[112] Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration cs.AI | cs.CL | cs.LGPDF

Langlin Huang, Chengsong Huang, Jinyuan Li, Donghong Cai, Yuyi Yang

TL;DR: 本文提出Lorem Perturbation for Exploration (LoPE)训练框架,通过在提示前随机添加由Lorem Ipsum词汇组成的序列来扰动提示空间,旨在解决GRPO等强化学习方法在复杂任务中遇到的‘零优势问题’,即当所有采样轨迹都失败时训练信号消失的问题。实验证明该方法能有效拓宽大语言模型的推理探索路径。

Details

Motivation: 解决GRPO等基于可验证奖励的强化学习方法在复杂推理任务中面临的‘零优势问题’,即当所有采样轨迹均失败时,相对优势值坍缩为零,导致模型失去有效训练信号,浪费数据和算力。

Result: 在1.7B、4B和7B参数规模的模型上进行实验,结果表明LoPE显著优于使用原始提示的重采样方法。分析还发现其他基于拉丁语的低困惑度随机序列也是有效的扰动源。

Insight: 核心创新点在于提出任务无关的提示空间扰动可以解锁模型对于难题的正交推理路径;客观来看,该方法提供了一种简单有效的基线,通过引入结构化噪声来打破静态采样策略对推理探索的固有约束,且对扰动源的性质(如低困惑度的伪拉丁文本)有深入分析。

Abstract: Reinforcement learning with verifiable rewards, particularly Group Relative Policy Optimization (GRPO), has significantly advanced the reasoning capabilities of Large Language Models (LLMs). However, in complex tasks, GRPO frequently suffers from the ``zero-advantage problem’’: when all sampled rollouts for a query fail, the relative advantage collapses to zero. Consequently, the model loses effective training signals for these questions, wasting the training data and computational budget. While simply increasing the sampling budget for these questions is a common remedy, the static sampling policy inherently constrains reasoning exploration, limiting the success rate. In this paper, we propose Lorem Perturbation for Exploration (LoPE), a simple yet effective training framework to break this exploration bottleneck. We posit that task-irrelevant prompt-space perturbations can shift the model’s output distribution enough to unlock orthogonal reasoning pathways for hard questions. Specifically, LoPE prepends sequences stochastically assembled from Lorem Ipsum vocabulary (a pseudo-Latin placeholder text) to the prompts before resampling. Experiments across 1.7B, 4B, and 7B models demonstrate that LoPE significantly outperforms resampling with the original prompts. Further analysis reveals that other Latin-based random sequences with low perplexity are also effective perturbations. Our results establish LoPE as a strong baseline for broadening exploration in LLM reinforcement learning.


[113] Belief Memory: Agent Memory Under Partial Observability cs.AI | cs.CLPDF

Junfeng Liao, Qizhou Wang, Jianing Zhu, Bo Du, Rui Yan

TL;DR: 本文提出了BeliefMem,一种用于部分可观测环境下LLM智能体的概率记忆范式,旨在解决现有确定性记忆方法因存储单一结论而导致的自我强化错误问题。BeliefMem将每次观察存储为多个候选结论及其概率,通过Noisy-OR规则更新概率,并在检索时呈现所有候选及其置信度,使智能体能够保留不确定性并根据新证据更新置信度。

Details

Motivation: 现有LLM智能体外部记忆方法通常将每次观察存储为单一确定性结论,但在部分可观测环境下,观察本质上是部分且模糊的,这种确定性存储会引入自我强化错误:智能体基于存储的结论行动,不再考虑其他可能性,并随时间强化该结论。

Result: 在LoCoMo和ALFWorld基准测试上的实证评估表明,即使数据有限,BeliefMem也取得了最佳平均性能,显著优于知名基线方法。

Insight: 创新点在于将记忆范式从每观察单一结论转变为保留多个概率化候选结论,通过概率更新机制保留不确定性,使智能体既能对证据充分的知识高置信行动,又能在新证据出现时更新置信度,为部分可观测环境下的智能体记忆探索了新方向。

Abstract: LLM agents that operate over long context depend on external memory to accumulate knowledge over time. However, existing methods typically store each observation as a single deterministic conclusion (e.g., inferring “API~X failed” from temporary errors), even though such observations are inherently partial and potentially ambiguous. By committing to one conclusion and discarding uncertainty, these methods introduce self-reinforcing error: the agent acts on the stored conclusion, never revisits alternatives, and reinforces the conclusion over time. To address this issue, we propose BeliefMem, which shifts the memory paradigm from committing to a single conclusion per observation to retaining multiple candidate conclusions with their probabilities. Concretely, BeliefMem stores the candidate conclusions as separate memory entries, each carrying a probability that is updated via Noisy-OR rules as new observations arrive. At retrieval, all candidates surface together with their probabilities, keeping alternatives visible to the agent. Since each conclusion in memory retains its probability, BeliefMem preserves the uncertainty that the deterministic paradigm discards, enabling the agent to act with high confidence on well-evidenced knowledge while retaining the capacity to update its confidence when new evidence arrives. Empirical evaluations on LoCoMo and ALFWorld benchmarks show that, even with limited data, BeliefMem achieves the best average performance, remarkably outperforming well-known baselines. More broadly, such probabilistic memory produces substantial gains and explores a new direction for agent memory in partially observable environments.


[114] ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning cs.AI | cs.CLPDF

Fan Huang

TL;DR: 本文提出了ReFlect,一种用于大语言模型(LLM)复杂长程推理的“约束”系统。该系统通过创建独立的错误检测与恢复逻辑,作为模型外部的确定性包装器,旨在解决现有推理范式(如思维链、ReAct、事后自我批判)在长程、多阶段任务中错误会无声累积且无法有效自我恢复的问题。

Details

Motivation: 现有LLM推理范式依赖于两个在长程多阶段任务中会失效的假设,导致错误在推理步骤中无声累积。论文旨在解决一个开放性问题:能否构建一个能有效检测并自我恢复错误的推理系统?

Result: 在6个推理领域的控制实验表明,ReFlect系统显著提升了任务成功率:在六个不同规模的模型上,成功率从gpt-4o-mini的41%到Claude Sonnet 4.5的56%不等,相比直接思维链(Direct CoT)基线,每个模型都有提升(从Qwen2.5-72B的+7个百分点到Claude Sonnet 4.5的+29个百分点)。此外,在SWE-bench上,它将补丁结构质量从0%(Direct CoT)提升至82%(Qwen2.5-72B)到87%(GPT-4o)。研究还发现,约束系统的增益与模型的基线成功率成反比(拟合斜率为-1.69,r=-0.76)。

Insight: 论文宣称的创新点在于提出了一个模型无关、无需训练、完全在推理时运行的“约束”系统,通过独立的错误检测与恢复逻辑来提升长程推理的鲁棒性。客观来看,其核心创新在于将错误处理机制外部化、确定化,并揭示了当前提示级自我批判的局限性(产生公式化模板,在90%的反思块中不标记问题),以及中等规模模型在填充结构化状态方面的能力瓶颈(仅15.0–18.7%的配对均值)。

Abstract: Current reasoning paradigms for LLMs include chain-of-thought, ReAct, and post-hoc self-critique. These paradigms rely on two assumptions that fail on long-horizon, multi-stage tasks. As a result, errors accumulate silently across reasoning steps, leaving an open question: can a reasoning system effectively detect and recover from its own failures? We present ReFlect, a \emph{harness} system for LLM reasoning that creates standalone error detection and recovery logic as a deterministic wrapper around the model. Controlled experiments across 6 reasoning domains show that prompt-level self-critique produces formulaic templates that flag no issues in 90 of 100 audited reflection blocks, and the investigated LLMs wrongly accept a wrong answer in at least 76% of cases. Our ReFlect harness achieves task success rates ranging from 41% on gpt-4o-mini to 56% on Claude Sonnet 4.5 across six models spanning small and frontier scale, with per-model gains over Direct CoT ranging from +7 pp on Qwen2.5-72B to +29 pp on Claude Sonnet 4.5, and additionally raises SWE-bench patch-structural quality from 0% (Direct CoT) to between 82% (Qwen2.5-72B) and 87% (GPT-4o). Notably, the harness gain is inversely proportional to the model’s Direct CoT task success rate (the fitted slope is -1.69 with r=-0.76): each pp lost in baseline success rate is mechanically recovered by 1.69 pp of harness gain. We spot that adding structured reasoning state and operators yields only 15.0–18.7% pair-mean on Llama-3.3-70B and Qwen2.5-72B because models at this scale cannot reliably populate the state its operators require. ReFlect is model-agnostic, training-free, and operates entirely at inference time.


[115] TheraAgent: Self-Improving Therapeutic Agent for Precise and Comprehensive Treatment Planning cs.AI | cs.CLPDF

Junkai Li, Yunghwei Lai, Tianyi Zhu, Zheng Long Lee, Weizhi Ma

TL;DR: 论文提出TheraAgent,一种基于迭代生成-判断-优化流程的智能体框架,用于制定精确、全面且安全的治疗计划,取代传统大语言模型的一次性生成方式。

Details

Motivation: 现有大语言模型在制定治疗计划时依赖一次性输出,缺乏显式验证,可能导致方案粗糙、不完整或不安全,无法模拟人类专家迭代修订的实际推理过程。

Result: 在HealthBench基准测试中,TheraAgent在准确性和完整性方面达到SOTA水平;专家评估显示其以86%的胜率优于医生,在目标针对性和伤害控制方面表现更优。

Insight: 创新点在于将治疗计划制定重构为迭代推理任务,并引入治疗专用评估模块TheraJudge嵌入推理循环,以强制符合临床标准,提升方案的可靠性与安全性。

Abstract: Formulating a treatment plan is inherently a complex reasoning and refinement task rather than a simple generation problem. However, existing large language models (LLMs) mainly rely on one-shot output without explicit verification, which may result in rough, incomplete, and potentially unsafe treatment plans. To address these limitations, we propose TheraAgent, an agentic framework that replaces one-shot generation with an iterative generate-judge-refine pipeline. By mirroring the actual reasoning process of human experts who iteratively revise treatment plans, our framework progressively transforms coarse and incomplete drafts into precise, comprehensive, and safer therapeutic regimens. To facilitate the critical judge component, we introduce TheraJudge, a treatment-specific evaluation module integrated into the inference loop to enforce clinical standards. Experiments show TheraAgent achieves state-of-the-art results on HealthBench, leading in Accuracy and Completeness. In expert evaluations, it attains an 86% win rate against physicians, with superior Targeting and Harm Control. Moreover, the highly agreement between TheraJudge and HealthBench evaluations confirms the reliability of our framework.


[116] Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning cs.AI | cs.CLPDF

Leon Hamm, Zlatan Ajanovic

TL;DR: 本文提出了一种基于新颖性的思维树搜索方法,用于改进大语言模型在推理和规划任务中的性能。该方法通过定义节点新颖性度量来评估思维路径中节点的独特性,并利用该度量进行剪枝以减少搜索空间,从而在降低总体计算成本的同时提升推理效果。

Details

Motivation: 现有思维链、思维树或强化学习方法在提升大语言模型推理与规划能力时仍存在脆弱性、未达到人类水平、且时间和token成本较高的问题,受宽度优先搜索在规划中成功的启发,探索如何将新颖性概念迁移到语言领域以改进思维树推理。

Result: 在多个基于语言的规划和通用推理基准测试中进行了验证与比较,结果表明该方法通过剪枝减少了总体树规模,从而降低了整体token成本。

Insight: 创新点在于将规划中的新颖性概念引入语言推理领域,提出了一种可测量的节点新颖性度量方法,利用大语言模型的嵌入知识和提示来评估节点独特性,并基于此进行剪枝优化搜索效率。

Abstract: Although advances such as chain-of-thought, tree-of-thought or reinforcement learning have improved the performance of LLMs in reasoning and planning tasks, they are still brittle and have not achieved human-level performance in many domains, and often suffer from high time and token costs. Inspired by the success of width-based search in planning, we explore how the concept of novelty can be transferred to language domains and how it can improve tree-of-thought reasoning. A tree of thoughts relies on building possible “paths” of consecutive ideas or thoughts. These are generated by repeatedly prompting an LLM. In our paper, a measurable concept of novelty is proposed that describes the uniqueness of a new node (thought) in comparison to nodes previously seen in the search tree. Novelty is estimated by prompting an LLM and making use of embedded general knowledge from pre-training. This metric can then be used to prune branches and reduce the scope of the search. Although this method introduces more prompts per state, the overall token cost can be reduced by pruning and reducing the overall tree size. This procedure is tested and compared using several benchmarks in language-based planning and general reasoning.


[117] Rethinking Adapter Placement: A Dominant Adaptation Module Perspective cs.AI | cs.CL | cs.LGPDF

Suoxin Zhang, Run He, Di Fang, Xiang Tan, Kaixuan Chen

TL;DR: 本文提出了一种基于梯度能量探测的LoRA适配器放置方法PAGE,发现浅层FFN下投影模块是主导适应模块,并据此设计了DomLoRA方法,仅用约0.7%的可训练参数即可超越标准LoRA在多项下游任务上的性能。

Details

Motivation: 现有LoRA方法通常广泛放置适配器,但研究表明使用更少的适配器可能保持甚至提升性能,因此需要研究如何最优地放置有限数量的适配器以最大化性能。

Result: DomLoRA在指令跟随、数学推理、代码生成和多轮对话等下游任务上平均表现优于标准LoRA,同时也能提升其他LoRA变体的性能。

Insight: 创新点在于提出梯度能量探测指标PAGE来识别主导适应模块,并发现该模块具有架构依赖但任务稳定的特性,为高效适配器放置提供了实用指导原则。

Abstract: Low-rank adaptation (LoRA) is a widely used parameter-efficient fine-tuning method that places trainable low-rank adapters into frozen pre-trained models. Recent studies show that using fewer LoRA adapters may still maintain or even improve performance, but existing methods still distribute adapters broadly, leaving where to place a limited number of adapters to maximize performance largely open. To investigate this, we introduce PAGE (Projected Adapter Gradient Energy), a gradient-based sensitivity probe that estimates the initial trainable gradient energy available to each candidate LoRA adapter. Surprisingly, we find that PAGE is highly concentrated on a single shallow FFN down-projection across two model families and four downstream tasks. We term this module the dominant adaptation module and show that its layer index is architecture-dependent but task-stable. Motivated by this finding, we propose DomLoRA, a placement method that places a single adapter at the dominant adaptation module. With only ~0.7% of vanilla LoRA’s trainable parameters, DomLoRA outperforms it on average across various downstream tasks, including instruction following, mathematical reasoning, code generation, and multi-turn conversation. This method also improves other LoRA variants, supporting the dominant adaptation module perspective as a practical placement guideline.


[118] OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models cs.AI | cs.CLPDF

Jaehoon Kim, Dongha Lee

TL;DR: 本文研究了在启用思维链的数学推理任务中,策略内自蒸馏(OPSD)的作用,发现其主要功能是压缩模型输出而非纠正错误。基于此,作者提出了一种改进的训练流程:先进行监督微调(SFT),再进行带可验证奖励的强化学习(RLVR),最后应用OPSD进行压缩。

Details

Motivation: OPSD在非思维链任务中能通过事后监督提升精度并缩短回答,但在启用思维链的数学推理中效果不佳甚至有害。本文旨在探究OPSD在此类任务中的具体行为机制。

Result: 实验表明,在启用思维链的数学推理中,仅对正确推理轨迹应用OPSD能保持精度并显著缩短回答长度;而仅对错误轨迹应用则会损害精度。这验证了OPSD主要起压缩作用而非纠正作用。

Insight: 核心创新点在于将OPSD的角色重新定位为后强化学习的压缩阶段,并提出了SFT→RLVR→OPSD的三阶段训练流程。这为优化复杂推理模型的效率提供了新思路,即可将输出压缩与错误纠正解耦处理。

Abstract: On-Policy Self-Distillation (OPSD) has recently emerged as an alternative to Reinforcement Learning with Verifiable Rewards (RLVR), promising higher accuracy and shorter responses through token-level credit assignment from a self-teacher conditioned on privileged context. However, this promise does not carry over to thinking-enabled mathematical reasoning, where reported accuracy gains shrink and sometimes turn negative. We hypothesize that hindsight supervision can specify better token-level alternatives in short thinking-disabled outputs, but in long thinking-enabled traces it more readily identifies redundancy than supplies better replacements. To test this, we applied OPSD separately to correct and incorrect rollout groups, so that compression and correction can be observed in isolation. Our results show that in thinking-enabled mathematical reasoning, OPSD behaves most reliably as a compression mechanism rather than a correction mechanism: training only on correct rollouts preserves accuracy while substantially shortening responses, whereas training only on incorrect rollouts damages accuracy. In light of these findings, we propose a revised post-training pipeline for thinking-enabled mathematical reasoning: SFT then RLVR then OPSD.


[119] The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models cs.AI | cs.CLPDF

Chonghan Qin, Xiachong Feng, Ziyun Song, Xiaocheng Feng, Jing Xiong

TL;DR: 该论文发现大语言模型(LLM)的内部表征编码了社会角色的粒度信息,即从微观个体到宏观组织机构的连续谱系。作者定义了一个基于对比的“粒度轴”,并证明其在模型表征空间中占据主导几何方向,能稳定地反映角色粒度,且通过激活引导可因果性地操控模型输出的粒度水平。

Details

Motivation: 探究大语言模型在扮演不同社会角色时,其内部表征是否以及如何编码角色的粒度信息(从个体经验到宏观机构推理),并理解这种编码的结构和因果影响。

Result: 在Qwen3-8B中,粒度轴与角色表征空间的主成分(PC1)对齐(余弦相似度0.972),解释了52.6%的方差;角色在轴上的投影随五个粒度级别单调递增,且该发现在不同层、提示变体、模型(Llama-3.1-8B-Instruct)中保持稳定;通过激活引导可显著改变模型响应的粒度评分(例如Llama模型在五点宏观量表上从2.00提升至3.17)。

Insight: 创新点在于将社会角色粒度概念化为一个可度量的、主导模型表征空间的潜在方向,并证明其具有结构有序性和因果可操控性。这为理解模型的社会角色扮演机制提供了新视角,并为通过表征工程精细控制模型行为提供了潜在途径。

Abstract: Large language models (LLMs) are routinely prompted to take on social roles ranging from individuals to institutions, yet it remains unclear whether their internal representations encode the granularity of such roles, from micro-level individual experience to macro-level organizational, institutional, or national reasoning. We show that they do. We define a contrast-based Granularity Axis as the difference between mean macro- and micro-role hidden states. In Qwen3-8B, this axis aligns with the principal axis (PC1) of the role representation space at cosine 0.972 and accounts for 52.6% of its variance, indicating that granularity is the dominant geometric axis organizing prompted social roles. We construct 75 social roles across five granularity levels and collect 91,200 role-conditioned responses over shared questions and prompt variants, then extract role-level hidden states and project them onto the axis. Role projections increase monotonically across all five levels, remain stable across layers, prompt variants, endpoint definitions, held-out splits, and score-filtered subsets, and transfer to Llama-3.1-8B-Instruct. The axis is also causally relevant: activation steering along it shifts response granularity in the predicted direction, with Llama moving from 2.00 to 3.17 on a five-point macro scale under positive steering on prompts that admit local responses. The two models differ in controllability, suggesting that steering depends on each model’s default operating regime. Overall, our findings suggest that social role granularity is not merely a stylistic surface feature, but a structured, ordered, and causally manipulable latent direction in role-conditioned language model behavior.


[120] SkillOS: Learning Skill Curation for Self-Evolving Agents cs.AI | cs.CLPDF

Siru Ouyang, Jun Yan, Yanfei Chen, Rujun Han, Zifeng Wang

TL;DR: 该论文提出了SkillOS,一种基于经验驱动强化学习的训练框架,用于学习自我进化代理中的技能管理策略。该方法将冻结的代理执行器与可训练的技能管理器配对,通过设计复合奖励和在基于技能相关任务依赖性的分组任务流上进行训练,使代理能够从累积经验中学习复杂的长期技能管理策略。

Details

Motivation: 解决基于LLM的代理在处理流式任务时通常作为一次性问题解决者,无法从过去交互中学习的问题。现有方法依赖于手动技能管理、启发式技能操作或训练短期技能操作,难以从间接和延迟反馈中学习复杂的长期管理策略。

Result: 在多轮代理任务和单轮推理任务中,SkillOS在有效性和效率上均持续优于无记忆和强基于记忆的基线方法。学习到的技能管理器能够泛化到不同的执行器主干和任务领域。分析表明,学习到的管理器能产生更有针对性的技能使用,且技能库中的技能会演变成更丰富结构的Markdown文件,编码更高级的元技能。

Insight: 核心创新在于将技能管理问题形式化为一个从延迟反馈中学习的强化学习问题,并设计了基于任务依赖性的分组训练和复合奖励机制。这为构建能够从经验中自主进化技能库的长期学习代理提供了一种可扩展的框架。

Abstract: LLM-based agents are increasingly deployed to handle streaming tasks, yet they often remain one-off problem solvers that fail to learn from past interactions. Reusable skills distilled from experience provide a natural substrate for self-evolution, where high-quality skill curation serves as the key bottleneck. Existing approaches either rely on manual skill curation, prescribe heuristic skill operations, or train for short-horizon skill operations. However, they still struggle to learn complex long-term curation policies from indirect and delayed feedback. To tackle this challenge, we propose SkillOS, an experience-driven RL training recipe for learning skill curation in self-evolving agents. SkillOS pairs a frozen agent executor that retrieves and applies skills with a trainable skill curator that updates an external SkillRepo from accumulated experience. To provide learning signals for curation, we design composite rewards and train on grouped task streams based on skill-relevant task dependencies, where earlier trajectories update the SkillRepo, and later related tasks evaluate these updates. Across multi-turn agentic tasks and single-turn reasoning tasks, SkillOS consistently outperforms memory-free and strong memory-based baselines in both effectiveness and efficiency, with the learned skill curator generalizing across different executor backbones and task domains. Further analyses show that the learned curator produces more targeted skill use, while the skills in SkillRepo evolve into more richly structured Markdown files that encode higher-level meta-skills over time.


[121] Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key cs.AI | cs.CLPDF

Tianle Wang, Zhaoyang Wang, Guangchen Lan, Xinpeng Wei, Sipeng Zhang

TL;DR: 本文介绍了ScaleLogic,一个用于研究强化学习(RL)如何提升大语言模型(LLM)长程推理能力的合成逻辑推理框架。该框架允许独立控制推理深度(即规划范围)和底层逻辑的表达能力这两个难度维度。研究表明,RL训练计算量T与推理深度D之间遵循幂律关系(T ∝ D^γ),且缩放指数γ随逻辑表达能力的增强而单调增加。在更具表达力的逻辑上进行训练,能在下游数学和通用推理基准上带来更大的性能提升和更高的计算效率迁移。

Details

Motivation: 系统研究RL训练如何随任务难度(特别是长程推理)扩展的需求,因缺乏可控、可扩展的环境而受阻。本文旨在通过一个可精确控制推理深度和逻辑表达能力的框架,来探究RL训练LLM进行逻辑推理的可扩展性规律。

Result: 在ScaleLogic框架上的实验表明,RL训练计算量与推理深度之间存在强幂律关系(R² > 0.99),缩放指数γ随逻辑表达能力从1.04单调增加到2.60。在下游任务(如数学和通用推理基准)上,更具表达力的训练设置能带来高达+10.66分的性能提升,并展现出更高的计算效率迁移能力。该幂律关系在多种RL方法中均成立,且基于课程学习的训练能显著提升扩展效率。

Insight: 论文的核心创新点在于提出了一个可控、可扩展的逻辑推理框架(ScaleLogic),用于系统量化RL训练LLM推理能力的扩展规律。客观的创新之处在于揭示了“训练内容”(逻辑的表达能力)而不仅仅是“训练量”对下游任务迁移性能的关键影响,并定量刻画了训练计算需求随任务复杂度的幂律缩放行为,为高效训练LLM进行复杂推理提供了指导。

Abstract: Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic (“if-then”) towards more expressive first-order reasoning with conjunction (“and”), disjunction (“or”), negation (“not”), and universal quantification (“for all”). Using this framework, we show that the RL training compute $T$ follows a power law with respect to reasoning depth $D$ ($T \propto D^γ$, $R^{2} > 0.99$), and that the scaling exponent $γ$ increases monotonically with logical expressiveness, from $1.04$ to $2.60$. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to $+10.66$ points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency.


[122] Intelligent CCTV for Urban Design: AI-Based Analysis of Soft Infrastructure at Intersections cs.AI | cs.CV | eess.IVPDF

Vinit Katariya, Seungjin Kim, Curtis Craig, Nichole Morris, Hamed Tabkhi

TL;DR: 本研究提出了一种基于人工智能和计算机视觉的分析框架,利用现有CCTV基础设施评估临时行人安全岛和路缘扩展等软干预措施对车辆速度和交通安全的影响。通过深度学习和基于透视的速度估计方法,在明尼阿波利斯市对干预前后的驾驶员行为进行了分析,并在安装后第一周和第二周进行了重复监测。

Details

Motivation: 利用AI和计算机视觉技术,通过现有CCTV基础设施快速、低成本地评估交通软干预措施的效果,为基于证据的交通政策制定提供支持。

Result: 在无信号交叉口,平均速度和85%百分位速度分别下降高达18.75%和16.56%,通过交通量减少高达12.2%;信号交叉口(除一处外)也显示出类似降幅,平均速度和85%百分位速度分别下降高达20.0%和17.19%。

Insight: 创新点在于将深度学习与基于透视的速度估计结合,利用现有监控视频实现非侵入式交通分析;客观来看,该方法为城市交通干预提供了可扩展、低成本的评估工具,证明了AI在交通政策评估中的实用性。

Abstract: Artificial intelligence (AI) and computer vision are transforming transportation data collection. This study introduces an AI-enabled analytics framework leveraging existing CCTV infrastructure to evaluate the impact of soft interventions, such as temporary pedestrian refuges and curb extensions, on vehicle speed and safety. Using deep learning and perspective-based speed estimation, we evaluated driver behavior before and after interventions, with repeated post-installation monitoring in Week 1 and Week 2, in Minneapolis. Findings reveal that at unsignalized intersections, mean and 85th-percentile speeds fell by up to 18.75% and 16.56%, respectively, while pass-through traffic decreased by as much as 12.2%. Signalized intersections showed comparable reductions except one location, with mean and 85th-percentile speeds dropping by up to 20.0% and 17.19%. These results demonstrate the traffic-calming effectiveness of soft infrastructure and underscore the utility of AI-powered methods for rapid, low-cost, and evidence-based transport policy evaluation.


[123] Large Vision-Language Models Get Lost in Attention cs.AI | cs.CVPDF

Gongli Xi, Ye Tian, Mengyu Yang, Huahui Yi, Liang Lin

TL;DR: 这篇论文通过信息论和几何学的统一框架,揭示了大型视觉-语言模型中注意力和前馈网络的功能解耦:注意力作为子空间保持算子负责重构,而前馈网络作为子空间扩展算子驱动语义创新。令人惊讶的是,实验表明用预定义值(如高斯噪声)替换学习到的注意力权重,在多数数据集上能达到甚至超过原始模型的性能,暴露了当前注意力机制存在严重的错配和冗余。

Details

Motivation: 尽管训练范式快速发展,但大型视觉-语言模型的解码器主干仍基于残差连接Transformer架构,因此理解内部模块的独特角色对于理解模型机制和指导架构优化至关重要。现有统计方法缺乏统一的理论基础,本文旨在填补这一空白。

Result: 在多个数据集上的实验表明,用预定义值(如高斯噪声)替换学习到的注意力权重,性能与原始模型相当甚至更优,这揭示了当前最先进LVLMs的注意力机制存在严重低效性。

Insight: 论文的创新点在于提出了一个基于信息论和几何学的统一框架来量化残差更新的几何和熵性质,从而揭示了注意力与前馈网络的功能解耦。从客观角度看,这一发现挑战了当前Transformer架构中注意力模块的必要性,为模型简化与优化提供了新的理论依据和方向。

Abstract: Despite the rapid evolution of training paradigms, the decoder backbone of large vision–language models (LVLMs) remains fundamentally rooted in the residual-connection Transformer architecture. Therefore, deciphering the distinct roles of internal modules is critical for understanding model mechanics and guiding architectural optimization. While prior statistical approaches have provided valuable attribution-based insights, they often lack a unified theoretical basis. To bridge this gap, we propose a unified framework grounded in information theory and geometry to quantify the geometric and entropic nature of residual updates. Applying this unified framework reveals a fundamental functional decoupling: Attention acts as a subspace-preserving operator focused on reconfiguration, whereas FFNs serve as subspace-expanding operators driving semantic innovation. Strikingly, further experiments demonstrate that replacing learned attention weights with predefined values (e.g., Gaussian noise) yields comparable or even superior performance across a majority of datasets relative to vanilla models. These results expose severe misallocation and redundancy in current mechanisms, suggesting that state-of-the-art LVLMs effectively ``get lost in attention’’ rather than efficiently leveraging visual context.


[124] Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios cs.AI | cs.CVPDF

Peizheng Yan, Yu Zhao, Liang Xie, Juntong Qi, Mingming Wang

TL;DR: 本文提出了Event-Causal RAG,一个用于超长视频推理的轻量级检索增强生成框架。该方法将流式视频分割为语义连贯的事件,并用结构化的状态-事件-状态图表示每个事件,构建全局事件知识图谱并存储在支持语义匹配和因果拓扑检索的双存储记忆中。通过双向检索策略识别相关事件因果链,结合视频证据提供给基础视觉模型生成答案。

Details

Motivation: 现有的大规模视觉语言模型在处理短/中视频理解上表现良好,但在超长或无限视频推理中能力不足,难以维持长时间连贯记忆并推断跨时间遥远事件的因果依赖。现有端到端方法受自注意力O(n²)复杂度限制,而现有检索增强生成方法存在片段化记忆、时序与因果结构建模弱、存储与推理成本高等问题。

Result: 在长视频理解基准测试中,Event-Causal RAG持续优于基于片段的检索基线和长上下文视频模型,特别是在需要多事件整合和跨长时间间隔因果推理的问题上表现突出,同时实现了更高的内存效率和稳健的流式性能。

Insight: 创新点在于以事件而非固定长度片段作为基本语义单元,并采用结构化的状态-事件-状态图进行表示,构建了支持语义和因果拓扑检索的双存储记忆系统。从客观角度看,该方法将视频理解从帧/片段级提升到事件级,并显式建模了状态转换和因果依赖,为长序列时序推理提供了更高效的表示和检索机制。

Abstract: Recent large vision-language models have achieved strong performance on short- and medium-length video understanding, yet they remain inadequate for ultra-long or even infinite video reasoning, where models must preserve coherent memory over extended durations and infer causal dependencies across temporally distant events. Existing end-to-end video understanding methods are fundamentally limited by the $O(n^2)$ complexity of self-attention, while recent retrieval-augmented generation (RAG) approaches still suffer from fragmented clip-level memory, weak modeling of temporal and causal structure, and high storage and online inference costs. We present Event-Causal RAG, a lightweight retrieval-augmented framework for infinite long-video reasoning. Instead of indexing fixed-length clips, our method segments streaming videos into semantically coherent events and represents each event as a structured State-Event-State (SES) graph, capturing the event together with its surrounding state transitions. These graphs are merged into a global Event Knowledge Graph and stored in a dual-store memory that supports both semantic matching and causal-topological retrieval. On top of this memory, we design a bidirectional retrieval strategy to efficiently identify the most relevant event causal chains and provide them, together with the associated video evidence, to a backbone video foundation model for answer generation. Experiments on long-video understanding benchmarks demonstrate that Event-Causal RAG consistently outperforms strong clip-based retrieval baselines and long-context video models, particularly on questions requiring multi-event integration and causal inference across long temporal gaps, while also achieving improved memory efficiency and robust streaming performance.


[125] GlazyBench: A Benchmark for Ceramic Glaze Property Prediction and Image Generation cs.AI | cs.CVPDF

Ziyu Zhai, Siyou Li, Juexi Shao, Juntao Yu

TL;DR: GlazyBench是首个用于AI辅助陶瓷釉料设计的数据集,包含23,148个真实釉料配方,支持从原材料预测烧制后表面属性(如颜色和透明度)以及基于这些属性生成准确视觉表示两大任务。

Details

Motivation: 解决陶瓷釉料开发因复杂化学过程而成本高、耗时长的试错问题,并填补该领域缺乏大规模数据集以训练多模态AI模型的空白。

Result: 实验为属性预测(使用传统机器学习和大型语言模型)和图像生成(使用深度生成模型和大型多模态模型)建立了全面的基线,展示了有前景但具挑战性的结果。

Insight: 该研究开创了AI辅助材料设计的新方向,并提供了一个用于系统评估的标准化基准数据集。

Abstract: Developing ceramic glazes is a costly, time-consuming process of trial and error due to complex chemistry, placing a significant burden on independent artists. While recent advances in multimodal AI offer a modern solution, the field lacks the large-scale datasets required to train these models. We propose GlazyBench, the first dataset for AI-assisted glaze design. Comprising 23,148 real glaze formulations, GlazyBench supports two primary tasks: predicting post-firing surface properties, such as color and transparency, from raw materials, and generating accurate visual representations of the glaze based on these properties. We establish comprehensive baselines for property prediction using traditional machine learning and large language models, alongside image generation benchmarks using deep generative and large multimodal models. Our experiments demonstrate promising yet challenging results. GlazyBench pioneers a new research direction in AI-assisted material design, providing a standardized benchmark for systematic evaluation.


cs.CY [Back]

[126] From Review to Design: Ethical Multimodal Driver Monitoring Systems for Risk Mitigation, Incident Response, and Accountability in Automated Vehicles cs.CY | cs.CV | cs.ETPDF

Bilal Khana, Waseem Shariff, Rory Coyne, Muhammad Ali Farooq, Peter Corcoran

TL;DR: 本文从综述到设计视角,探讨了自动驾驶车辆中多模态驾驶员监控系统(DMS)面临的伦理与法律挑战,并提出了一个模块化伦理设计框架,旨在将高层原则转化为可操作的设计与部署指南,以促进透明、可信且以人为本的DMS开发。

Details

Motivation: 随着车辆自动化水平提高,DMS在确保人类监督、安全和合规性方面变得至关重要,但其部署引发了隐私、同意、数据所有权和算法公平性等复杂的伦理与法律问题,现有法规框架如GDPR、欧盟AI法案和IEEE标准缺乏针对车内传感技术独特风险的具体指导。

Result: 论文未提及具体的定量实验结果或基准测试,但通过批判性审查现有框架,识别了其在多模态AI车内监控应用中的不足,并基于此提出了一个专门定制的伦理设计框架。

Insight: 创新点在于将伦理原则转化为可操作的设计指南,包括用户可配置的同意机制、公平感知模型开发、透明性与可解释性工具,以及针对驾驶员情感福祉的保障措施,为下一代自动驾驶车辆中的DMS开发提供了系统性的风险缓解和问责机制。

Abstract: As vehicles transition toward higher levels of automation, Driver Monitoring Systems (DMS) have become essential for ensuring human oversight, safety, and regulatory compliance in a vehicle. These systems rely on multimodal sensing and AI-driven inference to assess driver attention, cognitive state, and readiness to take control. While technologically promising, their deployment introduces a complex set of ethical and legal challenges - ranging from privacy and consent to data ownership and algorithmic fairness. While overarching frameworks such as the GDPR, EU AI Act, and IEEE standards offer important guidance, they lack the specificity required for addressing the unique risks posed by in-cabin sensing technologies. This paper adopts a review-to-design perspective, critically examining existing regulatory instruments and ethical frameworks – such as the GDPR, the EU AI Act, and IEEE guidelines – and identifying gaps in their applicability to the distinctive risks posed by multimodal, AI-enabled in-cabin monitoring. Building on this review, we propose a modular ethical design framework tailored specifically to Driver Monitoring Systems. The framework translates high-level principles into actionable design and deployment guidance, including user-configurable consent mechanisms, fairness-aware model development, transparency and explainability tools, and safeguards for driver emotional well-being. Finally, the paper outlines a risk analysis and failure mitigation strategy, emphasizing proactive incident response and accountability mechanisms tailored to the DMS context. Together, these contributions aim to inform the development of transparent, trustworthy, and human-centered driver monitoring systems for next-generation autonomous vehicles.


cs.DB [Back]

[127] Anatomy of a Query: W5H Dimensions and FAR Patterns for Text-to-SQL Evaluation cs.DB | cs.CLPDF

Vicki Stover Hertzberg, Eduardo Valverde, Joyce C. Ho

TL;DR: 本文提出了QUEST框架,用于评估和设计自然语言数据库接口,该框架基于两个独立组件:FAR结构不变量(所有查询可简化为过滤、聚合和返回操作)和W5H维度框架(过滤条件映射到六个语义维度)。在五个文本到SQL数据集(n=120,464)上验证,FAR一致性在所有领域和模式类型中普遍存在,而W5H维度分布差异显著,揭示了医疗查询高度集中于时间和人员维度,而因果和机制推理几乎为零,指出了结构化数据上机器推理的前沿挑战。

Details

Motivation: 自然语言数据库接口日益流行,但其评估和设计的理论基础仍不完善,需要开发一个框架来系统分析查询结构。

Result: 在五个文本到SQL数据集(共120,464个查询)上验证,FAR一致性在所有领域和模式类型中普遍存在;W5H维度分析显示医疗查询中时间维度(WHEN)占80.4%,人员维度(WHO)占73.0%,远超通用领域基准,而因果(WHY)和机制(HOW)推理几乎为零。

Insight: 创新点在于提出FAR结构不变量和W5H维度框架作为评估文本到SQL系统的理论基础;客观分析表明,该框架能揭示查询的语义分布差异,并识别出当前系统在因果和机制推理上的局限,为未来机器推理研究指明了方向。

Abstract: Natural language interfaces to databases have gained popularity, yet the theoretical foundations for evaluating and designing these systems remain underdeveloped. We present QUEST (Query Understanding Evaluation through Semantic Translation), a framework resting on two independently motivated components: the FAR structural invariant, which holds that every well-formed query reduces to Filter, Aggregate, and Return operations; and the W5H dimensional framework, which holds that all filtering criteria map to six semantic dimensions (Who, What, Where, When, Why, and How). Validated across five text-to-SQL datasets (n = 120,464), FAR conformance is universal across all domains and schema types, while W5H dimensional profiles vary substantially. Healthcare queries are strongly concentrated in temporal (WHEN: 80.4%) and person-centric (WHO: 73.0%) dimensions far exceeding general-domain benchmarks, and causal (WHY) and mechanistic (HOW) reasoning are near-zero everywhere, with apparent HOW exceptions reflecting quantitative aggregation rather than genuine procedural reasoning. These results identify a frontier that must be crossed for genuine machine reasoning over structured data.


cs.LG [Back]

[128] Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning cs.LG | cs.AI | cs.CLPDF

Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng

TL;DR: 本文提出了一种新的强化学习推理范式,将结果监督内化为过程监督。该方法使模型能够通过识别、纠正和重用失败的推理轨迹,自动提取过程级学习信号,从而在仅有结果监督的情况下实现更细粒度的策略优化。

Details

Motivation: 解决强化学习推理中仅依赖稀疏结果监督难以进行细粒度信用分配的问题,以及外部构建过程监督成本高、难以扩展的挑战。

Result: 论文提出了一种监督内化方法,使模型能够在仅有结果监督的情况下自动生成内部过程监督,实现更细粒度的策略优化,为强化学习推理中的信用分配开辟了新路径。

Insight: 创新点在于将强化学习推理重新定义为将结果监督内化为过程监督的问题,使模型能够自我生成和优化内部过程监督,从而减少对外部昂贵监督的依赖,实现更可持续和可扩展的细粒度学习。

Abstract: The central challenge of reinforcement learning for reasoning lies not only in the sparsity of outcome-level supervision, but more fundamentally in how to transform feedback provided only at the end of a sequence into fine-grained learning signals that can guide intermediate reasoning steps. Existing approaches either rely on outcome-level rewards for sequence-level optimization, which makes precise credit assignment difficult, or depend on externally constructed process supervision, which is costly and difficult to scale sustainably. To address this, we propose a new perspective: reinforcement learning for reasoning can be understood as the problem of internalizing outcome supervision into process supervision. From this perspective, we introduce a supervision-internalization method for reinforcement learning for reasoning, enabling the model to automatically extract process-level learning signals through identifying, correcting, and reusing failed reasoning trajectories, thereby achieving finer-grained policy optimization under outcome-only supervision. We further abstract this idea into a new training paradigm, in which the model continually generates and refines its own internal process supervision during reinforcement learning, opening a new path for fine-grained credit assignment in reinforcement learning for reasoning that differs from externally provided process supervision.


[129] Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback cs.LG | cs.AI | cs.CL | q-fin.CPPDF

Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman

TL;DR: 本文提出了一种用于评估自主股票预测系统行为的多维框架,该框架利用LLM法官对系统决策轨迹进行评分,并通过闭环强化学习反馈来优化系统性能。

Details

Motivation: 现有股票预测系统的评估指标(如MAPE、方向准确率)过于聚合,无法揭示其内部序列决策(如机制检测、路径路由、强化学习控制)的个体质量,因此需要一种细粒度的行为评估方法。

Result: 在离线回测中,经过三个短周期的微调后,在2017-2025测试集上,单日MAPE从0.61%降至0.54%(相对降低11.5%),方向准确率从71%提升至74%,夏普比率提升18%。行为综合评分与20日夏普比率的相关系数ρ=0.72。

Insight: 创新点在于将LLM作为法官对系统决策进行多维度的细粒度行为评分,并将评分结果作为信用分配的惩罚项整合到SAC奖励中,形成闭环强化学习优化,从而在保持整体性能的同时,针对性地改善了系统在特定行为维度(尤其是高波动时期)的缺陷。

Abstract: Agentic stock prediction systems make sequences of interdependent decisions (regime detection, pathway routing, reinforcement learning control) whose individual quality is hidden by aggregate metrics such as mean absolute percentage error (MAPE) or directional accuracy. We present a behavioral evaluation framework that addresses this gap. Behavioral traces logged at every autonomous decision point are grouped into five-day episodes and scored along six domain-specific dimensions (regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery) by an ensemble of three large language model (LLM) judges (GPT 5.4, Claude 4.6 Opus, Gemini 3.1 Pro). Perturbation-based validation on 420 episodes yields targeted score drops of $-1.6$ to $-2.4$ on intended dimensions versus an average of $-0.32$ on the remaining five, with cross-model agreement up to Krippendorff’s $α= 0.85$. The composite behavioral score, used here only for cross-episode reporting, correlates at $ρ= 0.72$ with realized 20-day Sharpe ratio from offline backtesting. Closing the loop, the framework converts deficient per-dimension scores into a credit-assigned penalty term added to the Soft Actor-Critic (SAC) reward. Three short fine-tuning cycles, all confined to the validation period, produce on the held-out 2017-2025 test period a one-day MAPE reduction from 0.61% to 0.54% (an 11.5% relative reduction; $p<0.001$, Cohen’s $d=0.31$), a directional accuracy increase from 71% to 74%, and an 18% Sharpe ratio improvement (95% bootstrap CI [8.2%, 27.4%]), with gains concentrated in high-volatility episodes where the original system was most behaviorally deficient. Results are from offline backtesting and do not address effects specific to live deployment.


[130] RVPO: Risk-Sensitive Alignment via Variance Regularization cs.LG | cs.CLPDF

Ivan Montero, Tomasz Jurczyk, Bhuwan Dhingra

TL;DR: 本文提出了一种名为RVPO的风险敏感对齐框架,通过惩罚奖励方差来解决多目标强化学习对齐中的约束忽视问题,即防止模型通过在一个目标上取得高分来掩盖其他关键目标(如安全性或格式)的失败。

Details

Motivation: 现有无评论者RLHF方法通过算术平均聚合多目标奖励,容易导致约束忽视,即某个目标的高分可能掩盖其他关键目标的低分,从而影响多目标对齐的可靠性。

Result: 在基于评分的医疗和科学推理任务(最多17个LLM评判奖励信号)以及基于规则的约束工具调用任务上,RVPO在HealthBench上显著提升总体分数(14B模型上0.261对比GDPO的0.215,p<0.001),并在GPQA-Diamond上保持竞争力,避免了其他多奖励方法后期性能下降的问题。

Insight: 创新点在于将目标从“最大化总和”转向“最大化一致性”,通过LogSumExp算子作为平滑方差惩罚来正则化奖励方差,从而缓解约束忽视,且该方法在不同模型规模上均有效而不牺牲通用能力。

Abstract: Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g., safety or formatting), masking low-performing “bottleneck” rewards vital for reliable multi-objective alignment. We propose Reward-Variance Policy Optimization (RVPO), a risk-sensitive framework that penalizes inter-reward variance during advantage aggregation, shifting the objective from “maximize sum” to “maximize consistency.” We show via Taylor expansion that a LogSumExp (SoftMin) operator effectively acts as a smooth variance penalty. We evaluate RVPO on rubric-based medical and scientific reasoning with up to 17 concurrent LLM-judged reward signals (Qwen2.5-3B/7B/14B) and on tool-calling with rule-based constraints (Qwen2.5-1.5B/3B). By preventing the model from neglecting difficult constraints to exploit easier objectives, RVPO improves overall scores on HealthBench (0.261 vs. 0.215 for GDPO at 14B, $p < 0.001$) and maintains competitive accuracy on GPQA-Diamond without the late-stage degradation observed in other multi-reward methods, demonstrating that variance regularization mitigates constraint neglect across model scales without sacrificing general capabilities.


[131] Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing cs.LG | cs.CLPDF

Miao Rang, Zhenni Bi, Hang Zhou, Kai Han, Xuechun Wang

TL;DR: 本文提出了一种名为近策略蒸馏(NPD)的异步方法,用于加速自回归模型的知识蒸馏。该方法通过解耦学生模型的生成与训练阶段,并引入选择性序列打包,在缓解分布不匹配问题的同时,显著提升了训练效率。

Details

Motivation: 标准的自回归模型知识蒸馏常受分布不匹配问题困扰,而现有的策略方法虽能缓解此问题,但依赖于计算成本高昂的强化学习框架,效率低下。本文旨在提出一种更高效的蒸馏方法。

Result: 在实验中,NPD框架相比策略基线实现了8.1倍的加速,性能超越监督微调(SFT)8.09%。更重要的是,该方法使openPangu-Embedded-1B模型在特定基准上达到了68.73%的SOTA分数,超越了更大的Qwen3-1.7B模型。

Insight: 核心创新在于异步生成与训练的解耦设计,以及引入的Δ-IFD过滤机制。该机制作为一种启发式样本选择方法,通过过滤极端分布外样本,稳定了优化轨迹,确保了更新保持在安全的近端学习区域内,从而在提升效率的同时维持了策略的稳定性。

Abstract: Standard knowledge distillation for autoregressive models often suffers from distribution mismatch. While on-policy methods mitigate this by leveraging student-generated outputs, they rely on computationally expensive Reinforcement Learning (RL) frameworks. To improve efficiency, we propose Near-Policy Distillation (NPD), an asynchronous approach that decouples student generation from training. This reformulation enables Supervised Fine-Tuning (SFT) with sequence packing. However, asynchronous updates inevitably introduce policy lag and sample noise, which can cause the behavior to drift from near-policy toward off-policy. To counteract this without sacrificing efficiency, NPD integrates sparse student updates and the $Δ$-IFD filtering mechanism, a heuristic sample selection mechanism that empirically stabilizes the optimization trajectory. By filtering extreme out-of-distribution samples, $Δ$-IFD prevents noise from dominating the gradients, ensuring updates remain within a safe proximal learning zone. Empirically, the NPD framework achieves a 8.1x speedup over on-policy baselines and outperforms SFT by 8.09%. Crucially, by effectively narrowing the exploration space for subsequent RL, our method enables openPangu-Embedded-1B to reach a state-of-the-art score of 68.73%, outperforming the substantially larger Qwen3-1.7B. Codes will be released soon.


[132] Contrastive Identification and Generation in the Limit cs.LG | cs.AI | cs.CL | cs.DSPDF

Xiaoyu Li, Andi Han, Jiaojiao Jiang, Junbin Gao

TL;DR: 本文研究了在极限模型下的对比识别与生成问题,其中学习者接收的是无序对比数据对(即两个元素具有不同标签,但具体标签未知),而非传统的仅正例或全标注数据。论文首先在无噪声设定下给出了对比可识别类的精确刻画、提出了对比闭包维数,并分析了均匀对比生成的样本复杂度;随后证明了在有限对抗性噪声下,对比识别相比仅正例识别具有更强的鲁棒性。

Details

Motivation: 传统的极限识别与生成模型仅处理正例或全标注数据,但许多实际监督信号本质上是关系性的(如对比对),而非单个样本的标签。本文旨在探索在对比数据对(即两个元素标签不同但具体标签隐藏)的设定下,如何进行极限识别与生成,以更贴合现实世界中常见的对比性监督信号。

Result: 在无噪声设定下:1)精确刻画了对比可识别类(基于Angluin的tell-tale条件的几何细化);2)提出了对比闭包维数,并以此刻画了均匀对比生成的样本复杂度;3)证明了对比生成与文本识别互不可比。在有限对抗性噪声下:存在某些类可由单一算法在任意有限噪声预算下从对比对中识别,但无法在仅有一个噪声观测的情况下从正例中识别。

Insight: 创新点在于首次将极限学习框架扩展到对比性监督信号,并引入了“共同交叉图”作为统一技术工具,用于编码成对歧义、族级生成障碍和噪声缺陷。从客观角度看,该工作将对比学习与经典计算学习理论结合,为处理关系型数据提供了新的理论分析框架,尤其在噪声鲁棒性方面展现了对比表示的潜在优势。

Abstract: In the classical identification in the limit model of Gold [1967], a stream of positive examples is presented round by round, and the learner must eventually recover the target hypothesis. Recently, Kleinberg and Mullainathan [2024] introduced generation in the limit, where the learner instead must eventually output novel elements of the target’s support. Both lines of work focus on positive-only or fully labeled data. Yet many natural supervision signals are inherently relational rather than singleton, which encode relationships between examples rather than labels of individual ones. We initiate the study of contrastive identification and generation in the limit, where the learner observes a contrastive presentation of data: a stream of unordered pairs ${x,y}$ satisfying $h(x)\ne h(y)$ for an unknown target binary hypothesis $h$, but which element is positive is hidden from the learner. We first present three results in the noiseless setting: an exact characterization of contrastive identifiable classes (a one-line geometric refinement of Angluin [1980]’s tell-tale condition), a combinatorial dimension called contrastive closure dimension (a contrasitive analogue of the closure dimension in Raman et al. [2025]) and exactly characterizing uniform contrastive generation with tight sample complexity, and a strict hierarchy in which contrastive generation and text identification are mutually incomparable. We then prove a sharp reversal under finite adversarial corruption: there exist classes identifiable from contrastive pairs under any finite corruption budget by a single budget-independent algorithm, yet not identifiable from positive examples under even one corrupted observation. The unifying technical object is the common crossing graph, which encodes pairwise ambiguity, family-level generation obstructions, and corruption defects in a single coverage-and-incidence language.


[133] E = T*H/(O+B): A Dimensionless Control Parameter for Mixture-of-Experts Ecology cs.LG | cs.AI | cs.CL | cs.CVPDF

Qingjun Zhang

TL;DR: 本文提出一个无量纲控制参数E = T*H/(O+B),用于预测混合专家模型是否会形成健康的专家生态或崩溃为死专家。通过超过11,000个训练周期的实验,证明E >= 0.5足以保证零死专家,无需手工设计的负载平衡辅助损失。

Details

Motivation: 解决混合专家模型中专家生态健康问题,避免出现死专家,并减少对手工负载平衡损失的依赖。

Result: 在CIFAR-10、CIFAR-100、TinyImageNet-200、WikiText-2和WikiText-103等数据集上验证,E >= 0.5可保证零死专家。

Insight: 创新点在于将四个超参数整合为单一控制参数E,作为MoE训练的通用诊断工具,类似于流体力学中的雷诺数,并揭示了专家生态的多个新发现,如死专家可复苏、生态健康与过拟合解耦等。

Abstract: We introduce E = T*H/(O+B), a dimensionless control parameter that predicts whether Mixture-of-Experts (MoE) models will develop a healthy expert ecology or collapse into dead experts. E combines four hyperparameters – routing temperature T, routing entropy weight H, oracle weight O, and balance weight B – into a single quantity. Through 12 controlled experiments (8 vision, 4 language) totaling over 11,000 training epochs, we establish that E >= 0.5 alone is sufficient to guarantee zero dead experts, removing the necessity for handcrafted load-balancing auxiliary losses. We validate this cross-modally on CIFAR-10, CIFAR-100, TinyImageNet-200, WikiText-2, and WikiText-103. Six additional findings emerge: (1) dead experts can resuscitate – triggered by balance loss driving router re-exploration; (2) ortho toxicity is dataset-dependent, not universal; (3) task complexity shifts the critical E threshold; (4) model overfitting is decoupled from expert ecological health; (5) three-tier MoE spontaneously collapses into a two-tier functional structure; (6) ecological structure is temperature-invariant across a 50x range. We propose that E serves as a unified diagnostic for MoE training, analogous to the Reynolds number in fluid dynamics.


[134] PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization cs.LG | cs.CL | cs.SDPDF

Adhiraj Banerjee, Vipul Arora

TL;DR: PairAlign是一个用于音频序列标记化的框架,通过序列级自对齐实现紧凑的音频标记化。它将标记化视为条件序列生成,使用编码器将语音映射为连续条件,自回归解码器生成标记,学习标记身份、顺序、长度和终止位置。框架通过内容保留视图训练序列一致性,避免多对一崩溃,并引入EMA教师目标、交叉配对教师强制、前缀破坏、似然对比和长度控制等技术优化标记化过程。

Details

Motivation: 现有音频标记化方法(如量化、聚类或编解码器重建)通常局部分配标记,未直接优化序列一致性、紧凑性、长度控制、终止和编辑相似性。PairAlign旨在解决这些问题,通过序列级自对齐实现更紧凑和一致的音频标记化。

Result: 在3秒语音上,PairAlign学习到紧凑、非退化的序列,具有广泛的词汇使用和强跨视图一致性。在TIMIT检索任务中,它在保持编辑距离搜索的同时,将存档标记数量减少了55%。连续扫描探测显示,与密集几何标记器相比,局部重叠较低,但在100毫秒偏移下具有更强的长度控制和有界编辑轨迹。

Insight: 创新点包括将标记化视为条件序列生成,通过自对齐框架优化序列级属性;引入内容保留视图训练和竞争序列机制,作为编辑距离保持的可扩展替代方案;结合多种技术(如EMA教师、交叉配对教师强制)提升性能。从客观角度看,它类似于JEPA风格目标,但预测抽象目标为可变长度符号序列而非连续潜在表示,为序列符号预测学习提供了新思路。

Abstract: Many operations on sensory data – comparison, memory, retrieval, and reasoning – are naturally expressed over discrete symbolic structures. In language this interface is given by tokens; in audio, it must be learned. Existing audio tokenizers rely on quantization, clustering, or codec reconstruction, assigning tokens locally, so sequence consistency, compactness, length control, termination, and edit similarity are rarely optimized directly. We introduce PairAlign, a framework for compact audio tokenization through sequence-level self-alignment. PairAlign treats tokenization as conditional sequence generation: an encoder maps speech to a continuous condition, and an autoregressive decoder generates tokens from BOS, learning token identity, order, length, and EOS placement. Given two content-preserving views, each view’s sequence is trained to be likely under the other’s representation, while unrelated examples provide competing sequences. This gives a scalable surrogate for edit-distance preservation while discouraging many-to-one collapse. PairAlign starts from VQ-style tokenization and refines it with EMA-teacher targets, cross-paired teacher forcing, prefix corruption, likelihood contrast, and length control. On 3-second speech, PairAlign learns compact, non-degenerate sequences with broad vocabulary usage and strong cross-view consistency. On TIMIT retrieval, it preserves edit-distance search while reducing archive token count by 55%. A continuous-sweep probe shows lower local overlap than a dense geometric tokenizer, but stronger length control and bounded edit trajectories under 100 ms shifts. PairAlign is a sequence-symbolic predictive learner: like JEPA-style objectives, it predicts an abstract target from another view as a learned variable-length symbolic sequence, not a continuous latent.


[135] Recursive Agent Optimization cs.LG | cs.AI | cs.CL | cs.MAPDF

Apurva Gandhi, Satyaki Chakraborty, Xiangjun Wang, Aviral Kumar, Graham Neubig

TL;DR: 本文提出了一种名为递归智能体优化(RAO)的强化学习方法,用于训练递归智能体,即能够递归地生成新实例并将子任务委托给这些实例的智能体。该方法实现了一种推理时扩展算法,通过分治策略自然扩展至更长上下文并泛化至更困难问题。研究发现,经RAO训练的递归智能体在训练效率、上下文窗口扩展、任务泛化能力以及实际运行时间方面均优于单智能体系统。

Details

Motivation: 解决智能体在复杂任务中处理长上下文和困难问题时的扩展性与泛化能力不足的问题,通过递归委托实现分治策略。

Result: 在未具体指明的基准测试中,递归智能体表现出更好的训练效率,能够扩展到超出模型上下文窗口的任务,泛化到比训练任务更困难的问题,并减少实际运行时间。

Insight: 创新点在于将递归委托作为推理时扩展算法,通过强化学习训练智能体学习何时及如何委托和通信,实现分治策略的自然扩展和泛化提升。

Abstract: We introduce Recursive Agent Optimization (RAO), a reinforcement learning approach for training recursive agents: agents that can spawn and delegate sub-tasks to new instantiations of themselves recursively. Recursive agents implement an inference-time scaling algorithm that naturally allows agents to scale to longer contexts and generalize to more difficult problems via divide-and-conquer. RAO provides a method to train models to best take advantage of such recursive inference, teaching agents when and how to delegate and communicate. We find that recursive agents trained in this way enjoy better training efficiency, can scale to tasks that go beyond the model’s context window, generalize to tasks much harder than the ones the agent was trained on, and can enjoy reduced wall-clock time compared to single-agent systems.


[136] Verifier-Backed Hard Problem Generation for Mathematical Reasoning cs.LG | cs.AI | cs.CLPDF

Yuhang Lai, Jiazhan Feng, Yee Whye Teh, Ning Miao

TL;DR: 本文提出了一种名为VHG的验证器增强型难题生成框架,该框架基于三方自博弈,通过引入独立的验证器来约束出题者的奖励,从而生成有效且具有挑战性的数学问题。

Details

Motivation: 现有的大语言模型在解决科学和数学问题方面表现出色,但在生成有效、具有挑战性和新颖的问题方面存在困难,而这是推动LLM训练和实现自主科学研究的关键。现有方法要么依赖昂贵的人类专家参与,要么采用简单的自博弈范式,容易因奖励黑客行为而产生无效问题。

Result: 在不定积分任务和一般数学推理任务上的实验结果表明,VHG显著优于所有基线方法,取得了明显的性能提升。

Insight: 论文的创新点在于将传统的出题者-解题者二元结构扩展为包含独立验证器的三方自博弈框架,通过验证器评估问题有效性并结合解题者评估难度来共同决定出题者的奖励,从而有效缓解奖励黑客问题并生成高质量的难题。从客观角度看,这种将验证机制整合到生成过程中的设计思路,为解决类似生成任务中的奖励黑客和有效性控制问题提供了可借鉴的方案。

Abstract: Large Language Models (LLMs) demonstrate strong capabilities for solving scientific and mathematical problems, yet they struggle to produce valid, challenging, and novel problems - an essential component for advancing LLM training and enabling autonomous scientific research. Existing problem generation approaches either depend on expensive human expert involvement or adopt naive self-play paradigms, which frequently yield invalid problems due to reward hacking. This work introduces VHG, a verifier-enhanced hard problem generation framework built upon three-party self-play. By integrating an independent verifier into the conventional setter-solver duality, our design constrains the setter’s reward to be jointly determined by problem validity (evaluated by the verifier) and difficulty (assessed by the solver). We instantiate two verifier variants: a Hard symbolic verifier and a Soft LLM-based verifier, with evaluations conducted on indefinite integral tasks and general mathematical reasoning tasks. Experimental results show that VHG substantially outperforms all baseline methods by a clear margin.


[137] From Drops to Grid: Noise-Aware Spatio-Temporal Neural Process for Rainfall Estimation cs.LG | cs.CVPDF

Rafael Pablos Sarabia, Joachim Nyborg, Morten Birk, Ira Assent

TL;DR: 本文提出DropsToGrid方法,一种基于神经过程(Neural Process)的模型,用于从稀疏、有噪声的私人气象站观测数据和雷达空间背景信息中生成高分辨率、密集的降雨场估计。该方法融合了多尺度特征提取、时间注意力和多模态融合技术,能生成随机的、连续的降雨估计并明确量化不确定性。

Details

Motivation: 传统业务降雨观测存在偏差和分辨率低的问题,难以捕捉局部降雨;现有深度学习稠密化方法受限于降雨数据的偏态分布、局部性、噪声以及有限的时空融合能力,因此需要一种能有效整合稀疏地面观测和雷达数据的方法来生成准确的高分辨率降雨图。

Result: 在真实世界数据集上的评估表明,DropsToGrid在生成高分辨率降雨图方面,其性能优于现有的业务方法和深度学习基线模型,即使在只有少数站点可用以及跨区域场景下,也能生成具有良好校准不确定性的准确估计。

Insight: 创新点在于将神经过程框架应用于降雨估计任务,通过多尺度特征提取、时间注意力和多模态(站点与雷达)融合来建模降雨的时空动态和不确定性;客观来看,该方法将概率建模与深度学习结合,为处理稀疏、有噪声的时空数据提供了一种可解释且能量化不确定性的解决方案。

Abstract: High-resolution rainfall observations are crucial for weather forecasting, water management, and hazard mitigation. Traditional operational measurements are often biased and low-resolution, limiting their ability to capture local rainfall. Accurate high-resolution rainfall maps require integrating sparse surface observations, yet existing deep learning densification methods are hindered by rainfall’s skewed, localized nature, noise, and limited spatio-temporal fusion. We present DropsToGrid, a Neural Process-based method that generates dense rainfall fields by fusing temporal sequences from noisy, irregularly distributed private weather stations with spatial context from radar. Leveraging multi-scale feature extraction, temporal attention, and multi-modal fusion, the model produces stochastic, continuous rainfall estimates and explicitly quantifies uncertainty. Evaluations on real-world datasets demonstrate that DropsToGrid outperforms both operational and deep learning baselines, generating accurate high-resolution rainfall maps with well-calibrated uncertainty, even when only few stations are available and in cross-regional scenarios.


[138] Towards Self-Explainable Document Visual Question Answering with Chain-of-Explanation Predictions cs.LG | cs.CVPDF

Kjetil Indrehus, Adrian Duric, Changkyu Choi, Ali Ramezani-Kebrya

TL;DR: 本文提出了一种名为CoExVQA的自解释文档视觉问答框架,通过链式解释设计实现基于证据的推理过程。该框架首先识别问题相关证据,然后显式定位答案区域,最后仅从定位区域解码答案,从而提供可验证的推理过程。

Details

Motivation: 现有DocVQA模型将问题相关证据和答案定位纠缠在一起,且多为黑盒模型,难以验证预测如何依赖于视觉证据,因此需要一种可解释且可验证的推理方法。

Result: 在PFL-DocVQA基准测试上,CoExVQA实现了最先进的可解释DocVQA性能,将ANLS指标比当前可解释基线提高了12%,同时提供了透明且可验证的预测。

Insight: 创新点在于将推理过程分解为证据识别、答案区域定位和答案解码三个可解释的链式步骤,强制模型仅从定位区域解码答案,实现了自解释性和可验证性,这种模块化设计有助于提升模型透明度和性能。

Abstract: Document Visual Question Answering (DocVQA) requires vision-language models to reason not only about what information in a document is relevant to a question, but also where the answer is grounded on the page. Existing DocVQA models entangle question-relevant evidence and answer localization and operate largely as black boxes, offering limited means to verify how predictions depend on visual evidence. We propose CoExVQA, a self-explainable DocVQA framework with a grounded reasoning process through a chain-of-explanation design. CoExVQA first identifies question-relevant evidence, then explicitly localizes the answer region, and finally decodes the answer exclusively from the grounded region. Prediction via CoExVQA’s chain-of-explanation enables direct inspection and verification of the reasoning process across modalities. Empirical results show that restricting decoding to grounded evidence achieves SotA explainable DocVQA performance on PFL-DocVQA, improving ANLS by 12% over the current explainable baselines while providing transparent and verifiable predictions.


[139] Hyperbolic Concept Bottleneck Models cs.LG | cs.CVPDF

Daniel Uyterlinde, Swasti Shreya Mishra, Pascal Mettes

TL;DR: 本文提出了一种名为双曲概念瓶颈模型(HypCBM)的后处理框架,通过将概念激活重新定义为双曲空间中的非对称几何包含,将概念瓶颈建立在语义层次结构中。该方法利用蕴含锥作为测试时的激活信号,无需额外监督即可产生稀疏、层次感知的激活,并引入了自适应缩放定律以实现层次一致的干预。实验表明,在人类可解释性所需的稀疏条件下,HypCBM的性能可与使用20倍数据训练的后处理欧几里得模型相媲美,同时具有更强的层次一致性和对输入损坏的鲁棒性。

Details

Motivation: 现有概念瓶颈模型(CBMs)将概念嵌入平坦的欧几里得空间,将其视为独立、正交的维度,这与概念本身高度结构化、具有语义层次性的本质不匹配。本文旨在解决这一不匹配问题。

Result: 在稀疏条件下,HypCBM的性能与使用20倍数据训练的后处理欧几里得模型相当,同时表现出更强的层次一致性和对输入损坏的鲁棒性。

Insight: 核心创新点在于将概念激活重新定义为双曲空间中的几何包含关系,利用蕴含锥作为天然的测试时激活信号,从而无需额外监督即可实现稀疏、层次感知的激活。此外,提出的自适应缩放定律确保了层次一致的干预传播。从客观角度看,该方法将双曲几何的层次表示能力与概念瓶颈模型相结合,为解决概念间的结构化关系提供了一种新颖的几何视角。

Abstract: Concept Bottleneck Models (CBMs) have become a popular approach to enable interpretability in neural networks by constraining classifier inputs to a set of human-understandable concepts. While effective, current models embed concepts in flat Euclidean space, treating them as independent, orthogonal dimensions. Concepts, however, are highly structured and organized in semantic hierarchies. To resolve this mismatch, we propose Hyperbolic Concept Bottleneck Models (HypCBM), a post-hoc framework that grounds the bottleneck in this structure by reformulating concept activation as asymmetric geometric containment in hyperbolic space. Rather than treating entailment cones as a pre-training penalty, we show they encode a natural test-time activation signal: the margin of inclusion within a concept’s entailment cone yields sparse, hierarchy-aware activations without any additional supervision or learned modules. We further introduce an adaptive scaling law for hierarchically faithful interventions, propagating user corrections coherently through the concept tree. Empirically, HypCBM rivals post-hoc Euclidean models trained on 20$\times$ more data in sparse regimes required for human interpretability, with stronger hierarchical consistency and improved robustness to input corruptions.


[140] SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders cs.LG | cs.CVPDF

Jakub Stępień, Marcin Mazur, Jacek Tabor, Przemysław Spurek

TL;DR: 本文提出了一种名为SoftSAE的自适应稀疏自编码器,它通过动态Top-K选择机制,根据输入数据的复杂度自适应地调整激活特征的数量,从而解决了传统稀疏自编码器(如TopK SAE)因固定稀疏度而无法适应数据内在维度变化的问题。

Details

Motivation: 传统稀疏自编码器(SAE)在机制可解释性中用于分解多义激活为稀疏的单义特征,但其固定稀疏度(K值)忽略了真实数据中样本复杂度(即局部内在维度)的差异,导致对简单输入引入噪声或对复杂输入丢失重要结构。

Result: 实验结果表明,SoftSAE不仅能发现有意义的概念特征,还能为每个概念自适应地选择合适数量的特征,从而更好地匹配数据结构并反映输入信息量。

Insight: 创新点在于引入了可微分的Soft Top-K算子,实现了输入依赖的动态稀疏度学习,使模型能够根据每个输入的复杂度自适应调整激活特征数,这为稀疏表示学习提供了一种更灵活、更符合数据流形特性的方法。

Abstract: Sparse Autoencoders (SAEs) have become an important tool in mechanistic interpretability, helping to analyze internal representations in both Large Language Models (LLMs) and Vision Transformers (ViTs). By decomposing polysemantic activations into sparse sets of monosemantic features, SAEs aim to translate neural network computations into human-understandable concepts. However, common architectures such as TopK SAEs rely on a fixed sparsity level. They enforce the same number of active features (K) across all inputs, ignoring the varying complexity of real-world data. Natural data often lies on manifolds with varying local intrinsic dimensionality, meaning the number of relevant factors can change significantly across samples. This suggests that a fixed sparsity level is not optimal. Simple inputs may require only a few features, while more complex ones need more expressive representations. Using a constant K can therefore introduce noise in simple cases or miss important structure in more complex ones. To address this issue, we propose SoftSAE, a sparse autoencoder with a Dynamic Top-K selection mechanism. Our method uses a differentiable Soft Top-K operator to learn an input-dependent sparsity level k. This allows the model to adjust the number of active features based on the complexity of each input. As a result, the representation better matches the structure of the data, and the explanation length reflects the amount of information in the input. Experimental results confirm that SoftSAE not only finds meaningful features, but also selects the right number of features for each concept. The source code is available at: https://anonymous.4open.science/r/SoftSAE-8F71/.