Table of Contents

cs.CL [Back]

[1] How Frontier LLMs Adapt to Neurodivergence Context: A Measurement Framework for Surface vs. Structural Change in System-Prompted Responses cs.CL | cs.AI | cs.HCPDF

Ishan Gupta, Pavlo Buryi

TL;DR: 本文提出了NDBench基准测试框架,用于评估前沿聊天大语言模型(LLMs)在系统提示中包含神经多样性(ND)背景时的输出调整情况。研究发现,LLMs在ND背景下会显著调整输出,尤其是在明确指令下会产生更长、更结构化的回复,但这种调整主要是结构性的,且仅靠ND角色声明不足以抑制潜在有害倾向。

Details

Motivation: 研究动机是探究前沿LLMs是否能够根据系统提示中的神经多样性(ND)背景调整其输出,并描述这些调整的性质,以评估模型对ND的适应性和潜在风险。

Result: 在NDBench基准测试(包含576个输出)上,研究发现:在完全指令条件下,LLMs的输出更长、更结构化(token数、标题和细节步骤显著增加,p < 10^-8);结构调整明显(标题频率和每步细节增加,但列表密度变化不大);仅ND角色声明无法有效减少有害倾向(掩蔽强化仅在有明确指令时减少36-44%)。基于LLM的有害性评估中,仅掩蔽与强化、验证质量两个维度达到预定义的评判者间一致性标准(alpha >= 0.67)。

Insight: 创新点包括提出NDBench这一可复现的基准测试框架,用于系统评估LLMs对ND背景的适应;研究发现LLMs的调整主要是结构性的而非表面变化;强调仅靠角色声明不足以控制有害输出,需要明确指令。这为未来LLMs的ND意识审计提供了方法论。

Abstract: We examine if frontier chat-based large language models (LLMs) adjust their outputs based on neurodivergence (ND) context in system prompts and describe the nature of these adjustments. Specifically, we propose NDBench, a 576-output benchmark involving two frontier models, three system prompt types (baseline, ND-profile assertion, and ND-profile assertion with explicit instructions for adjustments), four canonical ND profiles, and 24 prompts across four categories, one of which involves an adversarial masking strategy. Four trends emerge consistently from our findings. First, LLMs show significant adaptation under ND context, where fully instructed conditions yield lengthier and more structured outputs, characterized by higher token counts, more headings, and more granular steps (p < 10^-8, Holm-corrected). Second, such adaptation is largely structural in nature: although list density does not change much, there is a marked rise in the frequency of headings and per-step detail. Third, ND persona assertion alone fails to suppress potentially harmful tendencies, as masking-reinforcement decreases only in explicitly instructed cases (36-44% reduction); the reduction rate barely changes in persona assertion conditions. Moreover, reliability analysis of LLM-based harm assessment reveals that only two out of the six dimensions (masking and reinforcement, validation quality) exceed the pre-defined inter-judge agreement criterion (alpha >= 0.67) and thus can be considered primary results. NDBench is made publicly available along with its prompts, outputs, code, and other resources, forming a reproducible framework for auditing future LLMs’ adaptation to ND awareness.


Nhung Thi-Hong Duong, Mai Ngoc Ho, Tin Van Huynh, Kiet Van Nguyen

TL;DR: 本文介绍了ViLegalNLI,这是首个针对越南语法律文本构建的大规模自然语言推理数据集,包含42,012个前提-假设对,标注了蕴含与非蕴含标签。论文提出了一个半自动数据生成框架,结合大语言模型进行受控假设生成和质量验证,并进行了广泛的实验评估。

Details

Motivation: 解决越南语法律领域缺乏专门的自然语言推理基准数据集的问题,以支持法律文本理解、推理及可靠AI法律系统的开发。

Result: 实验表明,few-shot大语言模型配置持续取得最优性能,性能显著受假设长度、词汇重叠和推理复杂度影响;跨领域评估揭示了在不同法律领域间泛化法律推理的挑战。

Insight: 创新点在于构建了首个越南语法律NLI数据集,并提出了一个结合LLM的半自动生成与验证框架,注重法律一致性与多样性推理模式;为法律文本理解与推理提供了重要基准。

Abstract: In this article, we introduce ViLegalNLI, the first large-scale Vietnamese Natural Language Inference (NLI) dataset specifically constructed for the legal domain. The dataset consists of 42,012 premise-hypothesis pairs derived from official statutory documents and annotated with binary inference labels (Entailment and Non-entailment). It covers multiple legal domains and reflects realistic legal reasoning scenarios characterized by structured logic, conditional clauses, and domain-specific terminology. To construct ViLegalNLI, we propose a semi-automatic data generation framework that integrates large language models for controlled hypothesis generation and systematic quality validation procedures. The framework incorporates artifact mitigation strategies and cross-model validation to improve annotation reliability and ensure legal consistency. The resulting dataset captures diverse reasoning patterns, including paraphrasing, logical implication, and legally invalid inferences, thereby providing a comprehensive benchmark for Vietnamese legal inference tasks. We conduct extensive experiments on the ViLegalNLI using multilingual models, Vietnamese-specific pretrained language models, and instruction-tuned large language models. The results show that few-shot LLM configurations consistently achieve superior performance, while performance is significantly influenced by hypothesis length, lexical overlap, and reasoning complexity. Cross-domain evaluations further reveal the challenges of generalizing legal inference across distinct legal fields. Overall, ViLegalNLI establishes a foundational benchmark for Vietnamese legal NLI and supports future research in legal reasoning, statutory text understanding, and the development of reliable AI systems for legal analysis and decision support. The dataset is publicly available for research purposes.


[3] Cultural Benchmarking of LLMs in Standard and Dialectal Arabic Dialogues cs.CL | cs.AIPDF

Muhammad Dehan Al Kautsar, Saeed Almheiri, Momina Ahsan, Bilal Elbouardi, Younes Samih

TL;DR: 该论文针对大语言模型在阿拉伯语文化推理评估上的不足,提出了一个名为ArabCulture-Dialogue的对话数据集,覆盖13个阿拉伯语国家、12个日常生活主题,并包含现代标准阿拉伯语和方言。基于此数据集,论文构建了三个基准测试任务,并发现模型在方言设置下的表现普遍差于标准语。

Details

Motivation: 现有阿拉伯语基准测试大多关注现代标准阿拉伯语的短文本,忽略了对话中自然产生的文化细微差别和方言语境,导致评估大语言模型文化推理能力存在显著差距。

Result: 实验结果表明,模型在所有三个任务(多项选择文化推理、标准语与方言间的机器翻译、方言引导生成)的方言设置下,表现均比在现代标准阿拉伯语设置下更差,性能差距依然存在。

Insight: 创新点在于构建了一个文化根植、覆盖多国方言的对话数据集,并设计了相应的基准测试任务,为评估LLMs在方言和文化语境下的能力提供了新工具,揭示了标准语与方言间的性能鸿沟这一重要现象。

Abstract: There is a significant gap in evaluating cultural reasoning in LLMs using conversational datasets that capture culturally rich and dialectal contexts. Most Arabic benchmarks focus on short text snippets in Modern Standard Arabic (MSA), overlooking the cultural nuances that naturally arise in dialogues. To address this gap, we introduce ArabCulture-Dialogue, a culturally grounded conversational dataset covering 13 Arabic-speaking countries, in both MSA and each country’s respective dialect, spanning 12 daily-life topics and 54 fine-grained subtopics. We utilize the dataset to form three benchmarking tasks: (i) multiple-choice cultural reasoning, (ii) machine translation between MSA and dialects, and (iii) dialect-steering generation. Our experiments indicate that the performance gap between MSA and Arabic dialects still exists, whereby the models perform worse on all three tasks in the dialectal setup, compared to the MSA one.


[4] RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners cs.CL | cs.AI | cs.IR | cs.LGPDF

Jugal Gajjar, Kamalasankari Subramaniakuppusamy

TL;DR: RSAT是一种训练小型语言模型(SLMs,1-8B)进行表格推理的方法,它使模型能够生成带有单元格级别引用的逐步推理过程。该方法分为两个阶段:第一阶段(SFT)通过监督微调教授结构化的JSON输出格式;第二阶段(GRPO)使用以基于NLI的忠实性为中心的复合奖励进行优化,同时考虑引用有效性和简洁性。

Details

Motivation: 解决语言模型在回答表格问题时,用户无法验证哪些单元格信息影响了哪些推理步骤的问题,旨在提高模型推理过程的透明度和可验证性。

Result: 在两个模型家族(Qwen 2.5和Llama 3)的六个模型上,RSAT将忠实性从仅使用SFT时的0.224提升至0.826(提高了3.7倍),引用有效性接近完美(0.992)。后处理归因方法的格式成功率低于13%,表明归因必须整合到推理过程中。消融实验显示忠实性奖励至关重要,移除后忠实性从0.97降至0.03。

Insight: 创新点在于将结构化归因(单元格引用)直接整合到小型语言模型的推理训练中,通过两阶段训练(SFT+GRPO)和以NLI为基础的忠实性奖励优化,实现了高忠实性和可验证的推理,证明了归因不能后处理,而必须作为推理的核心部分进行训练。

Abstract: When a language model answers a table question, users have no way to verify which cells informed which reasoning steps. We introduce RSAT, a method that trains small language models (SLMs, 1-8B) to produce step-by-step reasoning with cell-level citations grounded in table evidence. Phase 1 (SFT) teaches a structured JSON output format from verified reasoning traces. Phase 2 (GRPO) optimizes a composite reward centered on NLI-based faithfulness, alongside citation validity and parsimony. Across six models from two families-Qwen 2.5 (1.5B/3B/7B) and Llama 3 (1B/3B/8B)-RSAT improves faithfulness 3.7$\times$ over SFT alone (0.224$\rightarrow$0.826), with near-perfect citation validity (0.992). Post-hoc attribution collapses below 13% format success, confirming that attribution must be integrated into reasoning, not retrofitted. Ablations show the faithfulness reward is essential: removing it drops faithfulness from 0.97 to 0.03.


Jan Sobotka, Mustafa O. Karabag, Ufuk Topcu

TL;DR: 本文研究了大型语言模型(LLMs)在不完全信息博弈(如谈判和政策制定)中战略决策失败的原因,揭示了其内部决策机制存在的两个根本性缺陷:观察-信念差距和信念-行动差距。

Details

Motivation: LLMs越来越多地被用于不完全信息下的战略决策任务,但其失败原因尚不明确,本文旨在通过分析其内部机制来揭示这些系统性弱点。

Result: 在Llama 3.1、Qwen3和gpt-oss等开源模型上的实验表明,LLMs的内部信念编码比其口头报告更准确但脆弱,且信念到行动的转化效率低下,这导致其游戏收益并未因信念条件化而显著提高。

Insight: 创新点在于通过分析LLMs的内部过程(而非仅看输出)来揭示其系统性脆弱性,具体表现为信念在复杂推理中的退化、存在首因和近因偏差、偏离贝叶斯一致性,以及内部信念与行动之间的弱关联,这警示了在战略领域部署LLMs前需要设置强有力的防护措施。

Abstract: Large language models (LLMs) are increasingly tasked with strategic decision-making under incomplete information, such as in negotiation and policymaking. While LLMs can excel at many such tasks, they also fail in ways that are poorly understood. We shed light on these failures by uncovering two fundamental gaps in the internal mechanisms underlying the decision-making of LLMs in incomplete-information games, supported by experiments with open-weight models Llama 3.1, Qwen3, and gpt-oss. First, an observation-belief gap: LLMs encode internal beliefs about latent game states that are substantially more accurate than their own verbal reports, yet these beliefs are brittle. In particular, the belief accuracy degrades with multi-hop reasoning, exhibits primacy and recency biases, and drifts away from Bayesian coherence over extended interactions. Second, a belief-action gap: The implicit conversion of internal beliefs into actions is weaker than that of the beliefs externalized in the prompt, yet neither belief-conditioning consistently achieves higher game payoffs. These results show how analyzing LLMs’ internal processes can expose systematic vulnerabilities that warrant caution before deploying LLMs in strategic domains without robust guardrails.


[6] Retrieval-Augmented Reasoning for Chartered Accountancy cs.CL | cs.AI | cs.IRPDF

Jatin Gupta, Akhil Sharma, Saransh Singhania, Ali Imam Abidi

TL;DR: 本文提出了CA-ThinkFlow,一个参数高效的检索增强生成(RAG)框架,旨在解决大型语言模型在印度特许会计师(CA)等复杂、特定司法管辖区任务中的可靠性问题。该框架结合了一个4位量化的14B推理模型(DeepSeek-R1)和一个布局感知的文档提取系统,通过基本的RAG方法和内置的思维链(CoT)功能来整合检索信息并生成答案。

Details

Motivation: 大型语言模型在金融领域的应用日益增多,但在处理需要多步骤数值计算和高级法律知识、且资源受限的复杂任务(如印度特许会计师业务)时,其可靠性和可扩展性仍然有限。

Result: 在CA-Ben基准测试中,CA-ThinkFlow的性能与大型专有模型相当,其学术可靠性系数(SRC)达到了GPT-4o和Claude 3.5 Sonnet的68.75%。

Insight: 创新点在于结合了参数高效的量化推理模型、布局感知的文档提取以及基本的RAG与内置CoT,以在资源受限环境下提升复杂任务的性能。然而,从客观角度看,该框架在处理复杂法规文本(如税法)时,其核心推理能力仍有不足,这提示了未来需要进一步改进模型对专业领域深层语义的理解。

Abstract: The inception of Large Language Models (LLMs) has catalyzed AI adoption in the finance sector, yet their reliability in complex, jurisdiction-specific tasks like Indian Chartered Accountancy (CA) remains limited. The models display difficulty in executing numerical tasks which require multiple steps while also needing advanced knowledge about legal regulations and the method of scaling their operations is not feasible in settings which have limited access to resources. We present CA-ThinkFlow as a parameter-efficient Retrieval-Augmented Generation (RAG) framework which operates with a 14B, 4-bit-quantized reasoning model, 14B-DeepSeek-R1, and a layout-aware Docling extraction system which maintains document structure during extraction. CA-ThinkFlow uses a basic RAG method which automatically adds retrieved information into the prompt, while it depends on the model’s built-in Chain-of-Thought (CoT) functions to create context and produce correct answers. The system we developed system operates at performance levels which match large proprietary models when we tested it on the multi-level CA-Ben benchmark, achieving Scholastic Reliability Coefficient (SRC) results which equal 68.75% of GPT-4o and Claude 3.5 Sonnet. The framework shows high efficiency and strength in handling parameters, but essential reasoning abilities fail to process complex regulatory texts which exist in fields such as Taxation.


[7] Are You the A-hole? A Fair, Multi-Perspective Ethical Reasoning Framework cs.CL | cs.AI | cs.CY | cs.HCPDF

Sheza Munir, Ahanaf Rodoshi, Sumin Lee, Feiran Chang, Xujie Si

TL;DR: 本文提出了一种神经符号聚合框架,通过加权最大可满足性(MaxSAT)形式化冲突解决,用于处理大规模道德分歧场景中的自然语言判断聚合问题。该框架利用语言模型将非结构化的自然语言解释映射为可解释的逻辑谓词和置信权重,并编码为Z3求解器中的软约束,从而将聚合问题转化为优化任务,以寻求冲突证词间的最大一致性。

Details

Motivation: 解决标准聚合方法(如多数投票)在高冲突领域应用时,因将不同意见视为噪声而无法产生逻辑一致结果的问题,特别是在大规模道德分歧场景中。

Result: 以Reddit r/AmItheAsshole论坛为案例研究,该系统生成的逻辑连贯裁决与基于流行度的标签在62%的情况下存在分歧,并且与独立人类评估者的同意率达到86%。

Insight: 创新点在于将神经语义提取与形式化求解器(Z3)耦合,以在嘈杂的人类推理聚合中强制执行逻辑健全性和可解释性,提供了一种多视角、公平的伦理推理框架。

Abstract: Standard methods for aggregating natural language judgments, such as majority voting, often fail to produce logically consistent results when applied to high-conflict domains, treating differing opinions as noise. We propose a neuro-symbolic aggregation framework that formalizes conflict resolution through Weighted Maximum Satisfiability (MaxSAT). Our pipeline utilizes a language model to map unstructured natural language explanations into interpretable logical predicates and confidence weights. These components are then encoded as soft constraints within the Z3 solver, transforming the aggregation problem into an optimization task that seeks the maximum consistency across conflicting testimony. Using the Reddit r/AmItheAsshole forum as a case study in large-scale moral disagreement, our system generates logically coherent verdicts that diverge from popularity-based labels 62% of the time, corroborated by an 86% agreement rate with independent human evaluators. This study demonstrates the efficacy of coupling neural semantic extraction with formal solvers to enforce logical soundness and explainability in the aggregation of noisy human reasoning.


[8] Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification cs.CL | cs.CVPDF

Charles Weng, Dingwen Li, Alexander Martin

TL;DR: 本文研究了零样本视觉语言模型(VLM)安全分类器中,由语义等效提示词(prompt)重构引起的单提示词首词概率不稳定性问题。研究发现,即使二元标签被约束在固定输出位置,等效提示词也会对同一样本产生显著不同的不安全概率,这种跨提示词的方差与提示词间不一致性和更高错误率强相关。通过提出一种无需训练的平均集成方法,在多个多模态安全基准和VLM家族上,该方法在负对数似然(NLL)和预期校准误差(ECE)方面均优于单提示词基线及其他校准方法,并建议将提示词族评估与平均聚合作为标准的无标签可靠性基线。

Details

Motivation: 零样本VLM安全分类器通常将单提示词的首词概率作为决策分数,但作者发现这些分数在语义等效的提示词重构下不可靠,存在显著的方差,这影响了分类器的可靠性和鲁棒性。

Result: 在14个数据集-模型评估对上,提出的无训练平均集成方法在NLL上全部优于训练选择的单提示词基线,在12/14上改善了ECE;在AUROC和AUPRC指标上,相比基线也取得了一致的排名提升。该方法在无标签校准比较中,也优于标签温度缩放、Platt缩放和保序回归等方法。

Insight: 论文的核心创新在于揭示了零样本VLM安全分类中提示词诱导的分数方差问题,并将其作为一种脆弱性诊断工具;提出的无训练平均集成方法是一种简单有效的无标签可靠性提升技术,可作为标准基线,并为后续的标签校准提供了强力的第一阶段处理。

Abstract: Single-prompt first-token probabilities from zero-shot vision-language model (VLM) safety classifiers are treated as decision scores, but we show they are unreliable under semantically equivalent prompt reformulation: even when the binary label is constrained to a fixed output position, equivalent prompts can induce materially different unsafe probabilities for the same sample. Across multimodal safety benchmarks and multiple VLM families, cross-prompt variance is strongly associated with prompt-level disagreement and higher error, making it a useful fragility diagnostic. A training-free mean ensemble improves NLL on all 14 dataset-model evaluation pairs and ECE on 12/14 relative to a train-selected single-prompt baseline, and wins more head-to-head NLL comparisons than labeled temperature scaling, Platt scaling, and isotonic regression applied to the same prompt. Ranking gains are consistent against the train-selected baseline on both AUROC and AUPRC, and against the full 15-prompt distribution remain consistent on AUPRC while softening on AUROC. Labeled calibration on top of the mean provides further gains when labels are available, identifying prompt averaging as a strong label-free first stage rather than a replacement for calibration. We frame this as a reliability stress test for zero-shot VLM first-token safety scores and recommend prompt-family evaluation with mean aggregation as a standard label-free reliability baseline.


[9] Escaping Mode Collapse in LLM Generation via Geometric Regulation cs.CL | cond-mat.dis-nn | cs.AI | nlin.CDPDF

Xin Du, Kumiko Tanaka-Ishii

TL;DR: 本文提出了一种从动力学系统视角理解大语言模型生成中模式崩溃现象的新观点,将其解释为表示空间几何塌缩导致的状态空间可达性降低。基于此,作者提出了一种名为’强化模式调控’的轻量级在线干预方法,通过低秩阻尼调控Transformer值缓存中的主导自强化方向,有效缓解了模式崩溃问题。

Details

Motivation: 解决自回归文本生成中普遍存在的模式崩溃问题,该问题表现为从显式循环到多样性逐渐丧失和轨迹过早收敛等一系列行为。作者认为现有基于符号约束或仅依赖概率的解码启发式方法无法可靠解决此问题。

Result: 在多个大语言模型上的实验表明,RMR方法显著减少了模式崩溃,并能在极低的熵率下实现稳定、高质量的生成。具体而言,标准解码通常在约2.0 nats/步时崩溃,而RMR方法在低至0.8 nats/步时仍能稳定工作。

Insight: 核心创新点在于将模式崩溃重新概念化为一种几何塌缩现象,并据此设计了对模型内部状态空间的直接、轻量级干预机制。这为理解和解决生成模型中的模式崩溃问题提供了一个新颖的动力学系统框架和有效的工程实现路径。

Abstract: Mode collapse is a persistent challenge in generative modeling and appears in autoregressive text generation as behaviors ranging from explicit looping to gradual loss of diversity and premature trajectory convergence. We take a dynamical-systems view and reinterpret mode collapse as reduced state-space accessibility caused by geometric collapse: during generation, the model’s internal trajectory becomes confined to a low-dimensional region of its representation space. This implies mode collapse is not purely a token-level phenomenon and cannot be reliably solved by symbolic constraints or probability-only decoding heuristics. Guided by this perspective, we propose Reinforced Mode Regulation (RMR), a lightweight, online state-space intervention that regulates dominant self-reinforcing directions in the Transformer value cache (implemented as low-rank damping). Across multiple large language models, RMR substantially reduces mode collapse and enables stable, high-quality generation at extremely low entropy rates (down to 0.8 nats/step), whereas standard decoding typically collapses near 2.0 nats/step.


[10] Impact of Task Phrasing on Presumptions in Large Language Models cs.CL | cs.AIPDF

Kenneth J. K. Ong

TL;DR: 本研究探讨任务表述如何导致大语言模型产生预设,使其难以适应偏离假设的任务,并以迭代囚徒困境为案例揭示了LLMs在决策时易受预设影响,但中性任务表述能促进逻辑推理。

Details

Motivation: 针对大语言模型在不可预测现实应用中的安全性和可靠性问题,研究任务表述如何引发预设,阻碍模型适应任务变化。

Result: 实验表明,即使有推理步骤,LLMs在决策时仍易受预设影响;当中性任务表述时,模型能展现逻辑推理且预设减少。

Insight: 创新点在于通过迭代囚徒困境案例量化任务表述对预设的影响,强调中性表述可降低预设风险,为提升LLMs鲁棒性提供实践指导。

Abstract: Concerns with the safety and reliability of applying large-language models (LLMs) in unpredictable real-world applications motivate this study, which examines how task phrasing can lead to presumptions in LLMs, making it difficult for them to adapt when the task deviates from these assumptions. We investigated the impact of these presumptions on the performance of LLMs using the iterated prisoner’s dilemma as a case study. Our experiments reveal that LLMs are susceptible to presumptions when making decisions even with reasoning steps. However, when the task phrasing was neutral, the models demonstrated logical reasoning without much presumptions. These findings highlight the importance of proper task phrasing to reduce the risk of presumptions in LLMs.


[11] Structure Liberates: How Constrained Sensemaking Produces More Novel Research Output cs.CL | cs.AIPDF

James Mooney, Zae Myung Kim, Young-Jun Lee, Dongyeop Kang

TL;DR: 论文提出了SCISENSE框架,将科学发现的构思过程结构化为八个认知阶段,并构建了SCISENSE-Traj数据集和SCISENSE-LM模型。研究发现,基于已知论文引用路径(Target模式)训练的模型,比基于自由推理(Infer模式)训练的模型能生成质量更高、更新颖多样的研究轨迹,并能提升下游编码代理生成研究产物的可执行性和质量。

Details

Motivation: 现有方法将科学发现的构思阶段视为简短前奏,忽视了其在研究中的核心作用。本文旨在通过一个基于意义建构(sensemaking)的结构化框架,来操作化和增强这一构思过程。

Result: 在构建的SCISENSE-Traj数据集上,Target训练模式相比Infer训练模式,在轨迹质量上提升了2.0%,同时产生了更新颖和更多样化的输出。下游实验表明,基于Target轨迹的编码代理生成的研究产物,其可执行性和质量也更高。

Insight: 创新点在于将科学构思过程结构化为明确的认知阶段,并构建了大规模、基于引用的研究轨迹数据集。一个关键发现是,适度的结构化约束(Target模式)反而比宽松的监督(Infer模式)更能促进新颖性和多样性,这挑战了’更少监督促进更多探索’的假设,并为研究规划如何影响科学发现提供了原则性测试平台。

Abstract: Scientific discovery is an extended process of ideation–surveying prior work, forming hypotheses, and refining reasoning–yet existing approaches treat this phase as a brief preamble despite its central role in research. We introduce SCISENSE, a sensemaking-grounded framework that operationalizes ideation as a structured sequence of eight cognitive stages (Pirolli & Card, 2005). We construct SCISENSE-Traj, a 100K-scale dataset of citation-conditioned research trajectories in two modes: Target, where an LLM reconstructs the ideation path leading to a known paper from its cited works, and Infer, where the LLM proposes novel directions from the same citations. We distill these into SCISENSE-LM, a family of sensemaking LLMs spanning 3B to 70B parameters. Contrary to the assumption that looser supervision promotes greater exploration, Target-trained models achieve a 2.0% improvement in trajectory quality over Infer-trained models while also producing more novel and diverse outputs. This advantage propagates downstream: coding agents conditioned on Target trajectories produce research artifacts with higher executability and quality than those conditioned on Infer trajectories. This suggests that targeted ideation reduces cognitive burden on downstream agents, freeing them to explore more creatively. SCISENSE offers both a practical tool for augmenting LLM-driven research workflows and a principled testbed for studying how planning shapes scientific discovery.


[12] Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs cs.CLPDF

Jasper Dekoninck, Nikola Jovanović, Tim Gehrunger, Kári Rögnvalddson, Ivo Petrov

TL;DR: 该论文将原有的MathArena基准扩展为一个持续维护的LLM数学推理评估平台,以解决静态基准测试范围狭窄、易饱和和更新慢的问题。新平台覆盖了更广泛的任务类型,包括证明类竞赛、研究级arXiv问题和Lean形式化证明生成,并建立了清晰的评估协议和定期更新的挑战性基准。

Details

Motivation: 静态基准测试在评估LLM数学能力时存在范围窄、易饱和和更新慢的局限性,导致难以可靠比较模型和追踪长期进展,因此需要构建一个持续维护的综合性评估平台。

Result: 在MathArena平台上,最强的GPT-5.5模型在2026年美国数学奥林匹克竞赛上达到98%的准确率,在研究级问题上达到74%的准确率,表明前沿模型已能轻松解决极具挑战性的数学问题。

Insight: 论文的创新点在于将一次性基准测试转变为持续维护的评估生态系统,通过整合多样化的数学任务类型和动态更新基准来全面追踪LLM进展。这为其他领域的模型评估提供了可借鉴的平台化思路,强调评估体系需与模型能力同步演进。

Abstract: Large language models (LLMs) are becoming increasingly capable mathematical collaborators, but static benchmarks are no longer sufficient for evaluating progress: they are often narrow in scope, quickly saturated, and rarely updated. This makes it hard to compare models reliably and track progress over time. Instead, we need evaluation platforms: continuously maintained systems that run, aggregate, and analyze evaluations across many benchmarks to give a comprehensive picture of model performance within a broad domain. In this work, we build on the original MathArena benchmark by substantially broadening its scope from final-answer olympiad problems to a continuously maintained evaluation platform for mathematical reasoning with LLMs. MathArena now covers a much wider range of tasks, including proof-based competitions, research-level arXiv problems, and formal proof generation in Lean. Additionally, we maintain a clear evaluation protocol for all models and regularly design new benchmarks as model capabilities improve to ensure that MathArena remains challenging. Notably, the strongest model, GPT-5.5, now reaches 98% on the 2026 USA Math Olympiad and 74% on research-level questions, showing that frontier models can now comfortably solve extremely challenging mathematical problems. This highlights the importance of continuously maintained evaluation platforms like MathArena to track the rapid progress of LLMs in mathematical reasoning.


[13] Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory cs.CLPDF

Derong Xu, Shuochen Liu, Pengfei Luo, Pengyue Jia, Yingyi Zhang

TL;DR: 本文提出MemCoE,一种受认知启发的两阶段优化框架,用于学习如何组织和更新LLM代理的长期用户记忆。该框架通过记忆指导归纳和指导对齐记忆策略优化两个阶段,分别学习记忆的组织方式和信息更新策略,以解决现有记忆系统在跟踪用户动态偏好时面临的稀疏奖励和长时优化不稳定问题。

Details

Motivation: 现有LLM代理的长期记忆系统主要依赖静态的手工更新规则,而基于强化学习的方法则因稀疏结果奖励导致监督信号弱,难以稳定优化长时记忆更新。本文旨在解决这一挑战,通过借鉴记忆图式理论及前额叶与海马体的功能分工,设计更有效的记忆优化框架。

Result: 在三个个性化记忆基准测试(涵盖显式/隐式偏好、不同规模和噪声)上,MemCoE相比强基线模型展现出持续的性能提升,并具有良好的鲁棒性、可迁移性和效率。

Insight: 创新点在于将认知科学中的记忆图式理论与两阶段优化相结合,通过文本梯度形式的对比反馈诱导全局记忆指导,并利用该指导构建结构化过程奖励进行多轮强化学习,从而稳定地学习遵循指导的记忆演化策略。这为LLM记忆系统的设计提供了认知启发的优化思路。

Abstract: Large language model (LLM) agents require long-term user memory for consistent personalization, but limited context windows hinder tracking evolving preferences over long interactions. Existing memory systems mainly rely on static, hand-crafted update rules; although reinforcement learning (RL)-based agents learn memory updates, sparse outcome rewards provide weak supervision, resulting in unstable long-horizon optimization. Drawing on memory schema theory and the functional division between prefrontal regions and hippocampus regions, we introduce MemCoE, a cognition-inspired two-stage optimization framework that learns how memory should be organized and what information to update. In the first stage, we propose Memory Guideline Induction to optimize a global guideline via contrastive feedback interpreted as textual gradients; in the second stage, Guideline-Aligned Memory Policy Optimization uses the induced guideline to define structured process rewards and performs multi-turn RL to learn a guideline-following memory evolution policy. We evaluate on three personalization memory benchmarks, covering explicit/implicit preference and different sizes and noise, and observe consistent improvements over strong baselines with favorable robustness, transferability, and efficiency.


[14] When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models cs.CLPDF

Sailesh Panda, Pritam Kadasi, Abhishek Upperwal, Mayank Singh

TL;DR: 本文通过一个受控的诊断基准研究了大型语言模型(LLMs)在遵循提示中指定的步骤式算法时的忠实执行能力。研究发现,尽管LLMs在推理基准测试中表现强劲,但在执行较长或具有回溯依赖的算术程序时,其首次答案准确率显著下降,揭示了模型在忠实执行指令方面存在重大弱点。

Details

Motivation: 论文的动机是探究LLMs在推理任务中的最终答案准确性是否真实反映了其忠实执行提示中指定步骤式程序的能力,而不仅仅是表面上的推理表现。

Result: 在涵盖14个模型和55个数据集的基准测试中,平均首次答案准确率从5步程序的61%下降到95步程序的20%。生成层面的分析识别了多种失败模式,如缺失答案、过早回答、初始错误后自我纠正、执行不足的轨迹和幻觉额外步骤。

Insight: 论文的创新点在于设计了一个可控的诊断基准来系统评估LLMs对步骤式算法的忠实执行,揭示了模型在复杂程序执行中的具体失败模式,表明表面推理能力可能掩盖了指令执行中的实质性缺陷。

Abstract: Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We study this question through a controlled diagnostic benchmark for procedural execution, where models are given a step-wise arithmetic algorithm and two numeric inputs, and must return the final computed value. The benchmark uses simple arithmetic operations but increases complexity through algorithm length and look-back dependencies over intermediate variables. Across 14 models and 55 datasets, average first-answer accuracy drops from 61% on 5-step procedures to 20% on 95-step procedures. Generation-level analysis shows that failures often involve missing answers, premature answers, self-correction after an initial error, under-executed traces, and hallucinated extra steps. These findings suggest that apparent reasoning ability can mask substantial weaknesses in faithful instruction execution.


cs.CV [Back]

[15] Learning from the Unseen: Generative Data Augmentation for Geometric-Semantic Accident Anticipation cs.CV | cs.LGPDF

Yanchen Guan, Haicheng Liao, Chengyue Wang, Xingcheng Liu, Jiaxun Zhang

TL;DR: 本文提出了一种双路径框架用于交通事故预测,通过生成式数据增强缓解数据稀缺问题,并结合图神经网络进行动态时空与语义关系推理。

Details

Motivation: 解决自动驾驶中事故预测因交互建模复杂性和多样化大规模数据集稀缺而面临的挑战。

Result: 在现有数据集及新发布的标准化细粒度标注基准上评估,该方法在预测准确性和预警提前时间上均取得显著提升。

Insight: 创新点在于结合生成式合成高保真驾驶场景与语义增强的图神经网络,以数据生成和模型推理双路径突破数据瓶颈并提升系统可靠性。

Abstract: Anticipating traffic accidents is a critical yet unresolved problem for autonomous driving, hindered by the inherent complexity of modeling interactions between road users and the limited availability of diverse, large-scale datasets. To address these issues, we propose a dual-path framework. On the one hand, we employ a video synthesis pipeline that, guided by structured prompts, derives feature distributions from existing corpora and produces high-fidelity synthetic driving scenes consistent with the statistical patterns of real data. On the other hand, we design a graph neural network enriched with semantic cues, enabling dynamic reasoning over both spatial and semantic relations among participants. To validate the effectiveness of our approach, we release a new benchmark dataset containing standardized, finely annotated video sequences that cover a broad spectrum of regions, weather, and traffic conditions. Evaluations across existing datasets and our new benchmark confirm notable gains in both accuracy and anticipation lead time, highlighting the capacity of the proposed framework to mitigate current data bottlenecks and enhance the reliability of autonomous driving systems.


[16] GAFSV-Net: A Vision Framework for Online Signature Verification cs.CV | cs.CR | cs.LGPDF

Himanshu Singhal, Suresh Sundaram

TL;DR: GAFSV-Net是一个用于在线签名验证的视觉框架,它将签名数据转换为六通道的非对称格拉米角场图像,利用双分支ConvNeXt-Tiny编码器结合双向交叉注意力进行处理,通过余弦相似度进行验证,在DeepSignDB和BiosecurID数据集上超越了所有基于序列的基线方法。

Details

Motivation: 解决在线签名验证中现有深度学习方法直接处理原始时间序列、局限于一维架构、无法利用预训练二维视觉骨干网络的问题。

Result: 在DeepSignDB和BiosecurID数据集上评估,超越了所有在相同目标下训练的基于序列的基线方法,证明了二维时间编码的表征增益是稳定且独立于训练过程的。

Insight: 创新点在于将签名表示为六通道非对称格拉米角场图像以捕获时间共现和方向过渡结构,并设计了双分支编码器与双向交叉注意力机制;可借鉴之处包括将一维时间序列转换为二维图像以利用预训练视觉模型,以及通过互补编码和交叉注意力增强特征判别性。

Abstract: Online signature verification (OSV) requires distinguishing skilled forgeries from genuine samples under high intra-class variability and with very few enrollment samples. Existing deep learning methods operate directly on raw temporal sequences, restricting them to 1D architectures and preventing the use of pretrained 2D vision backbones. We bridge this gap with GAFSV-Net, which represents each signature as a six-channel asymmetric Gramian Angular Field image: three kinematic channels (pen speed, pressure derivative, direction angle) are each encoded into complementary GASF and GADF matrices that capture pairwise temporal co-occurrence and directional transition structure respectively. A dual-branch ConvNeXt-Tiny encoder processes GASF and GADF independently, with bidirectional cross-attention enabling each branch to query discriminative patterns from the other before metric-space projection. Training uses semi-hard triplet loss with skilled-forgery hard-negative injection; verification is performed via cosine similarity against a small enrollment prototype. We evaluate on DeepSignDB and BiosecurID, outperforming all sequence-based baselines trained under identical objectives, demonstrating that the representational gain of 2D temporal encoding is consistent and independent of training procedure, with ablations characterising each design choice’s contribution.


[17] MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video cs.CV | cs.AIPDF

Xijia Wei, Yuan Fang, Kevin Chetty, Youngjun Cho, Nadia Bianchi-Berthouze

TL;DR: MAEPose是一种基于掩码自编码的自监督时空学习方法,用于毫米波视频中的人体姿态估计。该方法直接从毫米波频谱图视频中学习时空运动感知的通用表示,并通过热图解码器进行多帧姿态预测。

Details

Motivation: 现有毫米波雷达姿态估计方法依赖预提取的中间表示(如稀疏点云或频谱图图像),丢弃了雷达视频流中丰富的时空信息,且多为端到端监督学习,未能利用未标记原始视频流学习通用表示。

Result: 在三个数据集上基于留一人交叉验证和严格统计测试进行评估,MAEPose在MPJPE指标上始终优于最先进的基线方法达22.1%(p<0.05),在零样本旁观者干扰下仅增加6.5%误差,保持鲁棒准确性。消融研究证实预训练和热图解码器均有显著贡献,模态分析表明使用Range-Doppler视频作为输入比Range-Azimuth或其融合具有更好的姿态估计性能和更低的计算成本。

Insight: 创新点在于将掩码自编码框架应用于毫米波视频流,直接从原始数据中自监督学习时空表示,避免了中间表示的信息丢失和系统复杂性;同时,该方法展示了在隐私保护场景下利用未标记数据学习通用表示的潜力,以及特定输入模态(Range-Doppler)对性能的优化作用。

Abstract: Millimetre-wave (mmWave) radar offers a more privacy-preserving alternative to RGB-based human pose estimation. However, existing methods typically rely on pre-extracted intermediate representations such as sparse point clouds or spectrogram images, where the rich spatiotemporal information naturally present in radar video streams is discarded for model learning, while such signal processing adds system complexity. In addition, existing solutions are mainly conducted in an end-to-end supervised manner without leveraging unlabelled raw video streams to learn generalized representations. In this study, we present MAEPose, a masked autoencoding-based human pose estimation approach that operates directly on mmWave spectrogram videos. MAEPose learns spatiotemporal motion-aware generalized representations from unlabelled radar video, and leverages its heatmap decoder for multi-frame pose estimation predictions. We evaluate it across three datasets based on leave-one-person-out cross-validation with rigorous statistical testing. MAEPose consistently outperforms state-of-the-art baselines by up to 22.1% in MPJPE p<0.05, and maintains robust accuracy under zero-shot bystander interference with only a 6.5% error increase. Ablation studies confirm that both the pre-training and the heatmap decoder contribute substantially, while modality analysis indicates that leveraging Range-Doppler video as input achieves better pose estimation performance than Range-Azimuth or their fusion, with lower computational cost.


[18] An End-to-End Decision-Aware Multi-Scale Attention-Based Model for Explainable Autonomous Driving cs.CV | cs.ROPDF

Maryam Sadat Hosseini Azad, Shahriar Baradaran Shokouhi, Amir Abbas Hamidi Imani, Shahin Atakishiyev, Randy Goebel

TL;DR: 本文提出了一种端到端的决策感知多尺度注意力模型,用于可解释的自动驾驶系统。该模型将驾驶决策输入到推理组件中,为每个决策提供特定案例的解释。通过F1分数和新提出的联合F1分数进行定量评估,并在BDD-OIA和nu-AR数据集上验证了其泛化能力和鲁棒性,结果表明该模型优于经典和SOTA模型。

Details

Motivation: 由于深度学习模型的黑盒性质,难以解释其决策过程,这在自动驾驶等安全关键应用中限制了模型的可靠部署。现有解释方法存在推理缺陷和度量不可靠的问题,阻碍了对复杂模型的全面理解和真正可靠系统的发展。

Result: 在BDD-OIA和nu-AR数据集上的实验表明,该模型在可解释人工智能(XAI)方面表现出准确可靠的性能,其推理网络优于经典和最先进的(SOTA)模型。

Insight: 创新点在于提出了一种端到端的决策感知多尺度注意力架构,将驾驶决策直接集成到推理过程中以生成案例特定的解释,并引入了联合F1分数这一新评估指标来量化解释性能,增强了模型的可解释性和可靠性。

Abstract: The application of computer vision is gradually increasing across various domains. They employ deep learning models with a black-box nature. Without the ability to explain the behavior of neural networks, especially their decision-making processes, it is not possible to recognize their efficiency, predict system failures, or effectively implement them in real-world applications. Due to the inevitable use of deep learning in fully automated driving systems, many methods have been proposed to explain their behavior; however, they suffer from flawed reasoning and unreliable metrics, which have prevented a comprehensive understanding of complex models in autonomous vehicles and hindered the development of truly reliable systems. In this study, we propose a multi-scale attention-based model in which driving decisions are fed into the reasoning component to provide case-specific explanations for each decision simultaneously. For quantitative evaluation of our model’s performance, we employ the F1-score metric, and also proposed a new metric called the Joint F1 score to demonstrate the accurate and reliable performance of the model in terms of Explainable Artificial Intelligence (XAI). In addition to the BDD-OIA dataset, the nu-AR dataset is utilized to further validate the generalization capability and robustness of the proposed network. The results demonstrate the superiority of our reasoning network over the classic and state-of-the-art models.


[19] Efficient Spatio-Temporal Vegetation Pixel Classification with Vision Transformers cs.CVPDF

Alan Gomes, Anderson Gonçalves, Samuel Felipe dos Santos, Nathan Felipe Alves, Magna Soelma Beserra de Moura

TL;DR: 本文针对植物物候监测中跨时间植物物种识别计算挑战,提出了一种基于视觉变换器(ViT)的高效时空植被像素分类方法。通过系统分析七个关键设计维度(如数据归一化、空间上下文窗口、位置编码等),在巴西塞拉多生物群系的两个数据集上验证了方法的有效性。

Details

Motivation: 现有基于多分支卷积网络(CNN)的方法在处理长时间序列时计算效率低、参数复杂度随序列长度线性增长,且需要较大的空间上下文窗口,难以适应资源受限的物候监测系统。

Result: 在巴西塞拉多生物群系的Serra do Cipó(航空影像)和Itirapina(近地面影像)数据集上,所提ViT方法在保持竞争力的分类性能的同时,计算效率显著提升:浮点运算量(FLOPs)降低一个数量级,且参数复杂度不随序列长度变化(CNN基线为线性增长)。

Insight: 创新点在于系统性地优化ViT在时空植被分类中的设计维度(如归一化、窗口形状、标记化策略等),实现了计算效率的显著提升和参数复杂度的恒定,为资源受限的物候监测提供了可扩展的解决方案;客观分析表明,将ViT的注意力机制与时空数据特性结合,并通过消融实验验证关键设计选择,是提升效率的有效途径。

Abstract: Plant phenology-the study of recurrent life cycle events-is essential for understanding ecosystem dynamics and their responses to climate change impacts. While Unmanned Aerial Vehicles (UAVs) and near-surface cameras enable high-resolution monitoring, identifying plant species across time remains computationally challenging. State-of-the-art approaches, specifically Multi-Temporal Convolutional Networks (CNNs), rely on rigid multi-branch architectures that scale poorly with longer time series and require large spatial context windows. In this paper, we present an extensive study on optimizing Vision Transformers (ViTs) for efficient spatio-temporal vegetation pixel classification. We conducted a comprehensive ablation study analyzing seven key design dimensions, including: (i) data normalization; (ii) spectral arrangement; (iii) boundary handling; (iv) spatial context window shape and size; (v) tokenization strategies; (vi) positional encoding; and (vii) feature aggregation strategies. Our method was evaluated on two datasets from the Brazilian Cerrado biome, Serra do Cipó (aerial imagery) and Itirapina (near-surface imagery). Experimental results demonstrate that our ViT approach offers a substantial improvement in computational efficiency while maintaining competitive classification performance. Notably, our ViT reduces Floating Point Operations (FLOPs) by an order of magnitude and maintains constant parameter complexity regardless of the time series length, whereas the CNN baseline scales linearly. Our findings confirm that ViTs are a robust, scalable solution for resource-constrained phenological monitoring systems.


[20] Online Self-Calibration Against Hallucination in Vision-Language Models cs.CV | cs.LGPDF

Minghui Chen, Chenxu Yang, Hengjie Zhu, Dayan Wu, Zheng Lin

TL;DR: 本文提出OSCAR框架,通过在线自校准方法解决大视觉语言模型中的幻觉问题,利用模型自身的判别能力构建偏好数据,并采用蒙特卡洛树搜索和双粒度奖励机制进行迭代优化,在幻觉基准测试中达到SOTA性能。

Details

Motivation: 现有偏好对齐方法依赖离线强模型监督,导致学生模型被迫对齐超出其感知能力的细节,产生监督-感知不匹配问题,因此需要可靠的在线自监督学习机制。

Result: 在幻觉基准测试中达到SOTA性能,同时提升了通用多模态能力。

Insight: 利用大视觉语言模型内部存在的生成-判别差距,通过蒙特卡洛树搜索和双粒度奖励机制构建在线自监督偏好数据,结合直接偏好优化实现迭代校准,避免了对强外部模型的依赖。

Abstract: Large Vision-Language Models (LVLMs) often suffer from hallucinations, generating descriptions that include visual details absent from the input image. Recent preference alignment methods typically rely on supervision distilled from stronger models such as GPT. However, this offline paradigm introduces a Supervision-Perception Mismatch: the student model is forced to align with fine-grained details beyond its perceptual capacity, learning to guess rather than to see. To obtain reliable self-supervision for online learning, we identify a Generative-Discriminative Gap within LVLMs, where models exhibit higher accuracy on discriminative verification than open-ended generation. Leveraging this capability, we propose \textbf{O}nline \textbf{S}elf-\textbf{CA}lib\textbf{R}ation (OSCAR), a framework that integrates Monte Carlo Tree Search with a Dual-Granularity Reward Mechanism to construct preference data and iteratively refines the model via Direct Preference Optimization. Extensive experiments demonstrate that OSCAR achieves state-of-the-art performance on hallucination benchmarks while improving general multimodal capabilities.


[21] Pose-Aware Diffusion for 3D Generation cs.CVPDF

Zihan Zhou, Luxi Chen, Jingzhi Zhou, Yuhao Wan, Min Zhao

TL;DR: 该论文提出了一种名为Pose-Aware Diffusion (PAD)的端到端扩散框架,用于直接在观测空间中合成姿态对齐的3D几何体,避免了传统解耦范式中的空间不匹配和变换模糊性问题,并通过将单目深度反投影为部分点云作为3D几何锚点来增强空间监督,从而生成高保真度的姿态对齐3D资产,并可扩展至组合式3D场景重建。

Details

Motivation: 解决传统解耦范式(先规范后旋转)中因空间不匹配和变换模糊性导致的姿态对齐3D物体生成挑战,旨在直接在观测空间中合成3D几何体以提高对齐精度。

Result: 在广泛实验中,PAD在几何对齐和图像到3D对应方面优于现有最先进方法(SOTA),并展示了通过简单合并独立生成物体实现组合式3D场景重建的鲁棒能力。

Insight: 创新点包括放弃规范假设、在观测空间中直接生成3D几何体,以及通过反投影单目深度作为3D几何锚点来注入明确的空间监督,从而本质解决姿态模糊性;客观分析认为其端到端框架和空间锚点设计可有效提升3D生成的对齐精度和可扩展性。

Abstract: Generating pose-aligned 3D objects is challenging due to the spatial mismatches and transformation ambiguities inherent in decoupled canonical-then-rotate paradigms. To this end, we introduce Pose-Aware Diffusion (PAD), a novel end-to-end diffusion framework that synthesizes 3D geometry directly within the observation space. By unprojecting monocular depth into a partial point cloud and explicitly injecting it as a 3D geometric anchor, PAD abandons canonical assumptions to enforce rigorous spatial supervision. This native generation intrinsically resolves pose ambiguity, producing high-fidelity pose-aligned assets. Extensive experiments demonstrate that PAD achieves superior geometric alignment and image-to-3D correspondence compared to state-of-the-art methods. Additionally, PAD naturally extends to compositional 3D scene reconstruction via a simple union of independently generated objects, highlighting its robust ability to preserve precise spatial layouts.


[22] RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference cs.CV | cs.LGPDF

Ben Wan, Yan Feng, Zihan Tang, Weizhe Huang, Yuting Zeng

TL;DR: 本文提出了一种名为RTPrune的两阶段令牌剪枝方法,专门用于提升DeepSeek-OCR模型的推理效率。该方法受模型解码过程中两阶段注意力轨迹的启发,第一阶段保留高范数令牌以捕获关键文本和结构信息,第二阶段利用最优传输理论对剩余令牌进行配对与合并,并引入动态剪枝比以适应OCR任务。实验表明,该方法在OmniDocBench基准上实现了99.47%的准确率和1.23倍的预填充加速,同时仅保留84.25%的令牌,达到了最先进的性能。

Details

Motivation: 现有视觉语言模型的令牌剪枝方法由于压缩机制不当,难以保持文本保真度;同时,DeepSeek-OCR中的视觉令牌仍存在文本和结构信息冗余。通过分析其解码过程,发现模型存在先关注高范数令牌、再重新分配注意力到剩余令牌的两阶段轨迹,这启发了针对性的剪枝设计。

Result: 在OmniDocBench基准测试中,应用于DeepSeek-OCR-Large模型时,RTPrune以84.25%的令牌保留率实现了99.47%的准确率和1.23倍的预填充加速,达到了最先进的性能水平。

Insight: 创新点在于受模型内部两阶段阅读轨迹启发,设计了两阶段剪枝策略,并结合最优传输理论进行令牌合并;同时,针对OCR任务引入了动态剪枝比,根据令牌相似度和文本密度自适应调整,实现了更好的效率-精度权衡。

Abstract: DeepSeek-OCR leverages visual-text compression to reduce long-text processing costs and accelerate inference, yet visual tokens remain prone to redundant textual and structural information. Moreover, current token pruning methods for conventional vision-language models (VLMs) fail to preserve textual fidelity due to improper compression mechanisms. By analyzing the decoding process of DeepSeek-OCR, we find that a distinct two-stage reading trajectory: the model initially prioritizes the majority of high-norm tokens, then subsequently redistributes its attention to the remaining ones. Motivated by this insight, we propose RTPrune, a two-stage token pruning method tailored for DeepSeek-OCR. In the first stage, we prioritize high-norm visual tokens that capture salient textual and structural information. In the second stage, the remaining tokens are paired and merged based on optimal transport theory to achieve efficient feature aggregation. We further introduce a dynamic pruning ratio that adapts to token similarity and textual density for OCR tasks, enabling a better efficiency-accuracy trade-off. Extensive experiments demonstrate state-of-the-art performance, as evidenced by 99.47% accuracy and 1.23$\times$ faster prefill on OmniDocBench, achieved with 84.25% token retention when applied to DeepSeek-OCR-Large.


[23] SIMON: Saliency-aware Integrative Multi-view Object-centric Neural Decoding cs.CV | q-bio.NCPDF

YuSheng Lin, Ji-Hwa Tsai, Chun-Shu Wei

TL;DR: 本文提出SIMON,一种基于显著性的多视角零样本EEG-图像检索框架,通过显著性感知采样选择注视中心并生成强调目标区域、抑制背景的注视视图,在THINGS-EEG数据集上实现了最先进的性能。

Details

Motivation: 现有EEG-图像检索方法通常假设固定、中心聚焦的视图,这与内容驱动的人类注意力存在冲突,导致视觉特征与EEG响应之间的几何-语义分离。

Result: 在THINGS-EEG数据集上,SIMON在受试者内和受试者间设置中均达到最先进水平,平均Top-1准确率分别为69.7%和19.6%,持续优于近期竞争基线。

Insight: 创新点在于结合前景分割和显著性预测进行显著性感知采样,以生成多视角注视视图,从而更好地对齐视觉内容与脑电响应;客观分析表明,该方法通过多视角集成增强了模型的鲁棒性。

Abstract: Recent EEG-to-image retrieval methods leverage pretrained vision encoders and foveation-inspired priors, but typically assume a fixed, center-focused view. This center bias conflicts with content-driven human attention, creating a geometric-semantic dissociation between visual features and EEG responses. We propose SIMON, a saliency-aware multi-view framework for zero-shot EEG-to-image retrieval. SIMON combines foreground segmentation and saliency prediction to select fixation centers via Saliency-Aware Sampling (SAS), then generates foveated views that emphasize informative object regions while suppressing background clutter. On THINGS-EEG, SIMON achieves state-of-the-art performance in both intra-subject and inter-subject settings, reaching an average Top-1 accuracy of 69.7% and 19.6%, respectively, consistently outperforming recent competitive baselines. Analyses across sampling granularity, EEG channel topology, and visual/brain encoder backbones further support the robustness of saliency-aware multi-view integration. Our code and models are publicly available at https://github.com/simonlink666/SIMON.


[24] Beyond Heuristics: Learnable Density Control for 3D Gaussian Splatting cs.CVPDF

Zhenhua Ning, Xin Li, Jun Yu, Guangming Lu, Yaowei Wang

TL;DR: 本文提出LeGS框架,将3D高斯泼溅中的密度控制从启发式规则转变为完全可学习的策略。通过强化学习优化参数化策略网络,并基于灵敏度分析设计奖励函数,量化单个高斯对重建质量的边际贡献。该方法在多个数据集上显著优于现有方法,在重建质量和效率之间取得更好平衡。

Details

Motivation: 现有3D高斯泼溅方法依赖启发式密度控制规则,缺乏适应复杂几何场景的灵活性,限制了其性能。

Result: 在Mip-NeRF 360、Tanks & Temples和Deep Blending数据集上的实验表明,LeGS显著优于现有SOTA方法,在重建质量和效率方面达到更好平衡。

Insight: 创新点在于将密度控制重新定义为可通过强化学习优化的策略网络,并设计基于灵敏度分析的高效奖励函数(计算复杂度从O(N²)降至O(N)),实现了从手工规则到可学习范式的转变。

Abstract: While 3D Gaussian Splatting (3DGS) has demonstrated impressive real-time rendering performance, its efficacy remains constrained by a reliance on heuristic density control. Despite numerous refinements to these handcrafted rules, such methods inherently lack the flexibility to adapt to diverse scenes with complex geometries. In this paper, we propose a paradigm shift for density control from rigid heuristics to fully learnable policies. Specifically, we introduce \textbf{LeGS}, a framework that reformulates density control as a parameterized policy network optimized via Reinforcement Learning (RL). Central to our approach is the tailored effective reward function grounded in sensitivity analysis, which precisely quantifies the marginal contribution of individual Gaussians to reconstruction quality. To maintain computational tractability, we derive a closed-form solution that reduces the complexity of reward calculation from $O(N^2)$ to $O(N)$. Extensive experiments on the Mip-NeRF 360, Tanks & Temples, and Deep Blending datasets demonstrate that \textbf{LeGS} significantly outperforms state-of-the-art methods, striking a superior balance between reconstruction quality and efficiency. The code will be released at https://github.com/AaronNZH/LeGS


[25] LIMSSR: LLM-Driven Sequence-to-Score Reasoning under Training-Time Incomplete Multimodal Observations cs.CVPDF

Huangbiao Xu, Huanqi Wu, Xiao Ke, Yuxin Peng

TL;DR: 本文提出LIMSSR框架,用于解决训练时模态缺失的不完全多模态学习问题。该方法将挑战重新定义为条件序列推理任务,利用大语言模型的语义推理能力,通过提示引导的上下文感知模态插补和多维表示融合,从可用上下文中推断潜在语义,而无需直接重构。在三个动作质量评估数据集上的实验表明,LIMSSR显著优于现有方法。

Details

Motivation: 现实世界中的多模态学习常受模态缺失困扰。现有不完全多模态学习方法通常依赖训练时全模态可用的不现实假设,本文旨在解决更具挑战性的训练时不完整观测下的IML问题,避免依赖完整数据的“上帝视角”。

Result: 在三个动作质量评估数据集上的大量实验表明,LIMSSR显著优于最先进的基线方法,且不依赖完整的训练数据。

Insight: 创新点包括:将训练时不完整多模态学习重新定义为条件序列推理任务;利用LLM进行提示引导的上下文感知模态插补和多维表示融合;引入掩码感知双路径聚合以动态校准推理不确定性,减轻幻觉。这为数据高效的多模态学习建立了新范式。

Abstract: Real-world multimodal learning is often hindered by missing modalities. While Incomplete Multimodal Learning (IML) has gained traction, existing methods typically rely on the unrealistic assumption of full-modal availability during training to provide reconstruction supervision or cross-modal priors. This paper tackles the more challenging setting of IML under training-time incomplete observations, which precludes reliance on a ``God’s eye view’’ of complete data. We propose LIMSSR (LLM-Driven Incomplete Multimodal Sequence-to-Score Reasoning), a framework that reformulates this challenge as a conditional sequence reasoning task. LIMSSR leverages the semantic reasoning capabilities of Large Language Models via Prompt-Guided Context-Aware Modality Imputation and Multidimensional Representation Fusion to infer latent semantics from available contexts without direct reconstruction. To mitigate hallucinations, we introduce a Mask-Aware Dual-Path Aggregation to dynamically calibrate inference uncertainty. Extensive experiments on three Action Quality Assessment datasets demonstrate that LIMSSR significantly outperforms state-of-the-art baselines without relying on complete training data, establishing a new paradigm for data-efficient multimodal learning. Code is available at https://github.com/XuHuangbiao/LIMSSR.


[26] Scaling Video Understanding via Compact Latent Multi-Agent Collaboration cs.CVPDF

Kerui Chen, Jinglu Wang, Jianrong Zhang, Ming Li, Yan Lu

TL;DR: 本文提出MACF(多智能体协作框架),一种端到端的视频理解方法,通过将视频分割为片段分配给感知预算受限的本地智能体,并采用基于潜在表示的通信协议进行全局推理,从而解决多模态大语言模型在长视频任务中因上下文预算有限而面临的挑战。

Details

Motivation: 现有基于规则预处理的多智能体方法在长视频理解中存在信息丢失、成本高和依赖文本中间表示等问题,本文旨在设计一个可扩展且保持视觉保真度的视频理解框架。

Result: 在多种视频理解基准测试上的大量实验表明,在相同预算约束下,MACF持续优于最先进的多模态大语言模型和多智能体系统。

Insight: 创新点在于将全局视频复杂度与单智能体感知预算解耦,通过课程训练策略逐步强化语义对齐、证据总结和跨智能体协调,并利用紧凑的、任务充分的潜在令牌在共享嵌入空间中进行高效且信息保留的协作。

Abstract: Multi-modal large language models (MLLMs) advance vision language understanding but face inherent limitations in long-video tasks due to bounded perception context budgets. Existing agentic methods mitigate this via rule-based preprocessing, yet often suffer from information loss, high cost, and reliance on textual intermediates. We propose MACF, an end-to-end Multi-Agent Collaboration Framework that decouples per-agent perception budgets from global video complexity, enabling scalable video understanding while preserving visual fidelity. MACF partitions videos into segments for locally budgeted agents and enables holistic reasoning via an agent-native latent communication protocol. Each agent encodes partial observations into compact, task-sufficient tokens in a shared embedding space, allowing efficient and information-preserving collaboration by a central coordinator. We introduce a curriculum training strategy that progressively enforces semantic alignment, evidence summarization, and cross-agent coordination. Extensive experiments on diverse video understanding benchmarks show that MACF consistently outperforms state-of-the-art MLLMs and multi-agent systems under identical budget constraints, demonstrating the effectiveness of our latent collaboration for scalable video understanding.


[27] Learning from Compressed CT: Feature Attention Style Transfer and Structured Factorized Projections for Resource-Efficient Medical Image Analysis cs.CV | eess.IVPDF

Shadid Yousuf, S. M. Mahbubur Rahman, Mohammed Imamul Hassan Bhuiyan

TL;DR: 本文提出了一种资源高效的医学图像分析方法,通过利用JPEG压缩的胸部CT体积进行胸部异常检测。该方法引入了特征注意力风格迁移(FAST)蒸馏框架,将高保真CT表示中的激活模式和结构关系迁移到处理压缩输入的时空视觉编码器中,并结合结构化因子化投影(SFP)来减少参数。最终,CT-Lite对比学习流程在压缩输入上实现了接近未压缩基线模型的性能。

Details

Motivation: 解决医学影像AI部署中因处理未压缩体积数据(如NIfTI或DICOM格式)导致的高计算复杂性和资源密集问题,探索在资源受限环境下利用压缩CT数据进行高效诊断。

Result: 在CT-RATE、NIDCH和Rad-ChestCT三个数据集上的实验表明,CT-Lite在压缩输入上操作,参数显著减少,但AUROC仅比未压缩输入基线低5-7%,接近SOTA水平。

Insight: 创新点包括FAST框架通过Gram矩阵注意力风格保持和双注意力特征对齐实现从退化体积中鲁棒特征提取,以及SFP利用块张量序列分解作为参数高效的密集投影层替代,减少近一半参数;整体方法为资源受限下的AI临床评估提供了可行路径。

Abstract: The deployment of artificial intelligence in medical imaging is hindered by high computational complexity and resource-intensive processing of volumetric data. Although chest computed tomography (CT) volumes offer richer diagnostic information than projection radiography, their use in AI-based diagnosis remains limited due to the computational burden of processing uncompressed volumetric images (typically stored in NIfTI or DICOM format). Addressing the growing need for low-resource deployment and efficient electronic data transfer, we investigate the utilization of JPEG-compressed chest CT volumes for thoracic abnormality detection. We propose Feature Attention Style Transfer (FAST), a novel distillation framework that transfers both activation patterns and structural relationships from high-fidelity CT representations to a spatiotemporal visual encoder operating on compressed inputs. By combining Gram-matrix-based attention style preservation with dual-attention feature alignment, FAST enables robust feature extraction from degraded volumes. Furthermore, we introduce Structured Factorized Projection (SFP), leveraging Block Tensor Train decomposition as a parameter-efficient alternative to dense projection layers, reducing projection-head parameters by almost half. Our contrastive learning pipeline, CT-Lite, integrates these components with a SigLIP-based multimodal alignment objective. Experiments on CT-RATE, NIDCH, and Rad-ChestCT demonstrate that CT-Lite achieves AUROC within 5-7% of the uncompressed-input baseline across all three datasets, despite operating on compressed inputs with significantly fewer parameters, paving the way for AI-based clinical evaluation under resource constraints.


[28] From Local to Global to Mechanistic: An iERF-Centered Unified Framework for Interpreting Vision Models cs.CVPDF

Yearim Kim, Sangyu Han, Nojun Kwak

TL;DR: 本文提出了一种以实例特定有效感受野(iERF)为中心的框架,统一了视觉模型的局部、全局和机制可解释性。该框架通过点状特征向量(PFV)及其iERF作为基本分析单元,引入了共享比率分解(SRD)生成高分辨率、忠实于激活的显著图,概念锚定特征解释(CAFE)将抽象潜在向量与像素级证据关联,以及层间概念图与归因(ICAT)量化概念间影响并揭示表示组合机制。

Details

Motivation: 现代视觉模型虽然精度高,但其证据来源、编码内容及内部计算如何组合证据的解释仍然零散。本文旨在通过一个统一的框架,从局部、全局和机制层面连贯地解释模型,解决解释方法碎片化的问题。

Result: 在ResNet50、VGG16和ViT等模型上的实验表明,该框架在保真度和鲁棒性上均优于基线方法,成功解释了Transformer中分散的稀疏自编码器特征,并在正确分类、误分类和对抗性案例中揭示了主导的概念路径。

Insight: 创新点在于以iERF和PFV为核心统一了不同层次的解释,SRD提供了对激活忠实且鲁棒的显著图,CAFE解决了非局部化稀疏潜在向量的语义落地问题,ICAT则量化了层间概念影响。该框架为从像素到概念再到决策提供了连贯、有证据支持的映射路径。

Abstract: Modern vision models achieve remarkable accuracy, but explaining where evidence arises, what the model encodes, and how internal computations assemble that evidence remains fragmented. We introduce an iERF-centric framework that unifies local, global, and mechanistic interpretability around a single analysis unit: the pointwise feature vector (PFV) paired with its instance-specific Effective Receptive Field (iERF). On the local side, Sharing Ratio Decomposition (SRD) expresses each PFV as a mixture of upstream PFVs via sharing ratios and propagates iERFs to construct class-discriminative saliency maps. SRD yields high-resolution, activation-faithful explanations, is robust to targeted manipulation and noise, and remains activation-agnostic across common nonlinearities. For the global view, we introduce Concept-Anchored Feature Explanation (CAFE), which utilizes the iERF as a semantic label, grounding abstract latent vectors in verifiable pixel-level evidence. With CAFE, we address the challenge of non-localized sparse autoencoder latents–especially in Transformers, where early self-attention mixes distant context. To answer how representations are composed through depth, we propose the Interlayer Concept Graph with Interlayer Concept Attribution (ICAT), which quantifies concept-to-concept influence while isolating layer pairs; an interlayer insertion, deletion protocol identifies Integrated Gradients as the most faithful instantiation. Empirically, across ResNet50, VGG16, and ViTs, our framework outperforms baselines in both fidelity and robustness, successfully interprets dispersed SAE features, and exposes dominant concept routes in correct, misclassified, and adversarial cases. Grounded in iERFs, our approach provides a coherent, evidence-backed map from pixels to concepts to decisions.


[29] Leveraging Vision-Language Models as Weak Annotators in Active Learning cs.CVPDF

Phuong Ngoc Nguyen, Kaito Shiku, Ryoma Bise, Seiichi Uchida, Shinnosuke Matsuo

TL;DR: 本文提出了一种利用视觉语言模型(VLMs)作为弱标注器的主动学习框架,以减少对昂贵人工标注的依赖。研究发现,在细粒度识别任务中,VLMs在粗粒度标签上表现准确,但在细粒度标签上表现不佳。因此,该框架通过实例级标签分配,将细粒度的人工标注与粗粒度的VLM生成弱标签相结合,并利用少量可信全标签对VLM生成标签的系统性噪声进行建模。

Details

Motivation: 动机在于主动学习旨在有限标注预算下通过选择性查询信息样本来降低标注成本,而本文探索如何利用VLMs进一步减少主动学习中对昂贵人工标注的依赖。

Result: 在CUB200和FGVC-Aircraft数据集上的实验表明,在相同标注预算下,所提出的框架一致优于现有的主动学习方法。

Insight: 创新点在于利用VLMs在粗粒度标签上的可靠性作为弱标注源,通过结合细粒度人工标注和粗粒度弱标签,并建模VLM的系统性噪声,有效提升了主动学习的效率。从客观角度看,这提供了一种低成本利用预训练VLMs增强标注策略的新思路。

Abstract: Active learning aims to reduce annotation cost by selectively querying informative samples for supervision under a limited labeling budget. In this work, we investigate how vision-language models (VLMs) can be leveraged to further reduce the reliance on costly human annotation within the active learning paradigm. To this end, we find that the reliability of VLMs varies significantly with label granularity in fine-grained recognition tasks: they perform poorly on fine-grained labels but can provide accurate coarse-grained labels. Leveraging this property, we propose an active learning framework that combines fine-grained human annotations with coarse-grained VLM-generated weak labels through instance-wise label assignment. We further model the systematic noise in VLM-generated labels using a small set of trusted full labels. Experiments on CUB200 and FGVC-Aircraft show that the proposed framework consistently outperforms existing active learning methods under the same annotation budget.


[30] High-Speed Vision Improves Zero-Shot Semantic Understanding of Human Actions cs.CV | cs.ROPDF

Yongpeng Cao, Yuji Yamakawa

TL;DR: 本文研究了高时间分辨率(高帧率)视频对零样本语义理解人类动作的影响。通过以剑道作为快速精细动作的案例,提出了一种无需训练的流程,结合预训练视频-语言模型和基于大语言模型的成对动作比较。实验表明,更高的时间分辨率(如120 Hz)能显著提升零样本设置下动作的语义可分性,并提供更稳定、可解释的表示。

Details

Motivation: 在需要语义理解陌生或难以标注动作的人机交互场景中,收集足够监督学习数据具有挑战性,因此零样本方法成为实用替代方案。然而,现有大规模预训练模型在零样本推理中,时间分辨率(尤其是针对快速、细粒度动作)的影响尚未得到充分探索。

Result: 在多个帧率(120 Hz、60 Hz、30 Hz)的受控实验中,定量评估(采用最近类原型策略)表明,高速视频(如120 Hz)为快速动作提供了更稳定和可解释的语义表示,显著提高了零样本设置下的语义可分性。

Insight: 创新点在于首次系统研究了时间分辨率对零样本动作语义理解的影响,并提出了一个结合视频-语言模型与LLM推理的无训练流程。客观来看,该研究强调了在训练免费的动作识别中,高时间分辨率感知对于提升语义理解能力的重要性,为处理快速精细动作提供了新视角。

Abstract: Understanding human actions from visual observations is essential for human–robot interaction, particularly when semantic interpretation of unfamiliar or hard-to-annotate actions is required. In scenarios such as rapid and less common activities, collecting sufficient labeled data for supervised learning is challenging, making zero-shot approaches a practical alternative for semantic understanding without task-specific training. While recent advances in large-scale pretrained models enable such zero-shot reasoning, the impact of temporal resolution, especially for rapid and fine-grained motions, remains underexplored. In this study, we investigate how temporal resolution affects zero-shot semantic understanding of high-speed human actions. Using kendo as a representative case of rapid and subtle motion patterns, we propose a training-free pipeline that combines a pre-trained video-language model for semantic representation with large language model-based reasoning for pairwise action comparison. Through controlled experiments across multiple frame rates (120 Hz, 60 Hz, and 30 Hz), we show that higher temporal resolution significantly improves semantic separability in zero-shot settings. We further analyze the role of tracking-based human joint information under both full and partial observation scenarios. Quantitative evaluation using a nearest-class prototype strategy demonstrates that high-speed video provides more stable and interpretable semantic representations for fast actions. These findings highlight the importance of temporal resolution in training-free action recognition and suggest that high-speed perception can enhance semantic understanding capabilities.


[31] End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer cs.CV | cs.LGPDF

Wenda Chu, Bingliang Zhang, Jiaqi Han, Yizhuo Li, Linjie Yang

TL;DR: 本文提出了一种端到端的自回归图像生成框架,通过联合优化重建和生成任务来训练一维语义分词器,并利用视觉基础模型提升分词器性能,最终在ImageNet 256×256生成任务上取得了无引导条件下1.48的FID分数,达到SOTA水平。

Details

Motivation: 解决传统自回归图像建模中分词器与生成模型分阶段训练导致的次优问题,通过端到端联合优化提升整体生成质量。

Result: 在ImageNet 256×256图像生成基准上,无引导条件下FID分数达到1.48,刷新了当前最佳记录(SOTA)。

Insight: 创新点在于端到端训练机制使生成任务直接监督分词器优化,同时探索了视觉基础模型增强一维分词器的潜力,为自回归图像生成提供了更高效的联合学习范式。

Abstract: Autoregressive image modeling relies on visual tokenizers to compress images into compact latent representations. We design an end-to-end training pipeline that jointly optimizes reconstruction and generation, enabling direct supervision from generation results to the tokenizer. This contrasts with prior two-stage approaches that train tokenizers and generative models separately. We further investigate leveraging vision foundation models to improve 1D tokenizers for autoregressive modeling. Our autoregressive generative model achieves strong empirical results, including a state-of-the-art FID score of 1.48 without guidance on ImageNet 256x256 generation.


[32] IdentiFace: Multi-Modal Iterative Diffusion Framework for Identifiable Suspect Face Generation in Crime Investigations cs.CVPDF

Weichen Liu, Yixin Yang, Changsheng Chen, Alex Kot

TL;DR: IdentiFace是一个用于犯罪调查中可识别嫌疑人面部生成的多模态迭代扩散框架。它通过多模态输入设计和迭代生成流程,解决了传统素描工作流效率低、质量差以及现有扩散模型在条件模糊性和采样方差方面的限制。

Details

Motivation: 解决犯罪调查中嫌疑人面部生成的技术挑战,传统方法效率低、质量差,而基于扩散的方法在文本到图像模型的条件模糊性和一次性生成的采样方差上存在固有局限。

Result: 在合成数据集和真实场景的综合实验中,IdentiFace在身份检索等方面优于现有方法,显示出实际应用的强大潜力。

Insight: 创新点包括多模态输入设计以增强条件控制、迭代生成流程以实现可识别特征调整,以及贡献了面部身份损失和两个任务特定数据集;客观分析认为其通过迭代和多模态策略有效提升了生成面部的可识别性和可控性。

Abstract: Suspect face generation remains a technical challenge in crime investigations. Traditional sketch-drawing workflows suffer from low efficiency and quality, while diffusion-based approaches still face intrinsic limitations on conditional ambiguity for text-to-image models and sampling variance for one-shot generation. We proposed IdentiFace, a novel diffusion-based framework for identifiable suspect face generation, which addressed these issues through (1) multi-modal input design to strengthen conditional control, and (2) an iterative generation pipeline enabling identifiable feature adjustment. We additionally contributed a facial identity loss and two task-specific datasets. Comprehensive experiments on synthetic datasets and in real-world scenarios indicate that IdentiFace achieves superior performance over existing methods, especially in terms of identity retrieval, and shows strong potential for practical applications.


[33] Jailbreaking Vision-Language Models Through the Visual Modality cs.CV | cs.AI | cs.LGPDF

Aharon Azulay, Jan Dubiński, Zhuoyun Li, Atharv Mittal, Yossi Gandelsman

TL;DR: 这篇论文探索了视觉语言模型(VLMs)中视觉模态这一未被充分研究的安全攻击面,提出了四种通过视觉组件进行越狱攻击的方法:视觉符号密码、对象替换、文本替换和视觉类比谜题。研究发现,基于文本的安全训练无法自动泛化到通过视觉传达的有害意图,导致跨模态对齐差距,攻击成功率显著高于纯文本攻击。

Details

Motivation: 动机是探索视觉语言模型安全对齐的薄弱环节,特别是视觉模态作为攻击向量被忽视的问题,旨在揭示基于文本的安全训练在应对视觉传达的有害内容时存在的局限性。

Result: 在六个前沿VLM上的评估显示,视觉攻击成功绕过了安全对齐,例如视觉密码在Claude-Haiku-4.5上达到了40.9%的攻击成功率,而等效的文本密码仅为10.7%,凸显了跨模态对齐差距。

Insight: 创新点在于首次系统性地将视觉模态作为越狱攻击的主要目标,提出了多种新颖的视觉攻击方法,并实证了VLMs中视觉与文本安全对齐的不一致性;客观分析认为,这强调了将视觉作为安全后训练的一等公民的重要性,为未来鲁棒对齐提供了方向。

Abstract: The visual modality of vision-language models (VLMs) is an underexplored attack surface for bypassing safety alignment. We introduce four jailbreak attacks exploiting the vision component: (1) encoding harmful instructions as visual symbol sequences with a decoding legend, (2) replacing harmful objects with benign substitutes (e.g., bomb -> banana) then prompting for harmful actions using the substitute term, (3) replacing harmful text in images (e.g., on book covers) with benign words while visual context preserves the original meaning, and (4) visual analogy puzzles whose solution requires inferring a prohibited concept. Evaluating across six frontier VLMs, our visual attacks bypass safety alignment and expose a cross-modality alignment gap: text-based safety training does not automatically generalize to harmful intent conveyed visually. For example, our visual cipher achieves 40.9% attack success on Claude-Haiku-4.5 versus 10.7% for an equivalent textual cipher. To further our insight into the attack mechanism, we present preliminary interpretability and mitigation results. These findings highlight that robust VLM alignment requires treating vision as a first-class target for safety post-training.


[34] Intrinsic Gradient Suppression for Label-Noise Prompt Tuning in Vision-Language Models cs.CVPDF

Jiayu Li, Jiaxin Qi, Sheng Zhou, Jiaqiang Huang, Xiansheng Hua

TL;DR: 本文提出了一种名为双Softmax提示调优(DSPT)的无超参数方法,用于解决视觉-语言模型(如CLIP)在提示调优中对标签噪声高度敏感的问题。该方法通过顺序概率归一化引入自适应饱和区域,抑制噪声样本产生的大梯度,从而在保持信息更新的同时提升鲁棒性。

Details

Motivation: CLIP等对比视觉-语言模型具有出色的零样本泛化能力,但其提示调优对标签噪声非常敏感,因为错误标注的样本会产生不成比例的大梯度,可能淹没预训练先验。由于CLIP已提供接近最优的初始化,适应过程应本质保守,尤其需抵御噪声环境中常见的极端梯度更新。

Result: 在多个噪声基准测试上的广泛实验表明,DSPT实现了最先进的鲁棒性,优于具有复杂架构和手工超参数的方法,达到了SOTA水平。

Insight: 论文的创新点在于将传统训练瓶颈的“梯度消失”转化为一种有原则的噪声过滤机制,通过双Softmax实现内在梯度抑制,无需额外超参数。从客观角度看,这种自适应饱和区域设计为标签噪声下的提示调优提供了一种简单而有效的解决方案。

Abstract: Contrastive vision-language models like CLIP exhibit remarkable zero-shot generalization. However, prompt tuning remains highly sensitive to label noise, as mislabeled samples generate disproportionately large gradients that can overwhelm pre-trained priors. We argue that because CLIP already provides a near-optimal initialization, adaptation should be inherently conservative, particularly against the extreme gradient updates common in noisy settings. To this end, we propose Double-Softmax Prompt Tuning (DSPT), a hyperparameter-free method for intrinsic gradient suppression. By applying a sequential probabilistic normalization, DSPT induces a self-adaptive saturation zone that suppresses gradients from high-error noisy samples while maintaining informative updates. We also provide both theoretical analysis and empirical evidence about how this mechanism achieves adaptive suppression. This design transforms ``gradient vanishing’’, traditionally a training bottleneck, into a principled noise-filtering shield for label-noise prompt tuning. Extensive experiments confirm that this simple, drop-in design achieves state-of-the-art robustness across various noisy benchmarks, outperforming methods with complex architectures and handcrafted hyperparameters.


[35] CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection cs.CV | cs.MM | eess.IVPDF

Hang Wang, Chao Shen, Chenhao Lin, Minghui Yang, Lei Zhang

TL;DR: 本文提出了一种名为CMTA的通用AI生成视频检测框架,通过捕捉视觉-文本跨模态空间中的时间伪影(CMTA)来区分真实视频与AI生成视频。该方法利用BLIP生成帧级图像描述,使用CLIP提取视觉-文本表示,并通过粗粒度(GRU)和细粒度(Transformer)分支建模跨模态对齐的时间波动。

Details

Motivation: 现有AI生成视频检测方法主要关注单模态或时空伪影,忽略了视觉-文本跨模态空间中丰富的线索,特别是语义对齐的时间稳定性。本文旨在利用AI生成视频中跨模态时间伪影这一独特指纹来解决检测问题。

Result: 在GenVideo、EvalCrafter、VideoPhy和VidProM四个大规模数据集的40个子集上进行广泛实验,验证了CMTA方法达到了新的最先进水平(SOTA),并展现出优异的跨生成器泛化能力。

Insight: 创新点在于首次识别并利用了AI生成视频中跨模态时间伪影(CMTA),即AI生成视频在给定输入提示下表现出不自然的稳定语义轨迹,而真实视频则因语义变化呈现自然的时间波动。通过联合跨模态嵌入和多粒度时间建模来捕捉这一特征,为通用AI生成视频检测提供了新思路。

Abstract: The proliferation of advanced AI video synthesis techniques poses an unprecedented challenge to digital video authenticity. Existing AI-generated video (AIGV) detection methods primarily focus on uni-modal or spatiotemporal artifacts, but they overlook the rich cues within the visual-textual cross-modal space, especially the temporal stability of semantic alignment. In this work, we identify a distinctive fingerprint in AIGVs, termed cross-modal temporal artifact (CMTA). Unlike real videos that exhibit natural temporal fluctuations in cross-modal alignment due to semantic variations, AIGVs display unnaturally stable semantic trajectories governed by given input prompts. To bridge this gap, we propose the CMTA framework, a cross-modal detection approach that captures these unique temporal artifacts through joint cross-modal embedding and multi-grained temporal modeling. Specifically, CMTA leverages BLIP to generate frame-level image captions and utilizes CLIP to extract corresponding visual-textual representations. A coarse-grained temporal modeling branch is then designed to characterize temporal fluctuations in cross-modal alignment with a GRU. In parallel, a fine-grained branch is constructed to capture intricate inter-frame variations from integrated visual-textual features with a Transformer encoder. Extensive experiments on 40 subsets across four large-scale datasets, including GenVideo, EvalCrafter, VideoPhy, and VidProM, validate that our approach sets a new state-of-the-art while exhibiting superior cross-generator generalization. Code and models of CMTA will be released at https://github.com/hwang-cs-ime/CMTA


[36] BlenderRAG: High-Fidelity 3D Object Generation via Retrieval-Augmented Code Synthesis cs.CV | cs.AI | cs.GR | cs.HC | cs.LGPDF

Massimo Rondelli, Francesco Pivi, Maurizio Gabbrielli

TL;DR: 本文提出了BlenderRAG,一个检索增强生成系统,用于从自然语言描述自动生成可执行的Blender 3D建模代码。该系统通过检索一个包含500个专家验证的多模态示例(文本、代码、图像)的数据集,显著提高了代码的编译成功率和生成对象的语义对齐度。

Details

Motivation: 解决当前最先进的大型语言模型在从自然语言生成Blender代码时,频繁出现语法错误和几何不一致对象的问题,旨在提高代码生成的质量和可靠性。

Result: 在四个最先进的LLMs上,BlenderRAG将编译成功率从40.8%提升至70.0%,并将语义归一化对齐度(CLIP相似度)从0.41提升至0.77,无需微调或专用硬件即可部署。

Insight: 主要创新点在于构建了一个高质量、多模态的专家验证数据集,并采用检索增强生成范式,将语义相似的示例作为上下文,以引导和约束LLM的代码生成过程,从而有效提升生成代码的语法正确性和几何一致性。

Abstract: Automatic generation of executable Blender code from natural language remains challenging, with state-of-the-art LLMs producing frequent syntactic errors and geometrically inconsistent objects. We present BlenderRAG, a retrieval-augmented generation system that operates on a curated multimodal dataset of 500 expert-validated examples (text, code, image) across 50 object categories. By retrieving semantically similar examples during generation, BlenderRAG improves compilation success rates from 40.8% to 70.0% and semantic normalized alignment from 0.41 to 0.77 (CLIP similarity) across four state-of-the-art LLMs, without requiring fine-tuning or specialized hardware, making it immediately accessible for deployment. The dataset and code will be available at https://github.com/MaxRondelli/BlenderRAG.


[37] UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors cs.CVPDF

Houyuan Chen, Hong Li, Xianghao Kong, Tianrui Zhu, Shaocong Xu

TL;DR: UniVidX是一个基于视频扩散模型(VDM)先验的统一多模态框架,用于实现多功能视频生成。它通过将像素对齐任务建模为共享多模态空间中的条件生成,并引入随机条件掩码、解耦门控LoRA和跨模态自注意力三个核心设计,来适应不同模态分布并促进合成过程中的跨模态一致性。该框架在RGB视频与内在属性图(如反照率、辐照度、法线)以及RGB视频与RGBA图层两个领域进行了实例化。

Details

Motivation: 现有方法通常为每个多模态图形任务训练独立的模型,这固定了输入-输出映射,限制了跨模态相关性的建模。本文旨在克服这一局限性,提出一个统一的框架来利用VDM先验进行多功能视频生成。

Result: 实验表明,UniVidX的两个实例化模型(UniVid-Intrinsic和UniVid-Alpha)在不同任务上均取得了与最先进方法(SOTA)相当的性能,并且即使在少于1000个视频的数据集上训练,也能鲁棒地泛化到真实场景。

Insight: 主要创新点包括:1)随机条件掩码(SCM)实现了全方向条件生成,而非固定映射;2)解耦门控LoRA(DGL)在目标模态生成时激活特定LoRA,保留了VDM的强大先验;3)跨模态自注意力(CMSA)通过跨模态共享键值对和保持模态特定查询,促进了信息交换和模态间对齐。这些设计使得单一模型能够灵活处理多种多模态视频生成任务。

Abstract: Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation. UniVidX formulates pixel-aligned tasks as conditional generation in a shared multimodal space, adapts to modality-specific distributions while preserving the backbone’s native priors, and promotes cross-modal consistency during synthesis. It is built on three key designs. Stochastic Condition Masking (SCM) randomly partitions modalities into clean conditions and noisy targets during training, enabling omni-directional conditional generation instead of fixed mappings. Decoupled Gated LoRA (DGL) introduces per-modality LoRAs that are activated when a modality serves as the generation target, preserving the strong priors of the VDM. Cross-Modal Self-Attention (CMSA) shares keys and values across modalities while keeping modality-specific queries, facilitating information exchange and inter-modal alignment. We instantiate UniVidX in two domains: UniVid-Intrinsic, for RGB videos and intrinsic maps including albedo, irradiance, and normal; and UniVid-Alpha, for blended RGB videos and their constituent RGBA layers. Experiments show that both models achieve performance competitive with state-of-the-art methods across distinct tasks and generalize robustly to in-the-wild scenarios, even when trained on fewer than 1,000 videos. Project page: https://houyuanchen111.github.io/UniVidX.github.io/


[38] Foundation AI Models for Aerosol Optical Depth Estimation from PACE Satellite Data cs.CVPDF

Zahid Hassan Tushar, Sanjay Purushotham

TL;DR: 本文提出了首个探索基础AI模型用于气溶胶光学厚度(AOD)反演的研究,并提出了ViTCG框架,这是一个基于通道分组空间回归的视觉Transformer模型,旨在减少反演偏差和误差。该模型利用高光谱大气顶部辐射作为输入,联合建模空间上下文和光谱信息,在PACE卫星数据上验证,相比现有基础模型(如Prithvi)显著降低了均方误差,并生成空间一致的AOD场。

Details

Motivation: 传统基于物理的AOD反演方法依赖辐射传输模型和查找表,计算量大且需要辅助气象数据;而现有数据驱动方法未能充分利用高光谱图像的空间-光谱一致性,导致反演结果空间不一致且对噪声敏感。

Result: 在PACE辐射观测数据上验证,ViTCG相比最先进的基础模型(包括Prithvi)将均方误差降低了62%,并生成了空间连贯的AOD场。

Insight: 创新点在于首次将基础AI模型应用于AOD反演,并设计了ViTCG框架,通过通道分组空间回归联合建模空间和光谱信息,提高了反演的准确性和空间一致性;客观分析认为,该方法有效利用了高光谱数据的多维度特征,为地球观测任务提供了新的数据驱动解决方案。

Abstract: Aerosol Optical Depth (AOD) retrieval is essential for Earth observation, supporting applications from air quality monitoring to climate studies. Conventional physics-based AOD retrieval methods formulate the problem as a pixel-wise inversion, relying on radiative transfer modeling, memory-intensive look-up tables, and auxiliary meteorological data. While recent data-driven approaches have shown promise, many fail to exploit the spatial-spectral coherence of hyperspectral imagery, leading to spatially inconsistent and noise-sensitive retrievals. We present the first study exploring Foundation AI models for AOD retrieval and propose ViTCG, a Vision Transformer with Channel-wise Grouping-based spatial regression framework that reduces retrieval bias and error. ViTCG uses hyperspectral top-of-atmosphere radiance as input and jointly models spatial context and spectral information. Validation with PACE radiance observations demonstrates a 62% reduction in mean squared error compared to state-of-the-art foundation models, including Prithvi, and produces spatially coherent AOD fields.


[39] Static and Dynamic Graph Alignment Network for Temporal Video Grounding cs.CVPDF

Zhanjie Hu, Bolin Zhang, Jianhua Wang, Jianbo Zheng, Chenchen Yan

TL;DR: 本文提出了一种名为静态与动态图对齐网络(SDGAN)的新方法,用于解决时序视频定位(TVG)任务,即根据自然语言查询在未修剪视频中定位对应的时间片段。该方法通过结合静态和动态视觉特征构建互补的时序图,并引入查询感知的对齐机制和多粒度训练策略,以提升定位性能。

Details

Motivation: 现有基于图卷积网络(GCN)的TVG方法存在三个关键瓶颈:仅使用静态或动态特征导致视觉表示不完整、构建查询无关的时序图导致特征交互效率低下,以及单粒度语义匹配导致收敛慢和精度次优。

Result: 在三个基准数据集上的广泛实验表明,SDGAN在复杂TVG场景中实现了优越的性能,达到了SOTA水平。

Insight: 创新点包括:结合静态和动态特征构建互补时序图并进行位置节点对齐;通过查询-片段对比学习和自适应图建模实现查询感知的视觉表示;采用多粒度时序提议和渐进式易到难训练策略,以桥接粗粒度语义定位和细粒度边界优化。

Abstract: Temporal Video Grounding (TVG) aims to localize temporal moments in an untrimmed video that semantically correspond to given natural language queries. Recently, Graph Convolutional Networks (GCN) have been widely adopted in TVG to model temporal relations among video clips and enhance contextual reasoning by constructing clip-level graphs. Despite their effectiveness, existing GCN-based TVG methods encounter three critical bottlenecks: 1) Most methods construct graph nodes using either static or dynamic features alone, resulting in incomplete visual representation and overlooking complementary semantics, 2) Most methods construct temporal graphs in a query-agnostic manner, leading to inefficient feature interaction within the temporal graph representation, and 3) Most methods often suffer from a single-granularity semantic matching, while direct training on complex temporal localization task may lead to slow convergence and suboptimal precision. To address these challenges, we propose Static and Dynamic Graph Alignment Network (SDGAN). First, SDGAN jointly exploits static and dynamic visual features to construct two complementary temporal graphs and performs Position-wise Nodes Alignment, enabling more expressive and robust visual representation. Second, SDGAN introduces Query-Clip Contrastive Learning and Adaptive Graph Modeling to explicitly align visual clips with their corresponding textual queries, yielding query-aware visual representations. Third, SDGAN incorporates multi-granularity temporal proposals within Progressive Easy-to-Hard Training Strategy, effectively bridging coarse-grained semantic localization and fine-grained temporal boundary refinement. Extensive experiments on three benchmark datasets demonstrate that SDGAN achieves superior performance across complex TVG scenarios. Codes and datasets are available at https://github.com/ZhanJieHu/SDGAN.


[40] PhysEdit: Physically-Consistent Region-Aware Image Editing via Adaptive Spatio-Temporal Reasoning cs.CVPDF

Guandong Li, Mengxia Ye

TL;DR: 本文提出了PhysEdit,一个基于物理一致性的区域感知图像编辑框架,通过自适应时空推理来处理异构的图像编辑指令。该框架引入了两个无需重新训练主干网络的推理时模块:复杂度自适应推理深度(CARD)和空间推理掩码(SRM),分别动态调整推理步骤/长度和限制推理区域,从而在提升效率的同时保持或改进编辑质量。

Details

Motivation: 现有基于推理的编辑方法对所有编辑指令都采用固定的推理方案,但不同类型的指令(如颜色交换、物体插入、物理动作编辑)需要不同的空间覆盖范围和推理深度。因此,论文旨在通过引入空间和时间轴上的自适应性来解决这一局限性。

Result: 在完整的737个样本的ImgEdit Basic-Edit Suite基准测试上,PhysEdit相比一个强大的推理基线实现了1.18倍的实时加速(每样本64.3秒 vs. 76.1秒),同时指令遵循度略有提升(CLIP-T得分0.2283 vs. 0.2266,+0.7%),身份保持能力相当(CLIP-I得分0.8246 vs. 0.8280)。在外观级编辑上加速比达到1.52倍。

Insight: 主要创新点在于将固定的推理方案转变为条件计算问题,通过CARD模块根据指令和参考图像预测编辑复杂度并自适应分配推理资源,以及通过SRM模块利用交叉注意力提取指令条件的空间先验来限制推理区域。这种自适应的时空推理机制是提升编辑效率和针对性的关键。

Abstract: Image editing instructions are heterogeneous: a color swap, an object insertion, and a physical-action edit all demand different spatial coverage and different reasoning depth, yet existing reasoning-based editors apply a single fixed inference recipe to every instruction. We argue that adaptivity along both the spatial and temporal axes is the missing degree of freedom, and we present PhysEdit, an editing framework built around this principle. PhysEdit introduces two inference-time modules that compose without retraining the backbone. At its core, (1) Complexity-Adaptive Reasoning Depth (CARD) predicts edit complexity directly from the instruction and reference image and allocates the reasoning step count N_r and reasoning-token length r per sample – turning a previously fixed inference schedule into a conditional-computation problem. CARD is supported by (2) a Spatial Reasoning Mask (SRM) that extracts an instruction-conditioned spatial prior from cross-attention to confine reasoning to regions that semantically require it. On the full 737-case ImgEdit Basic-Edit Suite, PhysEdit delivers a 1.18x wall-clock speedup (64.3s vs. 76.1s per sample) over a strong reasoning baseline while slightly improving instruction adherence (CLIP-T 0.2283 vs. 0.2266, +0.7%) and matching identity preservation within noise (CLIP-I 0.8246 vs. 0.8280). The speedup is category-dependent and reaches 1.52x on appearance-level edits, validating CARD’s adaptive allocation as the principal source of efficiency gain. A 30-sample pilot with full ablations isolates the contribution of each module.


[41] Learning Coarse-to-Fine Osteoarthritis Representations under Noisy Hierarchical Labels cs.CVPDF

Tongxu Zhang

TL;DR: 该论文提出了一种利用膝关节骨关节炎(OA)评估中固有的层级标签(粗粒度二元OA决策和细粒度KL严重程度分级)作为表示级监督先验的方法。通过设计一个简单的双头模型(共享编码器加两个任务特定头),比较了单OA、单KL和双头训练在不同3D骨干网络下的性能。实验表明,双头监督能在特定骨干网络上提升KL相关指标,并促使潜在表示呈现更有序的从粗到细的组织结构,同时增强显著性区域与软骨解剖结构的一致性。

Details

Motivation: 现有深度学习研究通常将膝关节OA评估中的二元OA决策和KL严重程度分级视为独立的分类问题,忽略了它们之间的自然层级关系。论文旨在探索这种临床层级结构能否作为表示学习的监督先验,以改善在噪声标签下的疾病表示学习。

Result: 在多个3D骨干网络(如3D ResNet)上,双头监督相比单任务训练在KL分级相关指标(如准确率)上取得了骨干依赖性的提升。通过统计比较、潜在严重程度轴几何分析和显著性重叠分析,发现双头监督能产生更有序的从粗到细的潜在表示组织,并对响应性骨干网络增强了显著性区域与软骨解剖结构的一致性。

Insight: 论文的创新点在于将临床层级标签作为简单的双头监督先验,无需复杂架构即可在噪声粗/细标签下重塑疾病表示,为OA诊断和严重程度分级提供了有用的归纳偏置。从客观角度看,这种方法通过共享编码器学习层级表示,增强了模型的可解释性和解剖对齐性,为医学图像分析中利用领域知识改进表示学习提供了借鉴。

Abstract: Knee osteoarthritis (OA) assessment involves a natural but often underused label hierarchy: a coarse binary OA decision and a fine-grained Kellgren–Lawrence (KL) severity grade. Existing deep learning studies commonly treat these targets as separate classification problems, either reducing OA assessment to disease presence or directly optimizing noisy ordinal KL labels. In this work, we ask whether this clinical hierarchy can serve as a representation-level supervisory prior. Rather than introducing a complex architecture, we use a deliberately simple dual-head model with a shared encoder and two task-specific heads as a probe of hierarchical supervision. We compare single-OA, single-KL, and dual-head training across multiple 3D backbones under the same test protocol. Beyond standard classification metrics, we perform paired statistical comparisons, analyze latent severity-axis geometry, and examine saliency overlap with cartilage regions. The results show that dual-head supervision produces backbone-dependent gains, with clear improvements in KL-related metrics for selected backbones. More importantly, the gains are accompanied by a more ordered coarse-to-fine latent organization and, for responsive backbones, stronger anatomical alignment of saliency with cartilage. These findings suggest that even simple hierarchical dual-head supervision can reshape disease representations under noisy coarse/fine labels, providing a useful inductive bias for OA diagnosis and severity grading.


[42] Unpaired Image Deraining Using Reward-Guided Self-Reinforcement Strategy cs.CVPDF

Yinghao Chen, Yeying Jin, Xiang Chen, Yanyan Wei, Ziyang Yan

TL;DR: 本文提出了一种名为RGSUD的无监督图像去雨方法,该方法通过奖励引导的自增强策略,利用训练过程中偶然产生的高质量去雨结果来指导优化过程,从而在没有成对监督的情况下学习真实世界的雨分布。

Details

Motivation: 无监督去雨方法因无需成对监督而受到关注,但缺乏强约束导致网络难以收敛,尤其是在雨退化复杂多样的情况下。

Result: 在多个数据集(包括成对合成、成对真实和非成对真实图像)上进行了广泛实验,结果表明该方法在主观和客观图像质量评估指标上均达到了最先进的性能,优于现有的无监督去雨方法。

Insight: 创新点在于引入了基于图像质量评估的动态奖励回收机制和自增强训练阶段,通过利用自增强损失和动态更新的奖励来提升合成伪配对数据的质量并稳定优化过程,且该策略可适配其他无监督去雨方法,并展现出对现有监督去雨网络的强泛化能力。

Abstract: Unsupervised deraining has attracted attention for its ability to learn the real-world distribution of rain without paired supervision. However, the lack of strong constraints makes it difficult for the network to converge, especially with the complex diversity of rain degradation. A key motivation is that high-quality deraining results occasionally emerge during training, which can be leveraged to guide the optimization process. To overcome these challenges, we introduce RGSUD (Reward-Guided Self-Reinforcement Unsupervised Image Deraining), comprising two key stages: reward recycling and self-reinforcement (SR) training. For the former stage, we propose an Image Quality Assessment (IQA)-based dynamic reward recycling mechanism that selects optimal derained outputs during training and continuously collects high-quality deraining images. In latter stage, we incorporate these rewards into the model’s optimization process, constraining the optimization space and improving alignment between derained outputs and clean images. By leveraging IQA-based self-reinforced loss and dynamically updated rewards, we enhance the quality of synthesized pseudo-paired data and stabilize the optimization. Extensive experiments demonstrate that our method achieves SOTA performance across multiple datasets, including paired synthetic, paired real, and unpaired real images, outperforming existing unsupervised deraining approaches in both subjective and objective IQA metrics. Additionally, we show that the self-reinforcement strategy is adaptable to other unsupervised deraining methods and our deraining framework demonstrates strong generalization across existing supervised deraining networks.


[43] Exploring the Limits of End-to-End Feature-Affinity Propagation for Single-Point Supervised Infrared Small Target Detection cs.CVPDF

Qiancheng Zhou, Wenhua Zhang

TL;DR: 本文提出了一种名为GSACP的端到端单点监督红外小目标检测方法,通过在线批次内特征亲和性传播生成点对掩码监督,避免了传统方法中显式离线伪标签构建的复杂流程。该方法在SIRST3数据集上实现了低误报率下的高精度检测,并通过系统化消融实验揭示了自参考传播漂移问题及其优化策略。

Details

Motivation: 现有单点监督红外小目标检测方法依赖多阶段主动学习或物理驱动掩码生成等显式离线伪标签构建,计算成本高且流程复杂。本文旨在探索一种极简的在线监督范式,通过批次内特征亲和性传播直接生成监督信号,以降低标注成本并简化部署。

Result: 在SIRST3数据集上,GSACP-Final实现了0.6674 mIoU的竞争性性能,同时与PAL方法相比将误报率(Fa)相对降低了38%,建立了超低误报操作机制,达到了新的SOTA水平。

Insight: 创新点在于提出了一种端到端的在线特征亲和性传播监督范式,避免了外部标签演化循环;同时理论分析了自参考传播漂移问题,并通过局部EMA教师解耦、硬背景对比分离和自适应支持几何等策略系统优化,为误报抑制至关重要的部署场景提供了紧凑解决方案。

Abstract: Single-point supervised infrared small target detection (IRSTD) drastically reduces dense annotation costs. Current state-of-the-art (SOTA) methods achieve high precision by recovering mask supervision through explicit, offline pseudo-label construction, such as multi-stage active learning and physics-driven mask generation. In this paper, we study a minimalist alternative: generating point-to-mask supervision online through in-batch, point-anchored feature-affinity propagation. We instantiate this paradigm as GSACP, an end-to-end testbed that directly supervises the detector using hard-margin feature affinity gated by local image priors, entirely eliminating external label-evolution loops. This compact design, however, exposes an optimization bottleneck. Because the affinity target is generated from the same feature representation being optimized, training forms a self-referential loop. We theoretically formalize this as \emph{Self-Referential Propagation Drift}, a representation-supervision entanglement that can sharpen true boundaries or distort the feature space to satisfy its own targets. To systematically isolate these failure modes, we apply a protocolized single-variable ablation procedure spanning local EMA teacher decoupling, hard-background contrastive separation, and adaptive support geometry. On the SIRST3 dataset, GSACP-Final establishes a new ultra-low false-alarm operating regime, achieving a highly competitive $0.6674$ mIoU while demonstrating a $38% relative reduction in false-positive artifacts ($\mathrm{Fa}$) compared with PAL. By systematically deconstructing the end-to-end paradigm, we map its performance boundaries and show that in-batch feature propagation provides a compact alternative for deployment scenarios where false-alarm suppression is paramount.


[44] Modeling Subjective Urban Perception with Human Gaze cs.CV | cs.HCPDF

Lin Che, Xi Wang, Marc Pollefeys, Konrad Schindler, Martin Raubal

TL;DR: 该论文提出了一种基于人类注视行为的城市感知建模方法,通过构建包含眼动追踪数据和个体感知标签的Place Pulse-Gaze数据集,并设计了Gaze-Guided Urban Perception Framework,系统探索了仅使用注视、注视与显式语义场景表征融合、注视与隐式丰富视觉表征融合三种设置对主观城市感知的预测能力。实验表明注视行为本身已能提供有效预测信号,且与场景表征融合可进一步提升性能。

Details

Motivation: 现有计算模型主要直接从街景图像建模城市感知,但忽略了人类形成此类判断的感知过程,因此论文旨在通过引入人类注视行为来更准确地建模主观城市感知。

Result: 实验在Place Pulse-Gaze数据集上进行,结果显示注视行为本身对主观城市感知具有预测性,且与显式语义或隐式丰富视觉表征融合后能进一步提升预测性能,但未明确提及是否达到SOTA水平。

Insight: 创新点在于首次将人类注视行为引入城市感知计算模型,强调了整合人类感知过程对城市场景理解的重要性,并为基于注视的多模态城市计算开辟了新方向;客观分析认为其通过眼动数据捕捉主观注意力机制,为理解环境评价的认知基础提供了新途径。

Abstract: Urban perception describes how people subjectively evaluate urban environments, shaping how cities are experienced and understood. Existing computational approaches primarily model urban perception directly from street view images, but largely ignore the human perceptual process through which such judgments are formed. In this paper, we introduce Place Pulse-Gaze, an urban perception dataset that augments street view images with synchronized eye-tracking recordings and individual perception labels. Based on this dataset, we propose a Gaze-Guided Urban Perception Framework to study how gaze behavior contributes to the modeling of subjective urban perception. The framework systematically investigates three complementary settings: gaze-only modeling, gaze fusion with explicit semantic scene representations, and gaze fusion with implicit richer visual representations. Experiments show that gaze alone already carries useful predictive signals for subjective urban perception, and that integrating gaze with scene representations further improves prediction under both semantic and richer visual representations. Overall, our findings highlight the importance of incorporating human perceptual processes into urban scene understanding and open a direction for gaze-guided multimodal urban computing.


[45] Make Your LVLM KV Cache More Lightweight cs.CV | cs.AI | cs.LGPDF

Xihao Chen, Yangyang Guo, Roger Zimmermann

TL;DR: 本文提出了一种名为LightKV的新方法,旨在解决大型视觉语言模型(LVLM)推理过程中KV缓存占用大量GPU内存的问题。该方法通过利用视觉令牌嵌入之间的冗余性,在预填充阶段根据文本提示进行跨模态消息传递,从而聚合信息并压缩视觉令牌,显著减少了KV缓存的大小和计算开销。

Details

Motivation: 动机在于,尽管KV缓存提升了大型语言模型(LLM)的解码效率,但直接应用于LVLM时,由于预填充阶段处理的视觉令牌数量庞大,会引入显著的GPU内存开销,因此需要一种轻量化的KV缓存优化方案。

Result: 在八个开源LVLM和八个公共基准数据集(如MME和SeedBench)上的实验结果表明,LightKV仅使用原始视觉令牌的55%,即可将视觉令牌KV缓存大小减半,计算量减少高达40%,并在保持通用性能的同时显著优于现有基线方法。

Insight: 创新点在于引入了提示感知的跨模态消息传递机制来压缩视觉令牌,这区别于先前仅依赖视觉信息的压缩策略,通过文本引导更有效地聚合冗余信息,实现了KV缓存的轻量化。从客观角度看,该方法将压缩过程与任务提示动态结合,提升了压缩的针对性和效率。

Abstract: Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency in Large Language Models (LLMs), its direct adoption in LVLMs introduces substantial GPU memory overhead due to the large number of vision tokens processed during the prefill stage. To tackle this problem, we propose LightKV, a novel approach that reduces KV cache size by exploiting the redundancy among vision-token embeddings. Guided by text prompts, LightKV employs cross-modality message passing to aggregate informative messages across vision tokens and progressively compress them during prefill. This prompt-aware guidance distinguishes our method from prior vision-only compression strategies. We evaluate LightKV on eight open-source LVLMs across eight public benchmark datasets, e.g., MME and SeedBench. Experimental results demonstrate that with only 55% of the original vision tokens, LightKV (a) halves the vision-token KV cache size, (b) reduces computation by up to 40%, and (c) preserves general-purpose performance while significantly outperforming existing baselines.


[46] Let ViT Speak: Generative Language-Image Pre-training cs.CVPDF

Yan Fang, Mengcheng Lan, Zilong Huang, Weixian Lei, Yunqing Zhao

TL;DR: 本文提出了GenLIP,一个为多模态大语言模型设计的、面向视觉Transformer的极简生成式语言-图像预训练框架。它通过语言建模目标直接训练ViT从视觉token预测语言token,无需对比批次构建或额外文本解码器,在简化架构的同时实现了可扩展性和高性能。

Details

Motivation: 动机是更好地将视觉编码器与大语言模型的自回归特性对齐,解决传统多模态预训练中视觉与文本模态对齐复杂、需要额外组件的问题。

Result: 在Recap-DataComp-1B的80亿样本上预训练后,GenLIP在多个多模态基准测试中达到或超越了强基线模型;在原生宽高比的多分辨率图像上持续预训练后,在OCR和图表理解等细节敏感任务上表现进一步提升。

Insight: 宣称的创新点在于其极简设计:使用单一Transformer联合建模视觉和文本token,通过标准的语言建模目标直接对齐视觉编码器与LLM的自回归特性。客观分析,其核心创新是将视觉编码器直接重构为生成式语言模型,简化了多模态对齐流程,可能提升了训练效率和模型可扩展性。

Abstract: In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) \textbf{Simplicity}: a single transformer jointly models visual and textual tokens; (2) \textbf{Scalability}: it scales effectively with both data and model size; and (3) \textbf{Performance}: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.


[47] Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs cs.CV | cs.AIPDF

Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Zefeng He

TL;DR: 本文提出了一种名为Persistent Visual Memory(PVM)的轻量级可学习模块,旨在解决自回归大型视觉语言模型(LVLMs)在深度生成任务中面临的’视觉信号稀释’问题。PVM作为前馈网络(FFN)的并行分支集成到LVLMs中,通过建立距离无关的检索路径直接提供视觉嵌入,以维持持续的视觉感知能力。实验表明,该方法在Qwen3-VL模型上以极小的参数量带来了显著的性能提升。

Details

Motivation: 自回归大型视觉语言模型在多模态任务中表现出色,但在深度生成过程中存在’视觉信号稀释’现象,即文本历史积累导致注意力分配函数扩展,使得视觉注意力随生成序列长度增加而衰减,从而削弱了持续的视觉感知能力。

Result: 在Qwen3-VL模型的4B和8B规模上进行的大量实验表明,PVM以可忽略的参数开销带来了显著的改进,在需要持续视觉感知的复杂推理任务中实现了平均准确率的稳定提升。深入分析还显示,PVM能够抵抗长度引起的信号衰减并加速内部预测收敛。

Insight: 论文的创新点在于提出了一个结构化的轻量级模块PVM,它通过建立距离无关的视觉嵌入检索路径,从结构上缓解了深度生成中固有的信号抑制问题。从客观角度看,这是一种针对注意力机制在长序列生成中视觉信息衰减问题的有效且高效的解决方案,其并行集成设计对模型架构的改动较小,易于部署。

Abstract: While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a “Visual Signal Dilution” phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to ensure sustained, on-demand visual perception. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for precise visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM can resist length-induced signal decay and accelerate internal prediction convergence.


[48] Posterior Augmented Flow Matching cs.CVPDF

George Stoica, Sayak Paul, Matthew Wallingford, Vivek Ramanujan, Abhay Nori

TL;DR: 本文提出了一种名为后验增强流匹配(PAFM)的新方法,用于改进流匹配(FM)在生成模型训练中的性能。PAFM通过使用近似后验分布对多个候选目标进行期望监督,替代了FM中单一目标监督的稀疏信号,从而减少了训练梯度方差并缓解了流崩溃问题。

Details

Motivation: 流匹配(FM)在高维图像生成中,每个训练样本仅监督单个轨迹和中间点,导致训练信号稀疏且方差高,容易引发流崩溃,即学习到的动态映射将不同输入映射到过于相似的输出,泛化能力差。

Result: 在ImageNet和CC12M等基准测试中,PAFM在不同模型规模(SiT-B/2和SiT-XL/2)和架构(SiT和MMDiT)下,相比FM将FID50K指标提升了高达3.4,且计算开销增加可忽略。

Insight: PAFM的创新点在于将单一目标监督推广为基于近似后验的期望监督,通过重要性采样聚合多个合理延续轨迹的信息,理论上保证了无偏估计并显著降低梯度方差,这为生成模型训练提供了更稳定和高效的监督机制。

Abstract: Flow matching (FM) trains a time-dependent vector field that transports samples from a simple prior to a complex data distribution. However, for high-dimensional images, each training sample supervises only a single trajectory and intermediate point, yielding an extremely sparse and high-variance training signal. This under-constrained supervision can cause flow collapse, where the learned dynamics memorize specific source-target pairings, mapping diverse inputs to overly similar outputs, failing to generalize. We introduce Posterior-Augmented Flow Matching (PAFM), a theoretically grounded generalization of FM that replaces single-target supervision with an expectation over an approximate posterior of valid target completions for a given intermediate state and condition. PAFM factorizes this intractable posterior into (i) the likelihood of the intermediate under a hypothesized endpoint and (ii) the prior probability of that endpoint under the condition, and uses an importance sampling scheme to construct a mixture over multiple candidate targets. We prove that PAFM yields an unbiased estimator of the original FM objective while substantially reducing gradient variance during training by aggregating information from many plausible continuation trajectories per intermediate. Finally, we show that PAFM improves over FM by up to 3.4 FID50K across different model scales (SiT-B/2 and SiT-XL/2), different architectures (SiT and MMDiT), and in both class and text conditioned benchmarks (ImageNet and CC12M), with a negligible increase in the compute overhead. Code: https://github.com/gstoica27/PAFM.git.


eess.IV [Back]

[49] Multi-frame Restoration for High-rate Lissajous Confocal Laser Endomicroscopy eess.IV | cs.CV | cs.LGPDF

Minhee Lee, Sangyoon Lee, Jiwook Lee, Minki Hong, Kyuyoung Kim

TL;DR: 本文针对高帧率Lissajous共聚焦激光内窥镜成像中因扫描轨迹导致的像素缺失问题,提出了首个高质量基准数据集和一个轻量级循环恢复框架MIRA。该框架通过特征重用和位移对齐迭代聚合时序上下文,有效修复了图像中的结构化空洞。

Details

Motivation: 解决高帧率Lissajous扫描内窥镜成像中,由于谐振轨迹采样不完整导致图像出现结构化空洞的问题,以实现高速、高质量的体内光学活检。

Result: 在提出的高帧率Lissajous CLE基准数据集上,MIRA在恢复质量上超越了轻量级和高复杂度的基线方法,同时保持了适用于临床部署的优越计算效率。

Insight: 创新点在于构建了首个对齐监督的基准数据集用于模型训练与评估,并设计了一个轻量级循环网络,通过特征重用与对齐机制高效聚合多帧信息,在保证性能的同时兼顾了计算效率,为临床实时应用提供了可行方案。

Abstract: Lissajous confocal laser endomicroscopy (CLE) is a promising solution for high speed in vivo optical biopsy for handheld scenarios. However, Lissajous scanning traces a resonant trajectory and samples only the visited pixels per frame; at high frame rates, many pixels remain unvisited, creating structured holes. In this work, we introduce the first benchmark for high-rate Lissajous CLE, consisting of low-quality video clips paired with high-quality reference images. The reference images are wide-FOV mosaics obtained by stitching stabilized, slow-scan frames of the same tissue, enabling temporally aligned supervision. Using this dataset, we propose MIRA, a lightweight recurrent framework for Lissajous CLE restoration that iteratively aggregates temporal context through feature reuse and displacement alignment. Our experiments demonstrate that MIRA outperforms both lightweight and high-complexity baselines in restoration quality while maintaining a favorable computational efficiency suitable for clinical deployment.


cs.HC [Back]

[50] “What Are You Really Trying to Do?”: Co-Creating Life Goals from Everyday Computer Use cs.HC | cs.AI | cs.CLPDF

Shardul Sapkota, Matthew Jörke, Zane Sabbagh, Omar Shaikh, Grace Wang

TL;DR: 本文提出了一种名为’striving co-creation’的系统,旨在从用户日常无结构的计算机使用行为中,推断其更广泛的生活目标,而不仅仅是捕捉即时行为。该系统基于活动理论和Emmons的个人奋斗框架,构建活动的层次化表示,并通过一个编辑界面让用户参与修正,将反馈融入后续的推断过程。

Details

Motivation: 现有用户建模系统仅能捕捉用户当前在做什么,而无法理解其行为背后的深层原因和人生目标,这限制了系统提供更深层次支持的能力。本文旨在解决如何从日常计算机使用中推断’为什么’的问题。

Result: 在一项为期一周的实地部署研究中,该系统在14名参与者身上验证了其有效性,结果表明,该共创过程产生的奋斗目标能代表参与者的长期目标,并且相比基线方法,赋予了用户更大的自主权。

Insight: 核心创新点在于将用户建模从’是什么’扩展到’为什么’,并引入了一个人机协同的’共创’过程。系统不仅通过观察进行推断,还通过交互式编辑界面让用户修正和指导模型,形成闭环,这增强了模型的代表性和用户的控制感,为人机交互和个性化系统设计提供了新思路。

Abstract: Recent advances in user modeling make it feasible to conduct open-ended inference over a person’s everyday computer use. Despite longstanding visions of systems that deeply understand our actions and the purposes they serve in our lives, existing systems only capture what a person is doing in the moment – not why they are doing it – limiting these systems to surface-level support. We introduce striving co-creation, a process for inferring broader life goals from unstructured observations of computer use. Grounded in Activity Theory and Emmons’ personal strivings framework, our system progressively constructs a hierarchical representation of a person’s activities. Crucially, strivings are difficult to fully resolve from observation alone, as the same action can be driven by many different goals. Our system therefore supports an editing interface that gives people agency over how they are understood by the system, feeding their corrections back into subsequent rounds of striving induction. In a week-long field deployment (N=14), we find that our co-creation process produces strivings that are representative of participants’ long-term goals and gives them greater agency than baseline methods.


cs.GR [Back]

[51] FieryGS: In-the-Wild Fire Synthesis with Physics-Integrated Gaussian Splatting cs.GR | cs.CVPDF

Qianfan Shen, Ningxiao Tao, Qiyu Dai, Tianle Chen, Minghan Qin

TL;DR: 本文提出了FieryGS,一个基于物理的框架,它将物理精确且用户可控的燃烧模拟和渲染集成到3D高斯泼溅(3DGS)流程中,旨在为真实世界的3D场景合成逼真且物理合理的火焰效果。

Details

Motivation: 传统CFD和图形学流程依赖手工几何、专家调参和劳动密集型工作流,难以扩展到真实世界;而3DGS等先进场景建模技术虽能高保真重建真实场景,但缺乏对燃烧的物理建模。FieryGS旨在弥合这一差距。

Result: 在多样化的室内外场景评估中,FieryGS在视觉真实感、物理保真度和可控性方面均优于所有对比基线。

Insight: 创新点在于将基于多模态大语言模型的物理材质推理、高效的体积燃烧模拟以及火焰与3DGS的统一渲染器紧密耦合,实现了重建、物理推理、模拟和渲染的统一,从而自动生成与场景几何和材质一致的、可控的逼真火焰动态。

Abstract: We consider the problem of synthesizing photorealistic, physically plausible combustion effects in in-the-wild 3D scenes. Traditional CFD and graphics pipelines can produce realistic fire effects but rely on handcrafted geometry, expert-tuned parameters, and labor-intensive workflows, limiting their scalability to the real world. Recent scene modeling advances like 3D Gaussian Splatting (3DGS) enable high-fidelity real-world scene reconstruction, yet lack physical grounding for combustion. To bridge this gap, we propose FieryGS, a physically-based framework that integrates physically-accurate and user-controllable combustion simulation and rendering within the 3DGS pipeline, enabling realistic fire synthesis for real scenes. Our approach tightly couples three key modules: (1) multimodal large-language-model-based physical material reasoning, (2) efficient volumetric combustion simulation, and (3) a unified renderer for fire and 3DGS. By unifying reconstruction, physical reasoning, simulation, and rendering, FieryGS removes manual tuning and automatically generates realistic, controllable fire dynamics consistent with scene geometry and materials. Our framework supports complex combustion phenomena – including flame propagation, smoke dispersion, and surface carbonization – with precise user control over fire intensity, airflow, ignition location and other combustion parameters. Evaluated on diverse indoor and outdoor scenes, FieryGS outperforms all comparative baselines in visual realism, physical fidelity, and controllability. Project page can be found at https://pku-vcl-geometry.github.io/FieryGS/.


cs.RO [Back]

[52] Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO | cs.CV | cs.LGPDF

Hao Luo, Wanpeng Zhang, Yicheng Feng, Sipeng Zheng, Haiweng Xu

TL;DR: 论文提出了Being-H0.7,一种隐式的世界-动作模型,旨在将未来感知推理引入视觉-语言-动作模型中,而无需生成未来的视频帧。该方法通过在感知和动作之间插入可学习的隐式查询作为紧凑的推理接口,并通过一个未来感知的双分支设计进行训练,使模型仅从当前观察中就能推理出对未来动作有用的结构。

Details

Motivation: 现有的视觉-语言-动作模型因动作监督稀疏,倾向于学习捷径映射而非动态、接触和任务进度的表示;而引入未来预测的世界-动作模型在像素空间进行预测,成本高昂且与控制间接相关。本文旨在结合世界模型的预测优势与直接VLA策略的效率和可部署性。

Result: 在六个模拟基准测试和多样化的真实世界任务上的实验表明,Being-H0.7达到了最先进的或相当的性能。

Insight: 创新点在于提出了一个隐式世界-动作模型架构,通过一个训练时使用的、基于未来观察的后验分支来引导一个可部署的、仅基于当前观察的先验分支,在隐式推理空间进行联合对齐,从而实现了高效、未来感知的策略学习,避免了显式的未来帧生成开销。

Abstract: Visual-Language-Action models (VLAs) have advanced generalist robot control by mapping multimodal observations and language instructions directly to actions, but sparse action supervision often encourages shortcut mappings rather than representations of dynamics, contact, and task progress. Recent world-action models introduce future prediction through video rollouts, yet pixel-space prediction is a costly and indirect substrate for control, as it may model visual details irrelevant to action generation and introduces substantial training or inference overhead. We present Being-H0.7, a latent world-action model that brings future-aware reasoning into VLA-style policies without generating future frames. Being-H0.7 inserts learnable latent queries between perception and action as a compact reasoning interface, and trains them with a future-informed dual-branch design: a deployable prior branch infers latent states from the current context, while a training-only posterior branch replaces the queries with embeddings from future observations. Jointly aligning the two branches at the latent reasoning space leads the prior branch to reason future-aware, action-useful structure from current observations alone. At inference, Being-H0.7 discards the posterior branch and performs no visual rollout. Experiments across six simulation benchmarks and diverse real-world tasks show that Being-H0.7 achieves state-of-the-art or comparable performance, combining the predictive benefits of world models with the efficiency and deployability of direct VLA policies.


[53] World Model for Robot Learning: A Comprehensive Survey cs.RO | cs.CVPDF

Bohan Hou, Gen Li, Jindou Jia, Tuo An, Xinying Guo

TL;DR: 本文是一篇关于机器人学习中世界模型的全面综述,系统回顾了世界模型作为环境动态预测表征在机器人学习中的核心作用,包括其与策略的耦合、作为学习模拟器在强化学习和评估中的应用,以及从基于想象的生成到可控、结构化、基础模型规模的发展历程。

Details

Motivation: 动机在于当前关于世界模型的研究在架构、功能角色和具身应用领域方面较为分散,缺乏系统性梳理,本文旨在填补这一空白,从机器人学习的视角提供一个全面的综述。

Result: 本文是综述性论文,未提出新方法,因此没有具体的定量实验结果。它系统地总结了代表性数据集、基准测试和评估协议,并梳理了关键范式和应用。

Insight: 创新点在于首次从机器人学习的角度对世界模型进行了系统性的分类和综述,明确了其在策略学习、规划、模拟、评估和数据生成中的功能角色,并连接了导航和自动驾驶等具体应用领域,为未来具身智能体的预测建模研究指明了挑战和方向。

Abstract: World models, which are predictive representations of how environments evolve under actions, have become a central component of robot learning. They support policy learning, planning, simulation, evaluation, data generation, and have advanced rapidly with the rise of foundation models and large-scale video generation. However, the literature remains fragmented across architectures, functional roles, and embodied application domains. To address this gap, we present a comprehensive review of world models from a robot-learning perspective. We examine how world models are coupled with robot policies, how they serve as learned simulators for reinforcement learning and evaluation, and how robotic video world models have progressed from imagination-based generation to controllable, structured, and foundation-scale formulations. We further connect these ideas to navigation and autonomous driving, and summarize representative datasets, benchmarks, and evaluation protocols. Overall, this survey systematically reviews the rapidly growing literature on world models for robot learning, clarifies key paradigms and applications, and highlights major challenges and future directions for predictive modeling in embodied agents. To facilitate continued access to newly emerging works, benchmarks, and resources, we will maintain and regularly update the accompanying GitHub repository alongside this survey.


[54] Lucid-XR: An Extended-Reality Data Engine for Robotic Manipulation cs.RO | cs.CVPDF

Yajvan Ravan, Adam Rashid, Alan Yu, Kai McClennen, Gio Huh

TL;DR: 本文介绍了Lucid-XR,这是一个用于生成多样化、逼真多模态数据的扩展现实数据引擎,旨在训练现实世界的机器人系统。其核心是基于Web的物理模拟环境vuer,可直接在XR头显上运行,无需专用设备即可实现无延迟的沉浸式虚拟交互。系统集成了设备端物理模拟与人体到机器人姿态重定向,并通过物理引导的视频生成管道进一步扩增数据。实验表明,仅使用Lucid-XR的合成数据训练后,机器人视觉策略能够零样本迁移到未见过的、杂乱且光照不良的真实环境中,并在涉及柔软材料、松散颗粒和刚体接触的灵巧操作任务上进行了验证。

Details

Motivation: 解决在机器人操作训练中获取多样化、真实感多模态数据成本高、难度大的问题,通过扩展现实技术提供可扩展的合成数据生成方案。

Result: 在零样本迁移到未见、杂乱、光照不良的真实环境评估中,仅使用合成数据训练的机器人视觉策略表现良好,展示了方法的有效性,但未提及具体基准或与SOTA的定量比较。

Insight: 创新点在于将基于XR头显的Web物理模拟环境与设备端计算结合,实现低延迟、易访问的沉浸式数据生成,并利用物理引导和自然语言可控的视频生成管道扩增数据,为机器人训练提供可扩展的合成数据源。

Abstract: We introduce Lucid-XR, a generative data engine for creating diverse and realistic-looking multi-modal data to train real-world robotic systems. At the core of Lucid-XR is vuer, a web-based physics simulation environment that runs directly on the XR headset, enabling internet-scale access to immersive, latency-free virtual interactions without requiring specialized equipment. The complete system integrates on-device physics simulation with human-to-robot pose retargeting. Data collected is further amplified by a physics-guided video generation pipeline steerable via natural language specifications. We demonstrate zero-shot transfer of robot visual policies to unseen, cluttered, and badly lit evaluation environments, after training entirely on Lucid-XR’s synthetic data. We include examples across dexterous manipulation tasks that involve soft materials, loosely bound particles, and rigid body contact. Project website: https://lucidxr.github.io


[55] MSACT: Multistage Spatial Alignment for Stable Low-Latency Fine Manipulation cs.RO | cs.CVPDF

Xianbo Cai, Hideyuki Ichiwara, Masaki Yoshikawa, Tetsuya Ogata

TL;DR: 本文提出了一种名为MSACT的多阶段空间对齐方法,旨在解决精细操作任务中低延迟控制与稳定视觉定位的平衡问题。该方法基于ACT框架,通过多阶段注意力模块提取稳定的2D注意力点,并引入时间对齐损失来预测未来注意力序列,从而在有限数据下抑制定位漂移并保持低延迟推理。

Details

Motivation: 现实世界中的精细操作(尤其是双手操作)需要低延迟控制和稳定的视觉定位,但大规模数据收集成本高,且有限演示可能导致定位漂移。现有方法在延迟、表达能力和计算成本之间存在权衡,缺乏同时满足低延迟和稳定性的方案。

Result: 在ALOHA双手平台上进行的模拟和真实世界精细操作实验表明,该方法在任务成功率、注意力漂移、推理延迟和视觉干扰鲁棒性方面均有提升,在测试条件下保持了低延迟推理并改善了定位稳定性与任务性能。

Insight: 创新点包括:多阶段空间注意力模块提取任务相关的2D注意力点作为局部空间模态;自监督目标通过对齐预测注意力序列与未来帧视觉特征来抑制漂移,无需关键点标注。这为有限数据下的视觉-动作映射提供了更稳定的几何基础。

Abstract: Real-world fine manipulation, particularly in bimanual manipulation, typically requires low-latency control and stable visual localization, while collecting large-scale data is costly and limited demonstrations may lead to localization drift. Existing approaches make different trade-offs: action-chunking policies such as ACT enable low-latency execution and data efficiency but rely on dense visual features without explicit spatial consistency, generative methods such as Diffusion Policy improve expressiveness but can incur iterative sampling latency, vision-language-action and voxel-based methods enhance generalization and geometric grounding but require higher computational cost and system complexity. We introduce a multistage spatial attention module that extracts stable 2D attention points and jointly predicts future attention sequences with a temporal alignment loss. Built upon ACT with a pretrained ResNet visual prior, a multistage attention module extracts task-relevant 2D attention points as a local spatial modality for action prediction. To maintain consistent object tracking, we introduce a self-supervised objective that aligns predicted attention sequences with visual features from future frames, suppressing drift without keypoint annotations and improving stability of the vision-to-action mapping under limited data. Experiments on simulated and real-world fine manipulation tasks, conducted on the ALOHA bimanual platform, evaluate task success, attention drift, inference latency, and robustness to visual disturbances. Results indicate improvements in localization stability and task performance while maintaining low-latency inference under the tested conditions.


cs.LG [Back]

[56] Technical Report: Activation Residual Hessian Quantization (ARHQ) for Low-Bit LLM Quantization cs.LG | cs.CL | cs.CVPDF

YiFeng Wang, Zhun Sun, Keisuke Sakaguchi

TL;DR: 本文提出了激活残差Hessian量化(ARHQ),一种用于缓解低比特激活-权重量化中误差传播的后训练权重分割方法。该方法通过从激活量化残差(G_x)构建输入侧残差Hessian,解析地识别并将误差敏感的权重方向隔离到一个高精度低秩分支中。实验表明,ARHQ显著改善了层间信噪比,并在ZebraLogic基准上保持了推理性能。

Details

Motivation: 解决低比特大语言模型(LLM)量化中,由于激活和权重同时量化导致的误差传播问题,旨在通过后训练方法提升量化模型的精度。

Result: 在Qwen3-4B-Thinking-2507模型上的实验结果表明,ARHQ显著提升了层间信噪比(SNR),并在ZebraLogic推理基准测试中,即使在激进量化条件下也保持了下游性能。

Insight: 创新点在于提出了一种基于激活量化残差构建输入侧Hessian矩阵的解析方法,通过截断SVD将误差敏感权重方向隔离到高精度分支,这是一种新颖的后训练权重分割策略,可有效控制量化误差传播。

Abstract: We present Activation Residual Hessian Quantization (ARHQ), a post-training weight splitting method designed to mitigate error propagation in low-bit activation-weight quantization. By constructing an input-side residual Hessian from activation quantization residuals (G_x), ARHQ analytically identifies and isolates error-sensitive weight directions into a high-precision low-rank branch. This is achieved via a closed-form truncated SVD on the scaled weight matrix W G^{1/2}_x . Experimental results on Qwen3-4B-Thinking-2507 demonstrate that ARHQ significantly improves layer-wise SNR and preserves downstream reasoning performance on ZebraLogic even under aggressive quantization. The code is available at https://github.com/BeautMoonQ/ARHQ.


[57] Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback cs.LG | cs.CL | math.OC | stat.MLPDF

Yikai Wang, Shang Liu, Jose Blanchet

TL;DR: 本文提出了一种基于Wasserstein分布鲁棒后悔优化的强化学习人类反馈方法,以解决RLHF中因奖励模型不准确导致的过度优化问题。该方法通过最小化最坏情况下的后悔值,而非传统分布鲁棒优化中的最坏情况值,从而减少过度悲观,并在理论上和实验中展现出优于现有基线的性能。

Details

Motivation: RLHF中学习的奖励信号仅是真实人类效用的代理,导致目标误设问题,引发奖励过度优化,即代理奖励持续提升而真实质量下降。现有缓解方法计算负担重且过于悲观。

Result: 在实验中,DRRO比现有基线更有效地缓解了过度优化问题,而标准DRO则系统性地过度悲观。

Insight: 创新点在于将分布鲁棒优化框架从最小化最坏情况值转向最小化最坏情况后悔值,这降低了悲观性,并导出了具有注水结构的优化策略,可轻松集成到PPO/GRPO风格的RLHF训练中。

Abstract: Reinforcement learning from human feedback (RLHF) has become a core post-training step for aligning large language models, yet the reward signal used in RLHF is only a learned proxy for true human utility. From an operations research perspective, this creates a decision problem under objective misspecification: the policy is optimized against an estimated reward, while deployment performance is determined by an unobserved objective. The resulting gap leads to reward over-optimization, or Goodharting, where proxy reward continues to improve even after true quality deteriorates. Existing mitigations address this problem through uncertainty penalties, pessimistic rewards, or conservative constraints, but they can be computationally burdensome and overly pessimistic. We propose Wasserstein distributionally robust regret optimization (DRRO) for RLHF. Instead of pessimizing worst-case value as in standard DRO, DRRO pessimizes worst-case regret relative to the best policy under the same plausible reward perturbation. We study the promptwise problem through a simplex allocation model and show that, under an $\ell_1$ ambiguity set, the inner worst-case regret admits an exact solution and the optimal policy has a water-filling structure. These results lead to a practical policy-gradient algorithm with a simple sampled-bonus interpretation and only minor changes to PPO/GRPO-style RLHF training. The framework also clarifies theoretically why DRRO is less pessimistic than DRO, and our experiments show that DRRO mitigates over-optimization more effectively than existing baselines while standard DRO is systematically over-pessimistic.


[58] State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning cs.LG | cs.CLPDF

Thea Aviss

TL;DR: 论文提出了State Stream Transformer (SST) V2,一种通过前馈网络驱动的非线性递归在解码器层实现连续潜在空间推理的架构。它采用两阶段并行训练解决递归的顺序依赖问题,并在推理时支持每个位置的连续潜在审议。实验表明,该架构能有效提升模型在数学推理和复杂问答任务上的性能。

Details

Motivation: 解决当前Transformer模型在位置间丢弃丰富的潜在残差流、未能充分利用潜在推理能力的问题,旨在实现参数高效的连续潜在空间推理。

Result: 在仅使用少量GSM8K数据微调27B骨干模型后,在分布外数据集GPQA-Diamond上比微调基线提升15.15分,并将基线在GSM8K上的剩余错误减少46%。在GPQA-Diamond上,27B SST的准确率超过了多个更大的开源和专有系统,包括参数量达25倍的开源模型。

Insight: 创新点在于通过FFN驱动的非线性递归在解码器层水平流式传输潜在状态,实现连续潜在空间的探索式推理;采用两阶段并行训练高效解决递归依赖;潜在状态分析揭示了模型通过探索不同语义盆地来促进推理,且早期潜在状态能预测最终答案的稳定性。

Abstract: Current transformers discard their rich latent residual stream between positions, reconstructing latent reasoning context at each new position and leaving potential reasoning capacity untapped. The State Stream Transformer (SST) V2 enables parameter-efficient reasoning in continuous latent space through an FFN-driven nonlinear recurrence at each decoder layer, where latent states are streamed horizontally across the full sequence via a learned blend. This same mechanism supports continuous latent deliberation per position at inference time, dedicating additional FLOPs to exploring abstract reasoning before committing to a token. A two-pass parallel training procedure resolves the sequential dependency of the recurrence to allow compute-efficient training. Hidden state analysis shows the state stream facilitates reasoning through exploration of distinct semantic basins in continuous latent space, where transitions at content-dependent positions move the model into a substantially different Bayesian posterior, directly influencing the latent space at future positions. We also find, via a learned probe, that at the first generated token position, the latent state already predicts whether the eventual answer will survive or break under additional latent computation for every subsequent position. Co-trained into an existing 27B backbone using only a small dataset of GSM8K examples, the SST delivers a +15.15 point gain over a fine-tuning-matched baseline on out-of-distribution GPQA-Diamond and cuts that same baseline’s remaining GSM8K errors by 46%, together showing that the reasoning improvement is attributable to the architectural mechanism rather than scale or training data. On GPQA-Diamond, the resulting 27B SST also achieves higher accuracy than several larger open-weight and proprietary systems, including open-weight models up to 25 times larger.


[59] Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning cs.LG | cs.AI | cs.CLPDF

Chengshuai Shi, Wenzhe Li, Xinran Liang, Yizhou Lu, Wenjia Yang

TL;DR: 本文提出Odysseus框架,通过强化学习训练视觉语言模型在长视野决策任务(如《超级马里奥大陆》游戏)中实现100+回合的交互,解决了现有方法依赖大规模监督微调或仅适用于短视野(20-30回合)的局限。

Details

Motivation: 现有视觉语言模型在交互式决策任务(如视频游戏)中,要么依赖大规模人类轨迹的监督微调,要么仅能在短视野(约20-30回合)中应用强化学习,无法适应长视野、多模态的决策需求。

Result: 在《超级马里奥大陆》游戏中,Odysseus框架实现了至少3倍于前沿模型的平均游戏进度提升,并在游戏内和跨游戏泛化设置中均表现出持续改进,同时保持通用领域能力。

Insight: 创新点包括:提出了基于PPO的轻量级回合级评论家变体,提升了训练稳定性和样本效率;利用预训练视觉语言模型作为强动作先验,减少了对动作工程等手动设计的依赖。从客观角度看,该研究为长视野多模态设置中强化学习的稳定性和有效性提供了关键要素和实践指导。

Abstract: Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human trajectories or apply reinforcement learning (RL) only in relatively short-horizon settings (typically around 20–30 turns). In this work, we study RL-based training of VLMs for long-horizon decision-making in Super Mario Land, a visually grounded environment requiring 100+ turns of interaction with coordinated perception, reasoning, and action. We begin with a systematic investigation of key algorithmic components and propose an adapted variant of PPO with a lightweight turn-level critic, which substantially improves training stability and sample efficiency over critic-free methods such as GRPO and Reinforce++. We further show that pretrained VLMs provide strong action priors, significantly improving sample efficiency during RL training and reducing the need for manual design choices such as action engineering, compared to classical deep RL trained from scratch. Building on these insights, we introduce Odysseus, an open training framework for VLM agents, achieving substantial gains across multiple levels of the game and at least 3 times average game progresses than frontier models. Moreover, the trained models exhibit consistent improvements under both in-game and cross-game generalization settings, while maintaining general-domain capabilities. Overall, our results identify key ingredients for making RL stable and effective in long-horizon, multi-modal settings, and provide practical guidance for developing VLMs as embodied agents.


[60] Uniform-Correct Policy Optimization: Breaking RLVR’s Indifference to Diversity cs.LG | cs.CL | stat.MLPDF

Anamika Lochab, Bolian Li, Ruqi Zhang

TL;DR: 本文针对强化学习与可验证奖励(RLVR)方法在推理任务中存在的多样性崩溃问题,提出了一种名为Uniform-Correct Policy Optimization(UCPO)的改进方法。该方法通过引入条件均匀性惩罚,促使策略在正确解集合内均匀分配概率质量,从而在保持单次尝试准确率(Pass@1)的同时,显著提升多样性和多样本覆盖率(Pass@K)。

Details

Motivation: 现有RLVR方法(如GRPO)在提升单次尝试准确率时,往往导致多样本覆盖率下降和多样性崩溃。其根本原因在于这些目标函数对正确解之间的概率分布漠不关心,结合随机训练动态,会引发自我强化的崩溃,使概率质量集中在少数正确输出上,而抑制其他有效解。

Result: 在三个模型(1.5B-7B参数)和五个数学推理基准测试上,UCPO在保持竞争力的Pass@1的同时,提升了Pass@K和多样性。例如,在AIME24基准的Pass@64指标上实现了高达+10%的绝对提升,并在正确解集合内实现了高达45%的方程级多样性提升。

Insight: 论文的创新点在于从理论上形式化了多样性崩溃机制,并基于鲁棒性和熵正则化最优性准则,论证了均匀正确策略(Uniform-Correct Policy)的唯一最优性。据此提出的UCPO方法,通过条件均匀性惩罚重新分配梯度信号,鼓励策略在正确解集内均匀分布概率,这是一种简单有效的缓解多样性崩溃的结构性解决方案。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collapse. We identify a structural cause for this degradation: common RLVR objectives, such as GRPO, are indifferent to how probability mass is distributed among correct solutions. Combined with stochastic training dynamics, this indifference induces a self-reinforcing collapse, in which probability mass concentrates on a narrow subset of correct outputs while alternative valid solutions are suppressed. We formalize this collapse mechanism and further characterize the optimal policy structure under two complementary criteria: robustness and entropy-regularized optimality, which identify the Uniform-Correct Policy as uniquely optimal. Motivated by this analysis, we propose Uniform-Correct Policy Optimization (UCPO), a modification to GRPO that adds a conditional uniformity penalty on the policy’s distribution over correct solutions. The penalty redistributes gradient signal toward underrepresented correct responses, encouraging uniform allocation of probability mass within the correct set. Across three models (1.5B-7B parameters) and five mathematical reasoning benchmarks, UCPO improves Pass@K and diversity while maintaining competitive Pass@1, achieving up to +10% absolute improvement on AIME24 at Pass@64 and up to 45% higher equation-level diversity within the correct set. The code is available at https://github.com/AnamikaLochab/UCPO.


[61] ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning cs.LG | cs.CLPDF

Zihan Lin, Xiaohan Wang, Jie Cao, Jiajun Chai, Li Wang

TL;DR: 本文提出了一种名为ResRL(负样本投影残差强化学习)的方法,旨在提升大型语言模型(LLMs)的推理能力,同时避免因过度激励正奖励而导致的生成多样性下降问题。该方法通过解耦正负响应间的相似语义分布,利用负样本隐藏表示的投影残差来调制负梯度,从而在多个基准测试上实现了优于基线模型的性能。

Details

Motivation: 现有基于可验证奖励的强化学习(RLVR)方法在提升LLM推理能力时,常因过度激励正奖励而限制了生成多样性;虽然负样本强化(NSR)等方法通过加重负样本惩罚来缓解,但可能抑制正负响应间共享的语义分布。本文旨在解决这一矛盾,即在提升推理能力的同时不损失多样性。

Result: ResRL在涵盖数学、代码、智能体任务和函数调用等领域的12个基准测试中,平均表现优于强基线模型。特别是在数学推理任务上,ResRL在Avg@16和Pass@128指标上分别超越了NSR方法9.4%和7.0%。

Insight: 创新点在于将负样本隐藏表示投影到基于SVD的低秩正子空间,并利用投影残差来调制负梯度,从而理论上解耦了正负响应间的语义分布干扰。这提供了一种保守的优势重加权方法,能够在强化推理的同时有效保持生成多样性。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes negative sample projection Residual Reinforcement Learning (ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically link Lazy Likelihood Displacement (LLD) to negative-positive head-gradient interference and derive a single-forward proxy that upper-bounds representation alignment to guide conservative advantage reweighting. ResRL then projects negative-token hidden representations onto an SVD-based low-rank positive subspace and uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4% in Avg@16 and 7.0% in Pass@128. Code is available at https://github.com/1229095296/ResRL.git.


[62] RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution cs.LG | cs.CL | cs.MAPDF

Arunabh Srivastava, Mohammad A., Khojastepour, Srimat Chakradhar, Sennur Ulukus

TL;DR: 本文提出了RunAgent,一个多智能体计划执行平台,旨在解决大语言模型(LLMs)在执行结构化工作流时不可靠的问题。它通过约束和规则来分步执行自然语言计划,将自然语言的表达能力与编程的确定性相结合,并动态选择推理、工具使用或代码执行等模式,同时包含纠错机制和上下文过滤功能。

Details

Motivation: 人类通过执行有针对性的计划来解决问题,但当前的大语言模型在执行结构化工作流时仍不可靠,因此需要一种能够可靠解释和执行自然语言计划的系统。

Result: 在Natural-plan和SciBench数据集上的评估表明,RunAgent超越了基线LLMs和最先进的PlanGEN方法。

Insight: 创新点在于设计了一种具有显式控制结构(如IF、GOTO、FORALL)的智能体语言来桥接自然语言与编程,并能够基于任务描述和实例自主推导和验证约束,同时动态选择执行模式并过滤上下文历史以提高执行正确性。

Abstract: Humans solve problems by executing targeted plans, yet large language models (LLMs) remain unreliable for structured workflow execution. We propose RunAgent, a multi-agent plan execution platform that interprets natural-language plans while enforcing stepwise execution through constraints and rubrics. RunAgent bridges the expressiveness of natural language with the determinism of programming via an agentic language with explicit control constructs (e.g., \texttt{IF}, \texttt{GOTO}, \texttt{FORALL}). Beyond verifying syntactic and semantic verification of the step output, which is performed based on the specific instruction of each step, RunAgent autonomously derives and validates constraints based on the description of the task and its instance at each step. RunAgent also dynamically selects among LLM-based reasoning, tool usage, and code generation and execution (e.g., in Python), and incorporates error correction mechanisms to ensure correctness. Finally, RunAgent filters the context history by retaining only relevant information during the execution of each step. Evaluations on Natural-plan and SciBench Datasets demonstrate that RunAgent outperforms baseline LLMs and state-of-the-art PlanGEN methods.


[63] Learning physically grounded traffic accident reconstruction from public accident reports cs.LG | cs.CVPDF

Yanchen Guan, Haicheng Liao, Chengyue Wang, Zhenning Li

TL;DR: 本文提出了一种从公开交通事故报告和场景测量数据中学习物理基础的事故重建方法,将事故重建构建为参数化的多模态学习问题。作者构建了CISS-REC数据集(包含6,217个真实事故案例),并开发了一个重建框架,该框架将报告语义与道路拓扑和参与者属性关联,重建车道一致的影响前运动,并通过局部几何推理和时间分配细化碰撞相关交互。

Details

Motivation: 交通事故通常以文本报告形式记录,但物理基础的事故重建仍然困难,因为详细的场景测量和专家重建稀缺、成本高且难以扩展。本文旨在利用公开可获取的报告和场景测量数据,实现可扩展、可定量验证的事故重建。

Result: 在CISS-REC数据集上,该方法优于代表性基线,实现了最强的整体重建保真度,包括提高了事故点准确性和碰撞一致性。

Insight: 创新点在于将事故重建形式化为多模态学习问题,并提出了一个结合语义接地、车道一致性运动重建和局部几何推理的框架。其核心见解是公共事故报告可以作为可扩展的计算基板,用于定量验证的事故重建,这对交通安全分析、仿真和自动驾驶研究具有潜在价值。

Abstract: Traffic accidents are routinely documented in textual reports, yet physically grounded accident reconstruction remains difficult because detailed scene measurements and expert reconstructions are scarce, costly and hard to scale. Here we formulate accident reconstruction from publicly accessible reports and scene measurements as a parameterized multimodal learning problem. We construct CISS-REC, a dataset of 6,217 real-world accident cases curated from the NHTSA Crash Investigation Sampling System, and develop a reconstruction framework that grounds report semantics to road topology and participant attributes, reconstructs lane consistent pre-impact motion, and refines collision relevant interactions through localized geometric reasoning and temporal allocation. Our method outperforms representative baselines on CISS-REC, achieving the strongest overall reconstruction fidelity, including improved accident point accuracy and collision consistency. These results show that public accident reports can serve as scalable computational substrates for quantitatively verifiable accident reconstruction, with potential value for traffic safety analysis, simulation and autonomous driving research.


math.AC [Back]

[64] Elimination Templates in Macaulay2 math.AC | cs.CV | cs.MSPDF

Manav Batavia, Cheng Chen, Anna Natalie Chlopecki, Timothy Duff, William Huang

TL;DR: 本文介绍了为Macaulay2计算机代数系统开发的EliminationTemplates软件包,该包提供了为依赖于代数独立参数的零维根理想族构建自动求解器的工具。文章详细描述了如何为此类理想族构建消元模板及其特化性质,并通过计算机视觉等领域的多个示例说明了软件包的主要功能和数据类型的使用方法。

Details

Motivation: 解决在参数化多项式系统(特别是计算机视觉中的几何问题)中自动生成高效、稳定的数值求解器的需求,为依赖于代数独立参数的零维根理想族提供系统化的消元模板构造方法。

Result: 论文通过多个示例(包括计算机视觉应用案例)展示了软件包的功能,但摘要中未提及具体的定量基准测试结果或与现有方法的直接性能比较。

Insight: 创新点在于将消元模板的构造过程系统化并封装为可用的软件工具,使得为参数化多项式系统生成求解器自动化,这源自并适用于计算机视觉中的多项式求解问题,提升了相关领域算法实现的便利性和可靠性。

Abstract: We introduce the package \texttt{EliminationTemplates} for the Macaulay2 computer algebra system, which provides tools for constructing automatic solvers for families of zero-dimensional radical ideals depending on algebraically independent parameters. This article provides a self-contained description of how elimination templates are constructed for such families and their specialization properties. Additionally, we describe the main functionality and datatypes provided by our package, and illustrate its usage on several examples, including applications from computer vision from which elimination templates originated.


cs.AR [Back]

[65] DPU or GPU for Accelerating Neural Networks Inference – Why not both? Split CNN Inference cs.AR | cs.CVPDF

Ali Emre Oztas, Mahir Demir, James Garside, Mikel Luj’an

TL;DR: 本文提出了一种名为Split CNN Inference的方法,通过在DPU和GPU之间划分CNN推理任务来加速边缘设备上的视频和图像流处理。该方法利用Versal VCK190的AI引擎(DPU)处理初始CNN层,并异步流水线地将剩余层交由GPU(NVIDIA RTX 2080)处理,以减少数据传输延迟。此外,还提出了一种基于图神经网络(GNN)的分区索引预测方法来自动化CNN的分区。

Details

Motivation: 边缘设备上的视频和图像流处理需要低延迟,现有工作主要依赖单一硬件单元(如GPU、FPGA、DPU)加速神经网络推理,但结合这些单元可以进一步降低延迟。

Result: 在LeNet-5、ResNet18/50/101/152、VGG16和MobileNetv2等模型上,该方法相比仅使用DPU执行实现了最高2.48倍的延迟改进,相比仅使用GPU执行实现了最高3.37倍的延迟改进。训练的GNN模型在设备间划分层时准确率达到96.27%。

Insight: 创新点在于提出跨DPU和GPU的CNN推理分区方法,结合DPU靠近数据源的优势和GPU的计算能力,并通过GNN自动化分区决策,实现了硬件协同优化以降低延迟。

Abstract: Video and image streaming on edge devices requires low latency. To address this, Neural Networks (NNs) are widely used, and prior work mainly focuses on accelerating them with single hardware units such as Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), and Deep Learning Processing Units (DPUs). However, further reductions in latency can be observed by combining these units. In this paper, partitioning CNN inference across DPU and GPU (Split CNN Inference) is proposed. The first partition runs on the AI engines (DPU) of a Versal VCK190, which consists of initial CNN layers processing the input images. The DPU processes the first partition near the source of the data. Pipelined asynchronously, a GPU runs the remaining layers. The GPU (NVIDIA RTX 2080) processes the second partition, albeit having reduced the data transfer between the data source (storage/camera) and the GPU. Furthermore, a Graph Neural Network (GNN)-based partition index prediction method is proposed to automate the partitioning of CNNs needed for the Split Inference. Well established models such as LeNet-5, ResNet18/50/101/152, VGG16, and MobileNetv2 are analyzed. Results demonstrate up to 2.48x latency improvement over DPU-only execution and up to 3.37x over GPU-only execution. The trained GNN model splits the layers between the appropriate devices with 96.27% accuracy.


cs.AI [Back]

[66] On the Role of Artificial Intelligence in Human-Machine Symbiosis cs.AI | cs.CL | cs.HCPDF

Ching-Chun Chang, Yuchen Guo, Hanrui Wang, Timo Spinde, Isao Echizen

TL;DR: 本文探讨了人机共生背景下AI在自然语言生成中的功能角色追溯问题,提出了一种从生成文本中推断并嵌入AI潜在角色(如辅助编辑或创意生成)的方法,并通过实验验证了该方法的角色区分能力、抗干扰性和语言质量保持效果。

Details

Motivation: 随着AI与人机共生关系的日益交织,AI生成信息的来源难以界定,关键在于追溯AI在生成过程中的具体参与方式,而现有方法在脱离对话上下文后难以追踪AI的功能角色。

Result: 在AI作为辅助编辑或创意生成代理的代表性场景中,实验表明所提方法能有效区分角色、抵抗干扰并保持语言质量,验证了其有效性。

Insight: 创新点在于将提示中隐含的AI功能角色嵌入概率生成过程,并从生成文本中恢复参与性质,这为AI伦理研究(如公平性、透明性)提供了技术基础。

Abstract: The evolution of artificial intelligence (AI) has rendered the boundary between humanity and computational machinery increasingly ambiguous. In the presence of more interwoven relationships within human-machine symbiosis, the very notion of AI-generated information becomes difficult to define, as such information arises not from either humans or machines in isolation, but from their mutual shaping. Therefore, a more pertinent question lies not merely in whether AI has participated, but in how it has participated. In general, the role assumed by AI is often specified, either implicitly or explicitly, in the input prompt, yet becomes less apparent or altogether unobservable when the generated content alone is available. Once detached from the dialogue context, the functional role may no longer be traceable. This study considers the problem of tracing the functional role played by AI in natural language generation. A methodology is proposed to infer the latent role specified by the prompt, embed this role into the content during the probabilistic generation process and subsequently recover the nature of AI participation from the resulting text. Experimentation is conducted under a representative scenario in which AI acts either as an assistive agent that edits human-written content or as a creative agent that generates new content from a brief concept. The experimental results support the validity of the proposed methodology in terms of discrimination between roles, robustness against perturbations and preservation of linguistic quality. We envision that this study may contribute to future research on the ethics of AI with regard to whether AI has been used fairly, transparently and appropriately.


[67] Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding cs.AI | cs.CVPDF

Yan Zhang, Daiqing Wu, Huawen Shen, Yu Zhou, Can Ma

TL;DR: 本文提出了GUI-SD,这是首个为GUI(图形用户界面)接地任务量身定制的策略上自蒸馏框架。该框架通过构建视觉增强的特权上下文(使用目标边界框和高斯软掩码)和采用熵引导的蒸馏(根据数字显著性和教师置信度自适应加权令牌),从单次推演中提供密集的令牌级监督,从而在多个基准测试上超越了基于GRPO的方法和朴素策略上自蒸馏,在准确性和训练效率方面均表现出色。

Details

Motivation: 解决现有强化学习方法(如GRPO)在GUI接地任务中依赖昂贵的多次推演、对困难样本信号稀疏的问题,探索策略上自蒸馏在该任务中的适用性。

Result: 在六个具有代表性的GUI接地基准测试上进行广泛实验,结果表明GUI-SD在准确性和训练效率方面持续优于基于GRPO的方法和朴素的策略上自蒸馏。

Insight: 创新点在于:1) 构建了不泄露精确坐标但提供信息性指导的视觉增强特权上下文(目标框+高斯软掩码);2) 提出了熵引导蒸馏,自适应聚焦于最具影响力和最可靠的令牌位置进行优化。这为在视觉密集任务中高效利用自蒸馏提供了新思路。

Abstract: Graphical User Interface (GUI) grounding maps natural language instructions to the visual coordinates of target elements and serves as a core capability for autonomous GUI agents. Recent reinforcement learning methods (e.g., GRPO) have achieved strong performance, but they rely on expensive multiple rollouts and suffer from sparse signals on hard samples. These limitations make on-policy self-distillation (OPSD), which provides dense token-level supervision from a single rollout, a promising alternative. However, its applicability to GUI grounding remains unexplored. In this paper, we present GUI-SD, the first OPSD framework tailored for GUI grounding. First, it constructs a visually enriched privileged context for the teacher using a target bounding box and a Gaussian soft mask, providing informative guidance without leaking exact coordinates. Second, it employs entropy-guided distillation, which adaptively weights tokens based on digit significance and teacher confidence, concentrating optimization on the most impactful and reliable positions. Extensive experiments on six representative GUI grounding benchmarks show that GUI-SD consistently outperforms GRPO-based methods and naive OPSD in both accuracy and training efficiency. Code and training data are available at https://zhangyan-ucas.github.io/GUI-SD/.


cs.IR [Back]

[68] Exploring LLM biases to manipulate AI search overview cs.IR | cs.AI | cs.CLPDF

Roman Smirnov

TL;DR: 本文研究了LLM Overview系统中存在的偏见及其被操纵的可能性,通过强化学习训练小型语言模型重写搜索摘要以增加被选中的概率,证明了LLM Overview系统存在偏见且可被优化操纵,同时发现选择过程基于候选源的相对优势而非绝对优势,并探讨了上下文投毒攻击的安全风险。

Details

Motivation: 针对LLM Overview系统在源选择和答案生成阶段可能受LLM偏见影响的问题,研究旨在探究这些偏见的存在性及如何利用偏见操纵搜索结果概述。

Result: 实验证明LLM Overview系统存在偏见,强化学习在多数情况下能优化摘要内容以操纵结果;在受限的Web搜索环境设置下,模型成功提升了摘要被选中的可能性。

Insight: 创新点在于使用强化学习训练小模型专门针对LLM偏见进行摘要重写,揭示了LLM Overview选择机制基于相对优势的洞察,并强调了上下文投毒攻击带来的安全威胁,为系统鲁棒性设计提供了重要参考。

Abstract: Modern large language models (LLMs) are used in many business applications in general, and specifically in web search systems and applications that generate overviews of search results - LLM Overview systems. Such systems are using an LLM to select most relevant sources from search results and generate an answer to the user’s query. It is known from many studies that LLMs have different biases, in LLM Overview application both the source selection and answer generation stages may be affected by the biases of LLMs (here we are focusing mainly on the selection stage). This research is focused on investigating the presence of the biases in LLM Overview systems and on biases exploitation to manipulate LLM Overview results. Here we train a small language model using reinforcement learning to rewrite search snippets to increase their likelihood of being preferred by an LLM Overview. Our experimental setup intentionally restricts the policy to operate only on snippets and limits reward-hacking strategies, reflecting realistic constraints of web search environments. The results prove that LLM Overview systems have biases and that reinforcement learning in most of the cases can optimize snippet’s content to manipulate LLM Overview results. We also prove that LLM Overview selections are driven by comparative rather than absolute advantages among candidate sources. In addition, we examine safety aspects of LLM Overview manipulation possibilities and show that context poisoning attacks can lead to inaccurate or harmful results.


[69] LLM-Oriented Information Retrieval: A Denoising-First Perspective cs.IR | cs.AI | cs.CLPDF

Lu Dai, Liang Sun, Fanpu Cao, Ziyang Rao, Cehao Yang

TL;DR: 本文提出了一种面向大语言模型的信息检索新视角,强调去噪是当前信息获取流程中的主要瓶颈。作者认为,与传统人类用户不同,LLMs对噪声敏感,误导性信息会直接导致幻觉和推理失败。论文通过一个四阶段框架(从不可访问到不可验证)概念化了这一范式转变,并提供了一个按流程组织的信噪比优化技术分类法,涵盖了索引、检索、上下文工程、验证和智能体工作流。

Details

Motivation: 动机在于现代信息检索越来越多地被LLMs通过检索增强生成和智能体搜索所使用,而LLMs受限于有限的注意力预算且对噪声特别脆弱,因此提升信息密度和可验证性(即去噪)成为关键挑战。

Result: 本文是一篇视角论文,未报告具体的定量实验结果,但系统性地梳理和分类了信噪比优化技术,并讨论了在终身助理、编码智能体、深度研究和多模态理解等重度依赖检索的领域中的信息去噪研究工作。

Insight: 创新点在于首次从“去噪优先”的视角重新审视面向LLM的信息检索,提出了一个概念化范式转变的四阶段框架,并提供了一个系统性的技术分类法,强调了在LLM时代,信息质量(密度与可验证性)比传统相关性指标更为关键。

Abstract: Modern information retrieval (IR) is no longer consumed primarily by humans but increasingly by large language models (LLMs) via retrieval-augmented generation (RAG) and agentic search. Unlike human users, LLMs are constrained by limited attention budgets and are uniquely vulnerable to noise; misleading or irrelevant information is no longer just a nuisance, but a direct cause of hallucinations and reasoning failures. In this perspective paper, we argue that denoising-maximizing usable evidence density and verifiability within a context window-is becoming the primary bottleneck across the full information access pipeline. We conceptualize this paradigm shift through a four-stage framework of IR challenges: from inaccessible to undiscoverable, to misaligned, and finally to unverifiable. Furthermore, we provide a pipeline-organized taxonomy of signal-to-noise optimization techniques, spanning indexing, retrieval, context engineering, verification, and agentic workflow. We also present research works on information denoising in domains that rely heavily on retrieval such as lifelong assistant, coding agent, deep research, and multimodal understanding.


cs.SD [Back]

[70] LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation cs.SD | cs.CL | eess.ASPDF

Venkata Pushpak Teja Menta

TL;DR: 本文提出了LASE(语言对抗说话人编码器),一种用于多语言语音克隆的说话人编码器,旨在解决现有编码器在处理跨脚本(如英语、印地语、泰卢固语和泰米尔语)语音时,因口音条件差异导致同一说话人身份识别不一致的问题。LASE通过在冻结的WavLM-base-plus模型上添加一个小型投影头,结合监督对比损失和梯度反转的交叉熵损失进行训练,使嵌入表示在保留说话人信息的同时消除语言信息。实验表明,LASE在西方口音和印度口音语音库上显著减少了跨脚本相似性差距,并在合成多说话人日记化任务中,以远少于基线模型的训练数据实现了可比性能。

Details

Motivation: 现有说话人编码器在多语言语音克隆中,当同一说话人使用不同脚本(如英语与印度语脚本)发音时,会因口音条件差异而无法一致地识别其身份,这尤其影响跨脚本文本到语音(TTS)系统的性能,特别是在将非印度语训练的声音投射到印度语脚本时问题最严重。

Result: 在包含1043对西方口音和1369对印度口音语音的语料库上,LASE将跨脚本余弦相似性差距降至接近零(西方口音Δ=0.013,印度口音Δ=0.026,95%置信区间包含零),相比基线模型(WavLM-base-plus-sv损失0.082,ECAPA-TDNN损失0.105)提升了2.4-2.7倍。在合成多说话人日记化任务中,LASE以约100倍更少的训练数据,实现了与ECAPA-TDNN相当的跨脚本说话人召回率(0.788 vs 0.789)。

Insight: 创新点在于引入语言对抗训练(通过梯度反转的交叉熵损失)来使说话人嵌入表示对语言信息不敏感,同时保持说话人判别性;这结合了冻结预训练模型(如WavLM)的有效性,以及对抗性目标对提升跨脚本身份一致性的关键作用,为多语言语音处理提供了轻量级解决方案。

Abstract: A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE’s residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.


[71] MMAudioReverbs: Video-Guided Acoustic Modeling for Dereverberation and Room Impulse Response Estimation cs.SD | cs.CV | cs.LG | eess.ASPDF

Akira Takahashi, Ryosuke Sawata, Shusuke Takahashi, Yuki Mitsufuji

TL;DR: 本文提出MMAudioReverbs,一个基于预训练视频到音频(V2A)模型MMAudio的统一框架,用于视频引导的声学建模,旨在解决去混响和房间脉冲响应(RIR)估计问题,而无需修改网络架构,仅需在小数据集上微调。

Details

Motivation: 现有V2A模型虽能根据视觉输入合成语义合理的声音,但未显式建模房间声学效应(如混响或RIR),导致对这些效应的可控性有限;作者假设V2A模型隐含了空间音频与视觉线索之间关系的语义知识,因此探索利用预训练模型作为物理基础声学处理的先验。

Result: 实验结果表明,音频和视觉线索在物理房间声学类型上各有优势,暗示基础V2A模型可用于物理基础的房间声学分析;但摘要未提及具体基准测试或定量结果(如SOTA比较)。

Insight: 创新点在于利用预训练V2A模型作为先验,通过微调实现去混响和RIR估计的统一处理,无需架构改动,展示了视觉线索在物理声学任务中的潜在价值,为音频处理提供了跨模态的新视角。

Abstract: Although recent video-to-audio (V2A) models excelled at synthesizing semantically plausible sounds from visual inputs, they do not explicitly model room-acoustic effects such as reverberation or room impulse responses (RIRs), and thus offer limited controllability over these effects. However, we hypothesize that such V2A models implicitly have semantic knowledge of the relationship between spatial audio and the corresponding vision cues. In this paper, we revisit a V2A model for the sake of the above, and propose the way to utilize the pretrained model as prior for physically grounded room-acoustic processing. Based on one of the state-of-the-art V2A models, MMAudio, we propose MMAudioReverbs that is a unified framework dealing with i) dereverberation and ii) room impulse response (RIR) estimation without network architectural modification, and fine-tuned on a small dataset. Experimental results showed that audio and visual cues respectively have advantage depending on the type of physical room acoustics. It implies that foundation V2A models can be used for physically grounded room-acoustic analysis.


[72] MMAudio-LABEL: Audio Event Labeling via Audio Generation for Silent Video cs.SD | cs.CVPDF

Kazuya Tateishi, Akira Takahashi, Atsuo Hiroe, Hirofumi Takeda, Shusuke Takahashi

TL;DR: 本文提出MMAudio-LABEL框架,通过基于潜在表示的音频生成模型,从无声视频中联合生成音频和帧对齐的声音事件标签,以解决传统后处理流程中错误累积的问题,并在Greatest Hits数据集上显著提升了事件检测和分类的准确性。

Details

Motivation: 现有多模态生成技术虽能从无声视频生成高质量音频,但实际应用(如音效制作)需要明确的声音事件标签(类型和时序),而传统后处理检测方法易导致错误累积,因此需要一种联合生成音频和事件标签的框架。

Result: 在Greatest Hits数据集上,方法将起始点检测准确率从基线46.7%提升至75.0%,材料分类准确率从40.6%提升至61.0%,实现了SOTA性能。

Insight: 创新点在于将音频生成与事件预测任务联合学习,利用潜在表示进行端到端优化,提高了模型的解释性和实用性,避免了传统流水线的误差传播问题。

Abstract: Recent advances in multimodal generation have enabled high-quality audio generation from silent videos. Practical applications, such as sound production, demand not only the generated audio but also explicit sound event labels detailing the type and timing of sounds. One straightforward approach involves applying a standard sound event detection to the generated audio. However, this post-hoc pipeline is inherently limited, as it is prone to error accumulation. To address this limitation, we propose MMAudio-LABEL (LAtent-Based Event Labeling), an event-aware audio generation framework built on a foundational audio generation model as its backbone that jointly generates audio and frame-aligned sound event predictions from silent videos. We evaluate our method on the Greatest Hits dataset for onset detection and 17-class material classification. Our approach improves onset-detection accuracy from 46.7% to 75.0% and material-classification accuracy from 40.6% to 61.0% over baselines. These results suggest that jointly learning audio generation and event prediction enables a more interpretable and practical video-to-audio synthesis.


eess.SP [Back]

[73] TimeRFT: Stimulating Generalizable Time Series Forecasting for TSFMs via Reinforcement Finetuning eess.SP | cs.AI | cs.CV | cs.LGPDF

Siyang Li, Yize Chen, Zijie Zhu, Yuxin Pan, Yan Guo

TL;DR: 本文提出了TimeRFT(时序强化微调)范式,用于解决时序基础模型(TSFMs)在下游时序预测任务中因时序分布偏移和训练数据多样性不足导致的泛化能力下降问题。该方法通过基于预测质量的时序奖励机制和基于预测难度的数据选择策略,提升模型在多种真实世界预测任务和数据条件下的适应性与准确性。

Details

Motivation: 现有基于监督微调(SFT)的时序基础模型在下游任务适应时,容易因时序数据的非平稳性和不确定性导致过拟合和泛化能力下降,且不同任务的数据可用性差异大,需要一种能适应多样化数据环境的泛化性强的微调方法。

Result: 大量实验表明,TimeRFT在各种真实世界预测任务和不同训练数据条件下,均能持续超越基于SFT的适应方法,提升了预测准确性,并增强了对未预见分布偏移的泛化能力。

Insight: 创新点在于将强化学习思想引入时序基础模型的微调过程,设计了多方面的时序奖励机制来评估每一步预测对整体准确性的贡献,并结合数据选择策略筛选具有泛化性预测模式和信息量丰富的训练样本,从而系统性地提升模型在下游任务中的鲁棒性和适应性。

Abstract: Time Series Foundation Models (TSFMs) advance generalization and data efficiency in time series forecasting by unified large-scale pretraining. But TSFMs remain lacking when adapting to specific downstream forecasting tasks for two reasons. First, the non-stationary and uncertain nature of time series data lead to inevitable temporal distribution shifts between historical training and future testing data, while current Supervised FineTuning (SFT)-based methods are prone to overfitting and may degrade generalization. Second, training data availability varies across forecasting tasks, requiring TSFMs to generalize well under diverse data regimes. To address these challenges, we introduce the Time series Reinforcement Finetuning (TimeRFT) paradigm for TSFM downstream adaptation, which consists of two task-specific training recipes: i) A forecasting quality-based temporal reward mechanism that conducts a multi-faceted evaluation of the contribution of each prediction step to overall forecasting accuracy. ii) A forecasting difficulty-based data selection strategy to identify time series samples with generalizable predictive patterns and informative training signals. Extensive experiments demonstrate TimeRFT can consistently outperform SFT-based adaptation methods across various real-world forecasting tasks and training data regimes, enhancing prediction accuracy and generalization against unforeseen distribution shifts.