Table of Contents
- cs.CL [Total: 22]
- cs.CV [Total: 49]
- cs.LG [Total: 11]
- eess.AS [Total: 2]
- cs.SE [Total: 1]
- cs.AI [Total: 3]
- cs.SD [Total: 1]
- cs.CR [Total: 1]
cs.CL [Back]
[1] Evolving Demonstration Optimization for Chain-of-Thought Feature Transformation cs.CL | cs.AI | cs.LGPDF
Xinyuan Wang, Kunpeng Liu, Arun Vignesh Malarkkan, Yanjie Fu
TL;DR: 本文提出了一种通过演化轨迹级经验来优化上下文数据的方法,以提升大语言模型驱动的特征转换性能。该方法从强化学习探索的高性能特征转换序列出发,构建并持续更新一个经过下游任务验证的转换轨迹经验库,并利用一个多样性感知的选择器结合思维链来形成上下文,从而引导生成更高性能的转换特征。
Details
Motivation: 特征转换是数据为中心AI的核心任务,但现有方法(包括基于离散搜索、潜在生成或静态演示的LLM方法)存在样本效率低、候选无效、生成冗余且覆盖有限、与下游目标对齐弱等问题。
Result: 在多个表格数据基准测试上的实验表明,该方法优于经典方法和基于LLM的基线,且比单次生成更稳定。该框架在基于API和开源LLM上均能泛化,并在不同的下游评估器中保持鲁棒性。
Insight: 主要创新点在于提出了一个闭环演化优化框架,通过构建和更新经验库并结合多样性感知选择与思维链,动态优化LLM的上下文,从而提升特征转换的多样性和与下游任务的对齐性,解决了静态演示方法的局限性。
Abstract: Feature Transformation (FT) is a core data-centric AI task that improves feature space quality to advance downstream predictive performance. However, discovering effective transformations remains challenging due to the large space of feature-operator combinations. Existing solutions rely on discrete search or latent generation, but they are frequently limited by sample inefficiency, invalid candidates, and redundant generations with limited coverage. Large Language Models (LLMs) offer strong priors for producing valid transformations, but current LLM-based FT methods typically rely on static demonstrations, resulting in limited diversity, redundant outputs, and weak alignment with downstream objectives. We propose a framework that optimizes context data for LLM-driven FT by evolving trajectory-level experiences in a closed loop. Starting from high-performing feature transportation sequences explored by reinforcement learning, we construct and continuously update an experience library of downstream task-verified transformation trajectories, and use a diversity-aware selector to form contexts along with a chain-of-thought and guide transformed feature generation toward higher performance. Experiments on diverse tabular benchmarks show that our method outperforms classical and LLM-based baselines and is more stable than one-shot generation. The framework generalizes across API-based and open-source LLMs and remains robust across downstream evaluators.
[2] CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models cs.CL | cs.AIPDF
Jon Chun, Hannah Sussman, Adrian Mangine, Murathan Kocaman, Kirill Sidorko
TL;DR: 本文提出了一个名为CEI(Contextual Emotional Inference)的基准测试,用于评估大型语言模型在语用推理方面的能力。该基准包含300个人工验证的场景,旨在测试模型如何根据情境语境和明确的说话者-听者权力关系来消解复杂、模糊的话语含义。
Details
Motivation: 语用推理(超越字面语义推断意图)是日常沟通的基础,但对大型语言模型而言仍然困难。现有基准可能不足以评估模型在复杂社会语境下的实际语用理解能力,因此需要一个新的、细粒度的评估工具。
Result: 论文介绍了CEI基准的构建过程,包括从工作、家庭、社交和服务场景中提取的五种语用子类型(讽刺/反语、混合信号、策略性礼貌、被动攻击、转移/误导)和三种权力配置(同级、上级对下级、下级对上级)。标注者间一致性较低(Fleiss’ kappa = 0.06-0.25),但这被认为是合理的,因为语用推理允许多种有效解读。
Insight: 创新点在于构建了一个专门针对语用推理的、包含丰富社会情境和权力动态的基准数据集。其标注方法论(包括四级质量控制流程)和接受标注分歧作为信息本身的设计,为评估模型在模糊、多义语境下的表现提供了新视角,强调了语用评估中主观性和多样性的重要性。
Abstract: Pragmatic reasoning, inferring intended meaning beyond literal semantics, underpins everyday communication yet remains difficult for large language models. We present the Contextual Emotional Inference (CEI) Benchmark: 300 human-validated scenarios for evaluating how well LLMs disambiguate pragmatically complex utterances. Each scenario pairs a situational context and speaker-listener roles (with explicit power relations) against an ambiguous utterance. The dataset covers five pragmatic subtypes (sarcasm/irony, mixed signals, strategic politeness, passive aggression, deflection/misdirection) drawn from workplace, family, social, and service settings, with three power configurations (peer, higher-to-lower, lower-to-higher). Three trained annotators independently labeled every scenario. Inter-annotator agreement (Fleiss’ kappa = 0.06-0.25 by subtype) is low but expected: pragmatic inference admits multiple valid readings, and the disagreement itself is informative. We describe our annotation methodology, including a 4-level quality control pipeline that combines automated statistical checks with expert adjudication. CEI is released under CC-BY-4.0.
[3] There Are No Silly Questions: Evaluation of Offline LLM Capabilities from a Turkish Perspective cs.CL | cs.AI | cs.CR | cs.LGPDF
Edibe Yilmaz, Kahraman Kostas
TL;DR: 本研究从土耳其视角评估离线大语言模型在教育应用中的能力,特别关注土耳其传统语言教育场景。研究开发了包含10个边缘案例的土耳其异常测试套件,用于评估14个参数量从2.7亿到320亿不等的模型在认知抵抗、逻辑一致性和教学安全性方面的表现。
Details
Motivation: 解决大语言模型在教育集成中的数据隐私和可靠性问题,特别是在土耳其传统语言教育这种教学脆弱场景中,评估本地可部署离线LLM的鲁棒性和教学安全性。
Result: 实验发现异常抵抗能力不完全依赖模型规模,即使在大型模型中谄媚偏见也会带来教学风险;在8B-14B参数范围内的推理导向模型在成本-安全权衡方面表现最平衡。
Insight: 创新点在于构建了针对特定语言文化背景的异常测试套件,揭示了模型规模与安全性的非单调关系,为教育场景的模型选择提供了参数范围指导。
Abstract: The integration of large language models (LLMs) into educational processes introduces significant constraints regarding data privacy and reliability, particularly in pedagogically vulnerable contexts such as Turkish heritage language education. This study aims to systematically evaluate the robustness and pedagogical safety of locally deployable offline LLMs within the context of Turkish heritage language education. To this end, a Turkish Anomaly Suite (TAS) consisting of 10 original edge-case scenarios was developed to assess the models’ capacities for epistemic resistance, logical consistency, and pedagogical safety. Experiments conducted on 14 different models ranging from 270M to 32B parameters reveal that anomaly resistance is not solely dependent on model scale and that sycophancy bias can pose pedagogical risks even in large-scale models. The findings indicate that reasoning-oriented models in the 8B–14B parameter range represent the most balanced segment in terms of cost-safety trade-off for language learners.
[4] Beyond the Prompt in Large Language Models: Comprehension, In-Context Learning, and Chain-of-Thought cs.CL | cs.LGPDF
Yuling Jiao, Yanming Lai, Huazhen Lin, Wensen Ma, Houduo Qi
TL;DR: 该论文深入探讨了大型语言模型(LLMs)中语义提示理解、上下文学习(ICL)和思维链(CoT)推理等涌现能力的理论机制。研究通过分析自回归过程,揭示了LLMs如何精确推断不同任务间的词元转移概率,并解释了ICL通过减少提示歧义来提升性能,以及CoT通过任务分解激活模型处理复杂问题的能力。
Details
Motivation: 尽管LLMs在多种任务上表现出色,但其背后驱动语义提示理解、上下文学习和思维链推理等能力的理论机制尚不明确。本研究旨在解决三个核心问题:LLMs如何仅通过下一个词元预测目标准确解码提示语义?ICL为何能在不更新参数的情况下提升性能?以及CoT中的中间推理步骤为何能有效解锁处理复杂多步问题的能力?
Result: 研究通过理论分析表明,LLMs能够利用自回归过程精确推断不同任务间的词元转移概率。ICL通过减少提示歧义,促进后验概率集中于目标任务,从而提升性能。CoT则通过将复杂问题分解为预训练阶段已掌握的简单子任务序列来激活模型能力。论文通过比较这些方法的误差界限,为高级提示工程技术的统计优越性提供了新的理论见解。
Insight: 论文的创新点在于从理论层面系统解释了LLMs中提示理解、ICL和CoT的统计机制,强调了自回归推断、歧义减少和任务分解的关键作用。从客观角度看,这为理解LLMs的涌现能力提供了可验证的理论框架,有助于指导更有效的提示设计和模型优化策略。
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency across diverse tasks, exhibiting emergent properties such as semantic prompt comprehension, In-Context Learning (ICL), and Chain-of-Thought (CoT) reasoning. Despite their empirical success, the theoretical mechanisms driving these phenomena remain poorly understood. This study dives into the foundations of these observations by addressing three critical questions: (1) How do LLMs accurately decode prompt semantics despite being trained solely on a next-token prediction objective? (2) Through what mechanism does ICL facilitate performance gains without explicit parameter updates? and (3) Why do intermediate reasoning steps in CoT prompting effectively unlock capabilities for complex, multi-step problems? Our results demonstrate that, through the autoregressive process, LLMs are capable of exactly inferring the transition probabilities between tokens across distinct tasks using provided prompts. We show that ICL enhances performance by reducing prompt ambiguity and facilitating posterior concentration on the intended task. Furthermore, we find that CoT prompting activates the model’s capacity for task decomposition, breaking complex problems into a sequence of simpler sub-tasks that the model has mastered during the pretraining phase. By comparing their individual error bounds, we provide novel theoretical insights into the statistical superiority of advanced prompt engineering techniques.
[5] FERRET: Framework for Expansion Reliant Red Teaming cs.CL | cs.AIPDF
Ninareh Mehrabi, Vitor Albiero, Maya Pavlova, Joanna Bitton
TL;DR: 本文提出了一个名为FERRET的多方面自动化红队测试框架,旨在生成多模态对抗性对话以攻破目标模型,并通过引入多种扩展机制来提升对抗对话的效率和效果。
Details
Motivation: 为了解决现有自动化红队测试方法在生成有效多模态对抗对话方面的局限性,特别是缺乏系统性的扩展策略来优化对话启动、深化对话内容以及动态调整攻击策略。
Result: 实验表明,FERRET在生成多模态对抗对话方面优于现有的最先进方法,展示了其卓越性能。
Insight: 创新点在于提出了三种扩展机制:水平扩展(自我改进生成更有效的对话启动器)、垂直扩展(将启动器扩展为多模态对话)和元扩展(在对话中发现更有效的多模态攻击策略),这为自动化红队测试提供了系统化的框架设计思路。
Abstract: We introduce a multi-faceted automated red teaming framework in which the goal is to generate multi-modal adversarial conversations that would break a target model and introduce various expansions that would result in more effective and efficient adversarial conversations. The introduced expansions include: 1. Horizontal expansion in which the goal is for the red team model to self-improve and generate more effective conversation starters that would shape a conversation. 2. Vertical expansion in which the goal is to take these conversation starters that are discovered in the horizontal expansion phase and expand them into effective multi-modal conversations and 3. Meta expansion in which the goal is for the red team model to discover more effective multi-modal attack strategies during the course of a conversation. We call our framework FERRET (Framework for Expansion Reliant Red Teaming) and compare it with various existing automated red teaming approaches. In our experiments, we demonstrate the effectiveness of FERRET in generating effective multi-modal adversarial conversations and its superior performance against existing state of the art approaches.
[6] A Principle-Driven Adaptive Policy for Group Cognitive Stimulation Dialogue for Elderly with Cognitive Impairment cs.CLPDF
Jiyue Jiang, Yanyu Chen, Pengan Chen, Kai Liu, Jingqi Zhou
TL;DR: 本文提出了一种面向认知障碍老年人的群体认知刺激对话系统,通过构建大规模真实与模拟对话数据集,并设计了包含多说话人上下文控制、动态参与者认知状态建模、认知刺激注意力损失和多维奖励策略的核心模块,以克服大语言模型在该应用场景中的局限性,实验表明该系统在多项评估指标上显著优于基线模型。
Details
Motivation: 认知障碍已成为重大公共卫生挑战,传统的认知刺激疗法难以规模化,现有数字系统难以处理群体对话并遵循认知刺激原则,而大语言模型直接应用面临对话范式、缺乏治疗推理和静态用户建模等关键挑战。
Result: 实验结果表明,所提出的GCSD系统在各种评估指标上均显著优于基线模型。
Insight: 创新点在于提出了一个原则驱动的自适应策略,并通过一个集成了多说话人上下文控制、动态认知状态建模、认知刺激注意力损失和多维奖励策略的系统来实现,特别是采用了原则引导的场景模拟策略来生成大规模模拟对话数据,以弥补真实数据的不足并注入治疗推理能力。
Abstract: Cognitive impairment is becoming a major public health challenge. Cognitive Stimulation Therapy (CST) is an effective intervention for cognitive impairment, but traditional methods are difficult to scale, and existing digital systems struggle with group dialogues and cognitive stimulation principles. While Large Language Models (LLMs) are powerful, their application in this context faces key challenges: cognitive stimulation dialogue paradigms, a lack of therapeutic reasoning, and static-only user modeling. To address these issues, we propose a principle-driven adaptive policy actualized through a Group Cognitive Stimulation Dialogue (GCSD) system. We first construct a dataset with over 500 hours of real-world CST conversations and 10,000+ simulated dialogues generated via our Principle-Guided Scenario Simulation strategy. Our GCSD system then integrates four core modules to overcome LLM limitations: (i) a multi-speaker context controller to resolve role confusion; (ii) dynamic participant cognitive state modeling for personalized interaction; (iii) a cognitive stimulation-focused attention loss to instill cognitive stimulation reasoning; and (iv) a multi-dimensional reward strategy to enhance response value. Experimental results demonstrate that GCSD significantly outperforms baseline models across various evaluation metrics. Future work will focus on long-term clinical validation to bridge the gap between computational performance and clinical efficacy.
[7] Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation cs.CLPDF
Eeham Khan, Luis Rodriguez, Marc Queudot
TL;DR: 本文提出了一种结合显式推理与忠实性验证的领域特定检索增强生成框架,通过神经查询重写、BGE交叉编码器重排序和基于证据跨度生成理由的模块,增强标准RAG流程的可验证性。该框架在BioASQ和PubMedQA基准测试中验证了其有效性,并引入八类验证分类法以精细评估理由的忠实性。
Details
Motivation: 标准RAG管道缺乏中间推理验证机制,在高风险领域容易产生幻觉,因此需要一种能够显式推理和验证忠实性的框架来提升事实准确性。
Result: 在BioASQ-Y/N和PubMedQA基准上,使用Llama-3-8B-Instruct模型分别达到89.1%和73.0%的准确率,与使用更大模型的系统性能相当;显式理由生成和动态演示选择结合重排序在少样本设置中进一步提升了性能。
Insight: 创新点包括:引入显式理由生成模块以增强可解释性,设计八类验证分类法实现细粒度忠实性评估,以及结合神经查询重写与重排序优化检索质量;该框架通过结构化错误诊断提升了系统透明度和检索失败分析能力。
Abstract: Retrieval-Augmented Generation (RAG) significantly improves the factuality of Large Language Models (LLMs), yet standard pipelines often lack mechanisms to verify inter- mediate reasoning, leaving them vulnerable to hallucinations in high-stakes domains. To address this, we propose a domain-specific RAG framework that integrates explicit rea- soning and faithfulness verification. Our architecture augments standard retrieval with neural query rewriting, BGE-based cross-encoder reranking, and a rationale generation module that grounds sub-claims in specific evidence spans. We further introduce an eight-category verification taxonomy that enables fine-grained assessment of rationale faithfulness, distinguishing between explicit and implicit support patterns to facilitate structured error diagnosis. We evaluate this framework on the BioASQ and PubMedQA benchmarks, specifically analyzing the impact of dynamic in-context learning and rerank- ing under constrained token budgets. Experiments demonstrate that explicit rationale generation improves accuracy over vanilla RAG baselines, while dynamic demonstration selection combined with robust reranking yields further gains in few-shot settings. Using Llama-3-8B-Instruct, our approach achieves 89.1% on BioASQ-Y/N and 73.0% on Pub- MedQA, competitive with systems using significantly larger models. Additionally, we perform a pilot study combining human expert assessment with LLM-based verification to explore how explicit rationale generation improves system transparency and enables more detailed diagnosis of retrieval failures in biomedical question answering.
[8] OpenClaw-RL: Train Any Agent Simply by Talking cs.CLPDF
Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, Ling Yang
TL;DR: OpenClaw-RL是一个通用的智能体强化学习框架,其核心观点是:所有智能体交互产生的下一个状态信号(如下一个用户回复、工具输出、终端或GUI状态变化)都是通用的学习来源。该框架通过异步设计,同时从这些信号中提取评估性信号(通过PRM评判器转化为标量奖励)和指导性信号(通过后见之明引导的在线策略蒸馏提供更丰富的监督),从而允许单一策略从对话、终端、GUI、软件工程任务和工具调用等多种交互中同时学习。
Details
Motivation: 现有智能体强化学习系统未能将智能体每次行动后产生的通用“下一个状态信号”作为一个实时的在线学习来源进行利用。论文旨在解决这个问题,提出一个统一的框架,使智能体能够简单地通过被使用(即通过交互)来持续改进。
Result: 论文将该框架应用于个人智能体和通用智能体。对于个人智能体,它能够从用户的重新查询、纠正和显式反馈中恢复对话信号进行学习。对于通用智能体,该框架在终端、GUI、软件工程和工具调用等多种设置下支持可扩展的强化学习,并展示了过程奖励的效用。
Insight: 主要创新点在于将各种交互形式(对话、工具调用、GUI操作等)产生的“下一个状态信号”统一视为一个通用的、实时的学习源,并通过异步架构同时提取标量奖励和更丰富的、基于文本提示的令牌级方向优势监督(Hindsight-Guided OPD)。这种设计消除了不同组件间的协调开销,实现了“边服务边学习”。
Abstract: Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: https://github.com/Gen-Verse/OpenClaw-RL
[9] Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models cs.CL | cs.AI | cs.LGPDF
Eric Yocam, Varghese Vaidyan, Gurcan Comert, Paris Kalathas, Yong Wang
TL;DR: 本文提出了一种名为自适应激活消除(AAC)的实时推理框架,用于缓解大语言模型的幻觉问题。该方法将幻觉相关的神经激活视为Transformer残差流中的结构化干扰,通过层间线性探测识别幻觉节点,并在自回归生成过程中使用置信度加权的前向钩子进行抑制,无需外部知识、微调或额外推理过程。
Details
Motivation: 解决大语言模型频繁生成流畅但事实错误的文本(即幻觉)的问题,旨在不损害模型流畅性和通用能力的前提下,实时、精准地抑制幻觉。
Result: 在OPT-125M、Phi-3-mini和LLaMA 3-8B模型上,于TruthfulQA和HaluEval基准测试中,AAC是唯一能在所有模型规模上持续提升下游准确率的实时干预方法。在LLaMA 3-8B上,实现了生成层面的正向增益(MC1 +0.04;MC2 +0.003;Token-F1 +0.003),且探针空间选择性比ITI基线高5.94倍至3.5倍。所有模型在WikiText-103困惑度和MMLU推理准确率上均保持零退化。
Insight: 创新点在于将信号处理中的自适应噪声消除思想类比应用于Transformer内部激活,通过识别并手术式抑制与幻觉相关的特定神经元(H-Nodes),实现了在不影响模型其他能力的前提下,针对性地提升事实准确性。该方法是一种无需训练、轻量级的推理时干预技术。
Abstract: Large Language Models frequently generate fluent but factually incorrect text. We propose Adaptive Activation Cancellation (AAC), a real-time inference-time framework that treats hallucination-associated neural activations as structured interference within the transformer residual stream, drawing an explicit analogy to classical adaptive noise cancellation from signal processing. The framework identifies Hallucination Nodes (H-Nodes) via layer-wise linear probing and suppresses them using a confidence-weighted forward hook during auto-regressive generation – requiring no external knowledge, no fine-tuning, and no additional inference passes. Evaluated across OPT-125M, Phi-3-mini, and LLaMA 3-8B on TruthfulQA and HaluEval, the real-time hook is the only intervention that consistently improves downstream accuracy on all three scales. Critically, the method is strictly surgical: WikiText-103 perplexity and MMLU reasoning accuracy are preserved at exactly 0.0% degradation across all three model scales, a property that distinguishes AAC from interventions that trade fluency or general capability for factual improvement. On the LLaMA 3-8B scale, the hook additionally yields positive generation-level gains (MC1 +0.04; MC2 +0.003; Token-F1 +0.003) while achieving probe-space selectivity 5.94x - 3.5x higher than the ITI baseline – demonstrating that targeted neuron-level suppression can simultaneously improve factual accuracy and preserve model capability.
[10] S-GRADES – Studying Generalization of Student Response Assessments in Diverse Evaluative Settings cs.CLPDF
Tasfia Seuti, Sagnik Ray Choudhury
TL;DR: 本文介绍了S-GRADES基准,这是一个整合了14个学生回答评分数据集的网络基准,旨在统一自动作文评分(AES)和自动短答案评分(ASAG)的评估。通过使用多种提示策略评估三个先进大语言模型,并研究示例选择和跨数据集迁移,论文展示了该基准在揭示评分任务间可靠性和泛化性差距方面的效用。
Details
Motivation: 解决教育NLP中自动评分领域因数据集碎片化、指标不一致和社区隔离导致的AES与ASAG范式孤立发展问题,促进跨范式的标准化评估。
Result: 在S-GRADES基准上评估了三个SOTA大语言模型,使用多种推理提示策略,并分析了示例选择和跨数据集迁移的影响,揭示了作文与短答案评分任务间的可靠性和泛化性差距。
Insight: 创新点在于创建了一个开源、可扩展的统一基准来整合不同评分范式,通过标准化评估协议促进跨任务比较;客观分析认为其推动了教育NLP评估的规范化和泛化性研究。
Abstract: Evaluating student responses, from long essays to short factual answers, is a key challenge in educational NLP. Automated Essay Scoring (AES) focuses on holistic writing qualities such as coherence and argumentation, while Automatic Short Answer Grading (ASAG) emphasizes factual correctness and conceptual understanding. Despite their shared goal, these paradigms have progressed in isolation with fragmented datasets, inconsistent metrics, and separate communities. We introduce S-GRADES (Studying Generalization of Student Response Assessments in Diverse Evaluative Settings), a web-based benchmark that consolidates 14 diverse grading datasets under a unified interface with standardized access and reproducible evaluation protocols. The benchmark is fully open-source and designed for extensibility, enabling continuous integration of new datasets and evaluation settings. To demonstrate the utility of S-GRADES, we evaluate three state-of-the-art large language models across the benchmark using multiple reasoning strategies in prompting. We further examine the effects of exemplar selection and cross-dataset exemplar transfer. Our analyses illustrate how benchmark-driven evaluation reveals reliability and generalization gaps across essay and short-answer grading tasks, highlighting the importance of standardized, cross-paradigm assessment.
[11] Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas cs.CL | cs.AIPDF
Tim Schopf, Michael Färber
TL;DR: 该论文提出了首个用于大规模评估研究想法新颖性判断的综合基准RINoBench,包含1,381个由专家标注的研究想法和九种自动化评估指标,并利用该基准评估了多种先进大语言模型(LLMs)在新颖性判断任务上的表现。研究发现,尽管LLMs生成的推理与人类理由高度相似,但其新颖性判断的准确性仍与人类黄金标准存在显著差距。
Details
Motivation: 鉴于科学文献的指数级增长,手动评估研究想法的新颖性既费力又主观,且难以规模化;而现有自动化评估方法缺乏标准化、可比较的基准,阻碍了该领域的大规模评估与进展。
Result: 在RINoBench基准上评估了多种最先进的大语言模型(LLMs),结果显示,即使具备强大推理能力的模型,其新颖性判断准确性也与人类专家标准存在显著差异,未能达到可靠水平。
Insight: 创新点在于构建了首个标准化、大规模的研究想法新颖性判断基准,并系统评估了LLMs在此任务上的能力与局限;客观来看,该工作为自动化科研评估提供了重要的基础设施,并揭示了当前LLMs在复杂语义判断任务中推理与结果脱节的关键问题。
Abstract: Judging the novelty of research ideas is crucial for advancing science, enabling the identification of unexplored directions, and ensuring contributions meaningfully extend existing knowledge rather than reiterate minor variations. However, given the exponential growth of scientific literature, manually judging the novelty of research ideas through literature reviews is labor-intensive, subjective, and infeasible at scale. Therefore, recent efforts have proposed automated approaches for research idea novelty judgment. Yet, evaluation of these approaches remains largely inconsistent and is typically based on non-standardized human evaluations, hindering large-scale, comparable evaluations. To address this, we introduce RINoBench, the first comprehensive benchmark for large-scale evaluation of research idea novelty judgments. It comprises 1,381 research ideas derived from and judged by human experts as well as nine automated evaluation metrics designed to assess both rubric-based novelty scores and textual justifications of novelty judgments. Using this benchmark, we evaluate several state-of-the-art large language models (LLMs) on their ability to judge the novelty of research ideas. Our findings reveal that while LLM-generated reasoning closely mirrors human rationales, this alignment does not reliably translate into accurate novelty judgments, which diverge significantly from human gold standard judgments - even among leading reasoning-capable models. Data and code available at: https://github.com/TimSchopf/RINoBench.
[12] Large language models can disambiguate opioid slang on social media cs.CLPDF
Kristy A. Carpenter, Issah A. Samori, Mathew V. Kiang, Keith Humphreys, Anna Lembke
TL;DR: 该论文研究了利用大型语言模型(LLMs)来消除社交媒体文本中阿片类药物相关俚语歧义的能力。通过设计基于词典、无词典和新兴俚语三种任务,评估了GPT-4、GPT-5、Gemini 2.5 Pro和Claude Sonnet 4.5四种先进LLMs的性能。结果表明,LLMs在所有任务上都显著超越了传统的基于词典的方法,能够更准确地识别与阿片类药物相关的帖子,从而为监测阿片类药物滥用危机提供了更可靠的数据来源。
Details
Motivation: 社交媒体文本是监测阿片类药物过量危机趋势的有力工具,但绝大多数帖子无关。传统方法使用阿片相关术语词典来筛选内容,但许多俚语(如’smack’或’blues’)具有常见非药物含义,导致歧义和误判。因此,需要利用LLMs的高级文本推理能力来大规模、准确地消除这些歧义。
Result: 在基于词典的任务中,LLMs的F1分数(’fenty’子任务:0.824-0.972;’smack’子任务:0.540-0.862)远超最佳词典策略(分别为0.126和0.009)。在无词典任务中,LLMs的F1分数(0.544-0.769)也超过了词典方法(0.080-0.540),且召回率一致更高。在新兴俚语任务中,所有LLMs的准确率(平均0.784)、F1分数(平均0.712)、精确率(平均0.981)和召回率(平均0.587)均高于评估的两种词典。所有四种LLMs在所有任务上都表现出色。
Insight: 论文宣称的创新点在于系统性地评估了LLMs在消除阿片俚语歧义、无词典识别以及处理新兴俚语三种场景下的能力,证明了LLMs能够有效识别低流行度主题的相关内容。从客观角度看,其创新之处在于将LLMs的上下文理解能力应用于公共卫生监测这一具体领域,为解决传统基于词典方法在语义歧义和词汇覆盖上的固有缺陷提供了一个高效、可扩展的解决方案,并可推广至其他类似领域。
Abstract: Social media text shows promise for monitoring trends in the opioid overdose crisis; however, the overwhelming majority of social media text is unrelated to opioids. When leveraging social media text to monitor trends in the ongoing opioid overdose crisis, a common strategy for identifying relevant content is to use a lexicon of opioid-related terms as inclusion criteria. However, many slang terms for opioids, such as “smack” or “blues,” have common non-opioid meanings, making them ambiguous. The advanced textual reasoning capability of large language models (LLMs) presents an opportunity to disambiguate these slang terms at scale. We present three tasks on which to evaluate four state-of-the-art LLMs (GPT-4, GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4.5): a lexicon-based setting, in which the LLM must disambiguate a specific term within the context of a given post; a lexicon-free setting, in which the LLM must identify opioid-related posts from context without a lexicon; and an emergent slang setting, in which the LLM must identify opioid-related posts with simulated new slang terms. All four LLMs showed excellent performance across all tasks. In both subtasks of the lexicon-based setting, LLM F1 scores (“fenty” subtask: 0.824-0.972; “smack” subtask: 0.540-0.862) far exceeded those of the best lexicon strategy (0.126 and 0.009, respectively). In the lexicon-free task, LLM F1 scores (0.544-0.769) surpassed those of lexicons (0.080-0.540), and LLMs demonstrated uniformly higher recall. On emergent slang, all LLMs had higher accuracy (average: 0.784), F1 score (average: 0.712), precision (average: 0.981), and recall (average: 0.587) than the two lexicons assessed. Our results show that LLMs can be used to identify relevant content for low-prevalence topics, including but not limited to opioid references, enhancing data provided to downstream analyses and predictive models.
[13] Human-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language Agent cs.CLPDF
Zhongzhen Huang, Yan Ling, Hong Chen, Ye Feng, Li Wu
TL;DR: 论文提出了PULSE,一个结合了领域调优大语言模型和科学文献检索的医疗推理智能体,用于支持复杂真实世界病例的诊断决策。研究通过一个包含82个真实内分泌学病例报告的基准进行评估,发现PULSE在诊断准确性上达到了与资深专家相当的水平,且在罕见病上表现稳定,同时探讨了人机协作的潜力与风险。
Details
Motivation: 解决在复杂临床诊断中,如何利用AI辅助医生进行更准确、稳定的推理决策,特别是针对常见和罕见疾病,并探索人机协作的有效工作流程。
Result: 在Top@1和Top@4阈值下,PULSE的诊断准确性优于住院医师和初级专家,与资深专家相当;与医生不同,其准确性不随疾病罕见度下降而保持稳定;在协作实验中,能帮助医生纠正错误并拓宽诊断假设。
Insight: 创新点在于将领域调优的LLM与科学文献检索结合,构建了一个能进行自适应推理(输出长度随病例难度增加)的医疗智能体,并系统评估了其在真实病例基准上与人协作的效果,为临床AI应用提供了框架和风险洞察。
Abstract: We present PULSE, a medical reasoning agent that combines a domain-tuned large language model with scientific literature retrieval to support diagnostic decision-making in complex real-world cases. To evaluate its capabilities, we curated a benchmark of 82 authentic endocrinology case reports encompassing a broad spectrum of disease types and incidence levels. In controlled experiments, we compared PULSE’s performance against physicians with varying levels of expertise-from residents to senior specialists-and examined how AI assistance influenced human diagnostic reasoning. PULSE attained expert-competitive accuracy, outperforming residents and junior specialists while matching senior specialist performance at both Top@1 and Top@4 thresholds. Unlike physicians, whose accuracy declined with disease rarity, PULSE maintained stable performance across incidence tiers. The agent also exhibited adaptive reasoning, increasing output length with case difficulty in a manner analogous to the longer deliberation observed among expert clinicians. When used collaboratively, PULSE enabled physicians to correct initial errors and broaden diagnostic hypotheses, but also introduced risks of automation bias. The study explores both serial and concurrent collaboration workflows, revealing that PULSE offers robust support across common and rare presentations. These findings underscore both the promise and the limitations of language model-based agents in clinical diagnosis, and offer a framework for evaluating their role in real-world decision-making.
[14] Automatic End-to-End Data Integration using Large Language Models cs.CLPDF
Aaron Steiner, Christian Bizer
TL;DR: 本文提出了一种基于GPT-5.2大语言模型的自动端到端数据集成管道,旨在无需人工干预即可生成模式映射、值映射、实体匹配训练数据和冲突解决验证数据等所有必要构件。通过视频游戏、音乐和公司数据三个案例研究,该LLM管道与人工设计管道在集成数据集的大小和密度上表现相当,且成本显著降低。
Details
Motivation: 传统数据集成管道设计需要数据工程师大量手动配置组件和标注训练数据,过程繁琐且成本高昂。尽管LLM在单个集成步骤中已显示出潜力,但其在端到端流程中完全替代人工输入的潜力尚未被充分探索。
Result: 在三个案例研究(视频游戏、音乐和公司数据)中,LLM管道与人工设计管道相比,能够产生相似甚至在某些任务上更好的结果,生成的集成数据集在规模和密度上具有可比性。使用LLM配置管道的成本约为每个案例10美元,远低于人工成本。
Insight: 论文的核心创新在于首次系统性地探索并验证了LLM(GPT-5.2)在无需人工干预下自动生成端到端数据集成管道全部配置构件的能力,实现了从模式对齐到数据融合的全流程自动化,为降低数据集成成本和提高效率提供了新范式。
Abstract: Designing data integration pipelines typically requires substantial manual effort from data engineers to configure pipeline components and label training data. While LLMs have shown promise in handling individual steps of the integration process, their potential to replace all human input across end-to-end data integration pipelines has not been investigated. As a step toward exploring this potential, we present an automatic data integration pipeline that uses GPT-5.2 to generate all artifacts required to adapt the pipeline to specific use cases. These artifacts are schema mappings, value mappings for data normalization, training data for entity matching, and validation data for selecting conflict resolution heuristics in data fusion. We compare the performance of this LLM-based pipeline to the performance of human-designed pipelines along three case studies requiring the integration of video game, music, and company related data. Our experiments show that the LLM-based pipeline is able to produce similar results, for some tasks even better results, as the human-designed pipelines. End-to-end, the human and the LLM pipelines produce integrated datasets of comparable size and density. Having the LLM configure the pipelines costs approximately $10 per case study, which represents only a small fraction of the cost of having human data engineers perform the same tasks.
[15] End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering cs.CLPDF
Nhi Dang, Tung Le, Huy Tien Nguyen
TL;DR: 本文提出了一种端到端的聊天机器人自动评估框架,通过从知识库生成问答对、利用大语言模型判断回答质量,并结合置信度过滤来减少人工审核工作量。
Details
Motivation: 现有基于检索增强生成的领域专用聊天机器人容易生成无依据或错误的回答,而人工评估成本高,现有评估框架依赖精心构建的测试集和静态指标,可扩展性有限。
Result: 在越南新闻数据集上的实验表明,该评估器与人工判断具有高度一致性,同时显著降低了审核开销。
Insight: 创新点在于构建了模块化、语言无关的端到端自动评估流程,通过自适应推理和不确定性过滤实现可扩展的评估,减少了对人工干预的依赖。
Abstract: Large language models (LLMs) combined with retrieval augmented generation have enabled the deployment of domain-specific chatbots, but these systems remain prone to generating unsupported or incorrect answers. Reliable evaluation is therefore critical, yet manual review is costly and existing frameworks often depend on curated test sets and static metrics, limiting scalability. We propose an end-to-end automatic evaluator designed to substantially reduce human effort. Our system generates Q&A pairs directly from the underlying knowledge base, uses LLMs to judge chatbot responses against reference answers, and applies confidence-based filtering to highlight uncertain cases. Applied to a Vietnamese news dataset, the evaluator achieves high agreement with human judgments while significantly lowering review overhead. The framework is modular and language-agnostic, making it readily adaptable to diverse domains. This work introduces a practical, scalable solution for evaluating chatbots with minimal reliance on manual intervention.
[16] Making Bielik LLM Reason (Better): A Field Report cs.CLPDF
Adam Trybus, Bartosz Bartnicki, Remigiusz Kinas
TL;DR: 本文介绍了针对波兰大语言模型Bielik的推理能力评估与提升的研究计划。该研究包括初始基准测试与评估方法创建、与其他LLM的对比结果分析,并基于当前分析的局限性,为在不断变化的竞争性AI格局中保持Bielik的竞争力,规划了未来前景。
Details
Motivation: 评估并提升波兰大语言模型Bielik的推理能力,以应对AI领域快速变化和竞争激烈的挑战。
Result: 论文描述了初始基准测试和与其他LLM的对比分析结果,但未在摘要中明确提及具体的定量结果或是否达到SOTA水平。
Insight: 创新点在于为特定语言(波兰语)LLM建立系统的推理能力评估方法,并规划了适应竞争环境的持续发展路径,强调了针对非英语模型的定制化评估与优化策略。
Abstract: This paper presents a research program dedicated to evaluating and advancing the reasoning capabilities of Bielik, a Polish large language model. The study describes a number of stages of work: initial benchmarking and creation of evaluation methodology, analyzing of comparative results with other LLMs and outlining of future prospects that take into account the limitations of the analyses conducted so far and aims to keep Bielik in the race give the ever-changing – and competitive – AI landscape.
[17] HeartAgent: An Autonomous Agent System for Explainable Differential Diagnosis in Cardiology cs.CLPDF
Shuang Zhou, Kai Yu, Song Wang, Wenya Xie, Zaifu Zhan
TL;DR: 本文提出了HeartAgent,一个用于心脏病学可解释鉴别诊断的自主智能体系统。该系统整合了定制化工具和精选数据资源,通过协调多个专业子智能体进行复杂推理,同时生成透明的推理轨迹和可验证的参考依据。在MIMIC数据集和私有电子健康记录队列上的评估表明,其在诊断准确性和解释质量上显著优于现有方法和未辅助的临床专家。
Details
Motivation: 解决现有基于人工智能的诊断方法在心脏病学知识不足、复杂推理支持不够以及可解释性差方面的局限性,以实现可靠且可解释的鉴别诊断。
Result: 在MIMIC数据集和私有电子健康记录队列上,HeartAgent在top-3诊断准确率上分别比现有比较方法提高了超过36%和20%;辅助临床医生后,诊断准确率提高了26.9%,解释质量提高了22.7%。
Insight: 创新点在于构建了一个心脏病学专用的多智能体系统,通过协调专业子智能体进行复杂推理并生成透明、可验证的推理轨迹,从而实现了可靠、可解释且具有临床可操作性的决策支持。从客观角度看,其将智能体架构与领域知识深度结合,并注重生成可验证的参考,是提升AI医疗诊断可信度和实用性的有效途径。
Abstract: Heart diseases remain a leading cause of morbidity and mortality worldwide, necessitating accurate and trustworthy differential diagnosis. However, existing artificial intelligence-based diagnostic methods are often limited by insufficient cardiology knowledge, inadequate support for complex reasoning, and poor interpretability. Here we present HeartAgent, a cardiology-specific agent system designed to support a reliable and explainable differential diagnosis. HeartAgent integrates customized tools and curated data resources and orchestrates multiple specialized sub-agents to perform complex reasoning while generating transparent reasoning trajectories and verifiable supporting references. Evaluated on the MIMIC dataset and a private electronic health records cohort, HeartAgent achieved over 36% and 20% improvements over established comparative methods, in top-3 diagnostic accuracy, respectively. Additionally, clinicians assisted by HeartAgent demonstrated gains of 26.9% in diagnostic accuracy and 22.7% in explanatory quality compared with unaided experts. These results demonstrate that HeartAgent provides reliable, explainable, and clinically actionable decision support for cardiovascular care.
[18] mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR cs.CLPDF
Konstantin Dobler, Simon Lehnerer, Federico Scozzafava, Jonathan Janke, Mohamed Ali
TL;DR: 本文介绍了mAceReason-Math数据集,这是一个高质量的多语言数学问题数据集,旨在支持强化学习与可验证奖励(RLVR)的研究。该数据集基于专为RLVR设计的AceReason-Math语料库,通过精心翻译和清理,覆盖14种语言,每种语言包含超过10,000个样本,以解决当前RLVR训练数据以英语为中心、且现有多语言数据难度不足的问题。
Details
Motivation: 当前RLVR研究和训练数据集主要集中于英语,而现有的多语言数学问题数据集难度较低,无法为当前模型提供有效的训练信号,因此需要构建高质量、高难度的多语言数据集来促进多语言RLVR研究。
Result: 论文构建了mAceReason-Math数据集,覆盖14种语言,每种语言包含超过10,000个高质量翻译的挑战性数学问题样本,旨在为多语言RLVR研究和基准测试提供资源。
Insight: 创新点在于专门针对RLVR需求构建高质量多语言数学问题数据集,通过翻译和清理确保数据难度和多样性,以弥补当前多语言数据在难度和适用性上的不足,推动多语言RLVR领域的发展。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has been successfully applied to significantly boost the capabilities of pretrained large language models, especially in the math and logic problem domains. However, current research and available training datasets remain English-centric. While mul- tilingual training data and benchmarks have been created in the past, they were not created with RLVR and current model capability in mind, and their level of difficulty is often too low to provide appropriate training signals for current models. To address this gap, we provide mAceReason-Math, a dataset of high-quality translations of challenging math problems sourced from a corpus specifically curated for RLVR (AceReason-Math). We further take specific care to clean and improve our translations, resulting in a coverage of 14 languages with more than 10,000 samples per language. We release the dataset to facilitate multilingual RLVR research and benchmarking in the research community.
[19] Multilingual Reasoning Gym: Multilingual Scaling of Procedural Reasoning Environments cs.CLPDF
Konstantin Dobler, Simon Lehnerer, Federico Scozzafava, Jonathan Janke, Mohamed Ali
TL;DR: 本文提出了Multilingual Reasoning Gym,这是对Reasoning Gym的扩展,通过程序化生成方式创建了覆盖14种语言的可验证推理问题。该方法翻译了94个任务的模板,并在10种语言中进行了母语者验证,通过代码或模板适配确保语言自然性。该环境保留了原始Reasoning Gym的核心优势,如近乎无限的问题实例生成和可调节难度,并支持基于可验证奖励的强化学习和评估设置。由于环境的程序化特性,多语言问题具有并行性,支持大规模跨语言并行数据生成。
Details
Motivation: 动机是扩展单语言推理环境至多语言场景,以支持多语言推理模型的研究,解决现有基准在多语言覆盖和语言自然性方面的不足。
Result: 论文构建了包含14种语言、94个任务的多语言推理环境,并通过母语者验证确保了语言质量,为多语言推理研究提供了可直接使用的基准和工具。
Insight: 创新点在于通过程序化生成和模板翻译适配,实现了大规模、高质量、跨语言并行的问题生成,为多语言推理模型的训练和评估提供了可控且可扩展的环境。
Abstract: We present the Multilingual Reasoning Gym, an extension of Reasoning Gym (Stojanovski et al., 2025), that procedurally generates verifiable reasoning problems across 14 languages. We translate templates for 94 tasks with native-speaker validation in 10 languages and targeted code or template adaptations to ensure linguistic naturalness. The Multilingual Reasoning Gym preserves the core benefits of the procedural generation approach used in the original Reasoning Gym, such as virtually unlimited problem instance generation and adjustable difficulty, and remains directly usable for Reinforcement Learning from Verifiable Rewards and evaluation settings. Problems in the Multilingual Reasoning Gym are parallel across languages, enabling crosslingually parallel data generation at massive scale due to the procedural nature of the environments. We release our implementation to support research into multilingual reasoning models.
[20] From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers cs.CLPDF
Ayan Sengupta, Shantanu Dixit, Md Shad Akhtar, Tanmoy Chakraborty
TL;DR: 本文提出了ARMADA,一种高效的跨模态知识蒸馏框架,旨在将大型视觉-语言模型(包括黑盒模型)的知识迁移到纯语言模型中。该方法无需依赖多模态教师模型的内部结构或进行昂贵的预训练,通过新颖的对齐技术实现知识蒸馏,并在多个自然语言理解、生成推理和指令微调任务上验证了其有效性。
Details
Motivation: 传统知识蒸馏方法假设教师和学生模型模态同质,而现有跨模态蒸馏方法需要教师模型进行模态特定的预训练,计算成本高昂。本文旨在解决从黑盒视觉-语言模型向纯语言模型高效蒸馏知识的问题。
Result: 在12个自然语言理解、8个复杂生成推理和5个指令微调任务上,ARMADA在DeBERTa-v2-1.4B、OPT-1.3B、LLaMA-{3B, 7B, 8B}等大型模型上实现了性能提升,语言理解任务最高提升3.4%,生成推理任务提升2.6%,且无需对教师模型进行昂贵的多模态预训练或微调。
Insight: 创新点在于提出了一种不依赖教师模型内部结构或预训练的对齐技术,实现了从黑盒视觉-语言模型到纯语言模型的高效知识蒸馏,挑战了传统蒸馏范式,表明视觉-语言模型即使缺乏直接文本理解也能显著增强语言模型。
Abstract: Knowledge distillation (KD) methods are pivotal in compressing large pre-trained language models into smaller models, ensuring computational efficiency without significantly dropping performance. Traditional KD techniques assume homogeneity in modalities between the teacher (source) and the student (target) models. On the other hand, existing multimodal knowledge distillation methods require modality-specific pre-training of the teacher model, which is computationally infeasible in most cases. In this paper, we introduce ARMADA, an efficient cross-modal knowledge distillation framework designed to transfer knowledge from large vision-language models, including black-box models, to language-only models. Unlike existing KD techniques that rely on the internal structures of multimodal teachers or require computationally expensive pre-training, ARMADA leverages novel alignment techniques to distil knowledge without altering the teacher model, ensuring efficiency and scalability. We empirically validate ARMADA on twelve natural language understanding, eight complex generative reasoning and five instruction-tuning tasks, demonstrating consistent performance improvements in large models such as DeBERTa-v2-1.4B, OPT-1.3B, LLaMA-{3B, 7B, 8B}. ARMADA achieves up to 3.4% improvement on language understanding tasks and 2.6% boost in generative reasoning, all without requiring expensive multimodal pre-training or fine-tuning of the teacher model. Our findings challenge conventional knowledge distillation paradigms by demonstrating that even vision-language models, despite lacking direct textual understanding, can significantly enhance language models when distilled appropriately.
[21] GLM-OCR Technical Report cs.CLPDF
Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu
TL;DR: GLM-OCR是一个高效的0.9B参数紧凑多模态模型,专为真实世界文档理解设计。它结合了0.4B参数的CogViT视觉编码器和0.5B参数的GLM语言解码器,在计算效率和识别性能之间取得了良好平衡。模型采用两阶段流水线:先进行版面分析,再进行并行区域级识别,并在公开基准和工业场景中展现出竞争力。
Details
Motivation: 解决标准自回归解码在确定性OCR任务中效率低下的问题,并设计一个在计算效率和性能间平衡的紧凑模型,以适应资源受限的边缘部署和大规模生产系统。
Result: 在文档解析、文本和公式转录、表格结构恢复以及关键信息提取等任务上,在公开基准和工业场景的广泛评估中达到了有竞争力或最先进的性能水平。
Insight: 引入了多令牌预测机制,通过每步预测多个令牌显著提高解码吞吐量,同时通过共享参数保持低内存开销;采用两阶段流水线(版面分析后并行区域识别)的系统级设计,提升了整体效率。
Abstract: GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.
[22] LLM2Vec-Gen: Generative Embeddings from Large Language Models cs.CLPDF
Parishad BehnamGhader, Vaibhav Adlakha, Fabian David Schmidt, Nicolas Chapados, Marius Mosbach
TL;DR: 本文提出了一种名为LLM2Vec-Gen的新型自监督文本嵌入方法。该方法不直接编码输入文本,而是通过向大语言模型(LLM)的词汇表中添加可训练的特殊标记,并优化这些标记来表征LLM对输入的潜在响应,从而生成固定长度的嵌入向量。训练过程利用LLM自身的补全结果和一个无监督嵌入教师模型提供的蒸馏目标进行指导,无需标注数据且保持LLM主干网络冻结。
Details
Motivation: 解决传统基于LLM的文本嵌入模型通常只编码输入语义,而嵌入任务需要将多样化输入映射到相似输出的问题。现有方法通常依赖对比学习和配对数据进行训练,本文旨在探索一种不同的、能弥合输入-输出差距的自监督范式。
Result: 在Massive Text Embedding Benchmark (MTEB)上达到了自监督方法中的最先进(SOTA)性能,相比最佳的无监督嵌入教师模型提升了9.3%。此外,在嵌入任务中观察到有害内容检索减少了43.2%,推理能力提升了29.3%。
Insight: 核心创新在于将嵌入学习范式从“编码输入”转变为“表征模型潜在响应”,通过优化特殊标记来实现。这种方法能有效利用LLM自身的能力(如安全对齐和推理),并将其迁移到嵌入任务中,同时保持了模型的解释性(嵌入可解码为文本)。这是一种参数高效且数据高效的训练策略。
Abstract: LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model’s potential response. Specifically, we add trainable special tokens to the LLM’s vocabulary, append them to input, and optimize them to represent the LLM’s response in a fixed-length sequence. Training is guided by the LLM’s own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.
cs.CV [Back]
[23] 4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video cs.CVPDF
Jin Lyu, Liang An, Pujin Cheng, Yebin Liu, Xiaoying Tang
TL;DR: 本文提出了一种名为4DEquine的新框架,用于从单目视频中重建马科动物的4D模型。该方法将4D重建问题解耦为动态运动重建和静态外观重建两个子问题,分别使用时空Transformer和基于3D高斯的前馈网络进行处理。该方法仅使用合成数据集训练,但在真实世界数据集上达到了最先进的性能。
Details
Motivation: 解决现有4D动物重建方法需要在整个视频上联合优化运动和外观,导致计算耗时且对不完整观测敏感的问题。
Result: 在真实世界的APT36K和AiM数据集上达到了最先进的性能,证明了其在几何和外观重建方面的优越性。
Insight: 将4D重建解耦为运动和外观两个独立子问题的框架设计;提出了仅需单张图像即可重建高保真、可动画3D高斯外观模型的前馈网络;创建了大规模合成运动数据集VarenPoser和外观数据集VarenTex以辅助训练,并展示了合成数据到真实世界的良好泛化能力。
Abstract: 4D reconstruction of equine family (e.g. horses) from monocular video is important for animal welfare. Previous mainstream 4D animal reconstruction methods require joint optimization of motion and appearance over a whole video, which is time-consuming and sensitive to incomplete observation. In this work, we propose a novel framework called 4DEquine by disentangling the 4D reconstruction problem into two sub-problems: dynamic motion reconstruction and static appearance reconstruction. For motion, we introduce a simple yet effective spatio-temporal transformer with a post-optimization stage to regress smooth and pixel-aligned pose and shape sequences from video. For appearance, we design a novel feed-forward network that reconstructs a high-fidelity, animatable 3D Gaussian avatar from as few as a single image. To assist training, we create a large-scale synthetic motion dataset, VarenPoser, which features high-quality surface motions and diverse camera trajectories, as well as a synthetic appearance dataset, VarenTex, comprising realistic multi-view images generated through multi-view diffusion. While training only on synthetic datasets, 4DEquine achieves state-of-the-art performance on real-world APT36K and AiM datasets, demonstrating the superiority of 4DEquine and our new datasets for both geometry and appearance reconstruction. Comprehensive ablation studies validate the effectiveness of both the motion and appearance reconstruction network. Project page: https://luoxue-star.github.io/4DEquine_Project_Page/.
[24] Video-Based Reward Modeling for Computer-Use Agents cs.CV | cs.CLPDF
Linxin Song, Jieyu Zhang, Huanxin Sheng, Taiwei Shi, Gupta Rahul
TL;DR: 该论文提出了一种基于视频的奖励建模方法,用于评估计算机使用代理(CUAs)执行用户指令的成功与否。通过构建ExeVR-53k数据集,结合对抗性指令翻译生成负样本,并设计时空令牌剪枝技术处理长视频,最终训练出ExeVRM模型,该模型仅需用户指令和执行视频即可预测任务完成度,在多个操作系统上超越了GPT-5.2等专有模型。
Details
Motivation: 解决计算机使用代理评估难以规模化的问题,传统方法依赖代理内部推理或动作,而视频执行建模提供了一种与模型无关的评估方式,但面临布局冗余和局部细微线索的挑战。
Result: ExeVRM 8B模型在视频执行评估中达到84.7%准确率和87.7%召回率,在Ubuntu、macOS、Windows和Android系统上超越了GPT-5.2和Gemini-3 Pro等专有模型,并提供更精确的时间归因。
Insight: 创新点包括构建大规模视频-任务-奖励数据集、对抗性指令翻译合成负样本、时空令牌剪枝处理高分辨率长视频,这些方法实现了可扩展、模型无关的代理评估,为计算机视觉与强化学习交叉领域提供了新思路。
Abstract: Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independent of the agent’s internal reasoning or actions. Although video-execution modeling is method-agnostic, it presents key challenges, including highly redundant layouts and subtle, localized cues that determine success. We introduce Execution Video Reward 53k (ExeVR-53k), a dataset of 53k high-quality video–task–reward triplets. We further propose adversarial instruction translation to synthesize negative samples with step-level annotations. To enable learning from long, high-resolution execution videos, we design spatiotemporal token pruning, which removes homogeneous regions and persistent tokens while preserving decisive UI changes. Building on these components, we fine-tune an Execution Video Reward Model (ExeVRM) that takes only a user instruction and a video-execution sequence to predict task success. Our ExeVRM 8B achieves 84.7% accuracy and 87.7% recall on video-execution assessment, outperforming strong proprietary models such as GPT-5.2 and Gemini-3 Pro across Ubuntu, macOS, Windows, and Android, while providing more precise temporal attribution. These results show that video-execution reward modeling can serve as a scalable, model-agnostic evaluator for CUAs.
[25] Delta-K: Boosting Multi-Instance Generation via Cross-Attention Augmentation cs.CV | cs.AIPDF
Zitong Wang, Zijun Shen, Haohao Xu, Zhengjie Luo, Weibin Wu
TL;DR: 本文提出了Delta-K,一种与主干模型无关、即插即用的推理框架,旨在解决扩散模型在生成包含多个实例的复杂场景时经常出现的概念遗漏问题。该方法通过在共享的交叉注意力键空间中操作,提取并注入编码了缺失概念语义特征的差分键ΔK,从而在扩散过程的早期语义规划阶段增强概念生成,无需额外训练或修改模型架构。
Details
Motivation: 扩散模型在文本到图像合成中表现出色,但在合成复杂的多实例场景时,经常出现概念遗漏问题。现有的免训练方法试图通过重新缩放注意力图来解决,但这往往只是加剧了非结构化的噪声,而未能建立连贯的语义表示。
Result: 广泛的实验表明,该方法具有通用性:Delta-K在现代化的DiT模型和经典的U-Net架构上都能持续改善组合对齐效果,且无需空间掩码、额外训练或架构修改。
Insight: 论文的核心创新点在于直接操作共享的交叉注意力键空间,通过提取和注入编码缺失概念的差分语义信号ΔK来增强生成。从客观角度看,其提出的动态优化调度机制,能在扩散早期将噪声锚定到稳定的结构上,同时保留已有概念,这是一种新颖且与主干模型无关的干预策略。
Abstract: While Diffusion Models excel in text-to-image synthesis, they often suffer from concept omission when synthesizing complex multi-instance scenes. Existing training-free methods attempt to resolve this by rescaling attention maps, which merely exacerbates unstructured noise without establishing coherent semantic representations. To address this, we propose Delta-K, a backbone-agnostic and plug-and-play inference framework that tackles omission by operating directly in the shared cross-attention Key space. Specifically, with Vision-language model, we extract a differential key $ΔK$ that encodes the semantic signature of missing concepts. This signal is then injected during the early semantic planning stage of the diffusion process. Governed by a dynamically optimized scheduling mechanism, Delta-K grounds diffuse noise into stable structural anchors while preserving existing concepts. Extensive experiments demonstrate the generality of our approach: Delta-K consistently improves compositional alignment across both modern DiT models and classical U-Net architectures, without requiring spatial masks, additional training, or architectural modifications.
[26] Robotic Ultrasound Makes CBCT Alive cs.CV | cs.AI | cs.ROPDF
Feng Li, Ziyuan Li, Zhongliang Jiang, Nassir Navab, Yuan Bi
TL;DR: 本文提出了一种基于机器人超声的术中CBCT动态更新框架,通过超声图像实时推断组织形变并更新静态CBCT切片,以解决呼吸、探头压力和手术操作引起的软组织形变导致的导航误差问题。
Details
Motivation: 术中锥形束CT(CBCT)虽能提供可靠的3D解剖结构用于介入规划,但其静态特性无法连续监测软组织形变,导致导航不准确。
Result: 实验验证了该方法能实现实时端到端的CBCT切片更新和物理合理的形变估计,在机器人超声辅助介入中实现了静态CBCT引导的动态优化。
Insight: 创新点包括:1)提出基于线性组合线性相关(LC2)的刚性配准与轻量级网络USCorUNet相结合的多模态对应方法;2)利用光流引导监督训练网络学习形变感知的相关性表示,实现实时密集形变场估计;3)通过空间正则化将形变传递至CBCT参考,实现无重复辐射的形变一致可视化。
Abstract: Intraoperative Cone Beam Computed Tomography (CBCT) provides a reliable 3D anatomical context essential for interventional planning. However, its static nature fails to provide continuous monitoring of soft-tissue deformations induced by respiration, probe pressure, and surgical manipulation, leading to navigation discrepancies. We propose a deformation-aware CBCT updating framework that leverages robotic ultrasound as a dynamic proxy to infer tissue motion and update static CBCT slices in real time. Starting from calibration-initialized alignment with linear correlation of linear combination (LC2)-based rigid refinement, our method establishes accurate multimodal correspondence. To capture intraoperative dynamics, we introduce the ultrasound correlation UNet (USCorUNet), a lightweight network trained with optical flow-guided supervision to learn deformation-aware correlation representations, enabling accurate, real-time dense deformation field estimation from ultrasound streams. The inferred deformation is spatially regularized and transferred to the CBCT reference to produce deformation-consistent visualizations without repeated radiation exposure. We validate the proposed approach through deformation estimation and ultrasound-guided CBCT updating experiments. Results demonstrate real-time end-to-end CBCT slice updating and physically plausible deformation estimation, enabling dynamic refinement of static CBCT guidance during robotic ultrasound-assisted interventions. The source code is publicly available at https://github.com/anonymous-codebase/us-cbct-demo.
[27] A Robust Deep Learning Framework for Bangla License Plate Recognition Using YOLO and Vision-Language OCR cs.CVPDF
Nayeb Hasin, Md. Arafath Rahman Nishat, Mainul Islam, Khandakar Shakib Al Hasan, Asif Newaz
TL;DR: 本文提出了一种用于孟加拉车牌识别的鲁棒深度学习框架,该系统结合了基于YOLO的目标检测模型进行车牌定位,以及基于视觉-语言模型的OCR进行文本提取。通过比较多种目标检测架构,并基于YOLOv8提出了一种新颖的两阶段自适应训练策略来提升定位性能。在文本识别部分,采用VisionEncoderDecoder架构,并评估了多种编码器-解码器组合,最终ViT + BanglaBERT模型在字符级识别上取得了最佳结果。
Details
Motivation: 孟加拉车牌的自动识别面临字符复杂、布局不规则的挑战,现有系统性能有待提升,本文旨在为智能交通管理系统提供一个鲁棒且可靠的解决方案。
Result: 在车牌定位任务上,所提方法优于现有模型,准确率达到97.83%,交并比达到91.3%。在文本识别任务上,ViT + BanglaBERT模型取得了字符错误率0.1323和词错误率0.1068的优异结果,并且在具有不同环境和光照条件的外部测试集上表现一致,证明了其鲁棒性。
Insight: 主要创新点在于:1) 提出了一种基于YOLOv8的两阶段自适应训练策略以提升车牌定位性能;2) 将车牌文本识别构建为序列生成问题,并采用VisionEncoderDecoder架构,探索了ViT与BanglaBERT的有效结合。该框架为复杂字符和布局的车牌识别提供了一个可借鉴的端到端解决方案。
Abstract: An Automatic License Plate Recognition (ALPR) system constitutes a crucial element in an intelligent traffic management system. However, the detection of Bangla license plates remains challenging because of the complicated character scheme and uneven layouts. This paper presents a robust Bangla License Plate Recognition system that integrates a deep learning-based object detection model for license plate localization with Optical Character Recognition for text extraction. Multiple object detection architectures, including U-Net and several YOLO (You Only Look Once) variants, are compared for license plate localization. This study proposes a novel two-stage adaptive training strategy built upon the YOLOv8 architecture to improve localization performance. The proposed approach outperforms the established models, achieving an accuracy of 97.83% and an Intersection over Union (IoU) of 91.3%. The text recognition problem is phrased as a sequence generation problem with a VisionEncoderDecoder architecture, with a combination of encoder-decoders evaluated. It was demonstrated that the ViT + BanglaBERT model gives better results at the character level, with a Character Error Rate of 0.1323 and Word Error Rate of 0.1068. The proposed system also shows a consistent performance when tested on an external dataset that has been curated for this study purpose. The dataset offers completely different environment and lighting conditions compared to the training sample, indicating the robustness of the proposed framework. Overall, our proposed system provides a robust and reliable solution for Bangla license plate recognition and performs effectively across diverse real-world scenarios, including variations in lighting, noise, and plate styles. These strengths make it well suited for deployment in intelligent transportation applications such as automated law enforcement and access control.
[28] From Imitation to Intuition: Intrinsic Reasoning for Open-Instance Video Classification cs.CVPDF
Ke Zhang, Xiangchen Zhao, Yunjie Tian, Jiayu Zheng, Vishal M. Patel
TL;DR: 本文提出DeepIntuit框架,通过从模仿转向直觉的内在推理来解决开放实例视频分类问题。该方法首先通过监督对齐初始化视觉语言模型的推理能力,然后使用组相对策略优化进行强化学习微调以增强推理连贯性,最后通过直觉校准阶段将推理轨迹转化为稳定分类。
Details
Motivation: 解决现实世界中开放实例视频分类的挑战,即类内变化巨大且复杂,超出了现有基准。传统视频编码器难以拟合这种多样化分布,而视觉语言模型虽具优越泛化能力,却未充分利用其推理能力。
Result: 大量实验表明,在开放实例视频分类任务上,DeepIntuit通过超越简单的特征模仿并转向内在推理,取得了显著收益。
Insight: 创新点在于提出了一个从模仿到直觉演化的内在推理框架,通过监督对齐、强化学习优化和直觉校准三个阶段,将视觉语言模型的推理能力有效转化为分类性能,解决了分布不匹配问题。
Abstract: Conventional video classification models, acting as effective imitators, excel in scenarios with homogeneous data distributions. However, real-world applications often present an open-instance challenge, where intra-class variations are vast and complex, beyond existing benchmarks. While traditional video encoder models struggle to fit these diverse distributions, vision-language models (VLMs) offer superior generalization but have not fully leveraged their reasoning capabilities (intuition) for such tasks. In this paper, we bridge this gap with an intrinsic reasoning framework that evolves open-instance video classification from imitation to intuition. Our approach, namely DeepIntuit, begins with a cold-start supervised alignment to initialize reasoning capability, followed by refinement using Group Relative Policy Optimization (GRPO) to enhance reasoning coherence through reinforcement learning. Crucially, to translate this reasoning into accurate classification, DeepIntuit then introduces an intuitive calibration stage. In this stage, a classifier is trained on this intrinsic reasoning traces generated by the refined VLM, ensuring stable knowledge transfer without distribution mismatch. Extensive experiments demonstrate that for open-instance video classification, DeepIntuit benefits significantly from transcending simple feature imitation and evolving toward intrinsic reasoning. Our project is available at https://bwgzk-keke.github.io/DeepIntuit/.
[29] Fuel Gauge: Estimating Chain-of-Thought Length Ahead of Time in Large Multimodal Models cs.CVPDF
Yuedong Yang, Xiwen Wei, Mustafa Munir, Radu Marculescu
TL;DR: 本文提出了一种名为Fuel Gauge的新方法,用于在大规模多模态模型推理前预测其思维链的长度。该方法基于一个关键观察:思维链过程遵循一种与具体生成样本无关的简单形式,其长度可由一个代表推理’燃料’的隐藏参数决定。Fuel Gauge通过提取此隐藏信号,实现了对思维链长度的提前预测,并应用于预测性KV缓存分配和思维链长度调制两个下游任务,以优化计算资源利用和推理准确性。
Details
Motivation: 当前大规模多模态模型依赖冗长且运行时不可预测的思维链过程,这常导致计算资源使用效率低下(由于内存碎片化)和推理准确率欠佳(由于’欠思考’和’过思考’)。本文旨在解决这一问题,通过提前预测思维链长度来优化模型服务。
Result: 在涵盖纯文本、图文和视频文本问答的多个基准测试上进行的广泛实验证明了Fuel Gauge的有效性、泛化能力和实用价值。例如,在GPQA-Diamond基准上,Fuel Gauge的思维链长度预测误差不到基线方法的一半,这转化为内存分配频率降低了13.37倍。
Insight: 核心创新点在于发现了思维链过程遵循一种与具体样本无关的简单形式,并提出了首个提取隐藏’燃料’参数以提前预测思维链长度的方法。从客观角度看,该方法将不可预测的推理过程转化为可预测的资源规划问题,为优化大模型推理服务的内存管理和准确性提供了新思路。
Abstract: Reasoning Large Multi-modality Models (LMMs) have become the de facto choice for many applications. However, these models rely on a Chain-of-Thought (CoT) process that is lengthy and unpredictable at runtime, often resulting in inefficient use of computational resources (due to memory fragmentation) and sub-optimal accuracy (due to under- and over-thinking). We observe empirically that the CoT process follows a very simple form, whose behavior is independent of the specific generated samples. This suggests that the CoT length can be estimated ahead of time based on a hidden parameter representing the amount of “fuel” available to support the reasoning process. Based on this insight, we propose Fuel Gauge, the first method which extracts this hidden signal and predicts CoT length ahead of time. We demonstrate the utility on the Fuel Gauge on two downstream tasks: predictive KV cache allocation, which addresses memory fragmentation in LMM serving systems, and CoT length modulation, which mitigates under-thinking and over-thinking. Extensive experiments on LMMs across text-only, image-text, and video-text question answering benchmarks demonstrate the effectiveness, generalizability, and practical value of our Fuel Gauge. For example, on the GPQA-Diamond benchmark, our Fuel Gauge achieves less than half the CoT length prediction error compared to the baseline; this translates into a 13.37x reduction in the memory allocation frequency.
[30] Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation cs.CV | cs.AI | cs.RO | eess.SYPDF
Sangmim Song, Sarath Kodagoda, Marc Carmichael, Karthick Thiyagarajan
TL;DR: 本文提出了一种名为概念门控视觉蒸馏(CGVD)的训练无关、模型无关的推理框架,旨在解决视觉-语言-动作(VLA)模型在杂乱环境中因背景诱导的特征稀释而导致的‘精确推理鸿沟’问题。该方法通过解析指令、结合交叉验证与空间消歧的目标精炼过程,并利用基于傅里叶的修复技术生成干净的观测,从而抑制语义干扰并保留关键空间几何信息。
Details
Motivation: 动机是解决VLA模型在零样本泛化中,面对视觉杂乱环境时出现的‘精确推理鸿沟’,该问题源于高频语义噪声破坏了精确操作所需的几何基础。
Result: 在高度杂乱的操控任务评估中,CGVD显著优于现有最佳基线,在密集语义干扰环境下成功率从基线的43.0%提升至77.5%,防止了性能崩溃。
Insight: 创新点在于提出了一种无需训练、模型无关的推理时视觉蒸馏框架,通过概念门控机制(安全集与干扰集解析)和傅里叶修复主动抑制语义干扰,强化了属性遵循和几何保持,为鲁棒机器人操作提供了关键推理前提。
Abstract: Vision-Language-Action (VLA) models demonstrate impressive zero-shot generalization but frequently suffer from a “Precision-Reasoning Gap” in cluttered environments. This failure is driven by background-induced feature dilution, where high-frequency semantic noise corrupts the geometric grounding required for precise manipulation. To bridge this gap, we propose Concept-Gated Visual Distillation (CGVD), a training-free, model-agnostic inference framework that stabilizes VLA policies. CGVD operates by parsing instructions into safe and distractor sets, utilizing a two-layer target refinement process–combining cross-validation and spatial disambiguation–to explicitly penalize false positives and isolate genuine manipulation targets. We then process the scene via Fourier-based inpainting, generating a clean observation that actively suppresses semantic distractors while preserving critical spatial geometry and visual proprioception. Extensive evaluations in highly cluttered manipulation tasks demonstrate that CGVD prevents performance collapse. In environments with dense semantic distractors, our method significantly outperforms state-of-the-art baselines, achieving a 77.5% success rate compared to the baseline’s 43.0%. By enforcing strict attribute adherence, CGVD establishes inference-time visual distillation as a critical prerequisite for robust robotic manipulation in the clutter.
[31] One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination cs.CVPDF
Zhan Fa, Yue Duan, Jian Zhang, Lei Qi, Yinghuan Shi
TL;DR: 本文提出了一种统一的框架,通过操纵视觉令牌来对抗多模态大语言模型的幻觉问题。该框架包含两个模块:协同视觉校准模块利用增强图像补充视觉语义,因果表示校准模块通过剪枝令牌创建潜在空间负样本来纠正模型内部偏差,从而有效恢复视觉-语言平衡。
Details
Motivation: 现有免训练方法分别通过增强视觉信号或抑制文本惯性来应对MLLM幻觉,但各自存在权衡且简单组合无效,因此需要一个统一框架来更有效地解决幻觉问题。
Result: 在LLaVA-1.5模型上,该方法在多个基准测试中将POPE准确率平均提升2%,且推理延迟开销仅为1.06倍。
Insight: 创新点在于统一利用视觉令牌的两种方式(增强与剪枝)来协同校准模型,通过潜在空间操作而非原始图像扭曲来更精确地隔离幻觉倾向,从而平衡视觉与语言信息。
Abstract: Current training-free methods tackle MLLM hallucination with separate strategies: either enhancing visual signals or suppressing text inertia. However, these separate methods are insufficient due to critical trade-offs: simply enhancing vision often fails against strong language prior, while suppressing language can introduce extra image-irrelevant noise. Moreover, we find their naive combination is also ineffective, necessitating a unified framework. We propose such a framework by focusing on the core asset: the vision token. Our design leverages two key insights: (1) augmented images offer complementary visual semantics, and (2) removing vision tokens (information-gap) isolates hallucination tendencies more precisely than distorting images (modality-gap). Based on these, our framework uses vision tokens in two distinct ways, both operating on latent representations: our Synergistic Visual Calibration (SVC) module incorporates augmented tokens to strengthen visual representations, while our Causal Representation Calibration (CRC) module uses pruned tokens to create latent-space negative samples for correcting internal model biases. By harmonizing these two roles, our framework effectively restores the vision-language balance, significantly reducing object hallucinations, improving POPE accuracy by an average of 2% absolute on LLaVA-1.5 across multiple benchmarks with only a 1.06x inference latency overhead.
[32] Geometric Autoencoder for Diffusion Models cs.CVPDF
Hangyu Liu, Jianyong Wang, Yutao Sun
TL;DR: 本文提出了几何自编码器(GAE),一个用于扩散模型的原则性框架,旨在系统性地解决现有潜在扩散模型中语义可区分性、重建保真度和潜在紧凑性难以统一的问题。GAE通过构建优化的低维语义监督目标、采用潜在归一化替代KL散度约束,以及引入动态噪声采样机制,显著提升了生成质量和训练稳定性。
Details
Motivation: 现有潜在扩散模型的潜在设计大多基于启发式方法,难以在语义可区分性、重建保真度和潜在紧凑性之间取得良好平衡,本文旨在提供一个系统性的解决方案。
Result: 在ImageNet-1K 256x256基准测试上,GAE在不使用无分类器引导的情况下,仅80个epoch就达到了gFID 1.82,800个epoch达到1.31,显著超越了现有最先进方法。
Insight: 创新点在于:1)从视觉基础模型中构建优化的语义监督目标来指导自编码器;2)用潜在归一化替代标准VAE中限制性的KL散度,为扩散学习优化更稳定的潜在流形;3)引入动态噪声采样机制以增强高噪声下的重建鲁棒性。这为潜在扩散建模提供了一个在压缩率、语义深度和重建稳定性之间取得更优平衡的新范式。
Abstract: Latent diffusion models have established a new state-of-the-art in high-resolution visual generation. Integrating Vision Foundation Model priors improves generative efficiency, yet existing latent designs remain largely heuristic. These approaches often struggle to unify semantic discriminability, reconstruction fidelity, and latent compactness. In this paper, we propose Geometric Autoencoder (GAE), a principled framework that systematically addresses these challenges. By analyzing various alignment paradigms, GAE constructs an optimized low-dimensional semantic supervision target from VFMs to provide guidance for the autoencoder. Furthermore, we leverage latent normalization that replaces the restrictive KL-divergence of standard VAEs, enabling a more stable latent manifold specifically optimized for diffusion learning. To ensure robust reconstruction under high-intensity noise, GAE incorporates a dynamic noise sampling mechanism. Empirically, GAE achieves compelling performance on the ImageNet-1K $256 \times 256$ benchmark, reaching a gFID of 1.82 at only 80 epochs and 1.31 at 800 epochs without Classifier-Free Guidance, significantly surpassing existing state-of-the-art methods. Beyond generative quality, GAE establishes a superior equilibrium between compression, semantic depth and robust reconstruction stability. These results validate our design considerations, offering a promising paradigm for latent diffusion modeling. Code and models are publicly available at https://github.com/freezing-index/Geometric-Autoencoder-for-Diffusion-Models.
[33] GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning cs.CVPDF
Ruiheng Liu, Haihong Hao, Mingfei Han, Xin Gu, Kecheng Zhang
TL;DR: 本文提出GeoSense框架,旨在增强多模态大语言模型(MLLMs)的空间理解能力,通过赋予模型感知几何信息必要性的能力,使其能在2D视觉线索不足时自主调用几何特征进行推理。
Details
Motivation: 解决现有MLLMs空间理解能力有限的问题,避免现有方法盲目注入几何信号带来的计算开销,并提升模型对几何信息必要性的自主感知。
Result: 在多个空间推理基准测试中验证了该方法,取得了显著的空间推理性能提升,且未损害2D视觉推理能力。
Insight: 创新点在于引入独立的几何输入通道并进行对齐训练,同时通过精心构建的空间感知监督微调数据集激活模型内部线索,使其能自主判断几何信息必要性,实现了更高效、鲁棒且自感知的多模态智能。
Abstract: Advancing towards artificial superintelligence requires rich and intelligent perceptual capabilities. A critical frontier in this pursuit is overcoming the limited spatial understanding of Multimodal Large Language Models (MLLMs), where geometry information is essential. Existing methods often address this by rigidly injecting geometric signals into every input, while ignoring their necessity and adding computation overhead. Contrary to this paradigm, our framework endows the model with an awareness of perceptual insufficiency, empowering it to autonomously engage geometric features in reasoning when 2D cues are deemed insufficient. To achieve this, we first introduce an independent geometry input channel to the model architecture and conduct alignment training, enabling the effective utilization of geometric features. Subsequently, to endow the model with perceptual awareness, we curate a dedicated spatial-aware supervised fine-tuning dataset. This serves to activate the model’s latent internal cues, empowering it to autonomously determine the necessity of geometric information. Experiments across multiple spatial reasoning benchmarks validate this approach, demonstrating significant spatial gains without compromising 2D visual reasoning capabilities, offering a path toward more robust, efficient and self-aware multi-modal intelligence.
[34] Motion Forcing: A Decoupled Framework for Robust Video Generation in Motion Dynamics cs.CVPDF
Tianshuo Xu, Zhifei Chen, Leyi Wu, Hao Lu, Ying-cong Chen
TL;DR: 本文提出了一种名为’Motion Forcing’的解耦框架,旨在解决复杂场景下视频生成在视觉质量、物理一致性和精确可控性这三者之间难以平衡的难题。其核心是通过’点-形状-外观’的分层范式,将物理推理与视觉合成解耦,并引入’掩码点恢复’策略来增强模型对物理规律的学习。
Details
Motivation: 现有视频生成模型在简单场景中能平衡视觉质量、物理一致性和可控性,但在涉及碰撞、密集交通等复杂场景时,这种平衡容易被打破。本文旨在构建一个即使在复杂生成任务中也能稳定维持这三者平衡的鲁棒框架。
Result: 在自动驾驶基准测试上的大量实验表明,Motion Forcing显著优于最先进的基线模型,能够在复杂场景中保持’三难困境’的稳定性。在物理和机器人领域的评估进一步证实了该框架的通用性。
Insight: 主要创新点在于:1)通过’点-形状-外观’的层次化、可验证的解耦生成范式,将复杂的动态建模为稀疏几何锚点,再扩展为显式解决3D几何的动态深度图,最后渲染高保真纹理;2)采用’掩码点恢复’训练策略,迫使模型超越被动的模式匹配,学习潜在的物理定律(如惯性)来推断缺失轨迹,从而获得更鲁棒的物理理解。
Abstract: The ultimate goal of video generation is to satisfy a fundamental trilemma: achieving high visual quality, maintaining rigorous physical consistency, and enabling precise controllability. While recent models can maintain this balance in simple, isolated scenarios, we observe that this equilibrium is fragile and often breaks down as scene complexity increases (e.g., involving collisions or dense traffic). To address this, we introduce \textbf{Motion Forcing}, a framework designed to stabilize this trilemma even in complex generative tasks. Our key insight is to explicitly decouple physical reasoning from visual synthesis via a hierarchical \textbf{``Point-Shape-Appearance’’} paradigm. This approach decomposes generation into verifiable stages: modeling complex dynamics as sparse geometric anchors (\textbf{Point}), expanding them into dynamic depth maps that explicitly resolve 3D geometry (\textbf{Shape}), and finally rendering high-fidelity textures (\textbf{Appearance}). Furthermore, to foster robust physical understanding, we employ a \textbf{Masked Point Recovery} strategy. By randomly masking input anchors during training and enforcing the reconstruction of complete dynamic depth, the model is compelled to move beyond passive pattern matching and learn latent physical laws (e.g., inertia) to infer missing trajectories. Extensive experiments on autonomous driving benchmarks show that Motion Forcing significantly outperforms state-of-the-art baselines, maintaining trilemma stability across complex scenes. Evaluations on physics and robotics further confirm our framework’s generality.
[35] Frames2Residual: Spatiotemporal Decoupling for Self-Supervised Video Denoising cs.CVPDF
Mingjie Ji, Zhan Shi, Kailai Zhou, Zixuan Fu, Xun Cao
TL;DR: 本文提出了一种名为Frames2Residual(F2R)的时空解耦框架,用于自监督视频去噪。该方法将训练过程分为两个阶段:第一阶段通过盲时序估计器学习帧间一致性,生成一个时序一致的锚点;第二阶段利用该锚点,通过非盲空间细化器安全地重新引入中心帧,以恢复帧内高频空间残差,同时保持时序稳定性。
Details
Motivation: 现有自监督视频去噪方法通常将基于图像的框架扩展到时间维度,但难以有效整合帧间时序一致性与帧内空间特异性。现有的视频盲点网络(BSNs)因需要掩蔽中心像素以确保噪声独立性,阻碍了利用空间证据进行纹理恢复,从而切断了时空相关性并导致纹理丢失。
Result: 大量实验表明,F2R在sRGB和原始视频基准测试上均优于现有的自监督方法。
Insight: 论文的核心创新在于提出了一种时空解耦的自监督训练策略,将盲时序一致性建模与非盲空间纹理恢复明确分离,从而克服了传统BSNs中空间证据利用受限的问题,有效平衡了时空信息的整合,提升了去噪性能。
Abstract: Self-supervised video denoising methods typically extend image-based frameworks into the temporal dimension, yet they often struggle to integrate inter-frame temporal consistency with intra-frame spatial specificity. Existing Video Blind-Spot Networks (BSNs) require noise independence by masking the center pixel, this constraint prevents the use of spatial evidence for texture recovery, thereby severing spatiotemporal correlations and causing texture loss. To address this, we propose Frames2Residual (F2R), a spatiotemporal decoupling framework that explicitly divides self-supervised training into two distinct stages: blind temporal consistency modeling and non-blind spatial texture recovery. In Stage 1, a blind temporal estimator learns inter-frame consistency using a frame-wise blind strategy, producing a temporally consistent anchor. In Stage 2, a non-blind spatial refiner leverages this anchor to safely reintroduce the center frame and recover intra-frame high-frequency spatial residuals while preserving temporal stability. Extensive experiments demonstrate that our decoupling strategy allows F2R to outperform existing self-supervised methods on both sRGB and raw video benchmarks.
[36] World2Act: Latent Action Post-Training via Skill-Compositional World Models cs.CVPDF
An Dinh Vuong, Tuan Van Vo, Abdullah Sohail, Haoran Ding, Liang Ma
TL;DR: 本文提出了World2Act框架,用于通过技能组合世界模型对视觉-语言-动作策略进行后训练,以提升其在环境变化下的鲁棒性和泛化能力。该方法通过对比匹配目标将VLA动作直接与世界模型的视频动态潜在表示对齐,减少了对像素空间的依赖,并利用基于LLM的技能分解流水线生成支持长时间一致性的技能组合世界模型。
Details
Motivation: 现有基于世界模型的后训练方法大多依赖像素空间监督,导致策略对不完美世界模型生成的像素级伪影和幻觉敏感,且当前世界模型因主要训练于固定长度片段而难以生成任意长度的视频,限制了其在机器人任务中的适用性。
Result: 在RoboCasa和LIBERO基准测试中,将World2Act应用于GR00T-N1.6和Cosmos Policy等VLA模型,取得了最先进(SOTA)的结果,并在真实世界任务中提升了6.7%的性能,增强了具身智能体的泛化能力。
Insight: 创新点在于提出了直接对齐动作与视频动态潜在表示的对比匹配目标,避免了像素级监督的缺陷,并设计了基于LLM的自动技能分解流水线,生成了支持技能组合且时间一致的世界模型数据集,从而有效处理了可变任务时长的问题。
Abstract: World Models (WMs) have emerged as a promising approach for post-training Vision-Language-Action (VLA) policies to improve robustness and generalization under environmental changes. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to pixel-level artifacts and hallucination from imperfect WM rollouts. We introduce World2Act, a post-training framework that aligns VLA actions directly with WM video-dynamics latents using a contrastive matching objective, reducing dependence on pixels. Post-training performance is tied to rollout quality, yet current WMs struggle with arbitrary-length video generation as they are mostly trained on fixed-length clips while robotic execution durations vary widely. To address this, we propose an automatic LLM-based skill-decomposition pipeline that segments high-level instructions into low-level prompts. Our pipeline produces RoboCasa-Skill and LIBERO-Skill, supporting skill-compositional WMs that remain temporally consistent across diverse task horizons. Empirically, applying World2Act to VLAs like GR00T-N1.6 and Cosmos Policy achieves state-of-the-art results on RoboCasa and LIBERO, and improves real-world performance by 6.7%, enhancing embodied agent generalization.
[37] Learning to Wander: Improving the Global Image Geolocation Ability of LMMs via Actionable Reasoning cs.CVPDF
Yushuo Zheng, Huiyu Duan, Zicheng Zhang, Xiaohong Liu, Xiongkuo Min
TL;DR: 该论文提出了WanderBench基准测试和GeoAoT框架,旨在通过具身交互和可执行推理来提升大型多模态模型(LMMs)的全球图像地理定位能力。WanderBench是一个包含全球32K全景图、支持旋转和移动等物理动作的交互式基准。GeoAoT框架将推理与具身动作结合,生成主动减少不确定性的可执行计划,从而在动态环境中实现更优的细粒度定位和泛化能力。
Details
Motivation: 解决现有先进大型多模态模型(LMMs)在全球图像地理定位任务上能力未被充分探索的问题,该任务需要丰富的世界知识和复杂推理能力,而传统方法多为静态识别。
Result: 在包含19个大型多模态模型的实验中,GeoAoT框架在WanderBench基准上实现了优越的细粒度定位精度和在动态环境中更强的泛化能力,定义了一个新的可执行、推理驱动的地理定位范式。
Insight: 主要创新点在于将地理定位从静态识别转变为交互式探索,通过具身动作(如移动、旋转)与推理过程耦合来主动减少不确定性;提出的WanderBench基准和GeoAoT框架为可执行推理驱动的视觉理解设立了新范式。
Abstract: Geolocation, the task of identifying the geographic location of an image, requires abundant world knowledge and complex reasoning abilities. Though advanced large multimodal models (LMMs) have shown superior aforementioned capabilities, their performance on the geolocation task remains unexplored. To this end, we introduce \textbf{WanderBench}, the first open access global geolocation benchmark designed for actionable geolocation reasoning in embodied scenarios. WanderBench contains over 32K panoramas across six continents, organized as navigable graphs that enable physical actions such as rotation and movement, transforming geolocation from static recognition into interactive exploration. Building on this foundation, we propose \textbf{GeoAoT} (Action of Thought), a \underline{Geo}location framework with \underline{A}ction of \underline{T}hough, which couples reasoning with embodied actions. Instead of generating textual reasoning chains, GeoAoT produces actionable plans such as, approaching landmarks or adjusting viewpoints, to actively reduce uncertainty. We further establish an evaluation protocol that jointly measures geolocation accuracy and difficulty-aware geolocation questioning ability. Experiments on 19 large multimodal models show that GeoAoT achieves superior fine-grained localization and stronger generalization in dynamic environments. WanderBench and GeoAoT define a new paradigm for actionable, reasoning driven geolocation in embodied visual understanding.
[38] UniPINN: A Unified PINN Framework for Multi-task Learning of Diverse Navier-Stokes Equations cs.CV | cs.AIPDF
Dengdi Sun, Jie Chen, Xiao Wang, Jin Tang
TL;DR: 本文提出了一种名为UniPINN的统一PINN框架,用于多任务学习不同的纳维-斯托克斯方程。该框架通过共享-专用架构、跨流注意力机制和动态权重分配策略,解决了现有PINN方法在多流场景下面临的挑战,实现了对多种流动的统一学习和准确预测。
Details
Motivation: 现有PINN方法主要针对单一流动设计,扩展到多流场景时面临三个关键挑战:难以同时捕捉共享物理原理和流动特定特征、易受任务间负迁移影响导致预测精度下降,以及不同流态下损失量级差异引起的训练不稳定。
Result: 在三种典型流动上的大量实验表明,UniPINN有效统一了多流学习,在异质流态下实现了优越的预测精度和平衡性能,并成功缓解了负迁移问题。
Insight: 创新点包括:共享-专用架构解耦通用物理定律与流动特定特征;跨流注意力机制选择性增强相关模式并抑制任务无关干扰;动态权重分配策略自适应平衡损失贡献以稳定多目标优化。这些组件协同工作,为多流物理问题的统一建模提供了新思路。
Abstract: Physics-Informed Neural Networks (PINNs) have shown promise in solving incompressible Navier-Stokes equations, yet existing approaches are predominantly designed for single-flow settings. When extended to multi-flow scenarios, these methods face three key challenges: (1) difficulty in simultaneously capturing both shared physical principles and flow-specific characteristics, (2) susceptibility to inter-task negative transfer that degrades prediction accuracy, and (3) unstable training dynamics caused by disparate loss magnitudes across heterogeneous flow regimes. To address these limitations, we propose UniPINN, a unified multi-flow PINN framework that integrates three complementary components: a shared-specialized architecture that disentangles universal physical laws from flow-specific features, a cross-flow attention mechanism that selectively reinforces relevant patterns while suppressing task-irrelevant interference, and a dynamic weight allocation strategy that adaptively balances loss contributions to stabilize multi-objective optimization. Extensive experiments on three canonical flows demonstrate that UniPINN effectively unifies multi-flow learning, achieving superior prediction accuracy and balanced performance across heterogeneous regimes while successfully mitigating negative transfer. The source code of this paper will be released on https://github.com/Event-AHU/OpenFusion
[39] Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression cs.CVPDF
Hamidreza Dastmalchi, Aijun An, Ali Cheraghian, Hamed Barzamini
TL;DR: 本文提出了一种名为CIPHER的训练无关方法,通过构建反事实图像数据集OHC-25K来提取视觉诱导幻觉的低秩子空间表征,并在推理阶段通过将中间隐藏状态投影远离该子空间来抑制大型视觉语言模型(LVLM)中的幻觉生成。
Details
Motivation: 解决大型视觉语言模型(LVLM)频繁产生与视觉输入不一致的幻觉(hallucinations)问题,特别是针对由视觉模态引发的幻觉,而现有训练无关方法主要关注文本诱导的幻觉。
Result: 在多个基准测试上的实验表明,CIPHER显著降低了幻觉率,同时保持了任务性能,证明了反事实视觉扰动在提升LVLM忠实度方面的有效性。
Insight: 创新点在于首次明确针对视觉诱导幻觉,通过扩散模型编辑构建反事实图像数据集来系统性地表征幻觉子空间,并利用低秩投影进行轻量级特征校正,这是一种无需训练的后处理抑制方法。
Abstract: While large vision-language models (LVLMs) achieve strong performance on multimodal tasks, they frequently generate hallucinations – unfaithful outputs misaligned with the visual input. To address this issue, we introduce CIPHER (Counterfactual Image Perturbations for Hallucination Extraction and Removal), a training-free method that suppresses vision-induced hallucinations via lightweight feature-level correction. Unlike prior training-free approaches that primarily focus on text-induced hallucinations, CIPHER explicitly targets hallucinations arising from the visual modality. CIPHER operates in two phases. In the offline phase, we construct OHC-25K (Object-Hallucinated Counterfactuals, 25,000 samples), a counterfactual dataset consisting of diffusion-edited images that intentionally contradict the original ground-truth captions. We pair these edited images with the unchanged ground-truth captions and process them through an LVLM to extract hallucination-related representations. Contrasting these representations with those from authentic (image, caption) pairs reveals structured, systematic shifts spanning a low-rank subspace characterizing vision-induced hallucination. In the inference phase, CIPHER suppresses hallucinations by projecting intermediate hidden states away from this subspace. Experiments across multiple benchmarks show that CIPHER significantly reduces hallucination rates while preserving task performance, demonstrating the effectiveness of counterfactual visual perturbations for improving LVLM faithfulness. Code and additional materials are available at https://hamidreza-dastmalchi.github.io/cipher-cvpr2026/.
[40] StructDamage:A Large Scale Unified Crack and Surface Defect Dataset for Robust Structural Damage Detection cs.CVPDF
Misbah Ijaz, Saif Ur Rehman Khan, Abd Ur Rehman, Sebastian Vollmer, Andreas Dengel
TL;DR: 本文提出了一个名为StructDamage的大规模、统一的结构裂缝和表面缺陷数据集,包含约78,093张图像,涵盖墙壁、瓷砖、石材、道路、人行道、甲板、混凝土和砖块等九种表面类型。该数据集通过系统整合、协调和重新标注来自32个公开数据集的图像构建而成,旨在为稳健的结构损伤检测提供全面资源。
Details
Motivation: 现有裂缝数据集在地理多样性、表面类型、规模和标注一致性方面存在不足,导致训练算法难以在真实条件下有效泛化,因此需要构建一个更全面、统一的数据集以支持自动化结构损伤检测。
Result: 使用来自六个模型家族的十五种深度学习架构进行基线分类评估,其中十二种模型的宏观F1分数超过0.96,最佳模型DenseNet201达到98.62%的准确率。
Insight: 创新点在于通过系统整合和重新标注多个公开数据集,构建了一个大规模、多表面类型、标注一致的数据集,为结构损伤检测提供了标准化基准,促进了可重复研究和稳健算法的公平评估。
Abstract: Automated detection and classification of structural cracks and surface defects is a critical challenge in civil engineering, infrastructure maintenance, and heritage preservation. Recent advances in Computer Vision (CV) and Deep Learning (DL) have significantly improved automatic crack detection. However, these methods rely heavily on large, diverse, and carefully curated datasets that include various crack types across different surface materials. Many existing public crack datasets lack geographic diversity, surface types, scale, and labeling consistency, making it challenging for trained algorithms to generalize effectively in real world conditions. We provide a novel dataset, StructDamage, a curated collection of approximately 78,093 images spanning nine surface types: walls, tile, stone, road, pavement, deck, concrete, and brick. The dataset was constructed by systematically aggregating, harmonizing, and reannotating images from 32 publicly available datasets covering concrete structures, asphalt pavements, masonry walls, bridges, and historic buildings. All images are organized in a folder level classification hierarchy suitable for training Convolutional Neural Networks (CNNs) and Vision Transformers. To highlight the practical value of the dataset, we present baseline classification results using fifteen DL architectures from six model families, with twelve achieving macro F1-scores over 0.96. The best performing model DenseNet201 achieves 98.62% accuracy. The proposed dataset provides a comprehensive and versatile resource suitable for classification tasks. With thorough documentation and a standard structure, it is designed to promote reproducible research and support the development and fair evaluation of robust crack damage detection approaches.
[41] IMTBench: A Multi-Scenario Cross-Modal Collaborative Evaluation Benchmark for In-Image Machine Translation cs.CVPDF
Jiahao Lyu, Pei Fu, Zhenhang Li, Weichao Zeng, Shaojie Zhan
TL;DR: 本文提出了IMTBench,一个用于端到端图像内机器翻译(IIMT)的多场景跨模态协作评估基准。该基准包含2500个图像翻译样本,涵盖四种实际场景和九种语言,支持翻译质量、背景保留、整体图像质量和跨模态对齐分数等多方面评估。
Details
Motivation: 现有IIMT基准多为合成数据,无法反映真实世界复杂性,且当前评估协议侧重于单模态指标,忽视了渲染文本与模型输出之间的跨模态忠实度。
Result: 在IMTBench上对强大的商业级联系统、闭源和开源统一多模态模型进行基准测试,发现不同场景和语言间存在较大性能差距,特别是在自然场景和资源有限语言上,表明端到端图像文本翻译仍有很大提升空间。
Insight: 创新点在于构建了一个真实、多场景、多语言的IIMT评估基准,并引入了跨模态对齐分数来量化翻译文本与渲染图像之间的一致性,为标准化评估和推动该领域进展提供了重要工具。
Abstract: End-to-end In-Image Machine Translation (IIMT) aims to convert text embedded within an image into a target language while preserving the original visual context, layout, and rendering style. However, existing IIMT benchmarks are largely synthetic and thus fail to reflect real-world complexity, while current evaluation protocols focus on single-modality metrics and overlook cross-modal faithfulness between rendered text and model outputs. To address these shortcomings, we present In-image Machine Translation Benchmark (IMTBench), a new benchmark of 2,500 image translation samples covering four practical scenarios and nine languages. IMTBench supports multi-aspect evaluation, including translation quality, background preservation, overall image quality, and a cross-modal alignment score that measures consistency between the translated text produced by the model and the text rendered in the translated image. We benchmark strong commercial cascade systems, and both closed- and open-source unified multi-modal models, and observe large performance gaps across scenarios and languages, especially on natural scenes and resource-limited languages, highlighting substantial headroom for end-to-end image text translation. We hope IMTBench establishes a standardized benchmark to accelerate progress in this emerging task.
[42] DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime cs.CVPDF
Julian Lorenz, Vladyslav Kovganko, Elias Kohout, Mrunmai Phatak, Daniel Kienzle
TL;DR: DSFlash是一种用于全景场景图生成(Panoptic Scene Graph Generation)的低延迟模型,旨在在资源受限的边缘设备上实现实时处理。该模型能够在标准RTX 3090 GPU上以56帧/秒的速度处理视频流,同时保持与现有最先进方法相当的性能,并且训练资源需求低,适合计算资源有限的研究者和实践者。
Details
Motivation: 现有场景图生成(SGG)研究在速度和资源效率方面关注有限,难以在实际应用(如边缘设备)中部署,因此需要开发一种低延迟、高效的模型来克服这些限制。
Result: DSFlash在标准RTX 3090 GPU上达到56 FPS的处理速度,性能与现有最先进方法相当,并在训练资源上表现高效,使用单个GTX 1080 GPU在24小时内即可完成训练。
Insight: 创新点包括实现实时全景场景图生成,提供更全面的场景图(而非仅突出关系),以丰富上下文信息,同时保持低延迟和低资源消耗,增强了模型在资源受限环境中的可访问性和适应性。
Abstract: Scene Graph Generation (SGG) aims to extract a detailed graph structure from an image, a representation that holds significant promise as a robust intermediate step for complex downstream tasks like reasoning for embodied agents. However, practical deployment in real-world applications - especially on resource constrained edge devices - requires speed and resource efficiency, challenges that have received limited attention in existing research. To bridge this gap, we introduce DSFlash, a low-latency model for panoptic scene graph generation designed to overcome these limitations. DSFlash can process a video stream at 56 frames per second on a standard RTX 3090 GPU, without compromising performance against existing state-of-the-art methods. Crucially, unlike prior approaches that often restrict themselves to salient relationships, DSFlash computes comprehensive scene graphs, offering richer contextual information while maintaining its superior latency. Furthermore, DSFlash is light on resources, requiring less than 24 hours to train on a single, nine-year-old GTX 1080 GPU. This accessibility makes DSFlash particularly well-suited for researchers and practitioners operating with limited computational resources, empowering them to adapt and fine-tune SGG models for specialized applications.
[43] Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues cs.CV | cs.AI | eess.SPPDF
Mohammed Salah, Eman Ouda, Giuseppe Dell’Avvocato, Fabrizio Sarasini, Ester D’Accardi
TL;DR: 本文提出了一种基于视觉-语言模型(VLMs)的认知缺陷分析框架,用于主动红外热成像(AIRT)中碳纤维增强聚合物(CFRP)的亚表面缺陷检测。该框架无需训练数据集,通过预训练的多模态编码器和轻量级适配器实现生成式零样本缺陷理解和定位。
Details
Motivation: 解决传统基于AI的AIRT方法需要耗费大量时间和成本创建CFRP检测数据集来训练神经网络的问题,旨在实现无需训练数据的零样本缺陷分析。
Result: 在25个CFRP检测序列上验证,AIRT-VLM适配器相比传统热成像降维方法信噪比增益超过10 dB,零样本缺陷检测的交并比达到70%。
Insight: 创新点在于利用预训练视觉-语言模型实现零样本缺陷分析,通过提出的AIRT-VLM适配器弥合热成像数据与自然图像之间的领域差距,无需额外训练数据即可达到高性能检测。
Abstract: Active infrared thermography (AIRT) is currently witnessing a surge of artificial intelligence (AI) methodologies being deployed for automated subsurface defect analysis of high performance carbon fiber-reinforced polymers (CFRP). Deploying AI-based AIRT methodologies for inspecting CFRPs requires the creation of time consuming and expensive datasets of CFRP inspection sequences to train neural networks. To address this challenge, this work introduces a novel language-guided framework for cognitive defect analysis in CFRPs using AIRT and vision-language models (VLMs). Unlike conventional learning-based approaches, the proposed framework does not require developing training datasets for extensive training of defect detectors, instead it relies solely on pretrained multimodal VLM encoders coupled with a lightweight adapter to enable generative zero-shot understanding and localization of subsurface defects. By leveraging pretrained multimodal encoders, the proposed system enables generative zero-shot understanding of thermographic patterns and automatic detection of subsurface defects. Given the domain gap between thermographic data and natural images used to train VLMs, an AIRT-VLM Adapter is proposed to enhance the visibility of defects while aligning the thermographic domain with the learned representations of VLMs. The proposed framework is validated using three representative VLMs; specifically, GroundingDINO, Qwen-VL-Chat, and CogVLM. Validation is performed on 25 CFRP inspection sequences with impacts introduced at different energy levels, reflecting realistic defects encountered in industrial scenarios. Experimental results demonstrate that the AIRT-VLM adapter achieves signal-to-noise ratio (SNR) gains exceeding 10 dB compared with conventional thermographic dimensionality-reduction methods, while enabling zero-shot defect detection with intersection-over-union values reaching 70%.
[44] P-GSVC: Layered Progressive 2D Gaussian Splatting for Scalable Image and Video cs.CV | cs.MMPDF
Longan Wang, Yuang Shi, Wei Tsang Ooi
TL;DR: 本文提出了P-GSVC,一个用于图像和视频重建的分层渐进式2D高斯泼溅框架。它将2D高斯泼溅组织为基础层和多个增强层,实现从粗到细的重建,并通过联合训练策略同时优化各层高斯,以支持质量和分辨率上的可扩展性。
Details
Motivation: 高斯泼溅已成为图像和视频重建中一种有竞争力的显式表示,但现有方法缺乏一个统一的、可扩展的解决方案来处理图像和视频。本文旨在解决这一问题,通过分层渐进式表示实现更灵活和高效的重建。
Result: 实验表明,与逐层顺序训练的方法相比,本文提出的联合训练策略在视频上PSNR提升高达1.9 dB,在图像上PSNR提升高达2.6 dB。
Insight: 创新点在于首次提出了一个统一的分层渐进式2D高斯泼溅框架(P-GSVC),以及一种联合训练策略来同时优化各层,确保层间兼容性和稳定的渐进重建,这为可扩展的高斯表示提供了新思路。
Abstract: Gaussian splatting has emerged as a competitive explicit representation for image and video reconstruction. In this work, we present P-GSVC, the first layered progressive 2D Gaussian splatting framework that provides a unified solution for scalable Gaussian representation in both images and videos. P-GSVC organizes 2D Gaussian splats into a base layer and successive enhancement layers, enabling coarse-to-fine reconstructions. To effectively optimize this layered representation, we propose a joint training strategy that simultaneously updates Gaussians across layers, aligning their optimization trajectories to ensure inter-layer compatibility and a stable progressive reconstruction. P-GSVC supports scalability in terms of both quality and resolution. Our experiments show that the joint training strategy can gain up to 1.9 dB improvement in PSNR for video and 2.6 dB improvement in PSNR for image when compared to methods that perform sequential layer-wise training. Project page: https://longanwang-cs.github.io/PGSVC-webpage/
[45] UniStitch: Unifying Semantic and Geometric Features for Image Stitching cs.CVPDF
Yuan Mei, Lang Nie, Kang Liao, Yunqiu Xu, Chunyu Lin
TL;DR: 本文提出UniStitch框架,首次将传统图像拼接中的几何特征与基于学习的语义特征统一起来。通过Neural Point Transformer模块将稀疏几何关键点转换为密集语义图,并利用Adaptive Mixture of Experts模块动态融合两种特征,从而在复杂场景中提升拼接性能。
Details
Motivation: 传统图像拼接方法依赖手工设计的几何特征,而近期基于学习的方法则利用神经网络提取的语义特征,两者长期独立发展,缺乏有效融合。本文旨在弥合这一鸿沟,构建一个统一的多模态特征图像拼接框架。
Result: 实验表明,UniStitch在图像拼接任务上大幅超越现有最先进方法(SOTA),证明了统一特征的有效性。
Insight: 创新点在于首次系统性地融合几何与语义特征,通过NPT实现特征空间对齐,以及AMoE模块的动态自适应融合机制。这为传统方法与深度学习方法提供了一个统一的范式,可推广至其他需要多模态特征融合的视觉任务。
Abstract: Traditional image stitching methods estimate warps from hand-crafted geometric features, whereas recent learning-based solutions leverage semantic features from neural networks instead. These two lines of research have largely diverged along separate evolution, with virtually no meaningful convergence to date. In this paper, we take a pioneering step to bridge this gap by unifying semantic and geometric features with UniStitch, a unified image stitching framework from multimodal features. To align discrete geometric features (i.e., keypoint) with continuous semantic feature maps, we present a Neural Point Transformer (NPT) module, which transforms unordered, sparse 1D geometric keypoints into ordered, dense 2D semantic maps. Then, to integrate the advantages of both representations, an Adaptive Mixture of Experts (AMoE) module is designed to fuse geometric and semantic representations. It dynamically shifts focus toward more reliable features during the fusion process, allowing the model to handle complex scenes, especially when either modality might be compromised. The fused representation can be adopted into common deep stitching pipelines, delivering significant performance gains over any single feature. Experiments show that UniStitch outperforms existing state-of-the-art methods with a large margin, paving the way for a unified paradigm between traditional and learning-based image stitching.
[46] R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment cs.CV | cs.DBPDF
Zhuangzi Li, Jian Jin, Shilv Cai, Weisi Lin
TL;DR: 该论文提出了R4-CGQA,一个基于检索增强的视觉语言模型框架,用于计算机图形图像质量评估。首先,作者从用户视角定义了六个关键感知维度,并构建了一个包含3500张CG图像及其对应质量描述的数据集。研究发现现有VLM在细粒度CG质量评估上不够准确,但视觉相似图像的描述能显著提升其理解。因此,作者设计了一个双流检索框架来增强VLM的CG质量评估能力。
Details
Motivation: 现有CG数据集缺乏对渲染质量的系统描述,且现有CG质量评估方法无法提供合理的基于文本的解释,因此需要一种能够全面评估CG质量并给出解释的方法。
Result: 在基于所构建数据集的问题-答案基准测试上,实验表明该方法显著提升了多个代表性VLM在CG质量评估任务上的性能。
Insight: 创新点在于从用户视角系统定义了CG质量的感知维度并构建了带描述的数据集,以及提出了利用视觉相似图像描述进行检索增强的双流框架来提升VLM对细粒度CG质量的理解和评估能力。
Abstract: Immersive Computer Graphics (CGs) rendering has become ubiquitous in modern daily life. However, comprehensively evaluating CG quality remains challenging for two reasons: First, existing CG datasets lack systematic descriptions of rendering quality; and second existing CG quality assessment methods cannot provide reasonable text-based explanations. To address these issues, we first identify six key perceptual dimensions of CG quality from the user perspective and construct a dataset of 3500 CG images with corresponding quality descriptions. Each description covers CG style, content, and perceived quality along the selected dimensions. Furthermore, we use a subset of the dataset to build several question-answer benchmarks based on the descriptions in order to evaluate the responses of existing Vision Language Models (VLMs). We find that current VLMs are not sufficiently accurate in judging fine-grained CG quality, but that descriptions of visually similar images can significantly improve a VLM’s understanding of a given CG image. Motivated by this observation, we adopt retrieval-augmented generation and propose a two-stream retrieval framework that effectively enhances the CG quality assessment capabilities of VLMs. Experiments on several representative VLMs demonstrate that our method substantially improves their performance on CG quality assessment.
[47] HyPER-GAN: Hybrid Patch-Based Image-to-Image Translation for Real-Time Photorealism Enhancement cs.CVPDF
Stefanos Pasios, Nikos Nikolaidis
TL;DR: 本文提出了一种轻量级的图像到图像转换方法HyPER-GAN,旨在实时提升合成数据的照片真实感。该方法基于U-Net风格的生成器,并采用混合训练策略,结合真实世界数据的匹配块来改善视觉真实感和语义一致性。实验表明,HyPER-GAN在推理延迟、视觉真实感和语义鲁棒性方面优于现有的配对图像转换方法。
Details
Motivation: 现有生成模型在提升合成数据真实感时,常引入视觉伪影并消耗大量计算资源,限制了其在实时训练或评估场景中的应用。本文旨在解决这些问题,提出一种轻量级方法以实现实时推理。
Result: HyPER-GAN在推理延迟、视觉真实感和语义鲁棒性方面超越了最先进的配对图像到图像转换方法,实验验证了其高效性和有效性。
Insight: 创新点包括:基于U-Net的轻量级生成器设计,以及混合训练策略,通过引入真实世界数据的匹配块来增强视觉质量和语义一致性,这为实时图像增强提供了可借鉴的思路。
Abstract: Generative models are widely employed to enhance the photorealism of synthetic data for training computer vision algorithms. However, they often introduce visual artifacts that degrade the accuracy of these algorithms and require high computational resources, limiting their applicability in real-time training or evaluation scenarios. In this paper, we propose Hybrid Patch Enhanced Realism Generative Adversarial Network (HyPER-GAN), a lightweight image-to-image translation method based on a U-Net-style generator designed for real-time inference. The model is trained using paired synthetic and photorealism-enhanced images, complemented by a hybrid training strategy that incorporates matched patches from real-world data to improve visual realism and semantic consistency. Experimental results demonstrate that HyPER-GAN outperforms state-of-the-art paired image-to-image translation methods in terms of inference latency, visual realism, and semantic robustness. Moreover, it is illustrated that the proposed hybrid training strategy indeed improves visual quality and semantic consistency compared to training the model solely with paired synthetic and photorealism-enhanced images. Code and pretrained models are publicly available for download at: https://github.com/stefanos50/HyPER-GAN
[48] Splat2Real: Novel-view Scaling for Physical AI with 3D Gaussian Splatting cs.CVPDF
Hansol Lim, Jongseong Brad Choi
TL;DR: 本文提出Splat2Real方法,通过3D高斯溅射(3DGS)生成可扩展的新视角观测,将单目RGB到3D感知的预训练建模为从数字孪生专家模仿学习的过程,并引入CN-Coverage课程策略来优化新视角选择,以提升物理AI在视角变化下的鲁棒性。
Details
Motivation: 解决物理AI在训练与部署间存在视角偏移的问题,提升单目RGB到3D感知在新视角下的鲁棒性。
Result: 在TUM RGB-D数据集的20个序列上,与Robot/Coverage等基线策略相比,CN-Coverage缓解了最差情况下的性能回归,而GOL-Gated CN-Coverage在中高预算下提供了最强的稳定性并降低了高新颖性尾端误差。
Insight: 创新点在于提出新视角缩放概念,强调性能提升更依赖于添加哪些视角而非原始视角数量,并设计了结合几何增益和外推惩罚的CN-Coverage课程策略以及针对低可靠性教师的感知质量保护机制。
Abstract: Physical AI faces viewpoint shift between training and deployment, and novel-view robustness is essential for monocular RGB-to-3D perception. We cast Real2Render2Real monocular depth pretraining as imitation-learning-style supervision from a digital twin oracle: a student depth network imitates expert metric depth/visibility rendered from a scene mesh, while 3DGS supplies scalable novel-view observations. We present Splat2Real, centered on novel-view scaling: performance depends more on which views are added than on raw view count. We introduce CN-Coverage, a coverage+novelty curriculum that greedily selects views by geometry gain and an extrapolation penalty, plus a quality-aware guardrail fallback for low-reliability teachers. Across 20 TUM RGB-D sequences with step-matched budgets (N=0 to 2000 additional rendered views, with N unique <= 500 and resampling for larger budgets), naive scaling is unstable; CN-Coverage mitigates worst-case regressions relative to Robot/Coverage policies, and GOL-Gated CN-Coverage provides the strongest medium-high-budget stability with the lowest high-novelty tail error. Downstream control-proxy results versus N provides embodied-relevance evidence by shifting safety/progress trade-offs under viewpoint shift.
[49] Are Video Reasoning Models Ready to Go Outside? cs.CV | cs.AIPDF
Yangfan He, Changgyu Boo, Jaehong Yoon
TL;DR: 该论文提出了ROVA训练框架和PVRBench基准测试,旨在提升视频推理模型在真实世界扰动下的鲁棒性。ROVA通过时空扰动下的鲁棒性一致性奖励和难度感知的在线训练策略来增强模型性能,并在新构建的PVRBench以及UrbanVideo、VisBench等基准上验证了其有效性。
Details
Motivation: 现有视觉语言模型在受控的干净环境下评估表现良好,但在真实部署中遇到天气、遮挡、相机运动等扰动时,其理解和推理能力会大幅下降,存在鲁棒性差距。
Result: 在PVRBench、UrbanVideo和VisBench基准测试中,开源和专有模型在真实扰动下准确率和推理质量分别下降高达35%和28%。相比之下,ROVA显著缓解了性能下降,相比基线模型(如QWen2.5/3-VL, InternVL2.5, Embodied-R),相对准确率至少提升24%,推理质量提升超过9%,且在干净标准基准上也有持续改进。
Insight: 论文的创新点在于提出了一个建模鲁棒性一致性奖励的通用训练框架ROVA,以及一个难度感知的在线训练策略,通过自反思评估动态调整样本难度。同时,构建了PVRBench这一注入真实世界扰动的具身视频数据集基准,用于更真实地评估模型鲁棒性。从客观角度看,其将鲁棒性训练与样本难度动态估计相结合的方法具有借鉴意义。
Abstract: In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model’s evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.
[50] RandMark: On Random Watermarking of Visual Foundation Models cs.CV | cs.AIPDF
Anna Chistyakova, Mikhail Pautov
TL;DR: 本文提出RandMark方法,一种用于视觉基础模型所有权验证的随机水印嵌入技术,通过小型编码器-解码器网络在输入图像的内部表示中嵌入数字水印,确保水印统计特征在模型副本中可检测。
Details
Motivation: 视觉基础模型训练成本高昂,模型所有者需通过许可协议保护知识产权,因此需要一种可靠的所有权验证方法。
Result: 理论和实验表明,该方法在非水印模型上具有低误检概率,在水印模型上具有低漏检概率,验证了其有效性。
Insight: 创新点在于随机水印嵌入机制,利用内部表示而非输出层嵌入水印,增强了水印的隐蔽性和鲁棒性,为模型版权保护提供了新思路。
Abstract: Being trained on large and diverse datasets, visual foundation models (VFMs) can be fine-tuned to achieve remarkable performance and efficiency in various downstream computer vision tasks. The high computational cost of data collection and training makes these models valuable assets, which motivates some VFM owners to distribute them alongside a license to protect their intellectual property rights. In this paper, we propose an approach to ownership verification of visual foundation models that leverages a small encoder-decoder network to embed digital watermarks into an internal representation of a hold-out set of input images. The method is based on random watermark embedding, which makes the watermark statistics detectable in functional copies of the watermarked model. Both theoretically and experimentally, we demonstrate that the proposed method yields a low probability of false detection for non-watermarked models and a low probability of false misdetection for watermarked models.
[51] UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations cs.CVPDF
Yaqi Zhao, Wang Lin, Zijian Zhang, Miles Yang, Jingyuan Chen
TL;DR: UniCom提出了一种通过压缩连续语义表示来统一多模态建模的框架,旨在解决离散视觉分词器丢弃细粒度语义信息以及连续表示在高维生成建模中收敛慢和不稳定的问题。该框架采用基于注意力的语义压缩器将密集特征蒸馏为紧凑的统一表示,并验证了transfusion架构在收敛性和一致性上的优势。
Details
Motivation: 当前统一多模态模型通常依赖离散视觉分词器来弥合模态差距,但离散化会丢弃细粒度语义信息,导致视觉理解任务性能不佳;而直接建模连续语义表示(如CLIP、SigLIP)在高维生成建模中面临收敛慢和训练不稳定的挑战。
Result: 实验表明,UniCom在统一模型中实现了最先进的生成性能,并在图像编辑中展现出卓越的可控性,即使不依赖VAE也能保持图像一致性。
Insight: 创新点包括:提出通过压缩连续表示来统一多模态理解与生成,发现减少通道维度比空间下采样对重建和生成更有效,设计了基于注意力的语义压缩器,并验证了transfusion架构优于基于查询的设计。从客观角度看,该方法在保持语义先验的同时提升了生成质量和可控性,为多模态建模提供了新思路。
Abstract: Current unified multimodal models typically rely on discrete visual tokenizers to bridge the modality gap. However, discretization inevitably discards fine-grained semantic information, leading to suboptimal performance in visual understanding tasks. Conversely, directly modeling continuous semantic representations (e.g., CLIP, SigLIP) poses significant challenges in high-dimensional generative modeling, resulting in slow convergence and training instability. To resolve this dilemma, we introduce UniCom, a unified framework that harmonizes multimodal understanding and generation via compressed continuous representation. We empirically demonstrate that reducing channel dimension is significantly more effective than spatial downsampling for both reconstruction and generation. Accordingly, we design an attention-based semantic compressor to distill dense features into a compact unified representation. Furthermore, we validate that the transfusion architecture surpasses query-based designs in convergence and consistency. Experiments demonstrate that UniCom achieves state-of-the-art generation performance among unified models. Notably, by preserving rich semantic priors, it delivers exceptional controllability in image editing and maintains image consistency even without relying on VAE.
[52] WalkGPT: Grounded Vision-Language Conversation with Depth-Aware Segmentation for Pedestrian Navigation cs.CV | cs.CYPDF
Rafi Ibn Sultan, Hui Zhu, Xiangyu Zhou, Chengyin Li, Prashant Khanduri
TL;DR: 本文提出了WalkGPT,一种用于接地导航引导任务的像素级接地大型视觉语言模型,旨在解决复杂城市场景中行人导航的语义和空间推理问题。该模型通过统一的架构整合了语言推理和分割,能够根据行人视角图像和导航查询,生成包含可访问与有害特征分割掩码及相对深度估计的对话式响应。
Details
Motivation: 现有大型视觉语言模型在描述视觉内容时缺乏显式接地,导致物体幻觉和不可靠的深度推理,难以满足无障碍行人导航的需求。
Result: 在提出的PAVE大规模基准测试(包含41k行人视角图像)上,WalkGPT在接地推理和分割性能方面表现出色。
Insight: 创新点包括:1) 提出Grounded Navigation Guide新任务;2) 引入Multi-Scale Query Projector和Calibrated Text Projector架构组件,实现无需用户提示的细粒度接地和深度推理;3) 提出Region Alignment Loss指导表示映射;4) 构建了PAVE数据集。
Abstract: Ensuring accessible pedestrian navigation requires reasoning about both semantic and spatial aspects of complex urban scenes, a challenge that existing Large Vision-Language Models (LVLMs) struggle to meet. Although these models can describe visual content, their lack of explicit grounding leads to object hallucinations and unreliable depth reasoning, limiting their usefulness for accessibility guidance. We introduce WalkGPT, a pixel-grounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance. Given a pedestrian-view image and a navigation query, WalkGPT generates a conversational response with segmentation masks that delineate accessible and harmful features, along with relative depth estimation. The model incorporates a Multi-Scale Query Projector (MSQP) that shapes the final image tokens by aggregating them along text tokens across spatial hierarchies, and a Calibrated Text Projector (CTP), guided by a proposed Region Alignment Loss, that maps language embeddings into segmentation-aware representations. These components enable fine-grained grounding and depth inference without user-provided cues or anchor points, allowing the model to generate complete and realistic navigation guidance. We also introduce PAVE, a large-scale benchmark of 41k pedestrian-view images paired with accessibility-aware questions and depth-grounded answers. Experiments show that WalkGPT achieves strong grounded reasoning and segmentation performance. The source code and dataset are available on the \href{https://sites.google.com/view/walkgpt-26/home}{project website}.
[53] eLasmobranc Dataset: An Image Dataset for Elasmobranch Species Recognition and Biodiversity Monitoring cs.CVPDF
Ismael Beviá-Ballesteros, Mario Jerez-Tallón, Nieves Aranda-Garrido, Isabel Abel-Abellán, Irene Antón-Linares
TL;DR: 本文介绍了eLasmobranc数据集,这是一个专门用于软骨鱼类(如鲨鱼和鳐鱼)物种识别和生物多样性监测的公开图像数据集。该数据集包含来自西班牙地中海沿岸七种生态相关物种的图像,这些图像主要在水外标准化条件下采集,以确保清晰的形态特征可视化,并集成了专家验证的物种标注、时空元数据及物种信息。
Details
Motivation: 软骨鱼类种群在全球范围内显著下降,许多物种面临威胁,可靠的物种级识别对于保护规划至关重要。然而,现有视觉数据集多为检测导向、水下采集或仅限于粗粒度分类,限制了其在细粒度形态分类中的应用。
Result: 论文未在摘要中提及具体的定量实验结果或基准测试,但强调了数据集的设计旨在支持监督式物种级分类、种群研究和AI系统开发,以填补细粒度软骨鱼类识别的关键空白。
Insight: 创新点在于构建了一个专注于细粒度软骨鱼类识别的公开数据集,其特点包括水外标准化采集以突出形态特征、专家验证的标注、丰富的元数据整合,以及促进保护导向计算机视觉研究的可重复性。
Abstract: Elasmobranch populations are experiencing significant global declines, and several species are currently classified as threatened. Reliable monitoring and species-level identification are essential to support conservation and spatial planning initiatives such as Important Shark and Ray Areas (ISRAs). However, existing visual datasets are predominantly detection-oriented, underwater-acquired, or limited to coarse-grained categories, restricting their applicability to fine-grained morphological classification. We present the eLasmobranc Dataset, a curated and publicly available image collection from seven ecologically relevant elasmobranch species inhabiting the eastern Spanish Mediterranean coast, a region where two ISRAs have been identified. Images were obtained through dedicated data collection, including field campaigns and collaborations with local fish markets and projects, as well as from open-access public sources. The dataset was constructed predominantly from images acquired outside the aquatic environment under standardized protocols to ensure clear visualization of diagnostic morphological traits. It integrates expert-validated species annotations, structured spatial and temporal metadata, and complementary species-level information. The eLasmobranc Dataset is specifically designed to support supervised species-level classification, population studies, and the development of artificial intelligence systems for biodiversity monitoring. By combining morphological clarity, taxonomic reliability, and public accessibility, the dataset addresses a critical gap in fine-grained elasmobranch identification and promotes reproducible research in conservation-oriented computer vision. The dataset is publicly available at https://zenodo.org/records/18549737.
[54] CodePercept: Code-Grounded Visual STEM Perception for MLLMs cs.CVPDF
Tongkun Guan, Zhibo Yang, Jianqiang Wan, Mingkun Yang, Zhengtao Guo
TL;DR: 本文提出CodePercept,一种基于代码的视觉感知增强方法,用于提升多模态大语言模型在STEM领域的视觉推理能力。通过构建包含100万图像-描述-代码三元组的大规模数据集ICC-1M,以及引入直接评估视觉感知的新基准STEM2Code-Eval,研究表明将可执行代码作为感知媒介能有效解决现有方法中的幻觉和模糊性问题,并证明增强感知能力比增强推理能力对提升STEM视觉推理性能更为关键。
Details
Motivation: 针对MLLMs在STEM视觉推理中表现不佳的问题,本文旨在探究其根本原因是感知缺陷还是推理限制,并基于’扩展感知能力优于扩展推理能力’的发现,致力于通过将代码作为精确的感知媒介来系统性地增强MLLMs的视觉感知能力。
Result: 研究通过系统性缩放分析发现,在STEM视觉推理中,缩放感知组件始终优于缩放推理组件。提出的方法通过构建ICC-1M数据集和STEM2Code-Eval基准进行验证,该基准通过生成用于图像重建的可执行代码来提供确定性和可验证的评估,直接衡量模型的全面视觉理解能力。
Insight: 创新点在于将可执行代码确立为强大的感知媒介,其精确的语义与STEM视觉的结构化特性自然对齐。具体通过’代码锚定的描述生成’和’STEM图像到代码翻译’两种互补方法实现,避免了现有知识蒸馏方法中的幻觉问题,并缓解了自然语言在感知增强中的模糊性。从客观角度看,该工作为评估和提升MLLMs的细粒度视觉感知提供了一种新颖、可验证的范式。
Abstract: When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium–executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at https://github.com/TongkunGuan/Qwen-CodePercept.
[55] Taking Shortcuts for Categorical VQA Using Super Neurons cs.CV | cs.AI | cs.LGPDF
Pierre Musacchio, Jaeyi Jeong, Dahun Kim, Jaesik Park
TL;DR: 该论文提出了一种称为’超级神经元(Super Neurons, SNs)’的方法,通过直接探测视觉语言模型(VLM)的原始标量激活值来构建分类器,而非依赖注意力向量或模型预测。这种方法能在模型生成的第一个令牌处,从较浅层中找到具有判别性的神经元,从而实现极早退出,在提升分类性能的同时大幅加速推理。
Details
Motivation: 动机是寻找一种无需训练的高效替代方案,以改进视觉语言模型在视觉下游任务上的性能。现有方法如稀疏注意力向量(SAVs)虽有效,但局限于注意力头;本文转向探索更广泛的标量激活空间,以发现更具判别力的参数。
Result: 在分类视觉问答(Categorical VQA)等任务上,超级神经元方法相比原始网络稳健地提升了分类性能,同时实现了高达5.10倍的推理加速。
Insight: 创新点在于将探测对象从注意力向量转向标量激活,极大扩展了可搜索参数空间,并揭示了在模型浅层和早期令牌中即可找到足够判别性神经元,从而实现高性能的极早退出策略。这为高效VLM推理提供了新思路。
Abstract: Sparse Attention Vectors (SAVs) have emerged as an excellent training-free alternative to supervised finetuning or low-rank adaptation to improve the performance of Vision Language Models (VLMs). At their heart, SAVs select a few accurate attention heads for a task of interest and use them as classifiers, rather than relying on the model’s prediction. In a similar spirit, we find that directly probing the raw activations of the VLM, in the form of scalar values, is sufficient to yield accurate classifiers on diverse visually grounded downstream tasks. Shifting focus from attention vectors to scalar activations dramatically increases the search space for accurate parameters, allowing us to find more discriminative neurons immediately from the first generated token. We call such activations Super Neurons (SNs). In this probing setting, we discover that enough SNs appear in the shallower layers of the large language model to allow for extreme early exiting from the first layer of the model at the first generated token. Compared to the original network, SNs robustly improve the classification performance while achieving a speedup of up to 5.10x.
[56] PolGS++: Physically-Guided Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction cs.CVPDF
Yufei Han, Chu Zhou, Youwei Lyu, Qi Chen, Si Li
TL;DR: 本文提出PolGS++,一个物理引导的偏振高斯泼溅框架,用于快速重建反射表面。该方法将偏振双向反射分布函数(pBRDF)模型集成到3D高斯泼溅(3DGS)中,以显式解耦漫反射和镜面反射分量,并通过深度引导的可见性掩码获取机制实现基于偏振角(AoP)的切空间一致性约束,从而在无需昂贵光线追踪的情况下提升重建质量和效率。
Details
Motivation: 反射表面的精确重建是计算机视觉中的一个基本挑战,现有3D高斯泼溅(3DGS)方法在反射表面上的性能仍落后于隐式神经方法,尤其是在恢复精细几何和表面法线方面。
Result: 在合成和真实世界数据集上的大量实验验证了该方法的有效性,训练仅需约10分钟,实现了高质量和高效的反射表面重建。
Insight: 创新点在于将物理偏振模型(pBRDF)引入3DGS以显式建模反射分量,并设计了无需光线追踪的深度引导可见性掩码机制来施加偏振角一致性约束,从而在保持3DGS高效性的同时提升了反射表面的重建精度。
Abstract: Accurate reconstruction of reflective surfaces remains a fundamental challenge in computer vision, with broad applications in real-time virtual reality and digital content creation. Although 3D Gaussian Splatting (3DGS) enables efficient novel-view rendering with explicit representations, its performance on reflective surfaces still lags behind implicit neural methods, especially in recovering fine geometry and surface normals. To address this gap, we propose PolGS++, a physically-guided polarimetric Gaussian Splatting framework for fast reflective surface reconstruction. Specifically, we integrate a polarized BRDF (pBRDF) model into 3DGS to explicitly decouple diffuse and specular components, providing physically grounded reflectance modeling and stronger geometric cues for reflective surface recovery. Furthermore, we introduce a depth-guided visibility mask acquisition mechanism that enables angle-of-polarization (AoP)-based tangent-space consistency constraints in Gaussian Splatting without costly ray-tracing intersections. This physically guided design improves reconstruction quality and efficiency, requiring only about 10 minutes of training. Extensive experiments on both synthetic and real-world datasets validate the effectiveness of our method.
[57] Backdoor Directions in Vision Transformers cs.CV | cs.CRPDF
Sengim Karayalcin, Marina Krcek, Pin-Yu Chen, Stjepan Picek
TL;DR: 本文研究了视觉Transformer(ViT)中后门攻击的内部表征机制。通过假设已知触发器,作者在模型激活中识别出一个特定的‘触发器方向’,该方向对应触发器的内部表征。通过在激活和参数空间进行干预,证实了这一线性方向对模型后门行为的因果作用。利用该方向作为诊断工具,追踪了后门特征在模型各层的处理过程,揭示了静态补丁触发器与隐蔽分布式触发器在内部逻辑上的本质差异。进一步探究了后门攻击与对抗攻击之间的联系,并测试了基于PGD的扰动对已识别触发机制的(去)激活作用。最后,提出了一种针对隐蔽触发器攻击的无数据、基于权重的检测方案。研究表明,机制可解释性为诊断和应对计算机视觉中的安全漏洞提供了一个稳健的框架。
Details
Motivation: 研究动机在于深入理解后门攻击在视觉Transformer(ViT)模型内部是如何被表征和处理的,旨在为诊断和缓解此类安全漏洞提供一个基于机制可解释性的框架。
Result: 研究结果证实了‘触发器方向’的因果作用,并揭示了不同类型触发器(静态补丁 vs. 隐蔽分布式)在模型内部处理逻辑上的定性差异。提出的无数据、基于权重的检测方案针对隐蔽触发器攻击有效。
Insight: 主要创新点在于将机制可解释性方法系统性地应用于ViT后门攻击分析,识别出具有因果关系的‘触发器方向’作为核心诊断工具,并揭示了不同类型后门的内在机制差异,为开发更鲁棒的防御方法提供了新视角。
Abstract: This paper investigates how Backdoor Attacks are represented within Vision Transformers (ViTs). By assuming knowledge of the trigger, we identify a specific ``trigger direction’’ in the model’s activations that corresponds to the internal representation of the trigger. We confirm the causal role of this linear direction by showing that interventions in both activation and parameter space consistently modulate the model’s backdoor behavior across multiple datasets and attack types. Using this direction as a diagnostic tool, we trace how backdoor features are processed across layers. Our analysis reveals distinct qualitative differences: static-patch triggers follow a different internal logic than stealthy, distributed triggers. We further examine the link between backdoors and adversarial attacks, specifically testing whether PGD-based perturbations (de-)activate the identified trigger mechanism. Finally, we propose a data-free, weight-based detection scheme for stealthy-trigger attacks. Our findings show that mechanistic interpretability offers a robust framework for diagnosing and addressing security vulnerabilities in computer vision.
[58] HanMoVLM: Large Vision-Language Models for Professional Artistic Painting Evaluation cs.CVPDF
Hongji Yang, Yucheng Zhou, Wencheng Han, Songlian Li, Xiaotong Zhao
TL;DR: 本文提出了HanMoVLM,一个专为中文艺术领域(特别是绘画)进行专业级评估的大型视觉语言模型。它通过构建包含真实拍卖级杰作和AI生成作品的HanMo-Bench数据集,并设计专家验证的思维链(CoT)和奖励函数来引导模型进行从内容识别、兴趣区域定位到专业评估的专家级推理。该模型可作为高质量验证器,用于图像生成中的测试时缩放,以提升中国画生成的质量。
Details
Motivation: 现有的大型视觉语言模型(VLMs)虽然具备强大的通用视觉能力,但在特定艺术领域(如中国画)中缺乏专业艺术鉴赏力,无法像人类专家一样提供专业的艺术品评估。本文旨在弥合这一差距,使VLMs能够在中国艺术领域进行专业级的绘画评估。
Result: 实验和人工研究证实,HanMoVLM有效地弥合了评估差距,与专业专家达到了高度一致性,并显著提升了中国画生成的质量。
Insight: 主要创新点包括:1) 构建了基于真实市场估值的专业艺术评估数据集HanMo-Bench;2) 设计了由专家验证的思维链(CoT),引导模型进行从内容识别到专业评估的结构化推理;3) 引入奖励函数来优化模型的推理过程;4) 将模型作为高质量验证器应用于图像生成的测试时缩放,以提升输出质量。
Abstract: While Large Vision-Language Models (VLMs) demonstrate impressive general visual capabilities, they remain artistically blind and unable to offer professional evaluation of artworks within specific artistic domains like human experts. To bridge this gap, we transform VLMs into experts capable of professional-grade painting evaluation in the Chinese Artistic Domain, which is more abstract and demands extensive artistic training for evaluation. We introduce HanMo-Bench, a new dataset that features authentic auction-grade masterpieces and AI-generated works, grounded in real-world market valuations. To realize the rigorous judgment, we propose the HanMoVLM and construct a Chain-of-Thought (CoT) validated by experts. This CoT guides the model to perform expert-level reasoning: from content identification and Region of Interest (RoI) localization to professional evaluation, guided by both theme-specific evaluation and typical three-tier evaluation in Chinese paintings. Furthermore, we design a reward function to refine the reasoning process of the HanMoVLM to improve the accuracy. We demonstrate that HanMoVLM can serve as a critical backbone for Test-time Scaling in image generation. By acting as a high-quality verifier, HanMoVLM enables generative models to select the most artistically superior outputs from multiple candidates. Experimental results and human studies confirm that the proposed HanMoVLM effectively bridges the gap, achieving a high consistency with professional experts and significantly improving the quality of Chinese Painting generation.
[59] A dataset of medication images with instance segmentation masks for preventing adverse drug events cs.CVPDF
W. I. Chu, S. Hirani, G. Tarroni, L. Li
TL;DR: MEDISEG是一个用于预防药物不良事件的药物图像数据集,包含8262张图像中32种不同药丸的实例分割标注,涵盖了从单个药丸到杂乱药盒的多种真实场景。研究使用YOLOv8和YOLOv9在数据集上训练,在3-Pills子集上达到99.5%的mAP@0.5,在32-Pills子集上达到80.1%,并展示了在少样本检测协议下对未见药丸类别的识别提升。
Details
Motivation: 药物错误和不良药物事件对患者安全构成重大风险,而现有药丸图像数据集缺乏真实世界的复杂性(如药丸重叠、光照变化和遮挡),阻碍了基于AI的药丸识别模型的发展。
Result: 在MEDISEG数据集上训练YOLOv8和YOLOv9,在3-Pills子集上达到99.5%的mAP@0.5,在32-Pills子集上达到80.1%;在少样本检测协议下,基于MEDISEG的基础训练显著提升了在遮挡多药丸场景中对未见药丸类别的识别能力。
Insight: 创新点在于提供了首个涵盖真实世界复杂性的药丸实例分割数据集,支持鲁棒的监督训练和有限监督下的可迁移表示学习,为药物安全AI系统的开发和基准测试提供了宝贵资源。
Abstract: Medication errors and adverse drug events (ADEs) pose significant risks to patient safety, often arising from difficulties in reliably identifying pharmaceuticals in real-world settings. AI-based pill recognition models offer a promising solution, but the lack of comprehensive datasets hinders their development. Existing pill image datasets rarely capture real-world complexities such as overlapping pills, varied lighting, and occlusions. MEDISEG addresses this gap by providing instance segmentation annotations for 32 distinct pill types across 8262 images, encompassing diverse conditions from individual pill images to cluttered dosette boxes. We trained YOLOv8 and YOLOv9 on MEDISEG to demonstrate their usability, achieving mean average precision at IoU 0.5 of 99.5 percent on the 3-Pills subset and 80.1 percent on the 32-Pills subset. We further evaluate MEDISEG under a few-shot detection protocol, demonstrating that base training on MEDISEG significantly improves recognition of unseen pill classes in occluded multi-pill scenarios compared to existing datasets. These results highlight the dataset’s ability not only to support robust supervised training but also to promote transferable representations under limited supervision, making it a valuable resource for developing and benchmarking AI-driven systems for medication safety.
[60] Evaluating Few-Shot Pill Recognition Under Visual Domain Shift cs.CVPDF
W. I. Chu, G. Tarroni, L. Li
TL;DR: 本研究从部署导向的视角评估了少样本药片识别系统在视觉域偏移下的性能,采用两阶段目标检测框架(基础训练后进行少样本微调),重点关注模型在真实跨数据集域偏移下的泛化能力,而非架构创新。研究发现,语义分类性能在少量标注样本下即可快速饱和,但在重叠和遮挡条件下的定位与召回率显著下降,强调了训练数据真实性和少样本微调对部署准备的重要性。
Details
Motivation: 解决自动化药片识别系统在真实复杂视觉条件(如杂乱场景、重叠药片、反射和多样采集环境)下部署时面临的泛化挑战,从部署角度研究少样本识别,优先考虑跨数据集域偏移下的鲁棒性。
Result: 在包含多目标、杂乱场景的独立部署数据集上评估,分类性能在每类仅一个标注样本的少样本监督下即达到饱和;但在重叠和遮挡的压力测试中,定位和召回率显著下降,而基于视觉真实、多药片数据训练的模型在低样本场景中表现出更强的鲁棒性。
Insight: 创新点在于从部署导向视角评估少样本药片识别,强调训练数据真实性(如多药片场景)对鲁棒性的关键作用,以及少样本微调作为部署准备诊断工具的价值;客观分析表明,研究揭示了语义分类与定位性能在复杂条件下的解耦,为实际系统优化提供了方向。
Abstract: Adverse drug events are a significant source of preventable harm, which has led to the development of automated pill recognition systems to enhance medication safety. Real-world deployment of these systems is hindered by visually complex conditions, including cluttered scenes, overlapping pills, reflections, and diverse acquisition environments. This study investigates few-shot pill recognition from a deployment-oriented perspective, prioritizing generalization under realistic cross-dataset domain shifts over architectural innovation. A two-stage object detection framework is employed, involving base training followed by few-shot fine-tuning. Models are adapted to novel pill classes using one, five, or ten labeled examples per class and are evaluated on a separate deployment dataset featuring multi-object, cluttered scenes. The evaluation focuses on classification-centric and error-based metrics to address heterogeneous annotation strategies. Findings indicate that semantic pill recognition adapts rapidly with few-shot supervision, with classification performance reaching saturation even with a single labeled example. However, stress testing under overlapping and occluded conditions demonstrates a marked decline in localization and recall, despite robust semantic classification. Models trained on visually realistic, multi-pill data consistently exhibit greater robustness in low-shot scenarios, underscoring the importance of training data realism and the diagnostic utility of few-shot fine-tuning for deployment readiness.
[61] UltrasoundAgents: Hierarchical Multi-Agent Evidence-Chain Reasoning for Breast Ultrasound Diagnosis cs.CVPDF
Yali Zhu, Kang Zhou, Dingbang Wu, Gaofeng Meng
TL;DR: 本文提出了一种名为UltrasoundAgents的分层多智能体框架,用于乳腺超声诊断。该框架模拟临床工作流程,首先由主智能体在全图像中定位病灶并进行裁剪放大,然后由子智能体分析局部视图并预测四个临床相关属性(回声模式、钙化、边界类型、边缘形态),最后主智能体整合这些结构化属性进行基于证据的推理,输出BI-RADS分类和恶性预测,同时生成可审查的中间证据。
Details
Motivation: 现有方法多依赖端到端预测或仅提供弱证据,可能遗漏细粒度病灶线索,且可审计性和临床审查性有限。本文旨在与临床工作流程对齐,提高证据可追溯性。
Result: 实验表明,该方法在诊断准确性和属性一致性方面优于强大的视觉语言基线模型,同时提供了结构化证据和可追溯的推理过程。
Insight: 创新点在于提出了分层多智能体框架来模拟临床诊断的层次化推理过程,并引入了解耦渐进式训练策略(包括先训练属性智能体、再用真实属性训练主智能体进行基于属性的推理学习,最后通过带有空间监督的校正轨迹自蒸馏来构建高质量轨迹进行监督微调)以缓解分层训练中的错误传播、信用分配困难和稀疏奖励问题,从而提升训练稳定性并生成可部署的端到端策略。
Abstract: Breast ultrasound diagnosis typically proceeds from global lesion localization to local sign assessment and then evidence integration to assign a BI-RADS category and determine benignity or malignancy. Many existing methods rely on end-to-end prediction or provide only weakly grounded evidence, which can miss fine-grained lesion cues and limit auditability and clinical review. To align with the clinical workflow and improve evidence traceability, we propose a hierarchical multi-agent framework, termed UltrasoundAgents. A main agent localizes the lesion in the full image and triggers a crop-and-zoom operation. A sub-agent analyzes the local view and predicts four clinically relevant attributes, namely echogenicity pattern, calcification, boundary type, and edge (margin) morphology. The main agent then integrates these structured attributes to perform evidence-based reasoning and output the BI-RADS category and the malignancy prediction, while producing reviewable intermediate evidence. Furthermore, hierarchical multi-agent training often suffers from error propagation, difficult credit assignment, and sparse rewards. To alleviate this and improve training stability, we introduce a decoupled progressive training strategy. We first train the attribute agent, then train the main agent with oracle attributes to learn robust attribute-based reasoning, and finally apply corrective trajectory self-distillation with spatial supervision to build high-quality trajectories for supervised fine-tuning, yielding a deployable end-to-end policy. Experiments show consistent gains over strong vision-language baselines in diagnostic accuracy and attribute agreement, together with structured evidence and traceable reasoning.
[62] Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding cs.CVPDF
Lin Chen, Bolin Ni, Qi Yang, Zili Wang, Kun Ding
TL;DR: 本文针对多模态大语言模型在长上下文场景中出现的视觉衰减问题,提出了一种跨模态距离不变位置编码方法DIPE。该方法通过解耦模态间交互的位置编码,保留模态内的相对位置关系,同时强制跨模态交互保持感知邻近性,从而缓解了因跨模态距离增加而导致的注意力惩罚,使模型在长上下文中能稳定地保持视觉基础。
Details
Motivation: 多模态大语言模型在长上下文场景中面临视觉衰减问题,即随着文本序列增长,对视觉标记的注意力减弱,导致文本生成脱离视觉约束。作者认为这源于多模态RoPE固有的归纳偏差,即随着视觉与文本标记距离增加,跨模态注意力受到惩罚。
Result: 实验结果表明,将DIPE与多模态RoPE结合后,模型在长上下文场景中能保持稳定的视觉基础,显著缓解视觉衰减,同时在标准短上下文基准测试上保持了性能。
Insight: 创新点在于提出了一种解耦模态交互的位置编码机制,区分模态内和模态间的位置关系,通过强制跨模态交互的锚定感知邻近性来消除距离相关的注意力惩罚,这是一种简单有效的结构改进,可增强多模态模型的长上下文处理能力。
Abstract: Despite the remarkable capabilities of Multimodal Large Language Models (MLLMs), they still suffer from visual fading in long-context scenarios. Specifically, the attention to visual tokens diminishes as the text sequence lengthens, leading to text generation detached from visual constraints. We attribute this degradation to the inherent inductive bias of Multimodal RoPE, which penalizes inter-modal attention as the distance between visual and text tokens increases. To address this, we propose inter-modal Distance Invariant Position Encoding (DIPE), a simple but effective mechanism that disentangles position encoding based on modality interactions. DIPE retains the natural relative positioning for intra-modal interactions to preserve local structure, while enforcing an anchored perceptual proximity for inter-modal interactions. This strategy effectively mitigates the inter-modal distance-based penalty, ensuring that visual signals remain perceptually consistent regardless of the context length. Experimental results demonstrate that by integrating DIPE with Multimodal RoPE, the model maintains stable visual grounding in long-context scenarios, significantly alleviating visual fading while preserving performance on standard short-context benchmarks. Code is available at https://github.com/lchen1019/DIPE.
[63] Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment cs.CV | cs.ROPDF
Fanqi Yu, Matteo Tiezzi, Tommaso Apicella, Cigdem Beyan, Vittorio Murino
TL;DR: 本文提出了一种终身模仿学习框架,通过在多模态潜在空间中操作,结合潜在回放和增量特征调整机制,实现了在连续任务下持续优化策略,并在LIBERO基准上取得了新的SOTA性能。
Details
Motivation: 解决在现实内存和数据限制下,智能体在连续任务中进行终身模仿学习时面临的灾难性遗忘和适应稳定性问题。
Result: 在LIBERO基准测试中,AUC指标提升了10-17个点,遗忘率降低了高达65%,超越了先前领先方法,确立了新的SOTA。
Insight: 创新点在于完全在多模态潜在空间中进行经验回放,并引入带有角度间隔约束的增量特征调整机制来正则化任务嵌入的演化,从而有效保持任务间的区分性并稳定适应过程。
Abstract: We introduce a lifelong imitation learning framework that enables continual policy refinement across sequential tasks under realistic memory and data constraints. Our approach departs from conventional experience replay by operating entirely in a multimodal latent space, where compact representations of visual, linguistic, and robot’s state information are stored and reused to support future learning. To further stabilize adaptation, we introduce an incremental feature adjustment mechanism that regularizes the evolution of task embeddings through an angular margin constraint, preserving inter-task distinctiveness. Our method establishes a new state of the art in the LIBERO benchmarks, achieving 10-17 point gains in AUC and up to 65% less forgetting compared to previous leading methods. Ablation studies confirm the effectiveness of each component, showing consistent gains over alternative strategies. The code is available at: https://github.com/yfqi/lifelong_mlr_ifa.
[64] Pointy - A Lightweight Transformer for Point Cloud Foundation Models cs.CV | cs.LGPDF
Konrad Szafer, Marek Kraft, Dominik Belter
TL;DR: 本文提出Pointy,一种轻量级Transformer架构的点云基础模型,仅使用39k点云数据训练,性能超越多个基于更大数据集(200k+样本)训练的基础模型,并接近使用百万级多模态数据训练的SOTA模型,证明了精心设计的训练策略和架构的有效性。
Details
Motivation: 针对现有点云基础模型过度依赖跨模态监督和大规模数据的问题,旨在通过轻量级架构和有限数据实现高性能,探索简单主干网络在点云任务中的潜力。
Result: 在标准化训练和评估框架下,Pointy在多个点云基准测试中优于多个使用更大数据集训练的基础模型,性能接近使用百万级点云、图像和文本数据训练的SOTA模型。
Insight: 创新点在于轻量级Transformer架构和无需tokenizer的设计,通过精心设计的训练设置,证明了简单主干网络在有限数据下也能达到与复杂或数据丰富策略竞争的结果,为点云基础模型提供了高效可复现的解决方案。
Abstract: Foundation models for point cloud data have recently grown in capability, often leveraging extensive representation learning from language or vision. In this work, we take a more controlled approach by introducing a lightweight transformer-based point cloud architecture. In contrast to the heavy reliance on cross-modal supervision, our model is trained only on 39k point clouds - yet it outperforms several larger foundation models trained on over 200k training samples. Interestingly, our method approaches state-of-the-art results from models that have seen over a million point clouds, images, and text samples, demonstrating the value of a carefully curated training setup and architecture. To ensure rigorous evaluation, we conduct a comprehensive replication study that standardizes the training regime and benchmarks across multiple point cloud architectures. This unified experimental framework isolates the impact of architectural choices, allowing for transparent comparisons and highlighting the benefits of our design and other tokenizer-free architectures. Our results show that simple backbones can deliver competitive results to more complex or data-rich strategies. The implementation, including code, pre-trained models, and training protocols, is available at https://github.com/KonradSzafer/Pointy.
[65] Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition cs.CVPDF
Jian Sun, Mohammad H. Mahoor
TL;DR: 本文提出了一种结合对比学习与视频质量评估(VQA)的视频视觉Transformer模型(SSL-V3),用于提升视频分类性能。该方法通过自监督学习机制,将无参考VQA与视频分类任务联合训练,利用视频质量分数调整分类特征图,并借助分类任务的监督信号优化VQA参数,从而在视频质量差异较大的场景下实现更鲁棒的分类。
Details
Motivation: 动机源于在清晰与模糊视频上进行轻度认知障碍分类时性能差异显著的问题,作者意识到视频质量显著影响分类结果,因此提出将视频质量评估引入视频分类框架以提升模型鲁棒性。
Result: 在I-CONECT(一个涉及面部视频的医疗数据集)等两个数据集上取得了鲁棒的实验结果,例如在部分访谈视频上达到了94.87%的准确率,验证了方法的有效性。
Insight: 创新点在于提出了一种联合视频质量评估与分类的自监督学习机制(Combined-SSL),通过视频质量分数作为桥梁,使VQA与分类任务相互优化,解决了视频数据集中VQA标签短缺的问题,并提升了模型对质量变化的适应性。
Abstract: Video quality significantly affects video classification. We found this problem when we classified Mild Cognitive Impairment well from clear videos, but worse from blurred ones. From then, we realized that referring to Video Quality Assessment (VQA) may improve video classification. This paper proposed Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification (SSL-V3) to fulfill the goal. SSL-V3 leverages Combined-SSL mechanism to join VQA into video classification and address the label shortage of VQA, which commonly occurs in video datasets, making it impossible to provide an accurate Video Quality Score. In brief, Combined-SSL takes video quality score as a factor to directly tune the feature map of the video classification. Then, the score, as an intersected point, links VQA and classification, using the supervised classification task to tune the parameters of VQA. SSL-V3 achieved robust experimental results on two datasets. For example, it reached an accuracy of 94.87% on some interview videos in the I-CONECT (a facial video-involved healthcare dataset), verifying SSL-V3’s effectiveness.
[66] GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations cs.CV | cs.AIPDF
Boyuan Chen, Minghao Shao, Siddharth Garg, Ramesh Karri, Muhammad Shafique
TL;DR: 本文提出GroundCount框架,通过将基于CNN的目标检测模型(如YOLO)的空间定位能力与视觉语言模型(VLM)结合,以缓解VLM在计数任务中的幻觉问题。该方法采用基于提示的增强策略,在最佳模型(Ovis2.5-2B)上实现了81.3%的计数准确率,提升6.6个百分点,同时通过消除幻觉驱动的推理循环减少22%的推理时间。
Details
Motivation: 视觉语言模型在计数任务中存在持续幻觉,准确率远低于其他视觉推理任务,而基于CNN的目标检测模型擅长空间定位和实例计数且计算开销小,因此提出结合两者优势以缓解计数幻觉。
Result: 在五个评估的VLM架构中,四个模型获得一致提升(6.2-7.5个百分点),其中最佳模型Ovis2.5-2B计数准确率达81.3%;消融研究表明位置编码是关键组件,而置信度分数通常引入噪声;结构化提示的显式符号接地优于隐式特征融合。
Insight: 创新点在于利用目标检测模型提供显式空间接地来增强VLM,缓解计数幻觉;研究揭示了计数失败源于空间-语义整合的根本限制,而非架构特定缺陷,并强调了增强策略中架构兼容性的重要性。
Abstract: Vision Language Models (VLMs) exhibit persistent hallucinations in counting tasks, with accuracy substantially lower than other visual reasoning tasks (excluding sentiment). This phenomenon persists even in state-of-the-art reasoning-capable VLMs. Conversely, CNN-based object detection models (ODMs) such as YOLO excel at spatial localization and instance counting with minimal computational overhead. We propose GroundCount, a framework that augments VLMs with explicit spatial grounding from ODMs to mitigate counting hallucinations. In the best case, our prompt-based augmentation strategy achieves 81.3% counting accuracy on the best-performing model (Ovis2.5-2B) - a 6.6pp improvement - while reducing inference time by 22% through elimination of hallucination-driven reasoning loops for stronger models. We conduct comprehensive ablation studies demonstrating that positional encoding is a critical component, being beneficial for stronger models but detrimental for weaker ones. Confidence scores, by contrast, introduce noise for most architectures and their removal improves performance in four of five evaluated models. We further evaluate feature-level fusion architectures, finding that explicit symbolic grounding via structured prompts outperforms implicit feature fusion despite sophisticated cross-attention mechanisms. Our approach yields consistent improvements across four of five evaluated VLM architectures (6.2–7.5pp), with one architecture exhibiting degraded performance due to incompatibility between its iterative reflection mechanisms and structured prompts. These results suggest that counting failures stem from fundamental spatial-semantic integration limitations rather than architecture-specific deficiencies, while highlighting the importance of architectural compatibility in augmentation strategies.
[67] Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity cs.CVPDF
Zhengyao Fang, Zexi Jia, Yijia Zhong, Pengcheng Luo, Jinchao Zhang
TL;DR: 本文针对文本到图像(T2I)生成模型在生成写实风格图像时颜色过于鲜艳、不真实的问题,提出了一个用于客观评估颜色保真度的数据集(CFD)和度量标准(CFM),并引入了一种无需训练的颜色保真度优化方法(CFR),形成了一个评估和改进T2I生成颜色真实性的渐进式框架。
Details
Motivation: 现有T2I生成模型在视觉质量上虽有提升,但生成的写实风格图像常因饱和度和对比度过高而显得不真实。现有评估范式(如人类评分和偏好训练的指标)存在偏见,倾向于奖励视觉上鲜艳的图像,导致模型生成“过于鲜艳而不真实”的图像。
Result: 论文构建了包含超过130万张真实与合成图像、具有有序颜色真实度级别的Color Fidelity Dataset (CFD),并提出了基于多模态编码器的Color Fidelity Metric (CFM)来学习感知颜色保真度。此外,提出的Color Fidelity Refinement (CFR)方法能自适应调制生成过程中的时空引导尺度,以增强颜色真实性。
Insight: 创新点在于首次系统性地关注并量化了T2I生成中的颜色保真度问题,通过构建大规模标注数据集和专用度量标准,将主观的颜色真实感评估客观化。同时,提出的CFR方法是一种无需训练的后处理/引导调制技术,可直接集成到现有生成流程中,以可解释的方式(利用CFM学到的注意力)提升生成图像的颜色真实性,形成了一个从评估到改进的闭环框架。
Abstract: Recent advances in text-to-image (T2I) generation have greatly improved visual quality, yet producing images that appear visually authentic to real-world photography remains challenging. This is partly due to biases in existing evaluation paradigms: human ratings and preference-trained metrics often favor visually vivid images with exaggerated saturation and contrast, which make generations often too vivid to be real even when prompted for realistic-style images. To address this issue, we present Color Fidelity Dataset (CFD) and Color Fidelity Metric (CFM) for objective evaluation of color fidelity in realistic-style generations. CFD contains over 1.3M real and synthetic images with ordered levels of color realism, while CFM employs a multimodal encoder to learn perceptual color fidelity. In addition, we propose a training-free Color Fidelity Refinement (CFR) that adaptively modulates spatial-temporal guidance scale in generation, thereby enhancing color authenticity. Together, CFD supports CFM for assessment, whose learned attention further guides CFR to refine T2I fidelity, forming a progressive framework for assessing and improving color fidelity in realistic-style T2I generation. The dataset and code are available at https://github.com/ZhengyaoFang/CFM.
[68] Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style cs.CV | cs.AIPDF
Marvin Limpijankit, Milad Alshomary, Yassin Oulad Daoud, Amith Ananthram, Tim Trombley
TL;DR: 本文通过计算机科学家与艺术史学者的跨学科合作,研究了视觉语言模型(VLMs)预测艺术风格的机制,并评估其与艺术史学者推理风格的标准是否一致。研究采用潜在空间分解方法识别驱动艺术风格预测的概念,并进行了定量评估、因果分析和艺术史学者评估。
Details
Motivation: 动机在于探究VLMs在艺术领域(如艺术品分析和生成)中日益增强的能力,特别是其预测艺术风格的机制是否与艺术史学者的专业判断标准相吻合。
Result: 研究结果显示,73%的提取概念被艺术史学者判断为具有连贯且有语义意义的视觉特征,90%用于预测特定艺术品风格的概念被判定为相关。当模型使用不相关概念成功预测风格时,艺术史学者识别了其可能成功的原因(例如模型以更形式化的术语如明暗对比来“理解”概念)。
Insight: 创新点在于跨学科方法结合了潜在空间分解和人类专家评估,揭示了VLMs在艺术风格识别中的工作机制及其与人类专业知识的对齐程度,为解释AI在艺术领域的决策提供了新视角。
Abstract: VLMs have become increasingly proficient at a range of computer vision tasks, such as visual question answering and object detection. This includes increasingly strong capabilities in the domain of art, from analyzing artwork to generation of art. In an interdisciplinary collaboration between computer scientists and art historians, we characterize the mechanisms underlying VLMs’ ability to predict artistic style and assess the extent to which they align with the criteria art historians use to reason about artistic style. We employ a latent-space decomposition approach to identify concepts that drive art style prediction and conduct quantitative evaluations, causal analysis and assessment by art historians. Our findings indicate that 73% of the extracted concepts are judged by art historians to exhibit a coherent and semantically meaningful visual feature and 90% of concepts used to predict style of a given artwork were judged relevant. In cases where an irrelevant concept was used to successfully predict style, art historians identified possible reasons for its success; for example, the model might “understand” a concept in more formal terms, such as dark/light contrasts.
[69] DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving cs.CV | cs.ROPDF
Shuyao Shang, Bing Zhan, Yunfei Yan, Yuqi Wang, Yingyan Li
TL;DR: 本文提出了DynVLA,一种用于自动驾驶的视觉语言动作模型,其核心是引入了一种名为Dynamics CoT的新思维链范式。该模型在生成动作之前,先预测紧凑的世界动态表示,从而做出更明智且符合物理规律的决策。
Details
Motivation: 为了解决现有思维链方法(如缺乏细粒度时空理解的Textual CoT和因密集图像预测引入冗余的Visual CoT)在自动驾驶决策中的不足,旨在实现更紧凑、可解释且高效的世界动态建模,以支持高质量的决策。
Result: 在NAVSIM、Bench2Drive和一个大规模内部数据集上的大量实验表明,DynVLA在性能上持续优于Textual CoT和Visual CoT方法,验证了Dynamics CoT的有效性和实用价值。
Insight: 主要创新点在于提出了Dynamics CoT范式,通过引入Dynamics Tokenizer将未来动态压缩为少量动态令牌,并解耦自车中心动态与环境中心动态,实现了更准确、紧凑且可解释的世界动态建模,并通过SFT和RFT训练确保推理效率。
Abstract: We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.
[70] V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation cs.CV | cs.AI | cs.LG | cs.MM | cs.SDPDF
Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius
TL;DR: 本文提出了一种名为V2M-Zero的零配对视频到音乐生成方法,能够为视频生成时间上对齐的音乐。该方法的核心思想是通过计算视频和音乐各自模态内的相似度来提取事件曲线,这些曲线捕获了跨模态共享的时间结构,从而无需跨模态配对数据即可实现时间同步。
Details
Motivation: 现有文本到音乐模型缺乏细粒度的时间控制,难以生成与视频事件在时间上对齐的音乐。本文旨在解决这一挑战,其动机基于一个关键观察:时间同步需要匹配变化发生的时间和程度,而非变化的具体内容。
Result: 在OES-Pub、MovieGenBench-Music和AIST++等基准测试中,V2M-Zero相比基于配对数据的基线方法取得了显著提升:音频质量提高5-21%,语义对齐提升13-15%,时间同步性改善21-52%,在舞蹈视频上的节拍对齐提升28%。大规模众包主观听力测试也证实了类似结果。
Insight: 论文的创新点在于提出了一种无需跨模态配对数据或训练的零配对方法,通过独立计算模态内事件曲线来捕获共享的时间结构,从而实现了有效的时间对齐。其核心洞见是,时间对齐可以通过模态内特征而非跨模态监督来实现,这为视频到音乐生成提供了一种简单而有效的训练策略。
Abstract: Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at https://genjib.github.io/v2m_zero/
[71] COMIC: Agentic Sketch Comedy Generation cs.CV | cs.AI | cs.CL | cs.MA | cs.NEPDF
Susung Hong, Brian Curless, Ira Kemelmacher-Shlizerman, Steve Seitz
TL;DR: 论文提出COMIC系统,一个全自动AI系统,用于生成类似《周六夜现场》的短喜剧视频。系统从角色设定出发,采用基于真实制片工作室角色的智能体群体,通过迭代竞争、评估和改进来优化创意和输出的质量与多样性。关键贡献是引入LLM评论家,通过分析YouTube喜剧视频语料库来对齐真实观众偏好,自动评估幽默。实验表明,该框架生成的视频质量接近专业制作水平,并在视频生成任务中达到SOTA性能。
Details
Motivation: 解决自动化生成高质量喜剧视频的挑战,通过模拟真实制片流程和引入基于观众偏好的幽默评估,提升创意多样性和输出质量。
Result: 在视频生成任务中达到SOTA性能,生成的喜剧视频质量接近专业制作水平,通过实验验证了框架的有效性。
Insight: 创新点包括:采用多智能体竞争迭代机制模拟制片流程,以及利用LLM评论家基于YouTube数据对齐观众偏好进行自动幽默评估,这为AI生成创意内容提供了可借鉴的流程优化和评估方法。
Abstract: We propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction of LLM critics aligned with real viewer preferences through the analysis of a corpus of comedy videos on YouTube to automatically evaluate humor. Our experiments show that our framework produces results approaching the quality of professionally produced sketches while demonstrating state-of-the-art performance in video generation.
cs.LG [Back]
[72] Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards cs.LG | cs.AI | cs.CLPDF
Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin
TL;DR: 本文提出DCPO框架,通过理论分析揭示了强化学习可验证奖励(RLVR)中策略精度优化与校准误差最小化之间存在根本的梯度冲突,从而系统性地解耦推理与置信度校准目标,以解决LLM在RLVR训练后出现的校准退化与过度自信问题。
Details
Motivation: 解决强化学习可验证奖励(RLVR)方法在提升大语言模型推理能力的同时,导致的严重校准退化问题,即模型对错误答案变得过度自信。
Result: 在广泛的实验中,DCPO在保持与GRPO相当的推理精度的同时,取得了最佳的校准性能,并显著缓解了过度自信问题。
Insight: 核心创新在于通过理论分析识别并解耦了推理与校准的优化冲突,提出了DCPO这一简单有效的框架,为构建更可靠的LLM部署提供了新思路和实用方案。
Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.
[73] Explainable LLM Unlearning Through Reasoning cs.LG | cs.AI | cs.CLPDF
Junfeng Liao, Qizhou Wang, Shanshan Ye, Xin Yu, Ling Chen
TL;DR: 本文提出了一种基于推理的目标遗忘方法(TRU),通过引入推理遗忘目标来指导大语言模型(LLM)的遗忘过程,旨在更精确地移除特定知识,同时保留模型的通用能力,并提高遗忘过程的可靠性和可解释性。
Details
Motivation: 现有基于梯度上升(GA)的遗忘方法缺乏明确指导,导致通用能力意外退化、知识移除不彻底及生成不连贯响应等问题,本文旨在通过引入明确的遗忘目标来解决这些问题。
Result: 在多个基准测试和不同LLM骨干网络上评估,TRU相比强基线实现了更可靠的遗忘效果,同时更好地保留了通用能力,并在多种攻击场景下表现出更强的鲁棒性。
Insight: 创新点在于提出了推理遗忘目标作为明确的遗忘指导,结合监督损失和GA损失,使模型学习推理能力以实现精确知识移除;这为可靠且可解释的LLM遗忘提供了一个实用范式。
Abstract: LLM unlearning is essential for mitigating safety, copyright, and privacy concerns in pre-trained large language models (LLMs). Compared to preference alignment, it offers a more explicit way by removing undesirable knowledge characterized by specific unlearning datasets. In previous works, gradient ascent (GA) and its variants have shown promise for implementing unlearning, yet their untargeted nature results in unintended degradation of general capabilities, incomplete removal of knowledge, and the generation of incoherent responses, among many others. We argue that these issues stem from the absence of explicit guidance on what and how models should unlearn. To fill this gap, we introduce a novel unlearning target, reasoning-based unlearning target, which satisfies both the specified unlearning scope and the specified post-unlearning response. Building on this, we propose targeted reasoning unlearning (TRU), which leverages reasoning-based unlearning target as guidance. We employ the target using a cross-entropy supervised loss combined with a GA-based loss, enabling the model to learn reasoning ability for precise knowledge removal while preserving unrelated abilities. We evaluate TRU against strong baselines across multiple benchmarks and LLM backbones, and find that it achieves more reliable unlearning while preserving general capabilities. Moreover, TRU exhibits superior robustness under diverse attack scenarios, stemming from the reasoning ability learned through reasoning-based targets. Overall, our study establishes reasoning-augmented unlearning as a practical paradigm for reliable and explainable LLM unlearning.
[74] Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment cs.LG | cs.AI | cs.CLPDF
Jialu Wang, Heinrich Peters, Asad A. Butt, Navid Hashemi, Alireza Hashemi
TL;DR: 本文提出了一种名为个性化组相对策略优化(P-GRPO)的新颖对齐框架,旨在解决大型语言模型(LLM)在个性化偏好对齐中的局限性。该方法通过将优势估计与即时批处理统计解耦,并针对特定偏好组的历史奖励进行归一化,从而有效学习并适应多样化的用户偏好分布。
Details
Motivation: 标准后训练方法(如基于人类反馈的强化学习RLHF)通常优化单一全局目标,导致LLM难以与多样化的个体偏好对齐;而现有组相对策略优化(GRPO)框架在个性化设置中假设所有样本可互换,混淆了不同用户的奖励分布,偏向主导偏好并抑制少数信号。
Result: 在多个任务上的评估表明,P-GRPO相比标准GRPO实现了更快的收敛速度和更高的奖励,增强了模型恢复和对齐异构偏好信号的能力。
Insight: 创新点在于提出了一种在优化层面考虑奖励异质性的对齐框架,通过基于偏好组特定奖励历史而非当前生成组进行优势归一化,保留了学习不同偏好所需的对比信号,从而在不牺牲通用能力的前提下更忠实地对齐多样化人类偏好。
Abstract: Despite their sophisticated general-purpose capabilities, Large Language Models (LLMs) often fail to align with diverse individual preferences because standard post-training methods, like Reinforcement Learning with Human Feedback (RLHF), optimize for a single, global objective. While Group Relative Policy Optimization (GRPO) is a widely adopted on-policy reinforcement learning framework, its group-based normalization implicitly assumes that all samples are exchangeable, inheriting this limitation in personalized settings. This assumption conflates distinct user reward distributions and systematically biases learning toward dominant preferences while suppressing minority signals. To address this, we introduce Personalized GRPO (P-GRPO), a novel alignment framework that decouples advantage estimation from immediate batch statistics. By normalizing advantages against preference-group-specific reward histories rather than the concurrent generation group, P-GRPO preserves the contrastive signal necessary for learning distinct preferences. We evaluate P-GRPO across diverse tasks and find that it consistently achieves faster convergence and higher rewards than standard GRPO, thereby enhancing its ability to recover and align with heterogeneous preference signals. Our results demonstrate that accounting for reward heterogeneity at the optimization level is essential for building models that faithfully align with diverse human preferences without sacrificing general capabilities.
[75] Training Language Models via Neural Cellular Automata cs.LG | cs.AI | cs.CLPDF
Dan Lee, Seungwook Han, Akarsh Kumar, Pulkit Agrawal
TL;DR: 该论文提出使用神经细胞自动机(NCA)生成具有丰富时空结构的合成数据,用于大语言模型(LLM)的预预训练(synthetic-then-natural language training),以解决自然语言预训练中数据质量有限、存在偏见以及知识与推理纠缠的问题。实验表明,仅使用1.64亿个NCA token进行预预训练,即可在下游语言建模任务中提升高达6%的性能,加速收敛达1.6倍,甚至在计算量更少的情况下优于使用16亿个Common Crawl自然语言token的预预训练,且提升可迁移至GSM8K、HumanEval等推理基准。
Details
Motivation: 解决自然语言预训练面临的高质量文本有限、包含人类偏见、以及知识与推理纠缠的问题,探索自然语言之外实现智能的路径。
Result: 在语言建模任务上,性能提升高达6%,收敛加速1.6倍;在GSM8K、HumanEval和BigBench-Lite等推理基准上表现出可迁移的提升;其效果甚至优于使用更大规模自然语言数据(Common Crawl)的预预训练。
Insight: 创新点在于使用完全可控、可大规模生成的NCA合成数据作为预训练前序阶段,证明了非语言数据对语言模型能力(包括推理)的有效迁移性;研究发现注意力层最具可迁移性,且针对不同领域(如代码、数学、网页文本)可系统调整NCA复杂度以优化性能,为完全合成数据预训练的高效模型开辟了新路径。
Abstract: Pre-training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre-training has problems: high-quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs–training on synthetic-then-natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre-pre-training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench-Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre-training.
[76] Improving Search Agent with One Line of Code cs.LG | cs.CLPDF
Jian Li, Dongsheng Chen, Zhenhua Xu, Yizhang Jin, Jiafu Wu
TL;DR: 本文提出了一种名为SAPO(Search Agent Policy Optimization)的新方法,用于解决基于工具的智能体强化学习(TARL)中存在的训练不稳定性问题,即重要性采样分布漂移(ISDD)。该方法通过在标准GRPO算法中添加一行代码,引入条件令牌级KL约束,选择性地惩罚策略偏移过大的正令牌,从而稳定训练并防止模型崩溃。
Details
Motivation: 动机在于解决TARL中广泛采用的GRPO算法存在的训练不稳定性问题,即ISDD导致重要性采样比率急剧下降、梯度更新失效并引发不可逆的训练失败。
Result: 在七个QA基准测试上的广泛实验表明,SAPO相比Search-R1实现了绝对+10.6%(相对+31.5%)的性能提升,并在不同模型规模(1.5B, 14B)和家族(Qwen, LLaMA)上取得了一致的增益。
Insight: 创新点在于提出了条件令牌级KL约束机制,选择性地针对低概率正令牌进行惩罚,有效防止分布漂移同时保持梯度流;其实用性在于仅需一行代码修改即可部署,极大提升了方法的可实施性。
Abstract: Tool-based Agentic Reinforcement Learning (TARL) has emerged as a promising paradigm for training search agents to interact with external tools for a multi-turn information-seeking process autonomously. However, we identify a critical training instability that leads to catastrophic model collapse: Importance Sampling Distribution Drift(ISDD). In Group Relative Policy Optimization(GRPO), a widely adopted TARL algorithm, ISDD manifests as a precipitous decline in the importance sampling ratios, which nullifies gradient updates and triggers irreversible training failure. To address this, we propose \textbf{S}earch \textbf{A}gent \textbf{P}olicy \textbf{O}ptimization (\textbf{SAPO}), which stabilizes training via a conditional token-level KL constraint. Unlike hard clipping, which ignores distributional divergence, SAPO selectively penalizes the KL divergence between the current and old policies. Crucially, this penalty is applied only to positive tokens with low probabilities where the policy has shifted excessively, thereby preventing distribution drift while preserving gradient flow. Remarkably, SAPO requires only one-line code modification to standard GRPO, ensuring immediate deployability. Extensive experiments across seven QA benchmarks demonstrate that SAPO achieves \textbf{+10.6% absolute improvement} (+31.5% relative) over Search-R1, yielding consistent gains across varying model scales (1.5B, 14B) and families (Qwen, LLaMA).
[77] CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR cs.LG | cs.AI | cs.CLPDF
Sijia Cui, Pengyu Cheng, Jiajun Song, Yongbo Gai, Guojun Zhang
TL;DR: 本文提出CLIPO方法,将对比学习机制融入策略优化,以改进RLVR(可验证奖励的强化学习)框架。该方法通过优化成功轨迹上的对比损失,引导大语言模型捕捉正确推理路径间的共享不变结构,从而缓解中间步骤错误但最终答案正确导致的幻觉和答案复制问题,提升模型的泛化性和鲁棒性。
Details
Motivation: RLVR仅依赖最终答案作为奖励,忽略了中间推理步骤的正确性,导致模型在过程错误但结果正确的轨迹上训练时产生幻觉和答案复制,损害泛化与鲁棒性。
Result: 在多个推理基准测试中,CLIPO持续改进了多种RLVR基线方法,在LLM的策略优化上展现出泛化性和鲁棒性的统一提升。
Insight: 创新点在于将对比学习引入策略优化,通过跨轨迹的正则化监督(对比损失)替代RLVR的单路径监督,以捕捉正确推理中的不变结构,从而更有效地抑制步骤级不一致和幻觉现象。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model’s generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR, effectively mitigating step-level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs. Our code and training recipes are available at https://github.com/Qwen-Applications/CLIPO.
[78] ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning cs.LG | cs.CLPDF
Ruizhong Qiu, Hanqing Zeng, Yinglong Xia, Yiwen Meng, Ren Chen
TL;DR: 本文提出了一种名为ReMix的新型路由器设计,用于解决现有LoRA混合模型中路由权重极度不平衡的问题。ReMix采用不可学习的路由权重确保所有活跃LoRA同等有效,并通过强化学习中的RLOO技术实现无偏梯度估计,从而在保持参数效率的同时显著提升模型表达能力。
Details
Motivation: 现有LoRA混合模型的路由器采用可学习权重,导致实践中路由权重极度不平衡(常由一两个LoRA主导),严重限制了模型表达能力。
Result: 在可比较的激活参数量下,ReMix在多个任务上显著优于最先进的参数高效微调方法,达到SOTA水平。
Insight: 创新点在于用不可学习路由权重强制均衡LoRA贡献,并结合强化学习技术(RLOO)解决不可学习权重的训练难题;客观上为混合专家系统提供了避免主导专家垄断的新路由机制设计思路。
Abstract: Low-rank adapters (LoRAs) are a parameter-efficient finetuning technique that injects trainable low-rank matrices into pretrained models to adapt them to new tasks. Mixture-of-LoRAs models expand neural networks efficiently by routing each layer input to a small subset of specialized LoRAs of the layer. Existing Mixture-of-LoRAs routers assign a learned routing weight to each LoRA to enable end-to-end training of the router. Despite their empirical promise, we observe that the routing weights are typically extremely imbalanced across LoRAs in practice, where only one or two LoRAs often dominate the routing weights. This essentially limits the number of effective LoRAs and thus severely hinders the expressive power of existing Mixture-of-LoRAs models. In this work, we attribute this weakness to the nature of learnable routing weights and rethink the fundamental design of the router. To address this critical issue, we propose a new router designed that we call Reinforcement Routing for Mixture-of-LoRAs (ReMix). Our key idea is using non-learnable routing weights to ensure all active LoRAs to be equally effective, with no LoRA dominating the routing weights. However, our routers cannot be trained directly via gradient descent due to our non-learnable routing weights. Hence, we further propose an unbiased gradient estimator for the router by employing the reinforce leave-one-out (RLOO) technique, where we regard the supervision loss as the reward and the router as the policy in reinforcement learning. Our gradient estimator also enables to scale up training compute to boost the predictive performance of our ReMix. Extensive experiments demonstrate that our proposed ReMix significantly outperform state-of-the-art parameter-efficient finetuning methods under a comparable number of activated parameters.
[79] Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning cs.LG | cs.CLPDF
Zichao Li, Jie Lou, Fangchen Dong, Zhiyuan Fan, Mengjie Ren
TL;DR: 本文提出了一种名为Group Relative Reward Rescaling (GR^3)的新方法,用于解决强化学习训练大语言模型时出现的长度膨胀问题,即模型为了最大化奖励而变得冗长或采用低效推理。该方法通过将长度控制重构为乘法缩放范式,并结合组相对正则化和优势感知校准,实现了无损优化,在RLHF和RLVR设置中有效缓解长度膨胀,同时保持与标准GRPO相当的训练动态和下游性能。
Details
Motivation: 动机是解决强化学习增强LLM能力时普遍存在的长度膨胀问题,现有方法(如加法惩罚或启发式门控)要么因补偿效应导致优化捷径,要么缺乏通用性,无法无损地处理这一挑战。
Result: 在RLHF和RLVR设置中,GR^3在保持与标准GRPO相当的训练动态和下游性能的同时,显著缓解了长度膨胀,性能优于现有的长度正则化基线方法,达到了SOTA水平。
Insight: 创新点在于将长度控制重构为乘法缩放范式,并引入组相对正则化和优势感知校准,这提供了一种通用、连续且奖励依赖的门控机制,能动态适应实例难度并保留高质量轨迹的优势信号,从而实现无损的长度控制。
Abstract: Reinforcement learning significantly enhances LLM capabilities but suffers from a critical issue: length inflation, where models adopt verbosity or inefficient reasoning to maximize rewards. Prior approaches struggle to address this challenge in a general and lossless manner, primarily because additive penalties introduce a compensatory effect that creates optimization shortcuts, while heuristic gating strategies lack generality beyond binary feedback. To bridge this gap, we present Group Relative Reward Rescaling (GR$^3$), which reframes length control as a multiplicative rescaling paradigm, effectively establishing a generalized, continuous, and reward-dependent gating mechanism. To further ensure lossless optimization, we incorporate group-relative regularization and advantage-aware calibration, which dynamically adapt length budgets to instance difficulty and preserve the advantage signal of high-quality trajectories. Empirically, across both RLHF and RLVR settings, GR$^3$~maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation, outperforming state-of-the-art length-regularized baselines.
[80] Reinforcement Learning with Conditional Expectation Reward cs.LG | cs.AI | cs.CLPDF
Changyi Xiao, Caijun Xu, Yixin Cao
TL;DR: 本文提出了一种名为条件期望奖励(CER)的新方法,用于增强大型语言模型在通用推理任务中的强化学习效果。该方法利用模型自身作为隐式验证器,通过计算生成答案条件下参考答案的期望似然来提供软奖励信号,从而克服了传统基于规则的验证方法在自由形式答案领域中的局限性。
Details
Motivation: 传统基于可验证奖励的强化学习(RLVR)依赖于手工制定的领域特定验证规则,这限制了其在答案形式多样、难以建立完整准确规则的通用推理领域的应用。
Result: 实验结果表明,CER在广泛的推理任务(包括数学和通用领域)中均有效,表明其作为一种灵活且通用的验证机制是可行的。
Insight: 主要创新点在于利用大型语言模型自身作为隐式验证器来定义软奖励信号(条件期望奖励),这消除了对外部验证器或辅助模型的需求,并能够处理答案正确性程度不同的任务,从而扩展了强化学习在通用推理领域的适用性。
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing the reasoning capabilities of large language models, particularly in domains such as mathematics where reliable rule-based verifiers can be constructed. However, the reliance on handcrafted, domain-specific verification rules substantially limits the applicability of RLVR to general reasoning domains with free-form answers, where valid answers often exhibit significant variability, making it difficult to establish complete and accurate rules. To address this limitation, we propose Conditional Expectation Reward (CER), which leverages the large language model itself as an implicit verifier, and is therefore applicable to general domains and eliminates the need for external verifiers or auxiliary models. CER is defined as the expected likelihood of generating the reference answer conditioned on the generated answer. In contrast to rule-based verifiers that yield binary feedback, CER provides a soft, graded reward signal that reflects varying degrees of correctness, making it better suited to tasks where answers vary in correctness. Experimental results demonstrate that CER is effective across a wide range of reasoning tasks, spanning both mathematical and general domains, indicating that CER serves as a flexible and general verification mechanism. The code is available at https://github.com/changyi7231/CER.
[81] Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis cs.LG | cs.AI | cs.CLPDF
Yujie Zheng, Zhuo Li, Shengtao Zhang, Hanjing Wang, Junjie Sheng
TL;DR: 本文提出EvoKernel,一种自进化的智能体框架,用于解决在数据稀缺的特定领域架构(如NPU)上进行内核合成的冷启动问题。该框架将合成过程建模为基于记忆的强化学习任务,通过价值驱动的检索机制和跨任务记忆共享,实现了从初始草稿生成到持续优化的自动化,显著提升了模型在NPU内核合成任务上的正确性和性能。
Details
Motivation: 解决大型语言模型在数据稀缺的编程领域(如新兴特定领域架构NPU的内核合成)部署时面临的冷启动挑战,避免昂贵微调,实现从零开始的高效内核生成与优化。
Result: 在构建的NPU版KernelBench上评估,EvoKernel将前沿模型的正确率从11.0%提升至83.0%,并通过迭代优化使中位加速比达到初始草稿的3.60倍。
Insight: 创新点包括:将内核合成过程形式化为基于记忆的强化学习任务;提出新颖的价值驱动检索机制,学习阶段特定的Q值以优先选择对当前目标(如引导可行草稿或迭代优化延迟)贡献最大的经验;通过跨任务记忆共享实现从简单到复杂算子的知识泛化。这展示了价值引导的经验积累能使通用模型掌握小众硬件生态上的内核合成任务。
Abstract: Deploying Large Language Models to data-scarce programming domains poses significant challenges, particularly for kernel synthesis on emerging Domain-Specific Architectures where a “Data Wall” limits available training data. While models excel on data-rich platforms like CUDA, they suffer catastrophic performance drops on data-scarce ecosystems such as NPU programming. To overcome this cold-start barrier without expensive fine-tuning, we introduce EvoKernel, a self-evolving agentic framework that automates the lifecycle of kernel synthesis from initial drafting to continual refining. EvoKernel addresses this by formulating the synthesis process as a memory-based reinforcement learning task. Through a novel value-driven retrieval mechanism, it learns stage-specific Q-values that prioritize experiences based on their contribution to the current objective, whether bootstrapping a feasible draft or iteratively refining latency. Furthermore, by enabling cross-task memory sharing, the agent generalizes insights from simple to complex operators. By building an NPU variant of KernelBench and evaluating on it, EvoKernel improves frontier models’ correctness from 11.0% to 83.0% and achieves a median speedup of 3.60x over initial drafts through iterative refinement. This demonstrates that value-guided experience accumulation allows general-purpose models to master the kernel synthesis task on niche hardware ecosystems. Our official page is available at https://evokernel.zhuo.li.
[82] $V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts cs.LG | cs.AI | cs.CLPDF
Yi-Kai Zhang, Yueqing Sun, Hongyan Hao, Qi Gu, Xunliang Cai
TL;DR: 本文提出了一种名为$V_{0.5}$的通用价值模型,它通过将预训练价值模型(作为先验)的预测与稀疏策略采样得到的经验均值进行自适应融合,构建了一个计算高效且方差极低的鲁棒优势基线。该方法引入了实时统计检验和动态预算分配机制,以平衡稀疏采样带来的高方差与价值模型先验固有的系统偏差,从而最小化基线估计器的均方误差,确保策略梯度的稳定性。
Details
Motivation: 在具有可验证奖励的强化学习中,构建一个鲁棒的优势基线对于策略梯度方法至关重要。现有通用价值模型虽能提供预训练的价值先验,但其固有的系统偏差(或幻觉)与稀疏采样带来的高方差之间存在权衡,需要一种方法能自适应地融合两者以构建更优的基线。
Result: 在六个数学推理基准测试上的广泛评估表明,$V_{0.5}$显著优于GRPO和DAPO方法,实现了更快的收敛速度和约10%以上的性能提升,即使在组大小仅为4的极端稀疏条件下也能保证稳定。
Insight: 核心创新点在于提出了一个实时统计检验与动态预算分配框架,能够在线评估价值模型先验的可靠性,并据此按需分配额外的采样预算,从而在计算效率与估计精度(方差与偏差的权衡)之间达到最优平衡,为稀疏强化学习中的基线构建提供了新思路。
Abstract: In Reinforcement Learning with Verifiable Rewards (RLVR), constructing a robust advantage baseline is critical for policy gradients, effectively guiding the policy model to reinforce desired behaviors. Recent research has introduced Generalist Value Models (such as $V_0$), which achieve pre-trained value estimation by explicitly encoding model capabilities in-context, eliminating the need to synchronously update the value model alongside the policy model. In this paper, we propose $V_{0.5}$, which adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts. This constructs a robust baseline that balances computational efficiency with extremely low variance. Specifically, we introduce a real-time statistical testing and dynamic budget allocation. This balances the high variance caused by sparse sampling against the systematic bias (or hallucinations) inherent in the value model’s prior. By constructing a hypothesis test to evaluate the prior’s reliability in real-time, the system dynamically allocates additional rollout budget on demand. This mechanism minimizes the baseline estimator’s Mean Squared Error (MSE), guaranteeing stable policy gradients, even under extreme sparsity with a group size of 4. Extensive evaluations across six mathematical reasoning benchmarks demonstrate that $V_{0.5}$ significantly outperforms GRPO and DAPO, achieving faster convergence and over some 10% performance improvement.
eess.AS [Back]
[83] Calibration-Reasoning Framework for Descriptive Speech Quality Assessment eess.AS | cs.CLPDF
Elizaveta Kostenok, Mathieu Salzmann, Milos Cernak
TL;DR: 本文提出了一种新颖的后训练校准-推理框架,用于可解释的语音质量评估,通过校准阶段和基于强化学习的推理阶段,使基础音频大语言模型能够进行多维度的感知维度预测、音频伪影检测与分类,并在QualiSpeech基准上达到了0.71的平均PCC分数,在MOS预测上实现了13%的提升。
Details
Motivation: 为了解决可解释语音质量评估中需要超越平均意见分数(MOS)来分析底层感知维度的问题,本文旨在开发一种方法,使模型能够进行多维推理、检测和分类音频伪影。
Result: 在QualiSpeech多维基准测试中达到了0.71的平均PCC分数(SOTA),并通过基于强化学习的推理在MOS预测上实现了13%的改进,同时模型在时间上定位和分类音频伪影的能力得到显著提升。
Insight: 创新点在于结合了校准阶段和基于GRPO的强化学习阶段的后训练框架,利用维度特定奖励来增强描述的准确性和质量问题的时序定位,这为音频大语言模型在细粒度、可解释的语音质量评估任务中的适应提供了新思路。
Abstract: Explainable speech quality assessment requires moving beyond Mean Opinion Scores (MOS) to analyze underlying perceptual dimensions. To address this, we introduce a novel post-training method that tailors the foundational Audio Large Language Model for multidimensional reasoning, detection and classification of audio artifacts. First, a calibration stage aligns the model to predict predefined perceptual dimensions. Second, a reinforcement learning stage leverages Group Relative Policy Optimization (GRPO) with dimension-specific rewards to heavily enhance accuracy of descriptions and temporal localization of quality issues. With this approach we reach state-of-the-art results of 0.71 mean PCC score on the multidimensional QualiSpeech benchmark and 13% improvement in MOS prediction driven by RL-based reasoning. Furthermore, our fine-grained GRPO rewards substantially advance the model’s ability to pinpoint and classify audio artifacts in time.
[84] Speech Codec Probing from Semantic and Phonetic Perspectives eess.AS | cs.CLPDF
Xuan Shi, Chang Zeng, Tiantian Feng, Shih-Heng Wang, Jianbo Ma
TL;DR: 本文系统分析了多种主流语音分词器所编码的信息,通过词级探测任务、分层表示分析和跨模态对齐指标(如CKA),揭示了这些分词器主要捕获语音的声学/音素结构而非词汇语义,并提出了改进下一代语音分词器设计的实用建议。
Details
Motivation: 解决语音分词器在多模态系统中语义与声学信息保留不匹配的问题,以提升其与大型语言模型(LLMs)的兼容性和下游任务性能。
Result: 通过实验发现当前语音分词器主要编码音素信息而非词汇语义,这为下一代语音分词方法的设计提供了实证依据。
Insight: 创新点在于从语义和音素双视角解耦语音表示,揭示了现有分词器的局限性;可借鉴其系统性的探测框架和跨模态对齐分析方法用于评估语音表示模型。
Abstract: Speech tokenizers are essential for connecting speech to large language models (LLMs) in multimodal systems. These tokenizers are expected to preserve both semantic and acoustic information for downstream understanding and generation. However, emerging evidence suggests that what is termed “semantic” in speech representations does not align with text-derived semantics: a mismatch that can degrade multimodal LLM performance. In this paper, we systematically analyze the information encoded by several widely used speech tokenizers, disentangling their semantic and phonetic content through word-level probing tasks, layerwise representation analysis, and cross-modal alignment metrics such as CKA. Our results show that current tokenizers primarily capture phonetic rather than lexical-semantic structure, and we derive practical implications for the design of next-generation speech tokenization methods.
cs.SE [Back]
[85] Safety Under Scaffolding: How Evaluation Conditions Shape Measured Safety cs.SE | cs.AI | cs.CL | cs.LGPDF
David Gringras
TL;DR: 这篇论文通过大规模对照研究(N=62,808,涉及六个前沿模型和四种部署配置),系统评估了不同‘脚手架’(如推理链、批评代理、委托管道)对大型语言模型安全性的测量影响。研究发现,评估格式(如多项选择与开放式)对安全分数的影响(5-20个百分点)远大于脚手架架构本身;在相同格式下,大多数脚手架对安全性的影响在预先设定的±2个百分点等效范围内。模型与脚手架之间存在显著的交互效应(差异可达35个百分点),且模型在不同基准上的安全排名完全逆转,导致无法构建可靠的复合安全指数。
Details
Motivation: 当前的安全基准通常在孤立环境下以多项选择形式评估语言模型,而实际生产部署中模型被置于包含各种处理步骤的‘脚手架’内。这种评估条件与部署条件的不匹配可能导致对模型安全性的测量出现偏差,因此需要研究脚手架如何影响测得的安全性。
Result: 在预先注册的±2个百分点等效检验(TOST)范围内,除Map-Reduce脚手架导致安全性显著下降(NNH=14)外,其他两种脚手架架构与基线相比保持了实际意义上的等效。评估格式从多项选择切换到开放式,对相同题项的安全分数影响(5-20个百分点)大于任何脚手架效应。模型安全排名在不同基准间完全逆转,复合安全指数的广义化可靠性系数G=0.000。
Insight: 核心创新点在于通过大规模、预先注册的对照实验,首次系统量化了评估条件(特别是脚手架和问题格式)对语言模型安全性测量的影响。客观来看,研究揭示了安全性评估严重依赖于具体评估设置,挑战了孤立基准测试的普适性,强调了必须针对具体模型和部署配置进行测试,并指出评估格式是比脚手架架构更关键的操作变量。
Abstract: Safety benchmarks evaluate language models in isolation, typically using multiple-choice format; production deployments wrap these models in agentic scaffolds that restructure inputs through reasoning traces, critic agents, and delegation pipelines. We report one of the largest controlled studies of scaffold effects on safety (N = 62,808; six frontier models, four deployment configurations), combining pre-registration, assessor blinding, equivalence testing, and specification curve analysis. Map-reduce scaffolding degrades measured safety (NNH = 14), yet two of three scaffold architectures preserve safety within practically meaningful margins. Investigating the map-reduce degradation revealed a deeper measurement problem: switching from multiple-choice to open-ended format on identical items shifts safety scores by 5-20 percentage points, larger than any scaffold effect. Within-format scaffold comparisons are consistent with practical equivalence under our pre-registered +/-2 pp TOST margin, isolating evaluation format rather than scaffold architecture as the operative variable. Model x scaffold interactions span 35 pp in opposing directions (one model degrades by -16.8 pp on sycophancy under map-reduce while another improves by +18.8 pp on the same benchmark), ruling out universal claims about scaffold safety. A generalisability analysis yields G = 0.000: model safety rankings reverse so completely across benchmarks that no composite safety index achieves non-zero reliability, making per-model, per-configuration testing a necessary minimum standard. We release all code, data, and prompts as ScaffoldSafety.
cs.AI [Back]
[86] Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities cs.AI | cs.CL | cs.CVPDF
Ziwei Zhou, Rui Wang, Zuxuan Wu, Yu-Gang Jiang
TL;DR: 论文提出了Daily-Omni,一个专注于跨模态时间推理的多选题音频-视觉问答基准,包含684个真实世界视频和1197个问题,涵盖6个任务类别。研究开发了一个半自动标注流程来构建基准,并评估了24个基础模型在37种模态设置下的表现,发现现有端到端MLLMs在处理需要精确时间对齐的问题时仍存在困难。
Details
Motivation: 现有MLLMs在独立处理视觉和音频任务上表现良好,但其同步处理跨模态信息(特别是需要时间对齐的推理)的能力尚未被充分探索,因此需要专门的基准来评估和推动这一能力的发展。
Result: 在Daily-Omni基准上对24个基础模型进行了广泛评估,结果表明许多端到端MLLMs在处理对齐关键问题时表现不佳,突显了跨模态时间对齐仍是一个重要的开放挑战。
Insight: 论文的创新点在于构建了一个需要显式跨模态时间推理的音频-视觉问答基准,并开发了包含半自动标注流程和训练免费的模块化诊断基线方法,为评估和诊断模型在跨模态时间对齐能力上的不足提供了系统工具。
Abstract: Recent Multimodal Large Language Models (MLLMs) achieve promising performance on visual and audio benchmarks independently. However, the ability of these models to process cross-modal information synchronously remains largely unexplored. We introduce Daily-Omni, a multiple-choice Audio-Visual QA benchmark featuring 684 real-world videos and 1,197 questions spanning 6 task families that explicitly require cross-modal temporal reasoning. To support scalable benchmark construction, we develop a semi-automatic pipeline for annotation, cross-modal consistency refinement, temporal alignment elicitation, and text-only leakage filtering, followed by human verification. We further provide a diagnostic evaluation suite and extensively evaluate 24 foundation models under 37 model–modality settings (Audio+Video / Audio-only / Video-only / Text-only). Finally, we include a training-free modular diagnostic baseline that composes off-the-shelf unimodal models to serve as a diagnostic baseline and to illustrate how explicit temporal alignment signals affect performance. Results indicate that many end-to-end MLLMs still struggle on alignment-critical questions, suggesting that robust cross-modal temporal alignment remains an important open challenge.
[87] IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs cs.AI | cs.CL | cs.CR | cs.LGPDF
Chuan Guo, Juan Felipe Ceron Uribe, Sicheng Zhu, Christopher A. Choquette-Choo, Steph Lin
TL;DR: 该论文介绍了IH-Challenge数据集,这是一个旨在通过强化学习训练来提升前沿大语言模型指令层级能力的训练数据集。指令层级定义了LLM在指令冲突时如何优先处理系统、开发者、用户和工具指令。通过在GPT-5-Mini上进行微调,该数据集显著提升了模型在指令层级上的鲁棒性,减少了不安全行为,并保持了模型能力。
Details
Motivation: 指令层级对于防御越狱攻击、系统提示提取和智能体提示注入至关重要,但鲁棒的指令层级行为难以训练,因为其失败可能与指令遵循失败混淆,冲突可能很微妙,且模型可能学习捷径(如过度拒绝)。
Result: 在IH-Challenge数据集上对GPT-5-Mini进行在线对抗样本生成的微调后,在16个分布内、分布外和人类红队测试基准上,平均指令层级鲁棒性提升了+10.0%(从84.1%到94.1%),不安全行为从6.6%降至0.7%,同时在通用安全评估中提升了帮助性,并在内部静态智能体提示注入评估中达到饱和,且能力退化最小。
Insight: 论文的创新点在于提出了一个专门针对训练指令层级鲁棒性的数据集IH-Challenge,并采用在线对抗样本生成的强化学习微调方法。从客观角度看,其核心贡献是提供了一个可量化的训练框架和基准,以解决LLM中复杂指令冲突的优先级策略问题,这对于提升模型安全性和可靠性具有实际意义。
Abstract: Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (https://huggingface.co/datasets/openai/ih-challenge) to support future research on robust instruction hierarchy.
[88] Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning cs.AI | cs.CL | cs.LGPDF
Zhaowei Zhang, Xiaohan Liu, Xuekai Zhu, Junchao Huang, Ceyao Zhang
TL;DR: 本文通过实证研究比较了奖励最大化方法与分布匹配方法在大型语言模型(LLM)对齐任务(特别是道德推理任务)上的表现。研究基于MoReBench基准,构建了一个基于评分标准的奖励管道(使用Qwen3-1.7B作为评判模型)以稳定RLVR训练。研究发现,与预期相反,分布匹配方法在道德推理对齐任务中并未显示出显著优势,而标准的奖励最大化RLVR方法同样有效。
Details
Motivation: 探讨LLM对齐任务(尤其是道德推理)是否本质上需要寻求多样性的分布匹配算法,而非传统的奖励最大化策略方法,因为道德推理通常允许多种有效回答。
Result: 在MoReBench基准上的实验表明,分布匹配方法并未如预期那样比奖励最大化方法表现出显著优势;通过语义可视化发现,道德推理的高奖励响应在语义空间中分布更集中(与数学推理中多样策略可获高奖励不同),这解释了模式寻求优化(奖励最大化)在道德推理对齐任务中同样有效。
Insight: 创新点在于首次系统比较了两种范式在道德推理对齐任务上的表现,并提出了一个基于评分标准的奖励管道以实现稳定RLVR训练;客观分析表明,道德推理任务的高奖励响应语义集中性可能降低了对显式多样性机制的需求,标准奖励最大化方法可直接迁移,这挑战了对齐任务需多样性算法的常见假设。
Abstract: Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in logical reasoning tasks, yet whether large language model (LLM) alignment requires fundamentally different approaches remains unclear. Given the apparent tolerance for multiple valid responses in moral reasoning, a natural hypothesis is that alignment tasks inherently require diversity-seeking distribution-matching algorithms rather than reward-maximizing policy-based methods. We conduct the first comprehensive empirical study comparing both paradigms on MoReBench. To enable stable RLVR training, we build a rubric-grounded reward pipeline by training a Qwen3-1.7B judge model. Contrary to our hypothesis, we find that distribution-matching approaches do not demonstrate significant advantages over reward-maximizing methods as expected on alignment tasks. Through semantic visualization mapping high-reward responses to semantic space, we demonstrate that moral reasoning exhibits more concentrated high-reward distributions than mathematical reasoning, where diverse solution strategies yield similarly high rewards. This counter-intuitive finding explains why mode-seeking optimization proves equally or more effective for alignment tasks. Our results suggest that alignment tasks do not inherently require diversity-preserving algorithms, and standard reward-maximizing RLVR methods can effectively transfer to moral reasoning without explicit diversity mechanisms.
cs.SD [Back]
[89] ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA cs.SD | cs.CV | cs.GRPDF
Aviad Dahan, Moran Yanuka, Noa Kraicer, Lior Wolf, Raja Giryes
TL;DR: ID-LoRA是一种联合生成视频中人物外观和声音的个性化方法,它通过一个文本提示、一张参考图像和一段短音频片段共同控制两种模态。该方法基于LTX-2联合音视频扩散模型,采用参数高效的上下文LoRA进行适配,首次实现了在单次生成过程中同时个性化视觉外观和语音。
Details
Motivation: 现有视频个性化方法分别处理视频和音频,导致音频模型无法与屏幕动作同步,且传统语音克隆模型仅依赖参考录音,无法通过文本提示改变说话风格或声学环境。因此,需要一种能够联合生成主体外观和声音的统一模型。
Result: 在人类偏好研究中,ID-LoRA在语音相似性上以73%的偏好率优于Kling 2.6 Pro,在说话风格上以65%的偏好率优于Kling。在跨环境设置中,说话人相似性比Kling提高了24%,且条件差异越大优势越明显。该方法仅需在单个GPU上使用约3K训练对即可实现这些结果。
Insight: 创新点包括:1)提出负时间位置编码,将参考令牌置于不相交的RoPE区域以区分参考和生成令牌;2)引入身份引导,一种无分类器引导变体,通过对比有无参考信号的预测来增强说话人特定特征;3)联合生成为物理基础的声音合成提供了有用的归纳偏置。
Abstract: Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject’s appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice in a single generative pass. Two challenges arise. Reference and generation tokens share the same positional-encoding space, making them hard to distinguish; we address this with negative temporal positions, placing reference tokens in a disjoint RoPE region while preserving their internal temporal structure. Speaker characteristics also tend to be diluted during denoising; we introduce identity guidance, a classifier-free guidance variant that amplifies speaker-specific features by contrasting predictions with and without the reference signal. In human preference studies, ID-LoRA is preferred over Kling 2.6 Pro by 73% of annotators for voice similarity and 65% for speaking style. On cross-environment settings, speaker similarity improves by 24% over Kling, with the gap widening as conditions diverge. A preliminary user study further suggests that joint generation provides a useful inductive bias for physically grounded sound synthesis. ID-LoRA achieves these results with only ~3K training pairs on a single GPU. Code, models, and data will be released.
cs.CR [Back]
[90] Naïve Exposure of Generative AI Capabilities Undermines Deepfake Detection cs.CR | cs.AI | cs.CVPDF
Sunpill Kim, Chanwoo Hwang, Minsu Kim, Jae Hong Seo
TL;DR: 本文研究发现,生成式AI系统通过聊天机器人界面暴露的强大推理和图像精修能力,会严重削弱现有深度伪造检测器的有效性。攻击者仅使用合规提示词和商用AI系统,就能生成同时逃逸检测、保持身份一致性并提升感知质量的图像,揭示了当前检测框架与真实世界生成式AI能力之间的结构性错配。
Details
Motivation: 研究动机在于揭示生成式AI系统通过用户界面暴露的推理和图像精修能力,如何被恶意利用来规避深度伪造检测,从而暴露当前检测框架假设的威胁模型与实际部署的生成式AI能力之间的不匹配。
Result: 实验表明,最先进的深度伪造检测方法在语义保持的图像精修攻击下失效,精修后的图像能同时逃逸检测、通过商业人脸识别API验证身份一致性,并显著提升感知质量。商用聊天机器人服务比开源模型构成更大的安全风险。
Insight: 创新点在于揭示了生成式AI系统无限制的推理能力会外显其真实性标准,这些标准可被直接用作图像精修目标,从而系统性地规避检测。这暴露了当前检测基准与真实世界AI能力之间的根本性脱节,为未来检测框架设计提供了重要警示。
Abstract: Generative AI systems increasingly expose powerful reasoning and image refinement capabilities through user-facing chatbot interfaces. In this work, we show that the naïve exposure of such capabilities fundamentally undermines modern deepfake detectors. Rather than proposing a new image manipulation technique, we study a realistic and already-deployed usage scenario in which an adversary uses only benign, policy-compliant prompts and commercial generative AI systems. We demonstrate that state-of-the-art deepfake detection methods fail under semantic-preserving image refinement. Specifically, we show that generative AI systems articulate explicit authenticity criteria and inadvertently externalize them through unrestricted reasoning, enabling their direct reuse as refinement objectives. As a result, refined images simultaneously evade detection, preserve identity as verified by commercial face recognition APIs, and exhibit substantially higher perceptual quality. Importantly, we find that widely accessible commercial chatbot services pose a significantly greater security risk than open-source models, as their superior realism, semantic controllability, and low-barrier interfaces enable effective evasion by non-expert users. Our findings reveal a structural mismatch between the threat models assumed by current detection frameworks and the actual capabilities of real-world generative AI. While detection baselines are largely shaped by prior benchmarks, deployed systems expose unrestricted authenticity reasoning and refinement despite stringent safety controls in other domains.