Table of Contents

cs.CL [Back]

[1] PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review cs.CLPDF

Songjun Tu, Yiwen Ma, Jiahao Lin, Qichao Zhang, Xiangyuan Lan

TL;DR: 本文提出了PaperAudit-Bench,这是一个用于评估研究论文中错误检测能力的基准,旨在提升自动化同行评审的批判性。它包括一个包含局部和跨章节推理错误的PaperAudit-Dataset,以及一个结合结构化错误检测与证据感知评审生成的PaperAudit-Review框架。实验表明,在长上下文设置下检测此类错误具有挑战性,而将显式错误检测集成到评审流程中能产生更严格、更具区分力的评估。

Details

Motivation: 当前大语言模型生成的同行评审虽然流畅,但在处理论文中细微且分散的实质性错误时,往往缺乏足够的批判性严谨度。本文旨在解决这一问题,为自动化同行评审提供更严格的评估基准和方法。

Result: 在PaperAudit-Bench上的实验揭示了不同模型和检测深度在错误可检测性上的巨大差异。相对于代表性的自动化评审基线,集成显式错误检测的框架能产生系统性更严格、更具区分力的评估。此外,数据集支持通过SFT和RL训练轻量级LLM检测器,能以更低的计算成本实现有效的错误检测。

Insight: 论文的创新点在于构建了一个专门针对研究论文错误检测的基准,区分了局部错误和需要跨章节推理的错误,并提出了一个结合结构化错误检测与证据感知生成的自动化评审框架。这为提升自动化同行评审的批判性和可靠性提供了新的数据集和方法论。从客观角度看,其将错误检测任务结构化并集成到评审流程中的思路,以及对长上下文下错误检测难度的系统性分析,具有借鉴价值。

Abstract: Large language models can generate fluent peer reviews, yet their assessments often lack sufficient critical rigor when substantive issues are subtle and distributed across a paper. In this paper, we introduce PaperAudit-Bench, which consists of two components: (1) PaperAudit-Dataset, an error dataset covering both errors identifiable within individual sections and those requiring cross-section reasoning, designed for controlled evaluation under long-context settings; and (2) PaperAudit-Review, an automated review framework that integrates structured error detection with evidence-aware review generation to support critical assessment. Experiments on PaperAudit-Bench reveal large variability in error detectability across models and detection depths, highlighting the difficulty of identifying such errors under long-context settings. Relative to representative automated reviewing baselines, incorporating explicit error detection into the review workflow produces systematically stricter and more discriminative evaluations, demonstrating its suitability for peer review. Finally, we show that the dataset supports training lightweight LLM detectors via SFT and RL, enabling effective error detection at reduced computational cost.


[2] PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models cs.CLPDF

Haoyu Zheng, Yun Zhu, Yuqian Yuan, Bo Yuan, Wenqiao Zhang

TL;DR: 论文提出了一种名为PILOT的非侵入式框架,旨在将大模型的战略监督能力内化到紧凑型大语言模型(LLM)中,以提升其在多步推理任务中的全局规划能力。该方法通过一个轻量级的超网络合成查询条件化的潜在指导向量,在推理时引导模型表征,而无需修改主干模型权重。

Details

Motivation: 紧凑型LLM在长视野任务中往往缺乏制定全局策略的能力,导致错误传播。虽然LLM在获得教师模型显式计划指导时能展现出潜在的推理能力,但运行时依赖外部指导在延迟和可用性上不切实际。

Result: 在数学和编码基准测试上的大量实验表明,PILOT能有效稳定推理轨迹,持续超越强基线(例如在MATH500上提升+8.9%),且推理延迟可忽略不计。

Insight: 创新点在于提出了一种非侵入式的潜在优化轨迹规划方法,通过超网络生成内部潜在指导向量来隐式引导推理,避免了模型微改和运行时外部依赖,实现了高效的战略能力内化。

Abstract: Strategic planning is critical for multi-step reasoning, yet compact Large Language Models (LLMs) often lack the capacity to formulate global strategies, leading to error propagation in long-horizon tasks. Our analysis reveals that LLMs possess latent reasoning capabilities that can be unlocked when conditioned on explicit plans from a teacher model; however, runtime reliance on external guidance is often impractical due to latency and availability constraints. To bridge this gap, we propose PILOT (Planning via Internalized Latent Optimization Trajectories), a non-invasive framework designed to internalize the strategic oversight of large models into intrinsic Latent Guidance. Instead of altering backbone weights, PILOT employs a lightweight Hyper-Network to synthesize a query-conditioned Latent Guidance vector. This vector acts as an internal steering mechanism, guiding the model’s representations toward optimal reasoning paths. Extensive experiments on mathematical and coding benchmarks demonstrate that PILOT effectively stabilizes reasoning trajectories, consistently outperforming strong baselines (e.g., +8.9% on MATH500) with negligible inference latency.


[3] Demystifying Multi-Agent Debate: The Role of Confidence and Diversity cs.CL | cs.AIPDF

Xiaochen Zhu, Caiqi Zhang, Yizhou Chi, Tom Stafford, Nigel Collier

TL;DR: 本文研究了多智能体辩论(MAD)在提升大语言模型性能时效果不佳的问题,指出传统MAD缺乏初始观点多样性和置信度校准沟通两个关键机制。作者提出了两种轻量级干预方法:多样性感知初始化和置信度调制的辩论协议,以提升辩论的有效性。

Details

Motivation: 传统多智能体辩论(MAD)虽然计算成本高,但性能往往不如简单的多数投票。其动机是借鉴人类审议和集体决策的研究,识别并弥补传统MAD中缺失的关键机制,以可靠地提升大语言模型在推理任务上的表现。

Result: 在六个面向推理的问答基准测试中,所提出的方法在经验上持续优于传统的MAD和多数投票。

Insight: 创新点在于将人类审议的洞见(观点多样性和置信度沟通)形式化并引入LLM辩论框架。具体包括:1)通过多样性感知初始化增加辩论开始时存在正确假设的先验概率;2)通过置信度调制的更新协议,使辩论能系统性地向正确假设偏移。这些原则性的简单修改能显著提升辩论效果。

Abstract: Multi-agent debate (MAD) is widely used to improve large language model (LLM) performance through test-time scaling, yet recent work shows that vanilla MAD often underperforms simple majority vote despite higher computational cost. Studies show that, under homogeneous agents and uniform belief updates, debate preserves expected correctness and therefore cannot reliably improve outcomes. Drawing on findings from human deliberation and collective decision-making, we identify two key mechanisms missing from vanilla MAD: (i) diversity of initial viewpoints and (ii) explicit, calibrated confidence communication. We propose two lightweight interventions. First, a diversity-aware initialisation that selects a more diverse pool of candidate answers, increasing the likelihood that a correct hypothesis is present at the start of debate. Second, a confidence-modulated debate protocol in which agents express calibrated confidence and condition their updates on others’ confidence. We show theoretically that diversity-aware initialisation improves the prior probability of MAD success without changing the underlying update dynamics, while confidence-modulated updates enable debate to systematically drift to the correct hypothesis. Empirically, across six reasoning-oriented QA benchmarks, our methods consistently outperform vanilla MAD and majority vote. Our results connect human deliberation with LLM-based debate and demonstrate that simple, principled modifications can substantially enhance debate effectiveness.


[4] HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue cs.CL | cs.AIPDF

Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong

TL;DR: 本文提出了HEART框架,首次在相同的多轮情感支持对话中直接比较人类和大型语言模型的表现,通过人类评估员和LLM-as-judge评估器在五个维度上进行评估,发现前沿模型在共情和一致性方面接近或超越人类平均水平,但人类在适应性重构、紧张命名和细微语气转换方面仍具优势,且人类与LLM评估偏好的一致性达到约80%。

Details

Motivation: 现有研究缺乏明确的方法来比较大型语言模型与人类在人际交往领域(如情感支持对话)的能力差异,HEART旨在填补这一空白,提供一个统一的评估基准。

Result: 在HEART框架下,多个前沿模型在感知共情和一致性维度上接近或超过人类平均响应水平;人类评估员与LLM-as-judge评估器在约80%的成对比较中达成一致,与人类间一致性相当;人类在适应性重构、紧张命名和对抗性轮次中的细微语气转换方面保持优势。

Insight: HEART将情感支持对话重新定义为独立于通用推理或语言流畅性的能力轴,为理解模型生成支持与人类社交判断的异同提供了统一实证基础,并揭示了评估标准在人类与模型间的趋同性,可用于评估模型规模对情感对话能力的影响。

Abstract: Supportive conversation depends on skills that go beyond language fluency, including reading emotions, adjusting tone, and navigating moments of resistance, frustration, or distress. Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans. We introduce HEART, the first-ever framework that directly compares humans and LLMs on the same multi-turn emotional-support conversations. For each dialogue history, we pair human and model responses and evaluate them through blinded human raters and an ensemble of LLM-as-judge evaluators. All assessments follow a rubric grounded in interpersonal communication science across five dimensions: Human Alignment, Empathic Responsiveness, Attunement, Resonance, and Task-Following. HEART uncovers striking behavioral patterns. Several frontier models approach or surpass the average human responses in perceived empathy and consistency. At the same time, humans maintain advantages in adaptive reframing, tension-naming, and nuanced tone shifts, particularly in adversarial turns. Human and LLM-as-judge preferences align on about 80 percent of pairwise comparisons, matching inter-human agreement, and their written rationales emphasize similar HEART dimensions. This pattern suggests an emerging convergence in the criteria used to assess supportive quality. By placing humans and models on equal footing, HEART reframes supportive dialogue as a distinct capability axis, separable from general reasoning or linguistic fluency. It provides a unified empirical foundation for understanding where model-generated support aligns with human social judgment, where it diverges, and how affective conversational competence scales with model size.


[5] OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling cs.CL | cs.AI | cs.LGPDF

Yitian Chen, Cheng Cheng, Yinan Sun, Zi Ling, Dongdong Ge

TL;DR: 本文提出了OPT-ENGINE,一个可扩展的基准测试框架,用于评估大语言模型在优化建模任务上的能力,其特点是可通过复杂度缩放来可控地调整任务难度。该框架覆盖了运筹学中的10个经典任务(5个线性规划和5个混合整数规划)。通过该框架,研究评估了LLMs在泛化到超出当前基准复杂度的分布外任务时的鲁棒性,并识别了从问题解释到解决方案生成过程中的主要瓶颈。

Details

Motivation: 尽管LLMs在优化建模方面取得了显著进展,但其在自动化建模和解决复杂现实问题方面的能力边界尚不明确。为了填补这一空白,需要建立一个能够系统评估LLMs在可扩展复杂度任务上表现的基准。

Result: 实证研究结果表明:1. 当任务复杂度增加时,集成外部求解器的工具增强推理方法展现出显著更高的鲁棒性,而纯文本推理则很快达到性能上限;2. 约束条件的自动化建模是当前LLMs的主要性能瓶颈。

Insight: 论文的创新点在于提出了一个通过复杂度缩放来系统性评估LLMs优化建模极限的基准框架。其核心见解是揭示了工具集成对于处理高复杂度优化任务的重要性,并明确了自动化约束建模是当前LLMs能力的关键瓶颈,这为开发下一代面向高级优化的LLMs提供了具体方向。

Abstract: Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling, fostering a rapid expansion of new methodologies and evaluation benchmarks. However, the boundaries of their capabilities in automated formulation and problem solving remain poorly understood, particularly when extending to complex, real-world tasks. To bridge this gap, we propose OPT-ENGINE, an extensible benchmark framework designed to evaluate LLMs on optimization modeling with controllable and scalable difficulty levels. OPT-ENGINE spans 10 canonical tasks across operations research, with five Linear Programming and five Mixed-Integer Programming. Utilizing OPT-ENGINE, we conduct an extensive study of LLMs’ reasoning capabilities, addressing two critical questions: 1.) Do LLMs’ performance remain robust when generalizing to out-of-distribution optimization tasks that scale in complexity beyond current benchmark levels? and 2.) At what stage, from problem interpretation to solution generation, do current LLMs encounter the most significant bottlenecks? Our empirical results yield two key insights: first, tool-integrated reasoning with external solvers exhibits significantly higher robustness as task complexity escalates, while pure-text reasoning reaches a ceiling; second, the automated formulation of constraints constitutes the primary performance bottleneck. These findings provide actionable guidance for developing next-generation LLMs for advanced optimization. Our code is publicly available at \textcolor{blue}{https://github.com/Cardinal-Operations/OPTEngine}.


[6] Towards a Mechanistic Understanding of Large Reasoning Models: A Survey of Training, Inference, and Failures cs.CLPDF

Yi Hu, Jiaqi Gu, Ruxin Wang, Zijun Yao, Hao Peng

TL;DR: 本文对大型推理模型(LRMs)的机制性理解进行了全面综述,将近期研究成果归纳为训练动态、推理机制和意外行为三个核心维度,旨在弥合黑盒性能与机制透明度之间的鸿沟,并讨论了未来机制研究的挑战和方向。

Details

Motivation: 尽管大型推理模型在强化学习推动下展现出卓越的推理能力,但其内部机制尚不透明,探索驱动这些行为的机制已成为关键研究前沿,本文旨在系统梳理相关发现以促进机制性理解。

Result: 本文为综述性论文,未提出具体模型或实验,但整合了领域内关于训练动态、推理机制和失败案例的现有研究成果,为理解LRMs提供了结构化框架。

Insight: 创新点在于将LRMs的机制性研究系统化为三个维度,强调了应用可解释性、改进方法论和统一理论框架的重要性,为未来机制分析提供了路线图。

Abstract: Reinforcement learning (RL) has catalyzed the emergence of Large Reasoning Models (LRMs) that have pushed reasoning capabilities to new heights. While their performance has garnered significant excitement, exploring the internal mechanisms driving these behaviors has become an equally critical research frontier. This paper provides a comprehensive survey of the mechanistic understanding of LRMs, organizing recent findings into three core dimensions: 1) training dynamics, 2) reasoning mechanisms, and 3) unintended behaviors. By synthesizing these insights, we aim to bridge the gap between black-box performance and mechanistic transparency. Finally, we discuss under-explored challenges to outline a roadmap for future mechanistic studies, including the need for applied interpretability, improved methodologies, and a unified theoretical framework.


[7] Text-to-State Mapping for Non-Resolution Reasoning: The Contradiction-Preservation Principle cs.CL | cs.AI | cs.LGPDF

Kei Saito

TL;DR: 本文提出了一种文本到状态的映射函数φ,将自然语言输入转化为非消解推理框架中的叠加状态,并形式化了矛盾保持原则,要求真正模糊的表达式在状态表示中保持非零熵。通过使用大型语言模型作为解释生成器,在68个涵盖词汇、结构和语用模糊性的测试句上验证了该映射的有效性。

Details

Motivation: 非消解推理框架旨在保持语义模糊性而非强制过早的解释消解,但如何将自然语言映射到该框架的数学结构(状态空间和算子)仍是一个未解决的问题。

Result: 在68个测试句上的实验表明,所提出的映射对于模糊输入实现了平均香农熵H(S) = 1.087比特,而基线单解释方法产生的H(S) = 0.000。

Insight: 创新点在于形式化了矛盾保持原则,并利用现有LLM作为解释生成器,构建了原始文本与NRR形式状态空间之间的算法桥梁,从而在语言模型推理中实现了架构层面的消解延迟。

Abstract: Non-Resolution Reasoning (NRR) provides a formal framework for maintaining semantic ambiguity rather than forcing premature interpretation collapse. While the foundational architecture establishes state spaces and operators for ambiguity-preserving computation, the critical question of how natural language maps to these mathematical structures remains open. This paper introduces the text-to-state mapping function φ that transforms linguistic input into superposition states within the NRR framework. We formalize the Contradiction-Preservation Principle, which requires that genuinely ambiguous expressions maintain non-zero entropy in their state representations, and develop extraction protocols using existing Large Language Models as interpretation generators. Empirical validation across 68 test sentences spanning lexical, structural, and pragmatic ambiguity demonstrates that our mapping achieves mean Shannon entropy H(S) = 1.087 bits for ambiguous inputs while baseline single-interpretation approaches yield H(S) = 0.000. The framework provides the missing algorithmic bridge between raw text and the formal state spaces on which NRR operators act, enabling architectural collapse deferment in language model inference.


[8] LinguaMap: Which Layers of LLMs Speak Your Language and How to Tune Them? cs.CL | cs.AI | cs.LGPDF

J. Ben Tamo, Daniel Carlander-Reuterfelt, Jonathan Rubin, Dezhi Hong, Mingxian Wang

TL;DR: 该论文研究了大型语言模型在非英语任务中的语言控制问题,识别了多语言迁移瓶颈和语言一致性瓶颈两种关键失败模式。通过设计四场景评估协议和扩展logit lens分析,揭示了模型内部的三阶段结构。基于此,作者提出了一种仅微调负责语言控制的最后几层的方法,在Qwen-3-32B和Bloom-7.1B模型上,仅微调3-5%的参数即可在六种语言上实现超过98%的语言一致性,且不牺牲任务准确性,计算效率远高于全参数微调。

Details

Motivation: 尽管经过多语言预训练,大型语言模型在处理非英语任务时,特别是在语言控制(即用预期语言回应的能力)方面仍然存在困难。论文旨在识别并解决这些失败模式。

Result: 在MMLU、MGSM和XQuAD基准测试上,提出的选择性微调方法在Qwen-3-32B和Bloom-7.1B模型上,对六种语言实现了超过98%的语言一致性,同时保持了任务准确性。其效果与全参数微调相当(例如,在所有提示场景下两种方法都达到98%以上语言一致性),但仅需微调3-5%的参数,计算资源消耗大幅降低。

Insight: 论文的创新点在于:1) 系统性地识别和表征了LLM在语言控制上的两种失败模式;2) 通过扩展logit lens和跨语言语义相似性分析,揭示了LLM内部处理语言的三阶段结构(输入对齐、任务推理、语言生成);3) 首次提出并验证了基于语言控制层定位的高效多语言适应方法,即选择性微调最后几层,实现了参数高效且性能不降的微调策略。

Abstract: Despite multilingual pretraining, large language models often struggle with non-English tasks, particularly in language control, the ability to respond in the intended language. We identify and characterize two key failure modes: the multilingual transfer bottleneck (correct language, incorrect task response) and the language consistency bottleneck (correct task response, wrong language). To systematically surface these issues, we design a four-scenario evaluation protocol spanning MMLU, MGSM, and XQuAD benchmarks. To probe these issues with interpretability, we extend logit lens analysis to track language probabilities layer by layer and compute cross-lingual semantic similarity of hidden states. The results reveal a three-phase internal structure: early layers align inputs into a shared semantic space, middle layers perform task reasoning, and late layers drive language-specific generation. Guided by these insights, we introduce selective fine-tuning of only the final layers responsible for language control. On Qwen-3-32B and Bloom-7.1B, this method achieves over 98 percent language consistency across six languages while fine-tuning only 3-5 percent of parameters, without sacrificing task accuracy. Importantly, this result is nearly identical to that of full-scope fine-tuning (for example, above 98 percent language consistency for both methods across all prompt scenarios) but uses a fraction of the computational resources. To the best of our knowledge, this is the first approach to leverage layer-localization of language control for efficient multilingual adaptation.


[9] TAIGR: Towards Modeling Influencer Content on Social Media via Structured, Pragmatic Inference cs.CLPDF

Nishanth Sridhar Nakshatri, Eylon Caplan, Rajkumar Pujari, Dan Goldwasser

TL;DR: 本文提出TAIGR框架,旨在通过结构化、语用推理来建模社交媒体上健康影响者内容。该框架分三步:识别核心建议(takeaway)、构建论证图以捕捉其论证依据、执行基于因子图的概率推理来验证建议。研究表明,准确验证需要建模内容的语用和论证结构,而非将其视为扁平化的主张集合。

Details

Motivation: 健康影响者在塑造公众信念方面作用日益重要,但其内容常通过对话式叙述和修辞策略传达,而非明确的事实主张,导致以主张为中心的验证方法难以捕捉其语用含义。

Result: 在健康影响者视频转录本的内容验证任务上评估TAIGR,结果表明准确验证需要建模话语的语用和论证结构。

Insight: 创新点在于将影响者话语建模为结构化论证图并进行概率推理,强调语用和论证结构在内容验证中的关键作用,而非传统扁平化主张分析。

Abstract: Health influencers play a growing role in shaping public beliefs, yet their content is often conveyed through conversational narratives and rhetorical strategies rather than explicit factual claims. As a result, claim-centric verification methods struggle to capture the pragmatic meaning of influencer discourse. In this paper, we propose TAIGR (Takeaway Argumentation Inference with Grounded References), a structured framework designed to analyze influencer discourse, which operates in three stages: (1) identifying the core influencer recommendation–takeaway; (2) constructing an argumentation graph that captures influencer justification for the takeaway; (3) performing factor graph-based probabilistic inference to validate the takeaway. We evaluate TAIGR on a content validation task over influencer video transcripts on health, showing that accurate validation requires modeling the discourse’s pragmatic and argumentative structure rather than treating transcripts as flat collections of claims.


[10] VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning cs.CL | cs.AIPDF

Vikash Singh, Darion Cassel, Nathaniel Weir, Nick Feng, Sam Bayless

TL;DR: VERGE是一个神经符号框架,结合大型语言模型(LLMs)与可满足性模理论(SMT)求解器,通过迭代精炼生成可验证的答案。它将LLM输出分解为原子声明,自动形式化为逻辑公式,并使用自动定理证明验证逻辑一致性,旨在提升LLM在推理任务中的逻辑正确性。

Details

Motivation: 解决LLM在语法流畅性之外,于高风险领域确保逻辑正确性的根本挑战,提供可验证的推理能力。

Result: 使用GPT-OSS-120B模型,在多个推理基准测试中,VERGE在收敛时相比单次通过方法平均性能提升18.7%。

Insight: 创新点包括:通过形式语义等价检查实现多模型共识以消除语法偏差;语义路由将不同声明类型导向符号求解器或LLM集成进行验证;使用最小校正子集(MCS)进行精确逻辑错误定位以提供可操作的反馈。该框架在可能时提供形式保证,否则采用共识验证,推进可信AI。

Abstract: Despite the syntactic fluency of Large Language Models (LLMs), ensuring their logical correctness in high-stakes domains remains a fundamental challenge. We present a neurosymbolic framework that combines LLMs with SMT solvers to produce verification-guided answers through iterative refinement. Our approach decomposes LLM outputs into atomic claims, autoformalizes them into first-order logic, and verifies their logical consistency using automated theorem proving. We introduce three key innovations: (1) multi-model consensus via formal semantic equivalence checking to ensure logic-level alignment between candidates, eliminating the syntactic bias of surface-form metrics, (2) semantic routing that directs different claim types to appropriate verification strategies: symbolic solvers for logical claims and LLM ensembles for commonsense reasoning, and (3) precise logical error localization via Minimal Correction Subsets (MCS), which pinpoint the exact subset of claims to revise, transforming binary failure signals into actionable feedback. Our framework classifies claims by their logical status and aggregates multiple verification signals into a unified score with variance-based penalty. The system iteratively refines answers using structured feedback until acceptance criteria are met or convergence is achieved. This hybrid approach delivers formal guarantees where possible and consensus verification elsewhere, advancing trustworthy AI. With the GPT-OSS-120B model, VERGE demonstrates an average performance uplift of 18.7% at convergence across a set of reasoning benchmarks compared to single-pass approaches.


[11] Counterfactual Cultural Cues Reduce Medical QA Accuracy in LLMs: Identifier vs Context Effects cs.CLPDF

Amirhossein Haji Mohammad Rezaei, Zahra Shakeri

TL;DR: 这篇论文引入了一个反事实基准,通过插入与文化相关的标识符、上下文线索或其组合,将150个MedQA测试项目扩展为1650个变体,用于评估大型语言模型在医疗问答中的表现。研究发现,文化线索会显著降低多个LLM(如GPT-5.2、Llama-3.1-8B等)的诊断准确性,尤其是当标识符和上下文共同出现时,降幅最大。

Details

Motivation: 为了构建可持续和公平的医疗保健系统,需要确保医学语言模型在面对非决定性的文化信息时,不会改变临床正确的诊断。论文旨在评估文化线索如何影响LLM在医疗问答中的准确性。

Result: 在MedQA基准测试中,文化线索显著降低了所有测试模型的准确性(Cochran’s Q检验,p<10^-14)。当标识符和上下文共同出现时,在仅选项提示下,准确率下降幅度最大(达3-7个百分点)。使用LLM作为评判者的人类验证评分标准(κ=0.76)显示,超过一半基于文化的解释最终导致了错误答案。

Insight: 论文的创新点在于创建了一个反事实文化基准来系统评估LLM对文化信息的敏感性,并揭示了文化参照推理与诊断失败之间的关联。这为评估和缓解文化诱导的诊断错误提供了方法和数据支持。

Abstract: Engineering sustainable and equitable healthcare requires medical language models that do not change clinically correct diagnoses when presented with non-decisive cultural information. We introduce a counterfactual benchmark that expands 150 MedQA test items into 1650 variants by inserting culture-related (i) identifier tokens, (ii) contextual cues, or (iii) their combination for three groups (Indigenous Canadian, Middle-Eastern Muslim, Southeast Asian), plus a length-matched neutral control, where a clinician verified that the gold answer remains invariant in all variants. We evaluate GPT-5.2, Llama-3.1-8B, DeepSeek-R1, and MedGemma (4B/27B) under option-only and brief-explanation prompting. Across models, cultural cues significantly affect accuracy (Cochran’s Q, $p<10^-14$), with the largest degradation when identifier and context co-occur (up to 3-7 percentage points under option-only prompting), while neutral edits produce smaller, non-systematic changes. A human-validated rubric ($κ=0.76$) applied via an LLM-as-judge shows that more than half of culturally grounded explanations end in an incorrect answer, linking culture-referential reasoning to diagnostic failure. We release prompts and augmentations to support evaluation and mitigation of culturally induced diagnostic errors.


[12] Rewarding Intellectual Humility Learning When Not To Answer In Large Language Models cs.CL | cs.AI | cs.LGPDF

Abha Jha, Akanksha Mahajan, Ashwath Vaithinathan Aravindan, Praveen Saravanan, Sai Sailaja Policharla

TL;DR: 本文提出了一种基于可验证奖励的强化学习(RLVR)训练范式,通过奖励模型在不确定时选择“我不知道”来促进大语言模型的知识谦逊,从而减少幻觉和不可验证内容的生成。研究在MedMCQA和Hendrycks Math基准上对Granite-3.3B和Qwen-4B模型进行了微调和评估,探讨了不同奖励结构的影响。

Details

Motivation: 大语言模型(LLMs)经常产生幻觉或不可验证的内容,这削弱了其在事实性领域的可靠性。本文旨在通过强化学习范式,显式地奖励模型在不确定时选择弃答(“我不知道”),以提升模型的知识谦逊和可靠性。

Result: 实验表明,适度的弃答奖励(r_abs ≈ -0.25 到 0.3)能在多项选择题任务上持续减少错误回答,且不会导致准确率严重下降。更大的模型对弃答激励表现出更强的鲁棒性。在开放式问答任务中,由于探索不足存在局限性,但可以通过监督式弃答训练部分缓解。

Insight: 论文的创新点在于提出了一种可验证奖励设计的强化学习框架(RLVR),通过三元奖励结构(-1, r_abs, 1)显式地奖励正确性、弃答和惩罚错误,为缓解语言模型幻觉提供了一种灵活且可行的实用方法。客观来看,将监督微调与强化学习结合以先教授弃答策略,是一种有效的训练策略组合。

Abstract: Large Language Models (LLMs) often produce hallucinated or unverifiable content, undermining their reliability in factual domains. This work investigates Reinforcement Learning with Verifiable Rewards (RLVR) as a training paradigm that explicitly rewards abstention (“I don’t know”) alongside correctness to promote intellectual humility. We fine-tune and evaluate Granite-3.3-2B-Instruct and Qwen-3-4B-Instruct on the MedMCQA and Hendrycks Math benchmarks using a ternary reward structure ($-1$, r_abs, 1) under varying abstention reward structures. We further study the effect of combining RLVR with supervised fine-tuning strategies that teach abstention prior to reinforcement learning. Our results show that moderate abstention rewards (r_abs $\approx -0.25$ to 0.3) consistently reduce incorrect responses without severe accuracy degradation on multiple-choice tasks, with larger models exhibiting greater robustness to abstention incentives. On open-ended question answering, we observe limitations due to insufficient exploration, which can be partially mitigated through supervised abstention training. Overall, these findings demonstrate the feasibility and flexibility of verifiable reward design as a practical approach for hallucination mitigation in language models. Reproducible code for our abstention training framework is available here https://github.com/Mystic-Slice/rl-abstention.


[13] Mind the Shift: Using Delta SSL Embeddings to Enhance Child ASR cs.CL | cs.SD | eess.ASPDF

Zilai Wang, Natarajan Balaji Shankar, Kaiyuan Zhang, Zihan Wang, Abeer Alwan

TL;DR: 本文提出了一种利用Delta SSL嵌入来增强儿童自动语音识别(ASR)性能的方法。该方法通过计算微调后的SSL模型与预训练模型之间的嵌入差异(即Delta嵌入),捕获任务特定信息,并将其与另一个SSL模型的微调特征融合,从而在儿童语音数据集MyST上显著降低了词错误率(WER)。

Details

Motivation: 儿童ASR面临数据有限和预训练领域不匹配的挑战,而微调SSL模型会导致表示空间偏移。本文假设Delta嵌入(微调模型与预训练模型嵌入的差值)编码了任务特定信息,可以补充其他SSL模型的微调特征,从而提升识别性能。

Result: 在MyST儿童语料库上的实验表明,与仅融合微调嵌入相比,融合Delta嵌入使HuBERT模型的WER相对降低了10%,W2V2模型降低了4.4%。特别是将WavLM与Delta W2V2嵌入融合,达到了9.64%的WER,在MyST语料库上为SSL模型设定了新的最先进水平(SOTA)。

Insight: 创新点在于提出并验证了Delta SSL嵌入作为任务特定信息源的有效性,以及特征融合作为提升儿童ASR性能的可行方向。从客观角度看,该方法通过利用模型微调前后的表示差异来补偿领域偏移,是一种简单而有效的特征增强策略。

Abstract: Self-supervised learning (SSL) models have achieved impressive results across many speech tasks, yet child automatic speech recognition (ASR) remains challenging due to limited data and pretraining domain mismatch. Fine-tuning SSL models on child speech induces shifts in the representation space. We hypothesize that delta SSL embeddings, defined as the differences between embeddings from a finetuned model and those from its pretrained counterpart, encode task-specific information that complements finetuned features from another SSL model. We evaluate multiple fusion strategies on the MyST childrens corpus using different models. Results show that delta embedding fusion with WavLM yields up to a 10 percent relative WER reduction for HuBERT and a 4.4 percent reduction for W2V2, compared to finetuned embedding fusion. Notably, fusing WavLM with delta W2V2 embeddings achieves a WER of 9.64, setting a new state of the art among SSL models on the MyST corpus. These findings demonstrate the effectiveness of delta embeddings and highlight feature fusion as a promising direction for advancing child ASR.


[14] Unit-Based Agent for Semi-Cascaded Full-Duplex Dialogue Systems cs.CL | cs.HCPDF

Haoyuan Yu, Yuxuan Chen, Minjie Cai

TL;DR: 本文提出了一种基于单元的半级联全双工对话系统框架,将复杂对话分解为最小对话单元,使系统能够独立处理每个单元并预测何时转换到下一个单元。该系统围绕多模态大语言模型构建,辅以语音活动检测和文本转语音合成等模块,以无需训练、即插即用的方式运行。在HumDial数据集上的实验验证了该框架的有效性,在Human-like Spoken Dialogue Systems Challenge(赛道2:全双工交互)的测试集上排名第二。

Details

Motivation: 全双工语音交互对于自然的人机交互至关重要,但现有系统在处理复杂对话时面临挑战。本文旨在通过分解对话为最小单元,实现更灵活、高效的全双工交互,解决传统系统在实时性和连续性方面的不足。

Result: 在HumDial数据集上的实验表明,该框架在Human-like Spoken Dialogue Systems Challenge(赛道2:全双工交互)的测试集上排名第二,证明了其有效性,但未达到SOTA水平。

Insight: 创新点在于将对话分解为最小单元的半级联架构,结合多模态大语言模型和辅助模块,实现了无需训练、即插即用的全双工系统。从客观角度看,这种模块化设计提高了系统的灵活性和可扩展性,为实时对话系统提供了新思路。

Abstract: Full-duplex voice interaction is crucial for natural human computer interaction. We present a framework that decomposes complex dialogue into minimal conversational units, enabling the system to process each unit independently and predict when to transit to the next. This framework is instantiated as a semi-cascaded full-duplex dialogue system built around a multimodal large language model, supported by auxiliary modules such as voice activity detection (VAD) and text-to-speech (TTS) synthesis. The resulting system operates in a train-free, plug-and-play manner. Experiments on the HumDial dataset demonstrate the effectiveness of our framework, which ranks second among all teams on the test set of the Human-like Spoken Dialogue Systems Challenge (Track 2: Full-Duplex Interaction). Code is available at the GitHub repository https://github.com/yu-haoyuan/fd-badcat.


[15] Automated Benchmark Generation from Domain Guidelines Informed by Bloom’s Taxonomy cs.CL | cs.AIPDF

Si Chen, Le Huy Khiem, Annalisa Szymanski, Ronald Metoyer, Ting Hua

TL;DR: 本文提出了一种基于布鲁姆分类法的自动化基准生成框架,用于评估大语言模型在实践性领域中的开放式问答能力。该框架将专家指南转化为基于违规场景的自动评分多选题和多轮对话,覆盖四个认知层次,并在教学、营养学和护理三个应用领域进行了验证。研究发现,大语言模型有时在高级推理(分析)上表现较好,但在低级认知项目(记忆)上失败更频繁。

Details

Motivation: 解决实践性领域中开放式问答评估的挑战,因为现有基准大多依赖人类考试数据集,而这些数据在实践性领域往往不可用,且知识具有程序性和专业判断性。

Result: 在三个应用领域(教学、营养学、护理)生成了大规模、心理测量学信息化的基准,揭示了模型与人类推理的差异:LLMs在高级推理(分析)上相对较好,但在低级项目(记忆)上失败更多。

Insight: 创新点在于利用布鲁姆分类法从专家指南自动生成可扩展、确定性和可重复的评估基准,特别是通过违规场景和认知层次扩展来评估情境化推理,这为缺乏现成数据集的领域提供了新的评估方法。

Abstract: Open-ended question answering (QA) evaluates a model’s ability to perform contextualized reasoning beyond factual recall. This challenge is especially acute in practice-based domains, where knowledge is procedural and grounded in professional judgment, while most existing LLM benchmarks depend on pre-existing human exam datasets that are often unavailable in such settings. We introduce a framework for automated benchmark generation from expert-authored guidelines informed by Bloom’s Taxonomy. It converts expert practices into implicit violation-based scenarios and expands them into auto-graded multiple-choice questions (MCQs) and multi-turn dialogues across four cognitive levels, enabling deterministic, reproducible, and scalable evaluation. Applied to three applied domains: teaching, dietetics, and caregiving, we find differences between model and human-like reasoning: LLMs sometimes perform relatively better on higher-order reasoning (Analyze) but fail more frequently on lower-level items (Remember). We produce large-scale, psychometrically informed benchmarks that surface these non-intuitive model behaviors and enable evaluation of contextualized reasoning in real-world settings.


[16] SoftHateBench: Evaluating Moderation Models Against Reasoning-Driven, Policy-Compliant Hostility cs.CLPDF

Xuanyu Su, Diana Inkpen, Nathalie Japkowicz

TL;DR: 本文提出了SoftHateBench,一个用于评估内容审核模型对‘软性仇恨言论’检测能力的生成式基准。软性仇恨言论指表面上看似合理、但通过论证框架和价值导向引导受众指责或排斥特定群体的言论。该基准整合了论题论证模型和关联理论,在7个社会文化领域和28个目标群体上生成了4,745个软性仇恨实例。评估表明,当前基于编码器的检测器、通用大语言模型和安全模型在检测这类推理驱动的隐蔽敌意时性能显著下降。

Details

Motivation: 当前社交媒体内容审核系统主要针对表面毒性线索进行优化,可能无法有效检测‘软性仇恨言论’——即通过看似合理的论证来传递敌意的隐蔽言论,而现有基准未能系统性地衡量这一差距。

Result: 在SoftHateBench上的评估结果显示,从‘硬性仇恨言论’到‘软性仇恨言论’的检测性能存在一致性的显著下降。无论是基于编码器的检测器、通用LLMs还是专门的安全模型,在检测通过微妙、基于推理的语言传递的相同立场时,往往表现失败。

Insight: 论文的创新点在于首次系统性地构建了一个专注于评估模型对‘推理驱动、符合政策’的软性仇恨言论检测能力的生成式基准。其核心方法是将形式化的论证结构模型与语用学理论相结合,用于生成逻辑连贯且立场不变的软性仇恨变体,这为更细粒度的内容安全评估提供了新工具和视角。

Abstract: Online hate on social media ranges from overt slurs and threats (\emph{hard hate speech}) to \emph{soft hate speech}: discourse that appears reasonable on the surface but uses framing and value-based arguments to steer audiences toward blaming or excluding a target group. We hypothesize that current moderation systems, largely optimized for surface toxicity cues, are not robust to this reasoning-driven hostility, yet existing benchmarks do not measure this gap systematically. We introduce \textbf{\textsc{SoftHateBench}}, a generative benchmark that produces soft-hate variants while preserving the underlying hostile standpoint. To generate soft hate, we integrate the \emph{Argumentum Model of Topics} (AMT) and \emph{Relevance Theory} (RT) in a unified framework: AMT provides the backbone argument structure for rewriting an explicit hateful standpoint into a seemingly neutral discussion while preserving the stance, and RT guides generation to keep the AMT chain logically coherent. The benchmark spans \textbf{7} sociocultural domains and \textbf{28} target groups, comprising \textbf{4,745} soft-hate instances. Evaluations across encoder-based detectors, general-purpose LLMs, and safety models show a consistent drop from hard to soft tiers: systems that detect explicit hostility often fail when the same stance is conveyed through subtle, reasoning-based language. \textcolor{red}{\textbf{Disclaimer.} Contains offensive examples used solely for research.}


[17] SAPO: Self-Adaptive Process Optimization Makes Small Reasoners Stronger cs.CLPDF

Kaiyuan Chen, Guangmin Zheng, Jin Wang, Xiaobing Zhou, Xuejie Zhang

TL;DR: 本文提出了一种名为SAPO(自适应过程优化)的方法,旨在通过自适应地引入过程监督信号来缩小推理器与验证器之间的差距,从而提升小型语言模型在数学和代码等复杂任务上的推理能力。该方法避免了传统蒙特卡洛过程监督的计算低效问题,并在多个基准测试中取得了优于现有自进化方法的效果。

Details

Motivation: 现有自进化方法忽视了细粒度推理步骤的影响,导致推理器与验证器之间存在性能差距,而蒙特卡洛过程监督的计算低效性进一步加剧了缩小这一差距的困难。受错误相关负电位(ERN)的启发,即推理器能够在错误决策后定位错误并指导快速调整,本文旨在为小型语言模型设计一种高效的自适应过程优化方法。

Result: 在数学和代码这两类具有挑战性的任务上进行的广泛实验表明,SAPO方法在性能上超越了大多数现有的自进化方法。此外,为了进一步研究SAPO对验证器性能的影响,本文还针对数学和编程任务引入了两个新的过程奖励模型基准。

Insight: 核心创新在于提出了SAPO方法,它通过主动最小化推理器-验证器差距来自适应且高效地引入过程监督信号,而非依赖低效的蒙特卡洛估计。这借鉴了神经科学中ERN的概念,将错误定位与快速调整机制引入语言模型的自我优化过程,为提升小型模型在复杂推理任务上的性能提供了一种新思路。同时,新引入的基准也为未来过程奖励模型的研究提供了评估工具。

Abstract: Existing self-evolution methods overlook the influence of fine-grained reasoning steps, which leads to the reasoner-verifier gap. The computational inefficiency of Monte Carlo (MC) process supervision further exacerbates the difficulty in mitigating the gap. Motivated by the Error-Related Negativity (ERN), which the reasoner can localize error following incorrect decisions, guiding rapid adjustments, we propose a Self-Adaptive Process Optimization (SAPO) method for self-improvement in Small Language Models (SLMs). SAPO adaptively and efficiently introduces process supervision signals by actively minimizing the reasoner-verifier gap rather than relying on inefficient MC estimations. Extensive experiments demonstrate that the proposed method outperforms most existing self-evolution methods on two challenging task types: mathematics and code. Additionally, to further investigate SAPO’s impact on verifier performance, this work introduces two new benchmarks for process reward models in both mathematical and coding tasks.


[18] Beyond Speedup – Utilizing KV Cache for Sampling and Reasoning cs.CL | cs.AI | cs.LGPDF

Zeyu Xing, Xing Li, Hui-Ling Zhen, Mingxuan Yuan, Sinno Jialin Pan

TL;DR: 本文提出将KV缓存视为一种轻量级表示,而不仅仅是用于加速自回归解码。研究表明,KV缓存中编码的上下文信息可以免费重用于下游任务,无需重新计算或存储完整的隐藏状态。论文展示了KV缓存表示在两个关键应用中的有效性:链式嵌入和快慢思维切换。

Details

Motivation: 动机在于探索KV缓存除了加速解码之外的潜在用途,利用其编码的上下文信息作为免费且有效的表示,避免为下游任务重新计算或存储完整隐藏状态的开销。

Result: 在链式嵌入任务中,使用KV缓存表示在Llama-3.1-8B-Instruct和Qwen2-7B-Instruct模型上取得了有竞争力或更优的性能。在快慢思维切换任务中,在Qwen3-8B和DeepSeek-R1-Distil-Qwen-14B模型上实现了自适应推理,在最小精度损失下将token生成减少了高达5.7倍。

Insight: 核心创新点是将KV缓存重新定位为一种可免费重用的轻量级表示,用于采样和推理任务。这为LLM推理中的表示复用开辟了新方向,展示了即使弱于专用嵌入,KV缓存表示也足以支持关键应用,从而提升效率。

Abstract: KV caches, typically used only to speed up autoregressive decoding, encode contextual information that can be reused for downstream tasks at no extra cost. We propose treating the KV cache as a lightweight representation, eliminating the need to recompute or store full hidden states. Despite being weaker than dedicated embeddings, KV-derived representations are shown to be sufficient for two key applications: \textbf{(i) Chain-of-Embedding}, where they achieve competitive or superior performance on Llama-3.1-8B-Instruct and Qwen2-7B-Instruct; and \textbf{(ii) Fast/Slow Thinking Switching}, where they enable adaptive reasoning on Qwen3-8B and DeepSeek-R1-Distil-Qwen-14B, reducing token generation by up to $5.7\times$ with minimal accuracy loss. Our findings establish KV caches as a free, effective substrate for sampling and reasoning, opening new directions for representation reuse in LLM inference. Code: https://github.com/cmd2001/ICLR2026_KV-Embedding.


[19] CE-RM: A Pointwise Generative Reward Model Optimized via Two-Stage Rollout and Unified Criteria cs.CLPDF

Xinyu Hu, Yancheng He, Weixun Wang, Tao Feng, Li Lin

TL;DR: 本文提出了CE-RM-4B,一个通过两阶段生成和统一标准优化的逐点生成式奖励模型,用于开放域自然语言生成的自动评估。该模型仅使用约5.7K高质量数据,在多种奖励模型基准测试中表现出色,并在下游强化学习实践中带来更有效的改进。

Details

Motivation: 解决现有LLM-as-a-Judge范式在基准测试表现与实际强化学习实践效果之间存在显著差距的问题,归因于现有研究对成对评估的过度依赖以及对评估标准优化不足的局限性。

Result: CE-RM-4B在多样化的奖励模型基准测试中取得了优越性能,特别是在Best-of-N场景下,并在下游RL实践中实现了更有效的改进。

Insight: 创新点在于提出了专用的两阶段生成方法和采用统一的基于查询的标准来训练逐点生成式奖励模型,有效缩小了基准性能与实际应用之间的差距。

Abstract: Automatic evaluation is crucial yet challenging for open-ended natural language generation, especially when rule-based metrics are infeasible. Compared with traditional methods, the recent LLM-as-a-Judge paradigms enable better and more flexible evaluation, and show promise as generative reward models for reinforcement learning. However, prior work has revealed a notable gap between their seemingly impressive benchmark performance and actual effectiveness in RL practice. We attribute this issue to some limitations in existing studies, including the dominance of pairwise evaluation and inadequate optimization of evaluation criteria. Therefore, we propose CE-RM-4B, a pointwise generative reward model trained with a dedicated two-stage rollout method, and adopting unified query-based criteria. Using only about 5.7K high-quality data curated from the open-source preference dataset, our CE-RM-4B achieves superior performance on diverse reward model benchmarks, especially in Best-of-N scenarios, and delivers more effective improvements in downstream RL practice.


[20] PsychePass: Calibrating LLM Therapeutic Competence via Trajectory-Anchored Tournaments cs.CL | cs.LGPDF

Zhuang Chen, Dazhen Wan, Zhangkai Zheng, Guanqun Bi, Xiyao Xiao

TL;DR: 本文提出了PsychePass框架,通过轨迹锚定的锦标赛来校准大型语言模型的心理治疗能力,解决了当前评估方法因缺乏锚定而导致的进程漂移和标准漂移问题,并利用锦标赛轨迹作为奖励信号进行强化学习以提升模型性能。

Details

Motivation: 当前评估大型语言模型在心理治疗领域能力的方法存在未锚定的缺陷,导致咨询过程中目标漂移和静态评分标准不稳定,难以可靠评估其治疗能力。

Result: 大量实验验证了PsychePass框架的有效性,并表明其与人类专家判断具有很强的一致性。

Insight: 创新点在于提出了轨迹锚定的锦标赛评估框架,通过精确控制的模拟交互和高效的瑞士制锦标赛生成稳健的Elo评分,并将锦标赛轨迹转化为可信的奖励信号用于策略强化学习,从而系统性地校准和提升LLM的治疗能力。

Abstract: While large language models show promise in mental healthcare, evaluating their therapeutic competence remains challenging due to the unstructured and longitudinal nature of counseling. We argue that current evaluation paradigms suffer from an unanchored defect, leading to two forms of instability: process drift, where unsteered client simulation wanders away from specific counseling goals, and standard drift, where static pointwise scoring lacks the stability for reliable judgment. To address this, we introduce Ps, a unified framework that calibrates the therapeutic competence of LLMs via trajectory-anchored tournaments. We first anchor the interaction trajectory in simulation, where clients precisely control the fluid consultation process to probe multifaceted capabilities. We then anchor the battle trajectory in judgments through an efficient Swiss-system tournament, utilizing dynamic pairwise battles to yield robust Elo ratings. Beyond ranking, we demonstrate that tournament trajectories can be transformed into credible reward signals, enabling on-policy reinforcement learning to enhance LLMs’ performance. Extensive experiments validate the effectiveness of PsychePass and its strong consistency with human expert judgments.


[21] MobileBench-OL: A Comprehensive Chinese Benchmark for Evaluating Mobile GUI Agents in Real-World Environment cs.CL | cs.AIPDF

Qinzhuo Wu, Zhizhuo Yang, Hanhao Li, Pengzhi Gao, Wei Liu

TL;DR: 本文提出了MobileBench-OL,一个用于评估移动GUI代理在真实世界环境中性能的综合性中文在线基准。它包含来自80个中文应用的1080个任务,通过五个子集衡量代理的任务执行、复杂推理和噪声鲁棒性,并提供了一个带重置机制的自动评估框架。

Details

Motivation: 现有在线基准主要关注代理的任务指令遵循能力,而忽视了其推理和探索能力,且未考虑真实移动环境中的随机噪声,导致基准与真实环境之间存在差距。

Result: 在MobileBench-OL上评估12个领先的GUI代理表明,它们距离满足真实世界需求仍有显著改进空间;人工评估进一步证实了该基准能可靠地衡量代理在真实环境中的性能。

Insight: 创新点在于构建了一个多维度(任务执行、复杂推理、噪声鲁棒性)的综合性在线基准,并设计了带重置机制的自动评估框架以实现稳定、可重复的真实世界评测,弥补了现有基准的不足。

Abstract: Recent advances in mobile Graphical User Interface (GUI) agents highlight the growing need for comprehensive evaluation benchmarks. While new online benchmarks offer more realistic testing than offline ones, they tend to focus on the agents’ task instruction-following ability while neglecting their reasoning and exploration ability. Moreover, these benchmarks do not consider the random noise in real-world mobile environments. This leads to a gap between benchmarks and real-world environments. To addressing these limitations, we propose MobileBench-OL, an online benchmark with 1080 tasks from 80 Chinese apps. It measures task execution, complex reasoning, and noise robustness of agents by including 5 subsets, which set multiple evaluation dimensions. We also provide an auto-eval framework with a reset mechanism, enabling stable and repeatable real-world benchmarking. Evaluating 12 leading GUI agents on MobileBench-OL shows significant room for improvement to meet real-world requirements. Human evaluation further confirms that MobileBench-OL can reliably measure the performance of leading GUI agents in real environments. Our data and code will be released upon acceptance.


[22] Improving Diffusion Language Model Decoding through Joint Search in Generation Order and Token Space cs.CL | cs.LGPDF

Yangyi Shen, Tianjian Feng, Jiaqi Han, Wen Wang, Tianlang Chen

TL;DR: 本文提出了一种名为Order-Token Search的联合搜索方法,用于改进扩散语言模型的解码过程。该方法通过在生成顺序和词元空间中进行联合搜索,探索多种解码轨迹,从而超越了现有解码方法局限于单一轨迹的限制。

Details

Motivation: 当前扩散语言模型的解码方法通常只遵循单一解码轨迹,限制了在轨迹空间中的探索能力,因此需要一种能够同时探索生成顺序和词元值的方法来提升解码性能。

Result: 在数学推理和代码生成基准测试(GSM8K、MATH500、Countdown和HumanEval)上,Order-Token Search方法相比基线模型取得了显著提升(绝对提升分别为3.1%、3.8%、7.9%和6.8%),其性能匹配甚至超过了经过后训练的diffu-GRPO d1-LLaDA模型。

Insight: 论文的核心创新在于提出了一个联合搜索框架,其关键是一个能够对去噪动作进行评分的似然估计器,这使得对多样化轨迹的稳定剪枝和高效探索成为可能,为扩散语言模型的解码提供了新的研究方向。

Abstract: Diffusion Language Models (DLMs) offer order-agnostic generation that can explore many possible decoding trajectories. However, current decoding methods commit to a single trajectory, limiting exploration in trajectory space. We introduce Order-Token Search to explore this space through jointly searching over generation order and token values. Its core is a likelihood estimator that scores denoising actions, enabling stable pruning and efficient exploration of diverse trajectories. Across mathematical reasoning and coding benchmarks, Order-Token Search consistently outperforms baselines on GSM8K, MATH500, Countdown, and HumanEval (3.1%, 3.8%, 7.9%, and 6.8% absolute over backbone), matching or surpassing diffu-GRPO post-trained d1-LLaDA. Our work establishes joint search as a key component for advancing decoding in DLMs.


[23] PEARL: Plan Exploration and Adaptive Reinforcement Learning for Multihop Tool Use cs.CLPDF

Qihao Wang, Mingzhe Lu, Jiayue Wu, Yue Hu, Yanbing Liu

TL;DR: 论文提出了PEARL框架,通过离线探索和在线强化学习两阶段方法,增强大语言模型在复杂多跳工具使用中的规划和执行能力。

Details

Motivation: 解决大语言模型在复杂多轮工具调用中存在的规划能力弱、工具幻觉、参数生成错误和交互鲁棒性差等问题。

Result: 在ToolHop和T-Eval基准测试中显著优于现有方法,在ToolHop上达到了56.5%的最新最优成功率,同时保持较低调用错误率。

Insight: 采用两阶段(离线探索与在线强化学习)框架,并引入基于群体相对策略优化的专用规划器及精心设计的奖励函数,以提供规划质量的明确信号。

Abstract: Large Language Models show great potential with external tools, but face significant challenges in complex, multi-turn tool invocation. They often exhibit weak planning, tool hallucination, erroneous parameter generation, and struggle with robust interaction. To tackle these issues, we present PEARL, a novel framework to enhance LLM planning and execution for sophisticated tool use. PEARL adopts a two-stage approach: an offline phase where the agent explores tools to learn valid usage patterns and failure conditions, and an online reinforcement learning phase. In the online phase, a dedicated Planner is trained via group Relative Policy Optimization (GRPO) with a carefully designed reward function that provides distinct signals for planning quality. Experiments on the ToolHop and T-Eval benchmarks show PEARL significantly outperforms existing methods, achieving a new state-of-the-art success rate of \textbf{56.5%} on ToolHop while maintaining a low invocation error rate. Our work marks a key advance in addressing the complex planning challenges of tool use, contributing to the development of more robust and reliable LLM-based agents.


[24] MuVaC: AVariational Causal Framework for Multimodal Sarcasm Understanding in Dialogues cs.CLPDF

Diandian Guo, Fangfang Yuan, Cong Cao, Xixun Lin, Chuan Zhou

TL;DR: 本文提出MuVaC,一种变分因果推理框架,用于联合优化多模态对话中的讽刺检测(MSD)和讽刺解释(MuSE)任务。该框架模仿人类认知机制,通过结构因果模型建模任务间的因果依赖,采用对齐后融合的方法整合多模态特征,并确保检测结果与解释的一致性以增强推理可信度。

Details

Motivation: 现有研究多单独处理多模态讽刺检测或解释任务,或虽尝试整合但忽略了它们之间固有的因果依赖关系,无法模拟人类理解讽刺的认知过程。

Result: 在公开数据集上的实验结果表明,MuVaC框架具有优越性,为理解多模态讽刺提供了新视角。

Insight: 创新点在于首次从结构因果模型视角对MSD和MuSE进行联合建模,利用变分因果路径定义联合优化目标,并通过对齐后融合策略与一致性约束来提升多模态特征学习和推理的鲁棒性。

Abstract: The prevalence of sarcasm in multimodal dialogues on the social platforms presents a crucial yet challenging task for understanding the true intent behind online content. Comprehensive sarcasm analysis requires two key aspects: Multimodal Sarcasm Detection (MSD) and Multimodal Sarcasm Explanation (MuSE). Intuitively, the act of detection is the result of the reasoning process that explains the sarcasm. Current research predominantly focuses on addressing either MSD or MuSE as a single task. Even though some recent work has attempted to integrate these tasks, their inherent causal dependency is often overlooked. To bridge this gap, we propose MuVaC, a variational causal inference framework that mimics human cognitive mechanisms for understanding sarcasm, enabling robust multimodal feature learning to jointly optimize MSD and MuSE. Specifically, we first model MSD and MuSE from the perspective of structural causal models, establishing variational causal pathways to define the objectives for joint optimization. Next, we design an alignment-then-fusion approach to integrate multimodal features, providing robust fusion representations for sarcasm detection and explanation generation. Finally, we enhance the reasoning trustworthiness by ensuring consistency between detection results and explanations. Experimental results demonstrate the superiority of MuVaC in public datasets, offering a new perspective for understanding multimodal sarcasm.


[25] BMAM: Brain-inspired Multi-Agent Memory Framework cs.CLPDF

Yang Li, Jiaxiang Liu, Yusong Wang, Yujie Wu, Mingkun Xu

TL;DR: 本文提出了BMAM(Brain-inspired Multi-Agent Memory),一种受大脑认知启发的多智能体记忆框架,旨在解决基于语言模型的智能体在长期交互中面临的信息遗忘和行为不一致(即灵魂侵蚀)问题。该框架将记忆分解为情节、语义、显著性感知和控制导向等多个功能专门化的子系统,并通过在明确时间线上组织情节记忆来支持长期推理。

Details

Motivation: 解决基于语言模型的智能体在长期交互中难以保持时间基础信息和跨会话行为一致性的问题,即所谓的’灵魂侵蚀’失败模式。

Result: 在LoCoMo基准测试的标准长期评估设置下,BMAM达到了78.45%的准确率;消融分析证实了受海马体启发的情节记忆子系统在时间推理中起着关键作用。

Insight: 主要创新点在于将记忆建模为一组功能专门化的子系统(而非单一非结构化存储),并受认知记忆系统启发进行分解;客观来看,其通过组织明确时间线和融合多互补信号进行检索的机制,为长期推理提供了新的结构化方法。

Abstract: Language-model-based agents operating over extended interaction horizons face persistent challenges in preserving temporally grounded information and maintaining behavioral consistency across sessions, a failure mode we term soul erosion. We present BMAM (Brain-inspired Multi-Agent Memory), a general-purpose memory architecture that models agent memory as a set of functionally specialized subsystems rather than a single unstructured store. Inspired by cognitive memory systems, BMAM decomposes memory into episodic, semantic, salience-aware, and control-oriented components that operate at complementary time scales. To support long-horizon reasoning, BMAM organizes episodic memories along explicit timelines and retrieves evidence by fusing multiple complementary signals. Experiments on the LoCoMo benchmark show that BMAM achieves 78.45 percent accuracy under the standard long-horizon evaluation setting, and ablation analyses confirm that the hippocampus-inspired episodic memory subsystem plays a critical role in temporal reasoning.


[26] P2S: Probabilistic Process Supervision for General-Domain Reasoning Question Answering cs.CLPDF

Wenlin Zhong, Chengyuan Liu, Yiquan Wu, Bovin Tan, Changlong Sun

TL;DR: 本文提出了一种名为概率过程监督(P2S)的新型自监督框架,用于解决大语言模型在通用领域推理任务中缺乏可验证奖励信号的问题。该方法通过合成高质量参考推理链,并计算路径忠实性奖励(PFR)为每个推理步骤提供细粒度监督,从而缓解奖励稀疏性问题。

Details

Motivation: 现有基于结果奖励的方法(如RLPR)在通用领域推理任务中忽视了推理过程本身的逐步监督,导致奖励信号稀疏。本文旨在填补这一空白,提供无需单独奖励模型或人工标注的细粒度过程监督。

Result: 在阅读理解(如DROP)和医学问答(如MedQA)基准测试上的大量实验表明,P2S显著优于包括RLPR在内的强基线方法。

Insight: 核心创新点在于提出了路径忠实性奖励(PFR),它利用模型生成参考推理链后缀的条件概率来评估当前推理前缀的忠实性,并能灵活与基于结果的奖励结合,为推理过程提供密集的、自监督的指导。

Abstract: While reinforcement learning with verifiable rewards (RLVR) has advanced LLM reasoning in structured domains like mathematics and programming, its application to general-domain reasoning tasks remains challenging due to the absence of verifiable reward signals. To this end, methods like Reinforcement Learning with Reference Probability Reward (RLPR) have emerged, leveraging the probability of generating the final answer as a reward signal. However, these outcome-focused approaches neglect crucial step-by-step supervision of the reasoning process itself. To address this gap, we introduce Probabilistic Process Supervision (P2S), a novel self-supervision framework that provides fine-grained process rewards without requiring a separate reward model or human-annotated reasoning steps. During reinforcement learning, P2S synthesizes and filters a high-quality reference reasoning chain (gold-CoT). The core of our method is to calculate a Path Faithfulness Reward (PFR) for each reasoning step, which is derived from the conditional probability of generating the gold-CoT’s suffix, given the model’s current reasoning prefix. Crucially, this PFR can be flexibly integrated with any outcome-based reward, directly tackling the reward sparsity problem by providing dense guidance. Extensive experiments on reading comprehension and medical Question Answering benchmarks show that P2S significantly outperforms strong baselines.


[27] Efficient Multimodal Planning Agent for Visual Question-Answering cs.CLPDF

Zhuo Chen, Xinyu Geng, Xinyu Wang, Yong Jiang, Zhen Zhang

TL;DR: 本文提出了一种用于视觉问答(VQA)的高效多模态规划代理方法,通过动态分解多模态检索增强生成(mRAG)流程,智能决定每个步骤的必要性,从而在保持任务性能的同时显著提升效率。

Details

Motivation: 解决现有基于mRAG的多阶段VQA流程因固有依赖关系导致的效率低下问题,尤其是在处理知识密集型查询时。

Result: 在六个不同数据集上的实验表明,该方法优于所有基线(包括深度研究代理和精心设计的基于提示的方法),同时将搜索时间减少了60%以上,并降低了昂贵的工具调用成本。

Insight: 核心创新在于训练一个多模态规划代理来动态决策mRAG流程的执行路径,实现了效率与效果之间的优化权衡;从客观角度看,将规划智能引入多模态RAG系统以进行自适应流程控制是一个有前景的方向。

Abstract: Visual Question-Answering (VQA) is a challenging multimodal task that requires integrating visual and textual information to generate accurate responses. While multimodal Retrieval-Augmented Generation (mRAG) has shown promise in enhancing VQA systems by providing more evidence on both image and text sides, the default procedure that addresses VQA queries, especially the knowledge-intensive ones, often relies on multi-stage pipelines of mRAG with inherent dependencies. To mitigate the inefficiency limitations while maintaining VQA task performance, this paper proposes a method that trains a multimodal planning agent, dynamically decomposing the mRAG pipeline to solve the VQA task. Our method optimizes the trade-off between efficiency and effectiveness by training the agent to intelligently determine the necessity of each mRAG step. In our experiments, the agent can help reduce redundant computations, cutting search time by over 60% compared to existing methods and decreasing costly tool calls. Meanwhile, experiments demonstrate that our method outperforms all baselines, including a Deep Research agent and a carefully designed prompt-based method, on average over six various datasets. Code will be released.


[28] AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts cs.CLPDF

Shicheng Fang, Yuxin Wang, XiaoRan Liu, Jiahao Lu, Chuanyuan Tan

TL;DR: 本文介绍了AgentLongBench,一个通过环境模拟来评估长上下文智能体的可控基准测试。该基准基于横向思维谜题生成交互轨迹,涵盖知识密集和无知识场景,用于测试智能体在动态信息合成和非线性推理方面的能力。实验发现,尽管当前SOTA模型在静态检索任务上表现良好,但在处理动态工作流时存在显著性能下降,这主要受解决查询所需的最小令牌数影响。

Details

Motivation: 现有基准测试多为静态且依赖被动检索任务,无法模拟智能体与环境交互的复杂性(如非线性推理和迭代反馈),因此需要开发一个能评估智能体在长上下文、动态环境中性能的基准。

Result: 在32K到4M令牌范围内的SOTA模型和内存系统上进行实验,结果显示智能体在静态检索任务上表现良好,但在动态信息合成工作流中表现显著下降,性能退化与解决查询所需的最小令牌数相关。

Insight: 创新点在于通过环境模拟和横向思维谜题构建动态交互基准,揭示了智能体在处理高信息密度工具响应时比处理长对话内存碎片化更具挑战性,强调了最小令牌数作为关键性能影响因素。

Abstract: The evolution of Large Language Models (LLMs) into autonomous agents necessitates the management of extensive, dynamic contexts. Current benchmarks, however, remain largely static, relying on passive retrieval tasks that fail to simulate the complexities of agent-environment interaction, such as non-linear reasoning and iterative feedback. To address this, we introduce \textbf{AgentLongBench}, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles. This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios. Experiments with state-of-the-art models and memory systems (32K to 4M tokens) expose a critical weakness: while adept at static retrieval, agents struggle with the dynamic information synthesis essential for workflows. Our analysis indicates that this degradation is driven by the minimum number of tokens required to resolve a query. This factor explains why the high information density inherent in massive tool responses poses a significantly greater challenge than the memory fragmentation typical of long-turn dialogues.


[29] Persona Prompting as a Lens on LLM Social Reasoning cs.CLPDF

Jing Yang, Moritz Hechtbauer, Elisabeth Khalilov, Evelyn Luise Brinkmann, Vera Schmitt

TL;DR: 本文研究了角色提示(Persona Prompting)对大型语言模型在社会敏感任务(如仇恨言论检测)中生成解释的影响,发现角色提示能提升分类性能但会降低解释质量,且模拟角色无法与真实人群对齐,模型仍存在偏见。

Details

Motivation: 探讨角色提示如何影响LLM在社会敏感任务中的解释生成,以评估其对模型偏见和人类对齐的作用。

Result: 在三个LLM上的评估显示:角色提示提升了最主观任务(仇恨言论检测)的分类性能,但降低了解释质量;模拟角色与真实人群对齐失败;模型表现出一致的偏见和过度标记有害内容的倾向。

Insight: 角色提示在提升分类性能时可能牺牲解释质量,且无法有效缓解模型偏见,提示在应用中需谨慎;模拟角色与真实人群的对齐问题揭示了模型对社会因素的有限响应。

Abstract: For socially sensitive tasks like hate speech detection, the quality of explanations from Large Language Models (LLMs) is crucial for factors like user trust and model alignment. While Persona prompting (PP) is increasingly used as a way to steer model towards user-specific generation, its effect on model rationales remains underexplored. We investigate how LLM-generated rationales vary when conditioned on different simulated demographic personas. Using datasets annotated with word-level rationales, we measure agreement with human annotations from different demographic groups, and assess the impact of PP on model bias and human alignment. Our evaluation across three LLMs results reveals three key findings: (1) PP improving classification on the most subjective task (hate speech) but degrading rationale quality. (2) Simulated personas fail to align with their real-world demographic counterparts, and high inter-persona agreement shows models are resistant to significant steering. (3) Models exhibit consistent demographic biases and a strong tendency to over-flag content as harmful, regardless of PP. Our findings reveal a critical trade-off: while PP can improve classification in socially-sensitive tasks, it often comes at the cost of rationale quality and fails to mitigate underlying biases, urging caution in its application.


[30] SERA: Soft-Verified Efficient Repository Agents cs.CL | cs.LG | cs.SEPDF

Ethan Shen, Danny Tormoen, Saurabh Shah, Ali Farhadi, Tim Dettmers

TL;DR: 本文提出了SERA(Soft-Verified Efficient Repository Agents),一种用于训练代码代理的高效方法,能够快速且低成本地创建针对私有代码库的专用代理。该方法仅使用监督微调(SFT),在完全开源模型中达到了最先进的性能,同时匹配了前沿开放权重模型(如Devstral-Small-2)的水平。

Details

Motivation: 解决开放权重代码代理难以针对私有代码库进行专业化训练的成本和复杂性高的问题,使这一理论优势变得实用。

Result: 在完全开源(开放数据、方法、代码)模型中实现了SOTA结果,性能与Devstral-Small-2等前沿开放权重模型相当。训练成本比强化学习低26倍,比之前的合成数据方法低57倍。

Insight: 创新点在于Soft Verified Generation(SVG)方法,能从单个代码库生成数千条轨迹,结合成本效益实现私有代码库的专业化。此外,通过将SVG应用于更大规模的代码库语料,生成了超过20万条合成轨迹,为训练代码代理提供了详细的扩展定律、消融分析和混杂因素研究。

Abstract: Open-weight coding agents should hold a fundamental advantage over closed-source systems: they can be specialized to private codebases, encoding repository-specific information directly in their weights. Yet the cost and complexity of training has kept this advantage theoretical. We show it is now practical. We present Soft-Verified Efficient Repository Agents (SERA), an efficient method for training coding agents that enables the rapid and cheap creation of agents specialized to private codebases. Using only supervised finetuning (SFT), SERA achieves state-of-the-art results among fully open-source (open data, method, code) models while matching the performance of frontier open-weight models like Devstral-Small-2. Creating SERA models is 26x cheaper than reinforcement learning and 57x cheaper than previous synthetic data methods to reach equivalent performance. Our method, Soft Verified Generation (SVG), generates thousands of trajectories from a single code repository. Combined with cost-efficiency, this enables specialization to private codebases. Beyond repository specialization, we apply SVG to a larger corpus of codebases, generating over 200,000 synthetic trajectories. We use this dataset to provide detailed analysis of scaling laws, ablations, and confounding factors for training coding agents. Overall, we believe our work will greatly accelerate research on open coding agents and showcase the advantage of open-source models that can specialize to private codebases. We release SERA as the first model in Ai2’s Open Coding Agents series, along with all our code, data, and Claude Code integration to support the research community.


[31] Dissecting Multimodal In-Context Learning: Modality Asymmetries and Circuit Dynamics in modern Transformers cs.CL | cs.LGPDF

Yiran Huang, Karsten Roth, Quentin Bouniot, Wenjia Xu, Zeynep Akata

TL;DR: 本文通过在小规模Transformer模型上进行受控实验,研究了多模态上下文学习(ICL)的机制,特别是模态间的不对称性。研究发现,当模型在主要模态的高多样性数据上预训练后,仅需次要模态的极低数据复杂度即可实现多模态ICL,其机制依赖于跨模态的归纳式标签复制电路。

Details

Motivation: 旨在探究Transformer模型如何通过上下文示例学习跨模态关联信息,以理解多模态大语言模型中ICL现象的内在机制。

Result: 实验在合成分类任务上进行,结果表明:旋转位置编码(RoPE)提高了ICL的数据复杂度阈值;多模态设置下存在学习不对称性,次要模态的低数据复杂度足以引发多模态ICL。

Insight: 揭示了多模态ICL中模态间的不对称学习特性,并提供了基于归纳机制的电路动态解释,为理解现代Transformer的多模态ICL建立了机制基础并引入了可控测试平台。

Abstract: Transformer-based multimodal large language models often exhibit in-context learning (ICL) abilities. Motivated by this phenomenon, we ask: how do transformers learn to associate information across modalities from in-context examples? We investigate this question through controlled experiments on small transformers trained on synthetic classification tasks, enabling precise manipulation of data statistics and model architecture. We begin by revisiting core principles of unimodal ICL in modern transformers. While several prior findings replicate, we find that Rotary Position Embeddings (RoPE) increases the data complexity threshold for ICL. Extending to the multimodal setting reveals a fundamental learning asymmetry: when pretrained on high-diversity data from a primary modality, surprisingly low data complexity in the secondary modality suffices for multimodal ICL to emerge. Mechanistic analysis shows that both settings rely on an induction-style mechanism that copies labels from matching in-context exemplars; multimodal training refines and extends these circuits across modalities. Our findings provide a mechanistic foundation for understanding multimodal ICL in modern transformers and introduce a controlled testbed for future investigation.


cs.CV [Back]

[32] Size Matters: Reconstructing Real-Scale 3D Models from Monocular Images for Food Portion Estimation cs.CV | cs.AI | cs.LG | cs.MMPDF

Gautham Vinod, Bruce Coburn, Siddeshwar Raghavan, Jiangpeng He, Fengqing Zhu

TL;DR: 本文提出了一种从单目图像重建真实尺度3D模型的方法,用于精确估计食物份量。该方法利用在大规模数据集上训练的模型提取的丰富视觉特征来估计重建对象的尺度,从而将单视图3D重建转换为具有物理意义的真实尺度模型。在两个公开数据集上的实验表明,该方法显著优于现有技术。

Details

Motivation: 解决从单目图像中恢复食物真实尺寸(份量)信息这一不适定问题,以支持精准营养领域的准确膳食摄入监测,弥补现有3D重建方法无法恢复真实世界尺度的不足。

Result: 在两个公开数据集上的广泛实验和消融研究表明,该方法始终优于现有技术,在体积估计的平均绝对误差上减少了近30%。

Insight: 创新点在于通过从大规模预训练模型中提取的视觉特征来学习并恢复重建对象的真实尺度,从而桥接了3D计算机视觉与数字健康领域,实现了具有物理意义的真实尺度3D重建。

Abstract: The rise of chronic diseases related to diet, such as obesity and diabetes, emphasizes the need for accurate monitoring of food intake. While AI-driven dietary assessment has made strides in recent years, the ill-posed nature of recovering size (portion) information from monocular images for accurate estimation of ``how much did you eat?’’ is a pressing challenge. Some 3D reconstruction methods have achieved impressive geometric reconstruction but fail to recover the crucial real-world scale of the reconstructed object, limiting its usage in precision nutrition. In this paper, we bridge the gap between 3D computer vision and digital health by proposing a method that recovers a true-to-scale 3D reconstructed object from a monocular image. Our approach leverages rich visual features extracted from models trained on large-scale datasets to estimate the scale of the reconstructed object. This learned scale enables us to convert single-view 3D reconstructions into true-to-life, physically meaningful models. Extensive experiments and ablation studies on two publicly available datasets show that our method consistently outperforms existing techniques, achieving nearly a 30% reduction in mean absolute volume-estimation error, showcasing its potential to enhance the domain of precision nutrition. Code: https://gitlab.com/viper-purdue/size-matters


[33] DiSa: Saliency-Aware Foreground-Background Disentangled Framework for Open-Vocabulary Semantic Segmentation cs.CVPDF

Zhen Yao, Xin Li, Taotao Jing, Shuai Zhang, Mooi Choo Chuah

TL;DR: DiSa是一种新颖的显著性感知前景-背景解耦框架,旨在解决开放词汇语义分割中现有方法因依赖视觉语言模型(如CLIP)而存在的前景偏见和空间定位有限的问题。该框架通过显著性感知解耦模块分别建模前景和背景特征,并利用分层细化模块进行像素级和通道级特征优化,从而提升分割性能。

Details

Motivation: 现有基于CLIP等视觉语言模型的开放词汇语义分割方法存在两个关键局限:前景偏见(倾向于忽略背景区域)和有限的空间定位(导致物体边界模糊)。DiSa旨在通过显式解耦前景和背景特征来解决这些问题。

Result: 在六个基准测试上的广泛实验表明,DiSa始终优于最先进的方法,实现了开放词汇语义分割的SOTA性能。

Insight: 创新点包括:1)引入显著性感知解耦模块,以分而治之的方式分别建模前景和背景特征;2)提出分层细化模块,利用像素级空间上下文并通过多级更新实现通道级特征细化。这为解决VLMs在密集预测中的固有偏差提供了新思路。

Abstract: Open-vocabulary semantic segmentation aims to assign labels to every pixel in an image based on text labels. Existing approaches typically utilize vision-language models (VLMs), such as CLIP, for dense prediction. However, VLMs, pre-trained on image-text pairs, are biased toward salient, object-centric regions and exhibit two critical limitations when adapted to segmentation: (i) Foreground Bias, which tends to ignore background regions, and (ii) Limited Spatial Localization, resulting in blurred object boundaries. To address these limitations, we introduce DiSa, a novel saliency-aware foreground-background disentangled framework. By explicitly incorporating saliency cues in our designed Saliency-aware Disentanglement Module (SDM), DiSa separately models foreground and background ensemble features in a divide-and-conquer manner. Additionally, we propose a Hierarchical Refinement Module (HRM) that leverages pixel-wise spatial contexts and enables channel-wise feature refinement through multi-level updates. Extensive experiments on six benchmarks demonstrate that DiSa consistently outperforms state-of-the-art methods.


[34] Semi-Supervised Masked Autoencoders: Unlocking Vision Transformer Potential with Limited Data cs.CV | cs.AI | cs.LGPDF

Atik Faysal, Mohammad Rostami, Reihaneh Gh. Roshan, Nikhil Muralidhar, Huaxia Wang

TL;DR: 本文提出了一种名为半监督掩码自编码器(SSMAE)的框架,旨在解决在标记数据稀缺但未标记数据丰富的情况下训练视觉Transformer(ViT)的挑战。SSMAE通过联合优化掩码图像重建和分类任务,利用未标记和标记样本以及动态选择的伪标签,并引入验证驱动的门控机制,仅在模型达到可靠高置信度预测且在不同图像增强视图下一致时激活伪标签,以减少确认偏差。

Details

Motivation: 动机在于解决视觉Transformer在标记数据有限时训练效果不佳的问题,利用丰富的未标记数据提升模型性能。

Result: 在CIFAR-10和CIFAR-100数据集上,SSMAE一致优于监督ViT和微调MAE,在低标签率下(如CIFAR-10的10%标签)取得了最大增益(比ViT提升9.24%),达到了SOTA水平。

Insight: 创新点包括联合优化重建与分类任务、动态伪标签选择以及验证驱动门控机制,客观分析表明其关键在于伪标签引入时机与生成方式的平衡,为数据高效Transformer训练提供了新思路。

Abstract: We address the challenge of training Vision Transformers (ViTs) when labeled data is scarce but unlabeled data is abundant. We propose Semi-Supervised Masked Autoencoder (SSMAE), a framework that jointly optimizes masked image reconstruction and classification using both unlabeled and labeled samples with dynamically selected pseudo-labels. SSMAE introduces a validation-driven gating mechanism that activates pseudo-labeling only after the model achieves reliable, high-confidence predictions that are consistent across both weakly and strongly augmented views of the same image, reducing confirmation bias. On CIFAR-10 and CIFAR-100, SSMAE consistently outperforms supervised ViT and fine-tuned MAE, with the largest gains in low-label regimes (+9.24% over ViT on CIFAR-10 with 10% labels). Our results demonstrate that when pseudo-labels are introduced is as important as how they are generated for data-efficient transformer training. Codes are available at https://github.com/atik666/ssmae.


[35] Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning cs.CVPDF

Chuan Qin, Constantin Venhoff, Sonia Joseph, Fanyi Xiao, Stefan Scherer

TL;DR: 本文提出了一种名为Sparse CLIP的新方法,通过在CLIP的对比学习训练过程中直接集成稀疏性约束,生成了既具有高性能又具备良好可解释性的视觉-语言表示。该方法在保持强大下游任务性能的同时,超越了后处理稀疏自编码器(SAEs)的可解释性,并保留了CLIP固有的多模态能力。

Details

Motivation: 尽管CLIP在视觉-语言表示学习中取得了巨大成功,但其密集且不透明的潜在表示带来了显著的可解释性挑战。传统观点认为可解释性与性能之间存在权衡,在训练中强制稀疏性会降低准确性,而现有的后处理方法(如SAEs)又会导致下游性能下降和多模态能力丧失。本文旨在直接解决这一矛盾。

Result: 与SAEs相比,Sparse CLIP表示在保持强大下游任务性能的同时,实现了更优的可解释性,并保留了多模态能力。稀疏的多模态特征实现了直接的语义概念对齐,并揭示了跨模态知识涌现的训练动态。

Insight: 核心创新点在于将稀疏性约束直接集成到CLIP的对比学习训练框架中,实现了可解释性与性能的协同优化,挑战了二者必须牺牲其一的传统观念。这为未来模型的设计提供了一个有前景的原则:可解释性和高性能可以同时追求,而非相互妥协。

Abstract: Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in vision-language representation learning, powering diverse downstream tasks and serving as the default vision backbone in multimodal large language models (MLLMs). Despite its success, CLIP’s dense and opaque latent representations pose significant interpretability challenges. A common assumption is that interpretability and performance are in tension: enforcing sparsity during training degrades accuracy, motivating recent post-hoc approaches such as Sparse Autoencoders (SAEs). However, these post-hoc approaches often suffer from degraded downstream performance and loss of CLIP’s inherent multimodal capabilities, with most learned features remaining unimodal. We propose a simple yet effective approach that integrates sparsity directly into CLIP training, yielding representations that are both interpretable and performant. Compared to SAEs, our Sparse CLIP representations preserve strong downstream task performance, achieve superior interpretability, and retain multimodal capabilities. We show that multimodal sparse features enable straightforward semantic concept alignment and reveal training dynamics of how cross-modal knowledge emerges. Finally, as a proof of concept, we train a vision-language model on sparse CLIP representations that enables interpretable, vision-based steering capabilities. Our findings challenge conventional wisdom that interpretability requires sacrificing accuracy and demonstrate that interpretability and performance can be co-optimized, offering a promising design principle for future models.


[36] NucFuseRank: Dataset Fusion and Performance Ranking for Nuclei Instance Segmentation cs.CVPDF

Nima Torbati, Anastasia Meshcheryakova, Ramona Woitek, Sepideh Hatamikia, Diana Mechtcheriakova

TL;DR: 本文提出NucFuseRank,一个针对H&E染色图像中细胞核实例分割任务的数据集融合与性能排名框架。作者通过文献综述收集并标准化了多个公开数据集,使用两种SOTA分割模型(CNN和CNN-ViT混合架构)系统评估了这些数据集的性能并进行了排名。此外,还创建了统一的测试集(NucFuse-test)用于公平的跨数据集评估,以及通过融合多个数据集图像构建的统一训练集(NucFuse-train)以提升分割性能。

Details

Motivation: 当前细胞核实例分割研究多集中于开发新算法,并在少数任意选择的公开数据集上进行基准测试,缺乏对数据集本身的系统性评估。本文旨在从数据集角度出发,解决该领域缺乏标准化、统一评估基准的问题。

Result: 研究系统评估并排名了多个公开H&E染色细胞核实例分割数据集。通过创建NucFuse-test和NucFuse-train,为训练、测试和评估细胞核实例分割模型提供了新的基准。

Insight: 创新点在于将研究焦点从模型开发转向数据集本身,通过标准化、系统评估、排名和融合多个数据集,构建了更公平、更全面的评估基准。这为领域内提供了可重复、可比较的测试平台,并展示了通过数据集融合提升模型性能的潜力。

Abstract: Nuclei instance segmentation in hematoxylin and eosin (H&E)-stained images plays an important role in automated histological image analysis, with various applications in downstream tasks. While several machine learning and deep learning approaches have been proposed for nuclei instance segmentation, most research in this field focuses on developing new segmentation algorithms and benchmarking them on a limited number of arbitrarily selected public datasets. In this work, rather than focusing on model development, we focused on the datasets used for this task. Based on an extensive literature review, we identified manually annotated, publicly available datasets of H&E-stained images for nuclei instance segmentation and standardized them into a unified input and annotation format. Using two state-of-the-art segmentation models, one based on convolutional neural networks (CNNs) and one based on a hybrid CNN and vision transformer architecture, we systematically evaluated and ranked these datasets based on their nuclei instance segmentation performance. Furthermore, we proposed a unified test set (NucFuse-test) for fair cross-dataset evaluation and a unified training set (NucFuse-train) for improved segmentation performance by merging images from multiple datasets. By evaluating and ranking the datasets, performing comprehensive analyses, generating fused datasets, conducting external validation, and making our implementation publicly available, we provided a new benchmark for training, testing, and evaluating nuclei instance segmentation models on H&E-stained histological images.


[37] Look in the Middle: Structural Anchor Pruning for Scalable Visual RAG Indexing cs.CV | cs.CL | cs.IRPDF

Zhuchenyang Liu, Ziyu Hu, Yao Zhang, Yu Xiao

TL;DR: 本文提出了一种无需训练的结构锚点剪枝方法,用于在视觉文档检索中实现高压缩率,同时保持检索性能。通过识别中间层的关键视觉补丁,该方法在ViDoRe基准测试中实现了超过90%的索引向量压缩,并保持了稳健的检索效果。

Details

Motivation: 解决视觉语言模型在视觉文档检索中索引向量过大导致的可扩展性问题,特别是在高压缩场景下现有无训练剪枝方法性能不佳的问题。

Result: 在ViDoRe基准测试中,SAP方法将索引向量大小减少了超过90%,同时保持了稳健的检索保真度,提供了高度可扩展的视觉RAG解决方案。

Insight: 创新点在于提出从中间层而非最终层识别语义结构锚点补丁进行剪枝,因为结构信号在最终层会消散;同时引入了Oracle Score Retention协议来评估层间信息对压缩效率的影响。

Abstract: Recent Vision-Language Models (e.g., ColPali) enable fine-grained Visual Document Retrieval (VDR) but incur prohibitive index vector size overheads. Training-free pruning solutions (e.g., EOS-attention based methods) can reduce index vector size by approximately 60% without model adaptation, but often underperform random selection in high-compression scenarios (> 80%). Prior research (e.g., Light-ColPali) attributes this to the conclusion that visual token importance is inherently query-dependent, thereby questioning the feasibility of training-free pruning. In this work, we propose Structural Anchor Pruning (SAP), a training-free pruning method that identifies key visual patches from middle layers to achieve high performance compression. We also introduce Oracle Score Retention (OSR) protocol to evaluate how layer-wise information affects compression efficiency. Evaluations on the ViDoRe benchmark demonstrate that SAP reduces index vectors by over 90% while maintaining robust retrieval fidelity, providing a highly scalable solution for Visual RAG. Furthermore, our OSR-based analysis reveals that semantic structural anchor patches persist in the middle layers, unlike traditional pruning solutions that focus on the final layer where structural signals dissipate.


[38] Efficient Token Pruning for LLaDA-V cs.CVPDF

Zhewen Wan, Tianchen Song, Chen Lin, Zhiyong Zhao, Xianpeng Lang

TL;DR: 本文针对基于扩散的大型多模态模型LLaDA-V存在的计算开销大问题,提出了一种结构化的视觉令牌剪枝策略。通过深入分析注意力机制,发现LLaDA-V的跨模态信息聚合主要发生在中后层,因此该方法选择性地在第一个去噪步骤的中后层移除部分视觉令牌,以减少计算量(FLOPs)并保持关键语义信息。

Details

Motivation: LLaDA-V等基于扩散的大型多模态模型的双向注意力机制和迭代去噪范式导致视觉令牌在所有层和去噪步骤中被重复处理,引入了显著的计算开销。本文旨在通过令牌剪枝实现高效推理。

Result: 在多个基准测试上,最佳配置将计算成本降低了高达65%,同时保持了平均95%的任务性能。

Insight: 创新点在于首次在基于扩散的大型多模态模型中研究结构化令牌剪枝,并针对其特有的延迟注意力聚合特性,将剪枝重点放在第一个去噪步骤的中后层,而非像FastV那样关注浅层。这为基于扩散的多模态模型的高效推理提供了经验基础,并凸显了视觉感知剪枝的潜力。

Abstract: Diffusion-based large multimodal models, such as LLaDA-V, have demonstrated impressive capabilities in vision-language understanding and generation. However, their bidirectional attention mechanism and diffusion-style iterative denoising paradigm introduce significant computational overhead, as visual tokens are repeatedly processed across all layers and denoising steps. In this work, we conduct an in-depth attention analysis and reveal that, unlike autoregressive decoders, LLaDA-V aggregates cross-modal information predominantly in middle-to-late layers, leading to delayed semantic alignment. Motivated by this observation, we propose a structured token pruning strategy inspired by FastV, selectively removing a proportion of visual tokens at designated layers to reduce FLOPs while preserving critical semantic information. To the best of our knowledge, this is the first work to investigate structured token pruning in diffusion-based large multimodal models. Unlike FastV, which focuses on shallow-layer pruning, our method targets the middle-to-late layers of the first denoising step to align with LLaDA-V’s delayed attention aggregation to maintain output quality, and the first-step pruning strategy reduces the computation across all subsequent steps. Our framework provides an empirical basis for efficient LLaDA-V inference and highlights the potential of vision-aware pruning in diffusion-based multimodal models. Across multiple benchmarks, our best configuration reduces computational cost by up to 65% while preserving an average of 95% task performance.


[39] TeleStyle: Content-Preserving Style Transfer in Images and Videos cs.CVPDF

Shiwen Zhang, Xiaoyan Yang, Bojia Zi, Haibin Huang, Chi Zhang

TL;DR: 本文提出了TeleStyle,一种轻量级且有效的图像和视频风格迁移模型,旨在解决扩散变换器(DiTs)中内容与风格特征纠缠的挑战。该模型基于Qwen-Image-Edit构建,通过课程持续学习框架在混合数据集上训练,实现了对未见风格的泛化,同时保持内容保真度,并在风格相似性、内容一致性和美学质量三个核心指标上达到SOTA性能。

Details

Motivation: 解决扩散变换器在内容保持风格迁移中因内部表示中内容与风格特征纠缠而导致的挑战,以生成基于内容和风格参考的、内容保留良好的风格化输出。

Result: 在风格相似性、内容一致性和美学质量三个核心评估指标上实现了最先进的性能(SOTA)。

Insight: 创新点包括:利用Qwen-Image-Edit的基础能力进行内容保持和风格定制;采用课程持续学习框架处理混合(干净与噪声)数据集以提升泛化能力;引入视频到视频风格化模块增强时间一致性和视觉质量。

Abstract: Content-preserving style transfer, generating stylized outputs based on content and style references, remains a significant challenge for Diffusion Transformers (DiTs) due to the inherent entanglement of content and style features in their internal representations. In this technical report, we present TeleStyle, a lightweight yet effective model for both image and video stylization. Built upon Qwen-Image-Edit, TeleStyle leverages the base model’s robust capabilities in content preservation and style customization. To facilitate effective training, we curated a high-quality dataset of distinct specific styles and further synthesized triplets using thousands of diverse, in-the-wild style categories. We introduce a Curriculum Continual Learning framework to train TeleStyle on this hybrid dataset of clean (curated) and noisy (synthetic) triplets. This approach enables the model to generalize to unseen styles without compromising precise content fidelity. Additionally, we introduce a video-to-video stylization module to enhance temporal consistency and visual quality. TeleStyle achieves state-of-the-art performance across three core evaluation metrics: style similarity, content consistency, and aesthetic quality. Code and pre-trained models are available at https://github.com/Tele-AI/TeleStyle


[40] Automated Marine Biofouling Assessment: Benchmarking Computer Vision and Multimodal LLMs on the Level of Fouling Scale cs.CVPDF

Brayden Hamilton, Tim Cashmore, Peter Driscoll, Trevor Gee, Henry Williams

TL;DR: 该论文研究了利用计算机视觉模型和多模态大语言模型(LLMs)对船舶生物污损严重程度(基于污损等级LoF)进行自动分类。研究在一个由新西兰初级产业部提供的专家标注数据集上评估了卷积神经网络、基于Transformer的分割模型以及零样本LLMs。结果表明,计算机视觉模型在极端LoF类别上表现出高精度,但在中间等级上因数据集不平衡和图像构图问题而表现不佳;而通过结构化提示和检索引导的LLMs无需训练即达到了有竞争力的性能,并能提供可解释的输出。论文指出,整合分割覆盖度与LLM推理的混合方法为可扩展且可解释的生物污损评估提供了有前景的途径。

Details

Motivation: 船舶船体上的海洋生物污损带来了重大的生态、经济和生物安全风险。传统的调查方法依赖潜水员检查,既危险又难以规模化。本研究旨在探索利用自动化技术(计算机视觉和LLMs)对生物污损严重程度进行分类,以克服传统方法的局限性。

Result: 在专家标注的LoF数据集上,计算机视觉模型在极端LoF类别(如非常干净或非常严重)上准确率高,但在中间等级上表现不佳。零样本LLMs在结构化提示和检索的引导下,无需训练就取得了有竞争力的性能,并能提供可解释的文本输出。研究结果表明,混合方法(结合分割覆盖度和LLM推理)具有潜力。

Insight: 论文的创新点在于首次系统地对比了传统计算机视觉模型与新兴多模态LLMs在海洋生物污损评估任务上的表现。客观来看,其核心洞察在于揭示了LLMs在零样本设置下,通过适当的提示工程和检索增强,能够在该领域达到与定制模型相当的性能,并兼具可解释性。这为开发无需大量标注数据、可扩展且可信的自动化评估系统提供了新思路,即结合视觉模型的精确区域分析与LLM的语义推理能力。

Abstract: Marine biofouling on vessel hulls poses major ecological, economic, and biosecurity risks. Traditional survey methods rely on diver inspections, which are hazardous and limited in scalability. This work investigates automated classification of biofouling severity on the Level of Fouling (LoF) scale using both custom computer vision models and large multimodal language models (LLMs). Convolutional neural networks, transformer-based segmentation, and zero-shot LLMs were evaluated on an expert-labelled dataset from the New Zealand Ministry for Primary Industries. Computer vision models showed high accuracy at extreme LoF categories but struggled with intermediate levels due to dataset imbalance and image framing. LLMs, guided by structured prompts and retrieval, achieved competitive performance without training and provided interpretable outputs. The results demonstrate complementary strengths across approaches and suggest that hybrid methods integrating segmentation coverage with LLM reasoning offer a promising pathway toward scalable and interpretable biofouling assessment.


[41] Feature Projection Learning for Better Vision-Language Reasoning cs.CVPDF

Yi Zhang, Weicheng Lin, Liang-Jie Zhang

TL;DR: 本文提出了一种名为特征投影学习(FPL)的简单高效方法,用于将预训练的视觉语言模型(如CLIP)高效适应下游任务。该方法通过一个投影模型将类别原型特征投影到查询图像特征空间并重建特征图,将分类问题转化为特征投影问题,最终结合投影模型和原始CLIP的预测结果。

Details

Motivation: 现有方法在适应CLIP等VLP模型到下游任务时存在性能有限、可学习参数过多或训练时间过长的问题,阻碍了其有效性。

Result: 综合实证评估表明,FPL在准确性上显著优于当前最先进(SOTA)方法。

Insight: 创新点在于将分类问题重构为特征投影问题,通过重建误差作为类别得分,并融合原始CLIP预测,实现了参数高效且性能优越的适应策略。

Abstract: Vision-Language Pre-Trained models, notably CLIP, that utilize contrastive learning have proven highly adept at extracting generalizable visual features. To inherit the well-learned knowledge of VLP models for downstream tasks, several approaches aim to adapt them efficiently with limited supervision. However, these methods either suffer from limited performance, excessive learnable parameters, or extended training times, all of which hinder their effectiveness in adapting the CLIP model to downstream tasks. In this work, we propose a simple yet efficient and effective method called \textit{\textbf{F}eature \textbf{P}rojection \textbf{L}earning(FPL)} to address these problems. Specifically, we develop a projection model that projects class prototype features into the query image feature space and reconstructs the query image feature map. The negative average squared reconstruction error is used as the class score. In this way, we transform the classification problem into a feature projection problem. The final output of this method is a combination of the prediction from the projection model and the original pre-trained CLIP. Comprehensive empirical evaluations confirm that FPL delivers superior accuracy, surpassing the current state-of-the-art methods by a substantial margin.


[42] Visual Prompt-Agnostic Evolution cs.CVPDF

Junze Wang, Lei Fan, Dezheng Zhang, Weipeng Jing, Donglin Di

TL;DR: 本文提出了一种名为Prompt-Agnostic Evolution (PAE)的新方法,旨在解决视觉提示调优(VPT)中存在的训练不稳定、梯度振荡和层间不匹配问题。该方法通过从频域角度初始化任务感知的提示、使用共享的Koopman算子协调各层更新,并引入基于李雅普诺夫稳定性理论的正则化器来约束误差放大,从而稳定训练过程并提升性能。

Details

Motivation: 现有视觉提示调优(VPT)方法在训练过程中常出现不稳定的动态特性,表现为梯度振荡,具体表现为浅层提示过早停滞而深层提示高方差振荡,导致层间不匹配,从而减缓收敛速度并降低最终性能。

Result: 在涵盖多个下游任务的25个数据集上的广泛实验表明,PAE平均加速收敛1.41倍,并将准确率提升1-3%。

Insight: 论文的创新点在于从频域角度初始化提示以利用主干网络固有的频率捷径模式,使用共享的Koopman算子实现跨层协调的线性演化,以及引入基于稳定性理论的正则化器来控制训练动态。从客观角度看,该方法提供了一种通用、轻量且无需修改主干网络或推理过程的框架,可无缝集成到多种VPT变体中,提升了训练的稳定性和效率。

Abstract: Visual Prompt Tuning (VPT) adapts a frozen Vision Transformer (ViT) to downstream tasks by inserting a small number of learnable prompt tokens into the token sequence at each layer. However, we observe that existing VPT variants often suffer from unstable training dynamics, characterized by gradient oscillations. A layer-wise analysis reveals that shallow-layer prompts tend to stagnate early, while deeper-layer prompts exhibit high-variance oscillations, leading to cross-layer mismatch. These issues slow convergence and degrade final performance. To address these challenges, we propose Prompt-Agnostic Evolution ($\mathtt{PAE}$), which strengthens vision prompt tuning by explicitly modeling prompt dynamics. From a frequency-domain perspective, we initialize prompts in a task-aware direction by uncovering and propagating frequency shortcut patterns that the backbone inherently exploits for recognition. To ensure coherent evolution across layers, we employ a shared Koopman operator that imposes a global linear transformation instead of uncoordinated, layer-specific updates. Finally, inspired by Lyapunov stability theory, we introduce a regularizer that constrains error amplification during evolution. Extensive experiments show that $\mathtt{PAE}$ accelerates convergence with an average $1.41\times$ speedup and improves accuracy by 1–3% on 25 datasets across multiple downstream tasks. Beyond performance, $\mathtt{PAE}$ is prompt-agnostic and lightweight, and it integrates seamlessly with diverse VPT variants without backbone modification or inference-time changes.


[43] Reversible Efficient Diffusion for Image Fusion cs.CVPDF

Xingxin Xu, Bing Cao, DongDong Li, Qinghua Hu, Pengfei Zhu

TL;DR: 本文提出了一种名为可逆高效扩散(RED)的图像融合模型,旨在解决扩散模型在图像融合任务中因马尔可夫过程噪声误差累积导致的细节丢失问题。该模型通过显式监督训练框架,继承了扩散模型的强大生成能力,同时避免了分布估计,从而在保持高视觉保真度的同时提高了计算效率。

Details

Motivation: 多模态图像融合旨在将不同源图像的互补信息整合到统一表示中,但现有扩散模型在应用于融合任务时,由于马尔可夫过程固有的噪声误差累积,常导致细节丢失和结果退化,而端到端训练中引入显式监督又面临计算效率挑战。

Result: 摘要中未提及具体的定量结果、基准测试或性能水平(如SOTA)。

Insight: 论文的创新点在于提出了一个显式监督的可逆高效扩散(RED)训练框架,它结合了扩散模型的生成优势,同时通过避免分布估计来提升效率,这为在图像融合中平衡细节保留与计算开销提供了新思路。

Abstract: Multi-modal image fusion aims to consolidate complementary information from diverse source images into a unified representation. The fused image is expected to preserve fine details and maintain high visual fidelity. While diffusion models have demonstrated impressive generative capabilities in image generation, they often suffer from detail loss when applied to image fusion tasks. This issue arises from the accumulation of noise errors inherent in the Markov process, leading to inconsistency and degradation in the fused results. However, incorporating explicit supervision into end-to-end training of diffusion-based image fusion introduces challenges related to computational efficiency. To address these limitations, we propose the Reversible Efficient Diffusion (RED) model - an explicitly supervised training framework that inherits the powerful generative capability of diffusion models while avoiding the distribution estimation.


[44] Hallucination Begins Where Saliency Drops cs.CVPDF

Xiaofeng Zhang, Yuanchao Zhu, Chaochen Gu, Xiaosong Yuan, Qiyan Zhao

TL;DR: 本文提出LVLMs-Saliency框架,通过融合注意力权重和输入梯度来量化大型视觉语言模型中每个输出token的视觉基础强度,揭示了幻觉产生于前序token对下一个token预测的显著性下降这一关键模式,并据此提出了Saliency-Guided Rejection Sampling和Local Coherence Reinforcement两种推理时机制来缓解幻觉问题。

Details

Motivation: 现有方法仅依赖前向传播的注意力模式,忽略了基于梯度的信号,无法可靠地区分幻觉输出与事实基础输出,因此需要一种梯度感知的诊断框架来更好地理解和缓解LVLM中的幻觉问题。

Result: 在多个LVLM上的广泛实验表明,该方法显著降低了幻觉率,同时保持了流畅性和任务性能,为增强模型可靠性提供了一个鲁棒且可解释的解决方案。

Insight: 创新点在于首次将梯度信号与注意力模式融合以量化视觉基础强度,并发现了幻觉与前序token显著性下降之间的决定性关联;据此设计的SGRS和LocoRE机制是轻量级、即插即用的推理时干预方法,能有效对抗上下文遗忘行为。

Abstract: Recent studies have examined attention dynamics in large vision-language models (LVLMs) to detect hallucinations. However, existing approaches remain limited in reliably distinguishing hallucinated from factually grounded outputs, as they rely solely on forward-pass attention patterns and neglect gradient-based signals that reveal how token influence propagates through the network. To bridge this gap, we introduce LVLMs-Saliency, a gradient-aware diagnostic framework that quantifies the visual grounding strength of each output token by fusing attention weights with their input gradients. Our analysis uncovers a decisive pattern: hallucinations frequently arise when preceding output tokens exhibit low saliency toward the prediction of the next token, signaling a breakdown in contextual memory retention. Leveraging this insight, we propose a dual-mechanism inference-time framework to mitigate hallucinations: (1) Saliency-Guided Rejection Sampling (SGRS), which dynamically filters candidate tokens during autoregressive decoding by rejecting those whose saliency falls below a context-adaptive threshold, thereby preventing coherence-breaking tokens from entering the output sequence; and (2) Local Coherence Reinforcement (LocoRE), a lightweight, plug-and-play module that strengthens attention from the current token to its most recent predecessors, actively counteracting the contextual forgetting behavior identified by LVLMs-Saliency. Extensive experiments across multiple LVLMs demonstrate that our method significantly reduces hallucination rates while preserving fluency and task performance, offering a robust and interpretable solution for enhancing model reliability. Code is available at: https://github.com/zhangbaijin/LVLMs-Saliency


[45] Artifact-Aware Evaluation for High-Quality Video Generation cs.CVPDF

Chen Zhu, Jiashu Zhu, Yanxun Li, Meiqi Wu, Bingze Song

TL;DR: 本文提出了一种针对高质量视频生成的伪影感知评估方法,通过关注外观、运动和摄像机三个关键维度,定义了10类常见伪影的分类体系,并构建了包含8万个生成视频的大规模数据集GenVID,进而开发了密集视频伪影识别框架DVAR,用于细粒度的伪影检测与分类。

Details

Motivation: 现有视频生成评估方法通常只提供粗略的质量分数,缺乏对特定伪影的详细定位和分类,无法满足高质量视频生成技术快速发展的评估需求。

Result: 大量实验表明,该方法显著提高了伪影检测的准确性,并能有效过滤低质量内容,在构建的GenVID数据集上验证了其有效性。

Insight: 创新点在于提出了一个全面的、基于人类感知(外观、运动、摄像机)的评估协议和伪影分类法,并构建了大规模标注数据集以支持细粒度评估模型的训练,为视频生成质量的自动化、精细化评估提供了新思路。

Abstract: With the rapid advancement of video generation techniques, evaluating and auditing generated videos has become increasingly crucial. Existing approaches typically offer coarse video quality scores, lacking detailed localization and categorization of specific artifacts. In this work, we introduce a comprehensive evaluation protocol focusing on three key aspects affecting human perception: Appearance, Motion, and Camera. We define these axes through a taxonomy of 10 prevalent artifact categories reflecting common generative failures observed in video generation. To enable robust artifact detection and categorization, we introduce GenVID, a large-scale dataset of 80k videos generated by various state-of-the-art video generation models, each carefully annotated for the defined artifact categories. Leveraging GenVID, we develop DVAR, a Dense Video Artifact Recognition framework for fine-grained identification and classification of generative artifacts. Extensive experiments show that our approach significantly improves artifact detection accuracy and enables effective filtering of low-quality content.


[46] Physically Guided Visual Mass Estimation from a Single RGB Image cs.CV | cs.AIPDF

Sungjae Lee, Junhan Jeong, Yeonjoo Hong, Kwang In Kim

TL;DR: 本文提出了一种物理引导的单张RGB图像质量估计框架,通过从图像中恢复物体中心的三维几何(用于体积估计)和提取粗略材料语义(用于密度推理),并利用实例自适应门控机制融合这些表示,最终通过两个独立的回归头预测体积和密度相关因子,在仅使用质量监督的情况下实现更准确的质量估计。

Details

Motivation: 从RGB图像估计物体质量是病态问题,因为质量取决于几何体积和材料密度,而这两者都无法直接从RGB外观中直接观测,因此需要物理上有意义的表示来约束解空间。

Result: 在image2mass和ABO-500数据集上的实验表明,该方法一致地优于最先进的方法。

Insight: 创新点在于将视觉线索与物理因素(体积和密度)对齐,通过单目深度估计和视觉语言模型分别获取几何和材料语义信息,并采用实例自适应门控机制进行融合,在仅需质量监督的情况下实现物理结构化的质量预测。

Abstract: Estimating object mass from visual input is challenging because mass depends jointly on geometric volume and material-dependent density, neither of which is directly observable from RGB appearance. Consequently, mass prediction from pixels is ill-posed and therefore benefits from physically meaningful representations to constrain the space of plausible solutions. We propose a physically structured framework for single-image mass estimation that addresses this ambiguity by aligning visual cues with the physical factors governing mass. From a single RGB image, we recover object-centric three-dimensional geometry via monocular depth estimation to inform volume and extract coarse material semantics using a vision-language model to guide density-related reasoning. These geometry, semantic, and appearance representations are fused through an instance-adaptive gating mechanism, and two physically guided latent factors (volume- and density-related) are predicted through separate regression heads under mass-only supervision. Experiments on image2mass and ABO-500 show that the proposed method consistently outperforms state-of-the-art methods.


[47] Structure-constrained Language-informed Diffusion Model for Unpaired Low-dose Computed Tomography Angiography Reconstruction cs.CV | cs.AIPDF

Genyuan Zhang, Zihao Wang, Zhifan Gao, Lei Xu, Zhen Zhou

TL;DR: 本文提出了一种结构约束语言引导扩散模型(SLDM),用于解决低剂量CT血管造影重建中因不完全配对图像导致的增强精度不足问题。该模型通过整合结构先验信息和空间智能语义监督,确保增强过程中的结构一致性,并利用减影血管造影增强模块提升对比度,最终在低剂量对比剂CT血管造影重建中实现准确增强。

Details

Motivation: 现有深度学习方法在低剂量碘对比剂CT图像生成中难以处理不完全配对图像,主要由于模型识别特定结构的能力有限,导致增强精度不足。本文旨在通过结合结构约束和语言引导,提升模型在低剂量CT血管造影重建中的性能。

Result: 通过视觉对比的定性分析和多个指标的定量结果,证明了该方法在低剂量对比剂CT血管造影重建中的有效性,具体表现为结构一致性和增强精度的提升。

Insight: 创新点包括:1)引入结构先验信息约束模型推理,确保增强过程的结构一致性;2)结合空间智能的语义监督策略,整合视觉感知和空间推理功能;3)应用减影血管造影增强模块优化对比剂区域对比度。这些方法为医学图像生成提供了结构感知和语义引导的新思路。

Abstract: The application of iodinated contrast media (ICM) improves the sensitivity and specificity of computed tomography (CT) for a wide range of clinical indications. However, overdose of ICM can cause problems such as kidney damage and life-threatening allergic reactions. Deep learning methods can generate CT images of normal-dose ICM from low-dose ICM, reducing the required dose while maintaining diagnostic power. However, existing methods are difficult to realize accurate enhancement with incompletely paired images, mainly because of the limited ability of the model to recognize specific structures. To overcome this limitation, we propose a Structure-constrained Language-informed Diffusion Model (SLDM), a unified medical generation model that integrates structural synergy and spatial intelligence. First, the structural prior information of the image is effectively extracted to constrain the model inference process, thus ensuring structural consistency in the enhancement process. Subsequently, semantic supervision strategy with spatial intelligence is introduced, which integrates the functions of visual perception and spatial reasoning, thus prompting the model to achieve accurate enhancement. Finally, the subtraction angiography enhancement module is applied, which serves to improve the contrast of the ICM agent region to suitable interval for observation. Qualitative analysis of visual comparison and quantitative results of several metrics demonstrate the effectiveness of our method in angiographic reconstruction for low-dose contrast medium CT angiography.


[48] OSDEnhancer: Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion cs.CV | cs.GRPDF

Shuoyan Wei, Feng Li, Chen Zhou, Runmin Cong, Yao Zhao

TL;DR: 本文提出OSDEnhancer,一种基于一步扩散过程的新型框架,首次实现了在复杂未知退化条件下的真实世界时空视频超分辨率,通过线性预插值初始化时空结构,并训练时空细化与空间增强的专家混合模型,结合双向可变形VAE解码器进行循环时空聚合与传播。

Details

Motivation: 现有扩散模型在视频超分辨率中表现出色,但在时空视频超分辨率领域,尤其是在需要从低分辨率恢复高分辨率内容并提升帧率以保持时间一致性的真实世界复杂退化场景中,其潜力尚未充分探索,现有方法多基于简化的退化假设,难以应对实际挑战。

Result: 实验表明,所提方法在真实世界场景中实现了最先进的性能,并保持了卓越的泛化能力。

Insight: 创新点包括:首次通过高效的一步扩散过程实现真实世界STVSR;采用线性预插值策略初始化时空结构;设计时空细化与空间增强的专家混合模型,使不同专家路径分别学习时间一致性和空间细节的鲁棒表示,并在推理中协同增强;引入双向可变形VAE解码器进行循环时空聚合与传播,提升跨帧重建保真度。

Abstract: Diffusion models (DMs) have demonstrated exceptional success in video super-resolution (VSR), showcasing a powerful capacity for generating fine-grained details. However, their potential for space-time video super-resolution (STVSR), which necessitates not only recovering realistic visual content from low-resolution to high-resolution but also improving the frame rate with coherent temporal dynamics, remains largely underexplored. Moreover, existing STVSR methods predominantly address spatiotemporal upsampling under simplified degradation assumptions, which often struggle in real-world scenarios with complex unknown degradations. Such a high demand for reconstruction fidelity and temporal consistency makes the development of a robust STVSR framework particularly non-trivial. To address these challenges, we propose OSDEnhancer, a novel framework that, to the best of our knowledge, represents the first method to achieve real-world STVSR through an efficient one-step diffusion process. OSDEnhancer initializes essential spatiotemporal structures through a linear pre-interpolation strategy and pivots on training temporal refinement and spatial enhancement mixture of experts (TR-SE MoE), which allows distinct expert pathways to progressively learn robust, specialized representations for temporal coherence and spatial detail, further collaboratively reinforcing each other during inference. A bidirectional deformable variational autoencoder (VAE) decoder is further introduced to perform recurrent spatiotemporal aggregation and propagation, enhancing cross-frame reconstruction fidelity. Experiments demonstrate that the proposed method achieves state-of-the-art performance while maintaining superior generalization capability in real-world scenarios.


[49] GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction cs.CVPDF

Mai Su, Qihan Yu, Zhongtao Wang, Yilong Li, Chengwei Pan

TL;DR: 本文提出了GVGS方法,旨在解决3D高斯泼溅技术中精确表面重建的挑战。该方法通过引入高斯可见性感知的多视角几何一致性约束,聚合共享高斯基元在不同视角下的可见性,以提供更准确稳定的几何监督;同时提出渐进式四叉树校准的单目深度约束,通过从粗到细的空间尺度进行分块仿射校准,以缓解深度先验的尺度模糊性并保留细节。

Details

Motivation: 现有方法通过多视角几何一致性或单目深度先验来优化高斯深度估计以改进表面重建,但前者在大几何差异下不可靠,后者存在尺度模糊和局部不一致问题,导致高斯深度监督不准确。本文旨在克服这些限制。

Result: 在DTU和TNT数据集上的大量实验表明,该方法在几何精度上相比先前基于高斯和隐式表面重建的方法取得了一致的改进。

Insight: 创新点在于结合了可见性感知的多视角几何约束和渐进式校准的单目深度约束,前者增强了多视角监督的鲁棒性,后者有效解决了单目深度先验的尺度问题,从而实现了更精确的表面重建。

Abstract: 3D Gaussian Splatting enables efficient optimization and high-quality rendering, yet accurate surface reconstruction remains challenging. Prior methods improve surface reconstruction by refining Gaussian depth estimates, either via multi-view geometric consistency or through monocular depth priors. However, multi-view constraints become unreliable under large geometric discrepancies, while monocular priors suffer from scale ambiguity and local inconsistency, ultimately leading to inaccurate Gaussian depth supervision. To address these limitations, we introduce a Gaussian visibility-aware multi-view geometric consistency constraint that aggregates the visibility of shared Gaussian primitives across views, enabling more accurate and stable geometric supervision. In addition, we propose a progressive quadtree-calibrated Monocular depth constraint that performs block-wise affine calibration from coarse to fine spatial scales, mitigating the scale ambiguity of depth priors while preserving fine-grained surface details. Extensive experiments on DTU and TNT datasets demonstrate consistent improvements in geometric accuracy over prior Gaussian-based and implicit surface reconstruction methods. Codes are available at an anonymous repository: https://github.com/GVGScode/GVGS.


[50] MMSF: Multitask and Multimodal Supervised Framework for WSI Classification and Survival Analysis cs.CVPDF

Chengying She, Chengwei Chen, Xinran Zhang, Ben Wang, Lizhuang Liu

TL;DR: 本文提出了一种名为MMSF的多任务多模态监督框架,用于整合计算病理学中的全切片图像(WSI)和临床数据,以同时进行癌症分类和生存分析。该框架通过图特征提取模块、临床数据嵌入模块、特征融合模块以及基于Mamba的多实例学习(MIL)编码器,显式地分解和融合跨模态信息,从而有效处理异质数据。

Details

Motivation: 计算病理学中,千兆像素级的全切片图像(WSI)和患者级临床描述符分别提供了肿瘤形态和预后背景,但两者特征空间在统计和尺度上存在差异,整合这些异质信号具有挑战性。

Result: 在CAMELYON16和TCGA-NSCLC数据集上的实验显示,MMSF在分类任务上比竞争基线在准确率和AUC上分别提升了2.1-6.6%和2.2-6.9%;在五个TCGA生存分析队列上,与单模态方法相比,C-index提升了7.1-9.8%,与多模态替代方法相比提升了5.6-7.1%。

Insight: 创新点在于提出了一个显式分解和融合跨模态信息的线性复杂度MIL框架,结合了图特征提取、临床数据标准化、特征对齐以及基于Mamba的MIL编码器,有效整合了WSI的拓扑结构和临床数据,提升了多任务预测性能。

Abstract: Multimodal evidence is critical in computational pathology: gigapixel whole slide images capture tumor morphology, while patient-level clinical descriptors preserve complementary context for prognosis. Integrating such heterogeneous signals remains challenging because feature spaces exhibit distinct statistics and scales. We introduce MMSF, a multitask and multimodal supervised framework built on a linear-complexity MIL backbone that explicitly decomposes and fuses cross-modal information. MMSF comprises a graph feature extraction module embedding tissue topology at the patch level, a clinical data embedding module standardizing patient attributes, a feature fusion module aligning modality-shared and modality-specific representations, and a Mamba-based MIL encoder with multitask prediction heads. Experiments on CAMELYON16 and TCGA-NSCLC demonstrate 2.1–6.6% accuracy and 2.2–6.9% AUC improvements over competitive baselines, while evaluations on five TCGA survival cohorts yield 7.1–9.8% C-index improvements compared with unimodal methods and 5.6–7.1% over multimodal alternatives.


[51] PalmBridge: A Plug-and-Play Feature Alignment Framework for Open-Set Palmprint Verification cs.CVPDF

Chenke Zhang, Ziyuan Yang, Licheng Yan, Shuyi Li, Andrew Beng Jin Teoh

TL;DR: 该论文提出了一种名为PalmBridge的即插即用特征对齐框架,用于解决开放集掌纹验证中因异构部署条件导致的特征分布偏移问题。该方法基于向量量化,通过从训练特征中学习一组紧凑的代表性向量,并在注册和验证阶段将特征向量映射到其最近的代表性向量后进行混合,以抑制领域偏移带来的干扰变化,同时保留判别性身份信息。

Details

Motivation: 现有深度掌纹模型通常假设封闭且平稳的分布,导致模型过度拟合数据集特定纹理而非学习领域不变表示。数据增强虽常用但假设增强样本能近似目标部署分布,这在显著领域不匹配时往往失效。

Result: 在多个掌纹数据集和骨干架构上的实验表明,PalmBridge在数据集内开放集评估中持续降低等错误率(EER),并改善了跨数据集泛化能力,且运行时开销可忽略至适度。

Insight: 创新点在于提出了一种基于向量量化的特征空间对齐方法,通过联合优化代表性向量与骨干网络(结合任务监督、特征一致性目标和正交正则化),构建稳定且结构良好的共享嵌入空间。该方法不依赖数据级增强,而是直接在特征空间进行对齐,可视为一种轻量级的领域适应技术。

Abstract: Palmprint recognition is widely used in biometric systems, yet real-world performance often degrades due to feature distribution shifts caused by heterogeneous deployment conditions. Most deep palmprint models assume a closed and stationary distribution, leading to overfitting to dataset-specific textures rather than learning domain-invariant representations. Although data augmentation is commonly used to mitigate this issue, it assumes augmented samples can approximate the target deployment distribution, an assumption that often fails under significant domain mismatch. To address this limitation, we propose PalmBridge, a plug-and-play feature-space alignment framework for open-set palmprint verification based on vector quantization. Rather than relying solely on data-level augmentation, PalmBridge learns a compact set of representative vectors directly from training features. During enrollment and verification, each feature vector is mapped to its nearest representative vector under a minimum-distance criterion, and the mapped vector is then blended with the original vector. This design suppresses nuisance variation induced by domain shifts while retaining discriminative identity cues. The representative vectors are jointly optimized with the backbone network using task supervision, a feature-consistency objective, and an orthogonality regularization term to form a stable and well-structured shared embedding space. Furthermore, we analyze feature-to-representative mappings via assignment consistency and collision rate to assess model’s sensitivity to blending weights. Experiments on multiple palmprint datasets and backbone architectures show that PalmBridge consistently reduces EER in intra-dataset open-set evaluation and improves cross-dataset generalization with negligible to modest runtime overhead.


[52] Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models cs.CVPDF

Zengbin Wang, Xuecai Hu, Yong Wang, Feng Xiong, Man Zhang

TL;DR: 本文提出了SpatialGenEval基准,用于系统评估文本到图像(T2I)模型的空间智能,涵盖25个真实场景的1230个信息密集的长提示,涉及10个空间子领域。评估21个SOTA模型发现,高阶空间推理仍是主要瓶颈。此外,构建了SpatialT2I数据集,通过微调基础模型(如Stable Diffusion-XL)实现了性能提升和更真实的空间关系效果。

Details

Motivation: 当前T2I模型在生成高保真图像方面成功,但处理复杂空间关系(如感知、推理或交互)时经常失败,而现有基准因提示设计简短或信息稀疏而忽视了这些关键方面。

Result: 在SpatialGenEval基准上评估21个SOTA模型,显示高阶空间推理是瓶颈;基于SpatialT2I数据集微调基础模型(Stable Diffusion-XL、Uniworld-V1、OmniGen2)实现了性能提升(+4.2%、+5.7%、+4.4%)和更真实的空间关系效果。

Insight: 创新点包括设计信息密集的长提示基准以系统评估空间智能,并构建数据集通过数据中心的范式提升模型空间推理能力;客观分析认为,该方法强调了提示设计和数据质量对T2I模型空间能力的重要性。

Abstract: Text-to-image (T2I) models have achieved remarkable success in generating high-fidelity images, but they often fail in handling complex spatial relationships, e.g., spatial perception, reasoning, or interaction. These critical aspects are largely overlooked by current benchmarks due to their short or information-sparse prompt design. In this paper, we introduce SpatialGenEval, a new benchmark designed to systematically evaluate the spatial intelligence of T2I models, covering two key aspects: (1) SpatialGenEval involves 1,230 long, information-dense prompts across 25 real-world scenes. Each prompt integrates 10 spatial sub-domains and corresponding 10 multi-choice question-answer pairs, ranging from object position and layout to occlusion and causality. Our extensive evaluation of 21 state-of-the-art models reveals that higher-order spatial reasoning remains a primary bottleneck. (2) To demonstrate that the utility of our information-dense design goes beyond simple evaluation, we also construct the SpatialT2I dataset. It contains 15,400 text-image pairs with rewritten prompts to ensure image consistency while preserving information density. Fine-tuned results on current foundation models (i.e., Stable Diffusion-XL, Uniworld-V1, OmniGen2) yield consistent performance gains (+4.2%, +5.7%, +4.4%) and more realistic effects in spatial relations, highlighting a data-centric paradigm to achieve spatial intelligence in T2I models.


[53] Let’s Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models cs.CV | cs.AIPDF

Yuhao Sun, Chengyi Cai, Jiacheng Zhang, Zesheng Ye, Xingliang Yuan

TL;DR: 本文提出BiFTA方法,通过视图精炼和描述精炼双向去除细粒度文本-视觉对齐中的冗余信息,以提升CLIP等预训练视觉语言模型的零样本性能。

Details

Motivation: 现有细粒度文本描述与局部图像块对齐方法常因两者包含冗余信息而效果受限,本文旨在解决此问题。

Result: BiFTA在6个基准数据集上对基于ViT和ResNet的CLIP模型均实现了优越的零样本性能。

Insight: 创新点在于同时从图像和文本两端进行冗余去除:视图精炼通过高IoU比移除冗余图像块,描述精炼通过高余弦相似度移除冗余文本描述,从而增强对齐的区分性和多样性。

Abstract: Recent research has shown that aligning fine-grained text descriptions with localized image patches can significantly improve the zero-shot performance of pre-trained vision-language models (e.g., CLIP). However, we find that both fine-grained text descriptions and localized image patches often contain redundant information, making text-visual alignment less effective. In this paper, we tackle this issue from two perspectives: \emph{View Refinement} and \emph{Description refinement}, termed as \textit{\textbf{Bi}-refinement for \textbf{F}ine-grained \textbf{T}ext-visual \textbf{A}lignment} (BiFTA). \emph{View refinement} removes redundant image patches with high \emph{Intersection over Union} (IoU) ratios, resulting in more distinctive visual samples. \emph{Description refinement} removes redundant text descriptions with high pairwise cosine similarity, ensuring greater diversity in the remaining descriptions. BiFTA achieves superior zero-shot performance on 6 benchmark datasets for both ViT-based and ResNet-based CLIP, justifying the necessity to remove redundant information in visual-text alignment.


[54] Quartet of Diffusions: Structure-Aware Point Cloud Generation through Part and Symmetry Guidance cs.CVPDF

Chenliang Zhou, Fangcheng Zhong, Weihao Xia, Albert Miao, Canberk Baykal

TL;DR: 本文提出了Quartet of Diffusions框架,一种用于点云生成的结构感知方法。该方法通过协调四个扩散模型,分别学习全局形状潜在变量、对称性、语义部件及其空间组合的分布,从而在生成过程中显式地建模部件组成和对称性。

Details

Motivation: 现有方法将形状生成视为整体过程或仅支持部件组合,缺乏对对称性和部件结构关系的统一建模。本文旨在解决这一问题,通过解耦生成过程,实现对形状属性的细粒度控制,并保证全局结构一致性。

Result: 实验表明,该方法在点云生成任务上达到了最先进的性能水平。

Insight: 主要创新点在于首次在3D点云生成框架中,将对称性和部件先验知识完全整合并贯穿于整个生成过程。通过结构化的四重扩散管道,实现了有保证的对称性、连贯的部件放置以及多样化的高质量输出,同时增强了生成过程的可解释性和可控性。

Abstract: We introduce the Quartet of Diffusions, a structure-aware point cloud generation framework that explicitly models part composition and symmetry. Unlike prior methods that treat shape generation as a holistic process or only support part composition, our approach leverages four coordinated diffusion models to learn distributions of global shape latents, symmetries, semantic parts, and their spatial assembly. This structured pipeline ensures guaranteed symmetry, coherent part placement, and diverse, high-quality outputs. By disentangling the generative process into interpretable components, our method supports fine-grained control over shape attributes, enabling targeted manipulation of individual parts while preserving global consistency. A central global latent further reinforces structural coherence across assembled parts. Our experiments show that the Quartet achieves state-of-the-art performance. To our best knowledge, this is the first 3D point cloud generation framework that fully integrates and enforces both symmetry and part priors throughout the generative process.


[55] Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding cs.CVPDF

Kun Yin, Yunfei Wu, Bing Liu, Zhongpeng Cai, Xiaotian Li

TL;DR: 本文提出了Youtu-Parsing,一种高效且通用的文档解析模型,用于高性能内容提取。该模型采用动态分辨率视觉编码器的原生Vision Transformer提取共享文档特征,并结合提示引导的Youtu-LLM-2B语言模型进行布局分析和区域提示解码。通过解耦且特征可复用的框架,引入了包含令牌并行和查询并行的高并行度解码策略,显著提升了解码速度,并在OmniDocBench和olmOCR-bench基准测试中取得了最先进的性能。

Details

Motivation: 旨在解决传统文档解析模型在解码效率上的瓶颈,特别是在处理高度结构化内容(如表格)时的速度问题,同时提升模型对多样化文档元素(文本、公式、图表等)的解析能力和鲁棒性。

Result: 在OmniDocBench和olmOCR-bench基准测试上达到了最先进的性能水平;解码速度相比传统自回归解码提升了5-11倍(令牌并行),查询并行策略进一步带来2倍加速,同时保持输出质量不变。

Insight: 创新点在于提出了高并行度解码策略(令牌并行和查询并行),结合动态分辨率视觉编码器和提示引导的语言模型,实现了特征共享与高效解码的平衡;其解耦框架和区域提示解码机制可有效提升结构化文档处理的效率和泛化能力。

Abstract: This paper presents Youtu-Parsing, an efficient and versatile document parsing model designed for high-performance content extraction. The architecture employs a native Vision Transformer (ViT) featuring a dynamic-resolution visual encoder to extract shared document features, coupled with a prompt-guided Youtu-LLM-2B language model for layout analysis and region-prompted decoding. Leveraging this decoupled and feature-reusable framework, we introduce a high-parallelism decoding strategy comprising two core components: token parallelism and query parallelism. The token parallelism strategy concurrently generates up to 64 candidate tokens per inference step, which are subsequently validated through a verification mechanism. This approach yields a 5–11x speedup over traditional autoregressive decoding and is particularly well-suited for highly structured scenarios, such as table recognition. To further exploit the advantages of region-prompted decoding, the query parallelism strategy enables simultaneous content prediction for multiple bounding boxes (up to five), providing an additional 2x acceleration while maintaining output quality equivalent to standard decoding. Youtu-Parsing encompasses a diverse range of document elements, including text, formulas, tables, charts, seals, and hierarchical structures. Furthermore, the model exhibits strong robustness when handling rare characters, multilingual text, and handwritten content. Extensive evaluations demonstrate that Youtu-Parsing achieves state-of-the-art (SOTA) performance on both the OmniDocBench and olmOCR-bench benchmarks. Overall, Youtu-Parsing demonstrates significant experimental value and practical utility for large-scale document intelligence applications.


[56] MARE: Multimodal Alignment and Reinforcement for Explainable Deepfake Detection via Vision-Language Models cs.CVPDF

Wenbo Xu, Wei Lu, Xiangyang Luo, Jiantao Zhou

TL;DR: 本文提出了一种名为MARE的多模态对齐与强化方法,用于通过视觉语言模型进行可解释的Deepfake检测。该方法通过设计综合奖励函数,结合人类反馈强化学习,激励生成与文本空间对齐且符合人类偏好的推理内容,并引入伪造解耦模块从高层面部语义中捕获内在伪造痕迹,从而提升检测的准确性和可靠性。

Details

Motivation: 随着生成模型的快速发展,Deepfake检测面临新的需求,现有方法主要将问题建模为分类或空间定位,缺乏可解释性。本文旨在提升视觉语言模型在Deepfake检测和推理中的准确性与可靠性。

Result: 在推理内容的全面评估中,定量和定性实验结果均表明MARE在准确性和可靠性方面达到了最先进的性能水平。

Insight: 创新点包括:通过RLHF设计奖励函数以对齐文本与空间推理内容,以及引入伪造解耦模块分离高层语义中的伪造痕迹,这为可解释的Deepfake检测提供了多模态强化学习的新思路。

Abstract: Deepfake detection is a widely researched topic that is crucial for combating the spread of malicious content, with existing methods mainly modeling the problem as classification or spatial localization. The rapid advancements in generative models impose new demands on Deepfake detection. In this paper, we propose multimodal alignment and reinforcement for explainable Deepfake detection via vision-language models, termed MARE, which aims to enhance the accuracy and reliability of Vision-Language Models (VLMs) in Deepfake detection and reasoning. Specifically, MARE designs comprehensive reward functions, incorporating reinforcement learning from human feedback (RLHF), to incentivize the generation of text-spatially aligned reasoning content that adheres to human preferences. Besides, MARE introduces a forgery disentanglement module to capture intrinsic forgery traces from high-level facial semantics, thereby improving its authenticity detection capability. We conduct thorough evaluations on the reasoning content generated by MARE. Both quantitative and qualitative experimental results demonstrate that MARE achieves state-of-the-art performance in terms of accuracy and reliability.


[57] Efficient Autoregressive Video Diffusion with Dummy Head cs.CVPDF

Hang Guo, Zhaoyang Jia, Jiahao Li, Bin Li, Yuanhao Cai

TL;DR: 本文提出了一种名为Dummy Forcing的高效自回归视频扩散方法,通过分析发现现有模型的多头自注意力机制存在历史帧利用不足的问题,部分注意力头几乎只关注当前帧。该方法通过异质内存分配减少头间上下文冗余,并结合动态头编程自适应分类头类型,以及上下文打包技术实现更激进的缓存压缩,从而在不增加训练成本的情况下显著提升推理速度。

Details

Motivation: 动机在于解决自回归视频扩散模型中多头自注意力机制对历史帧利用效率低下的问题,部分注意力头冗余地关注当前帧,导致计算和内存开销浪费,旨在提升模型推理效率。

Result: 在未增加额外训练的情况下,该方法相比基线模型实现了最高2.0倍的加速,支持以24.3 FPS的速度生成视频,且质量下降小于0.5%。

Insight: 创新点在于提出了Dummy Forcing方法,包括异质内存分配、动态头编程和上下文打包技术,有效减少了注意力头的冗余计算,实现了高效的缓存压缩和推理加速,为视频生成模型的效率优化提供了新思路。

Abstract: The autoregressive video diffusion model has recently gained considerable research interest due to its causal modeling and iterative denoising. In this work, we identify that the multi-head self-attention in these models under-utilizes historical frames: approximately 25% heads attend almost exclusively to the current frame, and discarding their KV caches incurs only minor performance degradation. Building upon this, we propose Dummy Forcing, a simple yet effective method to control context accessibility across different heads. Specifically, the proposed heterogeneous memory allocation reduces head-wise context redundancy, accompanied by dynamic head programming to adaptively classify head types. Moreover, we develop a context packing technique to achieve more aggressive cache compression. Without additional training, our Dummy Forcing delivers up to 2.0x speedup over the baseline, supporting video generation at 24.3 FPS with less than 0.5% quality drop. Project page is available at https://csguoh.github.io/project/DummyForcing/.


[58] Latent Temporal Discrepancy as Motion Prior: A Loss-Weighting Strategy for Dynamic Fidelity in T2V cs.CVPDF

Meiqi Wu, Bingze Song, Ruimin Lin, Chen Zhu, Xiaokun Feng

TL;DR: 本文提出了一种名为潜在时序差异(LTD)的运动先验,用于指导文本到视频生成模型的损失加权。该方法通过测量潜在空间中帧间变化,对动态变化剧烈的区域分配更大的惩罚权重,从而提升模型在生成动态视频时的质量。

Details

Motivation: 现有视频生成模型在静态场景表现良好,但在动态视频生成中质量下降,主要原因是噪声破坏了时序一致性且动态区域学习困难,而现有扩散模型对所有场景使用静态损失,限制了其捕捉复杂动态的能力。

Result: 在通用基准VBench和专注于运动的VMBench上的大量实验表明,该方法取得了稳定的提升,在VBench上超过强基线3.31%,在VMBench上超过3.58%,在运动质量上实现了显著改进。

Insight: 核心创新点是引入LTD作为运动先验来指导损失加权,这是一种运动感知的训练策略,能稳定训练并更好地重建高频动态;从客观角度看,这是一种新颖的、基于潜在空间时序差异的自适应损失权重设计,可有效提升模型对动态内容的建模能力。

Abstract: Video generation models have achieved notable progress in static scenarios, yet their performance in motion video generation remains limited, with quality degrading under drastic dynamic changes. This is due to noise disrupting temporal coherence and increasing the difficulty of learning dynamic regions. {Unfortunately, existing diffusion models rely on static loss for all scenarios, constraining their ability to capture complex dynamics.} To address this issue, we introduce Latent Temporal Discrepancy (LTD) as a motion prior to guide loss weighting. LTD measures frame-to-frame variation in the latent space, assigning larger penalties to regions with higher discrepancy while maintaining regular optimization for stable regions. This motion-aware strategy stabilizes training and enables the model to better reconstruct high-frequency dynamics. Extensive experiments on the general benchmark VBench and the motion-focused VMBench show consistent gains, with our method outperforming strong baselines by 3.31% on VBench and 3.58% on VMBench, achieving significant improvements in motion quality.


[59] Context Tokens are Anchors: Understanding the Repetition Curse in dMLLMs from an Information Flow Perspective cs.CVPDF

Qiyan Zhao, Xiaofeng Zhang, Shuochen Chang, Qianyu Chen, Xiaosong Yuan

TL;DR: 本文针对基于扩散的多模态大语言模型(dMLLMs)在应用缓存技术加速解码时出现的重复文本生成问题(称为“重复诅咒”),从信息流视角分析了其底层机制。研究发现,上下文令牌作为锚点聚合语义信息并指导最终预测,其信息流中断和深层熵不收敛是导致重复的关键。基于此,作者提出了一种即插即用方法CoTA,通过增强上下文令牌的注意力并引入解码惩罚项来缓解重复,实验证明其有效且能提升通用任务性能。

Details

Motivation: 解决dMLLMs因使用缓存加速解码而导致的重复文本生成问题(重复诅咒),以降低推理延迟并提升生成质量。

Result: CoTA方法在缓解重复生成方面表现出显著有效性,并在通用任务上实现了持续的性能提升。

Insight: 创新点在于从信息流视角揭示了重复诅咒的机制,即上下文令牌作为信息锚点的作用及其熵收敛性的重要性;提出的CoTA方法通过注意力增强和置信度惩罚,以即插即用的方式修复信息流模式,为缓解LLM中的重复生成问题提供了新思路。

Abstract: Recent diffusion-based Multimodal Large Language Models (dMLLMs) suffer from high inference latency and therefore rely on caching techniques to accelerate decoding. However, the application of cache mechanisms often introduces undesirable repetitive text generation, a phenomenon we term the \textbf{Repeat Curse}. To better investigate underlying mechanism behind this issue, we analyze repetition generation through the lens of information flow. Our work reveals three key findings: (1) context tokens aggregate semantic information as anchors and guide the final predictions; (2) as information propagates across layers, the entropy of context tokens converges in deeper layers, reflecting the model’s growing prediction certainty; (3) Repetition is typically linked to disruptions in the information flow of context tokens and to the inability of their entropy to converge in deeper layers. Based on these insights, we present \textbf{CoTA}, a plug-and-play method for mitigating repetition. CoTA enhances the attention of context tokens to preserve intrinsic information flow patterns, while introducing a penalty term to the confidence score during decoding to avoid outputs driven by uncertain context tokens. With extensive experiments, CoTA demonstrates significant effectiveness in alleviating repetition and achieves consistent performance improvements on general tasks. Code is available at https://github.com/ErikZ719/CoTA


[60] AnomalyVFM – Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors cs.CVPDF

Matic Fučka, Vitjan Zavrtanik, Danijel Skočaj

TL;DR: 本文提出AnomalyVFM框架,旨在将预训练的视觉基础模型(VFMs)转化为强大的零样本异常检测器。该方法通过一个鲁棒的三阶段合成数据集生成方案和一个参数高效的适配机制(包括低秩特征适配器和置信度加权的像素损失)来解决现有方法面临的数据集多样性不足和模型适配策略过于浅层的问题,从而显著提升了性能。

Details

Motivation: 解决零样本异常检测中,基于纯视觉基础模型(如DINOv2)的方法性能落后于基于视觉-语言模型(如CLIP)方法的问题。作者认为性能差距源于现有辅助异常检测数据集多样性有限以及VFM适配策略过于浅层这两个实际问题。

Result: 在RADIO作为骨干网络时,AnomalyVFM在9个不同的数据集上实现了平均图像级AUROC为94.1%,显著超越了先前的最佳方法3.3个百分点,达到了新的SOTA水平。

Insight: 创新点在于提出了一个通用的、结合了鲁棒合成数据生成与参数高效适配的框架。具体包括:1) 三阶段合成数据集生成方案以增加数据多样性;2) 利用低秩特征适配器和置信度加权像素损失进行高效、深度的VFM适配。这为将通用视觉基础模型有效迁移到特定下游任务(如异常检测)提供了可借鉴的思路。

Abstract: Zero-shot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent approaches leverage vision-language models (VLMs), such as CLIP, to transfer high-level concept knowledge, methods based on purely vision foundation models (VFMs), like DINOv2, have lagged behind in performance. We argue that this gap stems from two practical issues: (i) limited diversity in existing auxiliary anomaly detection datasets and (ii) overly shallow VFM adaptation strategies. To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism, utilising low-rank feature adapters and a confidence-weighted pixel loss. Together, these components enable modern VFMs to substantially outperform current state-of-the-art methods. More specifically, with RADIO as a backbone, AnomalyVFM achieves an average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by significant 3.3 percentage points. Project Page: https://maticfuc.github.io/anomaly_vfm/


[61] Advancing Open-source World Models cs.CVPDF

Robbyant Team, Zelin Gao, Qiuyu Wang, Yanhong Zeng, Jiapeng Zhu

TL;DR: 本文介绍了LingBot-World,一个基于视频生成的开源世界模拟器,具备高保真度、强动态性、分钟级长时记忆和实时交互能力,旨在缩小开源与闭源技术之间的差距,推动内容创作、游戏和机器人学习等领域的应用。

Details

Motivation: 解决现有开源世界模型在环境多样性、长时记忆和实时交互性方面的不足,提供一个高性能、可公开访问的世界模拟器以促进社区发展。

Result: LingBot-World在多种环境(如真实场景、科学背景、卡通风格)中表现出高保真度和鲁棒动态性,支持分钟级时间跨度并保持上下文一致性,实时交互延迟低于1秒(以每秒16帧生成)。

Insight: 创新点包括结合视频生成技术构建开源世界模型,实现广泛环境适应性、长时记忆和低延迟交互;客观分析认为其开源发布有助于推动社区在内容创作和机器人学习等领域的实际应用,并可能促进世界模型技术的民主化。

Abstract: We present LingBot-World, an open-sourced world simulator stemming from video generation. Positioned as a top-tier world model, LingBot-World offers the following features. (1) It maintains high fidelity and robust dynamics in a broad spectrum of environments, including realism, scientific contexts, cartoon styles, and beyond. (2) It enables a minute-level horizon while preserving contextual consistency over time, which is also known as “long-term memory”. (3) It supports real-time interactivity, achieving a latency of under 1 second when producing 16 frames per second. We provide public access to the code and model in an effort to narrow the divide between open-source and closed-source technologies. We believe our release will empower the community with practical applications across areas like content creation, gaming, and robot learning.


[62] DeepSeek-OCR 2: Visual Causal Flow cs.CVPDF

Haoran Wei, Yaofeng Sun, Yukun Li

TL;DR: DeepSeek-OCR 2提出了一种名为DeepEncoder V2的新型视觉编码器,旨在通过动态重排视觉token来模拟人类视觉的因果推理过程,从而改进视觉语言模型对图像的理解。

Details

Motivation: 传统视觉语言模型以固定的光栅扫描顺序处理视觉token,这与人类根据图像语义进行灵活、连贯的视觉感知模式相矛盾,尤其是在处理复杂布局图像时。

Result: 论文探索了通过两个级联的一维因果推理结构来实现有效二维图像理解的新范式,代码和模型权重已公开。

Insight: 核心创新点在于将因果推理能力引入视觉编码器,使其能够根据图像语义动态重排token序列,这为视觉理解提供了一种新的架构思路,可能实现真正的二维推理。

Abstract: We present DeepSeek-OCR 2 to investigate the feasibility of a novel encoder-DeepEncoder V2-capable of dynamically reordering visual tokens upon image semantics. Conventional vision-language models (VLMs) invariably process visual tokens in a rigid raster-scan order (top-left to bottom-right) with fixed positional encoding when fed into LLMs. However, this contradicts human visual perception, which follows flexible yet semantically coherent scanning patterns driven by inherent logical structures. Particularly for images with complex layouts, human vision exhibits causally-informed sequential processing. Inspired by this cognitive mechanism, DeepEncoder V2 is designed to endow the encoder with causal reasoning capabilities, enabling it to intelligently reorder visual tokens prior to LLM-based content interpretation. This work explores a novel paradigm: whether 2D image understanding can be effectively achieved through two-cascaded 1D causal reasoning structures, thereby offering a new architectural approach with the potential to achieve genuine 2D reasoning. Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR-2.


[63] DiffVC-RT: Towards Practical Real-Time Diffusion-based Perceptual Neural Video Compression cs.CVPDF

Wenzhuo Ma, Zhenzhong Chen

TL;DR: DiffVC-RT是首个旨在实现实时扩散感知神经视频压缩(NVC)的框架,通过高效信息模型架构、显隐式一致性建模以及异步并行解码流水线,在保持高感知质量的同时大幅降低计算复杂度和推理延迟。

Details

Motivation: 解决扩散基神经视频压缩在实际部署中面临的信息丢失严重、推理延迟过高和时间一致性差等关键挑战。

Result: 在HEVC数据集上,相比VTM-17.0,DiffVC-RT在LPIPS指标上实现了80.1%的码率节省,并在NVIDIA H800 GPU上对720p视频达到了206 fps编码和30 fps解码的实时速度。

Insight: 创新点包括:1)通过模块替换和剪枝的高效信息模型架构以减少计算损失;2)结合零成本在线时间移位模块和混合隐式约束的显隐式一致性建模以抑制闪烁伪影;3)采用批量维度时间移位设计的异步并行解码流水线,支持混合半精度以实现高吞吐量。

Abstract: The practical deployment of diffusion-based Neural Video Compression (NVC) faces critical challenges, including severe information loss, prohibitive inference latency, and poor temporal consistency. To bridge this gap, we propose DiffVC-RT, the first framework designed to achieve real-time diffusion-based perceptual NVC. First, we introduce an Efficient and Informative Model Architecture. Through strategic module replacements and pruning, this architecture significantly reduces computational complexity while mitigating structural information loss. Second, to address generative flickering artifacts, we propose Explicit and Implicit Consistency Modeling. We enhance temporal consistency by explicitly incorporating a zero-cost Online Temporal Shift Module within the U-Net, complemented by hybrid implicit consistency constraints. Finally, we present an Asynchronous and Parallel Decoding Pipeline incorporating Mixed Half Precision, which enables asynchronous latent decoding and parallel frame reconstruction via a Batch-dimension Temporal Shift design. Experiments show that DiffVC-RT achieves 80.1% bitrate savings in terms of LPIPS over VTM-17.0 on HEVC dataset with real-time encoding and decoding speeds of 206 / 30 fps for 720p videos on an NVIDIA H800 GPU, marking a significant milestone in diffusion-based video compression.


[64] StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval cs.CVPDF

Shaokun Wang, Weili Guan, Jizhou Han, Jianlong Wu, Yupeng Hu

TL;DR: 本文提出了StructAlign方法,用于解决持续文本-视频检索任务中的灾难性遗忘问题。该方法通过引入单纯形等角紧框架几何先验来对齐跨模态特征,并设计跨模态关系保持损失来抑制模态内特征漂移,从而有效缓解特征漂移导致的遗忘。

Details

Motivation: 持续文本-视频检索面临灾难性遗忘的严峻挑战,其核心困难在于由持续学习引起的模态内特征漂移和导致模态错位的跨模态非协同特征漂移。

Result: 在多个基准数据集上的大量实验表明,该方法持续优于最先进的持续检索方法。

Insight: 创新点在于将单纯形等角紧框架作为统一的几何先验来结构化地对齐跨模态特征,并利用互补模态信息通过关系保持损失来稳定特征更新,从而联合解决了跨模态和模态内的特征漂移问题。

Abstract: Continual Text-to-Video Retrieval (CTVR) is a challenging multimodal continual learning setting, where models must incrementally learn new semantic categories while maintaining accurate text-video alignment for previously learned ones, thus making it particularly prone to catastrophic forgetting. A key challenge in CTVR is feature drift, which manifests in two forms: intra-modal feature drift caused by continual learning within each modality, and non-cooperative feature drift across modalities that leads to modality misalignment. To mitigate these issues, we propose StructAlign, a structured cross-modal alignment method for CTVR. First, StructAlign introduces a simplex Equiangular Tight Frame (ETF) geometry as a unified geometric prior to mitigate modality misalignment. Building upon this geometric prior, we design a cross-modal ETF alignment loss that aligns text and video features with category-level ETF prototypes, encouraging the learned representations to form an approximate simplex ETF geometry. In addition, to suppress intra-modal feature drift, we design a Cross-modal Relation Preserving loss, which leverages complementary modalities to preserve cross-modal similarity relations, providing stable relational supervision for feature updates. By jointly addressing non-cooperative feature drift across modalities and intra-modal feature drift, StructAlign effectively alleviates catastrophic forgetting in CTVR. Extensive experiments on benchmark datasets demonstrate that our method consistently outperforms state-of-the-art continual retrieval approaches.


[65] Person Re-ID in 2025: Supervised, Self-Supervised, and Language-Aligned. What Works? cs.CV | cs.AIPDF

Lakshman Balasubramanian

TL;DR: 这篇论文回顾了行人重识别(ReID)的三种训练范式:监督学习、自监督学习和语言对齐模型,并评估了它们在跨域应用中的鲁棒性。研究发现,监督模型在训练域内表现优异但在跨域数据上表现不佳,而语言对齐模型展现出意外的跨域鲁棒性。

Details

Motivation: 研究动机是评估不同训练范式在行人重识别任务中的表现,特别是它们在跨域场景下的泛化能力,并探索基础模型(如SigLIP2)在提升视觉表示可迁移性方面的作用。

Result: 在11个模型和9个数据集上的分析表明,监督模型在训练域内占主导地位但在跨域数据上崩溃,而语言对齐模型在跨域ReID任务中表现出惊人的鲁棒性,尽管它们未经过显式训练。

Insight: 创新点在于系统比较了三种训练范式在ReID中的跨域性能,揭示了语言对齐模型作为基础模型的潜在优势,为未来研究提供了新的方向,即利用语言对齐表示来增强模型的泛化能力。

Abstract: Person Re-Identification (ReID) remains a challenging problem in computer vision. This work reviews various training paradigm and evaluates the robustness of state-of-the-art ReID models in cross-domain applications and examines the role of foundation models in improving generalization through richer, more transferable visual representations. We compare three training paradigms, supervised, self-supervised, and language-aligned models. Through the study the aim is to answer the following questions: Can supervised models generalize in cross-domain scenarios? How does foundation models like SigLIP2 perform for the ReID tasks? What are the weaknesses of current supervised and foundational models for ReID? We have conducted the analysis across 11 models and 9 datasets. Our results show a clear split: supervised models dominate their training domain but crumble on cross-domain data. Language-aligned models, however, show surprising robustness cross-domain for ReID tasks, even though they are not explicitly trained to do so. Code and data available at: https://github.com/moiiai-tech/object-reid-benchmark.


[66] GDCNet: Generative Discrepancy Comparison Network for Multimodal Sarcasm Detection cs.CV | cs.AI | cs.CLPDF

Shuguang Zhang, Junhong Lian, Guoxin Yu, Baoxun Xu, Xiang Ao

TL;DR: 本文提出了一种名为GDCNet的生成式差异比较网络,用于多模态讽刺检测任务。该方法利用多模态大语言模型生成客观的图像描述作为语义锚点,通过计算生成描述与原始文本之间的语义和情感差异,并结合视觉-文本保真度测量来捕捉跨模态冲突,最终通过门控模块自适应融合多模态特征进行检测。

Details

Motivation: 现有方法依赖跨模态嵌入错位检测不一致性,但在视觉和文本内容松散相关或语义间接时效果不佳;而利用大语言模型生成讽刺线索的方法会因生成的多样性和主观性引入噪声。GDCNet旨在通过生成客观、事实基础的图像描述作为稳定语义锚点来解决这些问题。

Result: 在MSD基准测试上的广泛实验表明,GDCNet具有优越的准确性和鲁棒性,并在MMSD2.0基准上达到了新的最先进水平。

Insight: 创新点在于利用MLLMs生成客观描述作为稳定的语义锚点来量化跨模态差异,避免了直接依赖主观生成的讽刺线索带来的噪声;同时通过门控模块自适应平衡模态贡献,增强了模型对松散相关或语义间接多模态内容的处理能力。

Abstract: Multimodal sarcasm detection (MSD) aims to identify sarcasm within image-text pairs by modeling semantic incongruities across modalities. Existing methods often exploit cross-modal embedding misalignment to detect inconsistency but struggle when visual and textual content are loosely related or semantically indirect. While recent approaches leverage large language models (LLMs) to generate sarcastic cues, the inherent diversity and subjectivity of these generations often introduce noise. To address these limitations, we propose the Generative Discrepancy Comparison Network (GDCNet). This framework captures cross-modal conflicts by utilizing descriptive, factually grounded image captions generated by Multimodal LLMs (MLLMs) as stable semantic anchors. Specifically, GDCNet computes semantic and sentiment discrepancies between the generated objective description and the original text, alongside measuring visual-textual fidelity. These discrepancy features are then fused with visual and textual representations via a gated module to adaptively balance modality contributions. Extensive experiments on MSD benchmarks demonstrate GDCNet’s superior accuracy and robustness, establishing a new state-of-the-art on the MMSD2.0 benchmark.


[67] ProSkill: Segment-Level Skill Assessment in Procedural Videos cs.CVPDF

Michele Mazzamuto, Daniele Di Mauro, Gianpiero Francesca, Giovanni Maria Farinella, Antonino Furnari

TL;DR: 本文提出了ProSkill,这是首个用于程序性视频中动作级别技能评估的基准数据集。该数据集通过一种新颖且可扩展的标注协议,从成对比较中生成绝对技能评估排名,并提供了绝对和成对两种评估标注。作者使用该数据集对当前最先进的技能评估算法进行了基准测试,结果表明现有方法表现欠佳,凸显了ProSkill的挑战性和价值。

Details

Motivation: 当前程序性视频技能评估研究主要集中于体育领域,缺乏针对复杂程序性活动的大规模数据集,且现有研究通常只涉及有限动作,并侧重于成对评估或二元标签,无法进行精细的绝对技能评估。

Result: 在提出的ProSkill基准数据集上,对包括基于排名和成对范式在内的主要最先进技能评估算法进行了基准测试,结果显示这些算法取得了次优的结果,证明了该数据集的挑战性。

Insight: 创新点在于引入了一种新颖且可扩展的标注协议,该协议结合了瑞士轮赛制进行高效的成对比较,并使用基于ELO的评分系统将这些比较聚合成一致、连续的全局分数,从而能够从成对评估中创建绝对技能评估排名。这为程序性视频的精细技能评估提供了新的数据基础和方法论参考。

Abstract: Skill assessment in procedural videos is crucial for the objective evaluation of human performance in settings such as manufacturing and procedural daily tasks. Current research on skill assessment has predominantly focused on sports and lacks large-scale datasets for complex procedural activities. Existing studies typically involve only a limited number of actions, focus on either pairwise assessments (e.g., A is better than B) or on binary labels (e.g., good execution vs needs improvement). In response to these shortcomings, we introduce ProSkill, the first benchmark dataset for action-level skill assessment in procedural tasks. ProSkill provides absolute skill assessment annotations, along with pairwise ones. This is enabled by a novel and scalable annotation protocol that allows for the creation of an absolute skill assessment ranking starting from pairwise assessments. This protocol leverages a Swiss Tournament scheme for efficient pairwise comparisons, which are then aggregated into consistent, continuous global scores using an ELO-based rating system. We use our dataset to benchmark the main state-of-the-art skill assessment algorithms, including both ranking-based and pairwise paradigms. The suboptimal results achieved by the current state-of-the-art highlight the challenges and thus the value of ProSkill in the context of skill assessment for procedural videos. All data and code are available at https://fpv-iplab.github.io/ProSkill/


[68] bi-modal textual prompt learning for vision-language models in remote sensing cs.CVPDF

Pankhi Kashyap, Mainak Singha, Biplab Banerjee

TL;DR: 本文提出了一种名为BiMoRS的轻量级双模态提示学习框架,专门针对遥感图像任务设计。该方法利用冻结的图像描述模型(如BLIP-2)从遥感图像中提取文本语义摘要,然后通过BERT分词器将其与CLIP编码器提取的高级视觉特征融合,再通过一个轻量级的交叉注意力模块生成可学习的查询提示,从而在不改变CLIP主干网络的情况下生成上下文感知的提示。

Details

Motivation: 现有提示学习方法在自然图像数据集上表现出良好的泛化能力,但在遥感图像上直接应用面临挑战,如多标签场景、类内高变异性、多样空间分辨率等,导致现有方法难以识别主导语义线索且在新类别上泛化能力不足。

Result: 在四个遥感数据集上针对三个领域泛化任务进行评估,BiMoRS一致性地取得了性能提升,平均优于强基线方法达2%。

Insight: 创新点在于将图像描述模型生成的文本语义摘要与视觉特征融合,通过双模态提示学习来增强遥感场景下的语义理解和泛化能力;客观来看,该方法通过轻量级模块有效结合了文本和视觉模态信息,为遥感图像适配预训练视觉-语言模型提供了一种高效途径。

Abstract: Prompt learning (PL) has emerged as an effective strategy to adapt vision-language models (VLMs), such as CLIP, for downstream tasks under limited supervision. While PL has demonstrated strong generalization on natural image datasets, its transferability to remote sensing (RS) imagery remains underexplored. RS data present unique challenges, including multi-label scenes, high intra-class variability, and diverse spatial resolutions, that hinder the direct applicability of existing PL methods. In particular, current prompt-based approaches often struggle to identify dominant semantic cues and fail to generalize to novel classes in RS scenarios. To address these challenges, we propose BiMoRS, a lightweight bi-modal prompt learning framework tailored for RS tasks. BiMoRS employs a frozen image captioning model (e.g., BLIP-2) to extract textual semantic summaries from RS images. These captions are tokenized using a BERT tokenizer and fused with high-level visual features from the CLIP encoder. A lightweight cross-attention module then conditions a learnable query prompt on the fused textual-visual representation, yielding contextualized prompts without altering the CLIP backbone. We evaluate BiMoRS on four RS datasets across three domain generalization (DG) tasks and observe consistent performance gains, outperforming strong baselines by up to 2% on average. Codes are available at https://github.com/ipankhi/BiMoRS.


[69] Decoupling Perception and Calibration: Label-Efficient Image Quality Assessment Framework cs.CV | cs.AIPDF

Xinyue Li, Zhichao Zhang, Zhiming Xu, Shubo Xu, Xiongkuo Min

TL;DR: 本文提出了一种标签高效的图像质量评估框架LEAF,通过从多模态大语言模型教师中提取感知质量先验知识到轻量级学生回归器中,从而在最小化人工监督的情况下实现MOS尺度校准。该方法利用点级判断和成对偏好进行密集监督,并结合决策可靠性估计,使学生模型能够学习教师的质量感知模式,并在少量MOS标注子集上进行校准以对齐人类标注。

Details

Motivation: 当前基于多模态大语言模型的图像质量评估方法计算成本高且依赖大量平均意见分数标注,作者认为其核心瓶颈在于MOS尺度校准而非质量感知能力,因此旨在减少人工标注需求。

Result: 在用户生成和AI生成的图像质量评估基准测试中,该方法在保持与MOS强相关性的同时,显著减少了人工标注需求,使轻量级IQA在有限标注预算下变得实用。

Insight: 创新点在于将质量感知与尺度校准解耦,通过知识蒸馏将MLLM的感知先验迁移到轻量级模型中,并利用点级和成对监督信号结合可靠性估计进行高效校准,从而降低对大规模标注数据的依赖。

Abstract: Recent multimodal large language models (MLLMs) have demonstrated strong capabilities in image quality assessment (IQA) tasks. However, adapting such large-scale models is computationally expensive and still relies on substantial Mean Opinion Score (MOS) annotations. We argue that for MLLM-based IQA, the core bottleneck lies not in the quality perception capacity of MLLMs, but in MOS scale calibration. Therefore, we propose LEAF, a Label-Efficient Image Quality Assessment Framework that distills perceptual quality priors from an MLLM teacher into a lightweight student regressor, enabling MOS calibration with minimal human supervision. Specifically, the teacher conducts dense supervision through point-wise judgments and pair-wise preferences, with an estimate of decision reliability. Guided by these signals, the student learns the teacher’s quality perception patterns through joint distillation and is calibrated on a small MOS subset to align with human annotations. Experiments on both user-generated and AI-generated IQA benchmarks demonstrate that our method significantly reduces the need for human annotations while maintaining strong MOS-aligned correlations, making lightweight IQA practical under limited annotation budgets.


[70] LEMON: How Well Do MLLMs Perform Temporal Multimodal Understanding on Instructional Videos? cs.CV | cs.AIPDF

Zhuang Yu, Lei Shen, Jing Zhao, Shiliang Sun

TL;DR: 本文介绍了LEMON,一个专注于STEM教学视频的多模态理解评估基准,用于评测多模态大语言模型在长时、知识密集且具有时间结构的教学视频内容上的表现。该基准包含2,277个视频片段和4,181个高质量问答对,涵盖六项主要任务,揭示了当前SOTA模型在时序推理和教学预测方面的显著不足。

Details

Motivation: 现有MLLMs在长格式、知识密集型且具有时间结构的教学视频内容上的性能尚未得到充分探索,本文旨在填补这一空白。

Result: 综合实验表明,即使在GPT-4o等最先进的MLLMs上,各项任务也存在显著的性能差距,特别是在时序推理和教学预测方面表现不佳。

Insight: 论文的创新点在于构建了一个语义丰富、模态紧密耦合、具有明确时间与教学结构、并包含上下文关联多轮问答的基准,其任务设计覆盖了从感知到推理再到生成的完整认知谱系,为评估和推进长格式教学内容的多模态理解提供了可扩展的挑战性基准。

Abstract: Recent multimodal large language models (MLLMs) have shown remarkable progress across vision, audio, and language tasks, yet their performance on long-form, knowledge-intensive, and temporally structured educational content remains largely unexplored. To bridge this gap, we introduce LEMON, a Lecture-based Evaluation benchmark for MultimOdal uNderstanding, focusing on STEM lecture videos that require long-horizon reasoning and cross-modal integration. LEMON comprises 2,277 video segments spanning 5 disciplines and 29 courses, with an average duration of 196.1 seconds, yielding 4,181 high-quality QA pairs, including 3,413 multiple-choice and 768 open-ended questions. Distinct from existing video benchmarks, LEMON features: (1) semantic richness and disciplinary density, (2) tightly coupled video-audio-text modalities, (3) explicit temporal and pedagogical structure, and (4) contextually linked multi-turn questioning. It further encompasses six major tasks and twelve subtasks, covering the full cognitive spectrum from perception to reasoning and then to generation. Comprehensive experiments reveal substantial performance gaps across tasks, highlighting that even state-of-the-art MLLMs like GPT-4o struggle with temporal reasoning and instructional prediction. We expect LEMON to serve as an extensible and challenging benchmark for advancing multimodal perception, reasoning, and generation in long-form instructional contents.


[71] Li-ViP3D++: Query-Gated Deformable Camera-LiDAR Fusion for End-to-End Perception and Trajectory Prediction cs.CV | cs.AI | cs.ROPDF

Matej Halinkovic, Nina Masarykova, Alexey Vinel, Marek Galinski

TL;DR: Li-ViP3D++ 是一个用于自动驾驶的端到端感知与轨迹预测框架,它通过提出的查询门控可变形融合(QGDF)技术,在查询空间中自适应地融合多视角相机和LiDAR数据,以联合优化目标检测、跟踪和多假设轨迹预测。

Details

Motivation: 解决现有模块化感知预测流水线信息流受限、误差累积,以及现有基于查询的端到端模型中相机与LiDAR融合方案存在启发式对齐、离散选择步骤,导致信息利用不充分和引入偏差的问题。

Result: 在nuScenes数据集上,Li-ViP3D++ 提升了端到端行为预测和检测质量,取得了更高的EPA(0.335)和mAP(0.502),并显著降低了误报率(FP ratio 0.147),同时推理速度(139.82 ms)快于其前代Li-ViP3D(145.91 ms),达到了SOTA水平。

Insight: 创新点在于提出了完全可微的查询门控可变形融合(QGDF)机制,通过掩码注意力聚合图像证据、通过可学习的查询偏移进行可微BEV采样提取LiDAR上下文,并利用查询条件门控自适应加权视觉与几何线索,实现了更鲁棒、信息利用更充分的多模态融合。

Abstract: End-to-end perception and trajectory prediction from raw sensor data is one of the key capabilities for autonomous driving. Modular pipelines restrict information flow and can amplify upstream errors. Recent query-based, fully differentiable perception-and-prediction (PnP) models mitigate these issues, yet the complementarity of cameras and LiDAR in the query-space has not been sufficiently explored. Models often rely on fusion schemes that introduce heuristic alignment and discrete selection steps which prevent full utilization of available information and can introduce unwanted bias. We propose Li-ViP3D++, a query-based multimodal PnP framework that introduces Query-Gated Deformable Fusion (QGDF) to integrate multi-view RGB and LiDAR in query space. QGDF (i) aggregates image evidence via masked attention across cameras and feature levels, (ii) extracts LiDAR context through fully differentiable BEV sampling with learned per-query offsets, and (iii) applies query-conditioned gating to adaptively weight visual and geometric cues per agent. The resulting architecture jointly optimizes detection, tracking, and multi-hypothesis trajectory forecasting in a single end-to-end model. On nuScenes, Li-ViP3D++ improves end-to-end behavior and detection quality, achieving higher EPA (0.335) and mAP (0.502) while substantially reducing false positives (FP ratio 0.147), and it is faster than the prior Li-ViP3D variant (139.82 ms vs. 145.91 ms). These results indicate that query-space, fully differentiable camera-LiDAR fusion can increase robustness of end-to-end PnP without sacrificing deployability.


[72] Compression Tells Intelligence: Visual Coding, Visual Token Technology, and the Unification cs.CVPDF

Xin Jin, Jinming Liu, Yuntao Wei, Junyan Lin, Zhicheng Wang

TL;DR: 该论文探讨了视觉编码与视觉标记技术之间的内在联系,指出两者均以压缩效率与语义保真度的权衡为核心目标。论文首先综述了这两大技术体系,然后从优化角度将其统一,并基于此统一框架提出了双向见解与未来技术展望。实验表明,面向任务的标记技术在MLLMs、AIGC和具身AI等实际任务中具有巨大潜力,并可能催生类似传统编解码器(如H.264/265)的高效通用标记标准。

Details

Motivation: 动机在于揭示人工智能(尤其是多模态大语言模型)中压缩效率与模型性能的正相关性,并统一传统视觉编码与新兴视觉标记技术,以探索两者在优化本质上的共同基础。

Result: 论文未提供具体的定量实验结果,但通过理论分析和展望指出,面向任务的标记技术在MLLMs、AIGC和具身AI等实际应用中展现出巨大潜力,并可能推动未来高效通用标记标准的形成。

Insight: 创新点在于首次从优化角度统一了视觉编码与视觉标记技术,提出了一个连接两者的理论框架,并前瞻性地预测了下一代视觉编解码器与标记技术的发展方向,强调了任务导向设计与标准化在提升智能任务效率中的重要性。

Abstract: “Compression Tells Intelligence”, is supported by research in artificial intelligence, particularly concerning (multimodal) large language models (LLMs/MLLMs), where compression efficiency often correlates with improved model performance and capabilities. For compression, classical visual coding based on traditional information theory has developed over decades, achieving great success with numerous international industrial standards widely applied in multimedia (e.g., image/video) systems. Except that, the recent emergingvisual token technology of generative multi-modal large models also shares a similar fundamental objective like visual coding: maximizing semantic information fidelity during the representation learning while minimizing computational cost. Therefore, this paper provides a comprehensive overview of two dominant technique families first – Visual Coding and Vision Token Technology – then we further unify them from the aspect of optimization, discussing the essence of compression efficiency and model performance trade-off behind. Next, based on the proposed unified formulation bridging visual coding andvisual token technology, we synthesize bidirectional insights of themselves and forecast the next-gen visual codec and token techniques. Last but not least, we experimentally show a large potential of the task-oriented token developments in the more practical tasks like multimodal LLMs (MLLMs), AI-generated content (AIGC), and embodied AI, as well as shedding light on the future possibility of standardizing a general token technology like the traditional codecs (e.g., H.264/265) with high efficiency for a wide range of intelligent tasks in a unified and effective manner.


[73] FAIRT2V: Training-Free Debiasing for Text-to-Video Diffusion Models cs.CV | cs.AIPDF

Haonan Zhong, Wei Song, Tingxu Han, Maurice Pagnucco, Jingling Xue

TL;DR: 本文提出了FairT2V,一个无需训练的文本到视频(T2V)扩散模型去偏框架。该框架通过分析发现T2V模型的人口统计学偏见(尤其是性别偏见)主要源于预训练文本编码器,并设计了一种基于锚点的球面测地线变换来中和提示词嵌入,同时通过动态去噪计划保持时间一致性,从而在无需微调的情况下有效减少生成视频中的偏见。

Details

Motivation: T2V扩散模型进展迅速,但其人口统计学偏见(特别是性别偏见)尚未得到充分探索。现有偏见主要源于预训练文本编码器对中性提示词也编码了隐含的性别关联,需要一种无需训练或微调的方法来缓解这种编码器引入的偏见。

Result: 在现代T2V模型Open-Sora上的实验表明,FairT2V显著减少了不同职业视频生成中的性别偏见,同时对视频质量影响最小。评估结合了基于VideoLLM的推理和人工验证。

Insight: 主要创新点在于:1)揭示了T2V模型偏见主要源于文本编码器,并提出了性别倾向分数进行量化;2)提出了基于锚点的球面测地线变换,在不改变语义的前提下中和提示词嵌入;3)设计了动态去噪计划,仅在早期身份形成步骤应用去偏以保持时间一致性;4)提出了结合VideoLLM和人工验证的视频级公平性评估协议。该方法无需训练,具有较好的通用性和实用性。

Abstract: Text-to-video (T2V) diffusion models have achieved rapid progress, yet their demographic biases, particularly gender bias, remain largely unexplored. We present FairT2V, a training-free debiasing framework for text-to-video generation that mitigates encoder-induced bias without finetuning. We first analyze demographic bias in T2V models and show that it primarily originates from pretrained text encoders, which encode implicit gender associations even for neutral prompts. We quantify this effect with a gender-leaning score that correlates with bias in generated videos. Based on this insight, FairT2V mitigates demographic bias by neutralizing prompt embeddings via anchor-based spherical geodesic transformations while preserving semantics. To maintain temporal coherence, we apply debiasing only during early identity-forming steps through a dynamic denoising schedule. We further propose a video-level fairness evaluation protocol combining VideoLLM-based reasoning with human verification. Experiments on the modern T2V model Open-Sora show that FairT2V substantially reduces demographic bias across occupations with minimal impact on video quality.


[74] Open-Vocabulary Functional 3D Human-Scene Interaction Generation cs.CV | cs.AIPDF

Jie Liu, Yu Sun, Alpar Cseke, Yao Feng, Nicolas Heron

TL;DR: 本文提出了FunHSI,一个无需训练、功能驱动的框架,用于从开放词汇任务提示生成功能正确的3D人-场景交互。该框架通过功能感知的接触推理识别场景中的功能元素,重建其3D几何,并通过接触图建模高级交互,然后利用视觉语言模型合成执行任务的图像并估计3D身体和手部姿态,最后通过分阶段优化确保物理合理性和功能正确性。

Details

Motivation: 现有方法在生成3D人-场景交互时,通常缺乏对物体功能和相应人-场景接触的显式推理,导致交互不真实或功能不正确。本文旨在解决这一开放问题,以支持具身AI、机器人和交互式内容创建等应用。

Result: 广泛的实验表明,FunHSI在多样化的室内外场景中,能够一致地生成功能正确且物理合理的3D人-场景交互。

Insight: 创新点在于提出了一个无需训练、基于功能感知接触推理的框架,支持从开放词汇提示生成细粒度的功能交互(如“调高室温”),而不仅仅是通用交互(如“坐在沙发上”)。其核心是通过接触图建模和分阶段优化,将高级语义任务与具体的3D几何和姿态生成相结合。

Abstract: Generating 3D humans that functionally interact with 3D scenes remains an open problem with applications in embodied AI, robotics, and interactive content creation. The key challenge involves reasoning about both the semantics of functional elements in 3D scenes and the 3D human poses required to achieve functionality-aware interaction. Unfortunately, existing methods typically lack explicit reasoning over object functionality and the corresponding human-scene contact, resulting in implausible or functionally incorrect interactions. In this work, we propose FunHSI, a training-free, functionality-driven framework that enables functionally correct human-scene interactions from open-vocabulary task prompts. Given a task prompt, FunHSI performs functionality-aware contact reasoning to identify functional scene elements, reconstruct their 3D geometry, and model high-level interactions via a contact graph. We then leverage vision-language models to synthesize a human performing the task in the image and estimate proposed 3D body and hand poses. Finally, the proposed 3D body configuration is refined via stage-wise optimization to ensure physical plausibility and functional correctness. In contrast to existing methods, FunHSI not only synthesizes more plausible general 3D interactions, such as “sitting on a sofa’’, while supporting fine-grained functional human-scene interactions, e.g., “increasing the room temperature’’. Extensive experiments demonstrate that FunHSI consistently generates functionally correct and physically plausible human-scene interactions across diverse indoor and outdoor scenes.


[75] A New Dataset and Framework for Robust Road Surface Classification via Camera-IMU Fusion cs.CV | cs.AIPDF

Willams de Lima Costa, Thifany Ketuli Silva de Souza, Jonas Ferreira Silva, Carlos Gabriel Bezerra Pereira, Bruno Reis Vila Nova

TL;DR: 本文提出了一种用于鲁棒路面分类的多模态框架,通过融合摄像头图像和惯性测量单元(IMU)数据,并引入了一个新的数据集ROAD。该框架采用轻量级双向交叉注意力模块和自适应门控层来调整模态贡献,以应对领域偏移。实验表明,该方法在PVS基准测试和ROAD数据集上均优于现有技术,尤其在少数类别和恶劣视觉条件下表现稳定。

Details

Motivation: 现有路面分类技术因传感模态有限和数据集环境多样性不足,难以在狭窄操作条件之外泛化。本文旨在通过多模态融合和新数据集解决这些问题。

Result: 在PVS基准测试上比之前的最先进方法提升了1.4个百分点,在ROAD多模态子集上提升了11.6个百分点,并在少数类别上获得更高的F1分数。在夜间、大雨和混合路面过渡等挑战性视觉条件下也表现出稳定性能。

Insight: 创新点包括:引入轻量级双向交叉注意力与自适应门控的多模态融合框架,以及包含真实世界、仅视觉和合成子集的ROAD数据集,以增强环境多样性和泛化能力。从客观角度看,结合低成本摄像头与IMU传感器及注意力机制,为路面理解提供了可扩展且鲁棒的解决方案。

Abstract: Road surface classification (RSC) is a key enabler for environment-aware predictive maintenance systems. However, existing RSC techniques often fail to generalize beyond narrow operational conditions due to limited sensing modalities and datasets that lack environmental diversity. This work addresses these limitations by introducing a multimodal framework that fuses images and inertial measurements using a lightweight bidirectional cross-attention module followed by an adaptive gating layer that adjusts modality contributions under domain shifts. Given the limitations of current benchmarks, especially regarding lack of variability, we introduce ROAD, a new dataset composed of three complementary subsets: (i) real-world multimodal recordings with RGB-IMU streams synchronized using a gold-standard industry datalogger, captured across diverse lighting, weather, and surface conditions; (ii) a large vision-only subset designed to assess robustness under adverse illumination and heterogeneous capture setups; and (iii) a synthetic subset generated to study out-of-distribution generalization in scenarios difficult to obtain in practice. Experiments show that our method achieves a +1.4 pp improvement over the previous state-of-the-art on the PVS benchmark and an +11.6 pp improvement on our multimodal ROAD subset, with consistently higher F1-scores on minority classes. The framework also demonstrates stable performance across challenging visual conditions, including nighttime, heavy rain, and mixed-surface transitions. These findings indicate that combining affordable camera and IMU sensors with multimodal attention mechanisms provides a scalable, robust foundation for road surface understanding, particularly relevant for regions where environmental variability and cost constraints limit the adoption of high-end sensing suites.


[76] FreeFix: Boosting 3D Gaussian Splatting via Fine-Tuning-Free Diffusion Models cs.CVPDF

Hongyu Zhou, Zisen Shao, Sheng Miao, Pan Wang, Dongfeng Bai

TL;DR: FreeFix是一种无需微调扩散模型的方法,通过利用预训练图像扩散模型增强外推渲染,提升3D高斯泼溅(3D Gaussian Splatting)的性能。它采用交替的2D-3D细化策略,并引入逐像素置信掩码来识别不确定区域进行针对性改进,从而在保持泛化能力的同时提高保真度。

Details

Motivation: 解决神经辐射场和3D高斯泼溅在稀疏输入和外推视角下性能下降的问题,并克服现有方法在泛化性与保真度之间的权衡:微调扩散模型可能过拟合,而无需微调的方法则保真度较低。

Result: 在多个数据集上的实验表明,FreeFix提升了多帧一致性,性能达到或超越基于微调的方法,同时保持强泛化能力。

Insight: 创新点包括:无需微调扩散模型即可实现一致细化,避免了昂贵的视频扩散模型;提出交替2D-3D细化策略和逐像素置信掩码,以针对性改进不确定区域,平衡了泛化与保真度。

Abstract: Neural Radiance Fields and 3D Gaussian Splatting have advanced novel view synthesis, yet still rely on dense inputs and often degrade at extrapolated views. Recent approaches leverage generative models, such as diffusion models, to provide additional supervision, but face a trade-off between generalization and fidelity: fine-tuning diffusion models for artifact removal improves fidelity but risks overfitting, while fine-tuning-free methods preserve generalization but often yield lower fidelity. We introduce FreeFix, a fine-tuning-free approach that pushes the boundary of this trade-off by enhancing extrapolated rendering with pretrained image diffusion models. We present an interleaved 2D-3D refinement strategy, showing that image diffusion models can be leveraged for consistent refinement without relying on costly video diffusion models. Furthermore, we take a closer look at the guidance signal for 2D refinement and propose a per-pixel confidence mask to identify uncertain regions for targeted improvement. Experiments across multiple datasets show that FreeFix improves multi-frame consistency and achieves performance comparable to or surpassing fine-tuning-based methods, while retaining strong generalization ability.


eess.IV [Back]

[77] SegRap2025: A Benchmark of Gross Tumor Volume and Lymph Node Clinical Target Volume Segmentation for Radiotherapy Planning of Nasopharyngeal Carcinoma eess.IV | cs.CVPDF

Jia Fu, Litingyu Wang, He Li, Zihao Luo, Huamin Wang

TL;DR: SegRap2025是一个针对鼻咽癌放疗计划中大体肿瘤体积(GTV)和淋巴结临床靶区(LN CTV)分割的基准测试。它在SegRap2023的基础上,通过引入多中心、多模态的CT数据,旨在评估分割模型在不同成像中心和模态间的泛化性与鲁棒性。挑战包含两个任务:Task01专注于GTV分割并评估跨中心泛化能力,Task02专注于LN CTV分割并评估跨中心及跨模态鲁棒性。论文总结了十个参赛团队的解决方案及性能。

Details

Motivation: 解决鼻咽癌放疗计划中,从CT图像准确分割GTV、LN CTV和危及器官对于精确放疗至关重要。现有模型在跨中心、跨不同CT模态(如平扫CT与增强CT)时的泛化性和鲁棒性不足,SegRap2025旨在建立一个大规模基准来推动这一问题的研究。

Result: 在GTV分割任务中,最佳模型在内部和外部测试集上的平均Dice相似系数分别为74.61%和56.79%。在LN CTV分割任务中,最佳模型在配对CT、仅增强CT和仅平扫CT子集上的平均DSC分别达到60.24%、60.50%和57.23%。

Insight: 论文的主要创新点在于构建了一个大规模、多中心、多模态的基准测试,专门用于评估放疗靶区分割模型的泛化性和鲁棒性。这为开发临床可用的自动化放疗规划系统提供了关键的评估框架和洞见,强调了模型在真实临床场景(数据来源多样、模态可能缺失)下的性能至关重要。

Abstract: Accurate delineation of Gross Tumor Volume (GTV), Lymph Node Clinical Target Volume (LN CTV), and Organ-at-Risk (OAR) from Computed Tomography (CT) scans is essential for precise radiotherapy planning in Nasopharyngeal Carcinoma (NPC). Building upon SegRap2023, which focused on OAR and GTV segmentation using single-center paired non-contrast CT (ncCT) and contrast-enhanced CT (ceCT) scans, the SegRap2025 challenge aims to enhance the generalizability and robustness of segmentation models across imaging centers and modalities. SegRap2025 comprises two tasks: Task01 addresses GTV segmentation using paired CT from the SegRap2023 dataset, with an additional external testing set to evaluate cross-center generalization, and Task02 focuses on LN CTV segmentation using multi-center training data and an unseen external testing set, where each case contains paired CT scans or a single modality, emphasizing both cross-center and cross-modality robustness. This paper presents the challenge setup and provides a comprehensive analysis of the solutions submitted by ten participating teams. For GTV segmentation task, the top-performing models achieved average Dice Similarity Coefficient (DSC) of 74.61% and 56.79% on the internal and external testing cohorts, respectively. For LN CTV segmentation task, the highest average DSC values reached 60.24%, 60.50%, and 57.23% on paired CT, ceCT-only, and ncCT-only subsets, respectively. SegRap2025 establishes a large-scale multi-center, multi-modality benchmark for evaluating the generalization and robustness in radiotherapy target segmentation, providing valuable insights toward clinically applicable automated radiotherapy planning systems. The benchmark is available at: https://hilab-git.github.io/SegRap2025_Challenge.


cs.DC [Back]

[78] StreamFusion: Scalable Sequence Parallelism for Distributed Inference of Diffusion Transformers on GPUs cs.DC | cs.CVPDF

Jiacheng Yang, Jun Wu, Yaoyao Ding, Zhiying Xu, Yida Wang

TL;DR: 本文提出了StreamFusion,一个针对扩散变换器(DiTs)分布式推理的、可扩展的序列并行服务引擎。它通过引入拓扑感知的序列并行技术、Torus Attention(一种允许机器间all-to-all操作与计算重叠的新技术)以及单边通信实现,解决了现有序列并行方法在现代GPU集群上的通信模式不佳、延迟瓶颈和同步开销问题。

Details

Motivation: 随着对高分辨率图像和长视频生成需求的增长,单GPU推理因延迟增加和激活值过大而变得低效。现有的序列并行框架(如Ulysses Attention和Ring Attention)在扩展推理时存在通信模式对现代GPU拓扑不优、机器间all-to-all操作造成延迟瓶颈以及使用双边通信库带来的GPU发送-接收同步与计算开销等主要限制。

Result: 实验表明,StreamFusion在性能上平均优于当前最先进的方法1.35倍,最高可达1.77倍。

Insight: 论文的创新点在于:1)考虑了机器间与机器内带宽差异的拓扑感知序列并行技术;2)Torus Attention,一种新颖的序列并行技术,实现了机器间all-to-all通信与计算的重叠;3)最小化GPU同步与计算开销的单边通信实现。从客观角度看,其核心洞察是将通信优化与硬件拓扑特性紧密结合,并通过重叠通信与计算来隐藏延迟,这对于大规模分布式模型推理具有重要借鉴意义。

Abstract: Diffusion Transformers (DiTs) have gained increasing adoption in high-quality image and video generation. As demand for higher-resolution images and longer videos increases, single-GPU inference becomes inefficient due to increased latency and large activation sizes. Current frameworks employ sequence parallelism (SP) techniques such as Ulysses Attention and Ring Attention to scale inference. However, these implementations have three primary limitations: (1) suboptimal communication patterns for network topologies on modern GPU machines, (2) latency bottlenecks from all-to-all operations in inter-machine communication, and (3) GPU sender-receiver synchronization and computation overheads from using two-sided communication libraries. To address these issues, we present StreamFusion, a topology-aware efficient DiT serving engine. StreamFusion incorporates three key innovations: (1) a topology-aware sequence parallelism technique that accounts for inter- and intra-machine bandwidth differences, (2) Torus Attention, a novel SP technique enabling overlapping of inter-machine all-to-all operations with computation, and (3) a one-sided communication implementation that minimizes GPU sender-receiver synchronization and computation overheads. Our experiments demonstrate that StreamFusion outperforms the state-of-the-art approach by an average of $1.35\times$ (up to $1.77\times$).


cs.LG [Back]

[79] MeanCache: From Instantaneous to Average Velocity for Accelerating Flow Matching Inference cs.LG | cs.AI | cs.CVPDF

Huanlin Gao, Ping Chen, Fuyuan Shi, Ruijia Wu, Li YanTao

TL;DR: MeanCache是一种无需训练的高效缓存框架,用于加速流匹配(Flow Matching)推理。它通过引入平均速度视角,利用缓存的雅可比-向量积(JVP)从瞬时速度构建区间平均速度,有效缓解了局部误差累积问题,并结合轨迹稳定性调度策略优化缓存时机和JVP重用稳定性。

Details

Motivation: 现有缓存方法依赖瞬时速度信息(如特征缓存),在高加速比下易导致严重轨迹偏差和误差累积,MeanCache旨在解决这一问题。

Result: 在FLUX.1、Qwen-Image和HunyuanVideo基准测试中,MeanCache分别实现了4.12倍、4.56倍和3.59倍的加速,同时在生成质量上持续优于最先进的缓存基线方法。

Insight: 创新点在于从平均速度视角构建缓存机制,通过JVP缓存和轨迹稳定性调度策略减少误差累积,为流匹配推理提供了稳定性驱动加速的新思路,适用于商业级生成模型。

Abstract: We present MeanCache, a training-free caching framework for efficient Flow Matching inference. Existing caching methods reduce redundant computation but typically rely on instantaneous velocity information (e.g., feature caching), which often leads to severe trajectory deviations and error accumulation under high acceleration ratios. MeanCache introduces an average-velocity perspective: by leveraging cached Jacobian–vector products (JVP) to construct interval average velocities from instantaneous velocities, it effectively mitigates local error accumulation. To further improve cache timing and JVP reuse stability, we develop a trajectory-stability scheduling strategy as a practical tool, employing a Peak-Suppressed Shortest Path under budget constraints to determine the schedule. Experiments on FLUX.1, Qwen-Image, and HunyuanVideo demonstrate that MeanCache achieves 4.12X and 4.56X and 3.59X acceleration, respectively, while consistently outperforming state-of-the-art caching baselines in generation quality. We believe this simple yet effective approach provides a new perspective for Flow Matching inference and will inspire further exploration of stability-driven acceleration in commercial-scale generative models.


[80] Latent Object Permanence: Topological Phase Transitions, Free-Energy Principles, and Renormalization Group Flows in Deep Transformer Manifolds cs.LG | cs.CLPDF

Faruk Alpay, Bugra Kilictas

TL;DR: 本文通过几何和统计物理的视角,研究了深度Transformer语言模型中多步推理能力的涌现。作者将隐藏状态轨迹视为隐式黎曼流形上的流,通过分析激活的逐层协方差谱,发现模型在达到临界归一化深度时会发生相变,表现为有效维度的急剧下降和稳定’概念盆地’的形成。

Details

Motivation: 动机是理解深度Transformer模型内部表示空间中多步推理能力是如何涌现的,以及这种涌现过程背后的几何和统计物理机制。

Result: 在多个开源模型家族(1.5B至30B参数)上,通过逐层探针验证了理论预测的特征,观察到基于稀疏性/局部化的序参数在临界归一化深度γ_c≈0.42处出现不连续性,标志着相变的发生。

Insight: 创新点在于将Transformer的前向传播形式化为离散粗粒化映射,并将其与重整化群动力学的不动点联系起来,从而解释了表示空间中可重用、类对象结构(TCOs)的形成,并将逻辑可分性与谱衰减联系起来提供了理论条件。

Abstract: We study the emergence of multi-step reasoning in deep Transformer language models through a geometric and statistical-physics lens. Treating the hidden-state trajectory as a flow on an implicit Riemannian manifold, we analyze the layerwise covariance spectrum of activations, where $C^{(\ell)}=\mathbb{E}[h^{(\ell)}h^{(\ell)\top}]$, and track deviations from a random-matrix bulk. Across model scales (1.5B–30B), we observe a sharp reduction in effective dimensionality consistent with a phase transition: an order parameter based on sparsity/localization, $Ω(h)=1-|h|_1/(\sqrt{d}|h|_2)$, exhibits a discontinuity near a critical normalized depth $γ_c\approx 0.42$ in sufficiently large models. We formalize the forward pass as a discrete coarse-graining map and relate the appearance of stable “concept basins” to fixed points of this renormalization-like dynamics. The resulting low-entropy regime is characterized by a spectral tail collapse and by the formation of transient, reusable object-like structures in representation space, which we call Transient Class Objects (TCOs). We provide theoretical conditions connecting logical separability to spectral decay and validate the predicted signatures with layerwise probes on multiple open-weight model families.


[81] Continual GUI Agents cs.LG | cs.CVPDF

Ziwei Liu, Borui Kang, Hangjie Yuan, Zixiang Zhao, Wei Li

TL;DR: 本文提出了持续GUI智能体这一新任务,旨在解决GUI智能体在数据分布动态变化(如新领域或分辨率)时性能下降的问题。为此,作者引入了GUI-AiF强化微调框架,通过APR-iF和ARR-iF两种新颖的奖励机制,引导智能体适应变化的交互点和区域,从而稳定持续学习过程。实验表明该方法超越了现有基线。

Details

Motivation: 解决GUI智能体在动态变化的数字环境(如新领域、新分辨率数据持续到达)中,因训练数据分布静态而导致的性能退化问题。

Result: 在持续GUI智能体任务上的大量实验表明,GUI-AiF框架超越了最先进的基线方法。

Insight: 创新点在于提出了首个针对GUI智能体的持续学习框架GUI-AiF,其核心是通过APR-iF和ARR-iF两种奖励机制来稳定对动态交互点和区域的感知,避免了现有奖励策略对静态定位线索(如固定坐标或元素尺寸)的过度适应。这揭示了强化微调在持续GUI智能体任务中尚未开发的潜力。

Abstract: As digital environments (data distribution) are in flux, with new GUI data arriving over time-introducing new domains or resolutions-agents trained on static environments deteriorate in performance. In this work, we introduce Continual GUI Agents, a new task that requires GUI agents to perform continual learning under shifted domains and resolutions. We find existing methods fail to maintain stable grounding as GUI distributions shift over time, due to the diversity of UI interaction points and regions in fluxing scenarios. To address this, we introduce GUI-Anchoring in Flux (GUI-AiF), a new reinforcement fine-tuning framework that stabilizes continual learning through two novel rewards: Anchoring Point Reward in Flux (APR-iF) and Anchoring Region Reward in Flux (ARR-iF). These rewards guide the agents to align with shifting interaction points and regions, mitigating the tendency of existing reward strategies to over-adapt to static grounding cues (e.g., fixed coordinates or element scales). Extensive experiments show GUI-AiF surpasses state-of-the-art baselines. Our work establishes the first continual learning framework for GUI agents, revealing the untapped potential of reinforcement fine-tuning for continual GUI Agents.


[82] Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning cs.LG | cs.CLPDF

Jinyang Wu, Shuo Yang, Changpeng Yang, Yuhao Shen, Shuai Zhang

TL;DR: 本文提出了一种名为Spark的新型强化学习框架,旨在通过动态分支策略在关键决策状态进行选择性探索,以解决长视野任务中高质量轨迹稀缺和计算资源浪费的问题。该方法利用智能体的内在决策信号,实现精确的资源分配,优先保证样本质量而非盲目覆盖,从而在减少训练样本的同时提升任务成功率与泛化能力。

Details

Motivation: 当前基于大型语言模型的智能体在长视野任务训练中面临高质量轨迹稀缺的挑战,现有方法通常盲目扩大采样规模并均匀分配计算资源,导致大量计算浪费在无关紧要的步骤上,且无法保证样本质量。

Result: 在具身规划等多种任务上的实验表明,Spark能以显著更少的训练样本实现更高的成功率,并在未见场景中展现出强大的泛化能力。

Insight: 论文的创新点在于提出了一种基于关键状态动态分支的战略性策略感知探索机制,其核心思想是在关键决策点激活自适应分支探索以探测有潜力的轨迹,从而减少对人类先验知识的依赖,使智能体能够自主扩展探索并实现更强的泛化。从客观角度看,这种动态资源分配策略为长视野强化学习中的样本效率问题提供了新的解决思路。

Abstract: Reinforcement learning has empowered large language models to act as intelligent agents, yet training them for long-horizon tasks remains challenging due to the scarcity of high-quality trajectories, especially under limited resources. Existing methods typically scale up rollout sizes and indiscriminately allocate computational resources among intermediate steps. Such attempts inherently waste substantial computation budget on trivial steps while failing to guarantee sample quality. To address this, we propose \textbf{Spark} (\textbf{S}trategic \textbf{P}olicy-\textbf{A}ware explo\textbf{R}ation via \textbf{K}ey-state dynamic branching), a novel framework that selectively branches at critical decision states for resource-efficient exploration. Our key insight is to activate adaptive branching exploration at critical decision points to probe promising trajectories, thereby achieving precise resource allocation that prioritizes sampling quality over blind coverage. This design leverages the agent’s intrinsic decision-making signals to reduce dependence on human priors, enabling the agent to autonomously expand exploration and achieve stronger generalization. Experiments across diverse tasks (e.g., embodied planning), demonstrate that \textsc{Spark} achieves superior success rates with significantly fewer training samples, exhibiting robust generalization even in unseen scenarios.


[83] Truthfulness Despite Weak Supervision: Evaluating and Training LLMs Using Peer Prediction cs.LG | cs.AI | cs.CL | cs.GTPDF

Tianyi Alex Qiu, Micah Carroll, Cameron Allen

TL;DR: 本文提出了一种基于同伴预测(peer prediction)的方法,用于在弱监督下评估和训练大型语言模型(LLMs),以解决因缺乏强监督而导致的模型欺骗性评估问题。该方法利用博弈论中的激励相容原理,通过互预测性度量奖励诚实且信息丰富的回答,无需真实标签。理论保证和实证验证(包括对高达405B参数模型的测试)表明,该方法能有效抵抗欺骗,甚至在评估者与参与者能力差距巨大时(如超过100倍规模差异)表现更优。

Details

Motivation: 当前LLM的评估和后训练严重依赖监督,但对于困难任务,强监督往往不可得,导致基于不完美监督的评估容易被模型利用,产生欺骗性结果。

Result: 在理论保证和实证验证中,该方法表现出对欺骗的抵抗力。例如,使用基于同伴预测的奖励训练一个8B模型,可以恢复因恶意微调而下降的大部分真实性,即使奖励由一个未微调的0.135B语言模型产生。在评估方面,与需要强且可信法官的LLM-as-a-Judge方法相比,同伴预测展现出逆缩放特性:当专家与参与者能力差距扩大时,对欺骗的抵抗力反而增强,从而能在弱监督下可靠评估强模型。具体而言,LLM-as-a-Judge在面对规模是其5-20倍的欺骗性模型时表现比随机猜测更差,而同伴预测在差距巨大(包括超过100倍规模差异)时仍能有效工作。

Insight: 核心创新点是将机制设计中的同伴预测方法引入LLM评估和训练,利用互预测性作为奖励度量,无需真实标签即可激励诚实回答。从客观角度看,其揭示的逆缩放特性(能力差距越大,评估越可靠)是一个反直觉但重要的发现,为在资源有限或缺乏强监督时评估前沿模型提供了新思路。该方法在弱监督场景下具有实际应用潜力,特别是在对抗性设置或模型能力不对称的情况下。

Abstract: The evaluation and post-training of large language models (LLMs) rely on supervision, but strong supervision for difficult tasks is often unavailable, especially when evaluating frontier models. In such cases, models are demonstrated to exploit evaluations built on such imperfect supervision, leading to deceptive results. However, underutilized in LLM research, a wealth of mechanism design research focuses on game-theoretic incentive compatibility, i.e., eliciting honest and informative answers with weak supervision. Drawing from this literature, we introduce the peer prediction method for model evaluation and post-training. It rewards honest and informative answers over deceptive and uninformative ones, using a metric based on mutual predictability and without requiring ground truth labels. We demonstrate the method’s effectiveness and resistance to deception, with both theoretical guarantees and empirical validation on models with up to 405B parameters. We show that training an 8B model with peer prediction-based reward recovers most of the drop in truthfulness due to prior malicious finetuning, even when the reward is produced by a 0.135B language model with no finetuning. On the evaluation front, in contrast to LLM-as-a-Judge which requires strong and trusted judges, we discover an inverse scaling property in peer prediction, where, surprisingly, resistance to deception is strengthened as the capability gap between the experts and participants widens, enabling reliable evaluation of strong models with weak supervision. In particular, LLM-as-a-Judge become worse than random guess when facing deceptive models 5-20x the judge’s size, while peer prediction thrives when such gaps are large, including in cases with over 100x size difference.


[84] TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs cs.LG | cs.CLPDF

Minjae Lee, Wonjun Kang, Byeongkeun Ahn, Christian Classen, Kevin Galim

TL;DR: 本文提出了一种名为TABED(测试时自适应批量集成草稿)的新方法,用于加速大型视觉语言模型(LVLM)的推理过程。该方法通过动态集成多个草稿模型生成的候选标记,并利用推测解码设置中可用的历史真实值偏差进行自适应调整,从而在多种输入场景下实现稳定且高效的加速。

Details

Motivation: 现有推测解码(SD)方法主要针对纯文本LLM,在LVLM上的应用探索不足,且在不同输入场景下性能波动较大。本文旨在解决LVLM推理加速的鲁棒性问题,通过动态集成策略来适应场景变化。

Result: 在11个不同数据集上的实验表明,TABED相比自回归解码实现了平均1.74倍的鲁棒墙钟时间加速,相比单一草稿方法提升了5%的性能,同时保持训练无关性且集成开销可忽略。

Insight: 创新点在于提出了一个训练无关、即插即用的动态集成草稿框架,通过利用推测解码过程中的历史真实值偏差进行自适应集成,有效提升了LVLM推理加速的鲁棒性和效率,并可与其他高级验证和草稿方法结合使用。

Abstract: Speculative decoding (SD) has proven effective for accelerating LLM inference by quickly generating draft tokens and verifying them in parallel. However, SD remains largely unexplored for Large Vision-Language Models (LVLMs), which extend LLMs to process both image and text prompts. To address this gap, we benchmark existing inference methods with small draft models on 11 datasets across diverse input scenarios and observe scenario-specific performance fluctuations. Motivated by these findings, we propose Test-time Adaptive Batched Ensemble Drafting (TABED), which dynamically ensembles multiple drafts obtained via batch inference by leveraging deviations from past ground truths available in the SD setting. The dynamic ensemble method achieves an average robust walltime speedup of 1.74x over autoregressive decoding and a 5% improvement over single drafting methods, while remaining training-free and keeping ensembling costs negligible through parameter sharing. With its plug-and-play compatibility, we further enhance TABED by integrating advanced verification and alternative drafting methods. Code and custom-trained models are available at https://github.com/furiosa-ai/TABED.


[85] Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning cs.LG | cs.AI | cs.CLPDF

Minwu Kim, Safal Shrestha, Keith Ross

TL;DR: 本文提出了一种名为‘失败前缀条件化’的方法,用于解决在强化学习与可验证奖励(RLVR)训练中,当问题趋于饱和时模型学习停滞的问题。该方法通过将训练条件设定在从罕见错误推理轨迹中提取的前缀上,而非原始问题,从而让模型更频繁地暴露于易失败状态,有效利用已有的学习信号。

Details

Motivation: 动机在于解决RLVR训练中,随着问题变得饱和,信息丰富的失败案例在标准推理过程中难以遇到,导致学习信号无法被有效利用,从而阻碍模型性能进一步提升的核心挑战。

Result: 实验表明,该方法在饱和问题上的性能提升与在中等难度问题上训练所获得的增益相当,同时保持了令牌效率。迭代更新失败前缀的策略能在性能平台期后带来额外收益。分析还发现,该方法能减少模型在误导性失败前缀下的性能退化。

Insight: 主要创新点是提出了‘失败前缀条件化’这一简单有效的训练策略,通过重新分配探索重点到易错状态来挖掘饱和问题中的剩余学习潜力。客观来看,其核心洞察是将训练数据生成从‘问题’转向‘失败中间状态’,这是一种高效利用失败样本进行课程学习或数据增强的思路。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning abilities of large language models (LLMs), yet training often stalls as problems become saturated. We identify the core challenge as the poor accessibility of informative failures: learning signals exist but are rarely encountered during standard rollouts. To address this, we propose failure-prefix conditioning, a simple and effective method for learning from saturated problems. Rather than starting from the original question, our approach reallocates exploration by conditioning training on prefixes derived from rare incorrect reasoning trajectories, thereby exposing the model to failure-prone states. We observe that failure-prefix conditioning yields performance gains matching those of training on medium-difficulty problems, while preserving token efficiency. Furthermore, we analyze the model’s robustness, finding that our method reduces performance degradation under misleading failure prefixes, albeit with a mild trade-off in adherence to correct early reasoning. Finally, we demonstrate that an iterative approach, which refreshes failure prefixes during training, unlocks additional gains after performance plateaus. Overall, our results suggest that failure-prefix conditioning offers an effective pathway to extend RLVR training on saturated problems.


[86] Evolutionary Strategies lead to Catastrophic Forgetting in LLMs cs.LG | cs.AI | cs.CLPDF

Immanuel Abdi, Akshat Gupta, Micah Mok, Alexander Lu, Nicholas Lee

TL;DR: 本文对进化策略(ES)在大型语言模型(LLMs)中的持续学习能力进行了全面分析,发现尽管ES在数学和推理任务上能达到与GRPO相近的性能,但训练过程中会伴随严重的灾难性遗忘问题,限制了其在在线学习中的应用。

Details

Motivation: 当前AI系统缺乏部署后持续学习的能力,而基于梯度的算法内存需求大,因此研究梯度无关的进化策略作为替代方案,并评估其在持续学习中的遗忘问题。

Result: 在数学和推理任务上,ES在可比计算预算下性能接近GRPO,但训练更新步骤增加时表现出显著的灾难性遗忘,遗忘曲线与GRPO形成对比。

Insight: ES的更新比GRPO更不稀疏且具有数量级更大的ℓ2范数,这解释了其遗忘行为;研究强调了ES等梯度无关算法的遗忘问题,为未来缓解该问题提供了方向。

Abstract: One of the biggest missing capabilities in current AI systems is the ability to learn continuously after deployment. Implementing such continually learning systems have several challenges, one of which is the large memory requirement of gradient-based algorithms that are used to train state-of-the-art LLMs. Evolutionary Strategies (ES) have recently re-emerged as a gradient-free alternative to traditional learning algorithms and have shown encouraging performance on specific tasks in LLMs. In this paper, we perform a comprehensive analysis of ES and specifically evaluate its forgetting curves when training for an increasing number of update steps. We first find that ES is able to reach performance numbers close to GRPO for math and reasoning tasks with a comparable compute budget. However, and most importantly for continual learning, the performance gains in ES is accompanied by significant forgetting of prior abilities, limiting its applicability for training models online. We also explore the reason behind this behavior and show that the updates made using ES are much less sparse and have orders of magnitude larger $\ell_2$ norm compared to corresponding GRPO updates, explaining the contrasting forgetting curves between the two algorithms. With this study, we aim to highlight the issue of forgetting in gradient-free algorithms like ES and hope to inspire future work to mitigate these issues.


cs.RO [Back]

[87] TRACER: Texture-Robust Affordance Chain-of-Thought for Deformable-Object Refinement cs.RO | cs.CVPDF

Wanjun Jia, Kang Li, Fan Yang, Mengfei Duan, Wenrui Chen

TL;DR: 本文提出TRACER框架,用于解决机器人操作可变形物体时,在复杂外观和纹理变化下将高层语义指令与物理交互点对齐的挑战。该框架通过树状结构可供性思维链分解任务意图,并结合空间约束边界细化与交互收敛细化流,提升功能区域预测的准确性和物理一致性。

Details

Motivation: 现有基于视觉的可供性预测方法在处理可变形物体时,由于自由度极高、动力学复杂且模式异构,常出现边界溢出和功能区域碎片化问题,难以实现高层语义推理与低层物理执行的有效对齐。

Result: 在Fine-AGDDO15数据集和真实机器人平台上的大量实验表明,TRACER显著提升了针对可变形物体各种纹理和模式的可供性接地精度,并提高了长视野任务的执行成功率。

Insight: 创新点在于提出了一个从分层语义推理到外观鲁棒、物理一致的功能区域细化的跨层次映射框架,具体通过树状思维链实现任务分解,并引入空间约束边界细化和交互收敛细化流来抑制预测溢出、聚合噪声像素,从而增强空间连续性和物理合理性。

Abstract: The central challenge in robotic manipulation of deformable objects lies in aligning high-level semantic instructions with physical interaction points under complex appearance and texture variations. Due to near-infinite degrees of freedom, complex dynamics, and heterogeneous patterns, existing vision-based affordance prediction methods often suffer from boundary overflow and fragmented functional regions. To address these issues, we propose TRACER, a Texture-Robust Affordance Chain-of-thought with dEformable-object Refinement framework, which establishes a cross-hierarchical mapping from hierarchical semantic reasoning to appearance-robust and physically consistent functional region refinement. Specifically, a Tree-structured Affordance Chain-of-Thought (TA-CoT) is formulated to decompose high-level task intentions into hierarchical sub-task semantics, providing consistent guidance across various execution stages. To ensure spatial integrity, a Spatial-Constrained Boundary Refinement (SCBR) mechanism is introduced to suppress prediction spillover, guiding the perceptual response to converge toward authentic interaction manifolds. Furthermore, an Interactive Convergence Refinement Flow (ICRF) is developed to aggregate discrete pixels corrupted by appearance noise, significantly enhancing the spatial continuity and physical plausibility of the identified functional regions. Extensive experiments conducted on the Fine-AGDDO15 dataset and a real-world robotic platform demonstrate that TRACER significantly improves affordance grounding precision across diverse textures and patterns inherent to deformable objects. More importantly, it enhances the success rate of long-horizon tasks, effectively bridging the gap between high-level semantic reasoning and low-level physical execution. The source code and dataset will be made publicly available at https://github.com/Dikay1/TRACER.


cs.AI [Back]

[88] Scaling Medical Reasoning Verification via Tool-Integrated Reinforcement Learning cs.AI | cs.CLPDF

Hang Zhang, Ruheng Wang, Yuelyu Ji, Mingu Kwak, Xizhi Wu

TL;DR: 本文提出了一种名为Tool-Integrated Reinforcement Learning的智能体框架,用于增强大型语言模型在医学推理中的事实核查能力。该框架通过训练医学推理验证器,使其在评估过程中能够迭代查询外部医学知识库,并结合工具增强验证与仅需轨迹级监督的强化学习范式,以及动态调整训练数据分布的自适应课程机制。

Details

Motivation: 现有基于奖励模型的医学推理轨迹验证方法存在两个主要局限:一是仅输出标量奖励值而缺乏明确的理由解释;二是依赖单次检索,无法在验证过程中进行自适应的知识访问。本文旨在解决这些问题,以提高医学推理系统的可靠性和可解释性。

Result: 在四个医学推理基准测试(包括MedQA和MedXpertQA)上,该方法相比现有方法取得了显著提升,其中MedQA准确率相对基础生成器提高了23.5%,MedXpertQA提高了32.0%。关键的是,与先前的奖励模型基线相比,该方法将采样预算需求降低了8倍。

Insight: 创新点在于将验证过程基于动态检索的证据,通过工具增强的迭代强化学习框架,实现了更高效、可解释的医学推理验证。这为构建更可靠的医学推理系统提供了一条原则性路径,其自适应课程机制和降低采样成本的设计具有借鉴意义。

Abstract: Large language models have achieved strong performance on medical reasoning benchmarks, yet their deployment in clinical settings demands rigorous verification to ensure factual accuracy. While reward models offer a scalable approach for reasoning trace verification, existing methods face two limitations: they produce only scalar reward values without explicit justification, and they rely on single-pass retrieval that precludes adaptive knowledge access as verification unfolds. We introduce $\method$, an agentic framework that addresses these limitations by training medical reasoning verifiers to iteratively query external medical corpora during evaluation. Our approach combines tool-augmented verification with an iterative reinforcement learning paradigm that requires only trace-level supervision, alongside an adaptive curriculum mechanism that dynamically adjusts training data distribution. Across four medical reasoning benchmarks, $\method$ achieves substantial gains over existing methods, improving MedQA accuracy by 23.5% and MedXpertQA by 32.0% relative to the base generator in particular. Crucially, $\method$ demonstrates an $\mathbf{8\times}$ reduction in sampling budget requirement compared to prior reward model baselines. These findings establish that grounding verification in dynamically retrieved evidence offers a principled path toward more reliable medical reasoning systems.


[89] CtrlCoT: Dual-Granularity Chain-of-Thought Compression for Controllable Reasoning cs.AI | cs.CLPDF

Zhenxuan Fan, Jie Cao, Yang Dai, Zheqi Lv, Wenqiao Zhang

TL;DR: CtrlCoT是一个双粒度思维链压缩框架,旨在通过协调语义抽象和令牌级剪枝来压缩思维链,以减少大型语言模型推理时的延迟和内存开销,同时保持推理的正确性。

Details

Motivation: 现有思维链压缩方法要么在语义层面压缩过于保守,要么在令牌级剪枝过于激进,容易丢失关键推理线索并降低准确性,且两者结合存在序列依赖、任务无关剪枝和分布不匹配等挑战。

Result: 在MATH-500基准测试上使用Qwen2.5-7B-Instruct模型,CtrlCoT在减少30.7%令牌使用量的同时,准确率比最强基线高出7.6个百分点,实现了更高效可靠的推理。

Insight: 创新点包括:分层推理抽象生成多粒度语义思维链;逻辑保持蒸馏训练逻辑感知剪枝器以保留关键推理线索;分布对齐生成使压缩后的思维链与推理风格对齐以避免碎片化。这为可控的思维链压缩提供了系统化框架。

Abstract: Chain-of-thought (CoT) prompting improves LLM reasoning but incurs high latency and memory cost due to verbose traces, motivating CoT compression with preserved correctness. Existing methods either shorten CoTs at the semantic level, which is often conservative, or prune tokens aggressively, which can miss task-critical cues and degrade accuracy. Moreover, combining the two is non-trivial due to sequential dependency, task-agnostic pruning, and distribution mismatch. We propose \textbf{CtrlCoT}, a dual-granularity CoT compression framework that harmonizes semantic abstraction and token-level pruning through three components: Hierarchical Reasoning Abstraction produces CoTs at multiple semantic granularities; Logic-Preserving Distillation trains a logic-aware pruner to retain indispensable reasoning cues (e.g., numbers and operators) across pruning ratios; and Distribution-Alignment Generation aligns compressed traces with fluent inference-time reasoning styles to avoid fragmentation. On MATH-500 with Qwen2.5-7B-Instruct, CtrlCoT uses 30.7% fewer tokens while achieving 7.6 percentage points higher than the strongest baseline, demonstrating more efficient and reliable reasoning. Our code will be publicly available at https://github.com/fanzhenxuan/Ctrl-CoT.


[90] PathWise: Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs cs.AI | cs.CLPDF

Oguzhan Gungordu, Siheng Xiong, Faramarz Fekri

TL;DR: PathWise是一个基于自进化大语言模型的多智能体推理框架,用于组合优化问题的自动化启发式设计。它将启发式生成建模为在蕴含图上的序列决策过程,通过策略智能体规划进化动作、世界模型智能体生成启发式推演、批评智能体提供反思,实现了从试错进化到基于状态感知的规划推理。

Details

Motivation: 现有基于LLM的自动化启发式设计框架依赖固定的进化规则和静态提示模板,导致启发式生成短视、评估冗余且缺乏对新启发式如何推导的推理,PathWise旨在解决这些问题。

Result: 在多种组合优化问题上的实验表明,PathWise能更快收敛到更好的启发式,在不同LLM骨干上具有良好泛化性,并能扩展到更大规模问题。

Insight: 创新点在于将启发式生成形式化为蕴含图上的序列决策过程,引入多智能体协作(策略、世界模型、批评)实现状态感知的规划,从而超越传统的试错进化模式,提升了推理能力和搜索效率。

Abstract: Large Language Models (LLMs) have enabled automated heuristic design (AHD) for combinatorial optimization problems (COPs), but existing frameworks’ reliance on fixed evolutionary rules and static prompt templates often leads to myopic heuristic generation, redundant evaluations, and limited reasoning about how new heuristics should be derived. We propose a novel multi-agent reasoning framework, referred to as Planning through World Model for Automated Heuristic Design via Self-Evolving LLMs (PathWise), which formulates heuristic generation as a sequential decision process over an entailment graph serving as a compact, stateful memory of the search trajectory. This approach allows the system to carry forward past decisions and reuse or avoid derivation information across generations. A policy agent plans evolutionary actions, a world model agent generates heuristic rollouts conditioned on those actions, and critic agents provide routed reflections summarizing lessons from prior steps, shifting LLM-based AHD from trial-and-error evolution toward state-aware planning through reasoning. Experiments across diverse COPs show that PathWise converges faster to better heuristics, generalizes across different LLM backbones, and scales to larger problem sizes.


[91] Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation cs.AI | cs.CLPDF

Yanqi Dai, Yuxiang Ji, Xiao Zhang, Yong Wang, Xiangxiang Chu

TL;DR: 本文提出MathForge框架,通过难度感知的组策略优化算法(DGPO)和多方面问题重构策略(MQR),从算法和数据两个角度针对性地提升大模型在数学推理任务中对难题的处理能力,从而显著提高整体性能。

Details

Motivation: 现有基于可验证奖励的强化学习方法在算法和数据层面均缺乏对更具挑战性问题的系统性关注,这限制了模型未充分发展能力的提升。

Result: 大量实验表明,MathForge在多种数学推理任务上显著优于现有方法。

Insight: 创新点在于:1)算法上,通过难度平衡的组优势估计和难度感知的问题级加权,纠正了GRPO中存在的隐式不平衡并优先处理难题;2)数据上,通过多方面问题重构策略,在保持原答案的同时系统性增加问题内在难度;3)两者形成协同循环,共同推动模型能力边界。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives, despite their importance for refining underdeveloped capabilities. Algorithmically, widely used Group Relative Policy Optimization (GRPO) suffers from an implicit imbalance where the magnitude of policy updates is lower for harder questions. Data-wise, augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty. To address these issues, we propose a two-dual MathForge framework to improve mathematical reasoning by targeting harder questions from both perspectives, which comprises a Difficulty-Aware Group Policy Optimization (DGPO) algorithm and a Multi-Aspect Question Reformulation (MQR) strategy. Specifically, DGPO first rectifies the implicit imbalance in GRPO via difficulty-balanced group advantage estimation, and further prioritizes harder questions by difficulty-aware question-level weighting. Meanwhile, MQR reformulates questions across multiple aspects to increase difficulty while maintaining the original gold answer. Overall, MathForge forms a synergistic loop: MQR expands the data frontier, and DGPO effectively learns from the augmented data. Extensive experiments show that MathForge significantly outperforms existing methods on various mathematical reasoning tasks. The code and augmented data are all available at https://github.com/AMAP-ML/MathForge.