Table of Contents

cs.CL [Back]

[1] ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment cs.CLPDF

Zhipeng Bian, Jieming Zhu, Qijiong Liu, Wang Lin, Guohao Cai

TL;DR: 本文提出ICG框架,通过基于多模态大语言模型(MLLM)的提示生成和个性化偏好对齐,改进封面图像生成。该框架利用元令牌提取项目标题和参考图像的语义特征,结合用户嵌入进行细化,并将个性化上下文注入扩散模型。采用多奖励学习策略,结合公共审美与相关性奖励以及基于用户行为训练的个性化偏好模型,实现端到端训练。

Details

Motivation: 尽管封面图像在数字平台提升用户参与度中至关重要,但个性化封面生成研究仍不足。现有方法依赖手工提示和分离模块,缺乏有效的个性化与上下文整合。

Result: 实验表明,ICG在图像质量、语义保真度和个性化方面显著提升,增强了用户吸引力和下游任务中的离线推荐准确性。该框架作为即插即用适配器,兼容常见检查点,优化时无需真实标签。

Insight: 创新点包括:使用MLLM-based prompting生成语义丰富的提示;通过元令牌和用户嵌入实现个性化上下文注入;采用多奖励学习策略结合公共与个性化奖励进行无监督优化;设计适配器桥接MLLM和扩散模型,支持端到端训练。

Abstract: Recent advances in multimodal large language models (MLLMs) and diffusion models (DMs) have opened new possibilities for AI-generated content. Yet, personalized cover image generation remains underexplored, despite its critical role in boosting user engagement on digital platforms. We propose ICG, a novel framework that integrates MLLM-based prompting with personalized preference alignment to generate high-quality, contextually relevant covers. ICG extracts semantic features from item titles and reference images via meta tokens, refines them with user embeddings, and injects the resulting personalized context into the diffusion model. To address the lack of labeled supervision, we adopt a multi-reward learning strategy that combines public aesthetic and relevance rewards with a personalized preference model trained from user behavior. Unlike prior pipelines relying on handcrafted prompts and disjointed modules, ICG employs an adapter to bridge MLLMs and diffusion models for end-to-end training. Experiments demonstrate that ICG significantly improves image quality, semantic fidelity, and personalization, leading to stronger user appeal and offline recommendation accuracy in downstream tasks. As a plug-and-play adapter bridging MLLMs and diffusion models, ICG is compatible with common checkpoints and requires no ground-truth labels during optimization.


[2] OralAgent: Integrating Reasoning, Tools, and Knowledge for Interactive Dental Image Analysis cs.CL | cs.CV | cs.MAPDF

Jing Hao, Siyuan Dai, Yongxin Zhang, Yuci Liang, Jiamin Wu

TL;DR: 本文提出了OralAgent,这是首个专为牙科设计的AI智能体,它在一个端到端的自动化框架中统一了多模态推理、基于工具决策和知识检索。该智能体集成了22个视觉分析工具和368本经典牙科教材,能够自主进行推理、规划、工具使用、知识检索和多步骤工作流执行。此外,论文还引入了大规模双语文本资源OralCorpus和中文多选题基准OralQA-ZH。实验表明,OralAgent在多个基准测试中达到了最先进的性能。

Details

Motivation: 现有的牙科AI模型通常针对特定任务和单一成像模态设计,这种孤立的设计限制了其在真实临床工作流程中的实际应用。本文旨在解决这一问题,通过构建一个统一的智能体来整合推理、工具和知识,以支持更实用的临床工作流。

Result: 在MMOral-Uni、MMOral-OPG和OralQA-ZH等多个基准测试上进行的广泛实验表明,OralAgent取得了最先进的性能,证明了其在真实临床环境中的有效性、可解释性和适应性。

Insight: 论文的主要创新点在于首次构建了一个牙科专用的、集成了多模态推理、工具调用和知识检索的端到端AI智能体框架。从客观角度看,其整合大规模专业文本资源(OralCorpus)和构建多学科牙科知识基准(OralQA-ZH)的做法,为领域特定AI智能体的开发提供了有价值的参考范例。

Abstract: Dental image analysis plays a pivotal role in supporting accurate diagnosis and treatment planning in oral healthcare. Although recent advances have produced dental AI models for specific tasks and individual imaging modalities, their isolated designs limit practical use in real-world clinical workflows. In this paper, we present OralAgent, the first dental-specialized AI agent that unifies multimodal reasoning, tool-based decision-making, and knowledge-grounded retrieval within an end-to-end automated framework. It integrates 22 visual analysis tools and 368 widely-used classical dental textbooks, enabling autonomous reasoning, planning, tool use, knowledge retrieval, and multi-step workflow execution. Furthermore, we introduce OralCorpus, a large-scale, high-quality bilingual textual resource containing 134.8M tokens curated for dental retrieval-augmented generation (RAG). To evaluate models’ multidisciplinary dental knowledge, we construct OralQA-ZH, a Chinese multiple-choice question benchmark consisting of 798 items across eleven oral subspecialties. Extensive experiments demonstrate that OralAgent achieves state-of-the-art performance on the MMOral-Uni, MMOral-OPG, and OralQA-ZH benchmarks, highlighting its effectiveness, interpretability, and adaptability in real-world clinical settings. The code and models are publicly available at https://github.com/isjinghao/OralAgent.


[3] Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models cs.CL | cs.AIPDF

Yizhong Geng, Yanliang Li, Jinghan Yang, Tianhan Jiang, Boxun An

TL;DR: 该论文针对低资源口语语言模型(SLMs)中合成数据扩展导致的稳定性与表现力权衡问题,提出了两种自对齐框架。研究发现,合成数据虽能提升语音准确性,但会抑制韵律变化,导致表现力崩溃(合成侵蚀)。为弥合这一差距,作者提出了解耦引导自对齐(DGSA)和温度驱动自批判(TDSC)方法,以恢复表现力并稳定生成,在包括老挝语在内的低资源语言上实现了优于商业系统的性能,并首次实现了老挝语的零样本语音克隆。

Details

Motivation: 解决低资源口语语言模型因依赖合成数据进行扩展而引发的稳定性与表现力之间的根本权衡问题,即合成数据在提高语音准确性的同时会抑制韵律多样性,导致表现力下降(合成侵蚀)。

Result: 提出的方法在性能上超越了包括ElevenLabs和Gemini Pro在内的强大商业系统,并在老挝语上首次实现了零样本语音克隆能力,展示了在低资源语言上的有效性。

Insight: 创新点在于识别并形式化了合成数据扩展中的稳定性-表现力差距问题,并提出了两种自对齐框架:DGSA利用韵律-音色分离来恢复复杂语言的表现力,TDSC通过自动探索和过滤在真实参考极有限的情况下稳定生成,为低资源SLM训练提供了可借鉴的解决方案。

Abstract: Spoken Language Models (SLMs) have emerged as a promising paradigm for speech synthesis by bypassing explicit grapheme-to-phoneme pipelines. However, their effectiveness in low-resource languages remains fundamentally limited by the scarcity of transcribed speech. In practice, synthetic data has become the primary strategy for scaling SLMs in such settings, providing reliable phonetic supervision when real data is insufficient. In this work, we show that this reliance introduces a fundamental trade-off, which we term the Stability-Expressivity Gap: while synthetic data improves phonetic accuracy, it progressively suppresses prosodic variability, ultimately leading to a collapse of expressivity (Synthetic Erosion). To bridge this gap, we propose two self-alignment frameworks. Disentanglement-Guided Self-Alignment (DGSA) recovers expressivity for complex languages by exploiting prosody-timbre separation. For regimes where authentic references are exceptionally limited, Temperature-Driven Self-Critique (TDSC) stabilizes generation through automated exploration and filtering. Our approach outperforms strong commercial systems, including ElevenLabs and Gemini Pro, and enables the first zero-shot voice cloning capability for Lao.


[4] PAST2HARM: A Simple Adaptive Past Tense Attack for Jailbreaking Multimodal AI cs.CLPDF

Snehasis Mukhopadhyay

TL;DR: 本文提出了PAST2HARM,一种针对多模态文本到图像模型的简单自适应越狱攻击框架。该框架利用过去时态改写来规避模型的安全防护,通过‘广度’(时间深化)和‘深度’(迭代升级)两个维度增强攻击,在多个先进模型上实现了高成功率,并揭示了当前多模态AI安全防护的根本脆弱性。

Details

Motivation: 多模态AI系统(尤其是文本到图像模型)的越狱攻击研究不足,而不安全的图像生成比不安全文本的后果更严重,且现有防御机制相对不成熟。

Result: 在Gemini Nano Banana Pro、GPT Image 2和SD XL三个模型的黑盒无梯度设置下,攻击成功率分别达到83%、67%和100%,跨模型迁移成功率超过50%,能诱导生成包括色情、政治虚假信息、仇恨言论等多种有害内容。

Insight: 创新点在于系统化地利用过去时态改写这一语言学漏洞,并构建了‘时间深化’和‘迭代升级’的双维度攻击框架来侵蚀模型的安全边界;客观来看,该方法简单有效,暴露了当前多模态对齐训练在时序语境和对话中段窗口的根本脆弱性,为红队测试和对齐研究提供了重要基准。

Abstract: Jailbreak attacks on multimodal AI systems remain underexplored, even though unsafe image generation can have more severe consequences than unsafe text and current defenses are relatively immature. We introduce PAST2HARM, a simple yet effective adaptive jailbreak framework that bypasses refusal training in state of the art multimodal text to image models. Building on prior findings that past tense reformulations can evade safeguards, PAST2HARM systematically exploits this vulnerability in multimodal generative AI. We characterize the attack along two dimensions. First, breadth: through temporal deepening, the framework incrementally strengthens historical anchoring and archival cues, eroding refusal boundaries across models with varying alignment strength. Second, depth: via iterative escalation after initial compliance, we probe the upper bound of harmful generation, measuring severity using a scalar severity jailbreak metric evaluated by a language model acting as a judge. We find that mid conversation turns form peak vulnerability windows, where harmfulness increases before plateauing and eventually undergoing semantic inversion. We evaluate PAST2HARM on three models Gemini Nano Banana Pro, GPT Image 2, and SD XL achieving attack success rates of 83 percent, 67 percent, and 100 percent in a black box, gradient free setting. Adversarial prompts also transfer across models, with cross model success rates above 50 percent. The attack elicits diverse harmful outputs, including explicit sexual content, political disinformation, historical denial narratives, hate speech, and self harm glorification. We further release a curated benchmark of prompts, reformulations, and outputs as a resource for red teaming and alignment. Our results expose fundamental brittleness in current safeguards and highlight the need for stronger multimodal safety training.


[5] The Future of Facts: Tracing the Factual Generation-Verification Gap cs.CL | cs.AI | cs.LGPDF

Tim R. Davidson, Anja Surina, Caglar Gulcehre

TL;DR: 本文研究了语言模型在事实知识上的生成与验证能力差距(GV-gap),通过分析不同训练阶段(获取、持续学习、更新)和模型规模,发现验证能力总是先于生成能力被学习,且对持续学习更具鲁棒性,事实更新可能导致模型同时验证新旧答案的‘多元宇宙’状态。

Details

Motivation: 语言模型已成为获取事实知识的主要接口,但其生成输出的可靠性常低于验证能力,这种生成-验证差距在自改进和推理中至关重要,但其在事实知识上的具体动态尚不明确。

Result: 在四个开源模型家族的两个规模上进行了实验,发现验证能力学习更早、对持续学习更鲁棒,且事实更新后模型可能同时认为新旧答案都正确;前沿模型的大规模自然实验重现了这些动态,并揭示了在覆盖充分的事实上存在残留的验证偏差。

Insight: 创新点在于系统性地追踪了事实性生成-验证差距的训练机制,区分了其与计算和美学差距的不同,并揭示了验证优先学习、鲁棒性更强以及更新后的‘多元宇宙’现象,为理解模型事实知识处理提供了新视角。

Abstract: Language models are becoming the default interface to factual knowledge, yet they often verify outputs more reliably than they generate them. This generation-verification gap (GV-gap) underlies many recent advances in self-improvement and reasoning, but its dynamics on factual knowledge specifically remain poorly understood. We focus on the training mechanisms underlying factual GV-gaps, distinguishing them from their computational and aesthetic counterparts. We trace generation and verification capabilities through three training phases (acquisition, continual learning, and updating) across four open-source model families at two scales each. Three findings recur across models: (i) verification is consistently learned before generation; (ii) verification is more robust to continual learning than generation; and (iii) factual updates can leave models in a “multi-verse” state, simultaneously verifying both old and new answers as correct. Natural experiments on frontier models reproduce these dynamics at scale and reveal residual verification biases on well-covered facts.


[6] Simorgh at SemEval-2026 task 7: Region-Aware Hybrid Retrieval for Low-Resource Cultural Reasoning in Multilingual Question Answering cs.CLPDF

Hadi Bayrami Asl Tekanlou, Mahdi Bakhtiyarzadeh, Jafar Razmara

TL;DR: 本文针对低资源语言中文化推理任务,提出了一种区域感知混合检索方法,结合BM25词法匹配和稠密语义相似度,并引入区域权重启发式策略,以提升多语言问答中文化相关答案的检索质量。该方法在BLEnD多语言基准测试中验证了其有效性,通过检索增强的提示工程优化了Qwen3-14B量化模型的性能。

Details

Motivation: 尽管大语言模型在通用领域推理任务中表现优异,但在数字文本数据有限的低资源语言中,处理文化相关知识时仍面临挑战,因此需要改进检索增强方法以提升跨语言文化推理的稳定性。

Result: 实验结果表明,混合检索方法在BLEnD多语言基准测试中相比纯参数推理提高了跨语言稳定性,但训练数据不平衡导致不同语言间仍存在显著性能差距。

Insight: 创新点在于结合区域感知的混合检索与启发式权重,以增强文化相关文档的检索相关性;客观分析显示,该方法部分缓解了低资源语言的文化推理问题,但数据不平衡仍是检索增强方法的主要限制因素。

Abstract: Although Large Language Models (LLMs) demonstrate excellent capabilities and performance for general reasoning tasks within the general public domain, they may face challenges with culturally grounded knowledge within languages with limited digital and textual data. In this paper, we investigate culturally grounded multiple-choice question answering with the BLEnD benchmark, which consists of a multilingual corpus of 30 languages and covers various socio-cultural domains, such as cuisine, sports, family, etc. We propose a region-aware hybrid retrieval approach that combines BM25 lexical matching and dense semantic similarity with regional weighting heuristics to improve the relevance of the answer. The retrieved documents are used to construct a structured prompt for the Qwen3-14B quantized model with logit-based deterministic answer selection. The experimental results show improvements to cross-lingual stability with the hybrid retrieval approach over pure parametric inference for culturally grounded question answering. However, there are still notable performance gaps between languages with more and less training data. This shows that the limitations of the retrieval augmentation approach are not entirely overcome by the training data imbalance problem.


[7] Can Hallucinations Be Useful? Solving Multi-Hop Questions With SLMs By Chaining System-I/II Reasoning cs.CLPDF

Saptarshi Sengupta, Suhang Wang

TL;DR: 这篇论文提出了一种认知启发式的框架,用于解决小语言模型(SLMs)在复杂多步推理问题中的幻觉问题。该方法采用“先回答后推理”的策略,首先让模型快速给出初始答案(系统I思维),然后基于该假设从知识源检索证据进行深度思考(系统II思维),从而将幻觉转化为优势。

Details

Motivation: 小语言模型(SLMs)虽然快速且硬件需求低,但比大语言模型(LLMs)更容易产生幻觉,这影响了其解决复杂多步推理问题的能力,因为早期错误会累积到最终响应中。现有工作通常采用“先思考后检索”的策略来减少幻觉,但作者认为这种策略并非总是必要的。

Result: 论文表明,通过结合系统I和系统II思维,该方法在多个多步问答基准测试中,能够超越采用传统“先思考”路线的先前工作。

Insight: 论文的创新点在于将SLMs的幻觉视为潜在优势,并提出了“先回答后推理”的反向策略。这借鉴了认知心理学中的双系统思维理论,通过快速直觉反应(系统I)生成初始假设,再通过慢速分析推理(系统II)进行验证和修正,从而有效利用幻觉来引导检索和精炼答案。

Abstract: Recently, there has been increased interest in Small Language Models (SLMs), which are fast, show good performance, and have lower hardware demands than large language models (LLMs). However, SLMs hallucinate more frequently than LLMs, impacting their ability to solve complex multi-step reasoning problems as early mistakes cascade to the final response. To address this, existing works think-first followed by iterative retrieval to reduce hallucination. We argue that the think-first strategy is not always necessary as we find that: (i) SLMs are often accurately confident in their initial answer and, (ii) hallucinations can actually be beneficial for honing in on the true answer. As such, we position our work as an inversion of this strategy, i.e., answer first-reason later. We propose a cognitively-inspired framework where the model is first allowed to quickly answer the question (System-I (zero-shot)) and then resorts to deeper thinking (System-II) based on evidence retrieved from a knowledge source using the initial hypothesis. By combining System-I and System-II style thinking, we show that our method can outperform prior work that takes the traditional think-first route on various multi-step question-answering benchmarks.


[8] Chain-based Adaptive Reconfiguration Over Lattices for Hallucination Reduction cs.CL | cs.IRPDF

Joan Vendrell Gallart, Solmaz Kia, Russell Bent, Michael Grosskopf

TL;DR: 本文提出了CAROL(基于格子的链式自适应重构)框架,这是一个用于减少大语言模型在测试时产生幻觉的概率框架。CAROL通过定义基于生成响应与可信上下文之间一致性的语义不确定性度量,将幻觉缓解问题转化为在文本序列格上的字符串次模优化问题,并采用可证明收敛且接近最优的马尔可夫链接受-拒绝过程迭代优化输出。

Details

Motivation: 针对大语言模型在生成内容时出现的幻觉问题,现有方法多依赖词元级不确定性,CAROL旨在通过语义层面的不确定性度量,统一幻觉检测与缓解,提升模型输出的可靠性和可解释性。

Result: 在问答和多智能体推理基准测试中,CAROL相比基于似然和检索增强的基线方法,显著减少了幻觉,提高了可靠性和可解释性,同时保持了有竞争力的计算效率。

Insight: 创新点在于将幻觉缓解形式化为字符串次模优化问题,并引入可证明理论保证的马尔可夫链过程进行迭代优化;从客观角度看,该方法通过语义一致性而非词元概率来定义不确定性,为幻觉减少提供了更鲁棒和可解释的框架。

Abstract: We introduce CAROL (Chain-based Adaptive Reconfiguration Over Lattices), a probabilistic framework for test-time hallucination reduction in large language models. Rather than relying on token-level uncertainty, CAROL defines a semantic uncertainty measure based on the consistency between generated responses and a trusted context, inducing a string-submodular objective over a lattice of textual sequences. This formulation enables hallucination mitigation to be cast as a Markov chain accept-reject process with provable convergence and near-optimality guarantees, allowing the model to iteratively refine outputs toward semantic consistency. By operating at the level of meaning, CAROL unifies hallucination detection and mitigation within a single framework. Empirical results on question answering and multi-agent reasoning benchmarks show that CAROL significantly reduces hallucinations and improves reliability and interpretability compared to likelihood-based and retrieval-augmented baselines, while maintaining competitive computational efficiency.


[9] TRACES: Proactive Safety Auditing for Multi-Turn LLM Agents via Trajectory-State Modeling cs.CL | cs.LGPDF

Jiaqian Li, Yanshu Li, Boxuan Zhang, Ruixiang Tang, Kuan-Hao Huang

TL;DR: 本文提出TRACES,一种基于表示学习的主动安全审计方法,用于多轮LLM智能体。它通过学习观察者LLM的隐藏表示来建模轨迹风险状态,从而在最终不安全结果出现前,从中间步骤预测风险漂移。该方法利用弱轨迹级监督进行训练,无需昂贵的步骤级标注,并在多个智能体安全基准上提升了安全预测和主动风险识别能力。

Details

Motivation: 当前LLM智能体通过多轮工具使用与环境交互进行操作,安全风险常在最终结果显现前很久就已在中间步骤中产生。事后反应式审计往往错过在风险展开时标记的机会,因此需要一种主动的审计方法来提前识别风险。

Result: 在多个智能体安全基准测试中,TRACES提升了完整轨迹安全预测和主动风险识别的性能。分析表明,学习到的风险状态有助于训练更安全的智能体。

Insight: 创新点在于提出了一种基于轨迹-状态建模的主动审计框架,通过从观察者LLM的隐藏表示中诱导潜在机制特征并建模其时间演化,仅需弱轨迹级监督即可生成密集的前缀级风险估计,避免了步骤级标注的成本和模糊性,为长视野智能体安全提供了新思路。

Abstract: LLM agents increasingly operate through multi-turn tool use and environment interaction, where safety risks often emerge from intermediate steps long before they surface in the final outcome. Reactive auditing is therefore insufficient: post-hoc diagnosis frequently misses the chance to flag risks while they are unfolding. We propose TRACES, a representation-based proactive auditor that learns prefix-level trajectory risk states from the hidden representations of an observer LLM. TRACES induces latent mechanism features from step representations and models their temporal evolution to estimate whether a partial trajectory is drifting toward unsafe behavior. To sidestep the cost and ambiguity of step-level risk annotation, TRACES is trained with weak trajectory-level supervision while still producing dense prefix-level risk estimates. Across multiple agent safety benchmarks, TRACES improves both full-trajectory safety prediction and proactive risk discrimination. Our analyses further suggest that these risk states can help train a safer agent, highlighting the broader potential of proactive auditing for long-horizon agent safety.


[10] Beyond Input Understanding: Diagnosing Multilingual Mathematical Reasoning with Directed Acyclic Trace Graphs cs.CLPDF

Jiaqiao Zhang, Zhoujun Li, Raoyuan Zhao, Jian Lan, Thomas Seidl

TL;DR: 本文提出了一种名为DATG(有向无环追踪图)的框架,用于诊断大型推理模型在多语言数学推理中的性能差距。研究发现,性能差距不仅源于对非英语问题陈述的理解不足,更与推理执行语言本身有关。通过将推理轨迹映射到语言无关的数学锚点和依赖关系,DATG能够量化非英语推理在覆盖关键数学节点和遵循依赖关系方面的缺陷。基于此诊断,作者提出了两种简单的测试时控制方法(Loop-Retry和Formula-Retry),以针对性地改进低资源语言下的推理性能。

Details

Motivation: 大型推理模型在英语数学推理上表现强劲,但在许多低资源和中资源语言中可靠性显著降低。传统观点将此归因于模型对非英语问题陈述的理解失败,但本文旨在探究语言是否也直接影响推理执行过程本身。

Result: 在Qwen3系列模型上对12种语言进行的实验表明,非英语推理往往存在锚点覆盖率降低和依赖关系保真度减弱的问题,尤其在低资源语言中更为明显。基于DATG诊断提出的两种测试时控制方法(Loop-Retry和Formula-Retry)被证明能一致地提升低资源语言的目标语言推理性能。

Insight: 论文的核心创新在于提出了DATG这一语言无关的框架,将推理过程解构为数学锚点和依赖关系图,从而能够精确诊断多语言推理失败的模式,超越了单纯评估输入理解的层面。基于诊断结果设计的针对性干预策略(Loop-Retry和Formula-Retry)提供了一种简单有效的测试时改进方法,为解决多语言推理中的语言偏差问题提供了新思路。

Abstract: Large reasoning models (LRMs) achieve strong mathematical reasoning performance in English, but remain much less reliable in many low- and medium-resource languages. This gap is often explained as a failure to understand non-English problem statements. We show that this view is incomplete: even when the problem is given in English, controlling the model’s reasoning language can substantially reduce accuracy, suggesting that language also affects reasoning execution itself. To study this effect, we introduce DATG, a Directed Acyclic Trace Graph framework that maps reasoning traces to language-independent mathematical anchors and dependencies. This allows us to align target-language traces with reference DAGs and measure whether they cover required mathematical nodes, respect dependency edges, and avoid harmful mathematical actions. Experiments on the Qwen3 series across 12 languages show that non-English reasoning often suffers from reduced anchor coverage and weaker dependency fidelity, especially in low-resource languages. Motivated by this diagnosis, we propose Loop-Retry and Formula-Retry, two simple test-time controls targeting DATG-exposed failure modes, and show that they consistently improve target-language reasoning performance in low-resource languages.


[11] ReverseMath: Answer Inversion for Scalable and Verifiable Mathematical Problem Generation cs.CLPDF

Raoyuan Zhao, Yihong Liu, Yupei Du, Hinrich Schütze, Michael A. Hedderich

TL;DR: 本文提出ReverseMath方法,通过答案反转生成可扩展且可验证的数学问题。该方法将原问题中的数值掩码,将原答案作为已知条件重写问题,使掩码值成为新答案。研究将其用于评估和训练:评估中通过原问题/反转问题对揭示模型行为偏移(如记忆现象);训练中作为自动标注的数据增强用于强化学习,提升多基准测试的数学推理性能。

Details

Motivation: 现有数学推理基准多为静态且公开暴露,难以区分模型真实推理与记忆行为;同时人工构建带可靠答案的新数学问题成本高昂。

Result: 实验表明,使用ReverseMath生成的数据进行增强训练后,模型在多个数学推理基准(如GSM8K、MATH)上性能提升,验证了其作为可扩展训练数据的有效性;评估中模型在反转问题上出现失败或错误输出原答案的行为差异。

Insight: 创新点在于通过结构化答案反转自动生成可验证问题,为评估模型记忆倾向提供新工具,同时作为可扩展数据增强方法提升训练效率;客观分析认为其将问题生成转化为确定性变换过程,确保了答案可靠性并避免了人工标注成本。

Abstract: Mathematical reasoning benchmarks are vital for evaluating large language models (LLMs), but many are static and repeatedly exposed through public evaluation and training pipelines, making it difficult to separate genuine reasoning from memorization. Meanwhile, manually constructing new math problems with reliable answers remains costly. We introduce ReverseMath, a scalable method for generating new math problems through answer inversion. Given a problem and its answer, ReverseMath masks a numerical value in the original problem, treats the original answer as a known condition, and rewrites the problem so that the masked value becomes the new answer. The generated problem reverses the original input-output relation, making its answer known by construction. We study ReverseMath for both evaluation and training. For evaluation, paired original/reversed problems reveal substantial behavioral shifts: models sometimes fail on reversed problems and even incorrectly output the original answer, suggesting memorization-like behavior. For training, ReverseMath provides automatically labeled reversed problems as data augmentation for reinforcement learning (RL). Experiments show that including ReverseMath-generated data improves mathematical reasoning performance across multiple benchmarks, demonstrating its value as both an analysis tool and a scalable source of verifiable training data.


[12] UserHarness: Harnessing User Minds for Stronger Agent Theory-of-Mind cs.CL | cs.AIPDF

Cheng Qian, Jiayu Liu, Heng Ji

TL;DR: 本文提出了UserHarness框架,将心智理论(ToM)推理重新定义为显式的用户心智重建。该框架通过分解用户的心理状态、其与外部环境的关系以及由此产生的行动,使智能体能追踪用户的观察、信念、意图和行为。在五个基准测试中,该方法取得了高达95.94%的宏观准确率,显著优于现有方法。

Details

Motivation: 现有方法通常通过复杂的流程间接建模行为来应对ToM任务,而没有显式地重建用户的心理状态,这忽略了问题的核心结构:用户基于其信念行动,信念通过环境观察更新,信念和意图共同决定行动,而社会推理通常需要关于他人信念或意图的嵌套信念。

Result: 在五个基准测试上,UserHarness达到了高达95.94%的宏观准确率,相对于现有的推理方法提升了超过15%(相对提升),相对于最强的仅提示(prompt-only)方法提升了约20%(相对提升),达到了SOTA水平。

Insight: 核心创新在于将ToM任务明确重构为用户心智重建问题,通过显式建模用户信念、意图与观察、行动的动态关系来捕捉问题的根本结构。这为构建更具适应性的智能体助手提供了一个有前景的基础,强调了从用户心智根源进行推理的重要性。

Abstract: Understanding what a user believes and intends is central to building effective agent assistants. This ability is often evaluated through Theory-of-Mind (ToM) tasks, where success requires reasoning from the user’s perspective. However, many existing approaches address ToM with complex pipelines that model behavior indirectly, without explicitly reconstructing the user’s mental state. This misses the core structure of the problem: users act based on their beliefs, which are updated through observations of the environment; beliefs and intentions jointly determine actions, which in turn change the environment; and social reasoning often requires nested beliefs about what others believe or intend. We propose UserHarness, a simple framework that reframes ToM reasoning as explicit user-mind reconstruction. UserHarness decomposes the user’s mental state, its relation to the external environment, and the actions that follow from it, enabling agents to track what the user observes, believes, intends, and does. Across five benchmarks, UserHarness reaches up to 95.94% macro accuracy, improving over existing inference methods by more than 15% relative and over the strongest prompt-only harness by about 20% relative. These results suggest that robust user understanding requires reasoning from the roots of the user’s mind, positioning user harnessing as a promising foundation for more adaptive future assistants.


[13] Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions cs.CL | cs.AI | cs.CV | cs.DLPDF

Antonia Karamolegkou, Nicolas Angleraud, Benoît Sagot, Thibault Clérice

TL;DR: 本文研究了视觉语言模型(VLMs)在古希腊语关键版本OCR任务中的视觉基础失败问题,发现VLMs即使生成错误文本也常保持流畅性,而传统OCR引擎则产生局部识别噪声。通过引入受控图像扰动和基于条件与无图像解码分布的token级基础度量,分析解码过程中的视觉证据,揭示不同模型对先验知识的依赖差异。结果表明,流畅输出不一定视觉基础良好,需超越聚合准确率的可解释性驱动评估。

Details

Motivation: 解决VLMs在OCR任务中可能依赖语言先验而非视觉证据生成看似合理但视觉不支持文本的问题,特别是在低资源历史文档(如古希腊语关键版本)中验证这一现象。

Result: 在古希腊语关键版本数据集上,VLMs错误时仍保持流畅性,产生合理的希腊语替换,而传统OCR产生局部噪声;字符级扰动下VLMs与真实值偏离显著,传统OCR相对忠实;token级分析显示,OCR专用模型错误时几乎不依赖图像,通用VLMs即使错误也保持视觉条件。后OCR语言模型校正仅通过生成后修复文本改进系统,解码时干预未能可靠恢复基础。

Insight: 创新点包括引入受控图像扰动和token级基础度量来量化视觉证据,扩展了OCR语言先验依赖证据到低资源历史文档和更广泛模型;客观分析表明,流畅输出可能掩盖视觉基础失败,需开发可解释性评估方法以确保模型真正基于视觉输入。

Abstract: Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually unsupported text, suggesting reliance on language priors. Comparing open-weight VLMs with traditional OCR baselines on low-resource Ancient Greek critical editions, we show that VLM errors often remain fluent even when wrong, producing plausible Greek substitutions where traditional engines produce local recognition noise. To analyze visual evidence during decoding, we introduce controlled image perturbations and token-level grounding measures based on conditional versus image-free decoding distributions. Under character-level perturbations, VLMs diverge sharply from the perturbed ground truth while traditional OCR remains comparatively faithful; however, token-level analysis shows that prior reliance is model-specific: in an OCR-specialist model, fluent lexical errors are produced with little reliance on the image, whereas general-purpose VLMs remain conditioned on the visual input even when wrong. Decode-time interventions fail to reliably restore grounding, while post-OCR language-model correction improves several systems only by repairing text after generation. Our results extend prior evidence of OCR language-prior reliance to low-resource historical documents and a broader set of models, showing that fluent output is not necessarily visually grounded and motivating interpretability-driven evaluation beyond aggregate accuracy.


[14] Escape the Language Prior: Mitigating Late-Stage Modality Collapse in Audio Reasoning via Modality-Aware Policy Optimization cs.CLPDF

Cihan Xiao, Yiwen Shao, Chenxing Li, Xiang He, Zhenwen Liang

TL;DR: 本文提出了一种名为模态感知策略优化(MAPO)的双分支强化学习框架,旨在解决音频和全模态大语言模型在标准强化学习后训练中出现的晚期模态崩溃问题。该方法通过模态相关性掩码动态聚焦策略梯度于模态关键令牌,并引入辅助注意力损失分支来维持跨模态关联,从而提升长序列推理的忠实度和多模态指令跟随能力。

Details

Motivation: 标准强化学习后训练算法(如GRPO)在处理音频和全模态大语言模型时,由于对所有令牌应用统一的策略梯度,忽略了它们对非文本源模态的依赖差异,导致在长链思维生成后期出现模态崩溃,模型逐渐放弃主要源信号而依赖压缩的文本先验,产生自信但无根据的幻觉。

Result: 在复杂的音频推理基准测试中,MAPO显著提升了长时域推理的忠实度和多模态指令跟随性能,在多个关键基准上取得了极具竞争力的结果,并在开源权重模型中创造了新的最先进水平。

Insight: 创新点在于通过模态相关性掩码动态调整策略梯度的应用,以及引入辅助注意力损失分支来强化跨模态关联,这些方法仅依赖原生统计信号而非领域特定的归纳偏置,为缓解多模态系统中的认知崩溃提供了有前景的基础。

Abstract: Audio and omni-modal large language models exhibit impressive cross-modal reasoning capabilities. However, applying standard reinforcement learning post-training algorithms to these models exposes a critical structural vulnerability: methods like GRPO apply uniform policy gradients across all tokens, ignoring their unequal dependence on the non-text source modality. This exacerbates late-stage modality collapse during extended chain-of-thought generation, where models progressively abandon the primary source signal in favor of compressed textual priors, leading to confident but ungrounded hallucinations. To address this, we introduce Modality-Aware Policy Optimization (MAPO), a novel dual-branch reinforcement learning framework. First, MAPO dynamically concentrates the policy gradient on modality-critical tokens using a modality relevance mask, which is derived from the cross-modal differential entropy between an audio-ablated reference and the multimodal policy. Second, it integrates an auxiliary attention loss branch that applies a targeted, temporally scaled penalty to the model’s internal attention distributions. This ensures the model actively sustains cross-modal grounding deep into the reasoning trace. Evaluations on complex audio reasoning benchmarks demonstrate that MAPO substantially improves long-horizon reasoning fidelity and multimodal instruction following, achieving highly competitive performance and setting new state-of-the-art results on several key benchmarks among open-weight models. By relying strictly on native statistical signals rather than domain-specific inductive biases, MAPO offers a promising foundation for mitigating epistemic collapse across diverse multimodal systems.


[15] UniMaia: Steering Chess Policies with Language for Human-like Play cs.CL | cs.AI | cs.LGPDF

Sherman Siu, Lesley Istead

TL;DR: 本文提出了UniMaia框架,通过结合冻结的Lc0国际象棋策略网络与参数高效的文本编码器及ControlNet风格的调节机制,实现了基于自然语言提示的语义可控策略调制。该方法能够在保持预训练策略表示的同时,实现对开局选择和玩家强度等游戏行为的语义控制,并进一步引入UniMaia-Aux以整合辅助时序调节和行为预测目标。

Details

Motivation: 在结构化决策领域(如国际象棋)中,专用策略网络性能强但缺乏语义可控性,而基于提示的语言模型更灵活但领域基础较弱。本文旨在探索无需端到端多模态训练即可实现领域特定策略网络提示条件控制的方法。

Result: UniMaia在多个提示条件基准测试中取得了最先进的预期准确率,在通用指令跟随任务上具有竞争力的最佳走子准确率,同时在人类走子预测基准上与专用元数据条件方法保持竞争水平。UniMaia-Aux进一步提升了多个评估设置中的预期准确率和行为建模能力。

Insight: 创新点在于提出了一种参数高效的提示条件策略调制框架,将语言模型的灵活性与领域专用策略网络的强归纳偏置相结合,无需大规模多模态训练即可实现语义控制。同时,通过引入辅助目标和构建大规模数据集,平衡了可控性与预测性能之间的权衡。

Abstract: Recent advances in large language models have enabled natural language to serve as a flexible interface for controlling complex systems, but often at the cost of large-scale multimodal training or weakened domain-specific inductive biases. In structured decision-making domains such as chess, specialized policy networks achieve strong performance but lack semantic controllability, while prompt-conditioned language models are more flexible yet typically exhibit weaker domain grounding. We propose $\textbf{UniMaia}$, a framework for prompt-conditioned policy modulation that adapts a frozen Lc0-based chess policy network using a parameter-efficient text encoder and a ControlNet-style conditioning mechanism. UniMaia enables semantic control over gameplay, including opening selection and player strength, while preserving the pretrained policy representations. We further introduce $\textbf{UniMaia-Aux}$, which incorporates auxiliary temporal conditioning and behavioral prediction objectives. To support this work, we construct a large-scale metadata-augmented Lichess dataset, develop a semi-automated prompt-generation pipeline, and introduce benchmarks spanning both prompt-conditioned and metadata-conditioned settings. UniMaia achieves state-of-the-art expected accuracy on several prompt-conditioned benchmarks and competitive top-move accuracy on general instruction-following tasks, while remaining competitive with dedicated metadata-conditioned approaches on human move prediction benchmarks. UniMaia-Aux further improves expected accuracy and behavioral modeling across several evaluation settings, with modest trade-offs in top-move accuracy. Overall, our results demonstrate that prompt-conditioned control of domain-specific policy networks is feasible without end-to-end multimodal training, while highlighting trade-offs between controllability and predictive performance.


[16] Do Models Know Why They Changed Their Mind? Interpretability and Faithfulness of Chain-of-Thought Under Knowledge Conflict cs.CL | cs.AI | cs.LGPDF

Pruthvinath Jeripity Venkata

TL;DR: 本文研究当语言模型遇到与其训练知识相冲突的文档时,其思维链(CoT)推理是否忠实地反映了模型做出决策的内在机制。通过引入‘内省忠实性’概念,并在200个问题、8个模型和4种提示条件下进行测试,发现CoT推理在模型做出相反决策时高度稳定(相似度达96%),但模型自我评估的置信度仅携带微弱的真实信号。GPT-4o是唯一在推理与决策间存在统计可靠耦合的模型。

Details

Motivation: 探究语言模型在面临知识冲突时,其思维链推理是否真实地报告了模型内部的选择机制,即模型是遵循外部文档还是信任自身训练知识。

Result: 在多个模型和条件下测试,CoT推理在相反决策间保持高度稳定(相同答案相似度96%,d=0.34;ROUGE-L确认d=0.45)。对于冷门事实,置信度仍能预测决策(p<0.001)并与项目级知识相关(r=0.134)。GPT-4o是唯一显示出统计可靠的推理-决策耦合的模型。

Insight: 创新点在于提出了‘内省忠实性’这一评估概念,并揭示了CoT推理主要由决策不变的知识展示(约96%)和一层携带微弱真实信号的置信度层构成。实践启示是:监控模型决策时应关注其置信度而非推理论证本身;内部思维令牌比面向用户的CoT对决策更敏感。

Abstract: When a language model sees a document contradicting its training knowledge, it must choose: follow the document or trust itself. Prior work proved this choice depends on how well-known the fact is. We ask: does the model’s chain-of-thought (CoT) reasoning faithfully report this mechanism? We introduce introspective faithfulness and test it across 200 questions, 8 models, and 4 prompt conditions. We find CoT reasoning is highly stable across opposite decisions: flip pairs retain 96% of same-answer similarity (d=0.34; confirmed by ROUGE-L, d=0.45). Yet self-rated confidence carries a faint genuine signal: for obscure facts where entity fame is uninformative, confidence still predicts decisions (p<0.001) and tracks item-level knowledge (r=0.134). GPT-4o is the only model with statistically reliable reasoning-decision coupling. Claude Sonnet 4.6 shows the widest confidence range (SD=1.39) but near-zero pooled correlation because the confidence-decision relationship reverses between conditions; a temperature ablation confirms this is model-specific. Internal thinking tokens show greater decision-sensitivity than user-facing CoT (p=0.033). CoT decomposes into a decision-invariant knowledge display (~96%) and a thin confidence layer with weak but real signal. For monitoring: read confidence, not the argument.


[17] Playing with Words, Improving with Rewards: Training Language Models for Creative Association cs.CLPDF

Vijeta Deshpande, Namrata Shivagunde, Sherin Muckatira, Hadrien Glaude, Mikhail Gronas

TL;DR: 该论文提出了一种通过训练大型语言模型(LLMs)玩单词联想游戏《Codenames》来提升其创造力的方法。该方法利用游戏可客观验证的特性,采用带可验证奖励的强化学习(RLVR)进行训练,避免了依赖主观的人类判断。研究发现,模型规模会影响创造力与精确度之间的权衡:8B模型在多数创造力基准上有所提升,而较小模型则在推理任务上取得显著进步。

Details

Motivation: 解决训练LLMs创造力时面临的挑战,即创造力的主观性和人类判断的局限性,需要一种能客观评估并有效训练的方法。

Result: 在十个创造力基准和四个推理基准上的评估显示,8B模型在8个创造力基准上取得了适度但一致的提升,同时推理能力仅有轻微下降;而1.7B和4B模型则在推理任务上实现了显著增益,但牺牲了部分创造力。

Insight: 创新点在于利用《Codenames》这种可客观验证的单词联想游戏作为训练环境,并结合RLVR进行训练,为可扩展地训练LLMs创造力提供了有效方案;客观分析表明,模型规模是影响创造力-精确度权衡的关键因素。

Abstract: Large Language Models (LLMs) are being applied to increasingly difficult problems and use cases. To navigate their vast solution spaces effectively, LLMs need to be creative. Yet the subjective nature of creativity and the limits of human judgment make training LLMs for creativity especially challenging. As a solution, we train LLMs on Codenames, a word-association game that exercises the two central axes of creativity, divergent and convergent thinking, while yielding objectively verifiable outcomes. This verifiability lets us bypass human judgment and train with Reinforcement Learning with Verifiable Rewards (RLVR). We train Qwen3-1.7B, 4B, and 8B models and evaluate them on ten creativity and four reasoning benchmarks. We find that the precision-diversity trade-off is scale-dependent: the 8B model prioritizes creativity over precision, while the 1.7B and 4B models gain reasoning precision at the cost of creativity. Concretely, the 8B model shows modest but consistent creativity gains (8 of 10 benchmarks) with only minor reasoning degradation, whereas the smaller models achieve substantial gains on reasoning tasks. Our study presents a scalable and effective solution to train LLMs for creativity.


[18] TARQ: Tail-Aware Reconstruction Quantization for Rare-Word Robust Automatic Speech Recognition cs.CL | cs.MMPDF

Xinyu Wang, Ziyu Zhao, Ke Bai, Silin Meng, Dongming Shen

TL;DR: 本文提出了一种名为TARQ的尾部感知重建量化方法,旨在提升自动语音识别(ASR)系统对罕见词(如姓名、数字和领域特定词汇)的识别鲁棒性。该方法通过一种称为RareBAL的层间校准规则和残差校正技术,在不依赖实体标签、额外训练或解码的情况下,将量化校准的重心从高频词转移到低频词。

Details

Motivation: 传统的感知训练后量化(PTQ)方法通过最小化小规模校准语料上的逐令牌重建损失进行优化,这隐含地根据词频对位置进行加权。对于ASR任务,这与对尾部词汇(罕见词)敏感的风险度量不匹配,因为这些词在校准语料中占比极低,导致量化后模型对其识别性能下降。

Result: 在W4G128量化配置下,对八个ASR骨干模型和六个数据集的实验表明,TARQ方法在保持整体词错误率(WER)不退化的情况下,显著降低了罕见词错误率(rare-WER),并在对比方法中实现了最低的跨语料库rare-WER波动。该方法在无需实体监督的情况下,也能有效迁移到实体丰富的基准测试(如ProfASR, ContextASR-Speech-En)上。

Insight: 论文的核心创新点在于提出了一个无需标签的PTQ框架TARQ,其关键组件RareBAL通过一个闭式解规则,在每一线性层均衡常见词与尾部词的校准权重,并结合了与度量一致的残差校正。从客观角度看,该方法巧妙地解决了量化过程中校准数据分布与任务评估目标(罕见词识别)不匹配的根本问题,且具有无需额外监督、计算开销低的实用优势。

Abstract: Data-aware post-training quantization (PTQ) minimizes a per-token reconstruction loss on a small calibration corpus, implicitly weighting positions by their empirical frequency. For \textbf{A}utomatic \textbf{S}peech \textbf{R}ecognition (ASR), this misaligns with tail-sensitive risk: names, numerals, and domain-specific words receive proportionally little calibration mass. We propose \textbf{Tail-Aware Reconstruction Quantization} (\TARQ), a label-free PTQ framework that shifts calibration toward the lexical tail via \textbf{\rareBAL}, a closed-form per-Linear-layer rule equalizing common/tail mass, paired with a metric-consistent residual correction. \TARQ\ requires no entity labels, no curated calibration set, no validation decoding, and no additional training. Across eight ASR backbones and six datasets at W4G128, \TARQ\ improves mean rare-\textbf{W}ord \textbf{E}rror \textbf{R}ate (rare-WER) without an aggregate-WER regression, achieves the lowest cross-corpus rare-WER swing among compared methods, and transfers to entity-rich benchmarks (ProfASR, ContextASR-Speech-En) without entity supervision.


[19] MERIT: Matching Expertise via Rubric-Informed Training for Reviewer Assignment cs.CLPDF

Zixuan Yang, Yibo Zhao, Weicong Liu, Xiang Li

TL;DR: 本文提出MERIT框架,通过两阶段方法解决大规模会议中论文与审稿人匹配的挑战。第一阶段使用强化学习训练审稿人评估器,利用LLM根据论文特定评审标准提供奖励;第二阶段将评估器的预测蒸馏为基于嵌入的检索器,实现高效大规模分配。

Details

Motivation: 现有审稿人分配方法要么依赖将一般相关性误认为合适性的粗糙代理信号,要么需要难以扩展的人工标注,MERIT旨在通过标准驱动的专业知识匹配来填补这一空白。

Result: 实验表明,4B参数的审稿人评估器在合适性分类上优于更大的通用LLM,其检索器在LR-Bench和CMU Gold数据集上达到最先进性能。

Insight: 创新点在于将标准级别的专业知识匹配转化为可扩展的监督信号,并通过强化学习与知识蒸馏结合,实现了从精细评估到高效检索的端到端优化。

Abstract: Matching submissions with suitable reviewers at scale is a growing challenge for major venues, yet existing approaches either rely on coarse proxy signals that conflate general relatedness with true suitability, or require expensive human annotations that are difficult to scale for training. We propose MERIT, a two-stage framework that bridges this gap by converting criterion-level expertise matching into scalable suitability supervision. In the first stage, we train a reviewer assessor via reinforcement learning to identify the expertise dimensions a paper requires, match them against the reviewer’s prior work, and produce a suitability decision, with rewards provided by an LLM judge guided by paper-specific expertise rubrics. In the second stage, we distill the assessor’s predictions into an embedding-based retriever for efficient large-scale assignment. Experiments show that our 4B reviewer assessor outperforms larger general-purpose LLMs on suitability classification, and the resulting retriever achieves state-of-the-art performance across LR-Bench and the CMU Gold dataset. Our code is available at https://github.com/Luli3220/MERIT.


[20] DecomposeRL: Learning to Ask Useful, Informative, and Diverse Questions for Semi-Supervised, Traceable Claim Verification cs.CL | cs.AI | cs.LGPDF

Shubhashis Roy Dipta, Ankur Padia, Francis Ferraro

TL;DR: DecomposeRL是一个用于声明验证的半监督、可追溯系统,它将验证过程分解为提问步骤,并通过强化学习(GRPO)和多方面奖励机制进行训练。该方法通过数据筛选漏斗将大量声明浓缩为一个小而高质量的训练集,使得一个7B参数的模型在仅使用约5K条精选声明进行全监督训练后,在11个跨领域基准测试中取得了优异的准确率,性能媲美更大的基线模型和GPT-4.1-mini,并在半监督设置下表现更优。

Details

Motivation: 当前声明验证方法存在两难:端到端分类器准确但不可追溯,而基于分解的方法可追溯但性能较差。本文旨在开发一个既能保持高准确性又能生成可检查追踪痕迹的声明验证器。

Result: 在11个包含生物医学、政治、科学和通用领域声明的基准测试中,DecomposeRL-7B模型实现了86.3%的域内和69.8%的域外平衡准确率。尽管模型规模小4倍,但其性能与32B基线模型和GPT-4.1-mini相当,并且在仅使用10%标注数据的半监督设置中超越了基线。

Insight: 核心创新在于将声明验证重构为强化学习策略,通过GRPO和多方面奖励集成来学习提出有用、信息丰富且多样化的问题,从而实现可追溯的分解。另一个关键点是设计了数据筛选漏斗,显著降低了训练成本,实现了在小规模高质量数据上的高效学习,证明了数据质量相对于数量的重要性。

Abstract: Claim verification splits between end-to-end classifiers that are accurate but yields no inspectable traces, and decomposition-based methods produce inspectable traces but lag performance on benchmark datasets. We propose DecomposeRL an accurate claim-verifier that produce inspectable traces. DecomposeRL frames decomposition as an RL policy trained with GRPO and a multi-faceted reward ensemble, enabling both fully supervised and semi-supervised learning from unlabeled claims. DecomposeRL addresses the prohibitive training cost of GRPO with a data-curation funnel that distills 115K fact-verification claims into a compact, learning-signal-dense subset of 5K claims. We show that a DecomposeRL-7B policy trained with full supervision on only ~5K curated claims achieves 86.3 in-domain and 69.8 out-of-domain balanced accuracy across 11 claim-verification benchmarks containing biomedical, political, scientific, and general-domain claims. Despite being 4x smaller, it matches 32B baselines and GPT-4.1-mini, and it further outperforms baselines in a semi-supervised setting with only 10% labeled claims data. Code, data, and models are available at https://dipta007.github.io/DecomposeRL


[21] GRADE: Generalizable Reasoning-Aware Dialogue Evaluation for AI Tutors cs.CLPDF

Parth Bhalerao, Jeromy Chang, David Chou, Oana Ignat

TL;DR: 本文提出了GRADE框架,用于评估AI导师在对话中的教学能力,超越了单纯的事实正确性。研究系统评估了开源模型在五种语言模型、零样本推理、LoRA微调、合成数据增强、思维链推理以及单任务与多任务配置下的120种组合。结果表明,Gemma3-12B在单任务评估中表现最佳,而8位精度的Gemma3-27B在多任务预测中更可靠。

Details

Motivation: 评估AI导师的回复需要超越事实正确性,必须涵盖识别错误、定位问题、提供指导和给出可操作后续步骤等教学能力。

Result: 在BEA 2025 TutorMind基准上,Gemma3-12B在单任务评估中表现最佳,Gemma3-27B-8bit在多任务预测中更可靠。研究显示,精心选择的开源LoRA流程在关键教学维度上可以匹配或超越专有和基于集成的系统。

Insight: 创新点在于系统性地研究了多种配置对教学能力评估的影响,并发现数据增强对原始数据表现不佳的模型有帮助,而思维链推理对合成数据生成比直接分类更有用。一个关键洞察是,在思维模式下对结构化分类目标进行LoRA微调会干扰指令跟随行为,导致生成偏离所需的评估格式。

Abstract: Evaluating AI tutor responses requires more than factual correctness: tutors must identify mistakes, locate errors, provide guidance, and offer actionable next steps. We present GRADE, a systematic study of open-source models for pedagogical ability assessment in student-tutor dialogues. Building on the BEA 2025 TutorMind setting, we evaluate 120 configurations across five language models, zero-shot inference, LoRA fine-tuning, synthetic augmentation, CoT+Reasoning, and single-task versus multitask formulations. Gemma3-12B performs best for single-task evaluation, while Gemma3-27B in 8-bit precision is more reliable for multitask prediction. We find that augmentation helps models that struggle with the original data, verification adds limited gains despite higher cost, and CoT+Reasoning is more useful for synthetic data generation than direct classification. We further show that LoRA fine-tuning on structured classification objectives interferes with instruction-following behavior under thinking mode, redirecting generation away from the required evaluation format. Carbon analysis shows that model choice and reasoning mode substantially affect emissions. Overall, GRADE shows that carefully selected open-source LoRA pipelines can match or surpass proprietary and ensemble-based systems on key pedagogical dimensions, with code and data available at https://github.com/pvbgeek/GRADE.


[22] Retrieval, Reward, and Training Protocols: What Matters in Training Search Agents? cs.CLPDF

Yibo Zhao, Zichen Ding, Jiayi Wu, Zun Wang, Xiang Li

TL;DR: 本文通过受控实证研究,系统分析了训练搜索代理时检索语料库、奖励设计和训练协议三个关键维度的影响。研究发现,修正Wikipedia 2018语料库的数据覆盖问题比不同训练算法带来的提升更大;简单的基于结果的奖励方法在多数场景下表现优于或与基于过程的奖励方法相当;同时研究还总结了训练数据多样性、离策略数据利用和搜索预算扩展的实用指南。

Details

Motivation: 现有基于大语言模型的搜索代理训练方法在检索语料库、奖励设计和训练协议上存在差异,导致难以确定性能提升的真正驱动因素,因此需要受控比较来厘清关键影响因素。

Result: 在HotpotQA和FEVER基准测试中,修正数据覆盖问题后,不同基础模型(Llama-3-8B、Mistral-7B、Qwen2-7B)的性能均显著提升;基于结果的奖励方法在多数设置下达到竞争性或更优性能,而基于过程的奖励方法可能过度修正代理行为。

Insight: 创新点在于首次系统隔离并实证评估了搜索代理训练中三个被忽视的维度,揭示了数据质量(如语料库覆盖)对性能的影响可能超过算法差异,并挑战了过程奖励优于结果奖励的常见假设,为训练设计提供了数据驱动的指导原则。

Abstract: Search agents powered by large language models can autonomously decompose queries, retrieve information, and synthesize answers through multi-step reasoning. However, the rapid growth of training methods has outpaced controlled comparison: existing works differ in retrieval corpora, reward designs, and training protocols, making it unclear what actually drives improvements. We present a controlled empirical study that isolates three under-explored dimensions of search agent training. First, we identify a critical data-coverage issue in the widely used Wikipedia 2018 corpus and show that correcting it alone yields larger gains than the differences between training algorithms. Second, we systematically compare outcome-based and process-based reward methods across three base models, finding that the simplest outcome-based approach achieves competitive or superior performance in most settings, and that process-level credit assignment can over-correct agent behavior. Third, we analyze training data diversity, off-policy data utilization, and search budget scaling, distilling practical guidelines for training effective search agents. Our code is available at https://github.com/YiboZhao624/SearchAgentReview.


[23] VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild cs.CL | cs.AIPDF

Xiaohongshu Inc

TL;DR: 该论文提出了VibeSearchBench基准测试,用于评估LLM智能体在真实场景下的长视野主动搜索能力。该基准包含200个双语任务,分为专业和日常生活两个子集,通过渐进式用户模拟器和图匹配框架进行评估。实验表明现有模型在VibeSearch任务上表现仍严重不足,突显了长上下文推理、主动意图引导和结构化知识构建等方面需要根本性改进。

Details

Motivation: 针对现有搜索基准测试依赖过度明确的查询、单轮交互和固定模式评估,与真实搜索中用户通过多轮对话逐步细化模糊意图的协作模式存在差距,论文旨在填补这一评估与体验之间的鸿沟。

Result: 在ReAct框架和OpenClaw智能体平台上对七个前沿模型进行测试,结果显示所有模型在VibeSearch任务上表现均显著不足,最佳F1分数仅为30.30。

Insight: 创新点在于提出了VibeSearch范式及相应基准,其核心是模拟真实搜索中用户与智能体通过多轮对话协作细化模糊意图的过程,并设计了无模式真实知识图谱、渐进式披露用户模拟器和图匹配评估框架,为评估智能体的长上下文推理和主动交互能力提供了新标准。

Abstract: LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks’ reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.


[24] FinBoardBench: Benchmarking Dynamic Wealth Management and Strategic Financial Reasoning of LLMs via Board Game Simulations cs.CL | cs.CEPDF

Xuesi Hu, Peng Wang, Jinpeng Miao, Xilin Tao, Caiwei Li

TL;DR: 本文提出了FinBoardBench,这是一个基于三种经典金融棋盘游戏(现金流、收购和大富翁)的评估套件,旨在评估大语言模型在动态财富管理和金融决策方面的能力。实验表明,尽管先进的大语言模型展现出基本的长期规划和投资逻辑,但它们在利用复杂互动获利方面表现不佳,且静态推理能力未能转化为成功的动态决策。

Details

Motivation: 现有静态金融基准不足以评估大语言模型在真实世界环境中的动态财富管理和金融决策能力,因此需要一个新的评估框架来填补这一空白。

Result: 在9个先进大语言模型上的实验显示,它们倾向于优先获取即时资产而非保持充足流动性,使其容易受到随机事件引发的金融危机影响,且未能将强大的静态推理性能转化为成功的动态决策。

Insight: 创新点在于利用棋盘游戏模拟来构建动态金融决策评估基准,这为未来更智能的基于大语言模型的决策系统提供了有价值的参考,并揭示了模型在复杂互动和流动性管理方面的关键弱点。

Abstract: Recently, large language models (LLMs) have achieved superior performance in static financial reasoning and simple dynamic trading tasks. However, existing static financial benchmarks are insufficient to assess the dynamic wealth management and financial decision-making capabilities of LLMs in real-world environments. To bridge this gap, we present FinBoardBench, an evaluation suite based on three classic financial board games: Cashflow, Acquire, and Monopoly. FinBoardBench assesses a comprehensive set of financial skills, including personal cash flow management with debt balancing, corporate investment and acquisition forecasting, and competitive trade negotiations with asset auctions. Our experiments with 9 advanced LLMs reveal that while exhibiting basic long-term planning and investment logic, they fail to effectively leverage complex interactions for profit, and their strong static reasoning performance does not transform into successful dynamic decision-making. Notably, they tend to prioritize immediate asset acquisition over maintaining sufficient liquidity, making them vulnerable to financial crises triggered by random events. We hope that FinBoardBench can provide a valuable reference for more intelligent LLM-based decision-making systems in the future.


[25] ESC-Skills: Discovering and Self-Evolving Skills for Emotional Support Conversations cs.CL | cs.AIPDF

Jie Zhu, Huaixia Dou, Shuo Jiang, Junhui Li, Lifan Guo

TL;DR: 本文提出了ESC-Skills框架,一个以技能为中心的、可自我演化的情感支持对话系统。该框架通过建模干预单元来发现可执行的情感支持技能,并构建技能库,然后通过与多样化模拟用户交互进行自我进化精炼,以提高系统的鲁棒性和解释性。

Details

Motivation: 现有情感支持对话系统主要依赖端到端响应生成或粗粒度策略监督,解释性有限且难以系统性地改进技能。本文旨在解决这一问题,通过构建一个可发现、可执行且能自我演化的技能库来提供更可解释和可控的支持行为。

Result: 实验结果表明,ESC-Skills在响应质量和对话层面的情感结果上均有提升,同时提供了更可解释和可控的支持行为。

Insight: 创新点在于将情感支持对话建模为可执行的技能单元,并引入基于多用户画像的自我进化精炼框架,通过模拟交互来持续识别和修正技能库中的缺失、风险及特定失败模式,实现了技能的自动化发现与优化。

Abstract: Existing emotional support conversation (ESC) systems mainly rely on end-to-end response generation or coarse strategy supervision, offering limited interpretability and little support for systematic skill improvement. We propose ESC-Skills, a skill-centric framework that discovers and self-evolves executable emotional support skills. We first model localized support interactions as Intervention Units (IUs), which capture state–action–outcome dynamics between seeker states, support interventions, and post-response emotional changes. Based on IUs extracted from both successful and failed ESC dialogues, we construct the ESC-Skills Bank, a repository of executable emotional support skills containing intervention guidance, applicability conditions, expected outcomes, and potential risks. To further improve robustness, we introduce a multi-profile self-evolutionary refinement framework in which an ESC agent interacts with diverse simulated seeker profiles under SAGE evaluation. The resulting interaction traces are analyzed to identify missing skills, unsafe interventions, and profile-specific failure patterns, which are then used to refine the Skills Bank through simulation-based verification. Experimental results demonstrate that ESC-Skills improves both response-level quality and dialogue-level emotional outcomes while providing more interpretable and controllable support behaviors. We will release the code, prompts, and ESC-Skills Bank at https://github.com/aliyun/qwen-dianjin.


[26] GeneralThinker: Domain-General Reasoning through Likelihood-Guided Answer-Conditioned Optimization cs.CLPDF

Shengmin Piao, Sanghyun Park

TL;DR: GeneralThinker是一个用于语言模型推理的在线策略框架,通过将推理监督重新定义为密集的答案条件优化,实现了无需领域特定验证器的响应级评估和细粒度token级信用分配。该方法利用真实答案的似然性评估生成的推理轨迹,并通过裁剪和方向保持调制来稳定优化。在涵盖数学、STEM和通用推理的11个基准测试中,GeneralThinker取得了最佳平均性能。

Details

Motivation: 为了解决现有基于强化学习的语言模型推理方法对领域特定验证器的依赖、稀疏结果奖励和粗粒度信用分配的限制,提高其适用性。

Result: 在11个涵盖数学、STEM和通用推理的基准测试中,GeneralThinker取得了最佳的平均性能,达到了SOTA水平。

Insight: 创新点在于将推理监督重新定义为密集的答案条件优化,实现了无需领域特定验证器的细粒度token级信用分配;通过引入裁剪和方向保持调制来稳定token级更新,这是确保细粒度信用分配持续有效的关键机制。

Abstract: Reinforcement learning with verifiable rewards improves language model reasoning, but its reliance on domain-specific verifiers, sparse outcome rewards, and coarse-grained credit assignment limits its applicability. We introduce GeneralThinker, an on-policy framework that reformulates reasoning supervision as dense answer-conditioned optimization, enabling response-level evaluation and token-level credit assignment without domain-specific verifiers. GeneralThinker evaluates generated reasoning trajectories using the likelihood of the ground-truth answer and derives token-wise compatibility signals for fine-grained credit assignment. To stabilize optimization, it constrains token-level updates through clipping and direction-preserving modulation. Across 11 benchmarks spanning mathematics, STEM, and general reasoning, GeneralThinker achieves the best average performance. Further analyses show that uncontrolled token-level modulation can destabilize training, whereas controlled modulation makes fine-grained credit assignment consistently effective.


[27] DisasterBench: Benchmarking LLM Planning under Typed Tool Interface Constraints cs.CLPDF

Zhitong Chen, Kai Yin, Weifeng Zhang, Zhiyuan Wang, Xiangjue Dong

TL;DR: 本文提出了DisasterBench基准测试,用于评估大型语言模型在灾难响应场景下,面对语义相似但操作不同的工具时,进行结构化多智能体规划的能力。该基准强调生成可执行工作流,涉及正确的参数绑定和依赖传播。作者还提出了First-Point-of-Failure方法,用于定位预测工作流中的最早根因错误。评估发现,规划效果严重依赖模型能力,工具不匹配和参数绑定是主要瓶颈,且冗长的中间推理可能与结构化输出要求冲突。

Details

Motivation: 灾难导致严重社会影响,需要快速协调从卫星分析到洪水预测等异构AI工具,形成连贯的多步骤工作流。当前LLM作为此类流程的协调器,其有效协调不仅需要选择语义合理的工具,还必须生成具有正确参数绑定和依赖传播的可执行工作流。

Result: 在DisasterBench基准上的评估揭示了三个主要发现:规划方法的有效性强烈依赖于模型能力;工具不匹配和参数绑定错误主导了首次失败,揭示了语义基础和执行为一致性是不同瓶颈;冗长的中间推理可能与结构化输出要求产生指令冲突,从而破坏计划生成。

Insight: 论文的创新点在于提出了一个专注于工具接口约束下LLM规划的新基准,并引入了First-Point-of-Failure方法进行细粒度错误归因。其核心洞察是揭示了语义推理与基于执行的协调之间存在根本性差距,强调了需要共同建模语义意图、执行约束和工作流一致性的规划框架。

Abstract: Disasters cause severe societal impacts, demanding rapid coordination of heterogeneous AI tools, from satellite analysis to flood prediction and damage assessment, into coherent multi-step workflows. As LLMs increasingly serve as orchestrators of such pipelines, effective coordination requires more than selecting semantically plausible tools: LLMs must generate executable workflows with correct parameter binding and dependency propagation. We introduce DisasterBench, a benchmark for evaluating structured multi-agent planning over semantically similar but operationally distinct disaster-response tools. To enable step-level failure attribution, we further propose First-Point-of-Failure (FPoF), which localizes the earliest root cause in a predicted workflow, separating primary errors from downstream cascading effects. Our evaluation reveals three findings: planning method effectiveness depends strongly on model capacity; tool mismatch and parameter-binding errors dominate first failures, revealing semantic grounding and execution consistency as distinct bottlenecks; and verbose intermediate reasoning can create instruction clash with structured output requirements, disrupting plan generation. Together, these findings highlight a fundamental gap between semantic reasoning and execution-grounded coordination, underscoring the need for planning frameworks that jointly model semantic intent, execution constraints, and workflow consistency. Code, data, and evaluation resources are available at: https://github.com/TamuChen18/DisasterBench_Open


[28] Semantic Flow Regularization: Teaching LLMs to Generate Diverse Yet Coherent Responses cs.CL | cs.AIPDF

Kerui Peng, Feifei Li, Xingyu Fan, Wenhui Que

TL;DR: 本文针对大语言模型在微调生成角色或语气条件化响应时出现的输出多样性严重受限问题(称为跨风格坍缩),提出了一种轻量级的语义流正则化方法。该方法通过条件流匹配监督主干模型学习未来片段的连续句子编码器嵌入,在推理时丢弃流匹配头实现零部署成本。实验表明,该方法在工业对话数据集和代码生成基准上均能提升输出多样性、风格保真度和响应质量。

Details

Motivation: 动机是解决大语言模型在针对特定角色或语气进行微调后,其输出多样性会严重下降的‘跨风格坍缩’问题。作者认为该问题源于交叉熵损失函数在共享表示下倾向于抑制多样化的延续。

Result: 在大型工业对话数据集(Qwen3-32B,9种角色)上,SFR在输出多样性、风格保真度和响应质量上均优于标准监督微调。在公开基准LiveCodeBench-v5(Qwen2.5-Coder-7B-Instruct)上,SFR持续提升了pass@k指标,证实了其超越风格化对话的通用性。在MBPP上的对照实验表明,多令牌预测是SFR的一种退化特例。

Insight: 宣称的创新点是提出了语义流正则化这一轻量级辅助目标,通过条件流匹配利用未来片段的连续语义嵌入进行监督,其构造的随机流源保留了多模态性,且推理时无需额外开销。从客观角度看,该方法将流匹配思想引入语言模型正则化,为解决微调导致的输出同质化问题提供了一个新颖且高效的框架,并建立了与多令牌预测的理论联系。

Abstract: When large language models are fine-tuned to generate persona- or tone-conditioned responses, their output diversity is severely limited–a failure we term Cross-Style Collapse. We trace this collapse to the cross-entropy objective, which under shared representations tends to suppress diverse continuations. We propose Semantic Flow Regularization (SFR), a lightweight auxiliary objective that supervises the backbone with continuous sentence-encoder embeddings of future segments via conditional flow matching. The stochastic flow source preserves multi-modality by construction; the flow-matching head is discarded at inference, adding zero deployment cost. On a large-scale industrial dialogue dataset (Qwen3-32B, 9 personas), SFR improves output diversity, style fidelity, and response quality over SFT. We further validate on the public LiveCodeBench-v5 (Qwen2.5-Coder-7B-Instruct), where SFR consistently improves pass@k, confirming generality beyond stylized dialogue. A controlled comparison on MBPP reveals Multi-Token Prediction to be a degenerate special case of SFR.


[29] Rethinking Visual Neglect: Steering via Context-Preference for MLLM Hallucination Mitigation cs.CLPDF

Jingwen Wu, Xijun Zhang, Ge Song

TL;DR: 本文重新审视了多模态大语言模型(MLLM)中的物体幻觉问题,挑战了现有方法将幻觉主要归因于视觉忽视的假设。作者通过系统干预发现,简单地增强视觉依赖在某些模型中反而会加剧幻觉。为此,他们提出了一种无需训练的推理时框架——上下文偏好激活引导(CAS),通过提取并注入上下文偏好向量来动态控制模型对图像、参数知识和文本上下文的依赖,从而有效缓解幻觉。

Details

Motivation: 当前缓解MLLM物体幻觉的推理时方法主要假设幻觉源于视觉忽视,并引导模型增强视觉依赖。然而,作者发现这种假设并不完备,因为过度依赖视觉在某些模型中会恶化幻觉。因此,需要一种更精细的框架来理解和管理图像作为上下文与模型内部知识及文本提示之间的竞争关系。

Result: 实验表明,所提出的CAS框架在缓解物体幻觉方面效果显著,且不增加解码延迟,同时保持了模型原有的文本生成质量。

Insight: 核心创新在于揭示了MLLM幻觉不能简单归咎于视觉不足,而是图像上下文与模型参数知识/文本上下文动态竞争的结果。提出的CAS框架通过设计冲突样本提取语义不同的上下文偏好向量,并在推理时通过单次符号残差注入到MLP层,实现了对信息依赖的精细、无延迟控制,这是一种新颖的训练免费干预方法。

Abstract: Object hallucination remains a primary obstacle to the reliable deployment of Multimodal Large Language Models (MLLMs). Current inference-time mitigation methods mainly assume hallucinations stem from visual neglect, steering models to enhance visual reliance. In contrast, our systematic interventions on multiple MLLMs show that pushing toward more visual reliance may exacerbate hallucinations on some models, while less may mitigate hallucinations. This result suggests that attributing hallucinations solely to visual insufficiency is underdetermined. We argue that the image, as a context, simultaneously competes with the model’s parametric knowledge and the textual context. For this, we propose a training-free framework, Context-Preference Activation Steering (CAS). It extracts two semantically distinct Context Preference Vectors (CPVs) via two small sets of designed conflict samples and applies them via single-pass signed residual injection at mid-early MLP layers during inference to control information reliance. Experiments show that CAS substantially mitigates object hallucinations without increasing decoding latency and preserves native text-generation quality.


[30] ResearchMath-14K: Scaling Research-Level Mathematics via Agents cs.CLPDF

Guijin Son, Seungyeop Yi, Minju Gwak, Hyunwoo Ko, Wongi Jang

TL;DR: 本文介绍了ResearchMath-14K,一个通过多智能体流程从学术来源整理的、包含14,056个研究级数学问题的大规模数据集,是目前最大的同类集合。作者进一步生成了22万条教师轨迹(ResearchMath-Reasoning),观察到模型存在回避行为(如不尝试或伪造引用)。经过智能体过滤后,对Qwen3模型(4B到30B参数)进行微调,平均比基础模型提升了9.2分,表明即使没有完全正确的推理轨迹,过滤后的开放问题尝试也能提供有用的监督。

Details

Motivation: 数学前沿由尚未解决的问题定义,但语言模型能否在没有人类干预的情况下有意义地处理此类问题尚不清楚,主要障碍是缺乏大规模研究级数学数据集。

Result: 在八个开源模型上,新一代模型每条轨迹产生的引用数量是旧代的5.6倍,伪造引用是5.0倍。对Qwen3模型进行微调后,在基准测试上平均比基础模型提升9.2分。

Insight: 创新点在于通过多智能体管道构建了大规模研究级数学数据集,并揭示了模型在解决开放问题时的回避行为模式;客观分析认为,利用过滤后的不完美推理轨迹进行监督微调是一种有效的训练策略,为高级数学推理研究提供了新资源和方法。

Abstract: The frontier of mathematics is defined by problems whose solutions are not yet known, yet it remains unclear whether language models can meaningfully engage with such problems without human intervention. A major obstacle is the lack of large-scale research-level math datasets. To this end, we introduce ResearchMath-14k, a set of $14{,}056$ problems curated from academic sources via a multi-agent pipeline, making it the largest collection of research-level mathematical problems to date. We further generate ResearchMath-Reasoning, $220$K teacher trajectories from two open models, where we observe recurring avoidance behaviors such as non-attempts and fabricated references. Interestingly, across eight open-weight models, newer generations produce $5.6\times$ more references and $5.0\times$ more fake references per trace. After agentic filtering of ResearchMath-Reasoning, fine-tuning Qwen3 models from 4B to 30B parameters improves over base models by $9.2$ points on average. This shows that filtered open-problem attempts can provide useful supervision even without fully correct reasoning traces. We make ResearchMath-14k publicly available for future works on research-level mathematical reasoning.


[31] Beyond Chunk-Local Extraction: Cross-Chunk Graph Augmentation for GraphRAG cs.CLPDF

Jiaming Zhang, Yibo Zhao, Jing Yu, Jianxiang Yu, Xiang Li

TL;DR: 本文提出CrossAug方法,用于增强GraphRAG(基于知识图谱的检索增强生成)框架。GraphRAG通过将语料库组织为显式知识图谱来支持复杂问答,但现有方法仅提取单个文本块内的实体关系,忽略了跨文本块的关系。CrossAug作为一种离线图增强方法,利用图神经网络(GNN)引导,通过自监督图损坏生成训练监督,对缺失关系进行评分,并仅在选定的高评分区域使用基于证据的大语言模型(LLM)补全跨块关系,从而丰富图谱索引。

Details

Motivation: 现有GraphRAG框架在构建知识图谱时,仅提取单个文本块(chunk)内的实体和关系,导致证据跨越多个文本块的跨块关系在索引中系统性缺失。由于文本块组合的组合爆炸,基于LLM穷举恢复这些关系是不切实际的。

Result: 在三个基于LLM的GraphRAG框架和四个多跳及长文档问答基准测试上的实验表明,CrossAug方法能持续提升性能,证实了跨块图增强对基于检索的问答有益。

Insight: 主要创新点在于提出了一种高效的、GNN引导的离线跨块图增强方法(CrossAug)。它通过自监督图损坏生成训练数据,利用拓扑感知的GNN对缺失关系可能性进行评分,从而智能地、有选择地使用计算成本高的LLM来补全跨块关系,避免了组合爆炸问题,系统性地丰富了知识图谱的结构。从客观角度看,这是一种将GNN的图结构推理能力与LLM的语义理解能力相结合,以解决大规模文档图谱构建中关系提取可扩展性问题的创新思路。

Abstract: GraphRAG extends retrieval-augmented generation by organizing corpora as explicit knowledge graphs, enabling graph-based retrieval for complex question answering. However, existing frameworks extract entities and relations within individual chunks, leaving cross-chunk relations – those whose evidence spans multiple passages – systematically absent from the index. Exhaustive LLM-based recovery of such relations is impractical due to the combinatorial explosion of chunk combinations. We present CrossAug, a GNN-guided CROSS-Chunk Graph AUGmentation method that enriches GraphRAG indices with cross-chunk relational structure as an offline step before query-time retrieval. CrossAug derives training supervision through self-supervised graph corruption, uses a topology-aware GNN to score subgraphs for missingness, and applies evidence-grounded LLM completion only to selected high-scoring regions. Experiments on three LLM-based GraphRAG frameworks across four multi-hop and long-document QA benchmarks demonstrate that CrossAug consistently improves performance, confirming the benefit of cross-chunk graph augmentation for retrieval-based question answering. Our code is available at https://github.com/DonFinliani/CrossAug.


[32] Integrated and Cross-Architecture Interpretation of LLM Reasoning cs.CL | cs.AIPDF

Leonardo Matthew Yauw, Wei-Bin Kou, Yujiu Yang

TL;DR: 本文提出了一种集成跨架构推理(IAR)框架,旨在解决大语言模型(LLM)推理过程不透明的问题。该框架通过结合带宽校准的互信息峰值(MIP)与Tukey IQR峰值检测来识别输出层的关键推理令牌,并分析其与深度思考比率(DTR)令牌的重叠以追踪跨层轨迹,最后使用Jaccard稳定性指标在多领域问题上验证令牌的推理质量保证。

Details

Motivation: 当前理解LLM推理的实践存在不对称性:生成输出可观察,但底层推理模式不透明。依赖单一探针(如MIP或DTR)可能低估真实的推理结构,因此需要一种统一的方法来提升LLM推理的可解释性。

Result: 在三个模型(Qwen-7B、Qwen-14B和Llama-8B)和四个领域(数学、代码、逻辑和常识)上的大量实验表明,IAR框架具有跨架构的可泛化解释能力。

Insight: 创新点在于集成多种探针方法(MIP与DTR)进行交叉验证,并引入带宽校准、峰值检测和稳定性指标,以更全面地揭示LLM的推理模式及其跨层演化,这为模型可解释性研究提供了统一的框架思路。

Abstract: Understanding how LLMs reason is hindered by a practical asymmetry: while their generated outputs are observable, the underlying reasoning patterns remain opaque. Relying on single probes, such as Mutual Information Peak (MIP) or Deep-Thinking Ratio (DTR), risks underestimating the genuine inferential structure. To response this deficiency, we present an Integrated, cross-Architecture Reasoning (IAR) framework, designed to provide a unified approach to LLM reasoning interpretability. Specifically, we first propose to use bandwidth-calibrated MIP coupled with Tukey IQR peak-detection to isolate reasoning-crucial tokens at the output layer. Second, we performed an overlap analysis between MIP-picked tokens and DTR-deep tokens to trace the cross-layer trajectories of those tokens. This also discloses whether reasoning-crucial tokens are computation-intensive as well, further facilitating to understand how reasoning patterns evolve across model layers. Finally, we apply a Jaccard stability metric over multi-domain problems to verify if the MIP-identified tokens are reasoning quality-guaranteed. Extensive experiments on three models (Qwen-7B, Qwen-14B, and Llama-8B) across four domains (mathematics, code, logic, and common sense) demonstrate IAR’s generalizable interpretation capabilities across architectures.


[33] ROSD: Reflective On-Policy Self-Distillation for Language Model Reasoning across Domains cs.CL | cs.LGPDF

Ziqi Zhao, Xinyu Ma, Liu Yang, Yujie Feng, Daiting Shi

TL;DR: 本文提出了一种名为ROSD(Reflective On-Policy Self-Distillation)的新框架,用于改进大型语言模型(LLM)的推理能力。该框架通过引入自我反思机制来定位推理过程中的首个错误片段,并生成针对性的纠正思想,从而将传统的基于参考解决方案的模仿转变为有针对性的错误修正,提升了模型在领域内和跨领域的推理性能。

Details

Motivation: 现有基于策略的自我蒸馏(OPSD)方法在领域内推理上提升有限,且在跨领域问题上泛化能力差。其根本原因在于,基于已验证解决方案的自我教师模型倾向于模仿训练领域的参考轨迹而非进行针对错误的修正,并且对整个响应进行蒸馏可能会覆盖有效的推理前缀并加剧过拟合。

Result: 在多个领域内和跨领域推理基准测试上的实验表明,与标准的OPSD方法相比,ROSD在领域内推理性能上整体更强,并且在跨领域泛化能力上取得了显著提升。

Insight: 核心创新点在于通过自我反思器提取纠正思想并定位首个错误片段,实现了从模仿到针对性修正的转变,并将蒸馏过程限制在需要修正的局部区域,从而在纠正错误推理的同时保留了有效的推理前缀。这为解决OPSD方法的局限性提供了新的思路。

Abstract: On-policy self-distillation (OPSD) improves the reasoning performance of large language models (LLMs) by providing dense token-level supervision for on-policy rollouts. However, existing OPSD methods often yield limited gains on in-domain reasoning and generalize poorly to out-of-domain problems. We identify two key causes: conditioning the self-teacher on a verified solution encourages imitation of training-domain reference trajectories rather than error-specific correction, and applying distillation to the full response can overwrite valid reasoning prefixes and reinforce overfitting. We propose Reflective On-policy Self-Distillation (ROSD), a framework that turns reference-solution imitation into targeted reasoning correction through reflection-guided, error-localized distillation. For each rollout, ROSD uses a self-reflector to extract a corrective idea and locate the first erroneous span. The corrective idea guides the self-teacher toward targeted supervision, while the localized error span restricts distillation to where correction is needed. This design corrects flawed reasoning while preserving valid prefixes. Experiments on multiple in-domain and out-of-domain reasoning benchmarks show that ROSD yields stronger in-domain reasoning performance overall and substantially better out-of-domain generalization than standard OPSD. Code is available at https://github.com/ZiqiZhao1/ROSD.


[34] KSAFE-MM: A Multimodal Safety Benchmark via Localized Contextualization for Korean Cultural Risks cs.CLPDF

Yongwoo Kim, Sojung An, Yunjin Park, Jungwon Yoon, Dujin Lee

TL;DR: 本文提出了KSAFE-MM,一个针对韩语多模态安全性的评测基准,包含评估通用风险的KSAFE-MM-G和评估文化特定风险的KSAFE-MM-C两部分。该基准通过语言情境化和本地化视觉查询构建样本,用于评估多模态大语言模型在韩语文化背景下的安全漏洞。对12个先进MLLM的评估显示,模型对文化相关攻击更脆弱,且越狱策略显著提高了攻击成功率,同时揭示了安全性与过度拒绝之间的权衡。

Details

Motivation: 当前多模态大语言模型(MLLM)的安全评估工具存在两个主要局限:一是以英语为中心的数据集构建,二是关注与本地文化背景无关的通用风险。本文旨在解决这些问题,为韩语文化背景下的多模态安全评估提供一个专门的基准。

Result: 在KSAFE-MM基准上评估了12个SOTA MLLM。结果表明,模型对文化相关攻击(如使用越狱策略ProgramExecution时攻击成功率高达74.2%)比通用攻击(标准查询攻击成功率为13.4%)更脆弱。同时发现模型在实现低攻击成功率时,往往会对良性查询表现出过度拒绝行为。

Insight: 创新点在于提出了一个从通用到本地的构建流程,通过语言情境化和基于现实世界的本地化视觉查询来评估全球共享风险和文化特定漏洞。客观来看,该工作强调了超越英语中心基准、进行文化接地安全评估的紧迫性,并为多模态安全评估提供了新的方法论视角。

Abstract: Multimodal Large Language Models (MLLMs) exacerbate safety risks by introducing vulnerabilities across multiple modalities, such as language and vision. Current MLLM safety evaluation tools, however, suffer from major limitations: 1) English-centric dataset construction, and 2) a focus on generic risks that are not tied to local cultural contexts. This paper introduces KSAFE-MM, a benchmark for Korean multimodal safety evaluation that covers both general safety risks and culture-specific vulnerabilities. KSAFE-MM consists of two parts, KSAFE-MM-G and KSAFE-MM-C. KSAFE-MM-G evaluates globally shared risks in Korean contexts through linguistic contextualization, which transforms generic safety queries into contextually grounded multimodal samples. KSAFE-MM-C targets culture-dependent MLLM safety vulnerabilities using localized visual queries derived from real-world contexts. It pairs these visual queries with jailbreak-style textual queries to cover multimodal safety risks involving cultural visual cues and malicious textual intent. Together, these components provide a general-to-local construction pipeline for evaluating both globally shared safety risks and culture-specific vulnerabilities. We evaluate 12 state-of-the-art MLLMs on KSAFE-MM and reveal that models exhibit greater vulnerability to culturally grounded attacks than to generic ones. Notably, jailbreaking strategies substantially amplify attack success rates, with ProgramExecution yielding up to 74.2% ASR compared to 13.4% for standard queries. Furthermore, we identify a systematic trade-off between safety and over-refusal, where models achieving low ASR tend to exhibit excessive refusal behavior on benign queries. These findings highlight the urgent need for culturally grounded safety evaluation beyond English-centric benchmarks.


[35] MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language Models cs.CL | cs.AI | cs.LGPDF

Hyeonjeong Ha, Jeonghwan Kim, Cheng Qian, Jiayu Liu, William M. Campbell

TL;DR: 本文提出了MemGuard,一种类型感知的记忆框架,旨在解决长时记忆增强大语言模型中存在的异构记忆污染问题。该框架通过在记忆写入时分配明确的功能角色、维护跨类型隔离记忆的关系,并在检索时仅从必要记忆类型中选择性组合证据,从而减少无关或功能不兼容证据的干扰。

Details

Motivation: 现有记忆系统将稳定的用户事实、情景事件和行为规则等不同功能类型的记忆压缩到共享空间中,导致功能不同的记忆被检索并用作可互换的证据,从而引发异构记忆污染,例如特定情境事件被过度泛化为普遍主张,或语义相关但功能不兼容的记忆误导生成。

Result: 在幻觉检测和长程对话基准测试中,MemGuard将记忆可靠性提升了高达28.27%,同时检索的记忆令牌数量比先前方法减少了5.8倍,表明其在保持记忆可靠性的同时显著提高了效率。

Insight: 创新点在于提出了类型感知的记忆框架,通过显式分配记忆功能角色、隔离不同类型记忆并选择性组合证据,从原理上组织和使用异构记忆,这为构建可靠的长时推理系统提供了新的设计思路,强调了记忆的功能边界和选择性使用的重要性。

Abstract: Memory-augmented large language models extend reasoning beyond a fixed context window by maintaining long-term memory across interactions. However, existing memory systems often collapse stable user facts, episodic events, and behavioral rules into a shared space, allowing functionally distinct memories to be retrieved and used as interchangeable evidence. We identify this failure mode as heterogeneous memory contamination, where context-specific events become overgeneralized claims, or semantically relevant but functionally incompatible memories mislead generation. To this end, we introduce MemGuard, a type-aware memory framework that preserves functional memory boundaries during memory construction and retrieval. It assigns each memory an explicit functional role at write time, maintains relations across type-isolated memories, and selectively composes evidence only from necessary memory types, reducing contamination from irrelevant or functionally incompatible evidence. Across hallucination and long-horizon conversation benchmarks, MemGuard improves memory reliability by up to 28.27% while retrieving up to 5.8x fewer memory tokens than prior methods. These results suggest that reliable long-term reasoning depends on principled organization and selective use of heterogeneous memory.


[36] Beyond pass@k: Redundancy-Aware RLVR for Multi-Sample Code Generation cs.CL | cs.SEPDF

Le Bronnec Florian, Alexandre Verine, Rio Yokota, Benjamin Negrevergne

TL;DR: 本文研究了代码生成中LLM在重复采样设置下的冗余性问题,指出仅关注正确性的强化学习验证方法(RLVR)会导致生成的程序实现高度重复,而Pass@k感知的目标能降低冗余并提升大预算下的性能。为此,作者提出了一种基于JPlag相似度的直接抗冗余奖励机制,增强RLVR以减少近似重复的生成,从而在有限采样预算下提升可执行性能。

Details

Motivation: 动机在于理解代码生成中基于验证器的强化学习方法(RLVR)如何影响采样程序的冗余性,因为现有方法虽提升正确性,但可能导致实现层面的重复,从而影响有限采样预算下的整体性能。

Result: 在3个模型和3个基准测试上的实验表明,引入抗冗余奖励的RLVR能可靠提升有限预算下的可执行性能,常匹配或超越专门的Pass@k感知目标。

Insight: 创新点在于使用JPlag检测代码相似度来量化冗余,并将直接抗冗余奖励集成到RLVR中,这提供了一种平衡正确性与多样性的方法,可借鉴于优化多采样代码生成任务。

Abstract: LLMs for code generation are commonly evaluated in repeated-sampling settings using Pass@k, where multiple candidate programs are executed against unit tests under a finite sampling budget. While recent verifier-based reinforcement learning (RLVR) methods improve executable correctness, how these objectives affect redundancy among sampled programs remains poorly understood. In this work, we study implementation-level redundancy in code generation using JPlag, a plagiarism-detection system for code. Across models and benchmarks, we show that correctness-only RLVR often concentrates generations around repeated implementations, whereas Pass@k-aware objectives maintain lower redundancy and improve larger-budget performance. Motivated by these observations, we augment RLVR with direct anti-redundancy rewards based on JPlag similarity. Across 3 models and 3 benchmarks, discouraging near-duplicate generations reliably improves finite-budget executable performance, often matching or outperforming specialized Pass@k-aware objectives.


[37] Knowledge Dependency Estimation for Reliable Question Answering cs.CLPDF

Chaodong Tong, Qi Zhang, Nannan Sun, Lei Jiang, Yanbing Liu

TL;DR: 该论文研究了可靠问答中的知识依赖估计问题,旨在评估固定黑盒问答模型对不同候选知识单元的敏感性。作者提出了名为Knot的结构化、具有排名感知能力的知识依赖估计器,它通过子集级反事实监督进行学习,对潜在依赖因子进行覆盖建模以评估子集敏感性,并推导出具有排名感知的单元分数来识别有影响力的候选知识。

Details

Motivation: 动机在于,可靠的问答不仅需要判断答案是否正确,还需要识别预测依赖于哪些可用知识。在实际基于大语言模型的问答中,知识来源多样且可能包含噪声和冗余,形成一个候选空间而非干净的证据集,因此需要一种方法来估计模型对这些知识的依赖程度。

Result: 在多项选择和生成式问答基准测试中,Knot在子集敏感性预测方面优于所有基线方法,并且在不增加额外问答模型调用的情况下,比可部署基线产生了更忠实的单元排名。其依赖分数有助于在实际风险筛查中早期标记易出错的问答预测。

Insight: 创新点在于将知识依赖估计形式化为一个结构化学习问题,通过子集级监督和覆盖潜在依赖因子来建模知识单元之间的冗余、可替代性和互补性,从而提供细粒度的、具有排名感知的依赖分数,而无需在测试时进行详尽的扰动。这为理解和提高黑盒模型在复杂知识环境下的可靠性提供了一种新方法。

Abstract: Reliable question answering requires identifying not only whether an answer is correct, but also which available knowledge the prediction depends on. In realistic LLM-based QA, this knowledge may come from context, retrieval, decomposition, or intermediate reasoning, forming a noisy and redundant candidate space rather than a clean gold evidence set. We study \emph{knowledge dependency estimation}: estimating the sensitivity of a fixed black-box QA model to different candidate knowledge units. The challenge is to obtain fine-grained dependency scores without exhaustive test-time perturbation while modeling redundancy, substitutability, and complementarity. We propose \textbf{Knot}, a structured rank-aware knowledge dependency estimator. Knot learns from subset-level counterfactual supervision, models subset sensitivity through coverage over latent dependency factors, and derives rank-aware unit scores to identify influential candidates. Across multiple-choice and generative QA benchmarks, Knot outperforms all compared baselines in subset-sensitivity prediction and produces more faithful unit rankings than deployable baselines without extra QA-model calls; when used for practical risk screening, its dependency scores help flag error-prone QA predictions early.


[38] ConvMemory: A Lightweight Learned Memory Reranker, a Negative Attribution Result, and a Research-Preview Conflict Editor cs.CL | cs.IRPDF

Taiheng Pan

TL;DR: ConvMemory是一个轻量级的学习型记忆重排序器,用于对话长期记忆检索,通过融合稠密和词法特征的交叉编码器教师监督进行训练。在LongMemEval基准上,它在Recall@10上优于BGE-large交叉编码器且延迟低12-47倍,在Clean500上与mxbai-rerank-large-v1性能接近但成本低28倍,在Stress1000上性能差距增大但仍保持低延迟。论文还发布了关于时间窗口机制的负面归因结果,并推出了研究预览版冲突感知候选集编辑器CCGE-LA。

Details

Motivation: 解决对话系统中长期记忆检索的效率和成本问题,旨在开发一个轻量级重排序器以平衡检索性能与延迟和计算开销。

Result: 在LongMemEval基准上,ConvMemory在Recall@10上超越BGE-large交叉编码器,延迟降低12-47倍;在Clean500上与mxbai-rerank-large-v1的Recall@10差距为0.025,成本低28倍;在Stress1000上差距扩大至0.081,但延迟低117倍。CCGE-LA在LoCoMo的supersession和stale/rescue切片上带来一致但有限的提升。

Insight: 创新点包括融合稠密和词法特征的轻量级重排序器设计,以及通过严格实验揭示学习到的时间窗口机制并非真正利用时间结构,而是交叉编码器蒸馏的结果。研究预览版CCGE-LA展示了冲突感知编辑在记忆检索中的潜在应用价值。

Abstract: We describe ConvMemory, a small 3.6M-parameter learned reranker for conversational long-term memory retrieval, trained with cross-encoder teacher supervision over fused dense and lexical features. On the LongMemEval memory family, ConvMemory operates above the BGE-large cross-encoder in Recall@10 at 12-47x lower latency, remains within 0.025 Recall@10 of mxbai-rerank-large-v1 on Clean500 while running 28x cheaper; under Stress1000 distractors the Recall@10 gap widens to 0.081 but ConvMemory still operates at 117x lower latency; these LongMemEval numbers are single-run or single-seed and are reported as indicative cost-frontier evidence, not benchmark-grade. We then publish a rigorous negative attribution result on a previously claimed mechanism: a five-seed retrained ablation with paired bootstrap shows that ConvMemory’s learned temporal window is statistically significant on aggregate but not temporally specific, with the largest effects on hard non-temporal controls and no significant effect on multi-hop temporal queries. The honest description of the mechanism is cheap cross-encoder distillation in a fused dense+lexical feature space, not temporal-structure exploitation. We additionally release CCGE-LA, a low-amplitude conflict-aware candidate-set editor over ConvMemory, as a research preview with modest but consistent gains on supersession and stale/rescue slices on LoCoMo. All results are retrieval-stage; ConvMemory does not match mxbai-rerank-large-v1 in absolute LoCoMo MRR, and the report is single-author and not yet independently audited.


[39] StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative Enrichment cs.CL | cs.AIPDF

Hanwen Cui, Yuting Mei, Yuhang Fu, Dingyi Yang, Qin Jin

TL;DR: StoryLens提出了一种基于上下文感知叙事增强的故事重写方法,旨在根据读者偏好改编现有叙事,同时保持情节一致性和叙事连贯性。该工作引入了包含结构化故事书和多维读者偏好的大规模基准测试STORYLENBENCH,并开发了评估模型STORYLENSEVAL与两阶段重写模型STORYLENSWRITER。实验表明,该方法在保真度、连贯性和读者满意度方面均优于现有基线。

Details

Motivation: 传统风格迁移方法在故事重写中对提升读者满意度的效果有限,作者通过初步人类研究发现仅进行风格适应仅带来2.3%的边际增益,而上下文增强的重写能显著提升24.5%的用户偏好对齐。因此,论文旨在解决超越表面风格适应的、上下文感知的叙事增强问题,以实现更有效的个性化故事重写。

Result: 在提出的STORYLENBENCH基准上,STORYLENSWRITER模型在保真度、连贯性和读者满意度等综合评估指标上,持续优于强大的生成和个性化基线模型,证明了其有效性。

Insight: 论文的核心创新在于强调了上下文感知叙事增强对于个性化故事重写的重要性,而不仅仅是表面风格迁移。具体贡献包括构建了一个大规模、结构化的偏好对齐故事重写基准,以及一个结合监督微调和基于GRPO强化学习的两阶段重写模型框架,为个性化内容生成提供了可借鉴的评估与建模方法。

Abstract: Story rewriting aims to adapt existing narratives to diverse reader preferences while preserving plot consistency and narrative coherence. Unlike conventional work on style transfer, we argue that effective story rewriting demands context-aware narrative enrichment beyond surface-level stylistic adaptation. Our pilot human study shows that style adaptation alone provides only marginal gains in reader satisfaction (2.3%), while context-enhanced rewriting substantially improves user preference alignment (24.5%). Motivated by this, we introduce STORYLENSBENCH, a large-scale benchmark for preference-aligned story rewriting, comprising structured story books, multi-dimensional reader preference profiles, and ranked context-aware rewritten stories. Building on this benchmark, we propose STORYLENSEVAL, a reward model for estimating reader satisfaction over rewritten stories, and STORYLENSWRITER, a two-stage rewriting model combining supervised fine-tuning with GRPO-based reinforcement learning. We further establish a comprehensive evaluation framework covering fidelity, coherence, and reader satisfaction. Experimental results demonstrate that STORYLENSWRITER consistently outperforms strong generation and personalization baselines, highlighting the importance of context-aware narrative enrichment for personalized story rewriting.


[40] SMILE-Next: Teaching Large Language Models to Detect, Classify, and Reason about Laughter cs.CL | cs.AIPDF

Lee Jung-Mok, Kim Sung-Bin, Joohyun Chang, Lee Hyun, Tae-Hyun Oh

TL;DR: 本文提出了SMILE-Next数据集和一个用于真实世界笑声理解的专用大语言模型。该模型通过笑声特定的自指令生成和混合笑声专家框架,在笑声检测、分类和推理三个任务上实现了对笑声的细致理解,并显著超越了多模态大语言模型基线。

Details

Motivation: 笑声是一种传达超越娱乐的交流意图的复杂社交信号,而先前的工作多集中于孤立的笑声分析任务,对真实场景中笑声的全面理解仍显不足。

Result: 实验结果表明,结合所提出的笑声特定自指令和混合笑声专家框架,其性能在多模态大语言模型基线上有显著提升,推动了鲁棒的真实世界笑声理解。

Insight: 创新点在于构建了包含多模态文本表示和问答标注的真实世界笑声理解数据集SMILE-Next,并提出了笑声特定自指令生成和任务自适应的混合专家路由机制,以增强模型的泛化能力和任务特定性能。

Abstract: Laughter is a complex social signal that conveys communicative intent beyond amusement. While prior work has focused on isolated laughter analysis tasks, a comprehensive understanding of laughter in real-world scenarios remains underexplored. Therefore, we introduce SMILE-Next, a dataset for real-world laughter understanding with multimodal textual representations and question-answer annotations across three tasks: laughter detection, laughter type classification, and laughter reasoning. Building upon SMILE-Next, we aim to develop a laughter-specialized large language model capable of nuanced understanding of laughter in real-world contexts. To this end, we propose two key components: laughter-specific Self-Instruct and the Mixture-of-Laugh-Experts (MoLE) framework. Laughter-specific Self-Instruct enhances generalization across tasks and domains by automatically synthesizing diverse laughter-centric instructions. MoLE introduces a task-adaptive expert routing mechanism that dynamically selects specialized experts tailored to each laughter-related task, improving task-specific performance and efficiency. Experimental results show that the combination of our proposed components substantially outperforms multimodal LLM baselines, advancing robust real-world laughter understanding. Project page is at: https://mok0102.github.io/smile-next/.


[41] ConRAG: Consensus-Driven Multi-View Retrieval for Multi-Hop Question Answering cs.CLPDF

Yikai Zhu, Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du

TL;DR: 本文提出ConRAG框架,一种基于共识驱动的多视图检索增强生成方法,用于提升大语言模型在复杂多跳问答任务上的性能。该方法通过系统优化查询和语料库两端,并利用关系、实体和文本信号等多视图证据进行更准确的检索。在三个多跳问答基准测试上的实验表明,ConRAG显著优于所有基线方法。

Details

Motivation: 当前多跳RAG方法主要关注查询端任务分解或语料端知识图谱构建,但在复杂多跳问答任务上仍难以取得满意性能。本文旨在通过同时优化查询和语料库两端,并整合多视图证据来提升检索准确性。

Result: 在三个多跳问答基准测试上的广泛实验表明,ConRAG始终以明显优势超越所有基线方法,例如比原始RAG平均性能提升高达+26.9%,并使Gemma-4-31B在具有挑战性的MuSiQue基准上创造了新的最先进记录。

Insight: 创新点在于提出共识驱动的多视图检索框架,系统整合查询优化、语料优化和多视图证据(关系、实体、文本信号),通过端到端优化提升多跳问答的检索准确性。客观分析认为其核心贡献在于将多源证据融合与双向优化相结合的方法论。

Abstract: Retrieval-augmented generation (RAG) has emerged as a promising paradigm for enhancing large language models (LLMs) on multi-hop question answering (QA), which requires reasoning over evidence from multiple documents. Current multi-hop RAG methods generally focus on either query-side task decomposition or corpus-side knowledge graph construction. Despite their progress, these methods still struggle to achieve satisfactory performance on complex multi-hop QA tasks. To this end, we propose ConRAG, a consensus-driven multi-view RAG framework that effectively boosts LLMs on complex multi-hop QA. The core of ConRAG is to systematically optimize both the query and corpus sides and to leverage multi-view evidence (relation, entity, and text signals) for more accurate retrieval. Extensive experiments on three multi-hop QA benchmarks show that ConRAG consistently outperforms all baselines by a clear margin, e.g., up to +26.9% average performance gains over vanilla RAG, and enables Gemma-4-31B to achieve a new state-of-the-art record on the challenging MuSiQue benchmark.


Zerui Chen, Qinggang Zhang, Zhishang Xiang, Zhimin Wei, Linfeng Gao

TL;DR: 本文提出LegalGraphRAG框架,旨在解决基于图的检索增强生成(GraphRAG)在法律推理领域应用时的两大挑战:法律语料库的异构性以及传统RAG缺乏透明验证的问题。该框架通过构建分层法律知识图谱来组织多粒度知识,并采用包含研究员、审计员和裁决员的多智能体系统进行证据检索、验证与综合,以实现可靠的法律推理。实验表明,该方法在法律分析任务上达到了最先进的性能。

Details

Motivation: 动机在于将GraphRAG应用于法律推理领域时,面临法律语料库知识粒度异构且传统RAG方法检索上下文不经验证,导致推理不透明、易出错的问题。

Result: 大量实验表明,LegalGraphRAG在法律分析任务上取得了最先进的性能,在准确性和可信度方面超越了现有的GraphRAG基线模型。

Insight: 创新点在于构建了分层法律知识图谱以支持不同抽象层次的检索,并设计了包含明确分工(检索、验证、裁决)的多智能体系统来确保推理过程的透明性和可靠性,这为构建可信的领域特定RAG系统提供了可借鉴的架构思路。

Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) advances flat document retrieval by structuring knowledge as relational graphs, enabling more coherent and effective reasoning. However, applying it to specific domains like legal reasoning faces critical challenges. (i) Legal corpora are heterogeneous, containing multi-granular knowledge from cases, articles and interpretations. A flat knowledge graph cannot adequately differentiate between factual details, applied rules, and abstract principles, limiting accurate retrieval. (ii) Reliable legal judgment demands transparent, evidence-based reasoning. Traditional RAG passes retrieved context directly to an LLM without verification, resulting in opaque, error-prone reasoning. To this end, we propose LegalGraphRAG, a framework designed for reliable legal reasoning. Our approach introduces two core components: a hierarchical legal graph that hierarchically organizes legal sources to enable retrieval at appropriate abstraction levels, and a multi-agent system for reliable legal reasoning, where a Researcher retrieves candidate evidence, an Auditor rigorously verifies its validity against source documents, and an Adjudicator synthesizes the set of verified evidence to render a final judgment. Extensive experiments show that LegalGraphRAG achieves the state-of-the-art performance, outperforming existing GraphRAG baselines in accurate and trustworthy legal analysis. Our code, datasets and implementation details are available at https://github.com/XMUDeepLIT/LegalGraphRAG.


[43] Risk-aware Selective Prompting for Hallucination Mitigation in Large Vision-Language Models cs.CLPDF

Yuang Huang, Yafeng Zhang, Yu Zilan

TL;DR: 本文系统研究了基于提示的验证方法在缓解大型视觉语言模型幻觉中的作用,发现其是一种具有风险性的干预措施:其纠错能力随输入难度增加而提升,但新引入的错误在所有难度水平上持续存在。因此,总是开启提示对困难输入有帮助,但对较易输入益处甚微甚至有害。作者进一步分析发现,验证提示会将注意力从视觉令牌重新分配到指令令牌,并引发一种独特的中间层熵模式。基于这种输入依赖的风险,作者提出了无需训练的风险感知选择性提示方法,利用生成前的不确定性信号来选择性触发验证。

Details

Motivation: 基于提示的验证被广泛用于缓解大型视觉语言模型的幻觉问题,但其何时有效尚不明确。本文旨在系统理解验证提示的作用机制,并解决其在不同难度输入上风险不一致的问题。

Result: 在两种代表性LVLM架构和幻觉基准测试上的研究表明,总是开启的提示方法在困难输入上能提供帮助,但在较易输入上益处有限甚至有害。所提出的RSP方法缓解了总是开启提示带来的性能下降,同时保持了基线性能,并揭示了不同架构间有效的选择信号存在差异。

Insight: 主要创新点在于揭示了验证提示的风险特性与输入难度相关,并提出了基于生成前不确定性信号进行选择性触发的训练免费方法。客观来看,其核心洞察在于将验证视为一种条件性干预,而非总是有益的增强,并通过注意力重分配和熵模式分析提供了机制上的解释。RSP方法展示了如何利用模型内部信号进行自适应决策的潜力。

Abstract: Prompt-based verification is widely used to mitigate hallucinations in large vision-language models (LVLMs), yet when it helps remains poorly understood. We systematically study verification prompting across two representative LVLM architectures and hallucination benchmarks, and find that it is a risk-bearing intervention: its corrections increase with input difficulty, while newly introduced errors persist across difficulty levels. As a result, always-on prompting helps on hard inputs but offers little benefit – and can harm – easier ones. Our analysis further shows that this behavior is associated with a conservative output shift. Verification prompts redistribute attention from visual tokens toward instruction tokens and induce a distinct middle-layer entropy pattern absent in a neutral-prompt control, suggesting instruction-conditioned attention redistribution rather than uniformly improved visual grounding. Motivated by this input-dependent risk, we propose Risk-aware Selective Prompting (RSP), a training-free approach that uses pre-generation uncertainty signals to trigger verification selectively. RSP mitigates the degradation of always-on prompting while preserving baseline performance, and reveals that effective selection signals vary across architectures.


[44] Better heads do not guarantee better binarized constituency parsing cs.CLPDF

Zeyao Qi, Yige Chen, Eitan Klinger, Vivaan Wadhwa, Jungyeul Park

TL;DR: 本文重新审视了标点感知的树二值化方法在句法分析中的应用,并探究了依赖关系诱导的中心词信息是否能提升二值化句法分析器的监督效果。研究发现,尽管学习得到的中心词在内在中心词预测任务上显著优于基于规则的方法,但在去二值化后并未带来一致的句法分析性能提升。特别是在标点敏感的评价指标上,学习得到的中心词方法在宏观平均F1分数上表现不如基于规则的二值化方法,尽管在CTB语料库上整体略有提升。

Details

Motivation: 研究动机是探究依赖关系诱导的中心词信息是否能够作为二值化控制信号,从而提升句法分析器的性能,特别是针对标点敏感的句法分析任务。

Result: 在中文树库(CTB)上,学习得到的中心词方法在整体F1分数上略有提升,但在宏观平均标点敏感F1分数上表现不如基于规则的二值化方法。跨树库迁移实验也显示出类似的不稳定性。

Insight: 论文的创新点在于揭示了语言学上更合理的中心词预测并不一定意味着句法分析性能的提升,特别是在标点敏感的评价指标下。这提供了一个负面结果,即更好的中心词预测并不能保证更好的标点敏感句法分析性能,对二值化控制信号的选择提出了新的思考。

Abstract: We revisit punctuation-aware tree binarization for constituency parsing and ask whether dependency-induced headedness improves binary parser supervision. Although learned heads substantially outperform rule-based heads in intrinsic head prediction, they do not yield consistent parsing gains after debinarization. In particular, punctuation-conditioned evaluation shows that learned headedness underperforms rule-based binarization in macro-average punctuation-sensitive $F_1$, despite a small overall gain on CTB. Similar instability appears under cross-treebank transfer. These results suggest that \ycc{linguistically grounded} headedness is not necessarily parser-optimal when used as a binarization control signal. The paper presents a negative result: better head prediction does not imply better punctuation-sensitive constituency parsing.


[45] DEPART: DEcomposing PARiTy across Multilingual LLMs cs.CL | cs.AIPDF

Manan Uppadhyay, Prashant Kodali, Pranjal Chitale, Reshma Ramaprasad, Himanshu Beniwal

TL;DR: 本文提出DEPART框架,通过两步骤贝叶斯分层分析分解多语言大模型(mLLMs)的性能差异。首先识别语言身份导致的方差,发现可观测语言特征(如文字、语系、类型距离)能解释理解任务中79%、推理任务中92%的方差,且模型内部表征与英语的相似性是主要预测因子。其次分解完整(模型×基准×语言)立方体,揭示理解任务方差主要由模型身份主导(66.7%),而推理任务方差则由基准与模型交互主导(46.3%)。

Details

Motivation: 当前多语言大模型排行榜仅报告各语言准确率,未能解释性能差异的根源,导致系统性偏见未被归因,且缺乏可操作的改进杠杆。

Result: 通过无分布Friedman和Kruskal-Wallis检验确认性能差距是系统性的;语言特征可解释理解任务79%、推理任务92%的方差;模型内部表征与英语的相似性是跨任务的主要预测因子;理解任务方差66.7%由模型身份主导,推理任务方差46.3%由基准×模型交互主导。

Insight: 创新点在于将多语言评估从被动性能映射转化为可解释的诊断框架,通过贝叶斯分层分解识别语言差异的根源驱动因素,为针对性改进提供了具体杠杆(如聚焦模型内部表征与英语的相似性)。

Abstract: Multilingual Large Language Models (mLLMs) leaderboards report per-language accuracy but rarely explain why disparities emerge, leaving systemic biases unattributed and offering practitioners no actionable levers. We first establish that these gaps are systematic rather than artifacts of sampling noise via distribution-free Friedman and Kruskal–Wallis tests, then introduce a two-step Bayesian hierarchical framework that decomposes multilingual performance variance into interpretable components. First, isolating the variance attributable to language identity, we show that observable language features (script, family, typological distance) explain $R^2_{\text{ling}} = 79%$ of this variance on understanding tasks and $92%$ on reasoning, with a model’s internal representational similarity to English emerging as the dominant predictor across both task buckets. Second, decomposing the full (model$\times$benchmark$\times$language) cube, we find that NLU and reasoning have fundamentally divergent variance profiles: model identity dominates understanding ($66.7%$ of variance), whereas the benchmark$\times$model interaction dominates reasoning ($46.3%$). Together these results recast multilingual evaluation from passive performance mapping into an explainable, diagnostic framework with concrete levers for targeting the root drivers of language disparity.


Sebastian Nagl, Ann-Kristin Mayrhofer, Martin Heidebach, Aleyna Koçak, Anne Zettelmeier

TL;DR: 本文介绍了BenGER数据集,用于评估大型语言模型在德国法律基于涵摄推理任务上的表现。该数据集包含596个考试风格的自由文本法律案例任务和531个简短的法理推理任务。研究评估了12个当代LLM系统,并引入了一个与评分标准对齐的LLM-as-a-Judge框架,结果显示闭源旗舰系统在所有语料库中领先,且人机协作显著优于纯人工工作。

Details

Motivation: 解决在德国法律领域缺乏专门评估LLM系统基于涵摄推理能力的基准数据集的问题,旨在为法律推理任务提供一个全面的评估框架。

Result: 在BenGER数据集上评估了12个LLM系统,闭源旗舰模型在所有语料库中领先;人机协作表现显著优于纯人工工作(在定时条件下);LLM-as-a-Judge框架与人类评审者的一致性很高(Calderon r=0.96)。

Insight: 创新点包括构建了针对德国法律涵摄推理的综合性基准数据集BenGER,以及开发了经过交叉验证的LLM-as-a-Judge评估框架,该框架在减少人类评审负担的同时保持了高可靠性。

Abstract: We introduce the BenGER (Benchmark for German Law) dataset for evaluating LLM systems on subsumption-based legal reasoning in German law. The BenGER dataset consists of three components: 596 exam-style free-text legal case tasks across multiple levels of legal education and 531 short doctrinal reasoning tasks. We evaluate 12 contemporary LLM systems – closed flagship, efficiency-oriented, and open-weight – across automatic and judge-based metrics. On a controlled validation subset of timed human-written solutions under both unaided and human–AI co-creation conditions, we contextualise model performance against these human baselines. We introduce a rubric-aligned LLM-as-a-Judge framework cross-validated against a multi-rater human-grading protocol (three blind reviews plus one author-informed creator review per solution). Our results show that replacing a blind human reviewer with the LLM judge degrades agreement with the full human pool no more than removing that reviewer altogether (Calderon r=0.96 vs.~r=0.96, matched n=30), that closed-flagship systems lead the leaderboard across all corpora, and that human–AI co-creation substantially outperforms unaided human work.


[47] When Confidence Misleads: Suffix Anchoring and Anchor-Proximity Confidence Modulation for Diffusion Language Models cs.CLPDF

Jungwon Park, Jimyeong Kim, Jungmin Ko, Nojun Kwak, Wonjong Rhee

TL;DR: 本文研究了扩散语言模型在完全非自回归解码中,模型置信度可能误导位置选择的问题,特别是EOT标记的高置信度导致生成不完整,以及后缀锚点引入的局部过自信问题。作者提出了一种无需训练的后缀锚定置信度调制方法,通过插入短后缀锚点促进响应完成,并根据解码进度调制锚点附近的置信度,从而在多个基准测试中提升解码性能。

Details

Motivation: 动机在于重新审视扩散语言模型解码时基于置信度的位置选择假设,发现高置信度位置(如EOT标记)可能并未准备好解码,导致生成不完整或锚点相邻标记过早解码,需要解决这些置信度误导问题以改进完全非自回归解码。

Result: 在纯文本推理、视觉语言推理和代码生成基准测试中,该方法一致提升了基于置信度的完全非自回归解码性能,优于显式EOT抑制方法,并保持了完全非自回归生成的并行解码优势。

Insight: 创新点在于提出后缀锚定置信度调制,这是一种简单且无需训练的解码策略,通过动态调制锚点附近的置信度来平衡响应完成和避免过早解码,为扩散语言模型的解码过程提供了新的置信度校准视角。

Abstract: Diffusion language models decode text by iteratively denoising masked token sequences, making the choice of which positions to decode a central inference-time decision. Most training-free decoding strategies use model confidence for position selection, assuming that high-confidence positions are ready to be decoded. In this work, we revisit this assumption by studying when confidence misleads fully non-autoregressive (fully non-AR) decoding. EOT tokens can receive high confidence and cause incomplete generation; inserting a suffix anchor can mitigate this issue but introduces local overconfidence near the anchor, causing anchor-adjacent tokens to be decoded too early. To address these issues, we propose Suffix-Anchored Confidence Modulation, a simple training-free method that inserts a short suffix anchor to encourage response completion and modulates confidence near the anchor according to decoding progress. This preserves the response-completion benefit of suffix anchoring while reducing premature decoding of anchor-adjacent tokens. Across text-only reasoning, vision-language reasoning, and code-generation benchmarks, our method consistently improves confidence-based fully non-AR decoding, outperforms explicit EOT suppression, and preserves the parallel decoding advantage of fully non-AR generation.


[48] Framing Matters: Addressing Framing Sensitivity in Decision-Making through Behaviorally-Grounded Value Alignment cs.CLPDF

Seojin Hwang, Minju Kim, Junhyuk Choi, JeongHyun Park, Hwanhee Lee

TL;DR: 本文研究了大型语言模型在决策任务中对语义框架的敏感性,发现即使事实信息保持不变,仅改变表述方式(如价值倾向、时间切片和叙事生动性)也会导致模型决策显著不稳定。为此,作者构建了Fragile基准来系统评估这一问题,并提出了一种表示层面的方法Valign,通过锚定稳定价值先验、引导隐藏状态朝向价值一致方向,并投影掉对时间-生动性敏感的维度,以降低框架引发的决策翻转。

Details

Motivation: 大型语言模型越来越多地应用于高风险决策场景(如法律推理),其中在事实等价输入下保持决策一致性至关重要,但现有模型对语义框架的敏感性可能破坏这种一致性。

Result: 在Fragile基准上的实验显示,LLMs对框架高度敏感,平均决策翻转率达到28.6%;提出的Valign方法能持续减少框架引发的决策翻转,表明直接针对框架操作的内部路径进行干预是有效的缓解策略。

Insight: 论文的创新点在于系统性地识别并量化了LLMs的框架敏感性问题,并通过表示层面的干预(而非提示或激活层面)直接对齐模型内部的价值表示,从而提升决策的鲁棒性;这为价值对齐研究提供了行为基础的新视角。

Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes decision-making settings such as legal reasoning, where consistency under factually equivalent inputs is critical. However, we find that fact-preserved but differently framed inputs can significantly destabilize LLM decisions. To systematically investigate this problem, we introduce Fragile, a large-scale benchmark that isolates fact-preserving semantic framing across three controlled dimensions: value-tinted narration, temporal slice, and narrative vividness. Our experiments reveal a high susceptibility of LLMs to framing, with an average decision flip rate of 28.6%. We find that simple prior prompt-level and activation-level interventions not only fail to suppress framing sensitivity but actively amplify it. We therefore propose Valign, a representation-level method that explicitly targets these framing dimensions by anchoring decisions to a stable value prior, steering hidden states toward the model’s value-consistent direction, and projecting out temporal-vividness-sensitive directions from the model’s hidden states. Valign consistently reduces framing-induced decision flips, demonstrating that robust mitigation requires directly targeting the internal pathways in which framing operates.


[49] CIRF: Tokenizing Chain-of-Thoughts into Reusable Functional Units for Efficient Latent Reasoning in Large Language Models cs.CLPDF

Yukyung Lee, Yumeng Shen, Jinhyeong Park, Hyein Yang, Jun-Hyung Park

TL;DR: 本文提出了CIRF框架,将思维链(CoT)分解为可重用的功能单元,以提升大语言模型的隐式推理效率。该方法通过将显式推理步骤映射为离散的功能标记,实现了隐式推理与显式推理的对齐,并支持自适应推理。

Details

Motivation: 现有隐式思维链方法通常缺乏与显式推理的对齐以及对问题复杂度的适应性,导致推理效率和效果受限。

Result: 在数学、符号和常识推理基准测试中,CIRF相比最先进的隐式CoT方法,在准确性和延迟之间取得了更优的权衡,并展现出持续的性能提升。

Insight: 创新点在于将推理过程分解为语义连贯的功能单元并标记化,实现了隐式推理的结构化、可解释性和并行训练能力,从而提升了推理效率与适应性。

Abstract: Implicit Chain-of-Thought (CoT) reduces the inference cost of large language models by internalizing the explicit rationales. However, existing approaches typically lack alignment with explicit rationales and adaptivity to example complexity. In this work, we propose CIRF (\textit{\underline{C}hain-of-thoughts \underline{I}nto \underline{R}eusable \underline{F}unctional units}), an implicit CoT framework that performs reasoning as a dynamic sequence of discrete functional tokens. CIRF assigns a functional token to each semantically coherent reasoning unit in explicit CoT traces. The model is then fine-tuned to autoregressively generate functional tokens and their optional results, followed by the final answer. This design aligns latent reasoning with a sequence of functional units, facilitating parallel training, explicit rationale alignment, and adaptive reasoning. Extensive experiments on mathematical, symbolic, and commonsense reasoning benchmarks show that CIRF provides a favorable accuracy-latency trade-off compared with state-of-the-art implicit CoT methods. Further analyses demonstrate that CIRF constructs distinct, interpretable functional tokens, leading to consistent performance improvements.


[50] Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning cs.CL | cs.AIPDF

Yahan Yu, Noa Nakanishi, Fei Cheng

TL;DR: 本文重新审视了大型语言模型在复杂推理过程中产生的人形化反思标记(如’wait’、’hmm’等),通过提示级和标记级干预抑制这些标记,并分析其对推理性能的影响。研究发现这些标记并非推理所必需,抑制它们可在多种设置下保持或提升性能,且模型仍能进行无标记的验证。

Details

Motivation: 探究人形化反思标记在LLM推理中的必要性和作用机制,以澄清其作为反思可见指示器的有效性,并避免因冗余标记导致的’过度思考’风险。

Result: 在四个基准测试和两种模型规模上,抑制人形化标记后任务性能得以保持或提升,尤其在较大采样预算下效果更明显;模型仍能执行无标记的验证行为。

Insight: 人形化反思标记往往是表面线索而非反思本身的可靠代理,这激励未来研究超越显式标记模式来探索推理机制;抑制标记可作为优化推理效率的潜在策略。

Abstract: Large Language Models (LLMs) often produce explicit reflective traces during complex reasoning, accompanied by anthropomorphic markers such as wait, hmm, and alternatively. Although these markers are commonly used as visible indicators of reflection, their mechanisms remain unclear, which leaves the risk of overthinking associated with redundant and repetitive reflection markers. In this work, we revisit anthropomorphic reflection markers, examining their necessity for reasoning and role in the reflection. We suppress these markers through prompt-level and token-level interventions, and analyze their effects on task performance across four benchmarks and two model scales. Our results show that anthropomorphic markers are not uniformly necessary for reasoning performance: suppressing them can preserve or improve performance in several settings, especially under larger sampling budgets. Meanwhile, marker suppression does not necessarily remove reflection behavior, as models can still perform marker-free verification. These suggest that anthropomorphic markers tend to be surface cues rather than reliable proxies for reflection itself, and motivate future research on reasoning mechanisms beyond explicit marker patterns.


[51] Argument Quality Assessment with Large Language Models: A Pairwise Bradley-Terry Approach cs.CLPDF

Nicolás Benjamín Ocampo, Agnes Paullate Nyiranziza, Davide Ceolin

TL;DR: 本文研究了大型语言模型(LLMs)在评估论点质量任务上的能力。作者测试了12个不同规模和系列的开源LLMs,在零样本、少样本和思维链提示下,对论点的逻辑、修辞和辩证三个维度进行成对比较,并利用Bradley-Terry模型推断潜在强度分数和论点排名。

Details

Motivation: 尽管LLMs在推理和判断任务上表现出色,但评估论点质量需要严格的评估。本研究旨在探究LLMs能否有效执行这一任务。

Result: 实验表明,LLMs与人类专家判断具有有希望但中等程度的相关性,其中Llama-70B取得了最强的对齐(Cohen’s κ = 0.493,与Bradley-Terry分数的相关性在0.327-0.477之间)。其他LLMs与Llama-70B表现出弱、中或高的对齐,同时在与人类专家的比较中取得了可比的结果。LLM的预测在不同运行中稳定,标签不一致的情况少于7.75%。

Insight: 创新点在于将LLMs的成对比较输出与Bradley-Terry模型结合,以推断论点的潜在强度并进行排名。客观来看,该方法为利用LLMs进行细粒度的、多维度的论点质量评估提供了一种系统且可量化的框架,并揭示了不同模型在理解质量维度上存在部分且互补的能力。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in tasks related to reasoning and judgment. However, assessing the quality of arguments requires a rigorous evaluation. We investigate the extent to which LLMs can effectively perform this task. We tested 12 open-weight LLMs of different sizes and families under zero-shot, few-shot, and chain-of-thought to approximate expert pairwise comparisons of argument quality across three dimensions-logical, rhetorical, and dialectic-and used these comparisons in a Bradley-Terry model to infer latent strength scores and derive a ranking of arguments. Our insights show that LLMs have promising but moderate correlation with human expert judgments, with Llama-70B obtaining the strongest alignment, reaching moderate Cohen’s $κ$ = 0.493 and moderate correlations with Bradley-Terry scores derived from these annotations (Kendall, Pearson, and Spearman: 0.327-0.477). Other LLMs exhibit weak, moderate, or high alignment with Llama-70B while achieving comparable results against human experts, suggesting partial but complementary understanding of underlying quality dimensions despite differences in model size and family. Moreover, LLM predictions are stable across trial runs, with fewer than 7.75% of cases yielding different labels. Remaining variability is handled via majority voting and few-shot prompting for large-size models.


[52] When Discourse Pressures Conflict: Information Structure in Vision-Language Model Outputs cs.CLPDF

Marcell Fekete, Johannes Bjerva, Tamás Káldi

TL;DR: 本文研究了视觉语言模型在视觉问答任务中是否能够以符合语篇信息结构的方式表达内容,通过利用匈牙利语中话题和焦点的句法位置标记,比较了六个VLMs与人类参与者的表现。研究发现,VLMs能够生成与信息结构相关的结构,但存在过度规则化的问题,倾向于使用狭窄的响应模板,类似于模式崩溃现象。

Details

Motivation: 当前对VLMs的评估主要关注其识别正确视觉内容的能力,但缺乏对其表达内容是否符合语篇信息结构的研究,本文旨在填补这一空白,探讨VLMs是否能在视觉问答中区分语篇旧信息(话题)和新信息(焦点)。

Result: 在匈牙利语的视觉问答实验中,六个VLMs表现出对信息结构的敏感性,但过度规则化;相比之下,人类参与者会根据语篇状态、语法角色和限定性等因素采用可变策略,而VLMs则倾向于使用狭窄的响应模板,类似于模式崩溃。

Insight: 论文的创新点在于将信息结构理论应用于VLM评估,揭示了VLMs在语篇表达上的局限性,即过度依赖固定模板而缺乏人类般的灵活性;这提示未来VLM评估应超越内容准确性,关注内容如何为语篇进行包装,以提升模型的实际应用效果。

Abstract: Vision-language models (VLMs) are increasingly evaluated for whether they identify the right visual content, but little is known about whether they express such content in a discourse-appropriate form. We address this research gap using information structure (IS), testing whether VLMs distinguish discourse-old Topics from discourse-new Foci in visually grounded question answering. We exploit Hungarian, a language in which Topic and Focus map onto dedicated syntactic positions, making IS choices observable in text. Comparing six VLMs with human participants, we find that models produce IS-relevant constructions, but over-regularise this sensitivity. Under the interacting pressures of discourse status, grammatical role (preference for subject Topics) and definiteness (preference for indefinite Foci), humans choose variable strategies for IS realisation. VLMs, by contrast, collapse onto narrow response templates, resembling mode collapse (Kirk et al., 2024). Our findings suggest that VLM evaluation should look beyond content accuracy to how content is packaged for the discourse.


[53] FABSVer: Faster Training and Better Self-Verification for LLM Mathematical Reasoning cs.CLPDF

Haihui Pan, Junwei Bao, Hongfei Jiang, Yang Song

TL;DR: 本文提出FABSVer方法,通过将数学推理中的答案生成与自我验证融合为单一生成过程,显著减少训练时间,并引入动态参考模型更新(DRMU)机制突破奖励收敛瓶颈,从而在多个数学基准测试上实现更优的推理与验证性能。

Details

Motivation: 现有大语言模型在数学推理中自我验证能力不足,且通常将答案生成与验证作为独立任务处理,导致训练时间大幅增加;同时训练过程中奖励易因固定参考模型的约束而陷入平台期。

Result: 在数学基准测试中,FABSVer在三种模型规模下均取得了更优的自我验证与推理性能,训练时间仅需现有方法的51%–71%,且分析表明随着模型规模增大,验证与答案奖励差距显著缩小。

Insight: 创新点在于将生成与验证任务融合为单次生成以提升训练效率,并通过动态更新参考模型突破奖励收敛上限;客观来看,该方法揭示了模型学习自我验证的阶段性特征及规模扩展对奖励对齐的影响。

Abstract: While large language models have made significant progress in mathematical reasoning, they remain unreliable at judging the correctness of their own solutions. Existing approaches that equip models with self-verification typically treat solution generation and verification as two separate tasks, leading to substantially increased training time. In this paper, we propose FABSVer, which fuses these two tasks into a single generation pass, dramatically reducing training overhead while jointly optimizing both capabilities. We further identify a convergence bottleneck both theoretically and empirically: as training progresses, the reward reaches a plateau because the policy is constrained by a fixed reference model. To overcome this, we introduce Dynamic Reference Model Update (DRMU), which raises the reward ceiling and enables sustained reward growth. Extensive experiments on math benchmarks demonstrate that FABSVer achieves superior self-verification and reasoning performance across three model scales, while requiring only 51%–71% of the training time of existing methods. Analysis further reveals distinct learning phases in how models acquire self-verification, and that the gap between verify and answer rewards shrinks noticeably as model size increases.


[54] Roles with Rails: Contract-Preserving Role Evolution in Multi-Agent Structured Reasoning cs.CLPDF

Ling-Yue Ge, Lan-Zhe Guo

TL;DR: 本文提出了一种名为SERO的自进化角色编排框架,旨在解决基于角色的LLM多智能体系统中角色池自适应演化时可能破坏结构契约的问题。该框架通过信用引导检索、信用排序通信DAG以及基于上下文老虎机的控制器,确保每次角色编辑都保持能力覆盖、通信兼容、验证、答案聚合和输出协议这五项结构契约。

Details

Motivation: 现有基于角色的LLM多智能体系统要么固定角色库而失去适应性,要么允许无约束生成导致角色漂移,破坏系统必需的结构契约(如能力覆盖、消息兼容性等),因此需要一种能在角色演化过程中保持这些契约的方法。

Result: 在三个LLM骨干网络上的真实世界推理基准测试表明,契约保持的角色演化具有显著价值,但摘要未具体说明定量结果或是否达到SOTA水平。

Insight: 创新点在于将角色演化形式化为契约保持问题,并设计了SERO框架,通过信用机制和条件验证修复来确保结构契约不被破坏,同时利用上下文老虎机控制器智能地采纳能提升任务得分的编辑建议。

Abstract: Role-based LLM multi-agent systems need adaptive role pools, yet adapting such systems is not merely a matter of prompt optimization: roles often carry structural obligations, including capability coverage, message compatibility, validation, final-answer aggregation, and parser-compatible output protocols. Existing systems either fix the role inventory and lose adaptivity, or allow unconstrained generation to induce role drift, removing structurally necessary roles and breaking answer contracts. We formulate this as contract-preserving role evolution, requiring every committed edit to preserve five structural contracts (capability, communication, validation, aggregation, output protocol). We instantiate this formulation in SERO, a Self-Evolving Role Orchestration framework that evolves a typed role-card pool through credit-guided retrieval, a credit-ranked communication DAG with a protected terminal aggregator and conditional validator repair, and a contextual-bandit controller whose LLM-proposed edits are committed only when they preserve the contracts and improve task score. Experiments on real-world reasoning benchmarks across three LLM backbones confirm the value of contract-preserving role evolution.


[55] Skill0.5: Joint Skill Internalization and Utilization for Out-of-Distribution Generalization in Agentic Reinforcement Learning cs.CLPDF

Jiapeng Zhu, Jianxiang Yu, Yibo Zhao, Chengcheng Han, Qi Gu

TL;DR: 本文提出Skill0.5,一种新颖的智能体强化学习框架,旨在解决基于技能的强化学习方法在技能外部化(导致高上下文开销)和技能内部化(可能导致过拟合和知识冲突)之间的两难选择。该框架通过动态、难度感知的路由器将任务划分为不同掌握层级,对通用技能进行内部化以构建认知基础,同时对特定任务技能进行外部化利用,从而提升智能体在分布内和分布外任务上的泛化性能。

Details

Motivation: 现有基于技能的强化学习方法通常在技能的完全外部化(带来高昂的上下文开销)和完全内部化(存在过拟合和知识冲突风险)之间做刚性选择,这限制了智能体在复杂任务上的泛化能力。本文旨在解决这一困境,寻求一种更灵活、高效的技能处理方式。

Result: 在ALFWorld和WebShop基准测试上的实验表明,Skill0.5超越了基于记忆和基于技能的强化学习基线方法,在分布内和分布外场景下均取得了性能提升。

Insight: 核心创新点在于明确区分技能处理方式,将通用技能内部化与特定任务技能外部化利用相结合。具体通过动态、难度感知的路由机制对任务进行分层,并针对不同层级采用特权蒸馏(用于内部化)和诊断探测(用于惩罚捷径、强制利用特定技能)等定制化优化策略,这为平衡技能的知识固化与灵活运用提供了新思路。

Abstract: Equipping large language models with explicit skills has emerged as a promising paradigm for enabling autonomous agents to solve complex tasks. Agent skills can be inherently divided into general skills for broad cognitive transfer and task-specific skills for dynamic execution. However, existing skill-based reinforcement learning (RL) methods typically force a rigid choice between full externalization, which incurs prohibitive context overhead, and full internalization, which risks overfitting and knowledge conflicts. To address this dilemma, we propose Skill0.5, a novel agentic RL framework that explicitly differentiates skill treatments by combining general skill internalization with task-specific skill utilization. Driven by a dynamic, difficulty-aware router, Skill0.5 streams tasks into distinct mastery tiers to apply tailored optimization strategies: it internalizes general skills via privileged distillation to build a cognitive foundation for hard tasks, while using diagnostic probing on easy tasks to penalize shortcuts and enforce specific skill utilization. Experiments on ALFWorld and WebShop demonstrate that Skill0.5 outperforms both memory-based and skill-based RL baselines, yielding performance improvements across both in-distribution and out-of-distribution scenarios.


[56] Beyond One Path: Evaluating and Enhancing Divergent Thinking in Interactive LLM Agents cs.CLPDF

Jihyeong Park, Ingeol Baek, Jeonghyun Park, Hwanhee Lee

TL;DR: 本文提出了MUTATE交互式基准,用于评估LLM智能体在路径级和动作级的发散思维能力,并揭示了现有模型在收敛压力下存在‘动作固着’的盲点。为克服此问题,论文提出了ReDNA方法,将无约束的发散候选生成与收敛约束选择分离,在多个发散思维层级上显著优于先前方法,并能泛化到外部创意环境。

Details

Motivation: 现有对大型语言模型的评估通常将其视为单轮文本生成,无法捕捉智能体通过迭代交互进行推理的过程,这忽略了发散思维这一创造力的核心维度。因此,需要一个新的评估框架来衡量交互式LLM智能体的发散思维能力。

Result: 在MUTATE基准上的实验表明,前沿LLM在面临即时收敛压力时,其动作级发散性表现不佳。提出的ReDNA方法在路径级和动作级发散性指标上均显著优于先前方法,并在外部创意环境中有效泛化,其成功源于对弹性发散推理的定性增强。

Insight: 核心创新在于提出了一个专门评估交互式智能体发散思维的双层级基准(MUTATE),以及一个解耦‘发散候选生成’与‘收敛约束选择’的框架(ReDNA)。这为解决LLM在创造性任务中过早收敛的问题提供了新思路,强调了在智能体决策循环中保护发散性探索阶段的重要性。

Abstract: Divergent thinking is a core dimension of creativity, yet existing evaluations of Large Language Models (LLMs) treat them as single-turn text generations, failing to capture how an agent reasons through iterative interaction. To address this, we introduce MUTATE, an interactive benchmark designed to evaluate agentic divergent thinking at two levels: path-level, where an agent discovers multiple alternative paths to the same goal, and action-level, where individual actions require non-typical, mechanism-shifting object uses. Unlike success-only evaluations, MUTATE scores both completed paths and off-path attempts, capturing divergent reasoning that conventional success rates discard. Our experiments with frontier LLMs reveal a structural blind spot in existing frameworks: when exposed to immediate convergence pressure, they tend to fall into immediate action fixation, failing to improve action-level divergence. To overcome this, we propose ReDNA, which separates unconstrained divergent candidate generation from convergent constraint selection. ReDNA significantly outperforms prior methods across both divergence levels and generalizes effectively to an external creativity environment. We also confirm its success stems from a qualitative enhancement of resilient divergent reasoning rather than simple environmental exploration.


[57] ClinicalEncoder26AM: A Multlilingual Diagnosable ColBERT Model; Evidences from the MultiClinNER Shared Task cs.CLPDF

François Remy

TL;DR: ClinicalEncoder26AM是一个多语言可诊断的ColBERT模型,专为临床和生物医学文本设计。它通过多级对齐将标记级语义与ClinicalMap25临床潜在空间结合,并利用合成临床笔记、医患对话及标注数据进行后训练。在MultiClinNER共享任务中,该模型作为BIO标记器微调后,在症状、疾病和程序实体识别上实现了最先进的多语言实体召回率,并在字符加权F1分数上总体排名前五。

Details

Motivation: 解决临床和生物医学文本中多语言实体识别和信息提取的挑战,通过构建一个专门针对临床领域的语义表示模型,提升下游任务的性能和数据效率。

Result: 在MultiClinNER共享任务中,模型作为BIO标记器微调后,实现了最先进的多语言实体召回率,并在所有实体类型和语言的字符加权F1分数上总体排名前五;训练曲线显示其数据效率显著高于基础M3模型。

Insight: 创新点包括多级语义对齐临床潜在空间、结合合成与标注数据的多适配器蒸馏方法,以及轻量级CNN头改进边界检测;客观分析表明,其临床后训练策略有效提升了模型在信息提取任务中的泛化能力和效率。

Abstract: ClinicalEncoder26AM is a multilingual Diagnosable ColBERT for clinical and biomedical texts, which aligns at multiple levels its token-level semantic with ClinicalMap25, a clinical latent space inspired by BioLORD-2023 and enriched with synthetic and annotated supervision. The post-training recipe builds upon BGE-M3, and combines synthetic clinical notes, patient–doctor conversations, and annotated resources such as MedMentions, while considering both named-entity-level and sentence-level representations in a multi-adapter distillation, along with a ColBERT-style retrieval objective. In this system demonstration paper, we evaluate the model in the MultiClinNER shared task by finetuning it as a BIO tagger for patient symptoms, disorders, and procedure spans, using a lightweight two-layer CNN head to improve local boundary detection. The resulting system remains simple, processes most documents in a single 8192-token window, and achieves state-of-the-art multilingual entity recall, while achieving Top 5 overall across all entity types and languages in Character-weighted F1 scores. Training curves further show that ClinicalEncoder26AM is markedly more data-efficient than the base M3 model, supporting the usefulness of its clinical post-training for downstream information extraction. The model can be downloaded on https://huggingface.co/Parallia/ClinicalEncoder26AM-Diagnosable-Colbert-L2-for-multilingual-medical-texts


[58] GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection cs.CLPDF

Zheng Wu, Chengcheng Han, Zhengxi Lu, Tianjie Ju, Yanyu Chen

TL;DR: 本文提出了一种名为GUI-CIDER的中期训练方法,旨在通过因果内化和密度感知的示例重选,显式地让多模态大语言模型学习GUI操作的世界知识,以提升GUI智能体的任务完成能力。该方法包含数据合成、示例重选和中期训练三个阶段,并在多个基准测试中验证了其有效性。

Details

Motivation: 现有基于多模态大语言模型的GUI智能体在真实任务完成上受限于对GUI操作世界知识的缺乏,而传统的监督微调或强化学习等后训练方法只能让模型隐式地吸收知识,导致效率低下和轨迹记忆而非真正理解。因此,需要一种能够显式学习此类知识的方法。

Result: 在两个GUI知识基准和三个任务完成基准上的广泛实验表明,GUI-CIDER能持续提升智能体对GUI操作的理解能力和任务成功率。

Insight: 创新点在于提出了中期训练范式,通过将GUI轨迹中的静态规划和动态因果知识提炼为文本,并结合对因果结构的奖励和对语义冗余的惩罚进行示例重选,从而显式地将GUI世界知识内化到模型中,这比单纯的后训练更高效且能促进真正的理解。

Abstract: Despite the rapid progress of multimodal large language models in building Graphical User Interface (GUI) agents, their real-world task completion is fundamentally bottlenecked by a lack of world knowledge about GUI operations. Existing solutions typically rely on expensive multi-agent scaffolding or conventional post-training paradigms, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). However, post-training only allows agents to implicitly absorb world knowledge through action annotations or reward signals, leading to inefficient trajectory memorization rather than genuine comprehension. Therefore, an approach that enables explicit learning of this knowledge is imperative. To this end, we propose GUI-CIDER, a mid-training method that explicitly internalizes GUI world knowledge through Causal Internalization and Density-aware Exemplar Reselection. GUI-CIDER operates in three stages: (1) data synthesis, which distills static planning and dynamic causal knowledge from GUI trajectories into text; (2) exemplar reselection, which filters the corpus by rewarding causal structures and penalizing semantic redundancy; and (3) mid-training, where the refined data is used to embed the acquired knowledge. Extensive experiments on two GUI knowledge benchmarks and three task completion benchmarks demonstrate that GUI-CIDER consistently improves both the agent’s understanding of GUI operations and its task success rates.The codes are available at https://github.com/Wuzheng02/GUI-CIDER.


[59] Soft-SVeRL: Self-Verified Reinforcement Learning with Soft Rewards cs.CL | cs.LGPDF

Saurabh Dash, Pierre Clavier, John Dang, Matthias Galle, Marzieh Fadaee

TL;DR: 本文提出了Soft-RLVR框架及其自验证变体Soft-SVeRL,用于处理部分可验证任务的强化学习。该框架将提示分解为原子需求清单,使用LLM验证器逐项评分生成软奖励,从而将稀疏的通过/失败监督转化为更密集的部分信用信号。

Details

Motivation: 现有基于可验证奖励的强化学习(RLVR)在数学和代码等可自动检查正确性的领域有效,但许多重要任务(如包含多个需求的提示)仅部分可验证,缺乏单一参考答案,需要处理不完全满足所有需求的响应。

Result: 在基于规则真实评估的指令跟随控制设置中,基于清单的Soft-RLVR仅使用学习到的验证器奖励就将IFEval性能提升了高达11.1分。实验表明验证器质量和清单质量均影响下游RL结果,且显式稳定对于有效自验证至关重要。

Insight: 创新点在于将整体验证分解为原子需求清单以生成软奖励,形式化了清单验证与整体验证在噪声减少和部分信用奖励之间的权衡,并提出了自验证变体Soft-SVeRL,通过显式稳定机制防止因过度宽松的自我判断导致的奖励膨胀崩溃。

Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has improved language models in domains such as mathematics and code, where correctness can be checked automatically. However, many important tasks are only partially verifiable: prompts contain multiple requirements, responses may satisfy some but not all of them, or no single reference answer might exist. We introduce Soft-RLVR, a framework for reinforcement learning from decomposed, learned verification signals. Soft-RLVR converts each prompt into a checklist of atomic requirements, scores candidate responses item by item with an LLM verifier, and trains on the resulting soft reward. Checklist-based rewards turn sparse pass/fail supervision into a denser partial-credit signal, but they also introduce a tradeoff: averaging item-level judgments can reduce verifier noise, while partial credit can reward incomplete responses. We formalize this tradeoff and identify conditions under which checklist-based verification gives a more reliable RL training signal than holistic verification. We further introduce Soft-SVeRL, a self-verifying variant of Soft-RLVR in which the policy also acts as the verifier. We show that self-verification is prone to reward inflation from overly permissive self-judgments, and that explicit stabilization is needed to prevent this collapse. In a controlled instruction-following setting with rule-based ground-truth evaluation, checklist-based Soft-RLVR improves IFEval by up to 11.1 points using only learned verifier rewards. Our experiments further show that verifier quality and checklist quality both affect downstream RL outcomes, and that explicit stabilization is essential for effective self-verification.


[60] Mobile-Aptus: Confidence-Driven Proactive and Robust Interaction in MLLM-based Mobile-Using Agents cs.CLPDF

Zheng Wu, Pengzhou Cheng, Zongru Wu, Yuan Guo, Tianjie Ju

TL;DR: 本文提出Mobile-Aptus,一个用于基于多模态大语言模型(MLLM)的移动使用智能体的置信度驱动主动鲁棒交互框架。该框架通过交互能力赋能和置信度偏差校正两阶段,使智能体能够输出动作和置信度分数,从而在无法自主完成任务时主动请求人类交互,同时避免过度执行和过度请求。

Details

Motivation: 现有基于MLLM的移动使用智能体存在过度执行(无法完成任务仍尝试执行)或过度请求(过度依赖人类干预)的问题,需要一种机制来平衡自主执行与必要的人机交互。

Result: 在OS-Kairos、AITZ、Meta-GUI和AndroidControl四个主流移动智能体基准测试中,Mobile-Aptus均达到SOTA性能,离线基准任务成功率平均提升超过17%,真实动态实验中任务成功率超越基线26%,且每指令仅需0.64次干预步骤。

Insight: 创新点在于提出了一个通用的置信度集成框架,通过监督微调学习输出置信度,并结合语义相似性检索与直接偏好优化进行偏差校正,实现了置信度驱动的主动、鲁棒交互决策,有效缓解了过度执行与过度请求的平衡难题。

Abstract: Recent advancements in multimodal large language models (MLLMs) have shown exceptional potential in enabling mobile-using agents to autonomously execute human instructions. However, fully automated agents often try to execute tasks even when they are unable to resolve them, leading to the problem of over-execution. Previous studies solve it by training a interactive mobile-using agents to let agents request human interaction when agents can not complete user instructions. However, we find that these interactive agents tend to exhibit over-soliciting behavior, relying excessively on human intervention. To mitigate both over-execution and over-soliciting, we propose a universal confidence integration framework that enables confidence-driven proactive and robust interaction in MLLM-based mobile-using agents. The framework consists of two stages: interaction capability empowerment and confidence bias correction. In the interaction capability empowerment stage, agents learn through supervised fine-tuning to output both actions and confidence scores. In the confidence bias correction stage, agents learn to output more accurate confidence scores by combining semantic similarity retrieval with direct preference optimization. Experimental results show Mobile-Aptus achieves state-of-the-art performance on the four popular mobile-using agent benchmarks: OS-Kairos, AITZ, Meta-GUI, and AndroidControl. Mobile-Aptus consistently outperforms all baselines in offline benchmarks, with an average improvement over 17% in task success rate. In real-world dynamic experiments, Mobile-Aptus surpasses the baseline by 26% in task success rate with only 0.64 intervention steps per instruction. The codes are available at https://github.com/Wuzheng02/Mobile-Aptus.


[61] IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents cs.CL | cs.AIPDF

Michael Galarnyk, Siddharth Lohani, Vidhyakshaya Kannan, Sagnik Nandi, Aman Patel

TL;DR: 本文介绍了IPO-Mine,这是一个用于分析长篇幅、多模态IPO文件的工具包和数据集。该工具包能够下载和解析IPO文件,将其转换为标准化的章节结构化文本和提取的图像,从而构建了一个包含超过109,000份IPO文件的大型多模态数据集。

Details

Motivation: IPO文件作为金融市场的重要文档,缺乏大规模、标准化的数据集或基准来支持现代语言和多模态模型的研究,且这些文件通常篇幅极长、结构不一致,难以进行系统分析。

Result: 实验表明,在多模态推理任务(如图表质量和误导性评估)上,最先进的多模态模型与专家的人类判断存在差异,揭示了在长篇幅、真实世界监管文档上的对齐挑战。

Insight: 创新点在于提供了一个开源框架和大型数据集,首次实现了对IPO文件的大规模、可重复的结构化分析,并建立了针对提取财务图表的评估任务,为研究长文档多模态理解提供了新基准。

Abstract: An Initial Public Offering (IPO) filing is a document released when a private firm goes public, allowing individual (retail) investors to purchase its shares. These filings describe a firm’s business, financials, and risks and are long, multimodal documents with narrative text and images. Despite their importance to financial markets, there is no large-scale, standardized dataset or benchmark for studying IPO filings with modern language and multimodal models. These documents pose significant challenges: filings frequently exceed 500,000 tokens and lack consistent structural organization. We introduce the IPO-Toolkit, an open-source framework for downloading and parsing IPO filings into standardized section-structured text and extracted images. The toolkit segments filings, extracts embedded images, and produces structured outputs that enable large-scale, reproducible analysis workflows over long, multimodal documents. Using this infrastructure, we construct the IPO-Dataset, a large, section-structured, multimodal dataset covering more than 109,000 IPO filings and amendments from 1994 to 2026 and containing over 76,000 images. We establish structured evaluation tasks over extracted financial charts, including chart quality and misleadingness assessment. Our experiments show that state-of-the-art multimodal models often diverge from expert human judgments on these tasks, exposing alignment challenges in multimodal reasoning over long, real-world regulatory documents. Beyond benchmarking, the IPO-Dataset enables large-scale analysis of section-level textual variation and cross-industry differences in visual and textual disclosure practices. Our code, dataset, and website are publicly available under CC-BY-4.0.


[62] MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems cs.CL | cs.AI | cs.LGPDF

Xinle Deng, Ruobin Zhong, Hujin Peng, Xiaoben Lu, Yanzhe Wu

TL;DR: 本文提出MemTrace框架,用于追踪和归因大型语言模型内存系统中的错误。该框架将内存流水线转换为可执行的内存演化图,实现细粒度的操作信息流追踪,并构建MemTraceBench基准测试集来系统研究内存故障模式。进一步引入自动归因方法,通过迭代追踪操作子图来定位故障根源,并利用细粒度归因信号指导下游提示优化,形成一个闭环系统以自动纠正错误并提升任务性能。

Details

Motivation: 现有大型语言模型内存系统不可靠且难以调试,追踪内存的动态演化对于理解信息如何随时间合成、传播或损坏至关重要,因此需要研究LLM内存系统中的错误追踪与归因问题。

Result: 在MemTraceBench基准测试集(收集自Long-Context、RAG、Mem0和EverMemOS等代表性内存系统)上进行分析,发现内存故障是系统性的,源于信息丢失和检索错位等操作级问题。利用归因信号指导提示优化,可自动纠正故障,将端任务性能提升高达7.62%。

Insight: 创新点在于将内存流水线建模为可执行的内存演化图,实现细粒度操作流追踪;构建系统性的内存故障基准测试集;提出基于迭代操作子图追踪的自动归因方法,并形成闭环优化系统,将诊断信号直接用于提升模型性能。

Abstract: Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory’s dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted over time. In this work, we study the new problem of error tracing and attribution in LLM memory systems. We propose a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine-grained tracing of operational information flow. We then construct MemTraceBench, a benchmark collected from representative memory systems such as Long-Context, RAG, Mem0, and EverMemOS, to systematically study memory failure modes. We further introduce an automatic attribution method that iteratively traces operation subgraphs to pinpoint the root cause of any failed case. Our analysis reveals that memory failures are systematic, stemming from operation-level issues like information loss and retrieval misalignment. Crucially, we leverage these fine-grained attribution signals to guide downstream prompt optimization, establishing a closed-loop system that automatically corrects faults and boosts end-task performance by up to 7.62%. Code will be released at https://github.com/zjunlp/MemTrace.


[63] Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study cs.CL | cs.AIPDF

Irune Zubiaga, Aitor Soroa, Rodrigo Agerri

TL;DR: 本文研究了如何构建可靠的多语言LLM评估器(LLMs-as-a-Judge),通过系统分析英语、西班牙语和巴斯克语(分别代表高、中、低资源语言),探讨了指令翻译、单语/多语监督和模型规模等策略。研究发现,在有领域内数据时,微调的小模型可以达到与专有模型相当的性能;而在领域外场景中,零样本评估的大模型更有效。

Details

Motivation: 随着对多语言评估需求的增长,将基于LLM的评估器扩展到多语言环境,特别是低资源语言和领域内数据稀缺的场景,仍然具有挑战性。本文旨在探索开发可靠多语言LLM评估器的策略。

Result: 研究将两个现有的元评估数据集扩展到巴斯克语和西班牙语进行评测。结果显示,当领域内数据可用时,微调的小模型性能可与专有模型媲美;在领域外设置中,零样本评估的大模型更有效。同时发现,在领域外数据上微调可能对模型性能产生负面影响。

Insight: 创新点在于系统实证了不同资源语言下LLM评估器的构建策略,揭示了领域内外数据可用性与模型选择(微调小模型 vs. 零样本大模型)之间的关键权衡,为构建高效可靠的多语言评估流程提供了实用指导。研究还扩展了元评估数据集至低资源语言(巴斯克语)。

Abstract: Large language models (LLMs) are increasingly used for the automatic evaluation of generated text, yet most prior work focuses on English. Despite the growing demand for multilingual evaluation, extending LLM-based evaluators to multilingual settings remains challenging, particularly for low-resource languages and scenarios where in-domain data is scarce. This work explores several strategies for developing multilingual LLMs-as-a-judge, considering whether in-domain data is available for fine-tuning or not. We systematically analyze English, Spanish, and Basque, representing high-, mid-, and low-resource languages, considering instruction translation, monolingual versus multilingual supervision, and model size. For evaluation, we extend two existing meta-evaluation datasets to Basque and Spanish. Our results reveal key trade-offs: When in-domain data is available, fine-tuned smaller models can achieve performance comparable to proprietary models, whereas zero-shot evaluation with larger models proves more effective in out-of-domain settings. We also observe that fine-tuning on out-of-domain data can adversely affect model performance. These findings provide practical guidance for building efficient, reliable multilingual evaluation pipelines. The data and code are publicly available at hitz-zentroa/mJudge.


[64] Agent Explorative Policy Optimization for Multimodal Agentic Reasoning cs.CLPDF

Minki Kang, Shizhe Diao, Ryo Hachiuma, Sung Ju Hwang, Pavlo Molchanov

TL;DR: 本文提出了一种名为AXPO(Agent eXplorative Policy Optimization)的新方法,用于解决多模态智能体推理中存在的‘思考-行动鸿沟’问题。该方法通过固定思考前缀并重采样工具调用序列,结合基于不确定性的前缀选择,有效提升了智能体在需要外部工具的任务上的性能。

Details

Motivation: 动机在于解决多模态智能体推理中存在的结构性不对称问题,即内部思考(默认行为)与外部工具使用(高方差辅助行为)之间的‘思考-行动鸿沟’,这导致标准强化学习配方(如GRPO)在训练时工具使用尝试率低且错误率高,抑制了关键工具调用处的学习信号。

Result: 在九个多模态基准测试和三个不同规模的Qwen3-VL-Thinking模型上,SFT+AXPO在平均性能上优于SFT+GRPO(例如,在8B参数规模上平均提升+1.8pp Pass@1和+1.8pp Pass@4),并且8B参数的SFT+AXPO在Pass@4指标上超越了32B的基础模型,同时参数减少了四倍。

Insight: 创新点在于明确识别并形式化了‘思考-行动鸿沟’这一训练瓶颈,并提出了AXPO这一针对性优化方法,通过重采样工具调用序列和基于不确定性的前缀选择来增强探索,从而更有效地利用工具使用行为的学习信号,为多模态智能体策略优化提供了新思路。

Abstract: Vision-language models with extended reasoning succeed on complex problems, but many real-world problems require external tools that internal reasoning alone often cannot resolve. Agentic reasoning therefore interleaves two behaviors with a structural asymmetry: thinking (the self-contained default) and tool use (a high-variance auxiliary acting). We refer to this asymmetry as the Thinking-Acting Gap. Under standard RL recipes like GRPO, the gap manifests as two diagnostic symptoms during training: tool use is attempted on only ~30% of rollouts, and when attempted, the tool-using rollouts within a group are all-wrong on ~40% of questions, suppressing the learning signal at the tool calls that needed it. We propose AXPO (Agent eXplorative Policy Optimization): for each all-wrong tool-using subgroup, AXPO fixes the thinking prefix and resamples the tool call and its continuation, paired with uncertainty-based prefix selection. Across nine multimodal benchmarks and three scales of Qwen3-VL-Thinking, SFT+AXPO outperforms SFT+GRPO at average (+1.8pp Pass@1 and +1.8pp Pass@4 at 8B on average) and 8B with SFT+AXPO surpasses the 32B Base on Pass@4 with 4 times fewer parameters.


[65] The Abstraction Gap in Vision-Language Causal Reasoning cs.CL | cs.CVPDF

Chinh Hoang, Mohammad Rashedul Hasan

TL;DR: 本文针对视觉语言模型(VLMs)的因果推理能力,指出现有评估方法无法区分语言流畅性与真实因果推理。作者提出了一种双探针方法(Text-Only Probe和Chain-Text Probe)以及抽象差距(AG)指标来量化这一差异。在包含49,500个问题的CAGE基准测试中,评估了八个VLMs,发现大多数模型存在显著的抽象差距,但有一个模型实现了接近零的差距,表明该能力取决于预训练和架构选择。

Details

Motivation: 当前视觉语言模型能生成流畅的因果解释,但评估方法无法区分其是语言合理性还是真实的因果推理,这限制了模型在需要可靠因果推理任务中的应用。

Result: 在CAGE基准(涵盖Pearl因果层级的5,500张图像和49,500个问题)上评估八个VLMs,七个模型的抽象差距超过0.50(文本得分6-8,因果链得分低于2.5),微调45,000个链注释示例未能缩小差距,但有一个模型实现了接近零的抽象差距。

Insight: 创新点在于提出双探针方法和抽象差距指标来隔离和量化语言流畅性与因果推理能力,揭示了当前VLM架构中已存在忠实因果推理的潜力,但高度依赖预训练和架构设计;CAGE可作为诊断工具评估VLMs的因果推理忠实性。

Abstract: Vision-language models (VLMs) generate fluent causal explanations, but current evaluations cannot distinguish linguistic plausibility from faithful causal reasoning. We introduce a dual-probe methodology that isolates these properties. The Text-Only Probe measures linguistic quality. The Chain-Text Probe requires models to first generate explicit causal chains. The Abstraction Gap (AG) metric quantifies the normalized performance difference. Evaluating eight VLMs on CAGE (Causal Abstraction Gap Evaluation), a benchmark of 49,500 questions across 5,500 images spanning Pearl’s causal hierarchy, we find seven models exhibit AG exceeding 0.50 with text scores of 6–8 but chain scores below 2.5. Fine-tuning on 45,000 chain-annotated examples fails to close the gap. However, one model achieves near-zero AG. The capability exists within current VLM architectures and depends on pretraining and architectural choices. CAGE provides a diagnostic tool for assessing faithful causal reasoning in VLMs.


[66] Can Large Language Models Handle Discourse Particles? A Case Study of Colloquial Malay cs.CLPDF

Mariah Al Giptiah Binte Yusoff, Jakin Tan, Bocheng Chen, Guangliang Liu, Xi Chen

TL;DR: 本文提出了一个名为MalayPrag的基准测试,用于系统评估大型语言模型在处理马来语口语中话语小品词的能力,并引入了五个属性作为统一的语用功能解释框架。实验表明,现有LLM在准确关联马来语话语小品词与其语用功能方面面临巨大挑战,而所设计的属性框架能显著提升模型表现。

Details

Motivation: 现有研究对LLM处理话语小品词(如表达情感、意图的人际意义成分)的能力缺乏全面理解,且主要关注英语等高资源语言,忽视了东南亚语言。本文旨在填补这一空白,重点关注马来语口语。

Result: 在MalayPrag基准上对十个现成LLM进行三项预测任务的实验结果显示,当前LLM在准确连接马来语话语小品词与其语用功能方面存在显著困难。然而,引入本研究设计的五个属性作为结构化支架,能显著改善这种连接。

Insight: 创新点在于为低资源语言(马来语)构建了首个系统评估LLM语用能力(特别是话语小品词处理)的基准,并提出一个基于语言学的统一属性框架来解构语用功能,这为提升模型的实际对话能力提供了可解释的改进路径。

Abstract: Discourse particles, such as \textit{well} and \textit{kind of}, are crucial components that enable LLMs to ``speak’’ more like humans. They are used to convey emotions, intentions, and interpersonal meanings. However, existing studies have not yet built a comprehensive understanding of LLMs’ capabilities in handling discourse particles. Moreover, the limited number of studies focuses primarily on high-resource languages such as English, with little attention paid to Southeast Asian languages. In this paper, we (1) propose \textsc{MalayPrag}, a benchmark designed to systematically evaluate and analyze LLMs’ capabilities in handling discourse particles in colloquial Malay; and (2) introduce five attributes that provide a linguistically grounded, unified framework for interpreting the pragmatic functions of discourse particles. Applying these two contributions, we prompt ten off-the-shelf LLMs to perform three prediction tasks. The experimental results reveal substantial challenges for current LLMs in accurately connecting discourse particles with their pragmatic functions in Malay. The provision of the five attributes designed in this study is found to significantly improve these connections, highlighting the need for structured scaffolding for models’ pragmatic competence.


[67] OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration cs.CL | cs.AI | cs.CV | cs.LGPDF

Xinchen Zhang, Bowei Liu, Jiale Liu, Chufan Shi, Yizhen Zhang

TL;DR: 本文提出OmniVerifier-M1,一种利用符号化元验证和解耦强化学习的通用视觉验证器。该方法通过符号输出(如边界框)作为元验证依据,并分离二元判断和元验证的强化学习目标,以实现更可靠的细粒度多模态验证。

Details

Motivation: 随着视觉结果在多模态大语言模型中日益重要,需要可靠且细粒度的验证方法来扩展通用基础模型。本文研究如何利用验证器生成的依据(而非仅决策信号)进行元验证,并探索如何将元验证反馈有效融入多模态验证器训练。

Result: 基于所提方法训练的OmniVerifier-M1提供了鲁棒的验证和细粒度的错误定位。它进一步支持了M1-TTS,一个实现动态区域级自校正的验证器驱动智能生成系统。

Insight: 创新点在于发现符号验证器输出作为元验证依据优于文本解释,并揭示了解耦二元判断和元验证的强化学习目标能显著优于联合奖励优化。这为更可靠、可解释和细粒度的多模态验证提供了新途径。

Abstract: Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train OmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables M1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.


[68] Skill-Conditioned Gated Self-Distillation for LLM Reasoning cs.CL | cs.AIPDF

Jiazhen Huang, Xiao Chen, Xiao Luo, Yong Dai, Senkang Hu

TL;DR: 本文提出了一种名为技能条件门控自蒸馏(SGSD)的新方法,用于提升大语言模型(LLM)的推理能力。该方法通过从经验衍生的技能库中检索技能-错误对来构建多教师池,并将基于技能的自蒸馏重新定义为教师假设验证问题,而非无条件模仿。SGSD使用一个鲁棒的门控目标来蒸馏信息丰富的师生分歧信号,同时抑制不确定或极端的监督信号。

Details

Motivation: 现有策略自蒸馏方法通常假设教师侧拥有可信的“特权信息”,如参考答案或成功轨迹。本文旨在探索能否使用来自经验技能库的、可能不相关或具有误导性的紧凑可复用技能作为特权信息,从而在更弱的假设下提升模型推理。

Result: 在多个数学推理基准(AIME24, AIME25, HMMT25)上的实验表明,SGSD在Qwen3-1.7B模型上持续优于GRPO,并在较弱的特权信息假设下与答案条件的OPSD保持竞争力,平均分别超出6.2%和1.7%。

Insight: 核心创新在于将基于技能的自蒸馏重新定义为教师假设验证框架,并设计了鲁棒的门控目标来选择性利用多教师池的监督信号。这为在特权信息不可靠或受限的场景下进行有效的自蒸馏提供了一种新思路。

Abstract: On-policy self-distillation (SD) improves LLM reasoning by using teacher-side privileged information (PI) to turn sparse verifier outcomes into dense token-level supervision. Existing methods usually assume trusted PI, such as reference answers or successful traces. We ask whether PI can instead come from an experience-derived skill bank, where retrieved skills are compact and reusable but may also be irrelevant or misleading. We propose Skill-Conditioned Gated Self-Distillation (SGSD), which formulates skill-based SD as teacher hypothesis validation rather than unconditional imitation. SGSD retrieves skill-mistake pairs, constructs a multi-teacher pool, and lets all skill-conditioned teachers score the same plain-prompt student rollout. The verifier validates each teacher’s polarity: supporting a success or suppressing a failure gives positive supervision, while the opposite stance is reversed. A robust gated objective then distills informative teacher-student disagreements while suppressing uncertain or extreme signals. Experiments on multiple mathematical reasoning benchmarks show that SGSD consistently improves over GRPO and remains competitive with answer-conditioned OPSD under a weaker PI assumption. For example, on Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and OPSD by 1.7% on average on AIME24, AIME25, and HMMT25. Our code is available at https://github.com/walawalagoose/SGSD.


[69] Human Label Variation as Stable Signal: Learning Annotator-Specific Explanation Behavior via Cross-Annotator Preference Optimization cs.CLPDF

Beiduo Chen, Pingjun Hong, Ziyun Zhang, Benjamin Roth, Anna Korhonen

TL;DR: 本文研究大语言模型能否学习并复现标注者特定的标签-解释行为。通过自然语言推理和复述判断两个任务,分析标注者个体模式的稳定性,并提出跨标注者偏好优化方法,实验表明该方法能有效捕捉标注者特定行为。

Details

Motivation: 探索自由文本解释如何扩展人类标签变异,揭示标注者决策背后的推理和偏好,并研究大语言模型能否学习这种标注者特定的标签-解释行为。

Result: 在自然语言推理和复述判断任务上,实验显示提示方法有限且不稳定,监督微调能更好捕捉标注者特定行为,而跨标注者偏好优化进一步提升了聚合感知模仿和基于评判的归因能力,并在人工验证中保持了目标特定的推理模式。

Insight: 创新点在于将人类标签变异视为稳定信号,提出跨标注者偏好优化方法,通过对比目标标注者与其他标注者的响应来学习标注者特定行为,为基于解释的可扩展标注提供了新路径。

Abstract: Free-text explanations extend human label variation (HLV) beyond label disagreement by revealing the reasoning and preferences behind annotators’ decisions. We study whether large language models (LLMs) can learn and reproduce such annotator-specific label-explanation behavior. Using two sentence-pair tasks with four annotators each – natural language inference and paraphrase judgment – we first analyze whether annotators exhibit stable individual patterns. We find that such patterns are weak at the single-annotation level due to strong input-content effects, but become detectable after input-content reduction and annotator-level aggregation. We then compare prompting and supervised fine-tuning (SFT) baselines and propose cross-annotator preference optimization (CAPO), which contrasts a target annotator’s response with other valid but less target-specific annotations for the same input. Experiments show that prompting is limited and unstable, SFT better captures annotator-specific behavior, and CAPO further improves aggregation-aware imitation and judge-based attribution while preserving target-specific reasoning patterns under human validation. Overall, our results show that HLV can be learned as annotator-specific label-explanation behavior, suggesting a path toward scalable explanation-based annotation grounded in annotator histories rather than labels alone.


[70] VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading cs.CL | q-bio.NCPDF

Jinzhou Wu, Zhengwu Ma, Jixing Li, Baoping Tang, Zitong Lu

TL;DR: 本研究通过严格匹配的LLM和VLM模型对,在纯文本设置下评估了多模态预训练对人类语言处理对齐的影响。研究发现,多模态预训练在自然阅读中并未带来全局性的人类对齐优势,语言内部表征仍是关键因素;但当句子包含更强视觉语义内容时,VLM的优势会选择性显现。

Details

Motivation: 探究视觉语言学习是否能在自然阅读中使文本表征更接近人类处理方式,通过隔离多模态训练历史的影响来明确VLM相对于LLM在人类对齐方面的实际贡献。

Result: 在包含全脑fMRI响应和同步眼动追踪的自然阅读人类数据集上评估,VLM未表现出全局对齐优势,但在视觉语义内容较强的句子中,fMRI和眼动对齐证据均显示VLM具有选择性优势。

Insight: 创新点在于建立了严格控制的多模态训练历史影响评估框架,揭示了多模态预训练对语言表征人类对齐的选择性而非全局性贡献,强调了语言内部表征的核心作用。

Abstract: Large language models (LLMs) have become increasingly useful computational models of human language processing, but it remains unclear whether vision-language learning makes text representations more human-like during natural reading. Here, we address this question by comparing tightly matched LLM and vision-language model (VLM) pairs under a strictly text-only setting, allowing us to isolate the effect of multimodal training history from online visual input or cross-modal fusion. We evaluate model alignment with a human natural-reading dataset that includes whole-cortex fMRI responses and synchronized eye-tracking saccades. Our findings demonstrate that multimodal pretraining may not confer a uniform, global advantage in human alignment during natural reading, indicating that language-internal representations remain the key factor for modeling human text processing. However, the VLM advantage could emerge more selectively when sentences contain stronger visual semantic content, with converging evidence from both fMRI and eye-movement alignments. Together, our findings provide a controlled in silico framework for testing how visual learning history shapes model-human alignment of language processing, suggesting that multimodal pretraining contributes selectively rather than globally to human-like language representations during natural reading.


cs.CV [Back]

[71] From Affect to Complex Behavior: Advancing Multimodal Human-Centered AI at the 10th ABAW Workshop & Competition cs.CVPDF

Dimitrios Kollias, Panagiotis Tzirakis, Alan Cowen, Stefanos Zafeiriou, Irene Kotsia

TL;DR: 第十届ABAW研讨会与竞赛在CVPR 2026举办,致力于推进在真实无约束环境下对人类情感与行为的建模、分析与理解研究。研讨会包含竞赛和论文两个部分,竞赛设置了连续情感估计、离散情感识别及复杂行为分析等多个挑战任务,基于大规模真实数据集提供基准;论文部分则涵盖了姿态/运动/行为估计、情感建模/多模态学习、数据集与评估、公平性等多个研究方向。

Details

Motivation: 该研讨会旨在解决在真实、无约束环境中对人类情感和行为进行准确建模与理解的挑战,推动以人为中心的多模态AI系统的发展。

Result: 研讨会通过竞赛提供了多个大规模真实世界数据集作为基准,用于评估最先进方法在连续情感估计、离散情感识别、情感模仿强度估计、矛盾/犹豫识别和细粒度暴力检测等任务上的性能。

Insight: 创新点在于将研究范围从基础情感分析扩展到更复杂的、上下文相关的行为理解任务(如情感模仿、犹豫识别),并整合了多模态学习、公平性、鲁棒性等前沿议题,为构建下一代以人为本的AI系统提供了综合性平台和基准。

Abstract: The 10th Affective & Behavior Analysis in-the-Wild (ABAW) Workshop and Competition, held at CVPR 2026, continues to advance research on modelling, analysis, understanding of human affect and behavior in real-world, unconstrained environments. The workshop maintains its dual structure, comprising both a competition and a paper track. The ABAW Competition introduces a diverse set of challenges targeting key aspects of affective and behavioral understanding, including continuous affect (valence-arousal) estimation, discrete affect (expression and action unit) recognition, as well as more complex behavior analysis tasks, such as emotional mimicry intensity estimation, ambivalence/hesitancy recognition and fine-grained violence detection. These challenges are built upon large-scale in-the-wild datasets, providing comprehensive benchmarks for state-of-the-art approaches. In parallel, the paper track presents a wide range of contributions spanning pose, motion & behavior estimation, affect modelling & multimodal learning, benchmarks, datasets & evaluation protocols, fairness, robustness & deployment. Overall, the 10th ABAW Workshop and Competition continues to serve as a key platform for benchmarking, collaboration and innovation, shaping the development of next-generation multimodal, human-centered AI systems.


[72] Fine-Tuning Vision-Language Models for Understanding Current Damage and Scoring Priority with Quality Guard Agent cs.CVPDF

Takato Yasuno

TL;DR: 本文提出一种基于微调视觉语言模型(VLM)的自动化桥梁损伤评估方法,用于解决日本桥梁检测中人工评级不一致和专家老龄化问题。该方法使用QLoRA微调LLaVA-1.5-7B模型,结合基于规则的评分引擎生成五级维修优先级指数,并通过两阶段质量守卫机制过滤低质量输出。

Details

Motivation: 解决日本桥梁定期检测中不同工程师对损伤评级(a-e级)存在显著主观差异的问题,同时应对熟练工程师老龄化导致的检测能力下降挑战,实现自动化、标准化的基础设施管理。

Result: 在800张测试图像上评估,模型能输出自然语言描述识别结构构件和损伤模式;渐进训练研究表明2k训练样本在2.9小时内达到接近最优验证损失,3k样本时语义相似度最高(0.6909);推理优化后单图像处理时间降至10.06秒,比未优化基线减少70.2%。

Insight: 创新点包括:1)将VLM微调应用于专业工程领域(桥梁检测)的损伤理解与优先级评分;2)提出数据质量优于数量的训练策略,证明中等规模高质量数据优于大规模噪声数据;3)引入两阶段质量守卫机制(基于Swallow-8B SLM)过滤VLM低质量输出,防止错误评分。

Abstract: Bridge inspection in Japan requires mandatory visual assessments every five years, yet qualitative damage ratings (levels a-e) assigned by different engineers exhibit significant inter-rater variability – a critical barrier to consistent infrastructure management. The aging of skilled engineers further threatens inspection capacity. This paper presents a methodology for automating bridge damage understanding and repair priority scoring using fine-tuned Vision-Language Models (VLMs). We fine-tune LLaVA-1.5-7B with QLoRA on up to 4,000 paired bridge damage images and inspection text records, then evaluate on a fixed test set of 800 images. The model outputs natural language descriptions identifying structural members and damage patterns, from which a rule-based scoring engine calculates a five-level repair priority index. A progressive training study (1k/2k/3k/4k samples) reveals that 2k training samples achieve near-optimal validation loss in only 2.9 hours of training; beyond 2k, validation loss improves by no more than 0.2% per doubling of training samples, exhibiting clear diminishing returns. Furthermore, semantic similarity on the held-out test set peaks at 3k (0.6909) and degrades at 4k (0.6739), indicating that quality-curated mid-scale data outperforms larger but noisier corpora. Inference optimization combining torch.compile() and batch processing (batch_size=8) achieves 10.06 seconds per image – a 70.2% reduction over the unoptimized baseline. Our approach contributes to data governance in bridge inspection, reduces inter-rater variability, and provides AI-assisted triage to augment expert engineers in inspection workflows. Furthermore, we introduce a two-stage Quality Guard using a fine-tuned Swallow-8B SLM to reject low-quality VLM outputs before priority scoring, preventing spurious scores from damaged or unrecognised images.


[73] D$^2$Turb: Depth-Aware Simulation and Decoupled Learning for Single-Frame Atmospheric Turbulence Mitigation cs.CVPDF

Zixiao Hu, Tianyu Li, Guoqing Wang, Wei Li, Guoguo Xin

TL;DR: 本文提出D$^2$Turb,一个用于单帧大气湍流图像复原的统一框架。它通过引入深度感知的湍流合成协议,将场景深度信息融入物理模拟,并采用解耦学习策略,将复原过程分解为纹理去模糊和几何校正两个交互阶段,以解决现有方法在纹理恢复与几何校正间难以平衡的问题。

Details

Motivation: 单帧大气湍流复原是一个不适定问题,其挑战在于空间变化的模糊与非刚性几何畸变耦合在一起。现有基于平场模拟的端到端方法难以在纹理恢复和几何校正之间取得良好平衡。

Result: 在合成和真实世界数据集上的大量实验表明,D$^2$Turb在纹理恢复和几何保真度方面均取得了最先进的性能。

Insight: 主要创新点包括:1) 提出了深度感知湍流合成协议,将场景深度引入物理模拟,为解耦学习提供关键的中间监督信号;2) 设计了将复原过程解耦为纹理去模糊和几何校正两个交互阶段的框架;3) 提出了自适应结构先验注入机制,以缓解级联设计中常见的信息碎片化问题,通过动态传递深层结构表征来引导空间校正的稠密流预测。

Abstract: Single-frame atmospheric turbulence mitigation is inherently ill-posed due to spatially varying blur coupled with non-rigid geometric distortion. Existing end-to-end approaches trained on flat-field simulations often struggle to balance texture recovery with geometric rectification. To overcome this limitation, we propose D$^2$Turb, a unified framework that bridges physics-grounded simulation with explicitly decoupled restoration. First, we introduce a Depth-Aware Turbulence Synthesis protocol that incorporates scene depth into the phase-to-space formulation. This generates physically consistent, depth-dependent degradations and provides a crucial intermediate tilt supervision signal for disentangled learning. Building upon this simulation engine, D$^2$Turb decomposes restoration into two interactive stages: texture deblurring and geometric rectification. The texture deblurring stage employs a deblurring backbone to recover fine-grained details while preserving geometric distortion for the subsequent rectification stage. To mitigate the information fragmentation commonly observed in cascaded designs, we further propose an Adaptive Structural Prior Injection (ASPI) mechanism that dynamically transfers deep structural representations from the deblurring module to guide dense flow prediction for spatial unwarping. Extensive experiments demonstrate that D$^2$Turb achieves state-of-the-art performance on both synthetic and real-world datasets, with consistent improvements in both texture recovery and geometric fidelity. Our code and pre-trained models are publicly available at https://github.com/HertzDot222/D2Turb.


[74] AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers cs.CV | cs.AIPDF

Semi Lee, Hyejin Go, Hyesong Choi

TL;DR: 本文提出了AdaMerge,一种用于加速视觉Transformer(ViT)的无训练令牌合并框架。该框架通过显著性加权相似性和自适应合并强度两种机制,解决了现有令牌合并方法因忽视令牌重要性差异而导致的信息丢失问题。在ImageNet-1k数据集上的实验表明,AdaMerge在匹配计算量(FLOPs)的情况下,性能优于ToMe、PiToMe和DSM等方法,尤其是在高压缩比下优势更明显。

Details

Motivation: 视觉Transformer中自注意力的二次计算成本是实际部署的主要瓶颈,现有无训练令牌合并方法(如ToMe)默认令牌重要性相等,这与自注意力的非均匀性相悖,导致在高压缩下高显著性令牌的信息丢失。

Result: 在ImageNet-1k数据集上使用ViT-B/16模型,AdaMerge在所有匹配FLOPs的设置下均优于ToMe、PiToMe和DSM。在13.4G FLOPs操作点,AdaMerge的Top-1准确率仅下降1.06%,而PiToMe下降1.45%,DSM下降4.62%,提升了ViT加速的准确率-FLOPs帕累托前沿。

Insight: 创新点在于将显著性加权相似性(利用特征亲和中心性作为令牌重要性代理)和自适应每层合并强度(根据输入特定冗余动态调整每层减少数量)结合到一个统一的无训练框架中,首次实现了对令牌非均匀性的显式建模和输入自适应的压缩。

Abstract: The quadratic cost of self-attention in Vision Transformers (ViTs) constitutes a fundamental bottleneck for practical deployment, motivating a vibrant line of research on token reduction. Among existing approaches, token merging (ToMe) has emerged as an elegant training-free solution; yet its design rests on an unspoken premise of token equality, which contravenes the well-documented non-uniformity of self-attention and leads to information loss in high-salience tokens under aggressive compression. We address this limitation with AdaMerge, a token-merging framework based on two complementary mechanisms. First, salience-weighted similarity leverages column-wise feature-affinity centrality as a token-importance proxy and incorporates the resulting salience scores into the bipartite matching score, ensuring that pivotal tokens contribute more strongly to the merged representation. Second, adaptive merging intensity uses pre-computed layer-wise similarity statistics to dynamically modulate the per-layer reduction count in accordance with input-specific redundancy. On ImageNet-1k with ViT-B/16, AdaMerge consistently outperforms ToMe, PiToMe, and DSM across all FLOPs-matched regimes. The accuracy gap widens monotonically with compression: at the 13.4G FLOPs operating point, AdaMerge sustains a Top-1 degradation of only -1.06%, compared to -1.45% for PiToMe and -4.62% for DSM. To our knowledge, AdaMerge is the first to combine salience-weighted similarity and adaptive per-layer reduction into a single training-free token merging framework, advancing the accuracy-FLOPs Pareto frontier of ViT acceleration.


[75] What-If World: A Causal Benchmark for General World Models in Embodied Scenarios cs.CVPDF

Kunlin Cai, Rui Song, Jinghuai Zhang, Kaiyuan Zhang, Pranav Bodapati

TL;DR: 论文提出了What-If World基准测试,用于评估视频生成模型作为具身场景中世界模拟器的因果推理能力。该基准包含319个基于nuScenes和DROID真实帧构建的提示对,每个对描述同一场景但一个物理细节不同,通过APEO四部分评分标准评估模型生成视频的物理一致性。

Details

Motivation: 现有视频生成模型在作为世界模拟器(如驾驶和机器人操作)时,关键不在于单个视频的逼真度,而在于模型输出是否能根据输入变化做出符合物理规律的响应。现有基准无法检测模型在因果推理上的失败,因为它们单独评分每个视频。

Result: 在九个最先进模型上的测试表明,没有系统在配对评分上超过52%,开源模型集中在28%左右。所有模型在大部分因果干预上都失败,表明它们距离可靠支持动作条件模拟或基于模型的规划仍有很大差距。视觉显著性高的干预得分可达40.4%,而视觉细微的干预低至14.2%。

Insight: 创新点在于提出了首个专注于评估世界模型因果推理能力的基准,通过精心设计的提示对和APEO评分标准(包括遵循提示、物理一致性、场景保持和正确差异结果)系统量化模型缺陷。客观来看,该工作揭示了当前视频生成模型在物理理解上的普遍不足,并指出性能更依赖于干预的视觉显著性而非底层物理的可处理性。

Abstract: Video generation models are increasingly used as world simulators for tasks like driving and robotic manipulation. What matters in these settings is not whether a single video looks right, but whether the model’s output changes when its input changes. We test this by giving a model two prompts describing the same scene with one physical detail varied, and checking whether the two videos diverge the way physics predicts. The wording difference between the prompts is small by design, since only one variable is changed, but the correct physical difference is not. A model that misses this can still produce two videos that each look plausible individually, and existing benchmarks score videos one at a time and cannot detect this failure. We introduce What-If World, 319 such prompt pairs built on real frames from nuScenes and DROID, organized by a taxonomy of six physical variables shared across driving and manipulation. Each pair is scored with APEO, a four-part rubric checking whether each video follows its prompt (Adherence), is physically consistent (Physics), preserves the shared scene (Environment), and ends in the correct difference (Outcome). Across nine state-of-the-art models, no system exceeds 52% on the paired score, and open-source models cluster near 28%. Every model tested fails on a large fraction of causal interventions, indicating substantial room before these models can reliably support action-conditioned simulation or model-based planning. Where models do score well, performance appears to track the visual prominence of the intervention rather than the tractability of its underlying physics. Some visually subtle interventions score as low as 14.2%, while visually pronounced ones reach 40.4%.


[76] Hallucination Behavior in Multimodal LLMs Across Agricultural Image Interpretation and Generation Tasks cs.CV | cs.AIPDF

Partho Ghose, Al Bashir, Prem Raj, Azlan Zahid

TL;DR: 本研究系统评估了多模态大语言模型在农业图像解释和生成任务中的幻觉行为,发现模型在图像解释任务中零样本准确率较低(63-75%),少样本提示可提升至86.8%,但仍存在误检和漏检;在文本生成图像任务中,GPT-5和Gemini 2.5 Flash等先进模型在宽松提示下生成高达91%生物不一致的场景。

Details

Motivation: 针对大语言模型在农业成像应用中频繁产生看似自信但偏离生物或环境现实的幻觉输出,可能导致农艺决策错误的问题,研究旨在系统分析图像到文本解释和文本到图像生成两个方向中的幻觉现象。

Result: 在图像解释任务中,Gemma、LLaVA、Qwen和MiniCPM等模型零样本准确率为63-75%,少样本提示提升至86.8%;在文本生成图像任务中,GPT-5和Gemini 2.5 Flash在宽松提示下生成高达91%生物不一致的场景。

Insight: 论文创新点在于首次系统评估多模态LLMs在农业领域的幻觉行为,揭示了模型在生物一致性、上下文准确性和农艺合理性方面的根本缺陷;客观分析表明,少样本提示可缓解但无法消除幻觉,这为提升农业成像平台的可靠性提供了关键见解。

Abstract: Large Language Models (LLMs) are being rapidly adopted in agricultural imaging applications, ranging from crop interpretation to synthetic field image generation. However, these models frequently exhibit hallucinations outputs that appear confident yet deviate from biological or environmental reality potentially leading to misinformed agronomic insights. This study investigates such hallucinations in two complementary directions: image-to-text, where LLMs interpret crop or field imagery to describe conditions such as biotic and abiotic stresses, and text-to-image, where models generate synthetic agricultural scenes based on descriptive prompts. We examine errors involving biological inconsistency, contextual inaccuracy, and agronomic implausibility, evaluating the outputs under domain-informed criteria across multiple imaging modalities. Our analysis identifies recurring hallucination patterns within both interpretive and generative tasks. In image interpretation, LLMs (e.g., Gemma, LLAVA, Qwen, and MiniCPM) achieved modest zero-shot accuracy (63 to 75 percent), whereas few-shot prompting improved performance up to 86.8 percent, exhibiting false detections and missed infections, indicating residual hallucination effects. In text-to-image tasks, advanced models such as GPT-5 and Gemini 2.5 Flash generate up to 91 percent biologically inconsistent scenes under relaxed prompt constraints, revealing fundamental weaknesses in current LLMs. This systematic assessment of visual reasoning and generation offers critical insights toward enhancing the reliability and trustworthiness of LLM-based agricultural imaging platforms.


[77] ForestHG-Trace: Traceable Long-Horizon Ecological Reasoning over Large-Scale Forest Scenes cs.CV | cs.MMPDF

Zihang Cheng, Duanchu Wang, Cheng Li, Jing Huang, Huanzhao Fu

TL;DR: 本文提出了ForestHG-Trace框架,用于在森林遥感场景中进行可追溯的长时序生态推理。该框架将多模态森林场景表示为生态超图,并利用LLM引导的智能体调用确定性工具进行多步骤推理,生成可复现的执行轨迹和证据记录。

Details

Motivation: 解决大规模森林场景中遥感问答(RS-QA)的挑战,传统方法难以处理需要多步过滤、数值聚合、邻域推理和可验证证据的复杂生态分析问题。

Result: 在构建的ForestTraceQA基准测试上,ForestHG-Trace在答案准确性和执行忠实度上显著优于单步基线方法和场景图智能体,并指出执行深度是长时序生态QA的主要瓶颈。

Insight: 创新点在于将森林场景建模为支持高阶推理的生态超图,超越了传统成对场景图;并设计了结合LLM引导与确定性工具调用的可追溯推理框架,确保过程可审计与可复现。

Abstract: Remote sensing question answering (RS-QA) often requires more than direct semantic prediction, especially in large-scale forest scenes where ecological analysis involves multi-step filtering, numerical aggregation, neighborhood reasoning, and verifiable evidence. We introduce ForestHG-Trace, a framework for traceable long-horizon ecological reasoning over forest environments. It represents multimodal NEON forest scenes as ecological hypergraphs, where tree instances, spatial units, semantic groups, and neighborhood relations support higher-order reasoning beyond pairwise scene graphs. An LLM-guided agent then invokes deterministic tools for reading, filtering, expansion, aggregation, comparison, and auditing, producing replayable execution traces and compact evidence records rather than only free-form answers. We further construct ForestTraceQA, an executable benchmark for evaluating ecological QA across diverse task types and reasoning depths. Experiments show that ForestHG-Trace substantially improves answer accuracy and execution faithfulness over single-step baselines and scene-graph agents, while highlighting execution depth as the main bottleneck for long-horizon ecological QA.


[78] Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers cs.CV | cs.AIPDF

Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, Antonio Torralba

TL;DR: 本文提出了一种名为Tensor Memory的轻量级模块,用于增强Transformer块,通过引入固定大小的循环3D记忆张量来解决长序列处理中内存增长和缺乏显式空间状态的问题。该模块允许token通过可微软写入将内容存储到体素网格中,并通过局部交互和门控循环动态更新记忆,最后通过连续采样和门控残差融合读取上下文。

Details

Motivation: Transformer在处理图像和视频时,将时空信息展平为长token序列,导致注意力机制和KV缓存的内存随序列长度增长,且缺乏显式、持久的空间状态,这使得长时视频理解和遮挡敏感推理变得困难。

Result: 在标准语言、图像和视频基准测试以及一个控制玩具诊断套件上进行了评估,该套件旨在隔离持久状态何时有益;Tensor Memory能够与标准Transformer训练流程集成,并可在不改变其他架构的情况下附加或移除现有块。

Insight: 创新点在于引入固定大小的循环3D记忆张量,通过可微软写入、局部交互和门控循环动态实现空间归纳偏置的保持,从而解耦状态容量与输入长度,为长序列处理提供了一种高效且灵活的解决方案。

Abstract: Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past features, their memory grows with sequence length and they lack an explicit, persistent spatial state, making long-horizon video understanding and occlusion-sensitive reasoning difficult. We propose Tensor Memory, a lightweight module that augments Transformer blocks with a fixed-size recurrent 3D memory tensor: tokens write into a voxel grid via a differentiable soft write that deposits content as a Gaussian-weighted volume around a predicted continuous 3D location, the memory is updated with an efficient local interaction operator and gated recurrent dynamics, and tokens read back context via continuous sampling with gated residual fusion. Because the memory tensor has a constant size, Tensor Memory decouples state capacity from input length while preserving a spatial inductive bias. We evaluate the module on standard language, image, and video benchmarks and on a controlled toy diagnostic suite designed to isolate when persistent state is beneficial; it integrates with standard Transformer training pipelines and can be attached to or removed from existing blocks without other architectural changes.


[79] Bounded-Compute Multimodal Regression for Product-Rating Prediction cs.CVPDF

William Leach, Ru He, Sizhuo Ma, Yizhen Jia, Min Cao

TL;DR: 本文提出了一种面向产品评分预测的有界计算多模态回归方法,通过改造SmolVLM2-256M-Video-Instruct模型,将语言建模头替换为轻量级MLP,并使用固定尺寸图像和截断元数据,以在严格延迟预算下实现高效的标量回归。

Details

Motivation: 现有视觉语言模型(VLM)依赖自回归文本生成和动态视觉处理,难以在严格延迟预算下高效执行标量回归任务,因此需要设计计算有界的适配方案。

Result: 在LoViF 2026 Efficient VLM挑战赛的官方评估中,该228M参数模型取得了0.39 PLCC和0.40 CES的相关性分数,为资源受限的多模态回归提供了一个强可复现基线。

Insight: 创新点在于用基于特征的回归(轻量MLP)替代基于token的分数生成,并采用静态全局图像处理与确定性输入,在保持性能的同时显著降低了计算开销。

Abstract: Vision-language models (VLMs) are increasingly attractive for multimodal quality assessment, but their default reliance on autoregressive text generation and dynamic visual processing is poorly matched to scalar regression under strict latency budgets. We present a bounded-compute adaptation of SmolVLM2-256M-Video-Instruct for product-rating prediction in the LoViF 2026 Efficient VLM challenge. Motivated by recent multimodal engagement-prediction results showing that feature-based regression can outperform token-based score generation, we replace the language-modeling head with a lightweight two-layer MLP fed by pooled decoder states, and we enforce deterministic inputs through fixed 384x384 images and truncated metadata. Across controlled ablations, static global image processing slightly outperforms dynamic tiling, and scaling from 100K to 16M training examples substantially improves validation correlation. Under the official held-out evaluation, our 228M-parameter model achieves 0.39 PLCC and 0.40 CES, providing a strong and reproducible baseline for resource-constrained multimodal regression.


[80] AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications cs.CV | cs.SEPDF

Yifan Sui, Xin Huang, Hongbing Li, Fang Xu, Jiahe Lv

TL;DR: 本文提出了AndroidDaily基准测试,用于评估移动GUI智能体在真实世界闭源应用上的性能,并设计了GRADE评估器以实现自动验证。该基准包含350个日常任务,覆盖94个高频Android应用,解决了闭源应用因缺乏内部状态而难以自动评估的问题。

Details

Motivation: 现有GUI智能体基准大多依赖模拟环境或开源应用,无法评估真实闭源应用,因为闭源应用不暴露内部状态,传统自动验证方法失效。

Result: GRADE评估器与人类评估者的一致性达到87.37%,当前最强模型在AndroidDaily上的成功率为62.0%,揭示了现有推理能力与实际移动工作流执行之间的显著差距。

Insight: 创新点包括引入基于可观察外部准则(操作义务、输出质量、负面约束)的三层评估系统GRADE,将长视野、开放式移动交互转化为可验证评估,无需依赖隐藏内部状态。

Abstract: The rapid development of GUI foundation models and mobile GUI agents has spurred numerous evaluation benchmarks, yet most rely on simulated environments or open-source applications, leaving real-world closed-source applications largely unevaluated. The core difficulty is that closed-source applications do not expose internal states, making traditional automatic verification inapplicable. To bridge this gap, we introduce AndroidDaily, a large-scale benchmark comprising 350 realistic daily-use tasks across 94 high-frequency Android applications spanning transportation, shopping, local services, entertainment, content creation, social media, and everyday utilities. To enable automatic and verifiable assessment in these opaque environments, we propose Guideline-grounded Reviewer for Automatic Diagnostic Evaluation (GRADE), a process-aware evaluator built on a three-tiered system of observable external guidelines: operational obligations, output quality, and negative constraints. GRADE tracks the agent’s visual trajectory against these criteria and produces step-level diagnostic judgments, turning long-horizon, open-ended mobile interactions into verifiable evaluation without relying on hidden internal states. Experiments show that GRADE achieves 87.37% agreement with human evaluators. The strongest model reaches a 62.0% success rate on AndroidDaily, highlighting a substantial gap between current reasoning capabilities and practical execution in realistic mobile workflows.


[81] Can Segmentation Models Understand the World? Towards Proactive Affordance Reasoning via Visual Chain-of-Thought cs.CV | cs.AIPDF

Yuchen Guo, Junli Gong, Hongmin Cai, Yiu-ming Cheung, Weifeng Su

TL;DR: 本文提出SegWorld模型,通过多级视觉思维链(CoT)进行主动推理,以解决现有分割模型在处理意图级指令(即描述期望结果而非具体区域)时的局限性。模型在接收指令前主动观察场景,描述可见物体并推断可能事件;给定指令后,它从意图相关物体、满足意图的动作,逐步推理到物理交互部位(即支持该动作的物体部件)。

Details

Motivation: 现有分割模型将大语言模型(LLMs)与掩码解码器结合,以根据复杂语言表达式生成掩码,但其指令仍局限于目标参考型(即描述、约束或暗示待分割区域)。然而,在现实世界具身交互中,人类指令往往是意图级的,仅包含期望结果而不指明实现它的具体区域。本文旨在弥合这一差距。

Result: 在构建的意图到部件基准上评估支持功能部件分割,实验表明SegWorld在目标参考型指令上与指令驱动基线相当,在意图级指令上显著提升。

Insight: 创新点在于引入主动视觉思维链进行多级推理(从物体到动作再到交互部位),并将SegWorld形式化为概率推断,其中主动观察提供的语言场景上下文改善了意图级指令下的掩码预测。这为分割模型理解世界和进行主动功能推理提供了新思路。

Abstract: Recent segmentation models couple large language models (LLMs) with mask decoders to ground complex language expressions into masks, yet their instructions remain target-referential: they describe, constrain, or imply the region to be segmented. However, in real-world embodied interaction, human instructions are often at the intent-level, which includes the desired outcome without naming the region that enables it. To bridge this gap, we introduce SegWorld, where the model reasons about the scene through a multi-level visual chain-of-thought (CoT) before committing to a mask. Before receiving any instructions, it proactively observes the scene, describing visible objects and inferring plausible events they may support. Given an instruction, it continues the chain: from the object relevant to the intent, through the action that satisfies it, to the physical interaction site, the object part that affords the action. We formalize SegWorld as probabilistic inference, in which proactive observation supplies a linguistic scene context that improves mask prediction when instructions are given at the level of intent. We construct an intent-to-part benchmark for evaluating affordance-bearing part segmentation from high-level goals. Experiments show SegWorld matches instruction-driven baselines on target-referential instructions and improves substantially on intent-level ones.


[82] CuriosAI Submission to the CASTLE Challenge at EgoVis 2026 cs.CVPDF

Yuto Kanda, Hayato Tanoue, Takayuki Hori

TL;DR: 本文介绍了CuriosAI团队为EgoVis 2026的CASTLE挑战赛提出的两种方法。该挑战赛要求基于超过600小时的多视角第一人称视频回答185个多项选择题。团队在共享的多模态预处理层基础上,探索了两种方法:SVA(搜索-验证-回答)和TMKG(时序-多模态-知识图谱)。SVA采用三级流水线,通过层次化搜索、基于反幻觉规则的VLM验证和LLM证据融合来回答问题;TMKG则构建时序多模态知识图谱,通过图搜索定位关键信息,并使用单一的接地VLM生成答案。最终,SVA在排行榜上达到0.50的准确率,被选为最终提交方案;TMKG的准确率为0.35。

Details

Motivation: 解决在长时间、多视角的第一人称视频中进行复杂问答的挑战,该任务涉及大量同步视频和多项选择题,需要有效整合视觉、文本和时间信息。

Result: 在CASTLE 2026挑战赛的排行榜上,SVA方法达到0.50的准确率,TMKG方法达到0.35的准确率;SVA被选为最终提交方案。

Insight: 创新点包括:共享多模态预处理层(如个人时间线、说话者解析的转录和多VLM描述集成);SVA方法中的层次化搜索、基于反幻觉规则的VLM验证和LLM证据融合机制;TMKG方法中的时序多模态知识图谱构建和图搜索策略。这些方法展示了在复杂视频问答任务中结合结构化搜索与多模态推理的潜力。

Abstract: CASTLE 2026 asks 185 multiple-choice questions over 600+ hours of synchronized multi-view egocentric video. We explore two approaches on top of a shared multimodal preprocessing layer, including per-person timelines, speaker-resolved transcripts, and multi-VLM caption ensembles. Approach A, SVA: Search-Verify-Answer, is a three-stage pipeline that hierarchically narrows to a primary window, verifies sub-windows with a VLM under four anti-confabulation rules, and fuses evidence with an LLM judge under an evidence-priority hierarchy. Approach B, TMKG: Temporal-Multimodal-Knowledge-Graph, is the contrast: it builds a temporal multimodal knowledge graph, locates a primary cell via graph search, and produces the final answer with a single grounded VLM. SVA reaches a leaderboard accuracy of 0.50 and is our final challenge submission; TMKG reaches 0.35.


[83] A self-supervised learning approach to deep filter banks for texture recognition cs.CVPDF

Joao B. Florindo, Lucas O. Lyra, Antonio E. Fabris

TL;DR: 本文提出了一种用于纹理识别的自监督学习框架,该框架采用卷积自编码器进行预训练,并结合深度滤波器与Fisher向量池化来提取纹理特征,旨在解决纹理识别中训练数据有限的问题,同时降低计算复杂度。

Details

Motivation: 纹理识别在现实应用中常面临训练数据有限的问题,而现有自监督预训练方法(如掩码自编码器)通常依赖计算密集的视觉变换器架构,但纹理图像的信息主要集中于像素局部区域,无需长距离依赖捕捉,因此作者提出使用更轻量的卷积自编码器进行预训练。

Result: 该方法在多个纹理数据库上与多种先进方法进行了比较,结果表明其在分类准确率和计算复杂度方面均表现出潜力,达到了竞争性的性能水平。

Insight: 创新点在于将自监督预训练与卷积自编码器结合,针对纹理局部特性避免了注意力机制的计算开销,并通过深度滤波器与Fisher向量池化有效提取纹理模式信息,为数据有限的纹理识别任务提供了轻量高效的解决方案。

Abstract: An important challenge in texture recognition is the limited amount of data for training frequently found in real-world applications. In computer vision in general, a successful strategy to mitigate this issue is the use of a pretraining stage where the neural network learns to identify relations between parts of the data in a self-supervised manner. A well-established framework in this direction is masked autoencoder. Nevertheless, these models usually rely on computationally intensive architectures, such as vision transformers. In the particular case of texture images, most of the relevant information is compacted within a delimited area around each pixel, which suggests that capturing long-range dependence via the attention mechanism may be unnecessary. Based on that assumption, here we propose a framework where the pretraining model is a convolutional autoencoder. To leverage the rich information conveyed by texture patterns, we employ deep filters coupled with Fisher vector pooling. In this way, we improve the performance of texture recognition without adding significant computational burden. Our approach is compared with several state-of-the-art methods in different texture databases, confirming its potential both in terms of classification accuracy and computational complexity.


[84] Reflective Dialogue between Teacher and Solver Agents for Video Question Answering cs.CVPDF

Takuya Murakawa, Toru Tamaki

TL;DR: 本文提出了一种通过推理时上下文注入实现视频问答领域自适应的方法,该方法构建教师与求解器代理之间的反思对话,利用对话历史作为推理上下文,在EgoCross基准测试中超越了零样本基线和标准上下文学习方法。

Details

Motivation: 解决在无需微调的情况下,仅通过少量标注支持集在推理阶段获取任务特定知识的挑战,实现视觉语言模型在视频问答领域的有效适应。

Result: 在EgoCross基准测试中,该方法超越了零样本基线和标准上下文学习方法,并在CVPR 2026 EgoVis Workshop的首届跨域EgoCross挑战开源赛道中获得第三名。

Insight: 创新性地引入反思对话机制,通过多轮对话整合视觉基础解释和正确性反馈,将对话历史作为推理上下文,实现了仅通过推理时注入上下文的自适应,避免了模型微调。

Abstract: Various approaches have been proposed to adapt Vision-Language Models (VLMs) to specialized domains for Video Question Answering, including fine-tuning and in-context learning. However, acquiring task-specific knowledge at the inference phase from only a small labeled support set without fine-tuning remains a challenge. In this paper, we propose a method that achieves adaptation solely through inference-time context injection. Our method first constructs a Reflective Dialogue (RD) – a multi-turn conversation between two agents, in which Teacher poses each support question and delivers correctness feedback, and Solver answers and provides visual grounding explanations (or reflections) for both correct and incorrect answers. This dialogue history is then used as context at the inference phase. Experiments on the EgoCross benchmark demonstrate that our method outperforms both a baseline zero-shot setting and a standard in-context learning approach that passes support set examples directly, achieving 3rd place in the Open-source Track of the 1st Cross-Domain EgoCross Challenge at the CVPR 2026 EgoVis Workshop, for which this paper also serves as a technical report.


[85] SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control cs.CV | cs.AIPDF

Zhida Zhang, Jie Ma, Zhan Peng, Haoxue Wu, Yang Han

TL;DR: 本文提出SmartDirector框架,通过多关键帧条件控制增强视频生成的叙事能力,支持单镜头生成、多镜头叙事合成和视频扩展。该框架采用两阶段流程:Director-Gen基于关键帧生成低分辨率视频,Director-SR利用高分辨率关键帧作为语义锚点恢复细节。实验表明该方法显著优于现有SOTA方法。

Details

Motivation: 现有视频生成方法主要依赖文本提示或首尾帧等稀疏条件信号,难以精确控制叙事结构和时序节奏,限制了视频的叙事质量。

Result: 在从电影数据构建的单镜头和多镜头序列数据集上进行了大量实验,SmartDirector在视频生成质量上大幅超越现有最先进方法。

Insight: 创新点在于引入多关键帧作为条件信号来增强叙事控制,并设计了两阶段生成-超分辨率流程,利用高分辨率关键帧作为语义锚点提升细节生成质量;其构建的电影叙事数据集也为相关研究提供了新基准。

Abstract: The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits precise control over narrative structure and temporal pacing. In this paper, we propose SmartDirector, a framework that enhances the narrative capacity of video generation models through multiple keyframes. SmartDirector supports flexible generation scenarios including single-shot generation, multi-shot narrative synthesis, and video extension. The framework operates in two stages: Director-Gen generates a low-resolution video conditioned on the provided keyframes, and Director-SR refines the output by exploiting high-resolution keyframes as semantic anchors to recover fine-grained details. To enable robust multi-keyframe training, we construct a data pipeline that curates single-shot and multi-shot sequences from movies. Extensive experiments demonstrate that SmartDirector substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research.


[86] SIGMA: Bridging Structural and Distributional Gaps for Vision Foundation Model Adaptation cs.CVPDF

Lingyu Xiong, Jinjin Shi, Xuran Xu, Cong Luo, Runyu Shi

TL;DR: 本文提出了一种名为SIGMA(尺度集成全局调制适配器)的新型轻量级参数高效微调方法,旨在解决视觉基础模型在适应密集预测任务时面临的结构和分布差距问题。SIGMA通过尺度自适应融合模块增强多粒度视觉信息提取以弥合结构差距,并引入语义调制对融合特征进行全局对齐以消除分布差距。该方法仅需主干网络1.72%的可训练参数,在多种下游密集任务和主干网络上均实现了优于现有SOTA参数高效微调方法的性能。

Details

Motivation: 视觉基础模型具有强大的表征能力,但全参数微调的计算和存储开销巨大。参数高效微调旨在以最小训练成本达到全微调的性能,然而将其应用于密集预测任务时,模型结构和数据分布上的差距带来了挑战。

Result: 在多种下游密集任务(如语义分割、目标检测)和多个视觉基础模型主干上的综合实验表明,SIGMA方法取得了优于当前最先进参数高效微调方法的一致且卓越的性能。

Insight: 创新点在于同时解决了结构和分布两个维度的适应问题:通过尺度自适应融合统一处理多尺度空间信息,并通过语义调制进行全局特征对齐。这种统一的空间和分布适应设计,以极少的可训练参数实现了高效的模型适配,为视觉基础模型的轻量化下游任务适应提供了新思路。

Abstract: Vision Foundation Models (VFMs) have demonstrated impressive representational capabilities. However, adapting them to downstream tasks via full fine-tuning incurs prohibitive computational and storage overhead. Parameter-Efficient Fine-Tuning (PEFT) has emerged as a compelling alternative, aiming to achieve performance parity with full fine-tuning at minimal training costs. Nonetheless, applying PEFT to VFMs for dense prediction tasks remains challenging due to the structural and distributional gaps. To bridge these gaps, we propose \textbf{S}cale-\textbf{I}ntegrated \textbf{G}lobal \textbf{M}odulation \textbf{A}dapter (\textbf{SIGMA}), a novel lightweight PEFT method, which consists of two modules: scale-adaptive fusion and semantic modulation. Specifically, the scale-adaptive fusion module is utilized to bridge structural gaps by enhancing the extraction of multi-granularity visual information. Furthermore, SIGMA introduces semantic modulation on the fusion features to perform global feature alignment to further eliminate the distribution gap. This design facilitates unified spatial and distributional adaptation, requiring only 1.72% trainable parameters relative to the VFM backbone. Comprehensive experiments across various downstream dense tasks and multiple VFM backbones demonstrate that SIGMA achieves consistent and superior performance over state-of-the-art PEFT methods.


[87] Towards Unified Vision-Language Models with Incomplete Multi-Modal Inputs cs.CVPDF

Xiang Fang, Wanlong Fang, Changshuo Wang, Keke Tang, Daizong Liu

TL;DR: 本文提出了一种统一的不完整视频-语言模型,旨在处理现实世界中因传感器失效导致的多模态输入不完整问题,以提升模型在训练与测试数据不一致情况下的鲁棒性和安全性。

Details

Motivation: 现有视频-语言模型通常假设视频和语言输入完整,但实际应用中可能因传感器失效(如摄像头不可用)导致模态不完整数据,造成训练与测试不一致,且不完整输入可能损害模型安全性和可信度。

Result: 大量实验结果表明,该方法可作为即插即用模块应用于先前工作,在多种多模态任务中提升性能。

Insight: 创新点在于首次系统性地处理多模态输入不完整问题,提出统一模型框架以增强模型在现实场景中的泛化能力和安全性,可作为通用模块集成到现有模型中。

Abstract: Video-Language Models (VLMs) have demonstrated impressive multi-modal reasoning capabilities across diverse computer vision applications. However, these VLMs are task-specific and assume that both video and language inputs are complete. However, real-world VLM applications might face challenges due to deactivated sensors (e.g., cameras are unavailable due to data privacy), yielding modality-incomplete data and leading to inconsistency between training and testing data. While straightforward incomplete input can boast training generalization-ability and lead to training failure, its potential risks to VLMs regarding safety and trustworthiness have been largely neglected. To this end, we make the first attempt to propose a unified incomplete video-language model to process the incomplete multi-modal inputs. Extensive experimental results show that our method can serve as a plug-and-play module for previous works to improve their performance in various multi-modal tasks.


[88] Decoupled Training with Local Reinforcement Fine-Tuning in Federated Learning cs.CVPDF

Yuting Ma, Lechao Cheng, Xiaohua Xu

TL;DR: 本文提出FedDTL,一种新颖的联邦视觉语言模型框架,通过解耦客户端和服务器上的图像编码器与文本编码器进行训练,并结合两阶段本地微调(监督微调与强化学习),旨在异构和全数据联邦学习设置下,有效平衡全局任务适应性与泛化能力。

Details

Motivation: 现有基于预训练视觉语言模型的联邦学习方法强调完全本地优化与简单参数聚合,在异构和全数据联邦学习设置下,会加剧客户端间优化不一致性和客户端内过度专业化,难以平衡全局任务适应与泛化。

Result: 在包括标签偏斜和特征偏移在内的多个基准测试上进行广泛实验,结果表明FedDTL在少样本和全数据机制下,针对各种联邦学习数据分布,均能有效平衡全局任务适应性与泛化能力。

Insight: 主要创新点在于解耦编码器训练与服务器-客户端模态对齐以促进一致的全局语义更新,以及引入结合监督微调和强化学习的两阶段本地微调策略,以缓解客户端内过度专业化并增强泛化。

Abstract: Federated Learning (FL) with pre-trained Vision-Language Models (VLMs) has emerged as a promising paradigm for various downstream tasks. By leveraging its strong representations, recent studies improve task adaptation under insufficient local data while preserving generalization. However, these methods emphasize fully local optimization with simple parameter aggregation,which can amplify inter-client optimization inconsistency and intra-client over-specialization under heterogeneous and full-data FL settings, making it difficult to balance global task adaptation and generalization. To address these challenges, we propose FedDTL, a novel federated VLM framework that decouples the image encoder and text encoder across clients and the server. Through decoupled encoder training with server-client modality alignment, FedDTL promotes coherent global semantic update and reduces inter-client optimization inconsistency, improving global task adaptation.To further mitigate intra-client over-specialization,we introduce a two-stage local fine-tuning, where a supervised fine-tuning stage enables rapid and reliable warm-start, followed by a reinforcement learning stage that enhances generalization. Extensive experiments on multiple benchmarks, including label skew and feature shift, demonstrate that FedDTL achieves an effective balance between global task adaptation and generalization under various FL data distributions in both few-shot and full-data regimes.


[89] OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language Models cs.CV | cs.CLPDF

Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Hao Wang, Xin Li

TL;DR: 本文提出了OphIn-Engine,一个从眼科网络视频中构建高质量指令数据的流水线,并由此创建了大规模眼科多模态指令调优数据集OphIn-500K。基于此数据集,作者开发了眼科专用多模态大语言模型OphIn-VL,该模型在视觉理解和对话能力上表现出色。

Details

Motivation: 解决眼科领域缺乏大规模、高质量的领域特定指令调优数据的问题,以支持构建能够处理真实世界临床复杂性的眼科对话助手。

Result: 实验表明,OphIn-VL在性能上超越了最先进的通用医疗和领域特定多模态大语言模型(SOTA)。

Insight: 创新点在于提出了一个从开放网络视频中自动提取和构建高质量、多样化眼科临床对话指令数据的完整流水线(OphIn-Engine),并创建了目前规模最大的眼科多模态指令数据集(OphIn-500K),有效解决了领域数据稀缺的瓶颈。

Abstract: The advancement of general medical Multimodal Large Language Models (MLLMs) has shown great potential for building conversational assistants to support clinical diagnosis. However, their adaptation to highly specialized domains such as ophthalmology remains underexplored, primarily due to the scarcity of large-scale, domain-specific instruction-tuning data. Existing ophthalmic datasets for conversational agents are often limited in scale and largely rely on images from established public benchmarks, limiting the scalability of ophthalmic MLLMs and their ability to capture real-world clinical complexity. To address this gap, we propose $\textbf{OphIn-Engine}$, an ophthalmology-specific instruction data curation pipeline that constructs high-quality instruction data from open-access ophthalmology web-scale videos. The pipeline integrates multimodal transcription for extracting image-transcript pairs, visual cue separation and scoring for identifying clinically relevant visual descriptions, and instruction synthesis with quality control for generating accurate and diverse clinical dialogues. Using this engine, we introduce $\textbf{OphIn-500K}$, a large-scale multimodal ophthalmology instruction-tuning dataset containing over 500,000 instruction instances and more than 151,000 unique images from over 29,000 video clips, formatted as visual question answering (VQA), multi-turn conversational interactions, and chain-of-thought (CoT) reasoning. Built upon this dataset, we further develop $\textbf{OphIn-VL}$, an ophthalmology-specific MLLM with advanced visual understanding and conversational capabilities. Comprehensive experiments and case studies demonstrate that OphIn-VL achieves superior performance compared with state-of-the-art general medical and domain-specific MLLMs.


[90] Rethinking Video-Language Model from the Language Input Perspective cs.CVPDF

Xiang Fang, Wanlong Fang, Changshuo Wang, Xiaoye Qu, Daizong Liu

TL;DR: 本文提出了一种即插即用的框架,旨在从语言输入的角度重新思考视频-语言模型,通过生成正负文本、基于属性的文本推理以及视频引导的跨模态桥接,来提升现有VLM的性能。

Details

Motivation: 现有VLM通常假设所有文本都遵循预定义模板,这在现实应用中不切实际且限制用户友好性,因为预定义文本耗时耗力且模板僵化,导致语义相似但模板不同的文本输入性能差异显著。

Result: 大量实验表明,该方法可作为即插即用模块,有效提升最先进VLM的性能,但摘要未具体说明基准测试或定量结果。

Insight: 创新点在于从语言输入视角出发,通过生成对比文本、细粒度语义挖掘和自加权损失设计,实现更灵活的跨模态桥接,为VLM提供了增强语言输入鲁棒性的通用策略。

Abstract: Driven by the wave of large language models, Video-Language Models (VLMs) have become a significant yet challenging technology to bridge the gap between videos and texts. Although previous VLM works have made significant progress, almost all of them implicitly assume that all the texts are predefined by the specific template. In real-world applications, such a strict assumption is impossible to satisfy since 1) predefining all the texts is extremely time-consuming and labor-intensive. 2) these predefined text inputs are too restrictive and user-unfriendly, limiting their applications. It is observed that given a video input, texts with similar semantics but different templates lead to various performances. To this end, in this paper, we propose a novel plug-and-play framework for various VLM-based methods to fully bridge videos and texts. Specifically, we first generate positive and negative texts from the original ones to target specific text components. Then, we propose an attribute-based text reasoning strategy to mine fine-grained textual semantics of generated texts. Finally, we utilize videos as guidance to conduct cross-modal bridging by designing a self-weighted loss. Extensive experiments show that the proposed method can serve as the plug-and-play module to effectively improve the performance of state-of-the-art VLMs.


[91] Structure-Guided Visual Perturbation Neutralization for LVLMs cs.CV | cs.LGPDF

Yuanhe Zhang, Xueting Wang, YanBin Ren, Haoran Gao, Xinhan Zheng

TL;DR: 本文提出了一种名为SIGN的轻量级即插即用防御框架,旨在保护大型视觉语言模型(LVLMs)免受图像对抗性扰动攻击。该方法通过先验结构提取和动态引导中和,在仅修改少量像素的情况下有效抑制扰动,同时保持模型原有的视觉表征和良性任务性能。

Details

Motivation: 现有防御方法多为传统计算机视觉设计,忽视了LVLMs所需的跨模态对齐,导致性能下降;而专门针对LVLMs的防御方法通常需要大幅修改图像并引入高计算开销,影响推理质量和效率。

Result: 在广泛实验中,SIGN实现了超过87%的防御成功率,每张图像仅修改0.5%的像素且耗时0.16秒,同时几乎保持了原始视觉表征和良性任务性能。

Insight: 创新点在于利用视觉编码器进行先验结构提取以实现高效对抗保护,提出了一种无需昂贵模型训练的轻量级防御替代方案,通过动态引导中和在低像素修改下实现高防御成功率。

Abstract: Image inputs enable Large Vision Language Models (LVLMs) to perceive fine-grained visual information, but also introduce a pixel-level attack surface through which adversarial perturbations can elicit unsafe model behaviors. However, most existing defenses are designed for traditional computer vision settings and thus often overlook the cross-modal alignment required by LVLMs, leading to degraded performance. Meanwhile, the limited defenses tailored to LVLMs often require substantial image modifications and introduce considerable computational overhead, thereby compromising inference quality and efficiency. To address these limitations, we propose Structure-Induced Guided Neutralization (SIGN), a lightweight, plug-and-play defense framework that improves LVLM compatibility via Prior Structural Extraction and achieves efficient perturbation suppression via Dynamic Guided Neutralization. Extensive experiments show that SIGN achieves over 87% defense success rate with only 0.5% pixel modification and 0.16 seconds per image, while nearly preserving original visual representations and benign task performance. Our work offers a lightweight alternative to defenses that require costly model training and highlights the potential of exploiting a vision encoder for efficient adversarial protection. Our code is open source on https://anonymous.4open.science/r/SIGN-BCB1.


[92] SIGMA: Semantic-Difference Instruction-Grounding Mask Annotator for Text-Driven Image Manipulation Localization cs.CVPDF

Peiyu Zhuang, Jianquan Yang, Haodong Li, Zhuoying Cai, Ruitao Xie

TL;DR: 本文提出SIGMA(语义差异指令引导掩码标注器),一种自动从文本驱动图像编辑对中生成像素级掩码的方法,用于解决图像操纵定位(IML)任务中训练数据标注成本高的问题。该方法通过语义特征差异和跨模态细化来定位编辑区域,并利用两阶段训练策略优化性能,生成的掩码数据集可显著提升多种IML检测器的效果。

Details

Motivation: 图像操纵定位模型需要大规模像素级标注数据,但现有公共编辑数据集仅包含(原始,编辑)图像对,缺乏掩码标注。手动标注成本高昂,而现有自动方法(如像素差异或仅基于指令的定位)无法准确捕捉扩散模型引入的扰动或意外编辑效应。

Result: SIGMA在五个基准测试上优于现有自动掩码生成方法(F1提高12.20%,IoU提高11.16%)。将其应用于公共编辑语料库生成的约110万IML训练集,能使六种不同的检测器在五个数据集上平均F1提升18.34%,达到SOTA水平。

Insight: 创新点包括:结合语义特征差异与指令引导的空间先验进行双向跨模态细化,以区分意图编辑和扩散扰动;采用两阶段训练(修复掩码监督和噪声校准自训练)来弥合扩散域差异。这为利用未标注编辑数据生成模型无关的IML监督资源提供了高效方案。

Abstract: Text-driven image editing has advanced rapidly, but reliably localizing these manipulations requires image manipulation localization (IML) models trained on large pixel-annotated datasets, and there is still no low-cost way to obtain such training data at scale. We observe that these data already exist in disguise: public editing datasets contain millions of structurally identical (original, edited) pairs to IML training samples, lacking only pixel-level masks. Recovering these masks automatically is non-trivial: pixel differencing is overwhelmed by diffusion-induced perturbations across all pixels, and instruction-only grounding localizes only what the prompt describes, missing unintended editor side-effects. We propose SIGMA (Semantic-difference Instruction-Grounding Mask Annotator), which performs semantic-feature differencing in a vision foundation backbone and injects an instruction-derived spatial prior into this visual stream via bidirectional cross-modal refinement, amplifying the difference signal at intended-edit regions when the editor faithfully realizes user intent. SIGMA is trained in two complementary stages: Stage I supervises on inpainting masks; Stage II closes the diffusion-domain shift via VAE-roundtrip noise calibration, EMA self-training, and an edit-noise disentanglement loss. SIGMA outperforms existing automatic mask generators on five benchmarks (+12.20% F1, +11.16% IoU). When applied to public editing corpora, it produces a ~1.1M IML training set that improves six diverse detectors by +18.34% F1 across five datasets, turning previously unused editing data into a model-agnostic supervisory resource for IML. We’ll release the full codebase as soon as the paper is accepted.


[93] Do We Really Need Quantum Machine Learning?: A Multidimensional Empirical Study cs.CV | cs.AI | cs.LG | quant-phPDF

Sudip Vhaduri, Ryan Gammon, Sayanton Dibbo

TL;DR: 本文对经典与量子机器学习模型在MNIST手写数字数据集上的图像识别性能进行了多维度的实证研究,比较了支持向量机(CSVM/QSVM)和卷积神经网络(CCNN/QCNN)在分类准确率、计算时间、参数量和内存需求四个维度上的表现。研究发现,量子模型在准确率上通常优于经典模型,尤其是在高特征维度或大样本量时,但计算成本更高;QSVM在准确率上显著超越CSVM,而QCNN与CCNN准确率相当,但参数量和内存效率大幅提升。

Details

Motivation: 随着计算机视觉任务日益复杂,经典机器学习模型面临计算瓶颈,促使探索量子计算作为新兴范式,以解决现有研究在全面性能比较上的不足。

Result: 在SVM模型中,QSVM在1000个样本时准确率约0.90,优于CSVM的约0.85,但计算成本更高;10个量子比特和200-500样本量是平衡准确率与运行时间的实用操作点。在神经网络模型中,CCNN和QCNN在64个特征和60000个样本时准确率均超过0.96,但QCNN在高特征数下参数量减少约94%,内存需求降低约75%,尽管运行时间更长。量子模型在特征维度或样本量增加时,准确率优势更明显。

Insight: 论文的创新点在于通过多维性能基准测试(准确率、运行时间、参数量、内存)和可控实验设计(特征维度、样本量、CPU/GPU环境),系统比较了经典与量子模型;客观分析表明,量子模型在准确率和效率上具有潜力,但需权衡计算开销,为量子机器学习在图像识别中的实际应用提供了实证依据。

Abstract: The rapid growth of computer vision and increasingly complex image recognition tasks has exposed fundamental computational limitations of classical machine learning models, motivating the exploration of quantum computing as an emerging new paradigm. This paper presents a comprehensive benchmarking study of classical and quantum machine learning models for image recognition on the MNIST handwritten digit dataset, evaluating both traditional models, a Classical Support Vector Machine (CSVM) and a Quantum Support Vector Machine (QSVM), and deep neural network models, a Classical Convolutional Neural Network (CCNN) and a Quantum Convolutional Neural Network (QCNN), across four performance dimensions: classification accuracy, computational runtime, parameter count, and memory requirements. Experiments are conducted as functions of both feature dimensionality and sample size, and across CPU and GPU execution environments, providing a controlled, multidimensional comparison to address gaps in prior work. For the SVM-based models, QSVM consistently outperforms CSVM in accuracy, reaching $\sim$ 0.90 versus $\sim$ 0.85 at 1,000 samples, with a higher computational cost. A feature count of 10 qubits and a sample size in the range of 200 – 500 emerge as practical operating points that balance accuracy and runtime. For the neural network models, CCNN and QCNN achieve comparable classification accuracy, both exceeding 0.96 at 64 features and 60,000 samples, yet QCNN offers substantially superior parameter and memory efficiency, requiring $\sim$ 94% fewer parameters and $\sim$ 75% less memory than CCNN at higher feature counts, while incurring higher runtime. Across both model families, quantum models consistently outperform classical models by greater margins in accuracy as feature dimensionality or sample size increases.


[94] When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness? cs.CV | cs.AI | cs.CL | cs.CR | cs.LGPDF

Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li, Binghang Lu

TL;DR: 本文研究了大型视觉语言模型中‘图文协同推理’的不同设计范式对多模态越狱攻击鲁棒性的影响。实验发现,显式调用图像工具(explicit image-tool invocation)的范式能最有效地降低攻击成功率,平均减少约30%。作者提出了一个‘图像工具安全向量’框架来解释这一现象,认为工具调用在隐藏表示层面引入了朝向安全相关方向的残差偏移。

Details

Motivation: 图文协同推理作为一种新兴的推理范式,其安全性影响尚不明确。本文旨在探究不同图文推理流程设计(如直接生成、纯文本先验轮次、视觉状态操纵、显式图像工具调用)中,哪种能更好地抵御多模态越狱攻击,并揭示其背后的原因。

Result: 在多个视觉语言模型上的实验表明,显式图像工具交互范式能最显著地降低攻击成功率(ASR),平均减少约30%。即使返回的图像工具输出被手动覆盖或本身看起来不安全,其ASR仍然很低,但在纯文本先验轮次控制下,ASR则回升至接近直接回答的水平。

Insight: 论文的创新点在于系统评估了不同图文推理范式对安全性的影响,并发现显式工具调用这一设计模式能有效提升鲁棒性。更深层的洞见是提出了‘图像工具安全向量’框架,从表征层面解释了工具调用通过引入安全相关的残差偏移来增强防御能力,而非仅仅依赖于返回图像的良性语义或文本追踪。这强调了针对特定流程进行安全性评估的重要性。

Abstract: Think-with-image reasoning is emerging as a new inference paradigm for large vision-language models, but its safety implications remain poorly understood. Existing systems already span multiple process designs, including direct response generation, text-only prior turn, visual-state manipulation, and explicit external image-tool invocation. In this paper, we ask which of these evaluated paradigms improves multimodal jailbreak robustness, and why. Across multiple vision-language models, explicit image-tool interaction yields the lowest attack success rates in our experiments, reducing jailbreak success by around 30% relative on average across the evaluated models. This finding is initially surprising: ASR remains low even when the returned image-tool output is manually overridden or itself unsafe-looking, but returns near direct-answering levels under text-only prior turn controls. These results indicate that the lower ASR is not explained by benign returned-image semantics or by the textual image-tool trace alone. To explain the pattern, we introduce an image-tool safety vector framework that models image-tool invocation as a residual shift in hidden representations toward a safety-relevant direction. Representation-level analyses and activation interventions support this account. Overall, our results suggest that explicit image-tool interaction is a promising design pattern for improving jailbreak robustness, while also motivating pipeline-specific safety evaluation.


[95] SEMAGIC: Learning Semantically Consistent Deformable 3D Representations from In-the-Wild Images cs.CVPDF

Sky Cen, Wufei Ma, Guofeng Zhang, Alan Yuille, Adam Kortylewski

TL;DR: SEMAGIC是一个从单视角野外图像学习语义一致可变形3D表示的框架。它通过将可变形建模作为发现类别级对应关系的机制,而非仅以重建为目标,利用规范模板网格和变形场,使顶点在实例间保持一致的语义含义。

Details

Motivation: 现有可变形3D重建方法虽然能生成视觉上合理的几何形状,但在实例间对应关系不稳定,在语义对应基准上表现不佳,缺乏下游任务所需的语义结构。

Result: 在SPair-71k基准上,SEMAGIC将可变形模型的语义对应性能提升了+14.7 PCK@0.1,确立了可变形模型作为有效语义3D表示的地位。

Insight: 创新点在于将几何变形与语义对齐显式耦合,通过特征级一致性损失和顶点索引条件变形来强制语义一致性,从而在类别内变化中保持稳定的部件对应关系。

Abstract: Learning deformable 3D object models from single-view in-the-wild images has enabled impressive 3D shape reconstruction without supervision. However, it remains unclear whether these models capture the semantic structure required for downstream tasks. We find that existing deformable reconstruction approaches, despite producing visually plausible geometry, yield unstable correspondences across instances and perform poorly on semantic correspondence benchmarks. We introduce SEMAGIC, a framework for learning semantically consistent deformable 3D representations from single-view in-the-wild images. Rather than treating reconstruction as the end goal, SEMAGIC uses deformable modeling as a mechanism to discover category-level correspondences. Each category is represented by a canonical template mesh and a learned deformation field, functioning similarly to an autoencoder that reconstructs instance geometry from image features, enabling vertices to maintain consistent semantic meaning across instances. Semantic consistency is enforced during training through (i) a feature-level consistency loss aligning semantic features between canonical and deformed meshes, and (ii) vertex-index-conditioned deformation that preserves semantic correspondence across instances. By explicitly coupling geometric deformation with semantic alignment, SEMAGIC produces representations that maintain stable part correspondences across intra-category variation. Experiments demonstrate that SEMAGIC improves semantic correspondence of deformable models by +14.7 PCK@0.1 on SPair-71k, establishing deformable models as effective semantic 3D representations.


[96] Con-DSO: Learning Short-Horizon Consistency Priors for RGB-D Direct Sparse Odometry cs.CV | cs.ROPDF

Haolan Zhang, Thanh Nguyen Canh, Chenghao Li, Ziyan Gao, Xiongwen Jiang

TL;DR: 本文提出了Con-DSO,一种一致性感知的RGB-D直接稀疏里程计框架,通过预测相邻RGB-D帧对之间的密集光度与深度几何一致性不确定性,并将其转化为关键帧跟踪的质量先验,从而在姿态估计中连续衰减不可靠观测,提升了在动态物体、遮挡、光照变化等挑战性环境下的鲁棒性。

Details

Motivation: RGB-D直接视觉里程计依赖于短时域的光度与深度几何一致性假设,但在动态物体、遮挡、光照变化和不可靠深度等挑战性环境下性能会下降。现有方法通常依赖外部模块或针对特定失效模式的手工设计,缺乏灵活性和统一处理多种挑战的能力。

Result: 在五个公开RGB-D基准测试(包括ICL-NUIM、RGB-D Scenes V2、TUM/Bonn Dynamic和OpenLORIS)上的实验表明,该方法显著优于直接RGB-D VO基线,在ICL-NUIM上绝对轨迹误差降低了20%以上,在其他序列上降低了50%至80%。

Insight: 创新点在于通过端到端学习预测像素级一致性不确定性,并将其作为质量先验融入VO框架,实现了对不可靠观测的连续衰减而非硬性剔除或阈值门控,从而以统一方式提升了系统在多样化挑战下的鲁棒性。

Abstract: Visual odometry (VO) is a fundamental component in robotics and augmented reality. RGB-D direct VO benefits from metric depth measurements, but it can degrade in challenging environments, where dynamic objects, occlusions, illumination changes, and unreliable depth violate the short-horizon photometric and depth-geometric consistency assumptions used by direct alignment. Existing approaches mitigate these issues through semantic filtering, explicit occlusion reasoning, illumination adaptation, or hand-crafted geometric criteria, but often rely on external modules or fixed assumptions tailored to individual failure modes, limiting their flexibility and ability to handle diverse challenges in a unified manner. In this work, we propose Con-DSO, a consistency-aware RGB-D direct sparse odometry framework that predicts dense photometric and depth-geometric consistency uncertainty from temporally adjacent RGB-D frame pairs. The consistency network is trained using flow-guided photometric errors and projective depth-consistency errors, allowing consistency violations to be represented as pixel-level uncertainty. These pairwise uncertainty predictions are converted into a host-side quality prior for keyframe-based tracking. The prior is then applied to VO through quality-aware support-pixel selection and decoupled photometric-geometric weighting during pose estimation, enabling continuous attenuation of unreliable observations rather than hard rejection or threshold-based gating. Experiments on five public RGB-D benchmarks show substantial gains over direct RGB-D VO baselines, with over 20% absolute trajectory error reduction on ICL-NUIM and 50%–80% reductions on RGB-D Scenes V2, TUM/Bonn Dynamic, and OpenLORIS sequences.


[97] ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning cs.CV | cs.AIPDF

Guannan Lv, Ren Nie, Hongjian Dou

TL;DR: 本文提出了ROVER(Routing Object-centric Visual Evidence for grounded multi-image Reasoning),一种轻量级、可学习的插件,用于在多模态大语言模型中进行高效的全局视觉证据路由。该方法通过注入特定于推理步骤的令牌三元组,协同聚合推理上下文、通过以对象为中心的差分注意力将图像内线索提炼到视觉工作空间,并在此空间内跨对象和图像路由与整合历史感知证据。研究将ROVER集成到Qwen2.5-VL-7B模型中,并开发了交错式SFT-to-GRPO训练流程。

Details

Motivation: 现有的基于接地的多模态推理方法通常聚焦于感兴趣区域,这可能会削弱整体场景理解和对象间关系,同时解码成本随RoI的数量和大小而增加。而自适应视觉特征选择方法又往往需要细粒度监督或复杂启发式规则。ROVER旨在解决这些限制,实现高效、全局的视觉证据路由。

Result: 在严格遵守原始数据集和评估协议的情况下,该方法在MM-GCoT基准上取得了最佳性能(答案准确率提升4.8%,接地准确率提升14.6%),在VideoEspresso基准上答案准确率提升8.6%。在VideoEspresso上训练的模型还表现出强大的可迁移性,在多个不同基准上的平均性能比基础模型高出4.7%。

Insight: 论文的创新点在于提出了一个轻量级插件ROVER,它通过对象中心的差分注意力和历史感知的证据路由机制,在保持整体场景理解的同时,高效地整合跨对象和跨图像的视觉证据,避免了传统基于RoI方法的缺点,且无需复杂的监督或启发式规则。

Abstract: Multimodal Large Language Models (MLLMs) have increasingly localized and interleaved visual evidence for deliberative reasoning. Grounding-based approaches typically focus on regions of interest (RoIs) by injecting cropped image patches or RoI-specific features into the reasoning context. However, such designs can weaken holistic scene understanding and inter-object relations, while incurring decoding costs that scale with the number and size of RoIs. Alternatively, adaptive visual feature selection often requires fine-grained supervision or complex heuristics. To address these limitations, we propose ROVER (Routing Object-centric Visual Evidence for grounded multi-image Reasoning), a lightweight, learnable plugin for efficient global visual evidence routing. Upon each object grounding prediction, ROVER injects a step-specific token triplet to synergistically: (i) aggregate the ongoing reasoning context, (ii) distill intra-image cues into a visual working space via object-centric differential attention, and (iii) route and integrate history-aware evidence across objects and images within this space for subsequent reasoning. We integrate ROVER into Qwen2.5-VL-7B and develop an interleaved SFT-to-GRPO training pipeline. Strictly adhering to the original datasets and evaluation protocols, our method achieves the best performance on MM-GCoT (+4.8% answer accuracy, +14.6% grounding accuracy) and VideoEspresso (+8.6% answer accuracy). The VideoEspresso-trained model demonstrates strong transferability, outperforming the base model by +4.7% on average across diverse benchmarks.


[98] Evaluating the Feasibility of Inferring Dietary Behavior Change Receptivity from Egocentric Images of Eating Environment cs.CVPDF

Long Li, Yuning Huang, Heather A. Eicher-Miller, J. Graham Thomas, Fengqing Zhu

TL;DR: 本研究探讨了通过可穿戴相机采集的自我中心饮食图像来推断参与者自我报告的行为改变接受度的可行性。研究使用AIM-2设备收集自由生活饮食场景的图像序列,并结合行为改变接受度问卷,评估了基于预训练CLIP视觉编码器和轻量级Transformer分类器的迁移学习框架。初步结果表明,该方法在行为改变接受度指标上优于简单基线模型,提示自我中心饮食图像可能包含相关线索。

Details

Motivation: 准确评估饮食行为改变接受度对于设计有效的即时适应性干预至关重要,但基于自我报告的评估通常稀疏且延迟,限制了连续监测的实用性。本研究旨在探索被动感知(如可穿戴相机图像)是否有助于解决这一挑战。

Result: 初步实验结果显示,所提出的迁移学习框架在行为改变接受度指标(如意识、互动能力和动机)上相比简单基线模型有显著改进,表明该方法具有潜力,但需在更大更全面的数据集上进一步验证。

Insight: 创新点在于首次利用自我中心饮食图像序列来推断行为改变接受度,结合预训练CLIP模型提取语义和时序线索,为被动监测饮食行为提供了新思路;客观来看,该方法将计算机视觉与健康行为干预结合,有望推动JITAIs的实时性和个性化发展。

Abstract: Accurately assessing dietary behavior change receptivity is essential for designing effective just-in-time adaptive interventions (JITAIs) that promote healthier eating habits. However, self-report-based assessment of behavior change receptivity is sparse and delayed, limiting its practical use in continuous monitoring. To explore whether passive sensing may help address this challenge, this study conducts a pilot investigation of inferring participants’ self-reported behavior change receptivity from egocentric eating images collected by a wearable camera. We use pilot data obtained from free-living eating episodes using the Automatic Ingestion Monitor v2 (AIM-2). The data included egocentric image sequences captured during eating and paired with responses to questions assessing specific dimensions of behavior change receptivity (awareness, interaction capability, and motivation). To examine whether visual information contained any relevancy to these responses, we evaluated a transfer-learning-assisted framework that combines a pre-trained Contrastive Language-Image Pre-Training (CLIP) vision encoder with a lightweight transformer classifier. The model processes eating episode image sequences to extract potential semantic and temporal cues related to behavior change receptivity. Preliminary experimental results show promising improvements over simple baseline models for behavior change receptivity indicators. These early findings suggest that egocentric eating episode images may contain cues related to dietary behavior change receptivity, and warrant further investigation with larger and more comprehensive datasets.


[99] Mags-RL: Wearing Multimodal LLMs a Magnifying Glass via Agentic Reinforcement Learning For Complex Scene Reasoning cs.CVPDF

Xuanzhao Dong, Wenhui Zhu, Peijie Qiu, Xiwen Chen, Xiaobing Yu

TL;DR: 本文提出Mags-RL,一种基于智能体强化学习的框架,通过引入外部超分辨率‘放大镜’智能体,使多模态大语言模型能够自主识别并高分辨率地检查图像中的关键区域,从而提升其在复杂场景下的推理能力。该方法采用两轮推理机制,无需额外标注,并通过新颖的课程学习策略实现数据高效的强化学习训练。

Details

Motivation: 现有MLLMs在解释图像时存在准确性问题,尤其在对象密集、背景杂乱的复杂场景中,其推理能力受限;先前工作依赖额外标注(如边界框)且低分辨率裁剪会丢失细粒度细节,因此需要一种无需标注、能进行高分辨率细粒度检查的方法。

Result: 在VSR、TallyQA和GQA子集上的实验表明,Mags-RL优于近期强竞争方法,实现了高质量的推理和精确的视觉定位,且仅需40个训练样本即可达到合理性能。

Insight: 创新点包括:1) 将超分辨率智能体作为外部‘放大镜’集成到MLLMs中,实现自主区域识别和高分辨率细粒度检查;2) 两轮推理机制(生成初步理由后验证)增强视觉基础;3) 数据高效的课程学习策略,大幅减少强化学习所需样本量。

Abstract: Despite their popularity and success, Multimodal Large Language Models (MLLMs) often struggle to interpret images accurately, which limits their reasoning capability in complex scenarios (e.g., high object density and complex background clutter). Prior work mainly addresses this limitation by incorporating explicit visual cues like bounding boxes that require extra annotations. In addition, the resulting low-resolution crops often miss fine-grained details that MLLMs require for accurate reasoning. Therefore, we propose Mags-RL, an Agentic Reinforcement Learning (RL) framework that equips MLLMs with an external super-resolution “magnifying glass” agent for high-resolution fine-grained inspection. Specifically, the model performs two-round reasoning: in the first round, it generates an initial rationale and autonomously identifies regions of interest without relying on additional annotations; in the second round, it invokes a super-resolution agent to crop and upscale those regions, then revisits and verifies its earlier reasoning to produce the final answer. We also introduce a novel curriculum learning strategy that enables data-efficient RL training, needing as few as only 40 training samples to achieve reasonable performance. Experiments on VSR, TallyQA, and GQA subsets show its superior performance against recent strong competing methods, demonstrating high-quality reasoning with precise visual grounding. Code and weights will be released soon.


[100] ABot-OCR Technical Report cs.CVPDF

Kaitao Jiang, Ruiyan Gong, Xiaolong Cheng, Kangning Niu, Tianlun Li

TL;DR: 本文介绍了ABot-OCR,这是一个端到端的视觉语言模型,能够将整页图像直接转录为干净的Markdown格式。该方法通过专用的数据引擎生成大规模、结构一致的监督数据,并提出了解耦异构文档优化方法,这是一种结构约束的强化学习技术,用于提升文本准确性和标记规范性。

Details

Motivation: 解决传统文档图像转录流程中模块化编排脆弱、复杂的问题,旨在通过单一前向传播实现从图像到结构化Markdown的直接、鲁棒转换。

Result: 在OmniDocBench v1.5和v1.6基准测试中,ABot-OCR分别取得了92.81和93.30的分数,在所有端到端系统中达到了最先进的性能水平,显著缩小了与强大流水线基线的差距。

Insight: 主要创新点在于端到端的单一模型架构消除了模块化流程的脆弱性;通过解耦异构文档优化这一结构约束的强化学习方法,在监督微调之外进一步确保了文本准确性和标记格式的规范性;其数据引擎设计为生成大规模、高质量的训练数据提供了有效途径。

Abstract: We introduce ABot-OCR, an end-to-end vision-language model that transcribes a page image directly into clean Markdown in a single forward pass. By doing so, our approach completely eliminates the need for brittle modular orchestration. To maximize parsing fidelity, we develop a dedicated data engine to provide large-scale, structurally consistent supervision. Furthermore, we propose Decoupled Heterogeneous Document Optimization, a structure-constrained reinforcement learning method that sharpens textual accuracy and strictly enforces markup well-formedness beyond supervised fine-tuning alone. Extensive evaluations demonstrate the superior performance of our framework. On the OmniDocBench v1.5 and v1.6 benchmarks, ABot-OCR achieves state-of-the-art scores of 92.81 and 93.30 among all end-to-end systems, substantially narrowing the performance gap relative to strong pipeline baselines. Finally, comprehensive multilingual text recognition across ten diverse languages further confirms the robust generalizability of ABot-OCR.


[101] VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning cs.CV | cs.AI | cs.CL | cs.MMPDF

Xingyu Lu, Jinpeng Wang, Yi-Fan Zhang, Yankai Yang, Yancheng Long

TL;DR: VCap提出了一种基于见证者-裁决者框架的奖励设计,通过将参考描述(见证者)与视觉信号(裁决者)配对,显式验证参考描述与模型生成描述在视觉信号基础上的事实一致性,从而为视觉描述任务提供细粒度、可靠的奖励信号。该方法支持从不完美的参考数据中进行有效学习,实现从弱到强的泛化。

Details

Motivation: 现有视觉描述任务的奖励设计无法为事实验证提供细粒度和可靠的信号,限制了强化学习在驱动多模态大语言模型实现更高精度和更广覆盖方面的有效性。

Result: 实验表明,使用VCap训练的80亿参数模型在多个图像和视频描述基准测试中超越了开源和闭源的SOTA模型,并且人类评估进一步证实了其在事实正确性上的强对齐性。

Insight: 核心创新在于将参考描述视为“见证者”,视觉信号视为“裁决者”,通过显式验证两者与生成描述的事实一致性,提供了一种具有超几何分布级别精度的奖励信号。这种方法允许利用不完美的参考数据,实现了弱监督到强性能的泛化,并挑战了先前关于强化学习与视觉推理(RLVR)的假设。

Abstract: Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant paradigm for captioning, MLLMs have achieved strong performance through scaling and high-quality data. Recently, RL has emerged as a key route to driving MLLMs toward higher precision and broader coverage, however, existing reward designs for captioning fail to provide fine-grained and reliable signals for factual verification, limiting their effectiveness. To address this, we propose VCap, a Witness-Adjudicator reward that pairs the reference caption (a witness) with the visual signal (an adjudicator). By explicitly verifying factual consistency between the reference and policy-generated captions grounded in the visual signal, VCap delivers a reward signal with hypergeometric-distribution-level precision for caption quality verification. This design enables effective learning even from imperfect references, facilitating weak-to-strong generalization in RL training. In our experiments, an 8B model trained with VCap outperforms open- and closed-source SOTA models on multiple image and video captioning benchmarks. Human evaluation further confirms its strong alignment with factual correctness. Additionally, VCap improves MLLM perceptual capability, generalizes across tasks, and surpasses best-of-N distillation, challenging prior assumptions about RLVR.


[102] Beyond Surrogate Gradients: Fully Differentiable Token Pruning for Vision-Language Models cs.CVPDF

Landi He, Mingde Yao, Shawn Young, Lijian Xu

TL;DR: 本文提出了DiffPrune,一种用于视觉-语言模型(VLM)的完全可微分视觉令牌剪枝方法。该方法通过引入信息节流器,将剪枝重新定义为对令牌信息的连续控制,而非离散选择学习,从而避免了传统方法依赖代理梯度导致的不可靠学习问题。在推理时,基于学习到的重要性分数进行硬阈值剪枝,实现了高效加速。

Details

Motivation: 现有视觉令牌剪枝方法通常依赖Gumbel-Softmax在训练中近似离散选择,其优化由代理梯度驱动,而非真实选择过程,导致令牌重要性学习不可靠。本文旨在解决这一问题,提出一种完全可微分的剪枝方法。

Result: 在十个VLM基准测试上,DiffPrune在保留全模型96.5%准确率的同时,将LLM预填充阶段加速了2.85倍,且仅带来0.69毫秒的推理开销。

Insight: 核心创新在于将剪枝重新定义为对令牌信息的连续控制,通过信息节流器使用基于重要性分数的方差保持噪声来调制令牌,为学习令牌重要性提供了完全可微分的优化路径。这避免了离散近似,实现了更可靠的重要性学习与高效推理的平衡。

Abstract: Visual token pruning reduces the computational cost of Vision-Language Models (VLMs) by removing redundant visual tokens. Existing methods typically rely on Gumbel-Softmax to approximate discrete selection during training. However, the optimization is driven by surrogate gradients rather than the true selection process, leading to unreliable learning of token importance. In this paper, we propose DiffPrune, which reformulates pruning as continuous control of token information instead of discrete selection learning. Specifically, we introduce an Information Throttler that modulates each token using variance-preserving noise conditioned on importance scores, where higher scores induce less information suppression during training. This design directly operates on token representations, naturally providing a fully differentiable optimization path for learning token importance. At inference, tokens are removed via hard thresholding on the learned scores. Across ten VLM benchmarks, DiffPrune retains 96.5% of full-model accuracy while accelerating LLM prefill by 2.85x, with only 0.69 ms of inference overhead.


[103] VLA-Hijack: A Transferable Patch Attack against Vision-Language-Action Models via Visual Proprioception Hijacking cs.CVPDF

Jiyuan Fu, Kaixun Jiang, Jingkai Jia, Zhaoyu Chen, Xueyao Chen

TL;DR: 本文提出了VLA-Hijack,一种针对视觉-语言-动作(VLA)模型的可迁移对抗性补丁攻击框架。该攻击通过利用VLA模型依赖视觉进行机器人自定位这一共有脆弱性,同时优化注意力引导的本体感觉抑制和多模态本体感觉注入,将补丁建立为替代的‘幻影本体’,从而有效切断了智能体真实本体与其控制策略之间的语义联系。

Details

Motivation: 现有针对VLA模型的补丁攻击主要在白盒设置下进行,严重过拟合于目标模型特定的动作输出空间,导致跨架构的可迁移性很差,这阻碍了VLA模型在安全关键领域的部署。

Result: 在OpenVLA、UniVLA和CronusVLA等多种架构上的大量实验表明,VLA-Hijack在白盒设置下实现了卓越的优化效率,并在跨架构和跨领域的黑盒可迁移性上达到了新的SOTA水平。

Insight: 核心创新在于识别并利用了VLA模型进行运动规划前必须进行视觉自定位这一根本性脆弱性作为攻击面,并提出了通过交替进行语义概念锚定和视觉原型投影来劫持视觉本体感知的统一对抗框架,从而实现了高效的、可迁移的攻击。

Abstract: While Vision-Language-Action (VLA) models have emerged as powerful generalist policies, their severe vulnerability to adversarial patches significantly hinders their deployment in safety-critical domains. Moreover, existing patch attacks primarily focus on white-box settings, heavily overfitting to the specific action output space of the target model, which results in poor cross-architecture transferability. To overcome this limitation, we propose VLA-Hijack, a unified adversarial framework that breaks the transferability bottleneck by exploiting a fundamental vulnerability identified in this work: before planning any motion, a VLA model must first use visual information to locate its own robotic arm within the environment. Targeting this shared visual self-localization process, our approach concurrently optimizes Attention-Guided Proprioceptive Suppression to inhibit the real robotic arm’s features, and Multimodal Proprioceptive Injection to establish the patch as a surrogate “phantom embodiment”. By alternating between semantic concept anchoring and visual prototype projection, VLA-Hijack effectively severs the semantic relationship between the agent’s true embodiment and its control policy. Extensive experiments across diverse architectures (OpenVLA, UniVLA, and CronusVLA) demonstrate that VLA-Hijack achieves superior optimization efficiency in white-box settings and sets a new SOTA for cross-architecture and cross-domain black-box transferability.


[104] CogPortrait: Fine-Grained Eye-Region Control in Portrait Animation via Hierarchical Agent Planning cs.CVPDF

He Feng, Yongjia Ma, Donglin Di, Lei Fan, Tonghua Su

TL;DR: 本文提出CogPortrait,一个两阶段的人像动画生成框架,旨在实现细粒度的眼部区域控制。第一阶段通过多模态大语言模型代理链,将高层级标签编译为面部关键点;第二阶段基于DiT的视频生成主干,结合关键点、参考人像、音频和文本提示合成最终动画。该方法在HDTF和自建的EMH基准上验证了其在眼部区域控制精度、视觉质量和身份一致性方面的优越性。

Details

Motivation: 现有的人像动画方法在视觉质量和唇部同步方面已取得显著进展,但对眼部区域的细粒度操控仍面临输入粒度与运动精度之间的权衡。现有方法(如情感标签或粗略文本提示)难以描述细微的眼部动态,而基于动作单元或驱动视频的方法虽保真度更高,但输入负担较重,且难以处理超越情感的状态(如思考、困倦)。

Result: 在HDTF数据集和自建的EMH基准(涵盖多样情感及超越情感的类别)上进行的广泛实验表明,CogPortrait在眼部区域控制精度上优于现有方法,同时保持了更优的视觉质量和身份一致性。评估采用了两个动作单元级别的指标来量化细粒度眼部及头部运动控制。

Insight: 创新点包括:1) 采用分层代理规划(三个链式思维的多模态大语言模型代理)将高层级标签编译为面部关键点,实现了从语义到生理约束的细粒度控制;2) 在DiT生成主干中引入了动态无分类器引导策略,结合眼部区域感知重加权和基于KTO的边界案例优化,提升了生成质量。该方法为复杂、细微的非语言面部动态生成提供了新思路。

Abstract: Portrait animation methods have achieved substantial visual quality and lip synchronization, but fine-grained manipulation of the eye region still faces a trade-off between input granularity and motion accuracy. Existing methods using emotion labels or coarse text prompts are insufficient for describing subtle ocular dynamics, whereas approaches based on Action Units or driving videos provide higher fidelity at the cost of a heavier input burden. These limitations are still restrictive for beyond-emotion states (e.g., thinking) and drowsiness. In light of the above, we propose CogPortrait, a two-stage framework that generates portrait animations from high-level labels. In the first stage, three chain-of-thought Multimodal Large Language Models (MLLMs) agents compile high-level labels into facial keypoints through temporal event planning, prototype retrieval, and composition from a real-behavior library, and semantic-physiological constraint enforcement. In the second stage, a DiT-based video generation backbone synthesizes the final animation conditioned on the keypoints, reference portrait, audio, and text prompt, enhanced by a dynamic classifier-free guidance strategy with eye-region-aware reweighting and KTO-based refinement for boundary cases. We further introduce the EMH benchmark covering diverse emotions and beyond-emotion categories with two AU-level metrics for evaluating fine-grained eye-region and head-motion control. Extensive experiments on HDTF and the EMH benchmark demonstrate that CogPortrait achieves more precise eye-region control than existing methods while maintaining supe- rior visual quality and identity consistency


[105] ST-ColoNet: Spatio-Temporal Colon Segment Recognition via Hybrid Attention and Edge-Guided Feature Learning cs.CVPDF

Ziyi Wang, Zhengjie Zhang, Jingsheng Gao, Dahong Qian, Suncheng Xiang

TL;DR: 本文提出了一种名为ST-ColoNet的两阶段深度学习框架,用于从结肠镜视频中进行结肠段识别。该方法通过结合度量学习优化边缘引导空间特征提取的Colorlaus模块,以及融合三种自注意力模式以更好地近似长序列全自注意力并优化时序特征聚合的Full-Temp模块,充分利用时空信息。作者还为此任务构建并发布了一个专门的标注数据集。

Details

Motivation: 现有结肠段自动识别方法仅使用结肠镜图像,未能充分利用时序信息,导致性能不佳,且缺乏相关的公开视频数据集。

Result: 在作者构建的数据集上,ST-ColoNet实现了81.0%的准确率和70.7%的F1分数,通过大量消融实验证明其性能达到了该任务的SOTA水平,相比现有方法有显著提升。

Insight: 创新点在于提出了一个结合边缘引导空间特征学习和混合自注意力时序建模的端到端框架,并针对该任务发布了稀缺的专用视频数据集,解决了现有方法忽视时序信息和数据缺乏的问题。

Abstract: Colo-segment recognition in colonoscopy videos is a key requirement for many downstream tasks, but existing automatic recognition methods only use colonoscopy images without fully exploiting the use of temporal information, leading to poor performance. Additionally, relevant public video-based datasets are in scarcity. To tackle this problem, we curate and release a labeled dataset specifically for the task of colo-segment recognition. In addition, we propose a two-stage deep learning-based framework, Colo-Segment Recognition via SpatioTemporal Network (ST-ColoNet), for the task of colo-segment recognition from colonoscopy videos which includes the Colorlaus module that uses metric learning to optimize edge-mediated spatial feature extraction, as well as the Full-Temp module which combines three self-attention patterns to better approximate full self-attention on long colonoscopy sequences and optimize temporal feature aggregation. Through extensive ablation experiments, we show that our framework is capable of achieving state-of-the-art performance on the task of colo-segment recognition, achieving an accuracy of 81.0% and F1-score of 70.7%, which is a tremendous improvement over state-of-the-art methods.


[106] Qwen-Image-Bench: From Generation to Creation in Text-to-Image Evaluation cs.CVPDF

Niantong Li, Guangzheng Hu, Weixu Qiao, Ying Ba, Qichen Hong

TL;DR: 本文介绍了Qwen-Image-Bench,一个面向真实创作场景、由专业艺术家共同设计的文本到图像(T2I)生成评估基准。该基准超越了传统的文本-图像对齐评估,引入了‘真实世界保真度’和‘创意生成’两个应用驱动维度,并构建了一个包含5大支柱、23项子能力和56个可验证细项的分层评估体系。基于Qwen3.6-27B训练的统一评判模型Q-Judger,在专业标注下对图像进行细粒度、可归因的诊断评分,有效区分了领先的T2I模型,并为生产级开发提供了优化信号。

Details

Motivation: 现有T2I评估基准仍主要关注基础的文本-图像对齐,无法捕捉真实艺术创作实践中所需的细微能力,导致难以可靠地区分最先进的模型。

Result: Qwen-Image-Bench在现有基准难以提供洞见的‘真实世界保真度’和‘创意生成’两个应用驱动维度上,实现了对领先T2I模型的最大区分度,同时为生产级T2I开发提供了可信的优化信号。

Insight: 创新点在于构建了一个以创作者为中心、根植于真实创作场景的分层细粒度评估体系,并训练了一个基于专业标注的统一评判模型进行可归因的细粒度诊断,而非单一的模糊分数,这为评估T2I模型在专业工作流中的实用能力提供了新范式。

Abstract: Text-to-Image generation has evolved from basic image synthesis into a frequently used core capability in professional creative workflows, where simple text-image alignment can no longer satisfy users’ pressing demands for faithful real-world reconstruction and genuine creative expression. Existing benchmarks, however, remain anchored in these foundational criteria and do not yet capture the nuanced capabilities that matter in authentic artistic practice, making it difficult to reliably distinguish state-of-the-art T2I models. To address the gap, we introduce Qwen-Image-Bench, a creator-centric benchmark co-designed with professional artists and grounded in real-world creation scenarios. Qwen-Image-Bench enriches conventional evaluation with two application-driven dimensions: Real-world Fidelity and Creative Generation. Drawing on the staged reasoning inherent in professional artistic workflows, we organize these five pillars into a top-down hierarchical taxonomy that further decomposes into 23 second-level sub-capabilities and 56 third-level verifiable rubrics. To ensure broad coverage, we curate 1000 stratified prompts with each prompt jointly exercising more than four fine-grained facets across multiple pillars. We train a unified judge model Q-Judger based on Qwen3.6-27B, supervised by 80 professional annotators from global art academies under blind labeling and triple-review protocols, that scores every image across all 56 verifiable facets, producing fine-grained, rubric-grounded, and fully attributable diagnostics rather than a single opaque score. Empirically, Qwen-Image-Bench reliably distinguishes leading T2I models, achieving the greatest separation on the two application-driven dimensions of Real-world Fidelity and Creative Generation where existing benchmarks provide little insight, while also providing a trustworthy optimization signal for production-level T2I development.


[107] Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models cs.CVPDF

Haozhan Shen, Tiancheng Zhao, Kangjia Zhao, Jianwei Yin

TL;DR: 本文首次系统性地比较了视觉语言模型(VLM)和视频生成模型(VGM)这两种主流预训练范式在空间智能任务上的表现。通过冻结特征探测实验,发现VLM在语义标注和实例分组上更强,而VGM在密集几何和相机运动预测上更优。简单的特征融合即可在几何和语义任务上均取得优异表现,为构建更强的空间智能骨干网络指明了方向。

Details

Motivation: 空间智能需要能同时捕捉物理世界中语义对象和几何结构的视觉表征。目前广泛使用的VLM和VGM两种预训练方案,哪种能为空间智能提供更好的表征基础尚不明确。

Result: 在语义标注、实例分组和3D几何预测这三个代表性空间智能任务上的冻结特征探测实验表明,VLM在语义任务上更强,VGM在几何任务上更优。简单的特征融合即可在两项任务上均表现出色。

Insight: 研究揭示了VLM和VGM在表征能力上存在明确的互补性。通过有效整合两种模型家族的特征,是构建更强空间智能骨干网络的一个有前景的方向,这为多模态预训练模型的设计提供了重要启示。

Abstract: Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely used as foundation backbones: Vision-Language Models (VLMs), which use language supervision to align visual observations with semantic concepts, and Video Generation Models (VGMs), which learn from temporally evolving visual worlds. However, it still remains unclear which pre-training scheme provides a better representation substrate for spatial intelligence. In this paper, we present the first systematic frozen-feature probing study of VLMs and VGMs across three representative axes of spatial intelligence: semantic tagging, instance grouping, and 3D geometry prediction. Using the lightweight probe, our framework enables a controlled comparison of what information is already encoded in frozen representations from two model families. Experimental results reveal a clear complementarity: VLMs are stronger at semantic tagging and instance grouping, while VGMs provide more accessible signals for dense geometry and camera motion. Moreover, a naive fusion of the two already yields a representation that excels at both geometry and semantics, suggesting a promising direction for building stronger spatial-intelligence backbones by effectively integrating features from both model families. Our code is available at \href{https://github.com/om-ai-lab/Probing-VLM-VGM}{https://github.com/om-ai-lab/Probing-VLM-VGM}.


[108] SAM-Enhanced Segmentation on Road Datasets: Balancing Critical Classes in Autonomous Driving cs.CV | cs.ROPDF

Toomas Tahves, Mauro Bellone, Junyi Gu, Raivo Sell

TL;DR: 本文提出了一种基于Segment Anything Model(SAM)的标注流程,用于为Zenseact Open Dataset(ZOD)生成密集的像素级语义分割标注,以弥补其仅有边界框标签的不足。通过处理超过10万帧数据并手动筛选出2300帧高质量子集,作者评估了CLFT和DeepLabV3+等分割模型在不同天气条件下的性能,并针对类别极度不平衡问题探索了专门针对稀有类别的模型。该方法还在Iseauto自动驾驶平台上进行了验证,并通过双向迁移学习证明了SAM生成的表征在不同传感器配置间的有效迁移性。

Details

Motivation: 自动驾驶需要密集的语义分割,但许多多模态数据集(如ZOD)缺乏像素级标注,只有边界框标签,这限制了其在分割研究中的应用。

Result: 在ZOD数据集上,基于Transformer的CLFT-Hybrid模型达到了48.1%的mIoU;在Iseauto自动驾驶平台上的验证达到了77.5%的mIoU。

Insight: 主要创新点是利用SAM模型构建了一个从边界框自动生成高质量像素级分割标注的流程,有效解决了标注稀缺问题;同时,通过针对稀有类别的专门模型探索和双向迁移学习,展示了SAM生成表征的强泛化能力和跨传感器配置的可迁移性。

Abstract: Dense semantic segmentation is essential for autonomous driving, yet many multi-modal datasets lack pixel-level annotations. The Zenseact Open Dataset (ZOD) provides rich multi-sensor data but only bounding-box labels, limiting its use for segmentation research. Our primary contribution is a Segment Anything Model (SAM)-based annotation pipeline that produces dense, pixel-level annotations for ZOD by converting bounding boxes into semantic masks. In this pilot study, we process over 100,000 frames and manually curate a 2,300-frame subset (36% acceptance rate) to establish a reliable baseline. Using these annotations, we evaluate transformer-based CLFT and CNN-based DeepLabV3+ architectures across diverse weather conditions, achieving up to 48.1% mIoU with CLFT-Hybrid. To address extreme class imbalance, where pedestrians, cyclists, and signs constitute less than 1% of pixels, we explore specialized models targeting rare classes. We further validate the pipeline on the Iseauto autonomous-vehicle platform, achieving 77.5% mIoU, and show that SAM-derived representations transfer effectively across sensor configurations via bidirectional transfer learning. All code and annotations are released to support reproducible research.


[109] Intra-YOLO: A Small Object Detection Model for Caries and Molar-Incisor Hypomineralization in Intraoral Photography Based on Transfer Learning with Reinforcement Learning cs.CVPDF

Po-Lun Chwang, Po-Yu Chang, Wen-Liang Lin, Tung-Sheng Wu, Min-Ching Wang

TL;DR: 本研究开发了一种基于迁移学习和强化学习的计算机辅助诊断系统,用于在口腔内摄影中检测龋齿和磨牙-切牙釉质发育不全。该系统采用Intra-YOLO模型,专门针对这些病变的小尺寸和相似外观进行优化,以解决临床鉴别难题。

Details

Motivation: 龋齿和磨牙-切牙釉质发育不全在口腔内摄影中外观相似,且尺寸小、成像条件多变,导致临床鉴别困难,因此需要开发自动检测系统来辅助诊断。

Result: 摘要未提及具体实验结果、基准测试或性能水平,但暗示模型通过迁移学习和强化学习进行了优化,可能针对小物体检测任务进行了改进。

Insight: 创新点在于结合迁移学习和强化学习来优化YOLO模型,专门针对口腔内摄影中的小物体检测任务,这为医疗图像分析中的类似问题提供了可借鉴的方法。

Abstract: This study developed a computer-aided diagnosis (CAD) system for detecting caries and molar-incisor hypomineralization (MIH) in intraoral photographs. These lesions share similar appearances, making clinical differentiation challenging, especially given their small size and variability in imaging conditions.


[110] MeniOmni: A Structured Multimodal Benchmark for Holistic Meniscus Injury Assessment cs.CVPDF

Shurui Xu, Siqi Yang, Weiping Ding, Hui Wang, Mengzhen Fan

TL;DR: 本文提出了MeniOmni,一个用于半月板损伤评估的结构化多模态基准数据集,包含746个多中心MRI研究,整合了三维MRI影像、临床先验信息和专家标注的临床文本。该基准支持细粒度的Stoller严重程度分级和诊断报告生成两个任务,并引入了风险感知序数评估和语义一致性度量(Meni-Score)以更好地反映临床相关性。基线实验表明,融入临床先验信息能提升分级性能并减少严重错误。

Details

Motivation: 现有膝关节MRI基准通常是单模态且依赖粗粒度标签,无法评估全面的临床推理能力。临床诊断需要放射科医生整合三维MRI证据与患者背景信息(如性别、年龄、BMI)并生成结构化诊断报告,因此需要一个更全面的多模态基准。

Result: 基线实验表明,融入临床先验信息(Clinical Priors)能提高细粒度Stoller严重程度分级的性能,并减少严重错误。该基准为评估模型在半月板损伤评估任务上的表现提供了新的标准。

Insight: 创新点在于构建了一个整合影像、临床先验和文本的多模态结构化基准,并提出了针对临床任务的风险感知序数评估和语义一致性度量(Meni-Score),强调了多模态上下文对于更安全、更全面的医学影像评估的价值。

Abstract: Clinical diagnosis of meniscus injuries requires radiologists to integrate volumetric MRI evidence with patient context (e.g., sex, age, BMI) and to produce structured diagnostic reports. Existing knee MRI benchmarks are typically unimodal and rely on coarse labels, limiting their ability to evaluate holistic clinical reasoning. We introduce MeniOmni, a structured multimodal benchmark for meniscus injury assessment, consisting of 746 multi-center MRI studies with tri-planar volumetric inputs, Clinical Priors, and expert-annotated clinical text. MeniOmni supports two tasks: (1) fine-grained Stoller severity grading and (2) diagnostic report generation. We further propose risk-aware ordinal evaluation and a semantic consistency metric (Meni-Score) to better reflect clinical relevance. Baseline experiments show that incorporating Clinical Priors improves grading performance and reduces severe errors, highlighting the value of multimodal context for safer assessment. Code and data are available at https://github.com/ShuruiXu/MeniOmni.


[111] DebFilter: Eradicating Biases Stashed in Value cs.CVPDF

Seung Hyuk Lee, Songkuk Kim

TL;DR: 本文提出DebFilter,一种轻量级且无需训练的框架,用于缓解文本到图像扩散模型中的社会与语义偏见。该方法通过调整交叉注意力机制中的值分量,在推理时直接修正引导嵌入,从而在不改变模型参数的情况下生成更平衡、公平的图像。

Details

Motivation: 文本到图像扩散模型(如基于CLIP的模型)在生成过程中,其文本嵌入会编码并放大社会偏见(如性别、年龄),加之训练数据的不平衡,导致生成结果存在偏差。

Result: 实验表明,DebFilter能有效减轻生成图像中的社会偏见,提供了一种高效、可扩展的解决方案,无需微调或重新训练模型。

Insight: 创新点在于通过分析交叉注意力动态,提出在推理时对引导嵌入的值分量施加固定偏移来校正偏见,这是一种无需额外数据或模型更新的轻量级方法,有助于实现更公平的生成。

Abstract: Text-to-image diffusion models, which are theoretically equivalent to score-based generative models, generate images through a multi-step denoising process guided by text embeddings extracted from pretrained vision-language models such as CLIP. However, these text embeddings inherently encode social and semantic biases – such as those related to gender and age – that are subsequently propagated and amplified through the guidance mechanism, along with the model’s training on large-scale datasets that are imbalanced with respect to these bias-related concepts, often leading to skewed outputs in text-to-image generation. We propose DebFilter, a lightweight and training-free framework for mitigating such biases in text-to-image diffusion models. Observing that the model’s error prediction at each denoising step is primarily influenced by cross-attention dynamics, we introduce a bias-correction strategy that adjusts the value components within cross-attention. Specifically, we apply a fixed offset to the slice of guidance embedding, effectively steering the semantic direction of cross-attention values toward unbiased representations. This adjustment reconfigures the score landscape to produce balanced outputs while maintaining alignment with the intended text semantics. Unlike prior approaches that rely on fine-tuning or retraining, DebFilter operates entirely at inference time, requiring no additional data or model updates. Our results demonstrate that this method effectively mitigates social biases in generated images, offering an efficient and scalable pathway toward fairer and more inclusive text-to-image generation.


[112] FLORO: A Multimodal Geospatial Foundation Model for Ecological Remote Sensing Across Sensors and Scales cs.CV | cs.AIPDF

Jorge L. Rodriguez, Victor Angulo Morales, Areej Alwahas, Mariana Elias Lara, Fida Mohammad Thoker

TL;DR: 本文提出了FLORO,一个多模态地理空间基础模型,旨在从一个小型但高度多样化的遥感数据集中学习可迁移的表示。该模型通过掩码自编码在Sentinel-1、Sentinel-2、SkySAT影像、高程和无人机数据组成的异构组合上进行预训练,并采用可用性感知输入来处理传感器可变性。

Details

Motivation: 当前许多基础模型依赖大规模预训练数据集和固定传感器配置,限制了其在观测平台、空间/光谱分辨率和可用模态多变的生态与环境应用中的适用性,因此需要一种能从小型多样化数据集中学习通用表示的方法。

Result: 在PANGAEA基准测试的冻结编码器协议下,FLORO在场景分类、分割和回归任务中表现强劲且稳定。它在六个PANGAEA基准测试中取得了第二佳的平均分割性能,仅次于一个使用超过两个数量级更多图像预训练的模型,在场景分类中保持竞争力,并在回归任务中表现出鲁棒性。

Insight: 创新点在于引入了可用性感知输入来统一异构传感器配置的输入空间,从而能够有效处理多源、多尺度、多模态的遥感数据。此外,地理位置编码相对于绝对位置编码在分类任务中显示出进一步的性能提升。

Abstract: Foundation models offer a promising route to transferable remote sensing representations, but many current approaches depend on very large pretraining datasets and fixed sensor configurations, limiting their suitability for ecological and environmental applications, where observations often vary across platforms, spatial and spectral resolutions, and available modalities. We introduce FLORO, a multimodal geospatial foundation model designed to learn transferable representations from a small but highly diverse remote sensing corpus. FLORO is pretrained using masked autoencoding on a heterogeneous combination of Sentinel-1, Sentinel-2, SkySAT imagery, elevation, and UAV-derived data. To accommodate sensor variability, FLORO incorporates availability-aware inputs that indicate which spectral bands and auxiliary modalities are present in each sample, enabling a unified input space across heterogeneous sensor configurations. We evaluated FLORO on the PANGAEA benchmark under a frozen-encoder protocol across scene classification, segmentation, and regression tasks. Despite being pretrained on a smaller corpus than competing foundation models, FLORO achieved strong and stable transfer across optical, optical-SAR, and optical-elevation benchmarks spanning medium-resolution satellite, airborne, and ultra-high-resolution UAV imagery. FLORO obtained the second-best average segmentation performance across six PANGAEA benchmarks, trailing only a recently introduced foundation model pretrained on over two orders of magnitude more images, remained competitive on scene classification, and was robust in regression tasks, while qualitative results showed improved preservation of spatial structure in flood, urban, biomass, and canopy-height prediction settings. In a separate controlled experiment on EuroSAT-MS, geo-positional encoding further improved classification relative to absolute positional encoding.


[113] VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer cs.CV | cs.AIPDF

Rui Lin, Chuanming Wang, Huadong Ma

TL;DR: 本文提出VidPrism,一种用于图像到视频迁移学习的异构时序混合专家框架。它通过部署功能专精的专家(从空间理解到时序建模)并引入内容感知的多速率采样模块来提供专门化输入,结合动态双向融合机制,以克服传统MoE中专家同质化的问题,从而提升视频理解性能。

Details

Motivation: 现有基于混合专家的视频理解方法存在专家同质化问题,所有专家都作为相同的通才,从无差别的视频流中低效地学习时空特征,这限制了模型性能。

Result: 在多个视频识别基准测试上的广泛实验表明,VidPrism实现了最先进的性能,并有效促进了专家的专业化。

Insight: 核心创新在于提出了异构专家分工与内容感知多速率采样的协同设计,通过动态生成从语义丰富到运动聚焦的专门化视频流输入,并配合双向融合,实现了更高效的时空特征学习。这为设计视频理解模型提供了新的架构思路。

Abstract: With the rapid development of pre-training technologies, adapting large-scale Vision-Language Models (VLMs) for video understanding \emph{\ie} image-to-video transfer learning has become a dominant paradigm. To achieve superior performance, it raises as an effective strategy among recent advances to employ Mixture-of-Experts (MoE) to enhance VLMs’ temporal modeling capabilities. However, conventional MoE designs suffer from expert homogenization, where all experts act as identical generalists, inefficiently learning spatio-temporal features from undifferentiated video streams. To overcome this problem, we propose VidPrism, a novel heterogeneous temporal Mixture-of-Experts framework. VidPrism pioneers a division of labor by deploying functionally specialized experts, each assuming a role ranging from spatial understanding to temporal modeling. To feed these specialists appropriately, we introduce a content-aware, multi-rate sampling module that dynamically generates streams ranging from semantically rich to motion-focused representations, providing specialized inputs for experts. Furthermore, a dynamic, bidirectional fusion mechanism enables synergistic information exchange between these pathways, leading to a comprehensive video representation. Extensive experiments on various video recognition benchmarks demonstrate that VidPrism achieves state-of-the-art performance and effectively fosters expert specialization. Our source code is available at \href{https://github.com/Lrrrr549/VidPrism.git}{https://github.com/Lrrrr549/VidPrism.git}.


[114] Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation cs.CVPDF

Mariam Hassan, Kaouther Messaoud, Wuyang Li, Alexandre Alahi

TL;DR: 本文提出Proprio框架,一种无需额外训练的方法,旨在提升冻结视频生成模型输出视频的物理合理性。该方法受本体感觉启发,利用模型在受控潜在扰动下的流残差作为自评分信号,并通过最佳N选择或基于梯度的自精炼来优化生成结果。

Details

Motivation: 现有视频生成模型虽能产生视觉上令人印象深刻的结果,但经常违反基本物理原理,因此需要一种方法来评估和改进生成内容的物理合理性。

Result: 在文本到视频和图像到视频基准测试中,Proprio显著提升了物理合理性,在Physics-IQ和VideoPhy2-hard等指标上分别提升了16.5%和20.6%,并在人类评估中约三分之二的比较中更受青睐。

Insight: 创新点在于将生成模型自身的流残差作为内部自评分信号,并利用动态时空掩码聚焦于运动相关区域,实现了无需外部世界模型或额外训练的物理合理性评估与优化。

Abstract: Modern video generative models produce visually impressive results, yet frequently violate basic physical principles. We propose Proprio, a training-free framework that enables a frozen video generator to assess and improve the physical plausibility of its own outputs. Inspired by proprioception, the biological sense of one’s own movement, Proprio treats the model’s flow residual under controlled latent perturbations as a self-scoring signal. Samples that are better explained by the generator’s learned dynamics induce smaller and more stable residuals. We aggregate this signal across timesteps and perturbations, focus it on motion-relevant regions with a dynamic spatiotemporal mask, and use it for best-of-N search, gradient-based self-refinement, or both. Across text-to-video and image-to-video benchmarks, Proprio consistently improves physical plausibility, outperforming VLM-based scoring, and external world-model baselines in several settings. With TurboWan2.2, Proprio improves Physics-IQ from 32.2 to 37.5 (+16.5%) and VideoPhy2-hard physical commonsense from 45.6 to 55.0 (+20.6%). Human evaluation further shows that raters prefer Proprio-selected or refined videos for physical plausibility in roughly two-thirds of comparisons. These results suggest that frozen video generators contain actionable internal signals for evaluating and improving the physical plausibility of their own outputs.


[115] From Kellgren-Lawrence to Calcium Pyrophosphate Crystal Deposition: A Soft-Labelling Framework for Knee Osteoarthritis Assessmen cs.CVPDF

Francisco Bérchez-Moreno, Riccardo Rosati, Maria Chiara Fiorentino, Víctor M. Vargas, Edoardo Cipolletta

TL;DR: 该论文提出了一种基于软标签的序数深度学习框架,用于膝关节骨关节炎(KOA)的严重程度评估,特别是针对Kellgren-Lawrence(KL)分级和焦磷酸钙晶体沉积病(CPPD)分级。该方法通过使用单峰概率分布(如二项式、Beta、三角和指数分布)替代传统的独热编码标签,以更好地捕捉临床实践中观察到的分级不确定性和两个评分量表之间的非对称关系。在包含2172张膝关节X光图像的数据集上,所有软标签策略均显著优于传统基线,其中Beta分布和三角分布在KL和CPPD分级任务上分别取得了最佳性能。

Details

Motivation: 传统的基于独热标签的深度学习方法无法有效捕捉KL和CPPD严重性评分的序数不确定性,以及临床实践中观察到的两个评分量表之间的非对称关系,这限制了模型在膝关节骨关节炎评估中的准确性和临床实用性。

Result: 在包含968张同时标注了KL和CPPD严重性的膝关节X光图像的数据集上,所有四种软标签策略(二项式、Beta、三角、指数)均一致且显著地(p < 0.001)优于传统的独热编码基线。具体而言,对于CPPD分级,三角分布取得了最高的二次加权Kappa(QWK = 0.796)和最低的平均绝对误差(MAE = 0.438);而Beta分布在考虑类别平均MAE(AMAE)和最大MAE(MMAE)时表现出最均衡的类别性能。对于KL分级,基于Beta分布的方法取得了最佳整体性能,包括最高的QWK(0.777)、最低的MAE(0.529)以及最低的类别误差。

Insight: 论文的核心创新点在于提出了一个通用的序数软标签框架,通过将硬性的独热标签替换为围绕标注等级的单峰概率分布,从而将分级的不确定性和临床知识(如量表间的非对称关系)编码到监督信号中。从客观角度看,这种软标签方法提供了一种灵活且有效的方式,将医学评估中固有的模糊性和序数特性融入深度学习模型训练,有望提升模型在类似具有序数标签和不确定性的医学图像分析任务中的性能。

Abstract: Background and objective. Conventional Deep Learning (DL) approaches for Knee Osteoarthritis (KOA) grading rely on one-hot labels, which fail to capture both the ordinal uncertainty of Kellgren–Lawrence (KL) and Calcium Pyrophosphate Deposition Disease (CPPD) severity scores and the asymmetric relationship between the two scales observed in clinical practice. Methods. We retrospectively collected 2172 knee X-ray images, including 968 radiographs jointly annotated for KL and CPPD severity. An ordinal DL framework based on soft-labelling was developed for both tasks, replacing one-hot targets with unimodal probability distributions centred on the annotated grade. Four formulations were investigated: binomial, beta, triangular, and exponential. Results. All soft-labelling strategies consistently outperformed the nominal baseline. For CPPD grading, the triangular formulation achieved the highest Quadratic Weighted Kappa (QWK) and the lowest Mean Absolute Error (MAE) (QWK = 0.796; MAE = 0.438), while the beta formulation yielded the most balanced class-wise performance considering Average MAE (AMAE) and Maximum MAE (MMAE) across classes (AMAE = 0.458; MMAE = 0.573). For KL grading, the beta-based approach provided the best overall performance, achieving the highest QWK together with the lowest MAE and class-wise errors (QWK = 0.777; MAE = 0.529; AMAE = 0.523; MMAE = 0.775). Statistical analysis demonstrated significant improvements over conventional one-hot supervision (p < 0.001).


[116] Learning to Label: A Reinforced Self-Evolving Framework for Semi-supervised Referring Expression Segmentation cs.CVPDF

Runlong Cao, Ying Zang, Chuanwei Zhou, Tianrun Chen, Tong Zhang

TL;DR: 本文提出了一种名为Learning to Label (L2L)的强化自进化框架,用于解决半监督指代表达式分割(SS-RES)任务中监督信息有限和伪标签不可靠的问题。该框架将伪标签构建建模为一个可学习的决策过程,通过利用多模态大语言模型提取语义-空间先验作为初始软分割建议,并结合文本线索形成可学习的指导信号,通过强化学习自适应地选择高价值的像素级监督,实现分割模型与伪标签的联合优化。

Details

Motivation: 解决半监督指代表达式分割任务中,由于标注数据有限导致监督不足,以及利用未标注图像-文本对时产生的伪标签不可靠的问题。

Result: 在RefCOCO、RefCOCO+和RefCOCOg数据集上进行了大量实验,结果表明该方法优于现有方法,验证了其有效性和泛化能力。

Insight: 主要创新点在于将伪标签构建形式化为一个可学习的、基于强化学习的决策过程,通过强化自进化循环联合优化分割模型和伪标签质量。客观来看,其巧妙地将多模态大语言模型的先验知识作为可学习的指导信号,并与强化学习驱动的伪标签选择机制相结合,以应对稀疏监督下的学习稳定性挑战。

Abstract: Semi-supervised referring expression segmentation (SS-RES) aims to achieve precise pixel-level language grounding under limited annotation, yet suffers from limited supervision and unreliable pseudo-labels when exploiting unlabeled image-text pairs. In this work, we propose Learning to Label, a reinforced self-evolving framework (L2L) that casts pseudo-label construction as a learnable decision-making process. To build foundational understanding, we leverage a multimodal large language model to extract semantic-spatial priors, which are instantiated as initial soft segmentation proposals and elevated, together with textual cues, into learnable guidance signals that condition a hierarchical segmentation network. To ensure stable learning, reinforced pseudo-label selection is formulated as an exploratory decision process that adaptively rewards high-utility pixel-level supervision based on multimodal priors and model predictions. This reinforced self-evolving loop enables joint optimization of the segmentation model and pseudo-labels, progressively enhancing label reliability under sparse supervision. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate improvements over existing methods, validating its effectiveness and generalization.


[117] PointQ-Bench: Benchmarking Diagnostic and Interpretable Point Cloud Quality Assessment cs.CVPDF

Duanchu Wang, Cheng Li, Junjie Yang, Jing Huang, Zihang Cheng

TL;DR: 该论文提出了PointQ-Bench,一个用于评估点云质量的综合性基准测试。它旨在将点云质量评估从传统的标量分数预测扩展到全面的质量理解,包括缺陷识别、问题类型诊断、可用性分级和开放式质量报告。基准包含多种来源的点云数据,并标注了多种质量相关属性。

Details

Motivation: 现有PCQA研究主要集中于标量分数预测,但在实际应用中,质量评估需要识别缺陷、诊断问题类型、评估下游可用性并提供证据支持的描述,这些方面在当前基准中未得到充分评估。

Result: 在14个视觉语言模型和传统PCQA基线上的广泛实验揭示了感知与诊断之间的差距:现有模型在粗粒度缺陷感知方面表现出初步能力,但在基于证据的诊断和质量校准方面存在困难。强大的2D多模态大语言模型通常优于现有的3D视觉语言模型,且额外视图或点级输入的益处因任务、数据源和模型而异。

Insight: 论文的创新点在于提出了一个支持多任务(感知、诊断、分级、报告)的综合性点云质量评估基准,并引入了SSFRQ-5D协议来评估自由形式的质量描述。从客观角度看,该研究强调了PCQA从标量评分向可解释、诊断性评估范式转变的重要性,并为模型能力差距提供了实证分析。

Abstract: Point cloud quality plays a critical role in 3D acquisition, reconstruction, rendering, and perception, yet existing point cloud quality assessment (PCQA) research remains largely centered on scalar score prediction. In practical inspection scenarios, quality assessment often involves identifying defects, characterizing dominant issue types, assessing downstream usability, and providing evidence-supported descriptions, which are not explicitly evaluated by current benchmarks. We introduce PointQ-Bench, a benchmark designed to extend PCQA from scalar scoring toward comprehensive quality understanding. PointQ-Bench consists of 3,083 point clouds spanning authentic scans, simulated distortions, and AI-generated content, covering eight major issue types. Each sample is annotated with mean opinion scores (MOS), quality levels, issue tags, expert-grounded descriptions, and 12,332 question-answer pairs. The benchmark supports three perception-oriented tasks: anomaly sensing, defect diagnosis, and usability grading, as well as a cognition-oriented task of open-ended quality reporting. To evaluate free-form quality descriptions, we further propose SSFRQ-5D, a five-dimensional evaluation protocol validated through human-AI agreement analysis. Extensive experiments on 14 vision-language models and traditional PCQA baselines reveal a consistent perception-diagnosis gap: while current models exhibit emerging abilities in coarse defect perception, they struggle with grounded diagnosis and quality calibration. Strong 2D MLLMs generally outperform existing 3D VLMs, and the benefit of additional views or point-level inputs is non-uniform, varying across tasks, data sources, and models, particularly under boundary-ambiguous conditions. Overall, PointQ-Bench provides a diagnostic testbed for advancing reliable and interpretable point cloud quality understanding.


[118] Category-Level 3D Correspondence in Camera Space via Morphable Object Priors cs.CVPDF

Leonhard Sommer, Artur Jesslen, Basavaraj Sunagad, Adam Kortylewski

TL;DR: 该论文研究了类别级别的3D对应关系预测,提出了一种无需显式对应监督的方法,通过共享可变形物体先验学习,从单张图像预测在相机空间中跨实例保持一致的3D位置。作者引入了首个大规模基准数据集HouseCorr3D,并提出了Morpheus方法,该方法通过解耦规范形状、变形和物体姿态来学习可变形类别级形状先验,从而隐式地产生语义上有意义的3D对应关系。

Details

Motivation: 当前类别级姿态估计的表示方法未能捕捉细粒度语义,无法支持对物体部件、功能和交互的推理。该工作旨在解决从单张图像预测类别级别3D对应关系的问题,以促进对3D物体的语义理解。

Result: 在提出的HouseCorr3D基准上,该方法实现了新的最先进水平(SOTA),表明无需直接对应监督即可实现语义3D物体理解。

Insight: 创新点包括引入大规模基准数据集HouseCorr3D,提供遮挡区域的非模态对应标签和显式对称性标注;提出Morpheus方法,通过解耦学习共享规范基础,隐式地产生3D对应关系,避免了显式监督的需求。

Abstract: Understanding 3D objects from images is fundamental to robotics and AR/VR applications. While recent work has made progress in category-level pose estimation, current representations fail to capture the fine-grained semantics needed for reasoning about object parts, functions, and interactions. In this work, we study category-level 3D correspondence in camera space – predicting, from a single image, 3D locations that remain consistent across instances within a category – and show that it can emerge without explicit correspondence supervision by learning a shared morphable object prior. To enable research in this direction, we introduce HouseCorr3D, the first large-scale benchmark for monocular category-level 3D correspondence with 178k images across 50 household object categories, 280 unique instances, and 3D keypoint annotations directly on CAD models. Crucially, HouseCorr3D provides amodal correspondence labels for occluded regions and explicit symmetry annotations, addressing key limitations of existing datasets. We further propose Morpheus, a method that learns morphable category-level shape priors by disentangling canonical shape, deformation, and object pose. Through this shared canonical grounding, semantically meaningful 3D correspondences in camera space emerge implicitly. These emerging 3D correspondences set a new state of the art on HouseCorr3D, demonstrating that semantic 3D object understanding can arise without direct correspondence supervision. Data and code are publicly available at https://github.com/GenIntel/HouseCorr3D.


[119] MORI-Seg: Learning Morphological Geometry for Instance Segmentation without Instance Annotations cs.CVPDF

Leiyue Zhao, Tianyu Shi, Daniel Reisenbuchler, Xinzi He, Junchao Zhu

TL;DR: MORI-Seg是一个深度学习框架,用于在仅使用语义分割标注的情况下实现实例分割。它通过联合建模以对象为中心的距离场和边界带表示来学习形态感知的几何表示,从而将相连的语义区域分解为不同的实例掩码。

Details

Motivation: 解决肾脏功能单元实例级量化问题,现有公开病理数据集通常只提供语义分割标注,导致相邻同类结构被合并,无法进行可靠的实例级分析。现有启发式后处理方法在拥挤和粘连区域效果不佳,而基于深度学习的实例分割方法又需要昂贵且费力的实例级标注。

Result: 实验表明,与经典后处理流程和代表性的语义到实例学习方法相比,MORI-Seg提高了实例分离的准确性,并实现了更可靠的形态计量学量化。

Insight: 创新点在于无需实例级标注,通过联合学习距离场和边界带表示来编码内部结构和接触界面,并结合类条件特征解耦模块来增强实例内一致性和实例间分离性,实现了端到端的语义到实例分割。

Abstract: Instance-level quantification of kidney functional units is essential for morphometric analysis, yet most publicly available pathology datasets provide only semantic segmentation annotations, where adjacent structures of the same class are merged into single regions. This prevents reliable instance-level analysis and limits downstream quantitative studies. Existing heuristic post-processing methods often yield suboptimal instance separation, particularly in crowded and adherent regions, while deep learning-based instance segmentation approaches typically require intensive instance-level annotations that are costly and labor-intensive to obtain. We propose MORI-Seg, a deep learning framework that enables instance segmentation without requiring instance-level annotations. Instead of heuristic splitting or instance supervision, MORI-Seg learns morphology-aware geometric representations directly from semantic masks by jointly modeling object-centric distance fields and boundary-band representations to encode interior structure and contact interfaces. A class-conditioned feature disentanglement module further promotes intra-instance coherence and inter-instance separation. Under semantic-only supervision, MORI-Seg decomposes connected semantic regions into distinct instance masks in an end-to-end manner. Experiments demonstrate improved instance separation accuracy and more reliable morphometric quantification compared with classical post-processing pipelines and representative semantic-to-instance learning approaches. The official implementation is publicly available at https://github.com/ddrrnn123/MORI-Seg.


[120] LV-OSD: Language-Vision-Complementary Open-Set Object Detection cs.CVPDF

Yupeng Zhang, Ruize Han, Wei Feng, Song Wang, Liang Wan

TL;DR: 本文提出了一种语言-视觉互补的开放集目标检测(LV-OSD)新问题,旨在通过灵活的文本和/或图像提示来指定目标类别。为此,作者设计了一个双分支检测框架LVDor,并引入了目标引导的提示动态加权(TPDW)模块来弥合模态间的语义鸿沟。实验验证了问题设置的合理性和方法的有效性。

Details

Motivation: 解决现实应用中更常见和实用的场景,即使用灵活的文本和/或图像提示来指定待检测的目标类别,而非固定的类别列表或查询图像。

Result: 广泛的实验结果验证了所提问题设置的合理性以及LVDor方法的有效性,但摘要中未提及具体的基准测试数据集或与现有SOTA模型的定量比较结果。

Insight: 创新点在于提出了LV-OSD这一新问题设置,并设计了TPDW模块来动态融合多模态提示以实现精确语义对齐;同时,训练时采用的提示随机掩码(PRM)机制模拟了测试时提示的任意组合,增强了模型的鲁棒性。

Abstract: Object detection is an important task in computer vision, which aims to detect the objects of interest. through the given category list or query images. In this work, we propose a new problem of language-visual-complementary open-set object detection (LV-OSD), i.e., using the flexible text-based and/or image-based prompts to specify the desired object categories. This setting is more common and practical in real-world applications. For this purpose, we design a dual-branch detection framework, LVDor, which can simultaneously accept both text and image prompts. Specifically, we first build the Multi-modal Prompts (MPr) containing various text descriptions and image samples for each category. Subsequently, to bridge the semantic gap among the input image, text prompts, and image prompts, we design a Target-guided Prompt Dynamic Weighting (TPDW) module. Guided by the prior information of the target image, this module dynamically produces the text and image prompts that best align with the target semantics, achieving precise alignment and effectively reducing the discrepancy between the two modalities, thereby accommodating the LV-OSD setting. We also propose a simple Prompt Random Masking (PRM) mechanism during training to simulate the arbitrary combination of text and/or image prompts in testing. Extensive experimental results verify our problem formulation’s reasonability and our method’s effectiveness. Prompts and code will be released publicly.


[121] EchoAvatar: Real-time Generative Avatar Animation from Audio Streams cs.CVPDF

Bohong Chen, Yumeng Li, Yinglin Xu, Youyi Zheng, Yanlin Weng

TL;DR: 本文提出了EchoAvatar框架,用于从音频流实时生成高质量3D角色动画。该方法采用统一的流式架构,能够从增量音频输入合成连续、连贯的全身运动,并支持语音和音乐两种模式。通过强化学习优化在线生成质量,并结合工具调用接口实现语义控制,将语音代理转化为交互式人形化身。

Details

Motivation: 现有方法大多局限于离线处理完整音频序列或特定领域,难以同时有效处理语音和音乐,且缺乏实时性。本文旨在解决从流式音频实时生成高保真3D角色运动的问题,以支持下一代交互式化身和虚拟助手。

Result: 大量实验表明,该方法在运动质量和同步性方面优于最先进的实时基线,同时保持了实时部署所需的灵活性。

Insight: 创新点包括统一的流式架构、无需显式域标签或模式切换的鲁棒训练策略、利用强化学习优化在线生成,以及通过工具调用接口桥接反应式动画与意图驱动行为,实现语义控制与音频驱动合成的结合。

Abstract: Real-time synthesis of high-fidelity 3D character motion from audio is a pivotal component for next-generation interactive avatars and virtual assistants. However, most existing approaches are limited to offline processing of complete audio sequences or are constrained to specific domains, rarely handling both speech and music effectively. In this paper, we introduce a novel framework designed to generate continuous, coherent full-body motion from streaming speech and music with low latency. Central to our approach is a unified streaming architecture capable of synthesizing continuous motion from incremental audio inputs. We employ a robust training strategy that enforces strong audio dependency, allowing the model to seamlessly generalize across conversational speech and rhythmic music without requiring explicit domain labels or mode switching. Additionally, we explored Reinforcement Learning to refine the quality of online generation. Furthermore, we bridge reactive animation with intent-driven behavior via a tool-call interface that allows upstream Large Language Models to inject explicit semantic control. By combining this controllability with stream audio-driven synthesis, our framework serves as a plug-and-play solution for transforming voice agents into interactive humanoid avatars. Extensive experiments demonstrate that our method outperforms state-of-the-art realtime baselines in motion quality and synchronization while maintaining the flexibility required for live deployment. Our code, pre-trained models, and videos are available at https://robinwitch.github.io/EchoAvatar-Page.


[122] Every9D-21M: Large-Scale Real-World 9D Canonicalization of Everyday Objects cs.CVPDF

Leonhard Sommer, Emil Akopyan, Adam Kortylewski

TL;DR: 该论文提出了Every9D-21M,一个大规模、真实世界的9D姿态数据集,包含来自10.9万个以物体为中心的视频、涵盖700个日常物体类别的2180万张图像的9D姿态标注。该数据集通过多视角几何重建物体点云,并将相似实例对齐到共享的规范坐标系中来构建,其规模和类别数量比现有真实世界9D姿态基准大两个数量级。

Details

Motivation: 从单张真实世界图像估计日常物体的9D姿态仍然具有挑战性,主要原因是缺乏大规模监督数据。现有数据集要么严重依赖合成渲染,要么对真实世界物体的覆盖有限。

Result: 在ImageNet3D和PASCAL3D+基准上,使用Every9D-21M进行训练能提升性能,并且在HANDAL数据集上比使用ImageNet3D训练具有更好的泛化能力。该数据集本身也作为9D姿态基础模型的基准,提供了专门的训练和评估划分。

Insight: 创新点在于利用以物体为中心的视频,通过多视角几何重建和跨实例对齐,仅需对极少量的参考对象进行手动标注,即可将规范姿态大规模传播到其他实例,并通过多视角验证确保质量。此外,引入了跨类别方向规则以诱导类别级对称性,支持对称性感知的评估。

Abstract: Estimating the 9D pose of everyday objects from a single real-world image remains challenging. This is largely due to the lack of large-scale supervision. Most existing datasets either rely heavily on synthetic renderings or provide limited coverage of real-world objects: the largest real-world 9D pose dataset to date contains only 17K annotated objects across 9 categories. We address this gap with Every9D-21M, a dataset of 9D pose annotations for 21.8M real-world images from 109K object- centric videos spanning 700 everyday object categories - two orders of magnitude larger than prior real-world 9D pose benchmarks in both image and category count. To achieve this scale, we leverage object-centric videos by reconstructing object- level point clouds via multi-view geometry and aligning similar instances into a shared canonical coordinate frame. Canonical poses are manually annotated for only a small set of reference objects (fewer than 0.01% of all images) and propagated to the remaining instances via cross-instance alignment. All propagated canonical poses are then verified from multiple viewpoints. We further introduce cross-category orientation rules that induce category-level symmetries, enabling symmetry-aware evaluation. Beyond establishing dedicated training and evaluation splits as a benchmark for 9D pose foundation models, we show that training on Every9D-21M improves performance on ImageNet3D and PASCAL3D+, and generalizes to HANDAL substantially better than training on ImageNet3D. Data and code are available at https://github.com/GenIntel/Every9D.


[123] Transfer learning RGB models to hyperspectral images with trainable tensor decompositions cs.CVPDF

Mariette Schönfeld, Laurens Devos, Wannes Meert, Hendrik Blockeel

TL;DR: 本文提出了一种通过可训练张量分解将预训练的RGB模型迁移到高光谱图像的方法,该方法将卷积滤波器分解为空间和光谱分量,替换光谱分量以适应高光谱通道,从而保留原始滤波器的空间模式并实现信息完整迁移。

Details

Motivation: 现有迁移学习方法假设输入图像为3通道,与多光谱或高光谱图像不兼容,现有方法会牺牲图像或模型信息,本文旨在解决这一不兼容性问题。

Result: 在多个高光谱数据集上的实验表明,该方法比其他高光谱迁移学习方法更准确和鲁棒。

Insight: 创新点在于使用部分可训练的张量分解来分离预训练滤波器的空间和光谱成分,通过替换光谱成分实现通道扩展,从而在迁移学习中同时保留空间信息和适应高光谱数据。

Abstract: Transfer learning makes it possible to use large vision networks on a variety of domains, by specializing their models’ general filters to new tasks. However, these networks assume the input images to have 3 input channels, making them incompatible with multi- or hyperspectral images. Current approaches that mitigate this incompatibility sacrifice information in either the image, or the model. This work proposes a novel approach that preserves the image and spatial information present in the model by using partially trainable tensor decompositions. We create such decompositions of pretrained convolutional filters, separating the filters into spatial and spectral components. The spectral components are then replaced with trainable components of higher channel dimensionality. This creates hyperspectral filters that can specialize to new datasets, while retaining the spatial patterns of the original filter. Experiments on a variety of hyperspectral datasets show that our approach is more accurate and robust than other hyperspectral transfer learning methods.


[124] Toward Semantic-Agnostic and Shape-Aware Vision-Language Segmentation Models cs.CVPDF

Corentin Seutin, Mohamed Amine Ettaki, Michaël Clément, Pierrick Coupé, Rémi Giraud

TL;DR: 本文提出了一种名为SANSA(语义无关且形状感知)的分割新范式,旨在使视觉语言分割模型能够仅基于非语义的文本描述进行分割,从而减少对高级语义类别的依赖,增强对形状、几何等内在视觉属性的推理能力。

Details

Motivation: 现有视觉语言分割模型过度依赖高级语义类别,限制了其对形状、几何、纹理等内在视觉属性的推理能力,而这些属性在许多实际应用中至关重要。

Result: 实验表明,在SANSA提示上微调的模型,在新分割任务上相比预训练的SOTA模型实现了高达20%的mIoU提升,同时在标准语义提示上仍保持强劲性能。

Insight: 创新点在于提出了语义无关的分割范式,并通过基于字典约束或示例引导的策略生成非语义文本提示,强调了中低层视觉推理对于提升模型泛化性和可控性的重要性。

Abstract: Vision-language segmentation models have recently achieved strong performance by leveraging high-level semantic object categories expressed in natural language. However, this semantic dependence limits their ability to reason about intrinsic visual properties such as shape, geometry, or texture, which are essential in many real-world applications. In this work, we introduce Semantic-Agnostic aNd Shape-Aware (SANSA) segmentation, a new paradigm that requires segmentation models to operate solely from non-semantic textual descriptions. To this end, we propose two strategies to generate SANSA segmentation prompts based on either dictionary constraints or example guidance, both generating semantic-agnostic textual descriptions. These prompts are then used to finetune segmentation models under semantic-agnostic supervision. Experiments show that finetuning on SANSA prompts yields up to a 20% mIoU improvement on this new segmentation task, compared to pretrained state-of-the-art models, while maintaining strong performance on standard semantic prompts. These results highlight the importance of low- and mid-level visual reasoning for improving the generalization and controllability of vision-language segmentation models.


[125] Sketch2Motion: Text-driven 2D Sketch to 3D Animation via Diffusion-guided Skeleton Optimization cs.CV | cs.GRPDF

Gaurav Rai, Ojaswa Sharma

TL;DR: 本文提出Sketch2Motion框架,通过扩散模型引导的骨架优化,将文本驱动的2D手绘草图转化为3D动画。该方法结合了经典角色动画流程与深度生成先验,利用运动感知分数蒸馏采样(MoSDS)和物理启发的约束来生成真实、语义一致且时间连贯的动画,适用于双足、四足和非生物关节角色。

Details

Motivation: 2D手绘草图动画在视觉传达中有效,但存在遮挡处理和运动映射的挑战,而现有方法多局限于特定运动类型(如双足运动、面部表情)。本文旨在开发一个通用框架,从2D草图生成多样化的3D动画,解决运动估计的复杂性。

Result: 实验表明,该方法能生成时间连贯、与文本对齐的动画,在缺乏生成先验或显式物理约束的基线运动迁移方法上表现更优。

Insight: 创新点包括:1)结合扩散模型(通过MoSDS)引导骨架优化,无需配对运动数据;2)引入物理启发的平滑性、拓扑和接触约束以稳定优化;3)集成弹簧-质量模拟器添加次级运动效果,提升真实感;框架具有通用性、可微分性和模块化特点。

Abstract: Animation of 2D hand-drawn sketches provides an effective medium for visual communication. However, these sketches pose challenges, particularly in handling occlusions and accurately mapping motion. While 3D animation naturally addresses these challenges, estimating 3D motion remains a very complex task. Recent approaches to converting 2D sketches to 3D animations have mainly focused on specific types of motion, such as bipedal movements and facial expressions. We propose Sketch2Motion, a diffusion-guided framework for skeleton-based motion synthesis that combines classical character animation pipelines with deep generative priors. Our method represents motion using skeletal transformations, which are propagated to mesh deformations via linear blend skinning. To guide the resulting animation toward realistic and semantically meaningful motion, we integrate a text-to-video diffusion model via motion-aware score-distillation sampling (MoSDS), enabling optimization without paired motion data. Additionally, we apply physics-inspired smoothness, topological, and contact constraints to stabilize optimization and preserve motion plausibility. Further, we integrate a spring-mass simulator to introduce secondary motion effects. The proposed framework is generalized, fully differentiable, modular, and compatible with biped, quadruped, and non-living articulated characters. Experiments demonstrate that our approach produces temporally coherent, text-aligned animations that outperform baseline motion transfer methods that lack generative priors or explicit physical constraints. We will make our code and dataset publicly available.


[126] EgoRelight: Egocentric Human Capture and Illumination Recovery for Relightable and Photoreal Avatar Rendering cs.CVPDF

Jianchun Chen, Yinda Zhang, Rohit Pandey, Thabo Beeler, Marc Habermann

TL;DR: EgoRelight是一个用于混合现实(MR)头显的沉浸式远程呈现框架,能够从单一头戴式显示器(HMD)同时捕获全身人体运动、合成照片级真实且可重照明的外观,并估计高动态范围(HDR)环境光照图。

Details

Motivation: 现有方法将人体运动捕获、重照明和环境理解视为孤立问题,要么使用固定光照的虚拟化身,要么依赖复杂的摄影棚设置。本文旨在从受限的HMD视角出发,提供一个统一的解决方案,以实现虚拟化身在真实或虚拟环境中无缝、逼真的融合。

Result: 在几何精度、渲染质量和重照明保真度方面,该框架的各个组件及集成系统均显著优于现有最先进(SOTA)基线方法。

Insight: 创新点包括:1)利用头显向下立体相机提取密集深度图作为几何控制信号;2)提出一种新颖的神经外观模型,将视点相关的镜面反射和视点无关的漫反射分开学习,并采用专门的射线采样策略以泛化到未见光照;3)通过测试时逆向渲染过程,将预训练化身的外观与实时自我中心相机观测匹配,从而恢复HDR环境图,实现化身与物理世界的无缝集成。

Abstract: Mixed Reality (MR) headsets promise a future of immersive telepresence where virtual humans blend indistinguishably into real or virtual surroundings. Achieving this vision requires a method for capturing a user’s motion, estimating appearance under novel lighting, and understanding the environment - all from the constrained viewpoint of a head-mounted display (HMD). Existing approaches treat these as isolated problems: they either focus on driving avatars with baked-in lighting or rely on studio setups for relighting. In this paper, we present EgoRelight, a holistic framework for egocentric telepresence that simultaneously captures full-body human performance, synthesizes photorealistic and relightable appearance, and estimates high dynamic range (HDR) environment maps from a single HMD. First, to ensure motion and surface reconstruction, we propose an egocentric perception module that leverages stereo down-facing cameras to extract dense depth maps, which serve as geometric control signals to drive a mesh-based avatar. Second, we introduce a novel neural appearance model that learns to synthesize view-dependent specular and view-independent diffuse shading separately. By employing a specialized ray-sampling strategy, our model generalizes to unseen illumination without relying on restrictive analytical BRDF priors. Third, we enable seamless avatar integration into the physical world via a test-time inverse rendering process, which recovers an HDR environment map by matching the pre-trained avatar’s appearance to live egocentric camera observations. We demonstrate our system through a social telepresence application, where remote users are coherently relit according to their physical environment. Extensive experiments show that our components and the integrated system significantly outperform state-of-the-art baselines in geometric accuracy and rendering as well as relighting fidelity.


[127] VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs cs.CV | cs.AIPDF

Qiaoru Li, Shaotian Liang, Jintao Chen, Haoran Sun, Yuxiang Cai

TL;DR: VITAL是一种用于医学多模态大语言模型的潜在空间推理框架,通过视觉-语义双重监督增强推理能力与可解释性。它利用辅助文本解码器重建推理链,并通过视觉投影器回归ROI特征,在推理时丢弃这些模块以实现零开销,同时支持事后双重可解释性。在7个基准测试中,VITAL显著优于骨干模型、所有潜在推理基线以及使用更大数据训练的医学MLLM,达到与万亿参数专有模型竞争的最先进水平。

Details

Motivation: 现有潜在推理方法存在模态崩溃、视觉监督不足和训练-推理不匹配的问题,且其不透明的潜在状态缺乏可解释性,这在临床应用中是关键缺陷。

Result: 在涵盖9种成像模态的61K数据集上训练,并在7个基准测试上进行实验,VITAL一致且大幅超越骨干模型、所有潜在推理基线以及使用更大数据训练的医学MLLM,取得了最先进的结果,与万亿参数专有模型竞争。

Insight: 创新点在于引入视觉-语义双重监督机制,通过辅助文本解码和视觉特征回归解决模态崩溃和可解释性问题,且推理时丢弃这些模块实现零开销,同时支持事后双重可解释性,提供文本和视觉推理解释。

Abstract: Latent reasoning enables reasoning over continuous hidden states rather than explicit tokens, avoiding the language bottleneck and inference overhead of chain-of-thought for medical VQA. However, existing methods suffer from modality collapse, insufficient visual supervision, and train-inference mismatch. Moreover, their opaque latent states offer no interpretability, which is critical in clinical applications. We propose VITAL, a latent-space reasoning framework for medical MLLMs with visual-semantic dual supervision: an auxiliary text decoder reconstructs reasoning chains from latent states, while a visual projector regresses ROI features from a frozen, independent medical vision encoder. Both modules are discarded at inference with zero overhead, yet can be re-attached post-hoc for dual interpretability, providing textual and visual explanations of the reasoning process without sacrificing efficiency. We construct a 61K dataset spanning 9 imaging modalities, exceeding prior medical visual latent reasoning datasets by an order of magnitude. Experiments on 7 benchmarks show that VITAL consistently and substantially outperforms the backbone, all latent reasoning baselines, and medical MLLMs trained on far larger data, achieving state-of-the-art results competitive with trillion-parameter proprietary models.


[128] Adaptive Temporal Gating of Longitudinal Magnetic Resonance Imaging for Alzheimer’s Prediction cs.CVPDF

Alireza Moayedikia, Sara Fin, Alicia Troncoso Lora, Uffe Kock Wiil

TL;DR: 本文提出了一种名为TAF-Net的混合CNN-Transformer架构,用于利用纵向3D MRI扫描预测轻度认知障碍(MCI)向阿尔茨海默病(AD)的转化。其核心是一个由自适应时间门控制的时序融合模块,该模块学习患者特定的权重来合成三种时空表征。在ADNI队列的三年预测任务中,该方法仅使用结构MRI就取得了最佳判别性能,显著优于基线模型,并接近需要多模态数据的方法。

Details

Motivation: 当前基于深度学习的预测方法主要依赖横断面结构MRI,忽略了患者特异性解剖轨迹的预后价值。本文旨在通过建模成对的纵向MRI扫描来捕捉这种动态变化,以改进MCI向AD的转化预测。

Result: 在阿尔茨海默病神经影像学倡议(ADNI)队列的三年MCI-to-AD转化预测任务中,TAF-Net仅使用结构MRI就取得了所有评估方法中最高的判别性能,显著优于最强的基线模型,并且接近那些需要PET、CSF或遗传数据等多模态信息的方法。消融研究表明,与单时间点评估相比,纵向融合将预测方差降低了48%。

Insight: 创新点在于提出了一个自适应时间门控机制来融合纵向MRI数据,从而学习患者特定的时空表征。该方法在仅使用结构MRI的情况下实现了接近多模态方法的性能,并展现出卓越的数据效率。可解释性分析表明其注意力机制与已知的AD病理区域(如内侧颞叶和脑室)对齐,且门控机制优先考虑与转化风险强正相关的显式体积变化。

Abstract: Predicting conversion from Mild Cognitive Impairment (MCI) to Alzheimer’s Disease (AD) is critical for early intervention. Current deep learning paradigms predominantly rely on cross-sectional structural MRI, neglecting prognostic value in patient-specific anatomical trajectories. We introduce the Temporal Adaptive Fusion Network (TAF-Net), a hybrid CNN-Transformer architecture that models paired longitudinal 3D MRI scans. Central to TAF-Net is a Temporal Fusion Module governed by an Adaptive Temporal Gate, which learns patient-specific weightings to synthesize three spatiotemporal representations: explicit structural change, region-to-region temporal cross-attention, and bilateral feature concatenation. Evaluated on the Alzheimer’s Disease Neuroimaging Initiative cohort for three-year MCI-to-AD conversion prediction, TAF-Net achieved the highest discriminative performance among all evaluated methods using only structural MRI, significantly outperforming the strongest baseline and approaching multimodal methods requiring PET, CSF, or genetic data. The architecture exhibited exceptional data efficiency, matching baseline performance with a fraction of training data. Ablation studies demonstrate that longitudinal fusion improves discrimination while reducing predictive variance by 48% compared to single-timepoint evaluation. Interpretability analyses reveal spatial attention aligned with established AD pathology in the medial temporal lobe and ventricles, while the gating mechanism prioritizes explicit volumetric change with strong positive correlation to conversion risk.


[129] BiasEdit: A Training-Free Bias-Detect-and-Edit Framework for Learning Fair Visual Classifiers cs.CV | cs.AIPDF

Jungwook Seo, Yoonsik Park, Changmin Lee, Sungyong Baik

TL;DR: 本文提出BiasEdit框架,一种无需训练即可检测和编辑视觉数据中偏见的模块化方法,通过视觉-语言表征分析自动发现未知偏见属性,并利用文本引导图像编辑生成去偏数据集,从而训练公平的图像分类器。

Details

Motivation: 解决网络视觉数据中存在的虚假关联和社会偏见问题,这些偏见被神经网络学习后会加剧网络服务的不公平性,形成恶性循环。

Result: 在完全偏置的数据集上,BiasEdit实现了最先进的去偏性能,无需人工标注或合成混合,仅利用现成的视觉-语言和编辑模型。

Insight: 创新点在于通过统计依赖和互信息分析自动检测未知偏见属性,并结合文本引导图像编辑生成真实的偏见冲突样本,为无监督去偏提供了新思路。

Abstract: Visual data from the Web power image classifiers, which often underpin many web services, such as recommendation and content moderation. However, the raw Web data often contain spurious correlations and social biases, and neural networks are known for their tendency to learn biases present in data. This can reinforce unfairness in web services and the web data, leading to a vicious cycle. In the context of image classification, networks learn bias attributes for a specific class when a majority of images contain the same attribute only for a given class. Hence, training a fair and debiased classifier from a biased dataset demands handling an imbalanced problem between a majority of images with bias attributes (bias-aligned samples) and a minority without (bias-conflict samples). In this work, we introduce BiasEdit, a modular framework that automatically detects bias attributes from the original dataset and edits them to construct a debiased dataset. Specifically, BiasEdit first detects unknown bias attributes via statistical dependence and mutual information analysis of visual-linguistic representations, and then explicitly edits those attributes using text-guided image editing to generate realistic bias-conflict samples. Unlike prior works that assume known bias attributes or relies on synthetic mixing, our method operates without manual annotations and can leverage off-the-shelf vision-language and editing models. BiasEdit addresses a fundamental challenge in Web-sourced visual AI, mitigating dataset-induced bias and achieving state-of-the-art debiasing performance even when training data are fully biased.


[130] REVEAL: Reference-Grounded Reasoning for Multimodal Manipulation Detection cs.CVPDF

Jun Zhou, Bingwen Hu, Yaxiong Wang, Zhedong Zheng, Yongzhen Wang

TL;DR: 论文提出REVEAL框架,将多模态操纵检测任务重新定义为基于参考的验证问题,通过比较查询图像-文本对与检索到的真实证据来评估其真实性。该框架采用差异感知融合机制捕获细粒度差异,并引入任务解耦的专家混合架构联合执行实例级检测和细粒度定位。实验表明,REVEAL显著优于现有方法,并支持通过更新参考库实现免训练的领域自适应。

Details

Motivation: 现有方法通常依赖记忆孤立的伪造痕迹,难以处理难以察觉的操纵痕迹或领域偏移。受人类比较推理启发,论文旨在通过参考真实证据进行验证,以更鲁棒地检测多模态伪造内容。

Result: 在构建的大规模参考库(包含17万真实新闻图像-文本对)上,REVEAL在多个基准测试中显著优于最先进方法,并展示了通过更新参考库即可实现免训练领域自适应的能力。

Insight: 创新点包括将任务重新定义为基于参考的验证范式、构建大规模真实参考库、设计差异感知融合机制以及采用任务解耦的MoE架构来缓解异构目标间的优化冲突,为检测动态演变的虚假信息提供了实用解决方案。

Abstract: Multimodal manipulation detection aims to simultaneously identify forged image–text pairs and localize tampered regions, yet existing methods typically rely on memorizing isolated artifacts and struggle with imperceptible manipulation traces or domain shifts. Inspired by human comparative reasoning, we reformulate this task as a reference-grounded verification problem, where authenticity is assessed by comparing a query against retrieved authentic evidence. We propose REVEAL Reference-Enabled Verification for Evidence Analysis and Localization), a framework explicitly designed for this comparative paradigm. To support this paradigm, we construct a large-scale reference library comprising 170K authentic news image–text pairs featuring over 40K public figures. Technically, REVEAL employs a difference-aware fusion mechanism to capture fine-grained discrepancies between the query and retrieved evidence. Furthermore, we introduce a task-decoupled Mixture-of-Experts (MoE) architecture to jointly execute instance-level detection and fine-grained grounding, effectively mitigating optimization conflicts between these heterogeneous objectives. Extensive experiments demonstrate that REVEAL significantly outperforms state-of-the-art methods, and notably enables \emph{training-free domain adaptation} by simply updating the reference library, offering a robust and practical solution for detecting evolving misinformation. Code is available at https://anonymous.4open.science/r/REVEAL-Reference-A006.


[131] DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving cs.CVPDF

Chen Shi, Jinrui Xu, Shaoshuai Shi, Kehua Sheng, Bo Zhang

TL;DR: DriveWAM是一种用于端到端自动驾驶的世界-动作模型,它通过将预训练的视频扩散Transformer适配为自回归视频-动作策略,统一处理视频和动作流。该方法引入了场景演化驾驶引导和选择性KV记忆机制,以整合高级语义意图并保持长时滚动的有界性。在NAVSIM和PhysicalAI-Autonomous-Vehicles基准测试中表现出强大的规划性能,并展示了从4k到100k驾驶片段的数据扩展潜力。

Details

Motivation: 现有基础模型主要基于静态图像-文本对预训练,缺乏对驾驶至关重要的时序动态和运动先验。本文旨在利用视频生成模型捕获的时空先验,构建一个可扩展的驾驶世界-动作模型,以提升端到端自动驾驶的规划能力。

Result: 在NAVSIM和PhysicalAI-Autonomous-Vehicles基准测试中,DriveWAM实现了强大的规划性能。数据扩展研究(从4k到100k驾驶片段)进一步证实了世界-动作建模在端到端自动驾驶中的扩展潜力。

Insight: 创新点包括:将预训练视频扩散模型适配为自回归动作策略,统一视频与动作的时序令牌序列;引入基于冻结VLM的场景演化驾驶引导,提供语义意图;以及通过相关性-冗余缓存选择实现有界的模态感知记忆池,以控制长时滚动。这为利用大规模视频生成先验进行动作决策提供了新思路。

Abstract: Pretrained foundation models have become an important basis for end-to-end autonomous driving. In contrast to vision-language models pretrained primarily on static image-text pairs, video generative models capture temporal dynamics and motion priors that are naturally suited for driving. We present DriveWAM, a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy. DriveWAM organizes video and action streams into a unified temporal token sequence and trains them under a joint flow-matching objective, preserving the pretrained video-generation architecture while adapting its large-scale video priors to action generation. To incorporate high-level scene understanding, we introduce scene-evolving driving guidance, where a frozen VLM produces chunk-specific semantic intent to guide video-action generation. To keep long-horizon rollout bounded, we further introduce selective KV memory, which maintains bounded modality-aware video and action memory pools through relevance-redundancy cache selection at inference time. Experiments on NAVSIM and the PhysicalAI-Autonomous-Vehicles benchmark show that DriveWAM achieves strong planning performance, and a data-scaling study from 4k to 100k driving clips further confirms the scaling potential of world-action modeling for end-to-end autonomous driving.


[132] SSR3D-LLM: Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLMs cs.CV | cs.AIPDF

Jiawei Li, Ziyi Liu, Weijie Shi, Long Chen, Jiajie Xu

TL;DR: 本文提出了SSR3D-LLM,一种用于统一3D-LLM的结构化空间推理接口,旨在解决3D场景中细粒度物体定位的难题。该方法通过让LLM生成一系列潜在空间推理步骤和记忆令牌,并由一个几何感知的评分器逐步细化候选对象排名,从而超越了依赖单一指针式决策的现有方法。

Details

Motivation: 现有统一的、以实例为中心的3D-LLM通常采用单一的指针式决策进行物体定位,这会将复杂的关系性指令压缩为一次选择,在处理需要根据上下文对象和空间关系排除多个同类候选对象的细粒度查询时显得脆弱。

Result: 在ReferIt3D、ScanRefer和Multi3DRef基准测试中,SSR3D-LLM在统一的3D-LLM基线中取得了最强的结果,在细粒度定位任务上相比单指针的QPG基线有显著提升,并持续优于先前的统一3D-LLM,同时保留了默认的语言任务路径。

Insight: 核心创新在于引入了结构化的潜在空间推理步骤,将复杂的空间推理过程分解为可学习的序列化步骤,并通过几何感知的评分器进行逐步细化。这种方法将推理过程显式化,而非隐含在单一决策中,增强了模型处理细粒度空间关系的能力,且训练时仅需标准的目标监督和辅助的指代线索监督,推理时无需额外标注。

Abstract: 3D object grounding localizes referred objects in a 3D scene from natural language. Unified instance-centric 3D-LLMs aim to solve grounding together with dialog, QA, and captioning, yet many rely on a single pointer-style grounding decision that compresses a relational instruction into one selection. This is brittle for fine-grained queries where multiple same-class candidates must be ruled out by context objects and spatial relations. We propose Structured Spatial Reasoning 3D-LLM (SSR3D-LLM), a structured grounding interface for unified 3D-LLMs. Given fixed Mask3D object proposals, the LLM writes a sequence of latent spatial reasoning steps and memory tokens from the query, and a geometry-aware scorer reads these latent steps in order to refine candidate rankings step by step with step-length masking. The latent steps are learned from standard benchmark target supervision with auxiliary referential-cue supervision during training, while inference uses only the input query and Mask3D proposals. Across ReferIt3D, ScanRefer, and Multi3DRef, SSR3D-LLM achieves the strongest results among unified 3D-LLM baselines, with substantial gains over the single-pointer QPG baseline on fine-grained grounding and consistent improvements over prior unified 3D-LLMs, while preserving the default language-task route.


[133] GEM: Generative Supervision Helps Embodied Intelligence cs.CVPDF

Ruowen Zhao, Bangguo Li, Zuyan Liu, Yinan Liang, Junliang Ye

TL;DR: 本文提出GEM(Generative-supervised Embodied vision-language Model),一种通过集成深度图生成任务来预训练具身视觉语言模型(VLM)的方法,旨在弥合高级语义理解与低级空间物理知识之间的鸿沟。该方法使用新构建的大规模数据集GEM-4M进行训练,并在多个具身基准测试中取得了最先进的结果,其部署的动作模型GEM-VLA在仿真和真实世界评估中均表现出卓越的任务执行能力。

Details

Motivation: 现有基于文本引导预训练的具身视觉语言模型(VLMs)侧重于高级语义,与执行具身任务所需的低级空间和物理知识存在显著差距,GEM旨在通过生成式监督来桥接这一鸿沟。

Result: GEM在多个具身基准测试中取得了最先进(SOTA)的结果,其部署的动作模型GEM-VLA在仿真环境和真实世界评估中均展现出显著优越的任务执行能力。

Insight: 创新点在于将深度图生成任务直接集成到VLM的预训练阶段,通过联合训练生成目标来同时增强语义理解和物理操作能力;同时构建并开源了包含高质量深度监督的大规模混合数据集GEM-4M,为具身智能研究提供了重要资源。

Abstract: Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus of standard text-guided pre-training paradigms and the low-level spatial and physical knowledge critical for execution in embodied environments. In this paper, we introduce GEM, a Generative-supervised Embodied vision-language Model designed to bridge this divide. We propose integrating a depth map generation task directly into the VLM pre-training phase. By training this generative objective jointly with the main model, we observe substantial improvements in embodied intelligence, significantly enhancing both semantic understanding and physical operation capabilities. To support this paradigm, we curate and release GEM-4M, a comprehensive large-scale dataset featuring a mixture of grounding, reasoning, and planning data paired with high-quality depth supervision. Extensive experiments demonstrate that GEM achieves state-of-the-art results across diverse embodied benchmarks. Furthermore, our deployed action model, GEM-VLA, exhibits vastly superior task execution abilities in both simulation environments and real-world evaluations. Code, models, and datasets are available at https://zhaorw02.github.io/GEM/


[134] Deformable Gaussian Occupancy: Decoupling Rigid and Nonrigid Motion with Factorized Distillation cs.CVPDF

Yang Gao, Wuyang Li, Po-Chien Luan, Alexandre Alahi

TL;DR: 本文提出了一种名为DeGO的可变形高斯占据框架,用于理解动态3D环境,特别是针对以人为中心的非刚性物体。该框架通过解耦刚性运动和非刚性运动,并结合因子化的4D基础模型蒸馏,提升了弱监督下的占据预测性能,在Occ3D-NuScenes基准测试中取得了SOTA结果。

Details

Motivation: 现有弱监督占据预测框架主要假设刚体运动并依赖简单的帧间偏移,难以捕捉细粒度变形和保持时间一致性,这限制了其在自动驾驶等场景中对非刚性物体(如行人)的动态理解能力。

Result: 在Occ3D-NuScenes基准测试上,该方法在弱监督下达到了最先进的性能,在以人为中心的实例上提升了13.5%,整体性能提升了10.9%。

Insight: 核心创新点在于将高斯原语的运动解耦为刚性(基于偏移)和非刚性(基于变形)两部分,并设计了一种因子化的4D蒸馏策略,从VGGT基础模型中提取跨相机和跨帧知识,生成与基础模型对齐的特征以增强时间一致性,为动态场景理解提供了新的建模思路。

Abstract: Understanding dynamic 3D environments is essential for safe autonomous driving, particularly when reasoning about human-centric, nonrigid agents. However, existing weakly supervised occupancy prediction frameworks predominantly assume rigid-body motion and rely on simple frame-to-frame offsets, limiting their ability to capture fine-grained deformations and maintain temporal coherence. To address this issue, we propose DeGO, a deformable Gaussian occupancy framework that unifies decoupled Gaussian deformation with factorized 4D foundation-model distillation. DeGO disentangles rigid and nonrigid motion, enabling each Gaussian primitive to evolve through both deformation and offset-based updates. In parallel, a factorized 4D distillation strategy transfers cross-camera and cross-frame knowledge from the VGGT foundation model, producing foundation-aligned features that enhance temporal consistency. Experiments on the Occ3D-NuScenes benchmark demonstrate that our method achieves state-of-the-art performance under weak supervision, delivering 13.5% gains on human-centric instances and 10.9% overall improvements. These results highlight the effectiveness of deformation-aware and foundation-guided occupancy modeling for dynamic scene understanding. The code is publicly available: https://github.com/vita-epfl/DeGO


[135] Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification cs.CV | cs.AIPDF

Xiao Wang, Minglei Yang, Bin Yang, Wenke Huang, Zheng Wang

TL;DR: 该论文提出了视频重要人物识别任务,并构建了包含9249个视频片段的大规模数据集Temporal-VIP。为了解决现有方法忽视视频时空信息导致的时序重要性偏移问题,作者提出了VIP-Net框架,该框架通过社会线索编码器提取多模态时空线索,使用时序重要性矫正器进行层次化融合与跨模态对齐,最终对人物进行排序。

Details

Motivation: 现有方法主要关注静态图像和即时视觉线索,忽略了视频中丰富的时空信息,导致时序重要性偏移现象,即早期帧中的重要人物在考虑完整时序上下文后重要性可能下降。

Result: 在提出的Temporal-VIP数据集上,VIP-Net框架达到了67.3%的准确率,显著优于现有最佳模型的37.5%-53.9%准确率,并通过特征引导的LLM精炼获得了与真实标注平均相似度为0.63的文本解释。

Insight: 论文的创新点在于正式定义了视频重要人物识别任务,并构建了相应的数据集和评估基准;提出的VIP-Net框架通过多模态时空线索提取和层次化融合机制,有效缓解了时序重要性偏移问题,并结合LLM生成可解释的文本理由。

Abstract: Identifying key individuals in video scenes is essential for applications such as automated video editing and intelligent surveillance. Current methods primarily focus on static images and immediate visual cues, overlooking the rich spatio-temporal information in videos. This leads to the phenomenon of Temporal Importance Shift (TIS), wherein individuals deemed significant in early frames may be demoted as the entire temporal context is considered. To address this, we introduce the Video Important Person (VIP) identification task, aimed at automatically identifying the most influential individuals in videos while providing textual rationales. We present Temporal-VIP, a large-scale rationale-annotated dataset consisting of 9,249 video segments across 11 categories with aligned importance rationales. To mitigate TIS, we develop the VIP-Net framework, which includes a Social Cue Encoder (SCE) for extracting multi-modal spatio-temporal cues, a Temporal Importance Rectifier (TIR) for hierarchical cue fusion and cross-modal alignment, and VIP Inference for ranking individuals. Experimental results show that VIP-Net achieves 67.3% accuracy, significantly outperforming state-of-the-art models (37.5%-53.9%) and yielding a mean rationale similarity of 0.63 to ground truth through feature-guided LLM refinement. The dataset and code are available at https://huggingface.co/datasets/yml2002/Temporal-VIP.


[136] JECA^2: Judgment-Explanation Consistent Adversarial Attack against Forensic Vision-Language Models cs.CVPDF

Jiachen Qian

TL;DR: 本文提出了一种针对法证视觉语言模型(VLMs)的判决-解释一致性对抗攻击方法JECA^2。该方法在白盒威胁设置下,通过联合重定向视觉归因和调整文本解释,使模型的篡改检测判断与自然语言解释同时被误导,从而揭示此类模型在一致性方面的脆弱性。

Details

Motivation: 现有的对抗攻击通常只关注翻转模型的二元判决(如篡改/真实),而模型生成的解释仍可能暴露篡改线索并与被攻击后的判决相矛盾。本文旨在研究如何实现判决与解释一致的对抗攻击,以更全面地评估法证VLMs的鲁棒性。

Result: 在法证VLM基准测试上的实验表明,在白盒威胁设置下,JECA^2在攻击成功率和自动化的判决-解释一致性方面均优于基线方法。该方法对闭源VLMs的迁移攻击效果可测量但有限。

Insight: 创新点在于首次针对法证VLMs提出了判决与解释需同时被误导的攻击目标,并设计了视觉(Grad-CAM引导扰动)和文本(令牌邻近性约束下的提示嵌入优化)双管齐下的联合攻击框架。这揭示了仅依赖二元检测精度评估模型鲁棒性的不足,强调了评估解释一致性的重要性。

Abstract: Forensic vision-language models (VLMs) have recently been developed to detect image tampering and provide natural-language explanations. However, their robustness against adversarial manipulation remains underexplored. Existing adversarial attacks typically aim to flip the model’s binary judgment, while the accompanying explanation may still reveal forensic cues and contradict the attacked judgment. In this paper, we study judgment-explanation consistent adversarial attacks against forensic VLMs and propose JECA^2, a controlled white-box red-team diagnostic that jointly redirects visual attribution and aligns textual explanations with the target judgment. On the visual side, JECA^2 uses Grad-CAM-guided perturbations to divert attribution from tampered regions toward benign regions. On the textual side, it optimizes prompt embeddings toward authenticity-affirming semantics under a token-proximity constraint. Experiments on forensic VLM benchmarks show that JECA^2 achieves higher attack success and automated judgment-explanation consistency than implemented baselines under white-box threat settings, while transfer to closed-source VLMs remains measurable but limited. Our results highlight a consistency failure mode in explanation-based forensic VLMs and motivate future robustness evaluation beyond binary detection accuracy.


[137] OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning cs.CVPDF

Yunyang Ge, Xianyi He, Zezhong Zhang, Bin Lin, Bin Zhu

TL;DR: OSP-Next是一个高效的文本到视频生成模型,通过集成稀疏注意力、并行化、量化和强化学习技术来提升扩散Transformer的效率。它采用了混合全稀疏注意力架构和Skiparse-2D稀疏注意力机制,并提出了稀疏序列并行化(SSP)以减少通信开销,同时结合HiF8量化和Mix-GRPO后训练来优化性能。实验表明,该模型在多个硬件平台上实现了显著的加速,同时保持了高质量的视频生成效果。

Details

Motivation: 扩散Transformer在视频生成质量上表现出色,但其全注意力的二次方计算成本限制了效率。本文旨在通过引入稀疏注意力、并行化和量化等技术,构建一个高效且高质量的文本到视频生成模型。

Result: 在VBench基准测试中,OSP-Next的总分达到83.73%,超越了Wan2.1基线模型。在5秒720P和768P设置下,在NVIDIA H200 GPU上实现了单GPU最高1.64倍和八GPU超过1.52倍的加速;在昇腾950PR上,OSP-Next-HiF8仅以VBench总分下降0.4%的代价,实现了1.69倍和2.27倍的加速。

Insight: 创新点包括:1) 混合全稀疏注意力架构与Skiparse-2D稀疏注意力,利用局部性并保持与FlashAttention内核的兼容性;2) 稀疏序列并行化(SSP),为稀疏注意力提供原生并行策略,通信量比Ulysses序列并行减少75%;3) HiF8量化支持稳定的8位量化联合训练和稀疏微调;4) Mix-GRPO后训练提升稀疏模型性能。这些技术共同实现了跨硬件平台的高效高质量视频生成。

Abstract: Diffusion Transformers achieve strong video generation quality, but the quadratic cost of full attention limits efficiency. We introduce OSP-Next, an efficient text-to-video generation model that integrates sparse attention, parallelism, quantization, and reinforcement learning. OSP-Next uses a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This fixed-pattern mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining native compatibility with FlashAttention kernels. Based on the local equivalence of rearrangement in Skiparse-2D Attention, we further propose Sparse Sequence Parallelism (SSP), which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. Compared with Ulysses Sequence Parallelism (SP), SSP provides a native parallel strategy for sparse attention and reduces communication volume by 75%. OSP-Next also incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning, and applies Mix-GRPO post-training to improve the performance of the sparse model. Experiments show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. Under the 5-second 720P and 5-second 768P settings, OSP-Next achieves up to 1.64$\times$ single-GPU speedup and over 1.52$\times$ eight-GPU speedup on NVIDIA H200 GPUs. In addition, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69$\times$ and 2.27$\times$ speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms.


[138] Self-Prophetic Decoding to Unlock Visual Search in LVLMs cs.CVPDF

Zhendong He, Qiyuan Dai, Guanbin Li, Liang Lin, Sibei Yang

TL;DR: 本文提出了一种名为SeProD的自预言解码框架,旨在解决大型视觉语言模型(LVLMs)在视觉搜索任务中面临的后训练能力不兼容和长上下文推理干扰问题。该框架通过利用预训练模型的单步推理能力,以概率采样的方式引导后训练模型进行连贯的多步推理,无需额外训练即可即插即用。

Details

Motivation: LVLMs在视觉搜索任务中存在后训练导致的内在能力不兼容问题,以及在长多步推理上下文中容易受到干扰,这限制了其实现真正的多模态推理。

Result: 实验表明,SeProD在4个视觉搜索基准测试的12个划分上持续提升了多个视觉搜索LVLMs的性能,同时在通用VQA基准测试上也有改善,且没有增加计算开销。

Insight: 核心创新点在于提出了自调节机制和基于概率的预言采样,前者利用预训练模型的单步能力来缓解能力退化和长上下文干扰,后者提供了一个概率接口,使预训练模型作为“预言家”引导后训练模型的选择性接受,从而保持连贯的多步推理。

Abstract: Large Vision-Language Models (LVLMs) are rapidly evolving toward true multimodal reasoning, with visual search representing a concrete instantiation of the thinking-with-images paradigm. However, LVLM visual search faces two key challenges: incompatibility among intrinsic capabilities after post-training, and interference in long multi-step reasoning contexts. To address these, we identify two novel insights. First, self-regulation between pre- and post-training LVLMs leverages the intrinsic single-step capabilities of the pre-training model to mitigate capability deterioration and long-context interference. Second, probability-based prophetic sampling, replacing naive prompting, provides a probabilistic interface where the pre-training model acts as a prophet and the post-training model selectively accepts prophetic tokens under its output distribution, preserving coherent multi-step reasoning. Building on these insights, we introduce SeProD, a self-prophetic decoding framework that leverages intrinsic single-step capabilities to enable coherent multi-step reasoning in a training-free, plug-and-play manner. Experiments show that SeProD consistently improves multiple visual-search LVLMs across all 12 splits of 4 visual search benchmarks, as well as across general VQA benchmarks, without added computational overhead, thanks to its parallel prophetic acceptance mechanism.


[139] Ω-QVLA: Robust Quantization for Vision-Language-Action Models via Composite Rotation and Per-step Scaling cs.CV | cs.LGPDF

Xinyu Wang, Mingze Li, Sicheng Lyu, Dongxiu Liu, Kaicheng Yang

TL;DR: 本文提出了Omega-QVLA,一种无需训练的后训练量化框架,首次将视觉-语言-动作模型的语言主干和整个扩散动作头统一压缩至W4A4精度,避免了混合精度分配。该方法结合了复合SVD-Hadamard旋转和每步DiT激活缩放量化技术,以稳定量化过程。在LIBERO基准测试中,该方法在显著减少内存占用的同时,保持了与FP16参考模型相当甚至更优的任务成功率。

Details

Motivation: 视觉-语言-动作模型参数量巨大,其扩散动作头的量化被认为是不稳定的,导致现有量化方法只能部分压缩模型或采用混合精度方案,限制了其在设备上的部署。本文旨在挑战这一假设,实现对VLA模型的完全、统一的低精度量化。

Result: 在LIBERO基准测试中,Omega-QVLA将Pi 0.5和GR00T N1.5模型压缩至W4A4精度,任务成功率分别达到98.0%和87.8%,匹配或超过了其FP16参考模型的97.1%和87.0%,同时静态内存占用减少了71.3%。真实世界操作实验也验证了其优于先前方法的平滑、准确操作性能。

Insight: 创新点在于首次实现了对整个VLA模型(包括扩散动作头)的统一W4A4量化,并提出了复合旋转(均衡权重能量并分散激活异常值)和每步激活缩放(吸收去噪步骤间的动态范围漂移)两项关键技术,以解决量化不稳定的核心挑战,为高容量多模态模型的端侧部署提供了高效的解决方案。

Abstract: Vision-Language-Action (VLA) models unify perception, reasoning, and control within a single policy, yet their multi-billion-parameter backbones and diffusion-based action heads make on-device deployment prohibitively expensive. Prior quantization efforts offer only partial solutions, compressing the LLM backbone while leaving the DiT action head at full precision, or resorting to mixed-precision schemes, driven by the belief that uniformly quantizing the action head is inherently unstable. We challenge this assumption with Omega-QVLA, the first training-free post-training quantization framework that compresses both the language backbone and the entire diffusion action head of a VLA model to a uniform W4A4 precision, eliminating the need for mixed-precision allocation. Omega-QVLA combines a composite SVD-Hadamard rotation that equalizes per-channel weight energy while diffusing residual activation outliers with per-step DiT activation scaling quantization that absorbs dynamic-range drift across denoising steps. On LIBERO, Omega-QVLA compresses Pi 0.5 and GR00T N1.5 to W4A4 with 98.0% and 87.8% task success rates, matching or exceeding their FP16 references of 97.1% and 87.0%, while reducing the static memory footprint by 71.3%. Real-world manipulation experiments further confirm smooth, accurate manipulation where prior methods fail. Code is available at https://github.com/UCMP13753/Omega-QVLA.


[140] Bias Leaves a Gradient Trail: Label-Free Bias Identification via Gradient Probes on Concept Decompositions cs.CV | cs.LGPDF

Thomas Vitry, Kieran Edgeworth, Stefan Wermter, Jae Hee Lee

TL;DR: 本文提出了一种无需偏差标签、事后分析的方法,用于识别冻结视觉模型中的虚假相关概念。该方法通过非负矩阵分解从中间激活中提取可解释的概念向量,并利用误分类样本的反向传播梯度与这些概念的交互来估计偏差,从而在无需重训练的情况下识别和抑制虚假概念。

Details

Motivation: 视觉分类器可能利用虚假相关性,导致在分布偏移下性能下降。现有偏差缓解和分析方法通常依赖精心策划的数据集、虚假属性或组标签,或需要重新训练,这在模型部署后或相关偏差未知时可能不可行。

Result: 在Colored MNIST和Waterbirds上,该方法恢复了与已知虚假线索对齐的概念;在CelebA上,它揭示了与标注性别属性仅部分重合的决策相关方向。在推理时抑制排名靠前的概念,无需重训练或参数更新,即可将Waterbirds的最差组准确率提高最多17.9个百分点,CelebA提高10.4个百分点。

Insight: 创新点在于提出了一种基于梯度探针和概念分解的标签无关偏差识别方法,能够事后分析冻结模型,识别不一定与标注属性重合的决策相关虚假方向,为模型审计和去偏提供了可解释且可操作的途径。

Abstract: Vision classifiers can exploit spurious correlations, achieving high in-distribution accuracy yet failing under distribution shift. Existing approaches to bias mitigation and analysis often depend on curated datasets, spurious-attribute or group labels, or retraining, which may be infeasible once a model is deployed or the relevant bias is unknown. We present a bias-label-free, post-hoc method for identifying spurious concepts in frozen vision models, relying only on standard class labels from a held-out audit dataset. For each target class, we collect patches from inputs predicted as that class and apply non-negative matrix factorization to intermediate activations to obtain a bank of interpretable concept vectors. Candidate concepts are then ranked with a bias estimator derived from their interaction with backpropagated gradients on misclassified examples: bias concepts tend to get activated when correcting false negatives and suppressed when correcting false positives. On Colored MNIST and Waterbirds the method recovers concepts aligned with the known spurious cue, and on CelebA it surfaces decision-relevant directions that only partially coincide with the annotated gender attribute; suppressing the top-ranked concepts at inference time improves worst-group accuracy by up to 17.9 percentage points on Waterbirds and 10.4 on CelebA without any retraining or parameter updates. Our method identifies decision-relevant spurious directions that need not coincide with annotated ones, providing both an interpretable auditing tool and an actionable debiasing handle for frozen vision models. Code is available at https://github.com/vitryt/label-free-bias-identification.


[141] Personal Visual Memory from Explicit and Implicit Evidence cs.CV | cs.CL | cs.IRPDF

Viet Nguyen, Thao Nguyen, Vishal M. Patel, Yuheng Li

TL;DR: 该论文提出了一个针对个性化AI智能体长期记忆的视觉记忆基准和混合视觉-文本架构VisualMem,旨在解决现有记忆系统过度依赖文本而忽略图像中蕴含的个人专属信息的问题。VisualMem通过结构化视觉记忆模块,结合对话上下文解析身份、所有权和持久用户事实,显著提升了在个人视觉记忆任务上的性能。

Details

Motivation: 现有长期记忆基准和方法主要围绕文本,即使包含图像,其用户特定信息也往往仅从文本即可恢复,导致图像被简化为通用描述,忽略了图像中蕴含的显式(如重复出现的用户相关实体)和隐式(如从视觉或多模态线索推断的潜在用户事实)个人证据。

Result: 在提出的个人视觉记忆基准上,VisualMem显著优于先前的记忆系统;同时,在标准文本记忆基准上仍保持竞争力,表明其有效性和通用性。

Insight: 创新点在于明确区分并建模了图像中的显式和隐式个人证据,并设计了结合对话上下文的结构化视觉记忆模块,而非简单地将图像压缩为文本描述,这为构建更全面的个性化AI长期记忆系统提供了新思路。

Abstract: Long-term memory is increasingly important for personalized AI agents, yet existing benchmarks and methods remain largely text-centric. Even when images are included, the user-specific information needed for later questions is typically recoverable from text alone, and most memory systems reduce image turns to generic captions. Yet images often carry personal information that text rarely states – both explicit evidence, such as recurring user-associated entities, and implicit evidence, such as latent user facts inferred from visual or multimodal cues. We introduce a benchmark for personal visual memory that targets both forms of evidence, and propose VisualMem, a hybrid visual–text architecture that augments a text-memory backend with a structured personal visual memory module. Rather than collapsing images into captions, VisualMem uses conversational context to resolve identity, ownership, and durable user facts. Experiments show that VisualMem substantially outperforms prior memory systems on our benchmark while remaining competitive on standard text-memory benchmarks, indicating that personal visual memory is a distinct and important component of long-term memory for personalized AI agents.


[142] HarmoVid: Relightful Video Portrait Harmonization cs.CVPDF

Jun Myeong Choi, Jae Shin Yoon, Luchao Qi, Roni Sengupta, Joon-Young Lee

TL;DR: 本文提出了一种名为HarmoVid的视频肖像光照协调方法,旨在将前景视频的光照与目标背景场景进行协调,调整阴影、色调和光照强度。该方法通过引入光照去闪烁模型来稳定时间抖动,并利用视频扩散模型从真实和合成视频数据中学习,以生成高质量、时间一致且具有物理意义光照效果的视频协调结果。

Details

Motivation: 解决视频光照协调中因缺乏成对标注数据(即相同动作在不同光照下录制的视频)而导致的挑战,以及现有逐帧图像协调方法在视频上应用时产生的时间闪烁和不连贯问题。

Result: 实验表明,与先前的基于图像和基于视频的协调方法相比,该模型在时间一致性、自然度、边界清晰度和物理光照行为方面表现优异,同时保持了强大的重光照表达能力。

Insight: 创新点包括引入光照去闪烁模型以稳定全局和局部光照闪烁伪影,以及提出非对称alpha掩码条件技术以从真实视频中学习清晰边界;客观分析认为,该方法通过合成数据增强和视频扩散模型有效解决了视频协调中的数据稀缺和时间一致性问题。

Abstract: We present a method for harmonizing the lighting of a foreground video to match a target background scene, adjusting shadows, color tone, and illumination intensity (relightful harmonization). Unlike images, acquiring labeled data for videos, where identical motions are recorded under different lighting conditions, is practically infeasible and non-scalable. While one way to create such paired data is to apply existing image-based harmonization models frame by frame to a video, the resulting outputs often suffer from significant temporal jitters. We overcome this problem by introducing a novel lighting deflickering model that can stabilize the global and local lighting flickering artifacts. Our video diffusion model learns from these upgraded deflickered data with a volume of real and synthetic videos to generate high-quality video harmonization results. We further propose an asymmetric alpha mask conditioning technique to learn the clean boundaries from real videos. Experiments demonstrate that our model achieves strong temporal coherence, naturalness, cleaner boundaries, and physically meaningful lighting behavior, while maintaining strong relighting expressiveness compared to prior image-based and video-based harmonization methods.


[143] Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players cs.CVPDF

Fangfu Liu, Kai He, Tianchang Shen, Tianshi Cao, Sanja Fidler

TL;DR: 本文提出了Gamma-World,一个用于多智能体交互视频生成的生成式世界模型。它通过Simplex Rotary Agent Encoding实现智能体的可区分且置换对称的表示,并利用Sparse Hub Attention降低跨智能体注意力计算的复杂度。模型还通过将全上下文扩散教师蒸馏为因果学生,实现了24 FPS的实时动作响应生成。

Details

Motivation: 现有的交互式视频生成世界模型主要关注单智能体设置,而许多生成环境需要多智能体在共享空间内同时交互。将世界模型扩展到多智能体场景需要一个原则性的设计,以确保智能体独立可控、置换对称、推理高效,并保持跨时间和视角的一致性。

Result: 在多人虚拟环境中的实验表明,该模型在视频保真度、动作可控性和智能体间一致性方面优于基于槽位和密集注意力的基线方法,并且能够在未经额外训练的情况下从两个玩家泛化到四个玩家。

Insight: 主要创新点包括:1) Simplex Rotary Agent Encoding,一种将智能体表示为旋转角空间中正单纯形顶点的无参数方法,实现了可扩展且置换对称的智能体身份编码;2) Sparse Hub Attention,通过可学习的枢纽令牌中介跨智能体交互,将跨智能体注意力成本从二次降低到线性;3) 通过知识蒸馏实现实时生成,将全上下文扩散模型压缩为具有KV缓存的因果学生模型。

Abstract: World models for interactive video generation have largely focused on single-agent settings, where future observations are generated from a single control signal. However, many generated environments require multi-agent interaction: multiple players, robots, or embodied agents act simultaneously within a shared space. Scaling world models to such settings requires a principled multi-agent design: agents should remain independently controllable, permutation-symmetric, and support efficient inference while maintaining consistency across time and perspectives. In this paper, we present our generative multi-agent world model for interactive simulation. It introduces Simplex Rotary Agent Encoding, a parameter-free extension of 3D RoPE that represents agents as vertices of a regular simplex in rotary angle space. This gives each agent a distinct phase while making all agents permutation-equivalent, enabling scalable agent identity without learned per-slot identities or a fixed agent ordering. To avoid dense all-to-all attention across agents, we further propose Sparse Hub Attention, where learnable hub tokens mediate token interaction across agents, reducing cross-agent attention cost from quadratic to linear in the number of agents. For real-time rollout, we distill a full-context diffusion teacher into a causal student that generates temporal blocks sequentially with KV caching, enabling action-responsive generation at 24 FPS. Experiments in multiplayer virtual environments show that our model improves video fidelity, action controllability, and inter-agent consistency over slot-based and dense-attention baselines, while generalizing from two to four players without additional training.


[144] From Pixels to Words – Towards Native One-Vision Models at Scale cs.CVPDF

Haiwen Diao, Jiahao Wang, Penghao Wu, Yuhao Dong, Yuwei Niu

TL;DR: 本文提出了NEO-ov,一个原生的一体化视觉基础模型,它通过端到端学习跨帧和像素-单词的对应关系,无需外部编码器、辅助适配器或后融合模块。该模型旨在解决当前视觉语言模型(VLMs)因模块化拼接导致的像素信号碎片化和早期交互分散问题,并在细粒度视觉感知和多图像/视频理解任务上展现出竞争力。

Details

Motivation: 当前视觉语言模型通常采用多阶段对齐的方式拼接独立的图像编码器和语言解码器,这种模块化框架导致像素级信号在帧间碎片化,并分散了早期的像素-单词交互。同时,原生VLMs在多图像、视频理解和空间智能方面仍很大程度上未被探索。

Result: NEO-ov显著缩小了与模块化对应模型的性能差距,同时在细粒度视觉感知方面表现出色,验证了原生“单视觉”架构在大规模应用中的可行性和竞争力。

Insight: 论文的核心创新在于提出了一种完全消除模块边界的原生端到端架构,实现了细粒度且统一的时空建模在模型内部原生涌现。此外,论文提供了系统的架构分析和详细的训练方案,为后续原生多模态建模研究提供了便利。

Abstract: Current vision-language models (VLMs) typically stitch together separate image encoders and language decoders via multi-stage alignment, a modular framework that inevitably fragments pixel-level signals across frames and scatters early pixel-word interactions. In parallel, native VLMs, despite impressive performance on single images, remain largely unexplored in multi-image, video understanding, and spatial intelligence. Hence, we introduce NEO-ov, a native foundation model that learns cross-frame and pixel-word correspondence end-to-end, without any external encoders, auxiliary adapters, or post-hoc fusion. By eliminating module boundaries entirely, NEO-ov enables fine-grained and unified spatiotemporal modeling to emerge natively inside the model. Notably, NEO-ov largely narrows the gap to modular counterparts while excelling at fine-grained visual perception, validating that native “one-vision” architectures are not only feasible but competitive at scale. Beyond empirical performance, we unveil systematic architectural analyses and detailed training recipes to facilitate subsequent native multimodal modeling. Our code and models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.


eess.IV [Back]

[145] NL-MambaXCT: Self-Supervised Nested-Learning Mamba for Nomex Honeycomb X-ray CT Defect Classification eess.IV | cs.CVPDF

Ghaleb Aldoboni, Lobna Nassar, Fakhri Karray, Reem Alshamsi

TL;DR: 本文提出了NL-MambaXCT,一个基于Mamba的框架,结合了自监督掩码图像建模和嵌套学习(NL)策略,用于Nomex蜂窝结构X射线CT图像的缺陷分类。该模型使用RegNet卷积块和Mamba序列混合的混合编码器作为骨干,先在大量无标签工业XCT切片上进行掩码图像建模预训练,然后在少量有标签数据上进行微调。

Details

Motivation: 航空航天制造中Nomex蜂窝结构的X射线CT检测仍严重依赖人工解读和基于有限标注数据的监督模型,需要一种自动化、标签高效的缺陷分类方法。

Result: 在预留测试集上,经过掩码图像建模预训练的NL-MambaXCT模型达到了96.91%的准确率和96.8%的宏F1分数,在准确率上比CNN、注意力机制和单时间尺度Mamba基线模型高出3.11到10.31个百分点。

Insight: 主要创新点在于将自监督掩码图像建模与嵌套学习(NL)中的快/慢学习动态相结合,通过双时间尺度参数动态(如指数移动平均轨迹和深度动量优化器)来提升模型在工业缺陷分类任务中的鲁棒性和性能。

Abstract: X-ray computed tomography (XCT) is widely used for non-destructive testing of Nomex honeycomb structures in aerospace manufacturing, but industrial inspection still relies heavily on manual interpretation and supervised models trained on limited labeled data. This work introduces NL-MambaXCT, a Mamba-based framework that combines self-supervised masked image modelling with a Nested Learning (NL) formulation for automated, label-efficient defect classification from production XCT slices. The backbone is a four-stage 2D encoder with RegNet convolutional blocks in the early stages and Mamba-based sequence mixing with attention in the deeper stages. It is pretrained by masked image modelling on 19,961 unlabeled industrial XCT slices and fine-tuned on 2,000 relabeled Nomex XCT slices split by production order. NL is instantiated through two-timescale parameter dynamics: selected projections maintain slow exponential-moving-average traces alongside fast weights, while a deep-momentum optimizer introduces an additional slow parameter-update trajectory. On the held-out test set, the MIM-pretrained NL-MambaXCT model achieves 96.91% accuracy and 96.8% macro F1, outperforming CNN, attention, and single-timescale Mamba baselines by 3.11–10.31 percentage points in accuracy. The results suggest that combining masked self-supervision with NL-style fast/ slow learning dynamics is a promising strategy for robust defect classification in Nomex honeycomb XCT inspection.


[146] Deep Learning Strain Estimation: Is Physics-Based Simulation the Solution? eess.IV | cs.AI | cs.CVPDF

Thierry Judge, Nicolas Duchateau, Andreas Østvik, Khuram Faraz, Anders Austlid Taskén

TL;DR: 本文提出了一种用于心肌应变估计的新型深度学习模拟策略。该方法通过整合真实超声视频的散斑去相关测量,并采用迭代细化过程来提升模拟运动的真实性,从而创建了一个包含1478个视频的开源逼真数据集。基于此数据集训练的超声心动图运动估计算法,在全局和局部应变估计上均取得了超越现有方法的性能。

Details

Motivation: 斑点追踪超声心动图(STE)是心肌应变估计的临床标准,但其在局部应变估计上的准确性有限,而局部应变是早期诊断和细微异常表征的关键生物标志物。深度学习是很有前景的替代方案,但其发展受限于缺乏可靠的运动参考数据。现有方法要么依赖STE衍生的标签,要么依赖基于物理模型的模拟,但这些合成序列与临床数据相比真实性有限。

Result: 所提出的方法在全局和局部应变估计上取得了无与伦比的性能。特别是在专家间评估中,其全局纵向应变的变异系数达到了1.42%,优于临床参考标准的1.78%。

Insight: 论文宣称的创新点在于提出了一种结合真实视频散斑去相关测量的新型模拟策略,并通过迭代细化来提升运动真实性,从而生成了高质量的开源数据集。从客观角度看,该方法的关键创新在于将真实数据中的物理特性(散斑去相关)融入模拟过程,有效弥合了合成数据与临床数据之间的真实性鸿沟,为训练更可靠的深度学习模型提供了高质量的基础设施。

Abstract: Speckle tracking echocardiography (STE) is the clinical standard for myocardial strain estimation. Despite good performance on global strain (GLS), its accuracy for regional strain remains limited, even though this biomarker is highly relevant for early diagnosis and the characterization of subtle abnormalities. from clinical data. Deep learning is a promising alternative, but its development is constrained by the lack of reliable motion references. Existing solutions rely either on STE-derived labels or on simulations generated by physics-based models, but these synthetic sequences still have limited realism compared with clinical data.In this paper, we propose a novel simulation strategy that incorporates speckle decorrelation measures from real videos and uses an iterative refinement process to improve the motion realism in the simulations. We created an open-source photorealistic dataset of 1,478 videos with reference motion, which was used to train an echocardiographic motion estimation algorithm. The proposed method achieves unmatched performance on global and regional strain, notably reaching a GLS variability of 1.42% in an inter-expert setting compared to 1.78% for the clinical reference.


cs.LG [Back]

[147] Explicit Critic Guidance for Aligning Diffusion Models cs.LG | cs.CVPDF

Zhengyang Liang, Qihang Zhang, Ceyuan Yang

TL;DR: 本文提出了一种状态对齐的潜在演员-评论家框架,用于扩散模型的后训练对齐。该框架将扩散模型自身作为时间步条件化的价值函数,直接在噪声潜在状态上预测价值,从而支持轨迹级PPO训练和稳定的演员-评论家优化,并允许学习到的评论家在推理时用于引导生成。

Details

Motivation: 现有在线强化学习方法在扩散模型对齐中面临两个主要问题:难以沿去噪轨迹分配细粒度信用,以及难以实现基于价值的稳定优化。

Result: 在基于UNet和DiT的骨干网络上,该方法在单奖励和多奖励基准测试中一致优于先前的组相对RL和演员-评论家基线方法,且推理时引导进一步提升了生成质量。

Insight: 创新点在于将扩散模型本身作为时间步条件化的价值函数,实现了轨迹级PPO训练和稳定的演员-评论家优化;同时,该框架支持多奖励联合训练以缓解奖励黑客问题,并允许学习到的评论家直接用于推理时引导,提升了方法的实用性。

Abstract: Online reinforcement learning is becoming increasingly important for aligning diffusion models with non-differentiable objectives. However, existing methods still face limitations in assigning fine-grained credit along denoising trajectories and in realizing stable value-based optimization. We propose a state-aligned latent actor-critic framework for diffusion post-training, in which the diffusion model serves as its own timestep-conditioned value function and predicts values directly on noisy latent states. This enables trajectory-level PPO training, supports stable actor-critic optimization with simple conditioning and value pretraining strategies, and naturally allows the learned critic to be reused for inference-time steering. We further extend the framework to multi-reward optimization, where joint training with complementary rewards helps alleviate reward hacking. Across both UNet- and DiT-based backbones, our method consistently outperforms prior group-relative RL and actor-critic baselines on single-reward and multi-reward benchmarks, while test-time steering provides additional gains in generation quality.


[148] Aligning LLMs with Human Uncertainty: A Beta-Bernoulli Calibrator for LLM Forecasting cs.LG | cs.AI | cs.CLPDF

Hui Dai, Ryan Teehan, Parsa Torabian, Mengye Ren

TL;DR: 本文提出了一种名为Beta-Bernoulli校准器(BBC)的新方法,用于改进大型语言模型(LLM)的概率预测。该方法通过利用二元结果和人类预测(包括群体概率估计和预测者间的一致程度)作为监督信号,将任何模型的初始点估计预测转换为事件可能性的分布,从而提供经过校准的点预测和认知不确定性估计。

Details

Motivation: 现有方法主要从二元结果中学习以输出言语化预测,但未能充分利用人类预测中蕴含的丰富信息(如群体概率估计和预测者间的一致程度)。本文旨在解决如何有效利用这些信号来改进LLM预测的问题。

Result: 实验结果表明,BBC通常比传统的后处理校准方法和专门为预测微调的模型提供更好校准和更准确的预测,同时保持轻量级并具有良好的泛化能力。此外,BBC捕获的认知不确定性比言语化的置信度更能可靠地预测预测误差。

Insight: 核心创新在于提出一个统一的概率框架(Beta-Bernoulli模型),将人类预测中的群体信息和分歧信号整合到LLM预测的校准过程中,从而同时优化点预测的准确性和不确定性估计的可靠性。这为利用人类集体智慧来校准和解释LLM预测提供了新思路。

Abstract: Probabilistic forecasting estimates the likelihood of uncertain future events. To improve LLM forecasting, existing methods typically learn from binary outcomes to output verbalized forecasts. However, while aggregated human forecasts contain rich information in both the crowd probability estimate and the degree of agreement among forecasters, how to utilize these signals remains underexplored. To address this, we propose the Beta-Bernoulli Calibrator (BBC), which converts an initial point estimate forecast from any model into a distribution over event likelihood, using supervision from both binary outcomes and human forecasts. BBC models event likelihood $p \sim \text{Beta}(α, β)$ and outcome $y \sim \text{Bernoulli}(p)$, with the mean as the calibrated point forecast and the variance as the epistemic uncertainty. Our results show that BBC generally provides better calibrated and more accurate forecasts than both traditional post-hoc calibration methods and models fine-tuned specifically for forecasting, while remaining lightweight and having good generalization. We also show that the epistemic uncertainty captured by BBC is a more reliable predictor of forecasting error than verbalized confidence.


[149] CAREF: Calibration-Aware Regularization for Explanation Faithfulness Without Rationale Supervision cs.LG | cs.CLPDF

Naphat Nithisopa, Teerapong Panboonyuen

TL;DR: CAREF是一个参数高效的微调框架,通过校准感知正则化联合优化预测准确性和解释忠实性。其核心是使用统一的损失函数LSCED,将基于熵的校准与token级稀疏性控制相结合,无需人工标注的解释监督。在四个NLE基准测试(COS-E、ECQA、ComVE、e-SNLI)上,轻量级变体CAREF-AQ仅使用6.43%的可训练参数,取得了最佳的平均准确率(89.04)和解释对齐分数(81.00 nBERT),优于LoRA和AdaLoRA。

Details

Motivation: 解决在微调大语言模型时,如何在不依赖人工标注解释(rationale supervision)的情况下,同时提升预测准确性和解释忠实性的问题。

Result: 在四个自然语言解释基准测试上,CAREF-AQ取得了89.04的平均准确率和81.00的nBERT解释对齐分数,性能优于LoRA和AdaLoRA,达到了SOTA水平。

Insight: 主要创新点在于首次将熵校准和稀疏性正则化统一到一个训练目标中,用于可解释的LLM微调。这提供了一种无需监督信号就能自动提升模型解释忠实性的参数高效方法。

Abstract: We introduce CAREF, a parameter-efficient fine-tuning framework that jointly optimizes predictive accuracy and explanation faithfulness via calibration-aware regularization. At its core, CAREF couples entropy-based calibration with token-level sparsity control through a single unified loss, the Calibration-Aware Regularization for Explanation Faithfulness (LSCED), without requiring rationale supervision. Evaluated on four NLE benchmarks (COS-E, ECQA, ComVE, e-SNLI) with Flan-T5, our lightweight CAREF-AQ variant attains the best average accuracy (89.04) and explanation alignment (81.00 nBERT) using only 6.43% of trainable parameters, outperforming LoRA and AdaLoRA. To our knowledge, CAREF is the first method to unify entropy and sparsity regularization in a single training objective for interpretable LLM fine-tuning.


[150] Knowing When to Ask: Segment-Level Credit Assignment for LLM Tool Use cs.LG | cs.CLPDF

Abhijit Kumar, Zoey Wu, Mohit Suley

TL;DR: 这篇论文提出了CARL(Competence-Aware Reinforcement Learning)方法,旨在解决大语言模型在工具调用中无法准确判断何时需要外部帮助的问题。该方法通过将模型生成的轨迹分解为多个自然边界段,并训练一个评论家来为每个段分配独立的信用,从而教会模型识别自身知识的边界,减少不必要的工具调用,并提高在需要工具时的准确性。

Details

Motivation: 当前基于提示或强化学习的方法无法有效教会语言模型识别自身知识的边界,导致模型要么过度依赖工具,要么在需要时未能调用。轨迹级别的信用分配无法区分成功轨迹中哪些工具调用真正有益,也无法惩罚不必要的调用。

Result: 在涵盖算术、多跳事实问答和金融表格数值推理的五个基准测试中,CARL在7B参数模型上比最佳RL基线提高了6.7个精确匹配(EM)点,在3B模型上提高了9.7个EM点,其中在Musique基准上提升最大(7B模型+8.3 EM,3B模型+9.0 EM)。模型在参数化可解问题上减少了53%的工具调用,同时在这些问题上保持了约10个EM点的更高准确性。

Insight: 论文的核心创新在于提出了基于段级别的信用分配机制,通过自然边界(如代码块分隔符)分解轨迹,仅使用单一二元结果就能为每个段分配独立信用,无需外部评判或步骤级标注。这使模型能够学习其领域能力,区分参数化可解和工具依赖的问题。研究还发现,这种“知道何时询问”的能力对小规模模型(如3B参数)的益处更大,表明它尤其能弥补较小参数记忆的不足。

Abstract: Humans know when to reach for help e.g. $347 \times 28$ warrants a calculator while $2+2$ does not. Language models do not. Prompt-based approaches can instruct a model when to invoke tools, but this scaffolding does not teach it to recognize the boundary of its own knowledge. RL approaches that assign a single outcome reward to the whole trajectory fare no better: trajectory-level credit cannot isolate which tool call in a successful episode actually helped, nor penalize unnecessary calls. We propose \textbf{CARL} (\textbf{C}ompetence-\textbf{A}ware \textbf{R}einforcement \textbf{L}earning), which trains a critic on the model’s own rollouts to learn where parametric knowledge suffices and where it needs external help. By decomposing each rollout at natural tool-use boundaries (e.g., code fence delimiters and context block transitions), CARL assigns independent credit to each segment from a single binary outcome, without external judges or step-level annotations. As a result, erroneous tool calls, incorrect extractions, and unnecessary calls each receive appropriately signed advantages. The trained critic captures the model’s domain competence: it separates parametrically solvable from tool-dependent questions with AUC 0.93 at 7B. On five benchmarks spanning arithmetic, multi-hop factual QA, and numerical reasoning over financial tables, CARL improves exact-match accuracy by 6.7 points at 7B and 9.7 points at 3B over the best RL baseline, with the largest gain (+8.3 EM at 7B, +9.0 EM at 3B) on Musique. The model issues 53% fewer tool calls on parametrically answerable questions while remaining ${\sim}10$ EM points more accurate on them. Gains are largest at small scale: the 3B improvement is $1.4\times$ the 7B improvement, suggesting that knowing when to ask disproportionately benefits models with smaller parametric memory.


[151] Self-Consistency via Marginal Sharpening cs.LG | cs.CLPDF

Aleksei Arzhantsev, Otmane Sakhi, Nicolas Chopin

TL;DR: 本文提出了一种新的推理时采样方法,通过锐化答案边缘分布而非完整输出分布,将自洽性作为推理目标。该方法采用高效的并行采样算法,在数学和编码基准测试中优于现有方法且速度更快。

Details

Motivation: 现有推理时采样方法通过锐化完整输出分布来提升语言模型的推理能力,但作者认为这混淆了推理轨迹和最终答案,而关键在于答案是否被多个合理推理路径支持。

Result: 在数学和编码基准测试中,该方法比标准强力采样方法性能更强,且速度提升数个数量级。

Insight: 创新点在于将自洽性从后处理投票准则转变为推理时优化目标,通过锐化答案边缘分布来更直接地评估答案的可靠性,并设计了高效的并行采样算法实现近似采样。

Abstract: Inference-time sampling can elicit strong reasoning abilities from language models without additional training. Existing power-sampling methods do so by sharpening the distribution over full generated outputs, favoring completions that are individually likely under the model. We argue that this is the wrong object to target for reasoning: a completion entangles a reasoning trace with a final answer, whereas what matters is whether an answer is supported by many plausible reasoning paths. We therefore shift the target from the full-output distribution to the sharpened answer marginal, making self-consistency an inference-time objective rather than a post-hoc voting criterion. Surprisingly, this marginal target admits an efficient approximation: we propose a simple, purely autoregressive parallel sampling algorithm that approximately samples from the sharpened answer marginal, eliciting stronger performance than standard power sampling on mathematics and coding benchmarks while being orders of magnitude faster.


[152] Interpretability-Guided Layer Selection over Subspace Projection: SAEs as Stethoscopes, Not Scalpels, for Raw Task Vector Model Editing cs.LG | cs.CLPDF

Li Lei, Madalina Ciobanu, Qingqing Mao, Ritankar Das

TL;DR: 本文针对大语言模型(LLM)的模型编辑问题,评估了基于稀疏自编码器(SAE)特征子空间投影的编辑方法,发现该方法因信息瓶颈而失效。作者转而提出将SAE用作‘听诊器’而非‘手术刀’的新视角,即利用SAE进行层级的诊断,然后注入未经过滤的原始任务向量,从而在数学推理任务上实现了显著且鲁棒的改进。

Details

Motivation: 为了解决大语言模型在特定领域能力增强时,避免全量微调带来的巨大计算成本和灾难性遗忘问题,需要更精细的‘外科手术式’模型编辑方法。稀疏自编码器(SAEs)被认为是一种有潜力的工具,可用于识别特征层面的干预位置。

Result: 在Gemma-3-4B-IT模型和Minerva Math基准测试上,传统的SAE子空间投影方法丢弃了约97%的修改能量,在七个数学科目上未带来统计显著的提升。而提出的新方法(SAE作为听诊器)将数论准确率从29.6%提升至39.4%,七个科目中有五个显著提升且无显著退化。

Insight: 论文的核心创新在于视角转变:将SAE从用于干预级过滤的‘手术刀’,重新定位为用于层级诊断的‘听诊器’。这揭示了激活空间的SAE方向与权重空间的任务向量之间存在几何错位,并提供了一个无需额外推理成本、确定性的、基于可解释性指导的模型编辑原则性框架。

Abstract: LLMs increasingly require surgical model editing to enhance domain-specific capabilities without incurring the computational cost or catastrophic forgetting associated with full fine-tuning. Sparse Autoencoders (SAEs) have emerged as a promising tool in this setting, in principle allowing for feature-level identification of where to intervene. In this work, we rigorously evaluate an SAE-guided editing pipeline for mathematical reasoning on Gemma-3-4B-IT and uncover a fundamental failure mode: the intuitively appealing approach of projecting task vectors onto SAE feature subspaces acts as an information bottleneck that discards approximately 97% of the modification energy, yielding no statistically significant improvements across seven math subjects. We show that this failure stems from a geometric misalignment between activation-space SAE directions and weight-space task vectors. We then propose a shift in perspective: SAE as a Stethoscope, Not a Scalpel, where SAEs are used for layer-level diagnosis rather than intervention-level filtering. By injecting unfiltered raw task vectors only into layers identified by an SAE-derived specificity score, we improve Number Theory accuracy from 29.6% to 39.4% (z=+3.41, p=0.0007) on the Minerva Math benchmark; 5 of 7 math subjects significantly improved and none significantly degraded. Our method is fully deterministic, requires no additional inference cost, and provides a principled framework for interpretability-guided model editing.


[153] Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents cs.LG | cs.AI | cs.CLPDF

Suji Kim, Kangsan Kim, Sung Ju Hwang

TL;DR: 本文提出了LearnWeak框架,用于自动化地提升小型计算机使用代理在特定领域的性能。该方法利用更强的参考代理识别学生代理的弱点,生成针对性任务并自动构建监督信号,通过解耦规划和执行错误实现更精确的行为更新。

Details

Motivation: 针对小型计算机使用代理在特定领域表现不均且整体较弱的问题,现有的大规模合成训练数据方法改进有限,因此需要一种更高效、无需人工标注的领域专业化框架。

Result: 在OSWorld基准测试的八个领域中,LearnWeak相比EvoCUA-8B和OpenCUA-7B分别平均提升了11.6和11.1个百分点,并验证了其学生感知的数据生成和训练方法优于现有的自主轨迹生成和训练基线。

Insight: 创新点在于提出了学生感知的弱点驱动数据合成框架和错误感知的专业化目标,通过解耦错误类型实现针对性强化,为小型代理的领域专业化提供了更原则性和高效的路径。

Abstract: Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the target domain, yet we find that this naive approach yields only marginal improvements. Building on this observation, we introduce LearnWeak, an annotation-free specialization framework for small computer-use agents that uses a stronger reference agent to identify the student’s weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. LearnWeak further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains. We also validate that our student-aware dataset generation and training approaches outperform existing autonomous trajectory generation and training baselines. Our work highlights the importance of student awareness in both data synthesis and agent training, pointing toward a more principled and efficient path for specializing small computer-use agents in diverse domains.


[154] Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL cs.LG | cs.AI | cs.CLPDF

Kunhao Zheng, Pierre Chambon, Juliette Decugis, Jonas Gehring, Taco Cohen

TL;DR: 本文研究了在代码强化学习(RL)中,通过外推权重平均(extrapolative weight averaging)来探索和扩展正确性与效率之间的帕累托前沿。研究发现,在竞争性编程任务中,使用嵌套单元测试覆盖率训练的检查点会形成一个正确性-效率前沿,通过插值和外推权重平均可以恢复并超越该前沿,从而在推理时获得互补的策略,提升整体性能。

Details

Motivation: 动机在于探索是否可以通过外推权重平均,在不进行额外RL训练的情况下,将微调检查点之间的帕累托前沿扩展到推理时有用的新检查点,特别是在代码RL中平衡功能正确性和计算效率的挑战。

Result: 在LCB/hard基准上,使用外推权重平均的集成方法在相同样本预算下,比最佳单一检查点的pass@250指标提升了3.3%。该前沿及其外推扩展在纯推理、工具使用和智能体编码三种推理设置以及32B和7B两种模型规模上均得到验证。

Insight: 创新点在于揭示了嵌套单元测试覆盖率在代码RL中诱导出正确性-效率前沿,并证明外推权重平均可以导航、扩展和利用该前沿,为推理时策略组合和性能提升提供了新途径。

Abstract: Linear interpolation between fine-tuned checkpoints has been shown to trace the Pareto front between competing objectives, but whether extrapolative weight averaging can extend such frontiers to new checkpoints useful at inference time, without additional RL training, remains unclear. We study this question in RL for competitive programming, where hidden unit tests under time and memory limits enforce both functional correctness and computational efficiency. Starting from a shared initialization, we train checkpoints under nested unit-test coverage: low-coverage rewards require passing smaller-input tests, while high-coverage rewards require passing progressively larger tests up to the full suite. This sweep reveals the emergence of a correctness-efficiency frontier: on hard problems, higher-coverage reward reduces optimization failures but increases correctness failures, leaving solve rate nearly unchanged. Interpolation between low- and high-coverage checkpoints recovers this frontier, while extrapolation extends it beyond the trained endpoints. Both the frontier and its extrapolative continuation appear across three inference settings, pure reasoning, tool use, and agentic coding, and across two model scales, 32B and 7B. At the problem level, moving along the frontier changes which problems are solved, making extrapolated checkpoints complementary policies in inference-time scaling. Ensembles with extrapolative weight averaging broaden coverage and improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget. These results show that nested unit-test coverage in code RL induces a frontier that extrapolative weight averaging can navigate, extend, and exploit.


cs.MA [Back]

[155] You Only Align Once: Propagating Cooperative Behaviors in Multi-Agent Systems through Seed Agents cs.MA | cs.CLPDF

Nicole Hsing, Asuka Yuxi Zheng, Yi Zhao, Haoqin Tu, Jen-Tse Huang

TL;DR: 该论文提出了一种名为’对齐传播’的现象,表明在分布式开放多智能体系统中,单个经过对齐的’种子智能体’可以通过纯自然语言交互,将合作行为传播给未经训练的智能体。研究通过在基于团队的迭代囚徒困境游戏Red-Black Game中,将教师模型的合作推理和说服性对话知识蒸馏到Qwen-3-14B模型中,创造了一个种子智能体。该种子智能体显著提升了团队合作率,并且能够零样本迁移到另一个空间生存模拟环境Sugarscape中,大幅提升交易成功率。

Details

Motivation: 解决在分布式开放多智能体系统中,随着智能体数量增长和未对齐智能体可能存在的情况下,确保智能体行为(特别是合作行为)的挑战性问题。

Result: 在Red-Black Game中,种子智能体将团队合作率从基线24.8%提升至62.2%,超过了教师模型和原始Gemini-3.1-Pro。在Sugarscape环境中,实现了零样本迁移,将交易成功率从基线21.6%提升至91.5%。

Insight: 核心创新点在于将对齐问题从传统的、需要对每个智能体进行详尽训练的模式,重新定义为一种可通过战略性放置种子智能体来工程化实现的可扩展社会能力。这为多智能体对齐提供了一种更高效、可扩展的新范式,即’You Only Align Once’(只需对齐一次)。

Abstract: Ensuring agent behaviors in distributed open multi-agent systems remains challenging, especially as populations grow and unaligned agents may exist. We show that a single aligned agent can propagate cooperative behaviors to untrained agents purely through natural language interaction, a phenomenon we term Alignment Propagation. We study this in the Red-Black Game, a team-based iterated Prisoner’s Dilemma in which teammates deliberate and vote to determine their team’s collective action. By distilling the cooperative reasoning and persuasive dialogues of a teacher model into a Qwen-3-14B, we obtain a seed agent that, when placed among four untrained teammates, doubles the cooperation rate from 24.8% to 62.2%, outperforming the teacher model and a vanilla Gemini-3.1-Pro. Remarkably, a seed trained exclusively on the RedBlack Game transfers zero-shot to Sugarscape, a spatially grounded survival simulation with pairwise trading, achieving a 91.5% trade success rate versus a 21.6% baseline. Our results reframe multi-agent alignment from an exhaustive per-agent training problem to a scalable social capability that can be engineered through strategic seed placement.


cs.PL [Back]

[156] FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation cs.PL | cs.AI | cs.CLPDF

Loc Pham, Lang Hong Nguyet Anh, Thanh Le-Cong

TL;DR: 本文提出了FPMoE,一个基于稀疏专家混合(MoE)架构的轻量级开源代码生成模型,专门用于解决现有大语言模型在函数式编程语言(如Haskell、OCaml、Scala)上性能不佳的问题。该模型包含三个语言特定专家和一个捕获跨语言函数式抽象(如单子推理和类型导向编程)的共享专家,旨在同时消除多语言微调中的干扰并保留共享抽象。

Details

Motivation: 现有基于LLM的代码生成模型主要针对命令式语言训练,导致在函数式编程语言上性能显著落后。简单的单语言或多语言微调方法存在缺陷:前者无法捕获共享的函数式抽象,后者则引入了跨语言干扰。

Result: 在FPEval基准测试中,FPMoE显著优于微调的基线模型。尽管仅有30亿活跃参数,其性能却能与DeepSeek-Coder-6.7B、Qwen2.5-Coder-14B-Instruct和Qwen3-Coder-30B-A3B等参数量大得多的模型相媲美。

Insight: 核心创新点是采用稀疏MoE架构,通过设计语言特定专家和共享专家,巧妙地解决了多语言代码生成中“干扰”与“抽象共享”的矛盾。这为高效、轻量化的领域特定代码生成模型设计提供了新思路,特别是对于资源受限或需要处理多种相关但不同语言变体的场景。

Abstract: Despite rapid progress in LLM-based code generation, existing models are predominantly trained on imperative languages, leaving functional programming languages (FPLs) such as Haskell, OCaml, and Scala chronically underexplored, with even frontier models performing substantially worse on FPLs. Fine-tuning is a natural remedy, but our experiments show that per-language fine-tuning fails to capture shared functional abstractions, while merged multi-language fine-tuning introduces cross-language interference. To address this, we introduce FPMoE, a lightweight, open-source code generation model built on a sparse Mixture-of-Experts (MoE) architecture with three language-specific routed experts (one each for Haskell, OCaml, and Scala) and a shared expert that captures cross-language functional patterns such as monadic reasoning and type-directed programming. This design resolves both failure modes simultaneously: dedicated experts eliminate interference, while the shared expert preserves abstractions that per-language models miss. On FPEval, FPMoE substantially outperforms fine-tuned baselines and, with only 3B active parameters, matches the performance of much larger models including DeepSeek-Coder-6.7B, Qwen2.5-Coder-14B-Instruct, and Qwen3-Coder-30B-A3B.


cs.RO [Back]

[157] Uni-LaViRA: Language-Vision-Robot Actions Translation for Unified Embodied Navigation cs.RO | cs.CVPDF

Hongyu Ding, Sizhuo Zhang, Ziming Xu, Jinwen Guo, Hongxiu Liu

TL;DR: 本文提出了Uni-LaViRA,一种统一的具身导航智能体架构,它将导航任务结构化为单一的语言-视觉-机器人动作翻译问题。该方法利用预训练多模态大语言模型(MLLMs)的内在能力进行推理,无需机器人数据训练,并以零样本方式统一应用于四个任务族和四种异构真实机器人平台。

Details

Motivation: 当前主流的具身导航方法依赖于在大量机器人轨迹数据上扩展视觉-语言-动作(VLA)基础模型。本文认为,对于导航任务,泛化能力可以通过结构设计而非仅依赖数据规模来获得。

Result: 在零训练成本下,Uni-LaViRA在多个基准测试中达到或超越了近期消耗数百万样本和数千GPU小时训练的导航基础模型的性能,具体包括:VLN-CE R2R(60.7% SR)、VLN-CE RxR(51.3% SR)、HM3D-v2(77.7% SR)、HM3D-OVON(60.0% SR)、MP3D-EQA(54.7% SR)和OpenUAV(40.0% SR)。

Insight: 核心创新在于将导航决策结构化为一个统一的翻译任务,利用MLLMs的自然输出流形进行推理。其实用性通过两个智能体循环机制实现:待办事项列表记忆(TDM)用于结构化子目标管理,二次机会回溯(SCB)将单次导航转变为自我纠正过程。

Abstract: Embodied navigation requires an agent to map language and visual observations to a stream of spatial actions that drive a real robot through environments it has never seen. The dominant approach has been to scale vision-language-action (VLA) foundation models on ever-larger collections of robot trajectories. This paper argues that, for navigation specifically, generality can be obtained structurally, not only through data scale. The underlying decision structure of navigation reduces to a single Language-Vision-Robot Actions Translation. The language action emits semantic-level directional command and the vision action emits a pixel-level visual target. Both outputs lie inside the natural output manifold of pretrained multimodal large language models (MLLMs), so the task can be reasoned about by an agent rather than learned from robot data. Therefore, we present Uni-LaViRA, a unified agentic architecture that extends the same insight to four task families (VLN-CE, ObjectNav, EQA, and Aerial-VLN) and to four heterogeneous real robots (Wheeled, Quadruped, Humanoid robot, and a self-built UAV) in a zero-shot manner. Two agent-loop mechanisms make this unification practical. TODO List Memory (TDM) rewrites a structured checklist of pending sub-goals at every step, reciting the unfinished items back into the agent’s most recent attention window. Second Chance Backtrack (SCB) rolls the robot back to the pre-error state and conditions the agent’s next plan on the failed sub-trajectory, turning single-pass navigation into a self-correcting process. With zero training effort, Uni-LaViRA reaches 60.7% SR on VLN-CE R2R, 51.3% on VLN-CE RxR, 77.7% on HM3D-v2, 60.0% on HM3D-OVON, 54.7% on MP3D-EQA, and 40.0% on OpenUAV, matching or even surpassing recent training navigation foundation models that consume millions of samples and thousands of GPU-hours.


[158] Turning Video Models into Generalist Robot Policies cs.RO | cs.AI | cs.CV | cs.LGPDF

Sizhe Lester Li, Evan Kim, Xingjian Bai, Tong Zhao, Tao Pang

TL;DR: 本文提出了一种名为VERA的视频到具身机器人动作模型,通过将无动作的视频世界模型与基于机器人雅可比矩阵设计的逆动力学模型(IDM)相结合,构建了一个闭环的视频到动作策略。该方法将视频规划器与IDM解耦,使视频规划器保持与具体机器人无关,并能与不同IDM灵活配对,实现了零样本、跨具身和可泛化的机器人控制。

Details

Motivation: 现有方法通常通过带动作标签的数据微调视频模型来联合预测未来观察和动作,本文旨在探索一种替代方案:保持视频规划器不变,同时训练一个特定于具体机器人的逆动力学模型(IDM),以利用解耦带来的灵活性、数据效率和可扩展性优势。

Result: VERA在模拟和真实世界基准测试中表现出色,包括零样本的Panda机械臂操作和16自由度的Allegro灵巧手立方体重定向任务,证明了其方法的有效性。

Insight: 核心创新在于将视频规划与动作生成解耦,通过设计基于机器人雅可比矩阵的数据高效且可扩展的IDM,实现了视频模型到通用机器人策略的转换,为跨具身机器人控制提供了一条可行的新途径。

Abstract: Video generative models have emerged as a promising robotics backbone, capable of generating videos that depict the completion of complex tasks across embodiments and environments. Recent work proposes robot foundation models that jointly predict future observations and actions by finetuning video models with action-labeled data. In this paper, we test the limits of an alternative approach: leave the video planner as-is while training an embodiment-specific inverse dynamics model (IDM). This decoupling offers several natural benefits: the video planner remains embodiment-agnostic, different video models can be interchanged easily without re-training the IDM, and the IDM can be independently trained with readily available self-play data. We present a closed-loop, video-to-action policy that combines an action-free video world model with a carefully-designed IDM based on the robot embodiment Jacobian. We demonstrate that our IDM design is both data-efficient and scalable to high-dimensional action spaces. Our policy, which we coin the Video-to-Embodied Robot Action Model (VERA), achieves strong performance across simulated and real-world benchmarks, including zero-shot Panda arm manipulation and 16-DoF Allegro-hand dexterous cube re-orientation. The same video planner can be used across multiple embodiments by pairing it with different embodiment-specific IDMs. Our results show that decoupled video planning plus faithful video-to-action translation is a viable alternative route towards zero-shot, cross-embodiment, and generalizable robot control. More results are available on our project website: https://vera.csail.mit.edu.


[159] POINav: Benchmarking and Enhancing Final-Meters Arrival in Real-World Vision-Language Navigation cs.RO | cs.CVPDF

Ruiyan Gong, Meisheng Zhang, Yuxiang Zhao, Mingchao Sun, Yanfen Shen

TL;DR: 该论文提出了POINav-Bench,这是首个用于真实世界兴趣点导航闭环评估的基准,包含基于3D高斯泼溅重建的11个商业区域。同时,论文提出了POINav脑-行动框架,通过一个脑模块进行基于兴趣点的推理来指导行动模块预测连续路径点,以提升真实世界导航的最终抵达精度。

Details

Motivation: 解决真实世界视觉语言导航中,现有基准因场景生成导致的粒度粗糙或仿真与现实差距大,难以精确抵达兴趣点的’最后几米’挑战。

Result: 实验表明,所提出的POINav脑-行动框架为优化真实世界兴趣点目标导航提供了一条可行的路径,并在其构建的高保真基准上进行了评估。

Insight: 创新点在于构建了首个基于真实世界捕获和3D高斯泼溅重建的、具有遍历性标注的闭环导航基准,并提出了一个将高层推理与底层路径点预测解耦的脑-行动框架来应对最终抵达问题。

Abstract: Real-world navigation is fundamentally driven by Points of Interest (POIs), yet reaching a precise POI remains a critical “final-meters” challenge. Existing Vision-Language Navigation (VLN) benchmarks of POI-goal navigation often suffer from coarse granularity or significant sim-to-real gaps due to generated scene. To bridge this gap, we present POINav-Bench, the first benchmark designed for closed-loop evaluation of real-world POI-goal navigation. It comprises 11 commercial areas reconstructed from real-world captures using 3D Gaussian Splatting (3DGS), covering 126,398 $m^{2}$ in total and spanning 163 distinct POIs. With traversability-aware annotations and reference trajectories, POINav-Bench enables high-fidelity evaluation of navigation agents in realistic, POI-rich real-world environments. Building on this, we propose the POINav Brain-Action Framework where a Brain module performs POI-grounded reasoning to guide an Action module in predicting continuous waypoints for real-world execution. We further curate the POINav-Dataset, containing 70K real-world signage-entrance pairs. Experiments show that our framework provides a viable path toward refining real-world POI-goal navigation.


[160] EventShiftFlow: Towards Hardware-efficient FPGA-based Flow Estimation cs.RO | cs.CVPDF

Arianna Alonso Bizzi, Fernando Cladera, C. J. Taylor

TL;DR: 本文提出EventShiftFlow,一种面向FPGA硬件高效实现的基于事件相机的流速估计算法。该方法将异步事件离散化为固定时长的时间仓,构建1比特空间占用网格,并使用纯整数逻辑并行评估多个速度假设,无需浮点运算或迭代优化。

Details

Motivation: 事件相机具有异步、高时间分辨率的特性,适用于低延迟机器人感知,但现有事件运动估计方法计算密集且难以映射到FPGA硬件。本文旨在设计一种硬件友好的流式速度估计器,适用于尺寸、重量和功耗受限的平台。

Result: 在已知真实速度的合成噪声数据上,该方法能恢复速度和方向,但在不同速度物体相交时幅度估计最具挑战。在真实事件相机序列上,四个运动段的方向准确率达到99.5%,且在10-40%占用密度范围内保持鲁棒。算法存储需求小于2 kB,并在低成本Xilinx Artix-7上实现了单轴原型。

Insight: 创新点在于通过离散化事件、使用1比特占用网格和纯固定宽度整数逻辑(移位寄存器、计数器等)实现硬件高效设计,放弃了密集亚像素光流以换取稀疏量化速度估计,专为低延迟反应式任务优化。

Abstract: Event-based vision sensors offer asynchronous, high-temporal-resolution measurements that are attractive for low-latency robotic perception, but many event-based motion estimation methods are computationally intensive and difficult to map to FPGA hardware. We present a streaming velocity estimator that discretizes asynchronous events into fixed-duration time bins, constructs a 1-bit spatial occupancy grid, and evaluates multiple velocity hypotheses in parallel using only fixed-width integer logic - shift registers, counters, comparators, and small LUT-mapped multiplies - with no dividers and no DSP blocks. It requires no frame reconstruction, no floating-point arithmetic, and no iterative optimization. The method deliberately trades dense sub-pixel optical flow for a sparse, quantized velocity estimate at each active pixel, suited to low-latency tasks such as reactive obstacle avoidance on size-, weight-, and power-constrained platforms. On noisy synthetic data with known ground-truth velocities, the method recovers both magnitude and direction, with magnitude estimates being most challenged when objects of different velocities intersect. On a real event-camera sequence, directional accuracy reaches 99.5% across all four evaluated motion segments, with performance remaining robust across occupancy densities in the 10-40% range. We characterize the algorithm’s density-dependent behavior, present a parameter sensitivity analysis, show that the proposed datapath requires less than 2 kB of storage, and implement a single-axis prototype on a low-cost Xilinx Artix-7.


[161] Self-Supervised Online Robot-Agnostic Traversability Estimation for Open-World Environments cs.RO | cs.CVPDF

Julia Hindel, Simon Bultmann, Houman Masnavi, Daniele Cattaneo, Abhinav Valada

TL;DR: 本文提出了COTRATE,一种用于开放世界环境中机器人可通行性估计的自监督在线学习框架。该方法通过机器人无关的多模态感知模块推断可通行性分数,并利用对齐损失监督视觉网络,同时采用多样性感知特征选择策略缓解持续学习中的遗忘问题。

Details

Motivation: 现有方法依赖手工设计的本体感知分数或聚类先验数据,限制了机器人无关性和在线学习能力,且持续学习方法常带来高计算开销,阻碍了在机器人上的实际部署。

Result: 在包含约50,000张图像、覆盖11种户外地形和两个机器人平台的数据集上进行了评估,并在三个代表性户外环境的导航任务中进行了基准测试,结果表明该方法支持跨机器人平台的知识迁移。

Insight: 创新点包括机器人无关的在线地形评估模块、关联视觉嵌入与在线评估的对齐损失,以及低开销的多样性感知特征选择策略,有效实现了开放世界环境下的持续可通行性学习与跨平台知识迁移。

Abstract: Self-supervised online traversability estimation enables robots to continuously learn from unlabeled open-world experiences and adapt their navigation behavior toward safe and efficient trajectories. Existing approaches either rely on handcrafted proprioceptive traversability scores, limiting robot-agnosticism, or cluster prior data, preventing online learning. Moreover, many continual learning methods incur substantial memory and computational costs, hindering onboard deployment. We introduce COTRATE, an online learning framework for continuous traversability estimation from multimodal, unlabeled robot experience. Our method first infers robust traversability scores using a robot-agnostic, learning-based online terrain assessment module operating on proprioceptiveand inertial signals. These scores then supervise a visual traversability network through a novel alignment loss that associates visual embeddings with online terrain assessments.To mitigate forgetting during continual learning with minimal overhead, we propose a diversity-aware feature selection strategythat preserves performance using a compact replay memory. We further show that the learned traversability representation supports knowledge transfer across different robot platforms with different locomotion kinematics. We evaluate COTRATE on a dataset of \approx 50,000 images collected with two robotic platforms across 11 outdoor terrains, and benchmark it on navigation tasks in three representative outdoor environments. We make the dataset, code, and trained models publicly available.


cs.CR [Back]

[162] MIRAGE: Context-Aware Prompt Injection against Mobile GUI Agents via User-Generated Content cs.CR | cs.AI | cs.CLPDF

Ruoqi Guo, Yi Liu, Gelei Deng, Yiheng Xiong, Yuekang Li

TL;DR: 本文提出了MIRAGE,一种针对移动GUI智能体的上下文感知提示注入攻击方法。该方法通过将攻击者控制的文本植入用户生成内容区域,将良性移动屏幕截图转化为对抗样本,无需修改智能体、应用或操作系统。在包含10个应用和11种攻击意图的1111个样本基准测试中,所有五个评估的VLM智能体均被攻破,攻击成功率为23%-30%。

Details

Motivation: 移动GUI智能体基于视觉语言模型,通过感知屏幕像素来执行操作,无法可靠地区分可信的界面元素与用户生成内容,这带来了安全风险。论文旨在探索和演示如何利用这一弱点,通过精心设计的用户生成内容进行提示注入攻击。

Result: 在跨10个应用和11种攻击意图的基准测试上,MIRAGE对五个VLM智能体的攻击成功率达到23%-30%。在人类真实性评分中,MIRAGE(3.02/5分)优于先前最强的攻击方法(2.52/5分)。研究还发现样本的真实性与攻击成功率不相关。

Insight: 创新点在于提出了一种分阶段的、上下文感知的对抗样本生成管道(定位、生成、筛选),将攻击载荷无缝融入用户生成内容区域,并分离了对可达性、真实性和分布平衡的控制。客观来看,该方法揭示了仅靠视觉质量过滤无法防御此类攻击,强调了移动GUI智能体安全中上下文理解与内容来源验证的根本性挑战。

Abstract: Mobile graphical user interface (GUI) agents driven by vision-language models (VLMs) perceive the screen as rendered pixels and choose actions from what they see, so they cannot reliably separate trusted interface elements from user-generated content. We present MIRAGE (Mobile Injection of Realistic Adversarial GUI Examples), a pipeline that turns benign mobile screenshots into prompt-injection samples by placing attacker-controlled text into ordinary user-generated content regions, without modifying the agent, the application, or the operating system. MIRAGE operates in three stages: a Localizer identifies user-controllable regions on the screenshot, a Generator synthesises context-aware payloads and renders them in the application’s native style, and a Curator moderates realism and balances the samples across applications, region types, and attack intents. A key challenge is that an injected screenshot must stay visually indistinguishable from genuine user content while still diverting the agent; we address this by separating the stages that control reach, realism, and distributional balance. On a 1,111-sample benchmark spanning ten applications and eleven attack intents, all five evaluated VLM agents are vulnerable, with attack success rates of 23%-30%, and MIRAGE scores higher on human realism ratings than the strongest prior attack (3.02 versus 2.52 out of 5). We further find that per-sample realism and attack success are uncorrelated, so visual-quality filtering alone cannot reliably defend against this threat.


[163] MaskClaw: Edge-Side Personalized Privacy Arbitration for GUI Agents with Behavior-Driven Skill Evolution cs.CR | cs.CLPDF

Yanqiu Zhao, Dongying Zheng, Kaibo Huang, Yukun Wei, Zhongliang Yang

TL;DR: MaskClaw是一个运行在边缘侧的隐私仲裁系统,专为GUI智能体设计。它通过提取本地视觉证据、检索用户及任务特定的策略记忆,在原始屏幕截图离开可信环境前,决定允许、屏蔽或询问用户。系统能将用户的修正、取消和编辑行为转化为可复用的隐私技能,并在五个技能演化场景中通过沙箱门进行验证。

Details

Motivation: GUI智能体依赖屏幕截图来推断意图和操作应用,但这些截图常包含私人消息、医疗记录、支付凭证等敏感信息。现有的静态PII检测器无法适应任务、接收方、应用状态和用户角色等动态边界,而云端视觉语言模型推理可能在决定保护内容前就已上传原始屏幕,存在隐私泄露风险。

Result: 论文在基于真实UI模式、重构HTML屏幕和脱敏标签构建的P-GUI-Evo基准上进行实验。结果表明,在相同协议下,仅使用模式匹配、云端推理或路由等方法容易导致过度确认、过度屏蔽或暴露原始截图,而MaskClaw能更有效地进行隐私仲裁。

Insight: 主要创新点在于提出了一个边缘侧的、行为驱动的隐私仲裁框架,将隐私决策点前移至可信环境内,并结合了策略记忆和技能演化机制。从客观角度看,其将用户交互反馈(如修正)系统性地转化为可复用隐私技能的思路,为动态、个性化的隐私保护提供了新方法。

Abstract: GUI agents rely on screenshots to infer intent and operate across applications, but these screenshots often contain private messages, medical records, payment credentials, and workplace-specific workflows. Privacy decisions in this setting depend on task, recipient, application state, and user role, yet static PII detectors miss these boundaries and cloud-side VLM reasoning can upload the raw screen before deciding what should be protected. We present MaskClaw, an edge-side privacy arbitrator for GUI agents. MaskClaw extracts local visual evidence, retrieves user- and task-specific policy memory, and decides Allow, Mask, or Ask before raw screenshots leave a trusted user- or organization-controlled environment. In five designed skill-evolution scenarios, it turns corrections, cancellations, and edits into reusable privacy skills checked by a sandbox gate. We introduce P-GUI-Evo, a benchmark built from real UI patterns, reconstructed HTML screens, and sanitized labels. Experiments show that pattern matching, cloud reasoning, and routing alone tend to over-confirm, over-mask, or expose raw screenshots under the same protocol. The artifact is available at https://github.com/Theodora-Y/MaskClaw.


cs.AI [Back]

[164] Diffusion Large Language Models for Visual Speech Recognition cs.AI | cs.CV | eess.ASPDF

Jeong Hun Yeo, Chae Won Kim, Hyeongseop Rha, Yong Man Ro

TL;DR: 本文提出了DLLM-VSR,首个基于扩散大语言模型的视觉语音识别框架,通过迭代掩码去噪和灵活顺序解码解决传统自回归解码在视觉模糊标记上过早决策的问题。该方法采用置信度解掩码机制,早期确定高置信度位置并利用双向上下文优化模糊标记,同时引入两阶段掩码去噪训练策略和长度引导候选解码来提升性能。在LRS3数据集上仅使用标注训练数据即达到19.5%的词错误率,刷新了当前最优水平。

Details

Motivation: 现有视觉语音识别系统依赖从左到右的自回归解码,在视觉模糊标记上容易因上下文不足而过早决策,导致错误累积。本文旨在通过扩散大语言模型的迭代去噪特性,实现更灵活的上下文利用和决策延迟。

Result: 在LRS3基准测试中,仅使用其标注训练数据即达到19.5%的词错误率,取得了当前最优性能。实验表明,该方法通过长度引导解码减少了目标长度不确定性带来的性能差距。

Insight: 创新点包括:将视觉语音识别建模为迭代掩码去噪任务,支持灵活顺序解码;提出置信度解掩码机制,结合双向上下文优化;设计两阶段训练策略分离视觉-文本对齐与长度建模;开发长度引导候选解码,利用视频时长假设降低长度不确定性。这些方法为扩散模型在序列生成任务中的应用提供了新思路。

Abstract: Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, to the best of our knowledge, the first Diffusion Large Language Model (DLLM)-based VSR framework, formulating transcription as iterative masked denoising with flexible-order decoding. With confidence-based unmasking, DLLM-VSR commits high-confidence positions early and uses the committed tokens as bidirectional context to refine ambiguous ones. To adapt DLLMs to VSR, we introduce a two-stage masked-denoising training strategy that separates visual-to-text content alignment from length modeling. We further observe a performance gap with oracle-length decoding, which assumes access to the true transcript length, indicating that reducing target-length uncertainty can improve DLLM-based VSR. To reduce this gap, we develop length-guided candidate decoding, which uses video duration to construct plausible transcript-length hypotheses, decodes under multiple hypotheses, and reranks candidates using length plausibility and decoding confidence. The proposed method achieves a state-of-the-art WER of 19.5% on LRS3 using only its labeled training data.


[165] Why LLMs Fail at Causal Discovery and How Interventional Agents Escape cs.AI | cs.CLPDF

Amartya Roy, Sonali Parbhoo

TL;DR: 本文探讨了大型语言模型在因果发现任务中表现不佳的根本原因,并提出了一种名为Agentic Causal Bayesian Optimization的干预式代理方法来解决此问题。作者证明,传统的监督微调、直接偏好优化和上下文学习方法在理论上存在固有局限,无法区分产生相似观测数据的因果图。而A-CBO方法通过将冻结的LLM作为干预效应查询的预言机,并结合外部贝叶斯优化循环,能够有效规避这一局限,在无需训练的情况下实现可靠的因果发现。

Details

Motivation: 动机在于探究LLM在因果发现任务中表现不佳的根本原因,并解决传统基于LLM的学习范式(如微调、偏好优化)在理论上存在的固有局限性,这些方法无法可靠地区分不同的因果结构。

Result: 在Corr2Cause基准上,A-CBO方法无需任何训练即可达到微调基线的性能。在作者新提出的、规模扩展至24个变量和18K测试样本的Extended Corr2Cause基准上,A-CBO显著优于微调和偏好优化方法,且优势随着任务复杂度增加而扩大。

Insight: 核心创新点在于揭示了传统LLM学习范式在因果发现任务中存在一个根本性的理论障碍(核障碍定理),并提出了一种新颖的“代理”范式(A-CBO),将LLM作为固定的干预效应查询模块,而将因果图推断的逻辑置于模型外部的一个贝叶斯优化循环中,从而绕过了理论障碍。这为利用现有LLM进行可靠因果推理提供了一种新的架构思路。

Abstract: Causal discovery is a cornerstone of scientific reasoning, yet whether large language models can perform it reliably remains an open question. Recent benchmarks show that even fine-tuned models plateau on simple causal graphs and degrade as complexity grows, but why they fail has not been established. We prove the failure is fundamental: supervised fine-tuning, direct preference optimization, and in-context learning all produce predictors that cannot distinguish between causal graphs generating similar observational data, and any attempt to do so requires the model’s internal representations to grow unboundedly, violating the very conditions under which these methods work. We formalize this as a kernel obstruction theorem, establishing that the limitation is intrinsic to the learning paradigm, \emph{not any particular model or dataset}. We propose Agentic Causal Bayesian Optimization (A-CBO), wherein a frozen language model serves as an interventional oracle answering targeted queries about intervention effects, while an external Bayesian loop concentrates beliefs over candidate graphs in logarithmically many rounds. Because the decision operates outside the space where the obstruction applies, A-CBO provably converges while the underlying model remains unchanged. On Corr2Cause, A-CBO matches fine-tuned baselines without any training. On Extended Corr2Cause, a new benchmark scaling to 24 variables with 18K test samples, A-CBO significantly outperforms both fine-tuning and preference optimization, with the advantage growing


[166] Revealing Algorithmic Deductive Circuits for Logical Reasoning cs.AI | cs.CLPDF

Phuong Minh Nguyen, Tien Huu Dang, Naoya Inoue

TL;DR: 该论文研究了大型语言模型(LLMs)在少量示例下进行逻辑推理时,如何理解抽象推理步骤和整体算法。通过将推理步骤与token logits对齐,并采用因果中介分析,论文定位了负责单个推理步骤的注意力头,并描述了它们之间传递的信息类型。研究发现,引导推理过程的token位置与低置信度分数相关,且LLMs通过专门的注意力头(约占总数的3%)检索事实和基于规则的信息,而更高层则主要促进信息整合和全局推理策略(如图遍历算法)的出现。

Details

Motivation: 尽管LLMs在少量示例学习设置中通过结合功能符号表示(抽象描述图遍历算法和逐步推理)能实现强大的推理性能,但模型如何仅从有限演示中真正理解每个推理步骤的抽象含义及整体算法仍不清楚。

Result: 分析表明,在符号辅助的思维链(CoT)提示框架下,引导推理过程的token位置与因满足演示中推理行为模式约束而产生的低置信度分数相关。通过因果中介分析,识别出了负责这些模式的注意力头。

Insight: 创新点在于将推理步骤与模型内部表示(token logits)对齐,并采用因果中介分析来定位和表征负责推理的注意力机制。客观分析认为,其揭示了LLMs中专门化注意力头(用于子任务信息检索)与高层(用于全局策略整合)在复杂推理中的分工,为理解模型内部推理机制提供了新视角。

Abstract: Recent studies have shown that Large Language Models (LLMs) can achieve strong reasoning performance by incorporating functional symbolic representations that abstractly describe graph traversal algorithms and step-by-step reasoning in few-shot learning settings. However, it remains unclear how LLMs genuinely understand the abstract meaning of each reasoning step and the overall algorithm from only a limited number of demonstrations. This work aims to localize the attention heads responsible for individual reasoning steps and characterize the types of information transferred among them. We first align constituent reasoning steps with their corresponding token logits under a symbolic-aided Chain-of-Thought (CoT) prompting framework. Our analysis shows that token positions that steer the reasoning process are associated with low confidence scores caused by constraints on satisfying reasoning behavior patterns in demonstrations. We then adopt causal mediation analysis techniques to identify the attention heads responsible for these patterns. In addition, our findings indicate that LLMs retrieve factual and rule-based information for individual sub-reasoning tasks through specialized attention heads (approximately 3% total heads), whereas higher layers predominantly facilitate information integration and the emergence of global reasoning strategies (e.g., graph traversal algorithms) that coordinate multiple intermediate reasoning steps to solve the overall task.


[167] MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents cs.AI | cs.CLPDF

Zihan Li, Xingyu Fan, Feifei Li, Wenhui Que

TL;DR: 本文提出了MemCog系统,将对话智能体的记忆范式从‘记忆作为工具’转变为‘记忆作为认知’。该系统通过构建可导航的记忆存储、跨维度导航接口和主动推理协议,使记忆访问成为推理过程的内在组成部分。实验表明,MemCog在被动问答基准上达到SOTA,并在新构建的主动记忆触发基准ProactiveMemBench上显著优于基线。

Details

Motivation: 现有智能体记忆系统普遍遵循‘记忆作为工具’范式,存在被动调用、推理与检索脱节、以及检索片段与智能体导航需求结构不匹配的问题。

Result: 在被动QA基准LoCoMo和LongMemEval上分别达到92.98和95.8的SOTA分数,并在新提出的ProactiveMemBench基准上大幅超越基线模型。

Insight: 核心创新在于将记忆系统从被动检索工具提升为主动认知组件,通过关联图组织知识、支持多步推理驱动的遍历,并设计了从对话上下文中自发触发记忆探索的协议。这为解决推理-检索脱节问题提供了新思路。

Abstract: Existing agent memory systems universally follow what we term a Memory-as-Tool paradigm where a single query triggers one-shot retrieval of flat passage lists, suffering from passive invocation, reasoning-retrieval decoupling, and structural mismatch between retrieved fragments and the agent’s navigational needs. We propose MemCog, a Memory-as-Cognition system that makes memory access an integral part of the reasoning process. MemCog organizes user knowledge as Navigable Memory Store with associative link graphs, exposes Cross-Dimensional Navigation Interface for multi-step reasoning-driven traversal, and employs Proactive Reasoning Protocol that drives agents to spontaneously initiate memory exploration from conversational context. We additionally construct ProactiveMemBench, the first benchmark for evaluating proactive memory triggering. Experiments show that MemCog achieves state-of-the-art on passive QA benchmarks (92.98 on LoCoMo, 95.8 on LongMemEval) while substantially outperforming baselines on ProactiveMemBench, demonstrating the advantage of Memory-as-Cognition.


[168] Explaining is Harder Than Predicting Alone: Evaluating Concept-based Explanations of MLLMs as ICL Visual Classifiers cs.AI | cs.CL | cs.LG | cs.LO | cs.MAPDF

Carmen Quiles-Ramírez, Leticia L. Rodríguez, Nicolás Martorell, Natalia Díaz-Rodríguez

TL;DR: 本文系统评估了冻结多模态大语言模型(MLLMs)在少样本上下文学习(ICL)下的基于概念的可解释性,通过五个逐步严格的条件(从基线分类到描述逻辑公理生成)进行分析。研究发现,解释比单独预测更困难,强制模型生成形式化、基于概念的解释会单调降低预测准确率(从93.8%降至90.1%),但成功表达类别区分性视觉特征时,解释质量与正确预测强相关。

Details

Motivation: 解决多模态大语言模型在少样本上下文学习(ICL)中如何利用上下文进行分类的问题,以及其内部计算过程的不透明性,旨在评估模型基于概念的解释能力,以揭示其推理机制。

Result: 在四个最先进的MLLMs上,通过独立LLM-as-a-judge流程评估,预测准确率从93.8%降至90.1%,表明形式化解释会损害性能;但解释质量与正确预测在成功表达视觉特征时强相关,揭示了模型在形式化可解释性方面的不足。

Insight: 创新点在于系统性地评估MLLMs的基于概念可解释性,使用逐步严格的条件和LLM-as-a-judge方法;客观分析发现,MLLMs在视觉分类上表现优异,但缺乏形式化、机器可验证解释所需的特定指令调优,挑战了显式推理普遍提升性能的假设。

Abstract: In-context learning (ICL) enables multimodal large language models (MLLMs) to classify images from a few labelled examples. Yet, how these models use the provided context remains opaque. While Chain-of-Thought prompting is widely used, recent work argues that it may not reflect true internal computation. In this paper, we systematically evaluate the concept-based explainability of frozen MLLMs under few-shot ICL using five conditions of increasing formal rigour, ranging from baseline classification to Description Logics (DL) axiom generation. Evaluating four state-of-the-art MLLMs via an independent LLM-as-a-judge pipeline, we demonstrate that explaining is genuinely harder than predicting alone. Surprisingly, forcing models to generate formally structured, concept-based explanations degrades predictive accuracy monotonically (from 93.8% to 90.1%), contradicting the assumption that explicit reasoning universally aids performance. However, when models successfully articulate class-discriminative visual features, explanation quality strongly correlates with correct predictions. Our findings suggest that while MLLMs excel at visual classification, they lack the specific instruction-tuning required for formal, machine-verifiable explainability.


[169] Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR cs.AI | cs.CL | cs.LGPDF

Soeun Kim, Albert No

TL;DR: 本文针对强化学习与可验证奖励(RLVR)中的探索多样性瓶颈问题,提出了一种名为REFT(基于首词多样化的探索)的轻量级方法。该方法通过均匀采样策略自身Top-N候选词中的首个推理标记后的词元,来增加rollout的多样性,而不改变正确性信号。实验表明,REFT在多个基础模型和难度级别上,均提升了Pass@1、Pass@8和Pass@64等指标。

Details

Motivation: RLVR训练推理模型时,rollout的多样性是核心瓶颈。现有方法主要通过调整温度、前缀或rollout选择来拓宽探索,但作者发现推理标记后的第一个词元位置是一个被忽视但结构上独特的关键点,其分布具有尖锐峰值且与正确性解耦,可以高效地增加探索范围。

Result: 在四个基础模型(0.5B-7B)和三个难度级别上,REFT在Pass@1、Pass@8和Pass@64等聚合指标上均优于DAPO和GRPO基线方法,实现了性能提升。

Insight: 创新点在于识别并利用了推理标记后第一个词元分布的特性(尖锐峰值且与正确性解耦)作为低成本、高杠杆的探索点。REFT方法本身轻量,仅修改首词采样和rollout分配,不改变RLVR流程的其他部分,却能有效提升模型性能,为RLVR中的探索策略提供了新思路。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) trains reasoning models without labeled trajectories, relying on grouped rollouts to expose the policy to alternative reasoning paths and a verifier to score them. Rollout diversity has accordingly emerged as a central bottleneck in RLVR, with most existing methods broadening exploration through temperature, prefix, or rollout-selection adjustments. We identify a structurally distinguished but overlooked position for broadening this diversity: the first token after the reasoning marker. The policy’s first-token distribution exhibits a sharply peaked yet correctness-decoupled phenomenon, and this first token position can broaden the regions a rollout group covers without altering the correctness signal. We introduce REFT (Rollout Exploration with First-Token Diversification), a light addition to the RLVR pipeline that samples first tokens uniformly from the policy’s own top-$N$ candidates and allocates rollouts evenly, leaving every other component unchanged. Trained on the resulting diversified rollouts, REFT improves aggregate Pass@1, Pass@8, and Pass@64 over DAPO and GRPO baselines across four base models (0.5B-7B) and three difficulty regimes.


[170] Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning cs.AI | cs.CL | cs.LOPDF

Pauline Bourigault, Xiaotong Ji, Matthieu Zimmer, Rasul Tutunov, Haitham Bou Ammar

TL;DR: 本文研究了使用Lean定理证明器评估自然语言数学答案的局限性,指出其信号是部分且不可靠的,并提出了COVCAL方法,通过选择性风险控制来认证答案的准确性或选择弃答,从而在特定条件下(如使用专门的自动形式化工具时)实现对部分形式化信号的可靠信任。

Details

Motivation: 动机在于解决使用Lean作为自然语言数学推理评判工具时信号不完整、不可靠的问题,例如答案无法形式化、证明失败可能源于类型错误或库缺失而非答案错误,这限制了其作为评判标准的有效性。

Result: 在MATH-500基准测试上,实验显示证明成功的答案在高覆盖率下正确率为96%,低覆盖率下仅为20%;使用7B自动形式化器时,仅28%的问题能被证明,且其中仅约43%的证明是忠实的。COVCAL方法在使用专门的自动形式化器达到79%覆盖率时,在20个引导分区中的17个上可行,接受约48%的问题并达到0.98的接受准确率。

Insight: 创新点在于提出了COVCAL这一基于Lean跟踪诊断的选择器,通过有限样本选择性风险边界(如保守的Bonferroni边界和更紧的dev-then-cal规则)来控制风险,为在何种条件下(特别是依赖于自动形式化器的覆盖范围)可以信任部分形式化信号提供了精确的理论和实践依据。

Abstract: Lean is increasingly used to judge natural-language mathematical answers, but its signal is partial: many answers never formalize, and a failed proof may reflect an ill-typed statement or a missing library fact, not a wrong answer. On MATH-500 we show this signal is (i) sharply coverage-dependent, that is the proof-winning answer is correct 96% of the time at high proved coverage but 20% at low, and (ii) sparse and often unfaithful: a 7B autoformalizer proves a class for only 28% of problems, and a manual audit finds only approximately 43% of those proofs faithful. We propose COVCAL, a selector over Lean-trace diagnostics that certifies a finite-sample selective-risk bound on accepted answers or abstains, under two regimes (a conservative Bonferroni bound and a tighter dev-then-cal rule). Feasibility depends on autoformalization coverage: with the 7B formalizer the signal is too sparse and Bonferroni abstains on all 20 bootstrap partitions, whereas a prover-specialized formalizer reaches 79% coverage and flips it to feasible on 17 of 20, accepting approximately 48% of problems at 0.98 accepted accuracy. Since self-consistency alone is already 91% accurate, our contribution is a precise account of when, and with which formalizer, a partial formal signal can be trusted under risk control.


[171] Cultural Binding Heads in Language Models cs.AI | cs.CL | cs.LGPDF

Avrile Floro, Luca Benedetto

TL;DR: 该论文通过机制可解释性方法,在N4文化挪用基准上识别出语言模型中负责文化绑定的特定注意力头(每个模型2-3个中层头)。研究发现,文化绑定(即将文化项目与相应身份关联的过程)主要形成于预训练阶段,敲除这些头会削弱绑定强度。通过α缩放进行生成时引导,可提升文化区分准确性,且知识探测表明模型知识远多于其实际利用,瓶颈在于路由机制。

Details

Motivation: 解决语言模型在文化群体处理中缺乏差异意识的问题,即模型倾向于默认平等对待不同文化,即使上下文需要差异化处理,这被称为文化绑定能力不足。

Result: 在N4文化挪用基准上,敲除识别出的注意力头使文化绑定强度降低9-23%;生成时采用α=2-3的适度放大引导,可将文化区分准确率提高1-3个百分点,同时基本保持中性推理能力;知识探测显示模型所知是其所用知识的3-5倍。

Insight: 创新点在于通过机制可解释性定位了语言模型中负责文化绑定的具体注意力头,并证明文化绑定能力主要源于预训练;实践上,通过轻量级引导(α缩放)即可提升模型的文化敏感性,且瓶颈被确定为知识路由而非知识本身,为模型文化对齐提供了新思路。

Abstract: LLMs often default to equal treatment across cultural groups, even though context warrants differentiation: this is a lack of difference awareness. Using mechanistic interpretability and a factorial design on the N4 cultural appropriation benchmark from Wang et al. (2025), we identify 2-3 mid-layer attention heads per model that contribute causally to cultural binding across eight models (four architectures, base and instruct). Cultural binding is the process of associating cultural items with the appropriate identity. Knockout of the identity-to-item edges on these heads lowers the binding strength by 9-23%. The identified heads transfer from instruct to base models, suggesting that cultural binding is created at pre-training. An $α$-scaling shows a graded dose-response and moderate amplification steering at generation ($α= 2-3$) increases cultural differentiation accuracy by 1-3 pp while leaving neutral reasoning mostly intact. A knowledge probing task shows that models know 3-5 times more than they act upon it, indicating that the bottleneck lies in routing and not knowledge.


[172] Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability cs.AI | cs.CL | cs.LOPDF

Leizhen Zhang, Shuhan Chen, Sheng Chen

TL;DR: 本文系统评估了大型语言模型(LLM)在布尔可满足性问题(SAT)上的推理能力,重点关注2-SAT和3-SAT问题及其到顶点覆盖和离散三维装箱问题的归约。研究发现传统评估指标具有误导性,因此提出了基于成对公式的评估协议和准确区分率(ADR)来更可靠地衡量模型的推理能力。

Details

Motivation: LLM越来越多地用于可归约为SAT的任务,但其在SAT问题上的实际推理能力尚不明确,需要更可靠的评估方法来区分模型是真正推理还是依赖启发式方法。

Result: 在SAT相变设置中,许多模型通过过度预测可满足公式获得高分,但未能复现3-SAT阈值附近的经典易-难-易模式,且随变量数量增加性能急剧下降。使用ADR评估后,能更好区分推理导向模型与启发式模型,且模型在CNF及其对应图/装箱实例上的决策一致性超过80%。

Insight: 创新点在于提出了基于最小差异可满足/不可满足实例对的评估协议和ADR指标,这比传统指标更能忠实评估LLM的推理能力;同时通过跨表示一致性测试,发现模型决策规则在不同问题表示间具有稳定性,表明SAT可作为保守的LLM推理能力探针。

Abstract: Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains unclear. We present a systematic study of LLMs on 2-SAT and 3-SAT, together with two canonical reductions, Vertex Cover and discrete 3D packing, to probe representation-invariant reasoning. We first evaluate models using conventional metrics, including accuracy, precision, recall, and F1, as well as the SAT phase-transition setting. We find that these metrics can be misleading: many models obtain high scores by over-predicting satisfiable formulas, fail to reproduce the classical easy-hard-easy signature around the 3-SAT threshold, and degrade sharply as the number of variables grows. To address this problem, we introduce a paired-formula protocol based on minimally different satisfiable and unsatisfiable instances, together with Accurate Differentiation Rate (ADR), which requires both members of each pair to be classified correctly. ADR separates reasoning-oriented models from heuristic ones and correlates with witness validity. Beyond CNF, we test cross-representation consistency by converting CNF to Vertex Cover and 3-SAT to discrete 3D packing. Model decisions on CNF and on the corresponding graph or packing instances agree for most models on more than 80 percent of instances, suggesting stable decision rules across representations. Overall, our results show that SAT is a conservative probe for LLM reasoning, and that paired evaluation with ADR provides a more faithful and representation-robust assessment than conventional metrics.


[173] Adaptive Multimodal Agents-Based Framework for Automatic Workflow Execution cs.AI | cs.CLPDF

Susanna Cifani, Mario Luca Bernardi, Marta Cimitile

TL;DR: 本文提出了一种新颖的多模态多智能体框架,用于自动执行工作流。该框架采用独特的离线发现和在线推理两阶段流程:离线阶段从碎片化执行日志中自适应地构建拓扑知识库;在线推理阶段,智能体在固定的预建图上利用自适应检索增强生成技术,并结合闭环协作验证协议进行动态自我纠正和导航。

Details

Motivation: 当前方法在处理从结构化元数据解析到通用环境感知的过渡时存在困难,且通常将任务序列视为离散、线性的片段,这种碎片化阻碍了智能体捕获底层的转移拓扑,限制了其在新颖或非平稳场景中的有效性。

Result: 该框架在真实世界环境中得到验证,结果表明即使在训练数据有限的情况下,它也能保持高可靠性和语义感知能力,实现了卓越的任务分解和自适应导航性能。

Insight: 核心创新在于将工作流执行建模为基于图的拓扑知识库构建与利用过程,并引入了自适应RAG和闭环协作验证协议,从而实现了对复杂、非平稳工作流的鲁棒、自适应导航,而非传统的线性任务处理方式。

Abstract: Modern information systems require autonomous agents capable of navigating complex workflows, yet current methodologies often struggle with the transition from structured metadata parsing to general environmental perception. While the integration of MLLMs has enabled agents to interact directly with GUIs, existing approaches typically treat task sequences as discrete, linear episodes. This fragmentation prevents agents from capturing the underlying transition topology, limiting their effectiveness in novel or non-stationary scenarios. To address this, we propose a novel multimodal multi-agent framework that achieves automatic workflow execution through a distinct two-phase pipeline. First, during an offline discovery phase, the architecture adaptively constructs a topological knowledge base from fragmented execution logs. During inference, agents leverage Adaptive Retrieval-Augmented Generation (RAG) over this fixed, pre-established graph, coupled with a closed-loop collaborative verification protocol to dynamically self-correct and navigate. This graph-based approach facilitates superior task decomposition and adaptive navigation performance. We validate our framework in a real-world context, demonstrating its ability to maintain high reliability and semantic awareness even with limited training data.


[174] The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic cs.AI | cs.CLPDF

Dominika Agnieszka Długosz, Arlindo Oliveira, Natalia Díaz Rodríguez

TL;DR: 本文对GSM-Symbolic基准测试的结论进行了批判性重评估,指出其关于大语言模型缺乏真实推理能力的结论在统计上不可靠。通过使用广义线性混合模型重新评估20个开源模型,发现仅半数模型在原始提示格式下表现出统计显著的性能变化,并识别出数据集本身存在大整数分布偏移这一混杂因素。

Details

Motivation: 论文的动机是质疑GSM-Symbolic基准测试的统计严谨性及其得出的关于LLM推理能力的普遍性结论,认为其结论建立在薄弱的统计基础之上。

Result: 重评估结果显示,在20个开源模型中,仅约一半在GSM-Symbolic变体上表现出统计显著的性能下降;控制数据集中大整数分布的系统性偏移后,约一半的显著案例不再显著。

Insight: 创新点在于采用更严谨的统计方法(广义线性混合模型)进行基准测试评估,并揭示了数据集构造偏差(大整数分布偏移)这一关键混杂因素,强调了对LLM失败模式进行模型特异性、机制性分析的重要性,而非做出笼统的论断。

Abstract: The GSM-Symbolic benchmark (Mirzadeh et al., 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested on template-generated variants of GSM8K problems, concluding that the models lack genuine reasoning capabilities. We argue that this conclusion rests on shaky statistical ground. Re-evaluating 20 open-weight models using Generalised Linear Mixed Models with per-question random effects, we find that only half exhibit statistically significant performance changes under the original prompt format. Moreover, we identify a previously unacknowledged factor: the main GSM-Symbolic dataset contains a systematically shifted distribution of larger integers in problem texts relative to GSM-Base (K-S statistic = 0.12, p < 0.001), contradicting the original authors’ claims. Controlling for this large number effect accounts for significance in roughly half the remaining cases. Among models with statistically significant performance deltas, we identify distinct, model-specific failure profiles - including fragility of variable binding, arithmetic limitations, and dual-task interference - underscoring that blanket claims about LLM reasoning are both statistically premature and mechanistically misleading.


cs.IR [Back]

[175] RE-TRIANGLE: Does TRIANGLE Enable Multimodal Alignment Beyond Cosine Similarity in Retrieval? cs.IR | cs.AI | cs.CVPDF

Arijit Ghosh, Aritra Bandyopadhyay, Chiranjeev Bindra, Jingfen Qiao

TL;DR: 这篇论文是对TRIANGLE框架的可复现性研究。TRIANGLE通过最小化超球面上模态三元组的面积来实现整体对齐,以解决传统成对策略在多模态对齐中缺乏外围模态间相互一致性的问题。研究验证了该几何目标在检索任务中的鲁棒性,确认其在零样本设置下优于成对基线,但未能复现从头训练的结果,并分析了优化不稳定性。

Details

Motivation: 传统多模态对齐的成对策略存在几何盲点,即对齐锚点模态(如文本)与其他模态时,缺乏对外围模态(如视频和音频)之间相互一致性的约束。TRIANGLE框架旨在通过几何目标解决这一问题,本研究旨在验证其有效性和可复现性。

Result: 在零样本检索任务中,TRIANGLE优于成对基线,Recall@1最高提升+8.7个百分点,但收益具有领域依赖性。然而,未能成功复现论文中报告的从头训练结果。

Insight: 创新点在于验证了基于三元组面积的几何对齐目标在零样本检索中的有效性。客观分析表明,几何对齐与数据-文本匹配(DTM)损失的联合优化存在不稳定性,余弦正则化主要稳定文本-视频检索,而领域监督微调会放大几何收益但损害跨数据集泛化能力。

Abstract: Multimodal alignment is critical for bridging the semantic gap in information retrieval. However, traditional pairwise strategies introduce a geometric blind spot: while they align anchor modalities (e.g., text) with others, they lack constraints to enforce mutual consistency between peripheral modalities (e.g., video and audio). The TRIANGLE framework addresses this by minimizing the area of modality triplets on a hypersphere to enforce holistic alignment. In this reproducibility study, we verify the robustness of this geometric objective for retrieval tasks. We confirm that TRIANGLE outperforms pairwise baselines in zero-shot settings, achieving Recall@1 gains of up to +8.7 points, though benefits are domain-dependent. However, we fail to reproduce the reported learning-from-scratch results. Analysis using a synthetic toy dataset attributes this to instability when jointly optimizing geometric alignment with Data-Text Matching (DTM) loss. Furthermore, we find that cosine regularization primarily stabilizes text-to-video retrieval, and fine-tuning with domain supervision amplifies geometric benefits but reduces cross-dataset generalization. Our findings support the efficacy of geometric alignment while highlighting critical optimization sensitivities. Code available at https://github.com/ARIJIT00171/RE-TRIANGLE.


cs.GR [Back]

[176] ClothTransformer: Unified Latent-Space Transformers for Scalable Cloth Simulation cs.GR | cs.CVPDF

Yu Zhang, Yidi Shao, Wenqi Ouyang, Yushi Lan, Zhexin Liang

TL;DR: 本文提出了ClothTransformer,一个将布料模拟重新定义为在潜在空间中进行自回归序列建模的统一Transformer框架。该框架通过压缩任意分辨率的网格为固定大小的潜在标记,实现了与网格分辨率无关的动态计算,并利用一个包含约49.34万帧的无穿透高保真数据集来训练可微分的连续碰撞检测模块,从而有效处理碰撞。

Details

Motivation: 现有神经布料模拟器通常局限于单一场景、与网格离散化紧密耦合且缺乏鲁棒的碰撞处理。本文旨在探索现代Transformer技术是否能够应对布料模拟这一挑战性任务,以克服这些限制。

Result: 该方法在身体驱动服装、机器人操纵和自由落体碰撞三种不同场景下,使用单一模型实现了比先前最先进方法低约4到9倍的误差。

Insight: 创新点包括:1) 统一的Transformer架构处理多场景布料模拟;2) 可扩展的潜在空间公式使动态计算与网格分辨率解耦;3) 构建了大规模无穿透数据集并集成了可微连续碰撞检测模块,以抑制穿透伪影。

Abstract: Unified and scalable Transformers have recently achieved remarkable success in modeling diverse phenomena traditionally associated with computer graphics, such as 3D visual effects, rendering processes, and motion in videos. In this work, we take a step further by investigating whether modern Transformer techniques can tackle the challenging task of cloth simulation. To this end, we present ClothTransformer, a framework that reformulates cloth simulation as autoregressive sequence modeling in a learned latent space. Existing neural cloth simulators are largely specialized to single scenarios, intrinsically coupled to the mesh discretization, and lack robust collision handling. Our approach addresses these limitations through three contributions: (1) a unified Transformer architecture that handles diverse scenarios – body-driven garments, robotic manipulation, and free-fall collisions – under a single model and achieves approximately $4$–$9{\times}$ lower error than prior state-of-the-art methods across all scenarios; (2) a scalable latent-space formulation that compresses arbitrary-resolution meshes into a fixed-size set of latent tokens, making temporal dynamics computation independent of mesh resolution; and (3) a diverse-scenario high-fidelity penetration-free dataset of ${\sim}$493.4k frames spanning all three settings, which enables a differentiable Continuous Collision Detection (CCD) module to suppress penetration artifacts.


cs.CY [Back]

[177] REC-CBM: Rubric-Aware Error-Correction Concept Bottleneck Models for Trustworthy Open-Ended Grading cs.CY | cs.AI | cs.CLPDF

Chengshuai Zhao, Fan Zhang, Kumar Satvik Chaudhary, Yiwen Li, Lo Pang-Yun Ting

TL;DR: 本文提出了REC-CBM,一种面向开放式评分的、基于规则感知和错误纠正的概念瓶颈模型。该模型通过引入规则感知的概念编码器、序数校准目标以及潜在概念纠错模块,旨在提升自动化评分的性能、可解释性和可信度。

Details

Motivation: 现有基于神经网络或大语言模型的自动评分系统通常是黑盒模型,其评分过程和推理依据难以让教育工作者验证和信任。标准的概念瓶颈模型(CBMs)在应用于开放式评分时存在不足,未能显式建模细粒度评分维度、未能充分捕捉评分等级的序数语义,且忽略了人工概念标注中固有的可靠性问题。

Result: 在公开数据集上的综合实验表明,REC-CBM在评分性能上持续优于最先进的基线模型,并产生了更忠实的概念级推理。进一步的分析验证了每个组件的贡献,并证明了其在现实教育场景中的适用性。

Insight: 创新点在于将概念瓶颈模型(CBMs)针对开放式评分任务进行了专门化改进,核心是规则感知的概念编码、序数校准以及一个在保持可解释性的同时能对噪声概念预测进行去噪的潜在概念纠错模块,为实现透明、可信的自动化教育评估提供了可行的技术路径。

Abstract: Open-ended grading is central to equitable and personalized education, yet manual grading remains time-consuming and costly, underscoring the need for automated grading systems. Although recent neural and large language model (LLM) based systems have demonstrated superior performance, they are typically black-box models whose scoring processes and rationales are difficult for educators to verify and trust. Concept bottleneck models (CBMs) have emerged as a promising approach by routing predictions through human-interpretable concepts, providing a mechanistic guarantee of transparency. However, standard CBMs are not tailored to open-ended grading: they do not explicitly model fine-grained rubric dimensions, inadequately capture the ordinal semantics of scoring scales, and neglect inherent reliability issues in human concept annotations. To address these limitations, we propose REC-CBM, a rubric-aware error-correction concept bottleneck model for trustworthy open-ended grading. REC-CBM introduces a rubric-aware concept encoder that learns concept-specific representations over responses and an ordinal pairwise calibration objective that preserves ranking structure among rubric dimensions. It further incorporates a latent concept error-correction module that denoises concept predictions before final grade prediction while preserving interpretability. Comprehensive experiments on publicly available datasets show that REC-CBM consistently improves grading performance and produces more faithful concept-level reasoning than both state-of-the-art baselines. Further analyses validate the contribution of each component and demonstrate the applicability in realistic educational settings. Overall, this work provides a practical, interpretable grading solution that enables educators to inspect, intervene in, and trust automated decisions, advancing more transparent and trustworthy education.