Table of Contents
- cs.CL [Total: 23]
- cs.CV [Total: 42]
- cs.SD [Total: 1]
- eess.IV [Total: 1]
- eess.AS [Total: 1]
- cs.AI [Total: 2]
- cs.LG [Total: 4]
- physics.optics [Total: 1]
- cs.MM [Total: 1]
- cs.RO [Total: 2]
cs.CL [Back]
[1] BabyReasoningBench: Generating Developmentally-Inspired Reasoning Tasks for Evaluating Baby Language Models cs.CL | cs.AIPDF
Kaustubh D. Dhole
TL;DR: 本文介绍了BabyReasoningBench,一个基于发展心理学经典范式生成的评测基准,用于评估在儿童导向语料上训练的‘婴儿语言模型’的推理能力。研究发现,基于GPT-2的婴儿模型在19个推理任务上总体表现较低且不均衡,不同任务家族间存在分离现象。
Details
Motivation: 传统语言模型推理评估主要使用成人中心化的基准,其假设了广泛的世界知识、复杂的指令遵循和成熟的语用能力,这与在儿童发展合理输入(如儿童导向语音和早期叙事)上训练的婴儿语言模型不匹配,且掩盖了在此类约束下哪些推理能力能够出现。
Result: 两个基于GPT-2、分别在1000万和1亿词儿童导向语音文本上预训练的婴儿语言模型,在BabyReasoningBench上总体表现较低且不均衡。模型扩展(scaling)改善了部分因果和物理推理任务,但信念归因(theory of mind)和对语用敏感的任务仍然具有挑战性。
Insight: 创新点在于构建了一个发展心理学启发的、针对婴儿语言模型的专用推理评测基准,为分析儿童化训练分布支持何种推理以及测试相关能力如何出现的机制假设提供了基础。这为评估资源受限或发展早期阶段的模型提供了新视角。
Abstract: Traditional evaluations of reasoning capabilities of language models are dominated by adult-centric benchmarks that presuppose broad world knowledge, complex instruction following, and mature pragmatic competence. These assumptions are mismatched to baby language models trained on developmentally plausible input such as child-directed speech and early-childhood narratives, and they obscure which reasoning abilities (if any) emerge under such constraints. We introduce BabyReasoningBench, a GPT-5.2 generated benchmark of 19 reasoning tasks grounded in classic paradigms from developmental psychology, spanning theory of mind, analogical and relational reasoning, causal inference and intervention selection, and core reasoning primitives that are known to be confounded by memory and pragmatics. We find that two GPT-2 based baby language models (pretrained on 10M and 100M of child-directed speech text) show overall low but uneven performance, with dissociations across task families: scaling improves several causal and physical reasoning tasks, while belief attribution and pragmatics-sensitive tasks remain challenging. BabyReasoningBench provides a developmentally grounded lens for analyzing what kinds of reasoning are supported by child-like training distributions, and for testing mechanistic hypotheses about how such abilities emerge.
[2] LLMs versus the Halting Problem: Revisiting Program Termination Prediction cs.CL | cs.AI | cs.PLPDF
Oren Sultan, Jordi Armengol-Estape, Pascal Kesseli, Julien Vanegue, Dafna Shahaf
TL;DR: 这篇论文评估了大型语言模型(LLMs)在预测程序终止问题上的能力,使用了SV-Comp 2025终止类别的C程序数据集。研究发现,GPT-5和Claude Sonnet-4.5等LLMs在预测性能上接近最先进的专用验证工具,但无法提供有效的证明,且性能随程序长度增加而下降。
Details
Motivation: 图灵停机问题的不可判定性使得通用终止验证工具难以实现,而现有工具通常依赖于特定架构和语言。鉴于LLMs在代码理解上的成功,本文旨在探究LLMs是否能可靠地预测程序终止,从而探索其在不可判定问题推理上的潜力。
Result: 在SV-Comp 2025的C程序终止预测基准测试中,GPT-5和Claude Sonnet-4.5的性能仅次于排名第一的工具(使用测试时缩放),而Code World Model(CWM)则仅次于排名第二的工具,表明LLMs在该任务上表现优异。
Insight: 论文的创新点在于首次系统评估LLMs在经典不可判定问题(程序终止)上的预测能力,揭示了LLMs可作为有效的近似验证工具,但缺乏提供形式化证明的能力,这为未来结合LLMs与形式化方法的研究提供了方向。
Abstract: Determining whether a program terminates is a central problem in computer science. Turing’s foundational result established the Halting Problem as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Consequently, automatic verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem-specific architectures and abstractions, and are usually tied to particular programming languages. Recent success and progress in large language models (LLMs) raises the following question: can LLMs reliably predict program termination? In this work, we evaluate LLMs on a diverse set of C programs from the Termination category of the International Competition on Software Verification (SV-Comp) 2025. Our results suggest that LLMs perform remarkably well at predicting program termination, where GPT-5 and Claude Sonnet-4.5 would rank just behind the top-ranked tool (using test-time-scaling), and Code World Model (CWM) would place just behind the second-ranked tool. While LLMs are effective at predicting program termination, they often fail to provide a valid witness as a proof. Moreover, LLMs performance drops as program length increases. We hope these insights motivate further research into program termination and the broader potential of LLMs for reasoning about undecidable problems.
[3] FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning cs.CL | cs.AI | cs.LGPDF
Haozheng Luo, Zhuolin Jiang, Md Zahid Hasan, Yan Chen, Soumalya Sarkar
TL;DR: 本文提出了FROST,一种基于注意力感知的高效推理方法。该方法利用注意力权重来剪枝不关键的推理路径,从而产生更短、更可靠的推理轨迹。在四个基准测试上使用Phi-4-Reasoning和GPT-OSS-20B模型进行验证,结果表明FROST在减少令牌使用量和提高准确性方面均优于现有方法。
Details
Motivation: 为了解决传统推理方法效率低下、推理路径可能包含冗余或不可靠部分的问题,本文旨在通过识别并移除推理过程中的异常值(outliers)来提升推理的效率和可靠性。
Result: 在四个基准测试上,FROST相较于基础模型平均减少了69.68%的令牌使用量,并提高了26.70%的准确率,性能优于TALE和ThinkLess等SOTA方法。在注意力异常值指标评估中,最大无穷范数降低了15.97%,平均峰度降低了91.09%。
Insight: 论文的核心创新点在于提出了“推理异常值”的概念,并设计了一种基于注意力的机制来识别和移除这些异常值,从而在句子层面优化推理路径。这种方法在理论上保持了模型的推理能力,在实践中实现了效率与精度的双重提升,为高效推理模型的设计提供了新思路。
Abstract: We propose FROST, an attention-aware method for efficient reasoning. Unlike traditional approaches, FROST leverages attention weights to prune uncritical reasoning paths, yielding shorter and more reliable reasoning trajectories. Methodologically, we introduce the concept of reasoning outliers and design an attention-based mechanism to remove them. Theoretically, FROST preserves and enhances the model’s reasoning capacity while eliminating outliers at the sentence level. Empirically, we validate FROST on four benchmarks using two strong reasoning models (Phi-4-Reasoning and GPT-OSS-20B), outperforming state-of-the-art methods such as TALE and ThinkLess. Notably, FROST achieves an average 69.68% reduction in token usage and a 26.70% improvement in accuracy over the base model. Furthermore, in evaluations of attention outlier metrics, FROST reduces the maximum infinity norm by 15.97% and the average kurtosis by 91.09% compared to the base model. Code is available at https://github.com/robinzixuan/FROST
[4] Optimizing Conversational Quality in Spoken Dialogue Systems with Reinforcement Learning from AI Feedback cs.CL | cs.SDPDF
Siddhant Arora, Jinchuan Tian, Jiatong Shi, Hayato Futami, Yosuke Kashiwagi
TL;DR: 本文提出了首个面向语音对话系统的多奖励RLAIF框架,通过结合语义、音频质量和情感一致性奖励,并采用轮次级偏好采样与逐块对数概率聚合的DPO目标,解决了现有方法忽略对话质量多维性及与双工增量生成不匹配的问题。
Details
Motivation: 现有RLHF/RLAIF方法在语音对话系统中主要局限于单一句义奖励且应用于话语层面,忽视了对话质量的多维多模态特性(如语义连贯性、音频自然度、说话人一致性等),且与双工系统的增量生成模式不匹配。
Result: 实验表明,单奖励RLAIF能针对性提升其目标指标,而联合多奖励训练在语义质量和音频自然度上均取得一致提升,验证了多奖励对齐的有效性。
Insight: 创新点在于首次将多奖励RLAIF框架引入语音对话系统,通过轮次级偏好采样与块级解码对齐机制,实现了对对话质量多维度(语义、音频、情感)的联合优化,并发布了多奖励DPO数据集以促进可复现研究。
Abstract: Reinforcement learning from human or AI feedback (RLHF/RLAIF) for speech-in/speech-out dialogue systems (SDS) remains underexplored, with prior work largely limited to single semantic rewards applied at the utterance level. Such setups overlook the multi-dimensional and multi-modal nature of conversational quality, which encompasses semantic coherence, audio naturalness, speaker consistency, emotion alignment, and turn-taking behavior. Moreover, they are fundamentally mismatched with duplex spoken dialogue systems that generate responses incrementally, where agents must make decisions based on partial utterances. We address these limitations with the first multi-reward RLAIF framework for SDS, combining semantic, audio-quality, and emotion-consistency rewards. To align utterance-level preferences with incremental, blockwise decoding in duplex models, we apply turn-level preference sampling and aggregate per-block log-probabilities within a single DPO objective. We present the first systematic study of preference learning for improving SDS quality in both multi-turn Chain-of-Thought and blockwise duplex models, and release a multi-reward DPO dataset to support reproducible research. Experiments show that single-reward RLAIF selectively improves its targeted metric, while joint multi-reward training yields consistent gains across semantic quality and audio naturalness. These results highlight the importance of holistic, multi-reward alignment for practical conversational SDS.
[5] PsyProbe: Proactive and Interpretable Dialogue through User State Modeling for Exploratory Counseling cs.CLPDF
Sohhyung Park, Hyunji Kang, Sungzoon Cho, Dongil Kim
TL;DR: 本文提出了PsyProbe,一个用于心理咨询探索阶段的主动式、可解释对话系统。它通过增强认知错误检测的PPPPPI框架系统性地追踪用户心理状态,并整合状态构建、记忆构建、策略规划和响应生成等模块来生成主动提问。在真实韩国心理咨询场景下的评估表明,该系统在自动评估、用户参与度和专家评价方面均优于基线模型。
Details
Motivation: 现有基于大语言模型的心理健康对话系统多为被动反应式,缺乏对用户心理状态的系统性建模以支持主动的治疗性探索。本文旨在解决这一问题,为心理咨询的探索阶段构建一个主动且可解释的对话系统。
Result: 在包含27名参与者的真实韩国心理咨询场景评估中,完整的PsyProbe模型在自动评估中始终优于基线和消融模型。用户评估显示其显著提高了参与意愿和对话自然度。专家(持证咨询师)评估表明,PsyProbe显著提升了对核心问题的理解,且提问率达到了与专业咨询师相当的水平。
Insight: 主要创新点在于将系统性的用户心理状态建模(PPPPPI框架结合认知错误检测)整合到对话系统中,并设计了包含状态跟踪、信息缺口识别、策略规划和批判性修订的模块化架构,实现了从被动响应到主动探索的转变,为构建治疗性对话系统提供了可借鉴的框架。
Abstract: Recent advances in large language models have enabled mental health dialogue systems, yet existing approaches remain predominantly reactive, lacking systematic user state modeling for proactive therapeutic exploration. We introduce PsyProbe, a dialogue system designed for the exploration phase of counseling that systematically tracks user psychological states through the PPPPPI framework (Presenting, Predisposing, Precipitating, Perpetuating, Protective, Impact) augmented with cognitive error detection. PsyProbe combines State Builder for extracting structured psychological profiles, Memory Construction for tracking information gaps, Strategy Planner for Motivational Interviewing behavioral codes, and Response Generator with Question Ideation and Critic/Revision modules to generate contextually appropriate, proactive questions. We evaluate PsyProbe with 27 participants in real-world Korean counseling scenarios, including automatic evaluation across ablation modes, user evaluation, and expert evaluation by a certified counselor. The full PsyProbe model consistently outperforms baseline and ablation modes in automatic evaluation. User evaluation demonstrates significantly increased engagement intention and improved naturalness compared to baseline. Expert evaluation shows that PsyProbe substantially improves core issue understanding and achieves question rates comparable to professional counselors, validating the effectiveness of systematic state modeling and proactive questioning for therapeutic exploration.
[6] Do Images Speak Louder than Words? Investigating the Effect of Textual Misinformation in VLMs cs.CLPDF
Chi Zhang, Wenxuan Ding, Jiale Liu, Mingrui Wu, Qingyun Wu
TL;DR: 这篇论文研究了视觉语言模型(VLMs)在面对文本误导信息时的鲁棒性。作者提出了一个名为CONTEXT-VQA的数据集,其中包含与视觉证据相冲突的文本提示,并设计了一个评估框架来测试多种VLMs的易受误导性。实验发现,当前最先进的VLMs普遍容易受到误导性文本的影响,经常忽略清晰的视觉证据而采纳冲突的文本信息,仅经过一轮说服性对话后平均性能下降超过48.2%。
Details
Motivation: 现有研究主要关注纯文本领域的误导信息,但视觉语言模型如何在不同模态(视觉与文本)的冲突信息之间进行仲裁尚不清楚。论文旨在填补这一空白,探究VLMs对文本误导信息的鲁棒性。
Result: 在11个最先进的VLMs上进行的综合实验表明,这些模型在面对冲突的多模态输入时非常脆弱,平均性能下降超过48.2%。结果突显了当前VLMs在对抗文本操纵方面的严重局限性。
Insight: 论文的创新点在于首次系统地研究了VLMs在视觉与文本信息冲突场景下的行为,并创建了专门的CONTEXT-VQA数据集用于评估。从客观角度看,这项工作揭示了多模态模型在信息整合中的一个关键弱点,即文本信息可能过度主导模型决策,这为未来提升模型鲁棒性(如改进跨模态注意力机制)提供了重要方向。
Abstract: Vision-Language Models (VLMs) have shown strong multimodal reasoning capabilities on Visual-Question-Answering (VQA) benchmarks. However, their robustness against textual misinformation remains under-explored. While existing research has studied the effect of misinformation in text-only domains, it is not clear how VLMs arbitrate between contradictory information from different modalities. To bridge the gap, we first propose the CONTEXT-VQA (i.e., Conflicting Text) dataset, consisting of image-question pairs together with systematically generated persuasive prompts that deliberately conflict with visual evidence. Then, a thorough evaluation framework is designed and executed to benchmark the susceptibility of various models to these conflicting multimodal inputs. Comprehensive experiments over 11 state-of-the-art VLMs reveal that these models are indeed vulnerable to misleading textual prompts, often overriding clear visual evidence in favor of the conflicting text, and show an average performance drop of over 48.2% after only one round of persuasive conversation. Our findings highlight a critical limitation in current VLMs and underscore the need for improved robustness against textual manipulation.
[7] A Hybrid Supervised-LLM Pipeline for Actionable Suggestion Mining in Unstructured Customer Reviews cs.CL | cs.AIPDF
Aakash Trivedi, Aniket Upadhyay, Pratik Narang, Dhruv Kumar, Praveen Kumar
TL;DR: 本文提出了一种混合监督学习与大语言模型(LLM)的流水线,用于从非结构化的客户评论中挖掘可执行的建议。该方法结合了高召回率的RoBERTa分类器(使用精确率-召回率替代损失训练以减少不可恢复的假阴性)和经过指令调优的LLM,以进行建议提取、分类、聚类和摘要生成。
Details
Motivation: 现有方法通常仅对包含建议的句子进行分类或生成高层摘要,难以精确分离出企业所需的改进指令。本文旨在解决从混合意图的非结构化文本中准确提取可操作建议的问题。
Result: 在真实世界的酒店和食品数据集上,该混合系统在提取准确性和聚类一致性方面优于仅使用提示、基于规则或仅使用分类器的基线方法。人工评估进一步证实了所生成建议和摘要的清晰性、忠实性和可解释性。
Insight: 创新点在于将高召回监督分类器与可控LLM相结合,构建混合推理架构,在细粒度可操作建议挖掘上实现了有意义的改进,同时强调了领域适应和高效本地部署方面的挑战。
Abstract: Extracting actionable suggestions from customer reviews is essential for operational decision-making, yet these directives are often embedded within mixed-intent, unstructured text. Existing approaches either classify suggestion-bearing sentences or generate high-level summaries, but rarely isolate the precise improvement instructions businesses need. We evaluate a hybrid pipeline combining a high-recall RoBERTa classifier trained with a precision-recall surrogate to reduce unrecoverable false negatives with a controlled, instruction-tuned LLM for suggestion extraction, categorization, clustering, and summarization. Across real-world hospitality and food datasets, the hybrid system outperforms prompt-only, rule-based, and classifier-only baselines in extraction accuracy and cluster coherence. Human evaluations further confirm that the resulting suggestions and summaries are clear, faithful, and interpretable. Overall, our results show that hybrid reasoning architectures achieve meaningful improvements fine-grained actionable suggestion mining while highlighting challenges in domain adaptation and efficient local deployment.
[8] RPO-RAG: Aligning Small LLMs with Relation-aware Preference Optimization for Knowledge Graph Question Answering cs.CL | cs.AIPDF
Kaehyun Um, KyuHwan Yeom, Haerim Yang, Minyoung Choi, Hyeongjun Yang
TL;DR: 本文提出了RPO-RAG,一个专门为小型大语言模型(参数少于7B)设计的、基于知识图谱的检索增强生成框架,用于知识图谱问答任务。该框架通过查询路径语义采样、关系感知偏好优化和答案中心提示设计三项创新,有效提升了小型模型在KGQA上的推理能力和答案精度。
Details
Motivation: 现有基于知识图谱的RAG方法存在语义无关的路径采样、与KG推理目标对齐弱、以及未将检索路径组织成答案中心推理路径等问题,限制了准确率提升,且先前工作主要依赖大型LLM或7B以上参数模型,小型模型(sub-7B)未被充分探索。
Result: 在WebQSP和CWQ两个KGQA基准数据集上的实验表明,RPO-RAG显著缩小了小型与大型语言模型之间的性能差距。在WebQSP上F1提升高达8.8%;在CWQ上,它在8B参数以下模型中取得了Hit和F1指标的最新SOTA结果。
Insight: 创新点包括:1) 查询路径语义采样策略,提供信息丰富的监督信号;2) 关系感知偏好优化,使训练与KG中间推理信号(如关系)对齐;3) 答案中心提示设计,以可解释格式组织实体和推理路径。这为资源高效、实用的设备端KGQA应用提供了潜力。
Abstract: Large Language Models (LLMs) have recently demonstrated remarkable reasoning abilities, yet hallucinate on knowledge-intensive tasks. Retrieval-augmented generation (RAG) mitigates this issue by grounding answers in external sources, e.g., knowledge graphs (KGs). However, existing KG-based RAG approaches rely on semantics-unaware path sampling and are weakly aligned with KG reasoning objectives, which limits further accuracy gains. They also feed retrieved paths directly into the reasoner without organizing them into answer-centered reasoning paths, hindering small LLMs’ ability to leverage the retrieved knowledge. Furthermore, prior works predominantly rely on large LLMs (e.g., ChatGPT/GPT-4) or assume backbones above 7B parameters, leaving sub-7B models underexplored. We address this gap with RPO-RAG, the first KG-based RAG framework specifically designed for small LLMs, to the best of our knowledge. RPO-RAG introduces three key innovations: (1) a query-path semantic sampling strategy that provides informative supervisory signals; (2) a relation-aware preference optimization that aligns training with intermediate KG reasoning signals (e.g., relation); and (3) an answer-centered prompt design that organizes entities and reasoning paths in an interpretable format. Extensive experiments on two benchmark Knowledge Graph Question Answering (KGQA) datasets, WebQSP and CWQ, demonstrate that RPO-RAG effectively bridges the performance gap between small and large language models. On WebQSP, it improves F1 by up to 8.8%, reflecting enhanced answer precision, while on CWQ it achieves new state-of-the-art results among models under 8B parameters in both Hit and F1. Overall, RPO-RAG substantially improves the reasoning capability of small LLMs, even under 3B parameters-highlighting their potential for resource-efficient and practical on-device KGQA applications.
[9] DiaDem: Advancing Dialogue Descriptions in Audiovisual Video Captioning for Multimodal Large Language Models cs.CLPDF
Xinlong Chen, Weihong Lin, Jingyun Hua, Linli Yao, Yue Ding
TL;DR: 本文提出DiaDem模型,旨在提升视听视频描述中对话描述的准确性,通过合成高质量SFT数据集并采用难度分区的两阶段GRPO策略优化对话生成。同时,引入DiaDemBench基准系统评估模型在多样化对话场景下的表现,重点关注说话者归属和话语转录的忠实度。实验表明,DiaDem在对话描述准确性上超越Gemini系列,并在通用视听描述基准上达到竞争性性能。
Details
Motivation: 现有视听视频描述模型在生成忠实对话描述方面存在不足,影响下游理解与生成任务,因此需开发能精确描述对话的模型。
Result: 在DiaDemBench上,DiaDem在对话描述准确性上优于Gemini系列模型,并在通用视听描述基准(如Audiovisual Captioning benchmarks)上达到竞争性水平,显示出整体有效性。
Insight: 创新点包括合成高质量SFT数据集、难度分区的两阶段GRPO策略以增强对话描述,以及引入DiaDemBench基准系统评估对话能力;客观分析认为,该方法通过针对性优化对话生成和系统化评估,提升了多模态大语言模型在视听描述中的对话忠实度。
Abstract: Accurate dialogue description in audiovisual video captioning is crucial for downstream understanding and generation tasks. However, existing models generally struggle to produce faithful dialogue descriptions within audiovisual captions. To mitigate this limitation, we propose DiaDem, a powerful audiovisual video captioning model capable of generating captions with more precise dialogue descriptions while maintaining strong overall performance. We first synthesize a high-quality dataset for SFT, then employ a difficulty-partitioned two-stage GRPO strategy to further enhance dialogue descriptions. To enable systematic evaluation of dialogue description capabilities, we introduce DiaDemBench, a comprehensive benchmark designed to evaluate models across diverse dialogue scenarios, emphasizing both speaker attribution accuracy and utterance transcription fidelity in audiovisual captions. Extensive experiments on DiaDemBench reveal even commercial models still exhibit substantial room for improvement in dialogue-aware captioning. Notably, DiaDem not only outperforms the Gemini series in dialogue description accuracy but also achieves competitive performance on general audiovisual captioning benchmarks, demonstrating its overall effectiveness.
[10] Riddle Quest : The Enigma of Words cs.CL | cs.AI | cs.ITPDF
Niharika Sri Parasa, Chaitali Diwan, Srinath Srinivasa
TL;DR: 本文提出了一种用于创建和评估基于类比的谜语的简单流程,包括三元组创建器、语义映射器、风格化生成器和验证器,并利用验证器研究大型语言模型是否能恢复不同谜语类型的完整答案集。
Details
Motivation: 解决如何系统化创建和评估类比谜语,并利用谜语作为轻量级工具来检验语言模型的推理覆盖范围和歧义处理能力。
Result: 案例研究表明,虽然模型经常能猜出主要预期答案,但常常遗漏其他有效解释,揭示了谜语在评估语言模型推理覆盖和歧义处理方面的价值。
Insight: 创新点在于构建了一个系统化的谜语生成和验证流程,并首次将谜语作为评估语言模型推理覆盖和歧义处理能力的工具,为模型评估提供了新的视角和方法。
Abstract: Riddles are concise linguistic puzzles that describe an object or idea through indirect, figurative, or playful clues. They are a longstanding form of creative expression, requiring the solver to interpret hints, recognize patterns, and draw inferences to identify the answers. In this work, we introduce a simple pipeline for creating and evaluating analogy-based riddles. The system includes a triples creator that builds structured facts about a concept, a semantic mapper that selects attributes useful for analogy, a stylized generator that turns them into riddle clues, and a validator that collects all possible answers the riddle could point to. We use this validator to study whether large language models can recover the full answer set for different riddle types. Our case study shows that while models often guess the main intended answer, they frequently miss other valid interpretations. This highlights the value of riddles as a lightweight tool for examining reasoning coverage and ambiguity handling in language models.
[11] MetaGen: Self-Evolving Roles and Topologies for Multi-Agent LLM Reasoning cs.CLPDF
Yimeng Wang, Jiaxing Zhao, Hongbin Xie, Hexing Ma, Yuzhen Lei
TL;DR: 本文提出了MetaGen,一个无需训练的多智能体大语言模型推理框架,它能够在推理时动态调整智能体的角色定义和协作拓扑结构,以解决复杂任务。
Details
Motivation: 现有基于固定角色库和静态交互拓扑的多智能体系统设计僵化,容易导致任务不匹配、无法适应推理过程中的新证据,并增加推理成本。
Result: 在代码生成和多步推理基准测试上的实验表明,MetaGen在准确性和成本权衡方面优于现有的强大多智能体基线方法。
Insight: 核心创新在于提出了一个无需更新基础模型权重的、推理时自演化的框架,通过查询条件化的角色规范生成与重写、基于轻量反馈信号迭代更新角色提示和调整结构决策,实现了角色空间与协作拓扑的动态适应。
Abstract: Large language models are increasingly deployed as multi-agent systems, where specialized roles communicate and collaborate through structured interactions to solve complex tasks that often exceed the capacity of a single agent. However, most existing systems still rely on a fixed role library and an execution-frozen interaction topology, a rigid design choice that frequently leads to task mismatch, prevents timely adaptation when new evidence emerges during reasoning, and further inflates inference cost. We introduce MetaGen, a training-free framework that adapts both the role space and the collaboration topology at inference time, without updating base model weights. MetaGen generates and rewrites query-conditioned role specifications to maintain a controllable dynamic role pool, then instantiates a constrained execution graph around a minimal backbone. During execution, it iteratively updates role prompts and adjusts structural decisions using lightweight feedback signals. Experiments on code generation and multi-step reasoning benchmarks show that MetaGen improves the accuracy and cost tradeoff over strong multi-agent baselines.
[12] Formula-One Prompting: Adaptive Reasoning Through Equations For Applied Mathematics cs.CLPDF
Natapong Nitarach, Pittawat Taveekitworachai, Kunat Pipatanakul
TL;DR: 本文提出了Formula-One Prompting(F-1)方法,一种用于提升大语言模型在应用数学领域推理能力的提示技术。该方法分为两个阶段:首先从问题描述中推导出控制方程,然后根据生成的方程自适应地选择链式思维、程序思维或直接计算等求解策略。实验表明,该方法在多个模型和基准测试上优于现有方法,尤其在金融、物理等应用领域提升显著。
Details
Motivation: 现有提示技术(如链式思维和程序思维)在解决应用数学问题时,未能显式地利用或推导出问题背后的控制方程,而这是金融、物理和密码学等领域问题的关键步骤。
Result: 在五个模型和四个基准测试上的结果显示,F-1方法平均优于链式思维(CoT)5.76%,优于程序思维(PoT)8.42%。在应用领域提升尤其显著,如在FinanceMath上比CoT高出13.30%,在OlympiadBench中,物理问题的提升(+2.55%)也远大于纯数学问题(+0.44%)。
Insight: 核心创新点在于将数学方程作为中间表示,并基于此进行自适应求解策略选择,这使模型能更有效地处理需要领域知识的应用数学问题。从客观角度看,该方法将问题分解为“公式化”和“求解”两个明确阶段,并统一在一个LLM调用中完成,是一种结构化的、可解释的推理框架。
Abstract: Prompting techniques such as Chain-of-Thought (CoT) and Program-of-Thought (PoT) improve LLM mathematical reasoning by structuring intermediate steps in natural language or code. However, applied mathematics problems in domains like finance, physics, and cryptography often require recalling or deriving governing equations, a step that current approaches do not explicitly leverage. We propose Formula-One Prompting (F-1), a two-phase approach that uses mathematical equations as an intermediate representation before adaptive solving. F-1 first formulates governing equations from problem descriptions, then selects a solving strategy among CoT, PoT, or direct computation based on the generated equations, all within a single LLM call. Results across five models and four benchmarks show F-1 outperforms CoT by +5.76% and PoT by +8.42% on average. Crucially, gains are largest in applied domains: +13.30% on FinanceMath over CoT, and within OlympiadBench, larger gains on physics (+2.55%) than pure math (+0.44%). This demonstrates that F-1 is more effective than CoT in applied mathematics problems.
[13] KG-CRAFT: Knowledge Graph-based Contrastive Reasoning with LLMs for Enhancing Automated Fact-checking cs.CL | cs.AIPDF
Vítor N. Lourenço, Aline Paes, Tillman Weyde, Audrey Depeige, Mohnish Dubey
TL;DR: 本文提出了KG-CRAFT方法,该方法通过利用大型语言模型并结合基于知识图谱的对比性问题,来增强自动声明验证。该方法首先从声明和相关报告中构建知识图谱,然后基于图谱结构生成上下文相关的对比性问题,以指导基于证据的报告提炼,最终合成简洁摘要供LLM进行真实性评估。
Details
Motivation: 解决自动事实核查系统中声明验证的核心问题,即如何更有效地利用可靠证据源(如文档或知识库)来评估声明的真实性,并提升LLM在此任务上的能力。
Result: 在两个真实世界数据集(LIAR-RAW和RAWFC)上的广泛评估表明,该方法在预测性能上达到了新的最先进水平(SOTA)。
Insight: 创新点在于将知识图谱与对比性推理相结合来增强LLM的事实核查能力,具体通过构建知识图谱并基于其结构生成对比性问题,以引导证据提炼和摘要合成,从而提升验证的准确性和可解释性。
Abstract: Claim verification is a core component of automated fact-checking systems, aimed at determining the truthfulness of a statement by assessing it against reliable evidence sources such as documents or knowledge bases. This work presents KG-CRAFT, a method that improves automatic claim verification by leveraging large language models (LLMs) augmented with contrastive questions grounded in a knowledge graph. KG-CRAFT first constructs a knowledge graph from claims and associated reports, then formulates contextually relevant contrastive questions based on the knowledge graph structure. These questions guide the distillation of evidence-based reports, which are synthesised into a concise summary that is used for veracity assessment by LLMs. Extensive evaluations on two real-world datasets (LIAR-RAW and RAWFC) demonstrate that our method achieves a new state-of-the-art in predictive performance. Comprehensive analyses validate in detail the effectiveness of our knowledge graph-based contrastive reasoning approach in improving LLMs’ fact-checking capabilities.
[14] Automated Safety Benchmarking: A Multi-agent Pipeline for LVLMs cs.CLPDF
Xiangyang Zhu, Yuan Tian, Zicheng Zhang, Qi Jia, Chunyi Li
TL;DR: 本文提出了VLSafetyBencher,这是首个用于大型视觉-语言模型(LVLM)安全评估的自动化基准构建系统。该系统通过四个协作智能体(数据预处理、生成、增强和选择)来高效构建和筛选高质量的安全测试样本,以克服现有基准构建过程劳动密集、静态且判别力有限的问题。
Details
Motivation: 现有LVLM安全评估基准的构建过程劳动密集、复杂度静态且判别力有限,难以跟上模型快速发展和新兴风险的步伐,因此需要一种自动化、高效且高质量的基准构建方法。
Result: 实验验证表明,VLSafetyBencher能够以极低成本在一周内构建高质量的安全基准。生成的基准能有效区分模型安全性,最安全与最不安全模型之间的安全率差异达到70%。
Insight: 论文的核心创新在于将多智能体协作框架引入到LVLM安全基准的自动化构建中,实现了从样本生成、增强到筛选的端到端自动化流程,这为持续、高效地评估快速演进的模型安全性提供了一种可扩展的新范式。
Abstract: Large vision-language models (LVLMs) exhibit remarkable capabilities in cross-modal tasks but face significant safety challenges, which undermine their reliability in real-world applications. Efforts have been made to build LVLM safety evaluation benchmarks to uncover their vulnerability. However, existing benchmarks are hindered by their labor-intensive construction process, static complexity, and limited discriminative power. Thus, they may fail to keep pace with rapidly evolving models and emerging risks. To address these limitations, we propose VLSafetyBencher, the first automated system for LVLM safety benchmarking. VLSafetyBencher introduces four collaborative agents: Data Preprocessing, Generation, Augmentation, and Selection agents to construct and select high-quality samples. Experiments validates that VLSafetyBencher can construct high-quality safety benchmarks within one week at a minimal cost. The generated benchmark effectively distinguish safety, with a safety rate disparity of 70% between the most and least safe models.
[15] Decompose-and-Formalise: Recursively Verifiable Natural Language Inference cs.CLPDF
Xin Quan, Marco Valentino, Louise A. Dennis, André Freitas
TL;DR: 本文提出了一种分解与形式化框架,用于改进自然语言推理(NLI)中的蕴含验证和解释精炼。该方法将前提-假设对分解为原子步骤的蕴含树,自底向上验证以隔离故障节点,并基于局部诊断进行精炼,而非全局重新生成。通过引入基于事件的逻辑形式中的θ-替换来增强自动形式化的一致性,从而提升验证率并减少计算开销。
Details
Motivation: 现有神经符号方法在处理自然主义NLI时面临挑战:长而复杂的输入和多步推理会放大自动形式化错误,且难以从证明器诊断中定位故障点,导致需要昂贵的全局重新生成。
Result: 在五个LLM骨干网络上的一系列推理任务中,该方法实现了最高的解释验证率,比现有最佳方法分别提升了26.2%、21.7%、21.6%和48.9%,同时减少了精炼迭代次数和运行时间,并保持了较强的NLI准确性。
Insight: 创新点包括:递归分解验证框架实现故障局部化、基于诊断的局部精炼替代全局重新生成、以及通过θ-替换在事件逻辑形式中强制一致的角色绑定以提升形式化忠实度。
Abstract: Recent work has shown that integrating large language models (LLMs) with theorem provers (TPs) in neuro-symbolic pipelines helps with entailment verification and proof-guided refinement of explanations for natural language inference (NLI). However, scaling such refinement to naturalistic NLI remains difficult: long, syntactically rich inputs and deep multi-step arguments amplify autoformalisation errors, where a single local mismatch can invalidate the proof. Moreover, current methods often handle failures via costly global regeneration due to the difficulty of localising the responsible span or step from prover diagnostics. Aiming to address these problems, we propose a decompose-and-formalise framework that (i) decomposes premise-hypothesis pairs into an entailment tree of atomic steps, (ii) verifies the tree bottom-up to isolate failures to specific nodes, and (iii) performs local diagnostic-guided refinement instead of regenerating the whole explanation. Moreover, to improve faithfulness of autoformalisation, we introduce $θ$-substitution in an event-based logical form to enforce consistent argument-role bindings. Across a range of reasoning tasks using five LLM backbones, our method achieves the highest explanation verification rates, improving over the state-of-the-art by 26.2%, 21.7%, 21.6% and 48.9%, while reducing refinement iterations and runtime and preserving strong NLI accuracy.
[16] Up to 36x Speedup: Mask-based Parallel Inference Paradigm for Key Information Extraction in MLLMs cs.CL | cs.AIPDF
Xinzhong Wang, Ya Guo, Jing Li, Huan Chen, Yi Tu
TL;DR: 本文提出了一种名为PIP的并行推理范式,用于解决多模态大语言模型在视觉丰富文档关键信息提取任务中自回归推理效率低下的问题。该方法通过使用掩码标记作为占位符,实现所有目标值的单次前向生成,从而显著提升推理速度。
Details
Motivation: 动机在于自回归推理在提取多个语义独立字段时存在效率瓶颈,限制了关键信息提取任务的实际应用。
Result: 实验结果表明,PIP模型在保持高精度的同时,相比传统自回归基线模型实现了5到36倍的推理加速。
Insight: 创新点在于将序列生成问题重构为并行掩码填充任务,并设计了针对性的掩码预训练策略和大规模监督数据集,为可扩展的实时关键信息提取提供了新思路。
Abstract: Key Information Extraction (KIE) from visually-rich documents (VrDs) is a critical task, for which recent Large Language Models (LLMs) and Multi-Modal Large Language Models (MLLMs) have demonstrated strong potential. However, their reliance on autoregressive inference, which generates outputs sequentially, creates a significant efficiency bottleneck, especially as KIE tasks often involve extracting multiple, semantically independent fields. To overcome this limitation, we introduce PIP: a Parallel Inference Paradigm for KIE. Our approach reformulates the problem by using “[mask]” tokens as placeholders for all target values, enabling their simultaneous generation in a single forward pass. To facilitate this paradigm, we develop a tailored mask pre-training strategy and construct large-scale supervised datasets. Experimental results show that our PIP-models achieve a 5-36x inference speedup with negligible performance degradation compared to traditional autoregressive base models. By substantially improving efficiency while maintaining high accuracy, PIP paves the way for scalable and practical real-world KIE solutions.
[17] RATE: Reviewer Profiling and Annotation-free Training for Expertise Ranking in Peer Review Systems cs.CLPDF
Weicong Liu, Zixuan Yang, Yibo Zhao, Xiang Li
TL;DR: 本文针对LLM时代审稿人分配中主题快速变化导致现有基准过时、代理信号无法反映真实熟悉度的问题,提出了LR-bench这一基于2024-2025年AI/NLP稿件构建的高保真最新基准,并提出了RATE框架,通过基于关键词的审稿人档案构建和基于弱偏好监督的嵌入模型微调,实现稿件与审稿人档案的直接匹配。
Details
Motivation: 解决审稿人分配中因主题快速变化导致的现有基准过时问题,以及代理信号无法准确反映审稿人对稿件的真实熟悉度的问题。
Result: 在LR-bench和CMU黄金标准数据集上,RATE方法均取得了最先进的性能,明显优于强嵌入基线模型。
Insight: 创新点包括构建了基于近期稿件自评熟悉度的高保真最新基准LR-bench,以及提出了基于关键词档案和弱偏好监督的审稿人中心化排序框架RATE,实现了无需人工标注的训练和高效匹配。
Abstract: Reviewer assignment is increasingly critical yet challenging in the LLM era, where rapid topic shifts render many pre-2023 benchmarks outdated and where proxy signals poorly reflect true reviewer familiarity. We address this evaluation bottleneck by introducing LR-bench, a high-fidelity, up-to-date benchmark curated from 2024-2025 AI/NLP manuscripts with five-level self-assessed familiarity ratings collected via a large-scale email survey, yielding 1055 expert-annotated paper-reviewer-score annotations. We further propose RATE, a reviewer-centric ranking framework that distills each reviewer’s recent publications into compact keyword-based profiles and fine-tunes an embedding model with weak preference supervision constructed from heuristic retrieval signals, enabling matching each manuscript against a reviewer profile directly. Across LR-bench and the CMU gold-standard dataset, our approach consistently achieves state-of-the-art performance, outperforming strong embedding baselines by a clear margin. We release LR-bench at https://huggingface.co/datasets/Gnociew/LR-bench, and a GitHub repository at https://github.com/Gnociew/RATE-Reviewer-Assign.
[18] SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking cs.CL | cs.AI | cs.IR | cs.LGPDF
Adam Remaki, Christel Gérardin, Eulàlia Farré-Maduell, Martin Krallinger, Xavier Tannier
TL;DR: SynCABEL是一个用于解决生物医学实体链接(BEL)中专家标注训练数据稀缺问题的框架。它利用大语言模型为目标知识库中的所有候选概念生成上下文丰富的合成训练示例,从而无需人工标注即可提供广泛的监督。该框架结合仅解码器模型和引导推理,在MedMentions(英语)、QUAERO(法语)和SPACCC(西班牙语)三个多语言基准测试中取得了新的最先进(SOTA)结果。在数据效率方面,SynCABEL使用比全人工监督少60%的标注数据即可达到同等性能,并显著提高了临床有效预测的比例。
Details
Motivation: 解决监督式生物医学实体链接(BEL)中专家标注训练数据稀缺这一核心瓶颈问题。
Result: 在MedMentions(英语)、QUAERO(法语)和SPACCC(西班牙语)三个多语言基准测试上取得了新的最先进(SOTA)结果;在数据效率上,使用比全人工监督少60%的标注数据即可达到同等性能;通过引入LLM作为评判者的协议评估,显著提高了临床有效预测的比例。
Insight: 核心创新点在于利用大语言模型为知识库中所有候选概念自动生成上下文丰富的合成训练数据,以低成本解决数据稀缺问题;同时,提出了基于LLM的评估协议来更准确地衡量临床有效性,弥补了传统基于精确代码匹配评估的不足。这是一种数据增强与评估方法上的双重创新。
Abstract: We present SynCABEL (Synthetic Contextualized Augmentation for Biomedical Entity Linking), a framework that addresses a central bottleneck in supervised biomedical entity linking (BEL): the scarcity of expert-annotated training data. SynCABEL leverages large language models to generate context-rich synthetic training examples for all candidate concepts in a target knowledge base, providing broad supervision without manual annotation. We demonstrate that SynCABEL, when combined with decoder-only models and guided inference establish new state-of-the-art results across three widely used multilingual benchmarks: MedMentions for English, QUAERO for French, and SPACCC for Spanish. Evaluating data efficiency, we show that SynCABEL reaches the performance of full human supervision using up to 60% less annotated data, substantially reducing reliance on labor-intensive and costly expert labeling. Finally, acknowledging that standard evaluation based on exact code matching often underestimates clinically valid predictions due to ontology redundancy, we introduce an LLM-as-a-judge protocol. This analysis reveals that SynCABEL significantly improves the rate of clinically valid predictions. Our synthetic datasets, models, and code are released to support reproducibility and future research.
[19] Strong Reasoning Isn’t Enough: Evaluating Evidence Elicitation in Interactive Diagnosis cs.CLPDF
Zhuohan Long, Zhijie Bao, Zhongyu Wei
TL;DR: 本文提出了一个交互式医疗咨询评估框架,通过模拟患者和基于原子证据的模拟报告器来建模咨询过程,并引入信息覆盖率(ICR)来量化智能体在交互中收集必要证据的完整性。作者构建了基于证据的基准测试EviMed,评估了10个不同推理能力的模型,发现强大的诊断推理能力并不保证有效的信息收集,这是限制交互性能的主要瓶颈。为此,作者提出了REFINE策略,利用诊断验证来指导智能体主动解决不确定性,实验表明REFINE在多个数据集上优于基线,并促进了有效的模型协作。
Details
Motivation: 现有评估方法多为静态或结果中心化,忽视了医疗咨询中证据收集的交互过程,因此需要开发一个能显式建模咨询过程并量化证据收集完整性的评估框架。
Result: 在构建的EviMed基准测试上评估了10个模型,发现诊断推理能力强的模型在信息收集上存在不足;提出的REFINE策略在多个数据集上一致优于基线,并能使较小模型在强推理监督下实现更优性能。
Insight: 创新点在于提出了一个交互式评估框架和ICR指标来量化证据收集过程,揭示了推理能力与信息收集效率之间的脱节,并设计了REFINE策略通过诊断验证来主动引导信息收集,这为交互式AI系统的评估和优化提供了新思路。
Abstract: Interactive medical consultation requires an agent to proactively elicit missing clinical evidence under uncertainty. Yet existing evaluations largely remain static or outcome-centric, neglecting the evidence-gathering process. In this work, we propose an interactive evaluation framework that explicitly models the consultation process using a simulated patient and a \rev{simulated reporter} grounded in atomic evidences. Based on this representation, we introduce Information Coverage Rate (ICR) to quantify how completely an agent uncovers necessary evidence during interaction. To support systematic study, we build EviMed, an evidence-based benchmark spanning diverse conditions from common complaints to rare diseases, and evaluate 10 models with varying reasoning abilities. We find that strong diagnostic reasoning does not guarantee effective information collection, and this insufficiency acts as a primary bottleneck limiting performance in interactive settings. To address this, we propose REFINE, a strategy that leverages diagnostic verification to guide the agent in proactively resolving uncertainties. Extensive experiments demonstrate that REFINE consistently outperforms baselines across diverse datasets and facilitates effective model collaboration, enabling smaller agents to achieve superior performance under strong reasoning supervision. Our code can be found at https://github.com/NanshineLoong/EID-Benchmark .
[20] LVLMs and Humans Ground Differently in Referential Communication cs.CL | cs.AI | cs.HCPDF
Peter Zeng, Weiling Li, Amie Paige, Zhengxiang Wang, Panagiotis Kaliosis
TL;DR: 该论文通过设计一个指称沟通实验,比较了人类与大型视觉语言模型(LVLMs)在协作匹配无明确词汇标签的图片对象时的表现,揭示了LVLMs在交互式解决指称表达方面的局限性。
Details
Motivation: 为了解决生成式AI代理与人类用户有效协作时准确预测人类意图的能力不足问题,特别是缺乏对共同基础建模的能力。
Result: 实验收集了356个对话语料,分析了准确性、效率和词汇重叠度,结果表明LVLMs在交互式解决指称表达方面存在显著局限。
Insight: 论文创新地通过多轮交互实验揭示了LVLMs在动态指称沟通中的不足,强调了建模共同基础对AI协作的重要性,并提供了公开的数据收集管道和工具以促进相关研究。
Abstract: For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. Here, we present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We release the online pipeline for data collection, the tools and analyses for accuracy, efficiency, and lexical overlap, and a corpus of 356 dialogues (89 pairs over 4 rounds each) that unmasks LVLMs’ limitations in interactively resolving referring expressions, a crucial skill that underlies human language use.
[21] When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering cs.CL | cs.AI | cs.IRPDF
Mahdi Astaraki, Mohammad Arshi Saloot, Ali Shiraee Kasmaee, Hamidreza Mahyar, Soheila Samiee
TL;DR: 本文首次对迭代式检索增强生成(Iterative RAG)与理想静态证据(Gold Context)RAG在科学多跳问答中的表现进行了机制层面的诊断研究。研究发现,在化学领域的ChemKGMultiHopQA数据集上,迭代式RAG(交替进行检索、假设精炼和证据感知停止)持续优于一次性提供所有理想证据的静态RAG,性能提升最高达25.6个百分点。
Details
Motivation: 研究动机在于探究在科学领域(具有多跳推理、稀疏领域知识和异构证据特点)中,同步的迭代检索与推理何时以及为何能超越一次提供所有理想证据的静态RAG上限,以明确迭代RAG的有效场景和机制。
Result: 在ChemKGMultiHopQA数据集上,对11个SOTA LLM的测试表明,迭代RAG在需要真实检索的问题上一致超越Gold Context RAG,性能增益最高达25.6个百分点,尤其对非推理微调模型提升显著。
Insight: 论文的创新点在于首次进行了迭代RAG与理想证据RAG的受控诊断比较,揭示了分阶段检索能减少后期跳失败、缓解上下文过载、动态纠正早期假设漂移。客观来看,其核心洞察是:分阶段检索过程本身(而不仅仅是理想证据的存在)对性能提升至关重要,为专业科学场景中部署和诊断RAG系统提供了实用指南和更可靠、可控的迭代检索-推理框架基础。
Abstract: Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.
[22] Identifying and Transferring Reasoning-Critical Neurons: Improving LLM Inference Reliability via Activation Steering cs.CLPDF
Fangan Dong, Zuming Yan, Xuri Ge, Zhiwei Xu, Mengqi Zhang
TL;DR: 本文提出了一种名为AdaRAS的轻量级测试时框架,通过识别和选择性干预大语言模型中的推理关键神经元来提升推理可靠性。该方法无需额外训练或昂贵的采样策略,在多个数学和编程基准测试上实现了性能的稳定提升。
Details
Motivation: 尽管大语言模型具备强大的推理能力,但在复杂任务上实现可靠性能通常需要后训练或计算成本高昂的采样策略,这限制了其实际效率。本文旨在通过一种轻量级、无需训练的方法来提升推理的可靠性。
Result: 在10个数学和编程基准测试上的实验表明,该方法带来了持续的性能改进,例如在AIME-24和AIME-25上获得了超过13%的性能提升。该方法超越了后训练方法,且无需额外训练或采样成本,并展现出良好的跨数据集可迁移性和向更强模型的可扩展性。
Insight: 论文的核心创新点在于发现LLM中一小部分神经元与推理正确性存在强预测相关性,并据此提出了基于极性感知均值差异准则的推理关键神经元识别方法,以及自适应的激活引导干预策略,从而在推理时增强错误推理路径,同时避免对已正确案例的性能损害。
Abstract: Despite the strong reasoning capabilities of recent large language models (LLMs), achieving reliable performance on challenging tasks often requires post-training or computationally expensive sampling strategies, limiting their practical efficiency. In this work, we first show that a small subset of neurons in LLMs exhibits strong predictive correlations with reasoning correctness. Based on this observation, we propose AdaRAS (Adaptive Reasoning Activation Steering), a lightweight test-time framework that improves reasoning reliability by selectively intervening on neuron activations. AdaRAS identifies Reasoning-Critical Neurons (RCNs) via a polarity-aware mean-difference criterion and adaptively steers their activations during inference, enhancing incorrect reasoning traces while avoiding degradation on already-correct cases. Experiments on 10 mathematics and coding benchmarks demonstrate consistent improvements, including over 13% gains on AIME-24 and AIME-25. Moreover, AdaRAS exhibits strong transferability across datasets and scalability to stronger models, outperforming post-training methods without additional training or sampling cost.
[23] Reflective Translation: Improving Low-Resource Machine Translation via Structured Self-Reflection cs.CLPDF
Nicholas Cheng
TL;DR: 本文提出了一种名为’反射式翻译’的提示框架,用于改善低资源语言的机器翻译质量。该方法让大语言模型先生成初始翻译,再进行结构化自我批判,最后利用批判结果生成优化后的翻译。在英语-祖鲁语和英语-科萨语的翻译任务上,该方法在无需微调的情况下,显著提升了BLEU和COMET分数。
Details
Motivation: 解决低资源语言(如祖鲁语和科萨语)因平行语料和语言资源有限而面临的机器翻译质量挑战。
Result: 在OPUS-100和NTREX-African数据集上的实验表明,该方法在英语-祖鲁语和英语-科萨语翻译任务中,第二轮翻译相比第一轮在BLEU和COMET分数上均有稳定提升(平均增益分别达+0.22 BLEU和+0.18 COMET),且统计检验证实了提升的稳健性。
Insight: 创新点在于将大语言模型中的’自我反思’能力结构化应用于低资源机器翻译,提出了一种无需微调、模型无关的提示框架。该方法生成的’反射增强数据集’可为未来的监督学习或分析工作提供支持,证明了结构化自我反思是提升低资源翻译质量的有效机制。
Abstract: Low-resource languages such as isiZulu and isiXhosa face persistent challenges in machine translation due to limited parallel data and linguistic resources. Recent advances in large language models suggest that self-reflection, prompting a model to critique and revise its own outputs, can improve reasoning quality and factual consistency. Building on this idea, this paper introduces Reflective Translation, a prompt-based framework in which a model generates an initial translation, produces a structured self-critique, and then uses this reflection to generate a refined translation. The approach is evaluated on English-isiZulu and English-isiXhosa translation using OPUS-100 and NTREX-African, across multiple prompting strategies and confidence thresholds. Results show consistent improvements in both BLEU and COMET scores between first- and second-pass translations, with average gains of up to +0.22 BLEU and +0.18 COMET. Statistical significance testing using paired nonparametric tests confirms that these improvements are robust. The proposed method is model-agnostic, requires no fine-tuning, and introduces a reflection-augmented dataset that can support future supervised or analysis-driven work. These findings demonstrate that structured self-reflection is a practical and effective mechanism for improving translation quality in low-resource settings.
cs.CV [Back]
[24] Dynamic Mask-Based Backdoor Attack Against Vision AI Models: A Case Study on Mushroom Detection cs.CVPDF
Zeineb Dridi, Jihen Bennaceur, Amine Ben Hassouna
TL;DR: 本文提出了一种针对目标检测模型的新型动态掩码后门攻击方法,通过在蘑菇检测数据集中嵌入动态触发的恶意触发器,展示了在关键现实领域中的实际风险。该方法利用SAM分割模型生成掩码以实现动态触发器放置,在保持YOLOv7模型在干净数据上高精度的同时,在污染样本上实现了高攻击成功率。
Details
Motivation: 针对深度学习模型在计算机视觉任务中日益广泛部署所面临的后门攻击威胁,特别是传统基于静态模式的后门注入方法的局限性,本文旨在设计一种更隐蔽的动态掩码后门攻击方法,并以蘑菇检测这一关键现实领域为例,揭示外包实践中模型训练数据被污染的重大风险。
Result: 在YOLOv7目标检测模型上的大量实验表明,该方法在干净数据上保持了高精度,同时在污染样本上实现了高攻击成功率,其性能超越了基于静态一致模式的传统后门注入方法。
Insight: 主要创新点在于利用先进的图像分割模型(如SAM)生成掩码来实现触发器的动态、自适应放置,从而创建了一种更隐蔽、更灵活的后门攻击方法。这为后门攻击研究提供了新的思路,同时也凸显了开发针对此类动态、隐蔽攻击的鲁棒防御措施的紧迫性。
Abstract: Deep learning has revolutionized numerous tasks within the computer vision field, including image classification, image segmentation, and object detection. However, the increasing deployment of deep learning models has exposed them to various adversarial attacks, including backdoor attacks. This paper presents a novel dynamic mask-based backdoor attack method, specifically designed for object detection models. We exploit a dataset poisoning technique to embed a malicious trigger, rendering any models trained on this compromised dataset vulnerable to our backdoor attack. We particularly focus on a mushroom detection dataset to demonstrate the practical risks posed by such attacks on critical real-life domains. Our work also emphasizes the importance of creating a detailed backdoor attack scenario to illustrate the significant risks associated with the outsourcing practice. Our approach leverages SAM, a recent and powerful image segmentation AI model, to create masks for dynamic trigger placement, introducing a new and stealthy attack method. Through extensive experimentation, we show that our sophisticated attack scenario maintains high accuracy on clean data with the YOLOv7 object detection model while achieving high attack success rates on poisoned samples. Our approach surpasses traditional methods for backdoor injection, which are based on static and consistent patterns. Our findings underscore the urgent need for robust countermeasures to protect deep learning models from these evolving adversarial threats.
[25] SelfieAvatar: Real-time Head Avatar reenactment from a Selfie Video cs.CVPDF
Wei Liang, Hui Yu, Derui Ding, Rachael E. Jack, Philippe G. Schyns
TL;DR: 本文提出了一种名为SelfieAvatar的方法,用于从单目自拍视频中实时生成可动画化的高保真头部虚拟形象。该方法结合了3D形变模型(3DMM)和基于StyleGAN的生成器,通过混合损失函数和对抗训练来恢复高频细节,实现了对包括非面部区域和背景在内的整个头部的详细重建。
Details
Motivation: 现有基于3DMM的方法难以实时捕捉整个头部(包括非面部区域和背景细节),而基于GAN的方法在再现细粒度头部细节(如皱纹和头发纹理)方面存在局限,且通常依赖大量训练数据。本文旨在解决这些问题,专注于仅使用简单的自拍视频实现高质量的虚拟形象重演。
Result: 在自重演和交叉重演任务上的定性和定量评估表明,与现有方法相比,所提方法在头部虚拟形象重建方面实现了更优的效果,具有更丰富和精细的纹理。
Insight: 主要创新点在于将3DMM与StyleGAN生成器相结合,并提出了一个包含混合损失函数的详细重建模型,用于在对抗训练中同时处理前景重建和虚拟形象图像生成,从而有效恢复高频细节。从客观角度看,该方法在数据效率(仅需自拍视频)和实时性方面具有优势。
Abstract: Head avatar reenactment focuses on creating animatable personal avatars from monocular videos, serving as a foundational element for applications like social signal understanding, gaming, human-machine interaction, and computer vision. Recent advances in 3D Morphable Model (3DMM)-based facial reconstruction methods have achieved remarkable high-fidelity face estimation. However, on the one hand, they struggle to capture the entire head, including non-facial regions and background details in real time, which is an essential aspect for producing realistic, high-fidelity head avatars. On the other hand, recent approaches leveraging generative adversarial networks (GANs) for head avatar generation from videos can achieve high-quality reenactments but encounter limitations in reproducing fine-grained head details, such as wrinkles and hair textures. In addition, existing methods generally rely on a large amount of training data, and rarely focus on using only a simple selfie video to achieve avatar reenactment. To address these challenges, this study introduces a method for detailed head avatar reenactment using a selfie video. The approach combines 3DMMs with a StyleGAN-based generator. A detailed reconstruction model is proposed, incorporating mixed loss functions for foreground reconstruction and avatar image generation during adversarial training to recover high-frequency details. Qualitative and quantitative evaluations on self-reenactment and cross-reenactment tasks demonstrate that the proposed method achieves superior head avatar reconstruction with rich and intricate textures compared to existing approaches.
[26] On the Role of Depth in Surgical Vision Foundation Models: An Empirical Study of RGB-D Pre-training cs.CVPDF
John J. Han, Adam Schmidt, Muhammad Abdullah Jamal, Chinedu Nwoye, Anita Rau
TL;DR: 本文对基于视觉Transformer的手术视觉基础模型进行了大规模实证研究,重点探讨了深度信息(RGB-D)预训练的作用。研究比较了八种不同预训练域、学习目标和输入模态(RGB vs. RGB-D)的模型,使用140万张带深度图的手术图像进行预训练,并在八个涵盖检测、分割、深度估计和姿态估计的手术数据集上进行评估。
Details
Motivation: 当前手术视觉基础模型主要依赖单模态RGB预训练,忽视了手术环境固有的复杂3D几何结构。尽管通用计算机视觉中存在支持多模态或几何感知输入的架构,但在手术场景中融入深度信息的好处尚未得到充分探索。
Result: 实验表明,采用显式几何标记化(如MultiMAE)的模型在所有任务上均显著优于单模态基线。几何感知预训练实现了卓越的数据效率:仅使用25%标注数据微调的模型,其性能始终优于使用全数据集训练的纯RGB模型。这些提升在推理时无需改变架构或增加运行时开销,深度信息仅在预训练阶段使用。
Insight: 核心创新点在于系统论证了深度信息预训练对于手术视觉任务的普适有效性,特别是其带来的数据效率优势。一个关键的实际见解是,这种多模态预训练方法可以在不增加推理成本的前提下提升模型性能,为构建更强大的手术视觉系统提供了一条可行的路径。
Abstract: Vision foundation models (VFMs) have emerged as powerful tools for surgical scene understanding. However, current approaches predominantly rely on unimodal RGB pre-training, overlooking the complex 3D geometry inherent to surgical environments. Although several architectures support multimodal or geometry-aware inputs in general computer vision, the benefits of incorporating depth information in surgical settings remain underexplored. We conduct a large-scale empirical study comparing eight ViT-based VFMs that differ in pre-training domain, learning objective, and input modality (RGB vs. RGB-D). For pre-training, we use a curated dataset of 1.4 million robotic surgical images paired with depth maps generated from an off-the-shelf network. We evaluate these models under both frozen-backbone and end-to-end fine-tuning protocols across eight surgical datasets spanning object detection, segmentation, depth estimation, and pose estimation. Our experiments yield several consistent findings. Models incorporating explicit geometric tokenization, such as MultiMAE, substantially outperform unimodal baselines across all tasks. Notably, geometric-aware pre-training enables remarkable data efficiency: models fine-tuned on just 25% of labeled data consistently surpass RGB-only models trained on the full dataset. Importantly, these gains require no architectural or runtime changes at inference; depth is used only during pre-training, making adoption straightforward. These findings suggest that multimodal pre-training offers a viable path towards building more capable surgical vision systems.
[27] FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Geometry-Complete 4D Reconstruction cs.CV | cs.AI | cs.GRPDF
Wei Cao, Hao Zhang, Fengrui Tian, Yulun Wu, Yingying Li
TL;DR: 本文提出了FreeOrbit4D,一个无需训练的框架,用于解决单目视频在用户指定的大角度相机轨迹下进行相机重定向的难题。该方法的核心是通过解耦前景和背景重建,恢复一个几何完整的4D代理(包含静态背景和几何完整的前景点云),以此为结构基础来引导条件视频扩散模型生成视角一致的重定向视频。
Details
Motivation: 解决单目视频大角度相机重定向的固有不适定问题。单目视频仅提供4D世界(3D空间+时间)的局部观测,导致在远离原始轨迹的大角度视角变化下,现有基于扩散的方法因缺乏视觉基础而产生严重的几何模糊和时间不一致。
Result: 大量实验表明,FreeOrbit4D在具有挑战性的大角度轨迹下,能生成更忠实、几何一致的重定向视频。
Insight: 核心创新在于通过解耦重建和利用以物体为中心的多视角扩散模型,构建了一个几何完整的4D代理作为结构基础,从而有效缓解了重定向中的几何模糊问题。该方法无需训练,且其4D代理为编辑传播和4D数据生成等实际应用开辟了新途径。
Abstract: Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing highly partial observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive results, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. To address this, we present FreeOrbit4D, an effective training-free framework that tackles this geometric ambiguity by recovering a geometry-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and geometry-incomplete foreground point clouds in a unified global space, then leverage an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct geometry-complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D–3D correspondences and projecting the geometry-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful redirected videos under challenging large-angle trajectories, and our geometry-complete 4D proxy further opens a potential avenue for practical applications such as edit propagation and 4D data generation. Project page and code will be released soon.
[28] Anatomically-aware conformal prediction for medical image segmentation with random walks cs.CV | cs.LGPDF
Mélanie Gaillochet, Christian Desrosiers, Hervé Lombaert
TL;DR: 本文提出了一种名为随机游走保形预测(RW-CP)的模型无关框架,用于医学图像分割中的不确定性量化。该方法通过利用预训练视觉基础模型特征构建k近邻图并应用随机游走来扩散不确定性,从而强制空间一致性,生成解剖学上有效的预测集。RW-CP在保持严格的边际覆盖保证的同时,显著提升了分割质量。
Details
Motivation: 解决标准保形预测(CP)在医学图像分割中忽略解剖学上下文,导致预测集空间不连贯、过度分割且临床效用有限的问题,旨在提供既具有统计有效性又解剖学意义的不确定性量化。
Result: 在多模态公共数据集上的评估表明,在允许错误率α=0.1的条件下,RW-CP相比标准CP基线方法,分割质量提升了高达35.4%,同时保持了严格的边际覆盖保证。
Insight: 创新点在于将随机游走扩散机制与保形预测结合,利用预训练视觉基础模型的特征图构建图结构来正则化非保形分数,从而强制预测集的空间连贯性和解剖合理性。这提供了一种模型无关的、能生成更稳定连续解剖边界的不确定性量化框架。
Abstract: The reliable deployment of deep learning in medical imaging requires uncertainty quantification that provides rigorous error guarantees while remaining anatomically meaningful. Conformal prediction (CP) is a powerful distribution-free framework for constructing statistically valid prediction intervals. However, standard applications in segmentation often ignore anatomical context, resulting in fragmented, spatially incoherent, and over-segmented prediction sets that limit clinical utility. To bridge this gap, this paper proposes Random-Walk Conformal Prediction (RW-CP), a model-agnostic framework which can be added on top of any segmentation method. RW-CP enforces spatial coherence to generate anatomically valid sets. Our method constructs a k-nearest neighbour graph from pre-trained vision foundation model features and applies a random walk to diffuse uncertainty. The random walk diffusion regularizes the non-conformity scores, making the prediction sets less sensitive to the conformal calibration parameter $λ$, ensuring more stable and continuous anatomical boundaries. RW-CP maintains rigorous marginal coverage while significantly improving segmentation quality. Evaluations on multi-modal public datasets show improvements of up to $35.4%$ compared to standard CP baselines, given an allowable error rate of $α=0.1$.
[29] NuiWorld: Exploring a Scalable Framework for End-to-End Controllable World Generation cs.CVPDF
Han-Hung Lee, Cheng-Yu Yang, Yu-Lun Liu, Angel X. Chang
TL;DR: NuiWorld是一个用于端到端可控世界生成的框架,旨在解决现有方法在可控性、可扩展性和效率方面的挑战。它通过生成式引导策略从少量输入图像合成多样化的场景数据,并采用可变场景块和扁平化向量集表示来提升大场景的几何保真度和计算效率。
Details
Motivation: 现有世界生成方法面临可控性、可扩展性和效率三大障碍:端到端模型受限于数据稀缺,以物体为中心的方法依赖固定分辨率表示导致大场景保真度下降,而无训练方法则推理缓慢且计算成本高。
Result: 框架通过生成式引导策略合成多样场景数据,支持伪草图标签实现可控生成,并在未见草图上展示一定泛化能力;其场景表示方法显著减少大场景的令牌长度,提升了训练和推理效率。
Insight: 创新点包括:生成式引导策略缓解数据稀缺问题,可变场景块与扁平化向量集表示实现大场景的高效处理,以及通过伪草图标签增强可控性,为可扩展世界生成提供了新思路。
Abstract: World generation is a fundamental capability for applications like video games, simulation, and robotics. However, existing approaches face three main obstacles: controllability, scalability, and efficiency. End-to-end scene generation models have been limited by data scarcity. While object-centric generation approaches rely on fixed resolution representations, degrading fidelity for larger scenes. Training-free approaches, while flexible, are often slow and computationally expensive at inference time. We present NuiWorld, a framework that attempts to address these challenges. To overcome data scarcity, we propose a generative bootstrapping strategy that starts from a few input images. Leveraging recent 3D reconstruction and expandable scene generation techniques, we synthesize scenes of varying sizes and layouts, producing enough data to train an end-to-end model. Furthermore, our framework enables controllability through pseudo sketch labels, and demonstrates a degree of generalization to previously unseen sketches. Our approach represents scenes as a collection of variable scene chunks, which are compressed into a flattened vector-set representation. This significantly reduces the token length for large scenes, enabling consistent geometric fidelity across scenes sizes while improving training and inference efficiency.
[30] Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models cs.CV | cs.AIPDF
Jeonghwan Kim, Renjie Tao, Sanat Sharma, Jiaqi Wang, Kai Sun
TL;DR: 本文提出PixSearch,首个端到端的可分割大型多模态模型,统一了区域级感知与检索增强推理。模型通过生成
Details
Motivation: 解决现有多模态检索增强生成系统缺乏内部策略来决定何时以及如何检索的问题,旨在将细粒度感知与图像外的知识事实更有效地结合。
Result: 在CRAG-MM基准上,相比整图检索,准确率相对提升19.7%;同时在各种VQA和纯文本QA任务上保持了有竞争力的推理性能。
Insight: 创新点在于将检索触发、查询模态选择和像素级掩码生成统一到一个端到端模型中,通过两阶段监督微调策略学习检索时机和查询选择,同时保留分割能力,从而简化了传统多模态RAG的复杂流水线。
Abstract: Visual Question Answering (VQA) often requires coupling fine-grained perception with factual knowledge beyond the input image. Prior multimodal Retrieval-Augmented Generation (MM-RAG) systems improve factual grounding but lack an internal policy for when and how to retrieve. We propose PixSearch, the first end-to-end Segmenting Large Multimodal Model (LMM) that unifies region-level perception and retrieval-augmented reasoning. During encoding, PixSearch emits
[31] m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning cs.CV | cs.AIPDF
Yosub Shin, Michael Buriek, Igor Molybog
TL;DR: 本文提出了m2sv基准测试,用于评估视觉语言模型在从地图到街景的空间推理能力,即通过对齐正北朝上的俯视地图与同一真实世界交叉路口的街景图像来推断相机朝向。该基准包含地理多样且歧义可控的m2sv-20k数据集,以及用于监督微调的m2sv-sft-11k结构化推理轨迹集。
Details
Motivation: 现有视觉语言模型在多模态基准上表现良好,但在需要将抽象俯视表示与第一人称视角对齐的空间推理任务上仍显脆弱,因此需要构建一个可扩展的基准来系统评估和提升模型在此类任务上的能力。
Result: 在m2sv基准上,最佳视觉语言模型仅达到65.2%的准确率,远低于人类基线(95%)。监督微调和强化学习能带来一致提升,但跨基准评估显示迁移能力有限。
Insight: 创新点在于提出了一个可扩展、地理多样且歧义可控的地图到街景空间推理基准,并系统分析了模型在几何对齐、证据聚合和推理一致性方面的持久差距,为跨视角的接地空间推理研究提供了方向。
Abstract: Vision–language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, far below the human baseline of 95%. While supervised fine-tuning and reinforcement learning yield consistent gains, cross-benchmark evaluations reveal limited transfer. Beyond aggregate accuracy, we systematically analyze difficulty in map-to-street-view reasoning using both structural signals and human effort, and conduct an extensive failure analysis of adapted open models. Our findings highlight persistent gaps in geometric alignment, evidence aggregation, and reasoning consistency, motivating future work on grounded spatial reasoning across viewpoints.
[32] Glance and Focus Reinforcement for Pan-cancer Screening cs.CVPDF
Linshan Wu, Jiaxin Zhuang, Hao Chen
TL;DR: 本文提出了一种名为GF-Screen的强化学习框架,用于解决大规模CT扫描中全癌筛查的挑战。该框架模仿放射科医生的“扫视与聚焦”诊断策略,通过一个Glance模型定位病灶区域并裁剪子体积,再由一个Focus模型进行精确分割,并利用分割结果通过强化学习奖励Glance模型,以优化病灶定位。
Details
Motivation: 现有AI方法在大规模CT扫描中进行全癌筛查面临挑战,主要难点在于如何在大型CT体积中定位多种类型的微小病灶。极度的前景-背景不平衡阻碍模型关注病变区域,而对健康区域的冗余关注不仅降低效率,还增加误报。
Result: 在16个内部和7个外部数据集(涵盖9种病变类型)上的广泛实验证明了GF-Screen的有效性。GF-Screen在MICCAI FLARE25全癌挑战赛的公开验证排行榜上领先,大幅超越FLARE24冠军解决方案(DSC提升25.6%,NSD提升28.2%)。
Insight: 创新点包括:1)首次将前沿强化学习技术有效应用于全癌筛查的具体挑战;2)提出一种新颖的组相对学习范式,通过组内相对比较来优先处理高优势预测并丢弃低优势预测,从而提高效率并减少误报;3)通过非可微选择操作与分割结果的强化学习奖励,实现了Glance模型与Focus模型的协同优化。
Abstract: Pan-cancer screening in large-scale CT scans remains challenging for existing AI methods, primarily due to the difficulty of localizing diverse types of tiny lesions in large CT volumes. The extreme foreground-background imbalance significantly hinders models from focusing on diseased regions, while redundant focus on healthy regions not only decreases the efficiency but also increases false positives. Inspired by radiologists’ glance and focus diagnostic strategy, we introduce GF-Screen, a Glance and Focus reinforcement learning framework for pan-cancer screening. GF-Screen employs a Glance model to localize the diseased regions and a Focus model to precisely segment the lesions, where segmentation results of the Focus model are leveraged to reward the Glance model via Reinforcement Learning (RL). Specifically, the Glance model crops a group of sub-volumes from the entire CT volume and learns to select the sub-volumes with lesions for the Focus model to segment. Given that the selecting operation is non-differentiable for segmentation training, we propose to employ the segmentation results to reward the Glance model. To optimize the Glance model, we introduce a novel group relative learning paradigm, which employs group relative comparison to prioritize high-advantage predictions and discard low-advantage predictions within sub-volume groups, not only improving efficiency but also reducing false positives. In this way, for the first time, we effectively extend cutting-edge RL techniques to tackle the specific challenges in pan-cancer screening. Extensive experiments on 16 internal and 7 external datasets across 9 lesion types demonstrated the effectiveness of GF-Screen. Notably, GF-Screen leads the public validation leaderboard of MICCAI FLARE25 pan-cancer challenge, surpassing the FLARE24 champion solution by a large margin (+25.6% DSC and +28.2% NSD).
[33] QA-ReID: Quality-Aware Query-Adaptive Convolution Leveraging Fused Global and Structural Cues for Clothes-Changing ReID cs.CVPDF
Yuxiang Wang, Kunming Jiang, Tianxiang Zhang, Ke Tian, Gaozhe Jiang
TL;DR: 本文提出了一种名为QA-ReID的质量感知双分支匹配方法,用于解决换装行人重识别(CC-ReID)中因服装变化导致的外观剧烈变化问题。该方法联合利用基于RGB的特征和基于解析的表示,分别建模全局外观和服装不变的结构线索,并通过多模态注意力模块自适应融合这些异构特征。在匹配阶段,进一步设计了质量感知查询自适应卷积(QAConv-QA),通过像素级重要性加权和双向一致性约束来增强对服装变化的鲁棒性。
Details
Motivation: 解决传统行人重识别在换装场景下面临的严重挑战,即服装变化引入的显著外观变化,导致识别性能下降。
Result: 在多个基准测试(包括PRCC、LTCC和VC-Clothes)上实现了最先进的性能,并在跨服装场景下显著优于现有方法。
Insight: 创新点在于联合全局外观和服装不变结构线索的双分支建模,以及通过多模态注意力自适应融合异构特征;匹配阶段引入的质量感知查询自适应卷积(QAConv-QA)通过像素级加权和双向一致性约束,有效提升了模型对服装变化的鲁棒性,为跨服装行人重识别提供了可借鉴的融合与自适应匹配策略。
Abstract: Unlike conventional person re-identification (ReID), clothes-changing ReID (CC-ReID) presents severe challenges due to substantial appearance variations introduced by clothing changes. In this work, we propose the Quality-Aware Dual-Branch Matching (QA-ReID), which jointly leverages RGB-based features and parsing-based representations to model both global appearance and clothing-invariant structural cues. These heterogeneous features are adaptively fused through a multi-modal attention module. At the matching stage, we further design the Quality-Aware Query Adaptive Convolution (QAConv-QA), which incorporates pixel-level importance weighting and bidirectional consistency constraints to enhance robustness against clothing variations. Extensive experiments demonstrate that QA-ReID achieves state-of-the-art performance on multiple benchmarks, including PRCC, LTCC, and VC-Clothes, and significantly outperforms existing approaches under cross-clothing scenarios.
[34] TFFM: Topology-Aware Feature Fusion Module via Latent Graph Reasoning for Retinal Vessel Segmentation cs.CVPDF
Iftekhar Ahmed, Shakib Absar, Aftar Ahmad Sami, Shadman Sakib, Debojyoti Biswas
TL;DR: 本文提出了一种用于视网膜血管分割的拓扑感知特征融合模块(TFFM),通过将局部特征映射到潜在图空间并利用图注意力网络捕获全局结构依赖,结合混合损失函数(Tversky损失和软clDice损失)来显式惩罚拓扑断裂,从而生成拓扑连贯的血管分割结果。
Details
Motivation: 解决标准卷积架构在视网膜血管分割中产生的拓扑断裂问题(如间隙和不连续),这些断裂使得基于图的临床分析不可靠,尽管像素级精度可能很高。
Result: 在Fundus-AVSeg数据集上取得了SOTA性能,综合Dice分数达到90.97%,95% Hausdorff距离为3.50像素,并将血管断裂率相对于基线降低了约38%。
Insight: 创新点在于将拓扑约束显式地融入分割框架,通过潜在图推理和混合损失来保持血管连通性;这为需要保持结构连贯性的医学图像分割任务提供了新思路。
Abstract: Precise segmentation of retinal arteries and veins carries the diagnosis of systemic cardiovascular conditions. However, standard convolutional architectures often yield topologically disjointed segmentations, characterized by gaps and discontinuities that render reliable graph-based clinical analysis impossible despite high pixel-level accuracy. To address this, we introduce a topology-aware framework engineered to maintain vascular connectivity. Our architecture fuses a Topological Feature Fusion Module (TFFM) that maps local feature representations into a latent graph space, deploying Graph Attention Networks to capture global structural dependencies often missed by fixed receptive fields. Furthermore, we drive the learning process with a hybrid objective function, coupling Tversky loss for class imbalance with soft clDice loss to explicitly penalize topological disconnects. Evaluation on the Fundus-AVSeg dataset reveals state-of-the-art performance, achieving a combined Dice score of 90.97% and a 95% Hausdorff Distance of 3.50 pixels. Notably, our method decreases vessel fragmentation by approximately 38% relative to baselines, yielding topologically coherent vascular trees viable for automated biomarker quantification. We open-source our code at https://tffm-module.github.io/.
[35] SNR-Edit: Structure-Aware Noise Rectification for Inversion-Free Flow-Based Editing cs.CV | cs.AIPDF
Lifan Jiang, Boxi Wu, Yuhang Pei, Tianrun Wu, Yongyuan Chen
TL;DR: SNR-Edit是一个无需训练、无需模型调优或反转的框架,用于基于流的生成模型的无反转图像编辑。它通过结构感知噪声校正,将分割约束注入初始噪声,从而校正源轨迹的随机成分,减少轨迹漂移,实现高保真度的结构保留编辑。
Details
Motivation: 解决现有基于流的无反转图像编辑方法因依赖固定高斯噪声构建源轨迹而导致的轨迹动态偏差、结构退化或质量损失问题。
Result: 在SD3和FLUX模型上,使用PIE-Bench和SNR-Bench进行评估,SNR-Edit在像素级指标和基于视觉语言模型的评分上均表现出色,且每张图像仅增加约1秒的开销。
Insight: 创新点在于提出了一种轻量级的自适应噪声控制机制,通过结构感知噪声校正将分割信息融入初始噪声,从而在无需额外训练或反转的情况下,稳定轨迹并保持编辑图像的结构完整性。
Abstract: Inversion-free image editing using flow-based generative models challenges the prevailing inversion-based pipelines. However, existing approaches rely on fixed Gaussian noise to construct the source trajectory, leading to biased trajectory dynamics and causing structural degradation or quality loss. To address this, we introduce SNR-Edit, a training-free framework achieving faithful Latent Trajectory Correction via adaptive noise control. Mechanistically, SNR-Edit uses structure-aware noise rectification to inject segmentation constraints into the initial noise, anchoring the stochastic component of the source trajectory to the real image’s implicit inversion position and reducing trajectory drift during source–target transport. This lightweight modification yields smoother latent trajectories and ensures high-fidelity structural preservation without requiring model tuning or inversion. Across SD3 and FLUX, evaluations on PIE-Bench and SNR-Bench show that SNR-Edit delivers performance on pixel-level metrics and VLM-based scoring, while adding only about 1s overhead per image.
[36] Contrastive Spectral Rectification: Test-Time Defense towards Zero-shot Adversarial Robustness of CLIP cs.CVPDF
Sen Nie, Jie Zhang, Zhuo Wang, Shiguang Shan, Xilin Chen
TL;DR: 本文提出了一种名为对比频谱校正(CSR)的高效测试时防御方法,旨在提升CLIP等视觉语言模型的零样本对抗鲁棒性。该方法通过分析对抗样本在渐进频率衰减下的特征不一致性,利用模型固有的频谱偏差,在频谱引导的对比目标下优化校正扰动,以自适应地将输入重新对齐到自然流形上。
Details
Motivation: CLIP等视觉语言模型在零样本泛化方面表现出色,但对对抗样本高度脆弱。现有的测试时防御方法在面对强攻击时鲁棒性不足,且常伴有高推理延迟和任务特定适用性的限制。
Result: 在16个分类基准测试上的广泛实验表明,CSR在对抗强AutoAttack时,平均性能比当前最优方法(SOTA)高出18.1%,且推理开销适中。此外,CSR在多种视觉任务中展现出广泛的适用性。
Insight: 创新点在于揭示了对抗样本在渐进频率衰减下存在严重的特征不一致性,并将其归因于模型固有的频谱偏差。基于此,提出了一个频谱引导的对比目标来优化输入自适应的校正扰动,从而高效地实现测试时防御。从客观角度看,该方法将频谱分析与对比学习结合用于对抗防御,是一种新颖的思路,其输入自适应和低开销特性具有实用价值。
Abstract: Vision-language models (VLMs) such as CLIP have demonstrated remarkable zero-shot generalization, yet remain highly vulnerable to adversarial examples (AEs). While test-time defenses are promising, existing methods fail to provide sufficient robustness against strong attacks and are often hampered by high inference latency and task-specific applicability. To address these limitations, we start by investigating the intrinsic properties of AEs, which reveals that AEs exhibit severe feature inconsistency under progressive frequency attenuation. We further attribute this to the model’s inherent spectral bias. Leveraging this insight, we propose an efficient test-time defense named Contrastive Spectral Rectification (CSR). CSR optimizes a rectification perturbation to realign the input with the natural manifold under a spectral-guided contrastive objective, which is applied input-adaptively. Extensive experiments across 16 classification benchmarks demonstrate that CSR outperforms the SOTA by an average of 18.1% against strong AutoAttack with modest inference overhead. Furthermore, CSR exhibits broad applicability across diverse visual tasks. Code is available at https://github.com/Summu77/CSR.
[37] UniPCB: A Unified Vision-Language Benchmark for Open-Ended PCB Quality Inspection cs.CV | cs.AIPDF
Fuxiang Sun, Xi Jiang, Jiansheng Wu, Haigang Zhang, Feng Zheng
TL;DR: 本文提出了UniPCB,首个用于开放式PCB质量检测的统一视觉-语言基准,并基于该基准构建了PCB-GPT模型,通过渐进式课程学习模拟专家学习过程,在细粒度缺陷定位等任务上显著超越现有MLLMs。
Details
Motivation: 现有MLLMs在复杂工业场景(如PCB检测)中表现不足,且缺乏高质量、统一的视觉-语言基准来量化评估模型性能,主要源于数据稀缺、数据集碎片化及标准化不一致。
Result: 在UniPCB基准上,PCB-GPT建立了新的性能基线,在细粒度缺陷定位任务上的表现比最强竞争对手提升超过一倍,在定位和分析方面具有显著优势。
Insight: 创新点包括构建首个统一的PCB视觉-语言基准UniPCB,以及提出基于渐进式课程学习的PCB-GPT模型,模拟人类专家学习过程以提升领域特定任务性能。
Abstract: Multimodal Large Language Models (MLLMs) show promise for general industrial quality inspection, but fall short in complex scenarios, such as Printed Circuit Board (PCB) inspection. PCB inspection poses unique challenges due to densely packed components, complex wiring structures, and subtle defect patterns that require specialized domain expertise. However, a high-quality, unified vision-language benchmark for quantitatively evaluating MLLMs across PCB inspection tasks remains absent, stemming not only from limited data availability but also from fragmented datasets and inconsistent standardization. To fill this gap, we propose UniPCB, the first unified vision-language benchmark for open-ended PCB quality inspection. UniPCB is built via a systematic pipeline that curates and standardizes data from disparate sources across three annotated scenarios. Furthermore, we introduce PCB-GPT, an MLLM trained on a new instruction dataset generated by this pipeline, utilizing a novel progressive curriculum that mimics the learning process of human experts. Evaluations on the UniPCB benchmark show that while existing MLLMs falter on domain-specific tasks, PCB-GPT establishes a new baseline. Notably, it more than doubles the performance on fine-grained defect localization compared to the strongest competitors, with significant advantages in localization and analysis. We will release the instruction data, benchmark, and model to facilitate future research.
[38] Towards Pixel-Level VLM Perception via Simple Points Prediction cs.CVPDF
Tianhui Song, Haoyu Lu, Hao Yang, Lin Sui, Haoning Wu
TL;DR: 本文提出SimpleSeg方法,通过将分割任务重构为简单的序列生成问题,使多模态大语言模型(MLLMs)具备原生像素级感知能力。模型直接在语言空间中预测描述物体边界的点序列(文本坐标),并采用两阶段训练流程(SF→RL)通过基于IoU奖励的强化学习优化点序列以匹配真实轮廓。实验表明,标准MLLM架构无需专门设计即可解锁强大的底层感知能力,在分割基准上达到甚至超越复杂任务特定方法的性能。
Details
Motivation: 解决多模态大语言模型缺乏原生像素级感知能力的问题,挑战现有方法依赖辅助组件的现状,探索通过简单点预测实现精确空间理解的统一框架。
Result: 在分割基准测试中,SimpleSeg的性能与依赖复杂任务特定设计的方法相当或更优,展示了其有效性。
Insight: 创新点在于将分割任务重构为语言模型可直接处理的点序列生成问题,并利用强化学习优化序列匹配;客观分析认为,该方法揭示了标准MLLM架构固有的底层感知潜力,无需专门化设计即可实现高性能分割,为构建更统一、强大的视觉语言模型提供了新思路。
Abstract: We present SimpleSeg, a strikingly simple yet highly effective approach to endow Multimodal Large Language Models (MLLMs) with native pixel-level perception. Our method reframes segmentation as a simple sequence generation problem: the model directly predicts sequences of points (textual coordinates) delineating object boundaries, entirely within its language space. To achieve high fidelity, we introduce a two-stage SF$\to$RL training pipeline, where Reinforcement Learning with an IoU-based reward refines the point sequences to accurately match ground-truth contours. We find that the standard MLLM architecture possesses a strong, inherent capacity for low-level perception that can be unlocked without any specialized architecture. On segmentation benchmarks, SimpleSeg achieves performance that is comparable to, and often surpasses, methods relying on complex, task-specific designs. This work lays out that precise spatial understanding can emerge from simple point prediction, challenging the prevailing need for auxiliary components and paving the way for more unified and capable VLMs. Homepage: https://simpleseg.github.io/
[39] VC-Bench: Pioneering the Video Connecting Benchmark with a Dataset and Evaluation Metrics cs.CV | cs.MMPDF
Zhiyu Yin, Zhipeng Liu, Kehai Chen, Lemao Liu, Jin Liu
TL;DR: 本文提出了视频连接(Video Connecting)这一新任务,旨在为给定的起始和结束视频片段生成平滑的中间过渡内容。为了解决该任务缺乏标准化评估基准的问题,作者构建了VC-Bench基准,包含一个高质量、多样化的视频数据集和一套超越传统质量评估的综合评价指标。
Details
Motivation: 当前视频生成研究主要关注文本或图像条件生成,但在视频编辑、视频博客等实际应用中,常常需要将独立的视频片段无缝连接起来。然而,该任务缺乏标准化的评估基准,阻碍了其发展。
Result: 作者在提出的VC-Bench基准上评估了多个最先进的视频生成模型。实验结果表明,现有模型在保持起始-结束一致性和过渡平滑性方面存在显著不足,导致整体连贯性和流畅性较低。
Insight: 论文的创新点在于首次定义了视频连接任务,并为此构建了一个包含高质量数据集和综合评价指标(VQS、SECS、TSS)的基准VC-Bench,该基准超越了仅关注视频质量的传统评估方式,强调一致性与平滑性,为未来研究提供了明确的评估框架和方向。
Abstract: While current video generation focuses on text or image conditions, practical applications like video editing and vlogging often need to seamlessly connect separate clips. In our work, we introduce Video Connecting, an innovative task that aims to generate smooth intermediate video content between given start and end clips. However, the absence of standardized evaluation benchmarks has hindered the development of this task. To bridge this gap, we proposed VC-Bench, a novel benchmark specifically designed for video connecting. It includes 1,579 high-quality videos collected from public platforms, covering 15 main categories and 72 subcategories to ensure diversity and structure. VC-Bench focuses on three core aspects: Video Quality Score VQS, Start-End Consistency Score SECS, and Transition Smoothness Score TSS. Together, they form a comprehensive framework that moves beyond conventional quality-only metrics. We evaluated multiple state-of-the-art video generation models on VC-Bench. Experimental results reveal significant limitations in maintaining start-end consistency and transition smoothness, leading to lower overall coherence and fluidity. We expect that VC-Bench will serve as a pioneering benchmark to inspire and guide future research in video connecting. The evaluation metrics and dataset are publicly available at: https://anonymous.4open.science/r/VC-Bench-1B67/.
[40] Beyond Shadows: A Large-Scale Benchmark and Multi-Stage Framework for High-Fidelity Facial Shadow Removal cs.CVPDF
Tailong Luo, Jiesong Bai, Jinyang Huang, Junyu Xia, Wangyu Wu
TL;DR: 本文提出了首个大规模真实世界人脸阴影去除数据集ASFW,包含1,081对通过专业Photoshop流程创建的阴影/无阴影图像对,并提出了Face Shadow Eraser(FSE)方法验证数据集的有效性,显著提升了真实场景下的阴影去除性能。
Details
Motivation: 现有方法在复杂光照下难以在去除阴影的同时保留纹理,且缺乏真实世界的配对数据集进行训练,导致性能受限。
Result: 在ASFW数据集上训练的深度模型在真实世界条件下表现出改进的阴影去除效果,为该任务设定了新标准。
Insight: 通过专业流程构建大规模、高质量的真实世界配对数据集是解决阴影去除领域数据瓶颈的关键创新;提出的FSE方法可作为验证数据集有效性的基准框架。
Abstract: Facial shadows often degrade image quality and the performance of vision algorithms. Existing methods struggle to remove shadows while preserving texture, especially under complex lighting conditions, and they lack real-world paired datasets for training. We present the Augmented Shadow Face in the Wild (ASFW) dataset, the first large-scale real-world dataset for facial shadow removal, containing 1,081 paired shadow and shadow-free images created via a professional Photoshop workflow. ASFW offers photorealistic shadow variations and accurate ground truths, bridging the gap between synthetic and real domains. Deep models trained on ASFW demonstrate improved shadow removal in real-world conditions. We also introduce the Face Shadow Eraser (FSE) method to showcase the effectiveness of the dataset. Experiments demonstrate that ASFW enhances the performance of facial shadow removal models, setting new standards for this task.
[41] Instance-Guided Radar Depth Estimation for 3D Object Detection cs.CV | cs.AIPDF
Chen-Chou Lo, Patrick Vandewalle
TL;DR: 本文提出了一种用于3D目标检测的实例引导雷达深度估计框架,通过两个关键组件增强单目3D检测:InstaRadar(一种利用预训练分割掩码增强雷达密度和语义对齐的实例分割引导扩展方法)以及将预训练的RCDPT深度估计模型集成到BEVDepth框架中。
Details
Motivation: 解决单目相机3D检测在挑战性条件下的深度模糊和鲁棒性不足问题,以及雷达数据稀疏性和低分辨率限制其直接用于检测框架的问题,需要有效的雷达-相机融合与改进的预处理和深度估计策略。
Result: InstaRadar在雷达引导深度估计任务上取得了最先进(SOTA)的结果;将RCDPT集成到BEVDepth框架中,配合InstaRadar增强的输入,持续提升了3D检测性能,相对于基线BEVDepth模型获得了稳定的增益。
Insight: 创新点在于提出实例分割引导的雷达数据增强方法(InstaRadar)以生成更结构化的表示,并展示了在3D目标检测中显式深度监督的优势;客观分析认为,将雷达仅作为引导而非独立特征流是当前框架的局限性,也为未来工作(如扩展到点云表示、集成时序线索的专用雷达分支)指明了改进方向。
Abstract: Accurate depth estimation is fundamental to 3D perception in autonomous driving, supporting tasks such as detection, tracking, and motion planning. However, monocular camera-based 3D detection suffers from depth ambiguity and reduced robustness under challenging conditions. Radar provides complementary advantages such as resilience to poor lighting and adverse weather, but its sparsity and low resolution limit its direct use in detection frameworks. This motivates the need for effective Radar-camera fusion with improved preprocessing and depth estimation strategies. We propose an end-to-end framework that enhances monocular 3D object detection through two key components. First, we introduce InstaRadar, an instance segmentation-guided expansion method that leverages pre-trained segmentation masks to enhance Radar density and semantic alignment, producing a more structured representation. InstaRadar achieves state-of-the-art results in Radar-guided depth estimation, showing its effectiveness in generating high-quality depth features. Second, we integrate the pre-trained RCDPT into the BEVDepth framework as a replacement for its depth module. With InstaRadar-enhanced inputs, the RCDPT integration consistently improves 3D detection performance. Overall, these components yield steady gains over the baseline BEVDepth model, demonstrating the effectiveness of InstaRadar and the advantage of explicit depth supervision in 3D object detection. Although the framework lags behind Radar-camera fusion models that directly extract BEV features, since Radar serves only as guidance rather than an independent feature stream, this limitation highlights potential for improvement. Future work will extend InstaRadar to point cloud-like representations and integrate a dedicated Radar branch with temporal cues for enhanced BEV fusion.
[42] Innovator-VL: A Multimodal Large Language Model for Scientific Discovery cs.CV | cs.AIPDF
Zichen Wen, Boxue Yang, Shuang Chen, Yaojie Zhang, Yuhang Han
TL;DR: 本文提出了Innovator-VL,一个用于科学发现的多模态大语言模型。该模型旨在提升跨科学领域的理解和推理能力,同时在通用视觉任务上保持优异性能。其核心在于通过原则性的训练设计和透明的方法论,以远低于常规的数据需求实现强大的科学智能。
Details
Motivation: 当前趋势严重依赖大规模领域特定预训练和不透明的流程,本文旨在证明通过原则性的训练设计和透明的方法论,可以以显著减少的数据需求实现强大的科学智能,为构建高效、可复现的科学多模态模型提供实践基础。
Result: 模型在多种科学任务上取得了有竞争力的性能,仅使用了不到五百万个精选样本,且未进行大规模预训练。同时,在通用视觉、多模态推理和科学基准测试中也表现出强大的泛化能力和有竞争力的性能。
Insight: 宣称的创新点包括:1) 提供完全透明、端到端可复现的训练流程;2) 展示了卓越的数据效率,强调通过原则性数据选择而非盲目扩展可实现有效推理;3) 证明了科学对齐可以集成到统一模型中而不损害通用能力。客观来看,其强调的透明度、数据效率和统一模型设计是值得借鉴的方向。
Abstract: We present Innovator-VL, a scientific multimodal large language model designed to advance understanding and reasoning across diverse scientific domains while maintaining excellent performance on general vision tasks. Contrary to the trend of relying on massive domain-specific pretraining and opaque pipelines, our work demonstrates that principled training design and transparent methodology can yield strong scientific intelligence with substantially reduced data requirements. (i) First, we provide a fully transparent, end-to-end reproducible training pipeline, covering data collection, cleaning, preprocessing, supervised fine-tuning, reinforcement learning, and evaluation, along with detailed optimization recipes. This facilitates systematic extension by the community. (ii) Second, Innovator-VL exhibits remarkable data efficiency, achieving competitive performance on various scientific tasks using fewer than five million curated samples without large-scale pretraining. These results highlight that effective reasoning can be achieved through principled data selection rather than indiscriminate scaling. (iii) Third, Innovator-VL demonstrates strong generalization, achieving competitive performance on general vision, multimodal reasoning, and scientific benchmarks. This indicates that scientific alignment can be integrated into a unified model without compromising general-purpose capabilities. Our practices suggest that efficient, reproducible, and high-performing scientific multimodal models can be built even without large-scale data, providing a practical foundation for future research.
[43] RoamScene3D: Immersive Text-to-3D Scene Generation via Adaptive Object-aware Roaming cs.CVPDF
Jisheng Chu, Wenrui Li, Rui Zhao, Wangmeng Zuo, Shifeng Chen
TL;DR: RoamScene3D是一个从文本生成沉浸式3D场景的新框架,它通过自适应物体感知的漫游来解决现有方法的空间盲目性和对预定义轨迹的依赖问题。该方法利用视觉语言模型构建场景图来编码物体关系,指导相机感知显著物体边界并规划自适应漫游轨迹,同时引入一个在合成全景数据集上微调的运动注入修复模型,以适应相机运动并生成一致且逼真的场景。
Details
Motivation: 现有基于2D扩散先验的文本到3D场景生成方法存在空间盲目性,依赖预定义轨迹而无法利用显著物体间的内在关系,导致无法理解语义布局并自适应推断遮挡内容;同时,当前修复模型在2D图像空间操作,难以合理填充由相机运动造成的空洞。
Result: 大量实验表明,该方法在语义推理和几何约束下,在生成一致且逼真的3D场景方面显著优于现有最先进(SOTA)方法。
Insight: 创新点在于将语义指导与空间生成相结合:1) 利用视觉语言模型构建场景图来编码物体关系并指导自适应相机轨迹规划,实现了对场景语义布局的理解;2) 提出运动注入修复模型,通过在集成真实相机轨迹的合成全景数据集上进行微调,使其能适应相机运动,有效解决了2D先验在动态视图下的局限性。
Abstract: Generating immersive 3D scenes from texts is a core task in computer vision, crucial for applications in virtual reality and game development. Despite the promise of leveraging 2D diffusion priors, existing methods suffer from spatial blindness and rely on predefined trajectories that fail to exploit the inner relationships among salient objects. Consequently, these approaches are unable to comprehend the semantic layout, preventing them from exploring the scene adaptively to infer occluded content. Moreover, current inpainting models operate in 2D image space, struggling to plausibly fill holes caused by camera motion. To address these limitations, we propose RoamScene3D, a novel framework that bridges the gap between semantic guidance and spatial generation. Our method reasons about the semantic relations among objects and produces consistent and photorealistic scenes. Specifically, we employ a vision-language model (VLM) to construct a scene graph that encodes object relations, guiding the camera to perceive salient object boundaries and plan an adaptive roaming trajectory. Furthermore, to mitigate the limitations of static 2D priors, we introduce a Motion-Injected Inpainting model that is fine-tuned on a synthetic panoramic dataset integrating authentic camera trajectories, making it adaptive to camera motion. Extensive experiments demonstrate that with semantic reasoning and geometric constraints, our method significantly outperforms state-of-the-art approaches in producing consistent and photorealistic scenes. Our code is available at https://github.com/JS-CHU/RoamScene3D.
[44] Dynamic Worlds, Dynamic Humans: Generating Virtual Human-Scene Interaction Motion in Dynamic Scenes cs.CVPDF
Yin Wang, Zhiying Leng, Haitian Liu, Frederick W. B. Li, Mu Li
TL;DR: 本文提出了Dyn-HSI,首个用于动态场景中虚拟人-场景交互生成(HSI)的认知架构。该方法模拟人类感知-记忆-控制机制,通过动态场景感知导航、分层经验记忆和基于扩散模型的交互生成,使虚拟人能够在动态变化的环境中生成高质量、上下文感知的交互动作。
Details
Motivation: 现有的人-场景交互生成方法通常将场景视为静态,这与现实世界中场景持续动态变化的实际情况不符。本文旨在解决在动态场景中生成逼真、适应性强的虚拟人交互动作的问题。
Result: 在构建的动态场景基准Dyn-Scenes以及静态场景上进行了广泛的定性和定量实验。结果表明,Dyn-HSI方法在动态和静态设置下均能生成高质量的人-场景交互动作,并持续优于现有方法。
Insight: 创新点在于首次将动态场景变化纳入HSI生成框架,并受世界模型启发,构建了一个模拟人类感知(视觉导航)、记忆(经验存储与利用)和控制(扩散模型生成)的认知架构。这为虚拟人赋予了环境感知和自适应能力,提升了动作质量和泛化性。
Abstract: Scenes are continuously undergoing dynamic changes in the real world. However, existing human-scene interaction generation methods typically treat the scene as static, which deviates from reality. Inspired by world models, we introduce Dyn-HSI, the first cognitive architecture for dynamic human-scene interaction, which endows virtual humans with three humanoid components. (1)Vision (human eyes): we equip the virtual human with a Dynamic Scene-Aware Navigation, which continuously perceives changes in the surrounding environment and adaptively predicts the next waypoint. (2)Memory (human brain): we equip the virtual human with a Hierarchical Experience Memory, which stores and updates experiential data accumulated during training. This allows the model to leverage prior knowledge during inference for context-aware motion priming, thereby enhancing both motion quality and generalization. (3) Control (human body): we equip the virtual human with Human-Scene Interaction Diffusion Model, which generates high-fidelity interaction motions conditioned on multimodal inputs. To evaluate performance in dynamic scenes, we extend the existing static human-scene interaction datasets to construct a dynamic benchmark, Dyn-Scenes. We conduct extensive qualitative and quantitative experiments to validate Dyn-HSI, showing that our method consistently outperforms existing approaches and generates high-quality human-scene interaction motions in both static and dynamic settings.
[45] Entropy-Guided k-Guard Sampling for Long-Horizon Autoregressive Video Generation cs.CVPDF
Yizhao Han, Tianxing Shi, Zhao Wang, Zifan Xu, Zhiyuan Pu
TL;DR: 本文提出了一种名为熵引导k-守卫(ENkG)采样的自适应采样策略,用于解决自回归视频生成中静态top-k/top-p采样策略因视频令牌语义密度低、时空冗余高而导致的长期生成质量下降问题。该方法根据每个令牌预测分布的熵值动态调整候选令牌数量,以抑制噪声并减少错误累积。
Details
Motivation: 自回归架构在视频生成中面临挑战:视频令牌具有低语义密度和高时空冗余,使得从LLM借鉴的静态top-k/top-p采样策略在低不确定性区域引入不必要的随机性,而在高不确定性区域则容易因早期错误累积导致长期生成质量严重下降。
Result: 实验表明,与静态top-k/top-p策略相比,ENkG在感知质量和结构稳定性方面取得了一致的改进,且该方法与模型无关、无需训练、计算开销可忽略。
Insight: 创新点在于通过熵量化令牌级分散度,并据此自适应调整采样候选集大小,从而在保持结构完整性和抑制错误传播之间取得平衡。这为视频生成中的采样策略设计提供了新思路,其模型无关性和训练自由特性易于部署到现有框架中。
Abstract: Autoregressive (AR) architectures have achieved significant successes in LLMs, inspiring explorations for video generation. In LLMs, top-p/top-k sampling strategies work exceptionally well: language tokens have high semantic density and low redundancy, so a fixed size of token candidates already strikes a balance between semantic accuracy and generation diversity. In contrast, video tokens have low semantic density and high spatio-temporal redundancy. This mismatch makes static top-k/top-p strategies ineffective for video decoders: they either introduce unnecessary randomness for low-uncertainty regions (static backgrounds) or get stuck in early errors for high-uncertainty regions (foreground objects). Prediction errors will accumulate as more frames are generated and eventually severely degrade long-horizon quality. To address this, we propose Entropy-Guided k-Guard (ENkG) sampling, a simple yet effective strategy that adapts sampling to token-wise dispersion, quantified by the entropy of each token’s predicted distribution. ENkG uses adaptive token candidate sizes: for low-entropy regions, it employs fewer candidates to suppress redundant noise and preserve structural integrity; for high-entropy regions, it uses more candidates to mitigate error compounding. ENkG is model-agnostic, training-free, and adds negligible overhead. Experiments demonstrate consistent improvements in perceptual quality and structural stability compared to static top-k/top-p strategies.
[46] Bridging Information Asymmetry: A Hierarchical Framework for Deterministic Blind Face Restoration cs.CVPDF
Zhengjian Yao, Jiakui Hu, Kaiwen Li, Hangzhou He, Xinliang Zhang
TL;DR: 本文提出了一种名为Pref-Restore的分层框架,旨在解决盲人脸恢复中的信息不对称问题。该框架通过整合离散语义逻辑与连续纹理生成,采用增强输入密度和修剪输出分布两种互补策略,以实现确定性的、符合人类偏好的图像恢复。
Details
Motivation: 当前基于生成模型的方法在盲人脸恢复中存在信息不对称问题,即信息稀疏的低质量输入与信息密集的高质量输出之间的内在差异,导致一对多映射、随机不确定性和幻觉伪影。
Result: 在合成和真实世界基准测试中,Pref-Restore实现了最先进的性能。经验分析证实,其偏好对齐策略显著降低了解决方案的熵,为可靠和确定性的盲恢复建立了稳健路径。
Insight: 创新点在于将自回归积分器用于增强输入语义密度,并首次将在线策略强化学习直接集成到扩散恢复循环中,通过人类偏好作为可微分约束来修剪输出分布,从而减少随机性,实现确定性恢复。
Abstract: Blind face restoration remains a persistent challenge due to the inherent ill-posedness of reconstructing holistic structures from severely constrained observations. Current generative approaches, while capable of synthesizing realistic textures, often suffer from information asymmetry – the intrinsic disparity between the information-sparse low quality inputs and the information-dense high quality outputs. This imbalance leads to a one-to-many mapping, where insufficient constraints result in stochastic uncertainty and hallucinatory artifacts. To bridge this gap, we present \textbf{Pref-Restore}, a hierarchical framework that integrates discrete semantic logic with continuous texture generation to achieve deterministic, preference-aligned restoration. Our methodology fundamentally addresses this information disparity through two complementary strategies: (1) Augmenting Input Density: We employ an auto-regressive integrator to reformulate textual instructions into dense latent queries, injecting high-level semantic stability to constrain the degraded signals; (2) Pruning Output Distribution: We pioneer the integration of on-policy reinforcement learning directly into the diffusion restoration loop. By transforming human preferences into differentiable constraints, we explicitly penalize stochastic deviations, thereby sharpening the posterior distribution toward the desired high-fidelity outcomes. Extensive experiments demonstrate that Pref-Restore achieves state-of-the-art performance across synthetic and real-world benchmarks. Furthermore, empirical analysis confirms that our preference-aligned strategy significantly reduces solution entropy, establishing a robust pathway toward reliable and deterministic blind restoration.
[47] A Non-Invasive 3D Gait Analysis Framework for Quantifying Psychomotor Retardation in Major Depressive Disorder cs.CVPDF
Fouad Boutaleb, Emery Pierson, Mohamed Daoudi, Clémence Nineuil, Ali Amad
TL;DR: 本文提出了一种非侵入式的三维步态分析框架,通过单目RGB视频提取临床相关的步态运动学特征,用于量化重度抑郁症(MDD)中的精神运动性迟滞(PMR)。该框架结合重力视图坐标和一种新颖的轨迹校正算法,以减轻单目深度误差,并从小型临床数据集中识别稳健的运动特征,实现了对PMR的高精度检测和对抑郁严重程度的方差解释。
Details
Motivation: 解决重度抑郁症(MDD)评估中精神运动性迟滞(PMR)临床评估主观性强的问题,以及现有3D运动捕捉技术依赖专业硬件、难以常规临床应用的局限性,旨在开发一种基于单目视频的非侵入式、客观且可解释的量化方法。
Result: 在CALYPSO数据集上验证,该方法在检测PMR方面达到83.3%的准确率,并能解释总体抑郁严重程度64%的方差(R^2=0.64),揭示了踝关节推进力减少和骨盆活动受限与抑郁运动表型之间的强关联。
Insight: 创新点包括:将单目RGB视频转化为临床相关3D步态运动学的非侵入式框架,结合重力视图坐标和基于闭环拓扑的轨迹校正算法以缓解深度误差;以及针对小临床数据集的稳定性机器学习框架,用于提取稳健的生物力学特征并防止过拟合,为抑郁症的客观监测提供了透明且可扩展的工具。
Abstract: Predicting the status of Major Depressive Disorder (MDD) from objective, non-invasive methods is an active research field. Yet, extracting automatically objective, interpretable features for a detailed analysis of the patient state remains largely unexplored. Among MDD’s symptoms, Psychomotor retardation (PMR) is a core item, yet its clinical assessment remains largely subjective. While 3D motion capture offers an objective alternative, its reliance on specialized hardware often precludes routine clinical use. In this paper, we propose a non-invasive computational framework that transforms monocular RGB video into clinically relevant 3D gait kinematics. Our pipeline uses Gravity-View Coordinates along with a novel trajectory-correction algorithm that leverages the closed-loop topology of our adapted Timed Up and Go (TUG) protocol to mitigate monocular depth errors. This novel pipeline enables the extraction of 297 explicit gait biomechanical biomarkers from a single camera capture. To address the challenges of small clinical datasets, we introduce a stability-based machine learning framework that identifies robust motor signatures while preventing overfitting. Validated on the CALYPSO dataset, our method achieves an 83.3% accuracy in detecting PMR and explains 64% of the variance in overall depression severity (R^2=0.64). Notably, our study reveals a strong link between reduced ankle propulsion and restricted pelvic mobility to the depressive motor phenotype. These results demonstrate that physical movement serves as a robust proxy for the cognitive state, offering a transparent and scalable tool for the objective monitoring of depression in standard clinical environments.
[48] The S3LI Vulcano Dataset: A Dataset for Multi-Modal SLAM in Unstructured Planetary Environments cs.CV | cs.ROPDF
Riccardo Giubilato, Marcus Gerhard Müller, Marco Sewtz, Laura Alejandra Encinar Gonzalez, John Folkesson
TL;DR: 本文发布了S3LI Vulcano数据集,这是一个用于开发和基准测试依赖视觉与激光雷达模态的同步定位与建图(SLAM)及地点识别算法的多模态数据集。数据集在意大利西西里埃奥利群岛的武尔卡诺火山岛采集了多个序列,涵盖了多种环境、纹理和地形,包括玄武岩或富铁岩石、古老熔岩通道的地质构造,以及干燥植被和水体。
Details
Motivation: 动机是为SLAM和地点识别算法的开发与基准测试提供一个多模态(视觉与激光雷达)数据集,特别是在非结构化行星环境(如火山地形)中,以解决此类环境中算法评估数据缺乏的问题。
Result: 论文未在摘要中提及具体的定量实验结果或基准测试性能,但发布了数据集和开源工具包,可用于生成地面真实位姿和准备地点识别任务的标注样本。
Insight: 创新点在于提供了一个专注于非结构化行星环境(模拟火山地形)的多模态SLAM数据集,涵盖了多样化的地质纹理和地形,并配套开源工具包以支持算法开发与评估,有助于推动极端环境下的SLAM研究。
Abstract: We release the S3LI Vulcano dataset, a multi-modal dataset towards development and benchmarking of Simultaneous Localization and Mapping (SLAM) and place recognition algorithms that rely on visual and LiDAR modalities. Several sequences are recorded on the volcanic island of Vulcano, from the Aeolian Islands in Sicily, Italy. The sequences provide users with data from a variety of environments, textures and terrains, including basaltic or iron-rich rocks, geological formations from old lava channels, as well as dry vegetation and water. The data (rmc.dlr.de/s3li_dataset) is accompanied by an open source toolkit (github.com/DLR-RM/s3li-toolkit) providing tools for generating ground truth poses as well as preparation of labelled samples for place recognition tasks.
[49] QuaMo: Quaternion Motions for Vision-based 3D Human Kinematics Capture cs.CVPDF
Cuong Le, Pavlo Melnyk, Urs Waldmann, Mårten Wadenbäck, Bastian Wandt
TL;DR: 本文提出了一种名为QuaMo的新方法,利用四元数微分方程(QDE)进行基于视觉的3D人体运动捕捉,以解决传统方法中因忽略时间一致性导致的运动抖动问题,并通过引入加速度增强和单位球约束来提高运动估计的准确性和连续性。
Details
Motivation: 传统3D姿态估计方法常忽略帧间时间一致性,导致运动不自然和抖动;而现有基于运动学的方法依赖欧拉角,存在不连续性,尤其在在线设置中无法进行轨迹优化。四元数具有连续性优势,因此本文旨在利用四元数改进人体运动捕捉的稳定性和准确性。
Result: 实验结果表明,QuaMo在Human3.6M、Fit3D、SportsPose和AIST等多个数据集上优于当前最先进方法,能够无间断且最小化不合理性地准确估计3D人体运动学。
Insight: 创新点包括:使用四元数微分方程(QDE)替代欧拉角以避免不连续性;引入带有加速度增强的元PD控制器自适应调节控制信号;在四元数单位球约束下求解QDE以提高估计精度。这些方法提升了运动捕捉的连续性和实时性能。
Abstract: Vision-based 3D human motion capture from videos remains a challenge in computer vision. Traditional 3D pose estimation approaches often ignore the temporal consistency between frames, causing implausible and jittery motion. The emerging field of kinematics-based 3D motion capture addresses these issues by estimating the temporal transitioning between poses instead. A major drawback in current kinematics approaches is their reliance on Euler angles. Despite their simplicity, Euler angles suffer from discontinuity that leads to unstable motion reconstructions, especially in online settings where trajectory refinement is unavailable. Contrarily, quaternions have no discontinuity and can produce continuous transitions between poses. In this paper, we propose QuaMo, a novel Quaternion Motions method using quaternion differential equations (QDE) for human kinematics capture. We utilize the state-space model, an effective system for describing real-time kinematics estimations, with quaternion state and the QDE describing quaternion velocity. The corresponding angular acceleration is computed from a meta-PD controller with a novel acceleration enhancement that adaptively regulates the control signals as the human quickly changes to a new pose. Unlike previous work, our QDE is solved under the quaternion unit-sphere constraint that results in more accurate estimations. Experimental results show that our novel formulation of the QDE with acceleration enhancement accurately estimates 3D human kinematics with no discontinuity and minimal implausibilities. QuaMo outperforms comparable state-of-the-art methods on multiple datasets, namely Human3.6M, Fit3D, SportsPose and AIST. The code is available at https://github.com/cuongle1206/QuaMo
[50] ScenePilot-Bench: A Large-Scale Dataset and Benchmark for Evaluation of Vision-Language Models in Autonomous Driving cs.CVPDF
Yujin Wang, Yutong Zheng, Wenxian Fan, Tianyi Wang, Hongqing Chu
TL;DR: 本文提出了ScenePilot-Bench,一个用于评估自动驾驶场景中视觉语言模型性能的大规模第一人称驾驶基准。该基准基于包含3847小时驾驶视频的ScenePilot-4K数据集构建,并配备了四轴评估套件,用于测试模型在场景理解、空间感知、运动规划和GPT-Score方面的能力。
Details
Motivation: 动机是提供一个全面的基准来评估视觉语言模型在安全关键的自动驾驶环境中的能力,以明确当前模型的性能边界并识别面向驾驶推理的差距。
Result: 论文在ScenePilot-Bench上对代表性VLM进行了基准测试,提供了实证分析,但摘要中未提及具体的定量结果或是否达到SOTA水平。
Insight: 创新点在于构建了一个大规模、多粒度标注的驾驶数据集和包含安全感知指标、跨区域泛化设置的四轴评估框架,为自动驾驶领域的VLM评估提供了系统性的工具。从客观角度看,其多维度评估(尤其是结合GPT-Score)和对泛化能力的关注是值得借鉴的方向。
Abstract: In this paper, we introduce ScenePilot-Bench, a large-scale first-person driving benchmark designed to evaluate vision-language models (VLMs) in autonomous driving scenarios. ScenePilot-Bench is built upon ScenePilot-4K, a diverse dataset comprising 3,847 hours of driving videos, annotated with multi-granularity information including scene descriptions, risk assessments, key participant identification, ego trajectories, and camera parameters. The benchmark features a four-axis evaluation suite that assesses VLM capabilities in scene understanding, spatial perception, motion planning, and GPT-Score, with safety-aware metrics and cross-region generalization settings. We benchmark representative VLMs on ScenePilot-Bench, providing empirical analyses that clarify current performance boundaries and identify gaps for driving-oriented reasoning. ScenePilot-Bench offers a comprehensive framework for evaluating and advancing VLMs in safety-critical autonomous driving contexts.
[51] GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining cs.CV | cs.AI | cs.LG | cs.SD | eess.ASPDF
Shentong Mo, Zehua Chen, Jun Zhu
TL;DR: 本文提出GMS-CAVP框架,通过结合多尺度对比学习和基于扩散的生成式预训练目标,来增强视频-音频(V-A)的对应关系建模。该方法旨在解决现有方法对视频和音频信号密集、多尺度特性建模不足的问题,从而提升跨模态检索和生成任务的性能。
Details
Motivation: 现有方法(如CAVP)虽然能利用对比目标建模模态间的语义和时间对应关系,但其性能仍不理想,主要原因是对视频和音频信号密集、多尺度的时空结构建模不足,未能充分利用从细粒度到粗粒度的对应关系。
Result: 在VGGSound、AudioSet和Panda70M等基准数据集上的大量实验表明,GMS-CAVP在生成和检索任务上均优于先前方法。
Insight: 创新点在于提出了一个统一判别-生成式框架,结合了多尺度对比学习策略以捕获不同粒度的语义和时间关系,并引入了基于扩散的生成目标以实现模态间的转换与合成,从而促进了更深层次的跨模态理解并为高保真生成铺平了道路。
Abstract: Recent advances in video-audio (V-A) understanding and generation have increasingly relied on joint V-A embeddings, which serve as the foundation for tasks such as cross-modal retrieval and generation. While prior methods like CAVP effectively model semantic and temporal correspondences between modalities using contrastive objectives, their performance remains suboptimal. A key limitation is the insufficient modeling of the dense, multi-scale nature of both video and audio signals, correspondences often span fine- to coarse-grained spatial-temporal structures, which are underutilized in existing frameworks. To this end, we propose GMS-CAVP, a novel framework that combines Multi-Scale Video-Audio Alignment and Multi-Scale Spatial-Temporal Diffusion-based pretraining objectives to enhance V-A correspondence modeling. First, GMS-CAVP introduces a multi-scale contrastive learning strategy that captures semantic and temporal relations across varying granularities. Second, we go beyond traditional contrastive learning by incorporating a diffusion-based generative objective, enabling modality translation and synthesis between video and audio. This unified discriminative-generative formulation facilitates deeper cross-modal understanding and paves the way for high-fidelity generation. Extensive experiments on VGGSound, AudioSet, and Panda70M demonstrate that GMS-CAVP outperforms previous methods in generation and retrieval.
[52] Towards Governance-Oriented Low-Altitude Intelligence: A Management-Centric Multi-Modal Benchmark With Implicitly Coordinated Vision-Language Reasoning Framework cs.CVPDF
Hao Chang, Zhihui Wang, Lingxiang Wu, Peijin Wang, Wenhui Diao
TL;DR: 本文提出了面向城市治理的低空智能新基准GovLA-10K及配套推理框架GovLA-Reasoner。该基准聚焦于与治理功能相关的显著目标并提供可操作的管理建议,而非全面标注所有物体。所提框架通过一个高效的特征适配器,隐式协调视觉检测器与大语言模型之间的判别性表征共享,以支持细粒度视觉定位与高层上下文语言推理的协同。
Details
Motivation: 现有以物体为中心的感知范式及松耦合的视觉-语言流程难以支持现实城市治理中面向管理的异常理解需求,需要弥合这一差距。
Result: 大量实验表明,该方法显著提升了性能,同时避免了为任何特定任务组件进行微调的需要。
Insight: 创新点在于提出了首个以管理为导向的多模态基准,其标注策略直接对应实际管理需求;并设计了一个隐式协调视觉与语言模型表征共享的统一框架,实现了无需微调任务特定组件的有效协同推理,为管理感知的低空视觉-语言系统研究提供了新视角和基础。
Abstract: Low-altitude vision systems are becoming a critical infrastructure for smart city governance. However, existing object-centric perception paradigms and loosely coupled vision-language pipelines are still difficult to support management-oriented anomaly understanding required in real-world urban governance. To bridge this gap, we introduce GovLA-10K, the first management-oriented multi-modal benchmark for low-altitude intelligence, along with GovLA-Reasoner, a unified vision-language reasoning framework tailored for governance-aware aerial perception. Unlike existing studies that aim to exhaustively annotate all visible objects, GovLA-10K is deliberately designed around functionally salient targets that directly correspond to practical management needs, and further provides actionable management suggestions grounded in these observations. To effectively coordinate the fine-grained visual grounding with high-level contextual language reasoning, GovLA-Reasoner introduces an efficient feature adapter that implicitly coordinates discriminative representation sharing between the visual detector and the large language model (LLM). Extensive experiments show that our method significantly improves performance while avoiding the need of fine-tuning for any task-specific individual components. We believe our work offers a new perspective and foundation for future studies on management-aware low-altitude vision-language systems.
[53] KeepLoRA: Continual Learning with Residual Gradient Adaptation cs.CV | cs.LGPDF
Mao-Lin Luo, Zi-Hao Zhou, Yi-Lin Zhang, Yuanyu Wan, Tong Wei
TL;DR: 本文提出了一种名为KeepLoRA的持续学习方法,用于预训练视觉语言模型。该方法通过分析模型参数空间,发现通用知识主要编码在主成分子空间,而任务特定知识编码在残差子空间。KeepLoRA通过将新任务的梯度投影到与预训练模型主成分子空间及先前任务特征主导方向正交的子空间来学习新任务,从而限制LoRA参数在残差子空间中的更新,有效平衡了保留预训练知识、保持已学任务知识和维持学习新知识可塑性这三个目标。
Details
Motivation: 预训练视觉语言模型的持续学习需要平衡三个相互竞争的目标:保留预训练知识、保持从一系列已学任务中获得的知识,以及维持获取新知识的可塑性。本文旨在提出一种简单有效的方法来平衡这些目标。
Result: 理论和实证分析证实,KeepLoRA能够平衡这三个目标,并在持续学习任务上取得了最先进的性能。
Insight: 论文的创新点在于从参数空间角度分析了知识保留机制,并据此设计了在残差子空间进行受限LoRA更新的方法。其核心洞察是将梯度投影到与关键知识子空间正交的方向,从而最小化对已有知识的干扰,这是一种新颖且理论驱动的持续学习策略。
Abstract: Continual learning for pre-trained vision-language models requires balancing three competing objectives: retaining pre-trained knowledge, preserving knowledge from a sequence of learned tasks, and maintaining the plasticity to acquire new knowledge. This paper presents a simple but effective approach called KeepLoRA to effectively balance these objectives. We first analyze the knowledge retention mechanism within the model parameter space and find that general knowledge is mainly encoded in the principal subspace, while task-specific knowledge is encoded in the residual subspace. Motivated by this finding, KeepLoRA learns new tasks by restricting LoRA parameter updates in the residual subspace to prevent interfering with previously learned capabilities. Specifically, we infuse knowledge for a new task by projecting its gradient onto a subspace orthogonal to both the principal subspace of pre-trained model and the dominant directions of previous task features. Our theoretical and empirical analyses confirm that KeepLoRA balances the three objectives and achieves state-of-the-art performance. The implementation code is available at https://github.com/MaolinLuo/KeepLoRA.
[54] Video-KTR: Reinforcing Video Reasoning via Key Token Attribution cs.CVPDF
Ziyue Wang, Sheng Jin, Zhongrong Zuo, Jiawei Wu, Han Qiu
TL;DR: 本文提出了Video-KTR,一种用于视频推理的模态感知策略塑造框架。它通过结合视觉感知、时序敏感性和预测不确定性三种归因信号,在强化学习中仅对关键令牌进行选择性、令牌级别的更新,以提高多模态大语言模型在视频推理任务中的准确性和可解释性。
Details
Motivation: 现有视频推理方法通常依赖粗粒度的序列级奖励或单一因素的令牌选择,忽略了视觉输入、时序动态和语言输出之间的细粒度联系,这限制了模型的准确性和可解释性。
Result: 在五个具有挑战性的基准测试中,Video-KTR取得了最先进或极具竞争力的结果,例如在Video-Holmes上达到42.7%(超越了GPT-4o),并在推理和通用视频理解任务上均获得了一致的性能提升。消融研究验证了不同归因信号的互补作用以及目标令牌级更新的鲁棒性。
Insight: 创新点在于提出了一种结合多模态归因信号(视觉感知、时序感知、高熵不确定性)的令牌级强化学习框架,通过仅强化关键令牌来聚焦学习,这为复杂视频推理提供了一种简单、即插即用的RL扩展方法,同时增强了模型的可解释性。
Abstract: Reinforcement learning (RL) has shown strong potential for enhancing reasoning in multimodal large language models, yet existing video reasoning methods often rely on coarse sequence-level rewards or single-factor token selection, neglecting fine-grained links among visual inputs, temporal dynamics, and linguistic outputs, limiting both accuracy and interpretability. We propose Video-KTR, a modality-aware policy shaping framework that performs selective, token-level RL by combining three attribution signals: (1) visual-aware tokens identified via counterfactual masking to reveal perceptual dependence; (2) temporal-aware tokens detected through frame shuffling to expose temporal sensitivity; and (3) high-entropy tokens signaling predictive uncertainty. By reinforcing only these key tokens, Video-KTR focuses learning on semantically informative, modality-sensitive content while filtering out low-value tokens. Across five challenging benchmarks, Video-KTR achieves state-of-the-art or highly competitive results, achieving 42.7% on Video-Holmes (surpassing GPT-4o) with consistent gains on both reasoning and general video understanding tasks. Ablation studies verify the complementary roles of the attribution signals and the robustness of targeted token-level updates. Overall, Video-KTR improves accuracy and interpretability, offering a simple, drop-in extension to RL for complex video reasoning. Our code and models are available at https://github.com/zywang0104/Video-KTR.
[55] DSVM-UNet : Enhancing VM-UNet with Dual Self-distillation for Medical Image Segmentation cs.CVPDF
Renrong Shao, Dongyang Li, Dong Xia, Lin Shao, Jiangdong Lu
TL;DR: 本文提出了一种名为DSVM-UNet的医学图像分割方法,该方法通过双自蒸馏技术增强现有的VM-UNet模型,无需复杂的结构设计,旨在提升模型性能并保持计算效率。
Details
Motivation: 现有基于Vision Mamba的UNet模型(VM-UNet)主要依赖复杂的架构设计来增强语义特征感知能力,本文旨在通过一种简单有效的双自蒸馏方法改进VM-UNet,避免结构复杂化。
Result: 在ISIC2017、ISIC2018和Synapse基准测试上的广泛实验表明,该方法达到了最先进的性能水平(SOTA),同时保持了计算效率。
Insight: 创新点在于提出了双自蒸馏方法,在全局和局部两个层次上对齐特征,这是一种不依赖复杂架构设计的模型增强策略,为改进Vision Mamba模型提供了新的思路。
Abstract: Vision Mamba models have been extensively researched in various fields, which address the limitations of previous models by effectively managing long-range dependencies with a linear-time overhead. Several prospective studies have further designed Vision Mamba based on UNet(VM-UNet) for medical image segmentation. These approaches primarily focus on optimizing architectural designs by creating more complex structures to enhance the model’s ability to perceive semantic features. In this paper, we propose a simple yet effective approach to improve the model by Dual Self-distillation for VM-UNet (DSVM-UNet) without any complex architectural designs. To achieve this goal, we develop double self-distillation methods to align the features at both the global and local levels. Extensive experiments conducted on the ISIC2017, ISIC2018, and Synapse benchmarks demonstrate that our approach achieves state-of-the-art performance while maintaining computational efficiency. Code is available at https://github.com/RoryShao/DSVM-UNet.git.
[56] Self-Supervised Weight Templates for Scalable Vision Model Initialization cs.CV | cs.LGPDF
Yucheng Xie, Fu Feng, Ruixiao Shi, Jing Wang, Yong Rui
TL;DR: 本文提出了一种名为SWEET的自监督框架,用于为视觉任务提供可扩展的模型初始化方法。该方法通过基于约束的预训练,学习一个共享的权重模板和基于Tucker分解的尺寸特定权重缩放器,从而支持灵活适应不同深度和宽度的模型架构。目标模型通过轻量级权重缩放器组合和重加权该模板进行初始化,仅需少量训练数据即可高效学习。
Details
Motivation: 现代模型参数规模和复杂性不断增加,突显了预训练模型的重要性。然而,实际部署常需要不同尺寸的架构,这暴露了传统预训练和微调方法的局限性。
Result: 在分类、检测、分割和生成任务上的大量实验表明,SWEET在初始化可变尺寸视觉模型方面达到了最先进的性能。
Insight: 核心创新在于通过自监督学习一个共享的权重模板和轻量级尺寸适配器,实现了模型初始化的模块化和可扩展性。引入的宽度方向随机缩放技术增强了宽度扩展的灵活性,并促进了鲁棒、宽度不变的表示,从而改进了跨宽度泛化能力。
Abstract: The increasing scale and complexity of modern model parameters underscore the importance of pre-trained models. However, deployment often demands architectures of varying sizes, exposing limitations of conventional pre-training and fine-tuning. To address this, we propose SWEET, a self-supervised framework that performs constraint-based pre-training to enable scalable initialization in vision tasks. Instead of pre-training a fixed-size model, we learn a shared weight template and size-specific weight scalers under Tucker-based factorization, which promotes modularity and supports flexible adaptation to architectures with varying depths and widths. Target models are subsequently initialized by composing and reweighting the template through lightweight weight scalers, whose parameters can be efficiently learned from minimal training data. To further enhance flexibility in width expansion, we introduce width-wise stochastic scaling, which regularizes the template along width-related dimensions and encourages robust, width-invariant representations for improved cross-width generalization. Extensive experiments on \textsc{classification}, \textsc{detection}, \textsc{segmentation} and \textsc{generation} tasks demonstrate the state-of-the-art performance of SWEET for initializing variable-sized vision models.
[57] PaW-ViT: A Patch-based Warping Vision Transformer for Robust Ear Verification cs.CVPDF
Deeksha Arun, Kevin W. Bowyer, Patrick Flynn
TL;DR: 本文提出了一种名为PaW-ViT的预处理方法,该方法基于解剖学知识对耳部图像进行归一化,以提升视觉Transformer(ViT)在耳部验证任务中的性能。通过将ViT的令牌边界与检测到的耳部特征边界精确对齐,该方法增强了对形状、尺寸和姿态变化的鲁棒性,并产生了更一致的令牌表示。
Details
Motivation: 标准视觉Transformer方法中使用的矩形令牌通常会包含识别目标(如耳朵)之外的信息,这可能会严重影响模型性能。本文旨在解决耳部生物特征形态变化与Transformer架构位置敏感性之间的不匹配问题。
Result: 实验在多种ViT模型(ViT-T, ViT-S, ViT-B, ViT-L)上验证了PaW-ViT的有效性,结果表明该方法对形状、尺寸和姿态变化具有合理的对齐鲁棒性。
Insight: 创新点在于提出了一种基于解剖学知识的图像预处理方法,通过将令牌边界与生物特征边界对齐,来增强ViT对特定对象(耳朵)的识别鲁棒性。这为利用先验知识来改进Transformer在特定生物识别任务中的应用提供了一条可行路径。
Abstract: The rectangular tokens common to vision transformer methods for visual recognition can strongly affect performance of these methods due to incorporation of information outside the objects to be recognized. This paper introduces PaW-ViT, Patch-based Warping Vision Transformer, a preprocessing approach rooted in anatomical knowledge that normalizes ear images to enhance the efficacy of ViT. By accurately aligning token boundaries to detected ear feature boundaries, PaW-ViT obtains greater robustness to shape, size, and pose variation. By aligning feature boundaries to natural ear curvature, it produces more consistent token representations for various morphologies. Experiments confirm the effectiveness of PaW-ViT on various ViT models (ViT-T, ViT-S, ViT-B, ViT-L) and yield reasonable alignment robustness to variation in shape, size, and pose. Our work aims to solve the disconnect between ear biometric morphological variation and transformer architecture positional sensitivity, presenting a possible avenue for authentication schemes.
[58] GeoDiff3D: Self-Supervised 3D Scene Generation with Geometry-Constrained 2D Diffusion Guidance cs.CVPDF
Haozhi Zhu, Miaomiao Zhao, Dingyao Liu, Runze Tian, Yan Zhang
TL;DR: 本文提出GeoDiff3D,一种用于3D场景生成的自监督框架。该方法利用粗糙几何体作为结构锚点,并通过几何约束的2D扩散模型生成纹理丰富的参考图像,无需严格的视角一致性。通过体素对齐的3D特征聚合和双重自监督机制,在保持场景连贯性和细节的同时,显著降低了对标注数据的依赖,实现了高效、高质量的3D场景生成。
Details
Motivation: 现有3D场景生成方法(间接2D到3D重建和直接3D生成)存在结构建模能力弱、严重依赖大规模真实标注数据的问题,导致生成复杂场景时出现结构伪影、几何不一致和细节退化。本文旨在解决这些问题,提供一种更实用、高效的3D场景构建方案。
Result: 在具有挑战性的场景上进行的大量实验表明,GeoDiff3D在泛化能力和生成质量上优于现有基线方法,实现了高质量3D场景的快速生成。
Insight: 创新点包括:1) 使用粗糙几何体作为结构锚点结合几何约束的2D扩散引导,无需严格的多视角一致性,对噪声引导具有鲁棒性;2) 引入体素对齐的3D特征聚合和双重自监督机制,有效维持场景连贯性和细节,减少对标注数据的依赖;3) 整体框架计算成本低,实现了高效的自监督3D场景生成。
Abstract: 3D scene generation is a core technology for gaming, film/VFX, and VR/AR. Growing demand for rapid iteration, high-fidelity detail, and accessible content creation has further increased interest in this area. Existing methods broadly follow two paradigms - indirect 2D-to-3D reconstruction and direct 3D generation - but both are limited by weak structural modeling and heavy reliance on large-scale ground-truth supervision, often producing structural artifacts, geometric inconsistencies, and degraded high-frequency details in complex scenes. We propose GeoDiff3D, an efficient self-supervised framework that uses coarse geometry as a structural anchor and a geometry-constrained 2D diffusion model to provide texture-rich reference images. Importantly, GeoDiff3D does not require strict multi-view consistency of the diffusion-generated references and remains robust to the resulting noisy, inconsistent guidance. We further introduce voxel-aligned 3D feature aggregation and dual self-supervision to maintain scene coherence and fine details while substantially reducing dependence on labeled data. GeoDiff3D also trains with low computational cost and enables fast, high-quality 3D scene generation. Extensive experiments on challenging scenes show improved generalization and generation quality over existing baselines, offering a practical solution for accessible and efficient 3D scene construction.
[59] Diffusion for De-Occlusion: Accessory-Aware Diffusion Inpainting for Robust Ear Biometric Recognition cs.CVPDF
Deeksha Arun, Kevin W. Bowyer, Patrick Flynn
TL;DR: 本文提出了一种基于扩散模型的耳部图像修复方法,用于处理耳饰(如耳环、耳机)遮挡问题,作为基于Transformer的耳部生物识别系统的预处理辅助手段,以提升在遮挡情况下的识别性能。
Details
Motivation: 耳部遮挡(由耳饰等配件引起)会损害耳部生物识别系统的性能,尤其是在非受控成像条件下,因此需要一种有效的预处理方法来修复遮挡区域。
Result: 实验表明,扩散修复方法能有效缓解耳饰遮挡问题,提升多个基准数据集上基于Transformer的耳部识别模型的整体性能。
Insight: 创新点在于将扩散模型用于耳部图像修复,通过自动生成的遮挡掩码合成缺失像素,同时保持耳部关键解剖结构(如耳轮、对耳轮、耳甲、耳垂)的局部几何一致性,作为识别系统的预处理步骤。
Abstract: Ear occlusions (arising from the presence of ear accessories such as earrings and earphones) can negatively impact performance in ear-based biometric recognition systems, especially in unconstrained imaging circumstances. In this study, we assess the effectiveness of a diffusion-based ear inpainting technique as a pre-processing aid to mitigate the issues of ear accessory occlusions in transformer-based ear recognition systems. Given an input ear image and an automatically derived accessory mask, the inpainting model reconstructs clean and anatomically plausible ear regions by synthesizing missing pixels while preserving local geometric coherence along key ear structures, including the helix, antihelix, concha, and lobule. We evaluate the effectiveness of this pre-processing aid in transformer-based recognition systems for several vision transformer models and different patch sizes for a range of benchmark datasets. Experiments show that diffusion-based inpainting can be a useful pre-processing aid to alleviate ear accessory occlusions to improve overall recognition performance.
[60] Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision cs.CVPDF
Zhixiang Wei, Yi Li, Zhehan Kan, Xinghua Jiang, Zuwei Long
TL;DR: 本文提出Youtu-VL框架,采用视觉-语言统一自回归监督(VLUAS)范式,将视觉信号从条件输入转变为预测目标,以解决现有视觉语言模型在保留细粒度视觉信息方面的不足。该框架通过统一监督视觉细节和语言内容,提升了多模态理解能力,并能直接应用于视觉中心任务,无需额外任务特定设计。
Details
Motivation: 当前视觉语言模型(VLMs)在训练中存在文本主导的优化偏差,将视觉信号仅视为被动条件输入而非监督目标,导致细粒度视觉信息丢失和多模态理解粗糙。本文旨在通过改变优化目标来缓解这一问题。
Result: 大量实验评估表明,Youtu-VL在通用多模态任务和视觉中心任务上均取得了有竞争力的性能,为开发全面的通用视觉智能体奠定了坚实基础。
Insight: 核心创新在于提出VLUAS范式,将视觉token直接纳入预测流进行统一自回归监督,实现了从“视觉作为输入”到“视觉作为目标”的范式转变,从而增强模型对视觉细节的保留能力,并扩展了标准VLM处理视觉中心任务的通用性。
Abstract: Despite the significant advancements represented by Vision-Language Models (VLMs), current architectures often exhibit limitations in retaining fine-grained visual information, leading to coarse-grained multimodal comprehension. We attribute this deficiency to a suboptimal training paradigm inherent in prevailing VLMs, which exhibits a text-dominant optimization bias by conceptualizing visual signals merely as passive conditional inputs rather than supervisory targets. To mitigate this, we introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm, which fundamentally shifts the optimization objective from vision-as-input'' to vision-as-target.’’ By integrating visual tokens directly into the prediction stream, Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content. Furthermore, we extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions. Extensive empirical evaluations demonstrate that Youtu-VL achieves competitive performance on both general multimodal tasks and vision-centric tasks, establishing a robust foundation for the development of comprehensive generalist visual agents.
[61] Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering cs.CVPDF
Kun Li, Michael Ying Yang, Sami Sebastian Brandt
TL;DR: 本文提出了一种新颖的查询引导的时空频交互方法(QSTar)用于音乐音视频问答任务,该方法通过问题引导有效融合音频的频域特征与视觉的时空特征,并引入查询上下文推理模块以聚焦语义相关特征,在多个基准测试上显著超越了现有方法。
Details
Motivation: 现有音视频问答方法主要依赖预训练模型处理视觉信息,将音频作为视频分析的补充,且文本问题信息仅在推理后期集成,对音视频理解贡献有限,本文旨在解决音频信息利用不足及问题引导不充分的问题。
Result: 在多个音视频问答基准测试上的广泛实验表明,该方法显著超越了现有的音频问答、视觉问答、视频问答及音视频问答方法,取得了显著的性能提升。
Insight: 核心创新点在于提出了查询引导的时空频交互机制,首次在音视频问答中系统性地利用音频的频域特征,并结合问题信息进行早期深度引导;同时,受提示学习启发的查询上下文推理模块能更精确地聚焦语义相关特征,增强了跨模态对齐与推理能力。
Abstract: Audio–Visual Question Answering (AVQA) is a challenging multimodal task that requires jointly reasoning over audio, visual, and textual information in a given video to answer natural language questions. Inspired by recent advances in Video QA, many existing AVQA approaches primarily focus on visual information processing, leveraging pre-trained models to extract object-level and motion-level representations. However, in those methods, the audio input is primarily treated as complementary to video analysis, and the textual question information contributes minimally to audio–visual understanding, as it is typically integrated only in the final stages of reasoning. To address these limitations, we propose a novel Query-guided Spatial–Temporal–Frequency (QSTar) interaction method, which effectively incorporates question-guided clues and exploits the distinctive frequency-domain characteristics of audio signals, alongside spatial and temporal perception, to enhance audio–visual understanding. Furthermore, we introduce a Query Context Reasoning (QCR) block inspired by prompting, which guides the model to focus more precisely on semantically relevant audio and visual features. Extensive experiments conducted on several AVQA benchmarks demonstrate the effectiveness of our proposed method, achieving significant performance improvements over existing Audio QA, Visual QA, Video QA, and AVQA approaches. The code and pretrained models will be released after publication.
[62] HexFormer: Hyperbolic Vision Transformer with Exponential Map Aggregation cs.CVPDF
Haya Alyoussef, Ahmad Bdeir, Diego Coello de Portugal Mecke, Tom Hanika, Niels Landwehr
TL;DR: 本文提出HexFormer,一种用于图像分类的双曲视觉Transformer,通过引入指数映射聚合的注意力机制来建模数据中的层次和关系结构。论文探索了纯双曲版本(HexFormer)和混合版本(HexFormer-Hybrid),后者结合双曲编码器和欧几里得线性分类头。实验表明,该方法在多个数据集上优于欧几里得基线和先前双曲ViT,且双曲模型展现出更稳定的梯度和对训练策略更低的敏感性。
Details
Motivation: 图像、文本和图等多模态数据常包含层次和关系结构,欧几里得几何难以有效建模,而双曲几何为此提供了自然框架。本文旨在利用双曲几何增强视觉Transformer,通过改进注意力机制来提升表示能力和训练稳定性。
Result: 在多个数据集上的实验显示,HexFormer一致优于欧几里得基线和先前双曲ViT,其中混合变体(HexFormer-Hybrid)取得最强整体性能。分析表明双曲模型梯度更稳定,对预热策略敏感度降低。
Insight: 创新点包括基于指数映射聚合的新型注意力机制,相比基于质心的标准平均能产生更准确稳定的聚合表示;同时展示了简单机制(如指数映射聚合)的实用优势,以及双曲几何在提升视觉Transformer梯度稳定性和精度方面的潜力。
Abstract: Data across modalities such as images, text, and graphs often contains hierarchical and relational structures, which are challenging to model within Euclidean geometry. Hyperbolic geometry provides a natural framework for representing such structures. Building on this property, this work introduces HexFormer, a hyperbolic vision transformer for image classification that incorporates exponential map aggregation within its attention mechanism. Two designs are explored: a hyperbolic ViT (HexFormer) and a hybrid variant (HexFormer-Hybrid) that combines a hyperbolic encoder with an Euclidean linear classification head. HexFormer incorporates a novel attention mechanism based on exponential map aggregation, which yields more accurate and stable aggregated representations than standard centroid based averaging, showing that simpler approaches retain competitive merit. Experiments across multiple datasets demonstrate consistent performance improvements over Euclidean baselines and prior hyperbolic ViTs, with the hybrid variant achieving the strongest overall results. Additionally, this study provides an analysis of gradient stability in hyperbolic transformers. The results reveal that hyperbolic models exhibit more stable gradients and reduced sensitivity to warmup strategies compared to Euclidean architectures, highlighting their robustness and efficiency in training. Overall, these findings indicate that hyperbolic geometry can enhance vision transformer architectures by improving gradient stability and accuracy. In addition, relatively simple mechanisms such as exponential map aggregation can provide strong practical benefits.
[63] EgoHandICL: Egocentric 3D Hand Reconstruction with In-Context Learning cs.CVPDF
Binzhu Xie, Shi Qiu, Sicheng Zhang, Yinqiao Wang, Hao Xu
TL;DR: 本文提出了EgoHandICL,首个用于第一人称视角(Egocentric)3D手部重建的上下文学习(ICL)框架,旨在解决深度模糊、自遮挡和复杂手物交互带来的挑战。该框架通过视觉语言模型引导的互补示例检索、为多模态上下文定制的分词器以及基于掩码自编码器的架构,提升了语义对齐、视觉一致性和鲁棒性。
Details
Motivation: 解决第一人称视角下3D手部重建因深度模糊、自遮挡和复杂手物交互而面临的挑战,现有方法通过扩大训练数据或添加辅助线索来缓解问题,但在未见过的场景中表现不佳。
Result: 在ARCTIC和EgoExo4D基准测试上,EgoHandICL相比现有最先进方法(SOTA)取得了持续的性能提升,并展示了在真实世界的泛化能力,以及通过将重建的手部作为视觉提示来改进EgoVLM手物交互推理。
Insight: 创新点包括:1) 利用视觉语言模型进行互补示例检索以增强上下文;2) 为多模态上下文设计专门的ICL分词器;3) 采用基于掩码自编码器的架构,并结合手部引导的几何与感知目标进行训练。这为第一人称视角理解中的少样本或零样本适应提供了新思路。
Abstract: Robust 3D hand reconstruction in egocentric vision is challenging due to depth ambiguity, self-occlusion, and complex hand-object interactions. Prior methods mitigate these issues by scaling training data or adding auxiliary cues, but they often struggle in unseen contexts. We present EgoHandICL, the first in-context learning (ICL) framework for 3D hand reconstruction that improves semantic alignment, visual consistency, and robustness under challenging egocentric conditions. EgoHandICL introduces complementary exemplar retrieval guided by vision-language models (VLMs), an ICL-tailored tokenizer for multimodal context, and a masked autoencoder (MAE)-based architecture trained with hand-guided geometric and perceptual objectives. Experiments on ARCTIC and EgoExo4D show consistent gains over state-of-the-art methods. We also demonstrate real-world generalization and improve EgoVLM hand-object interaction reasoning by using reconstructed hands as visual prompts. Code and data: https://github.com/Nicous20/EgoHandICL
[64] SONIC: Spectral Oriented Neural Invariant Convolutions cs.CV | cs.LGPDF
Gijs Joppe Moens, Regina Beets-Tan, Eduardo H. P. Pooch
TL;DR: 本文提出了一种名为SONIC(Spectral Oriented Neural Invariant Convolutions)的新型卷积参数化方法,它通过连续频谱参数化,使用少量共享的、方向选择性的组件来建模卷积算子,从而生成具有全局感受野且能自然适应不同分辨率的滤波器。
Details
Motivation: 为了解决传统卷积神经网络(CNNs)因固定大小核而难以捕获全局上下文或长程依赖,以及视觉Transformer(ViTs)缺乏空间归纳偏置、依赖显式位置编码且受初始补丁大小限制的问题,需要一种既结构化又全局的表征。
Result: 在合成基准测试、大规模图像分类和3D医学数据集上,SONIC在几何变换、噪声和分辨率变化方面表现出更强的鲁棒性,并且以少一个数量级的参数匹配或超越了卷积、基于注意力的以及先前的频谱架构。
Insight: 创新点在于提出了一种连续、方向感知的频谱参数化方法,作为一种原则性且可扩展的替代方案,替代传统的空间和频谱算子,实现了全局感受野和跨分辨率的自适应滤波。
Abstract: Convolutional Neural Networks (CNNs) rely on fixed-size kernels scanning local patches, which limits their ability to capture global context or long-range dependencies without very deep architectures. Vision Transformers (ViTs), in turn, provide global connectivity but lack spatial inductive bias, depend on explicit positional encodings, and remain tied to the initial patch size. Bridging these limitations requires a representation that is both structured and global. We introduce SONIC (Spectral Oriented Neural Invariant Convolutions), a continuous spectral parameterisation that models convolutional operators using a small set of shared, orientation-selective components. These components define smooth responses across the full frequency domain, yielding global receptive fields and filters that adapt naturally across resolutions. Across synthetic benchmarks, large-scale image classification, and 3D medical datasets, SONIC shows improved robustness to geometric transformations, noise, and resolution shifts, and matches or exceeds convolutional, attention-based, and prior spectral architectures with an order of magnitude fewer parameters. These results demonstrate that continuous, orientation-aware spectral parameterisations provide a principled and scalable alternative to conventional spatial and spectral operators.
[65] DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding cs.CVPDF
Shubham Patle, Sara Ghaboura, Hania Tariq, Mohammad Usman Khan, Omkar Thawakar
TL;DR: 该论文提出了DuwatBench,一个包含1,272个样本的阿拉伯书法基准数据集,涵盖六种古典和现代书法风格,每个样本带有句子级检测标注,用于评估多模态模型对阿拉伯艺术化文本的理解能力。
Details
Motivation: 解决多模态模型在处理阿拉伯语脚本,特别是艺术化和风格化的书法形式方面能力不足的问题,以促进对阿拉伯语言和视觉遗产的公平包容。
Result: 评估了13个领先的阿拉伯语和多语言多模态模型,发现它们在干净文本上表现良好,但在书法变体、艺术扭曲和精确的视觉-文本对齐方面存在困难。
Insight: 创新点在于创建了一个专门针对阿拉伯书法视觉遗产的基准数据集,强调了文化背景在多模态研究中的重要性,并揭示了当前模型在艺术化文本处理上的局限性。
Abstract: Arabic calligraphy represents one of the richest visual traditions of the Arabic language, blending linguistic meaning with artistic form. Although multimodal models have advanced across languages, their ability to process Arabic script, especially in artistic and stylized calligraphic forms, remains largely unexplored. To address this gap, we present DuwatBench, a benchmark of 1,272 curated samples containing about 1,475 unique words across six classical and modern calligraphic styles, each paired with sentence-level detection annotations. The dataset reflects real-world challenges in Arabic writing, such as complex stroke patterns, dense ligatures, and stylistic variations that often challenge standard text recognition systems. Using DuwatBench, we evaluated 13 leading Arabic and multilingual multimodal models and showed that while they perform well on clean text, they struggle with calligraphic variation, artistic distortions, and precise visual-text alignment. By publicly releasing DuwatBench and its annotations, we aim to advance culturally grounded multimodal research, foster fair inclusion of the Arabic language and visual heritage in AI systems, and support continued progress in this area. Our dataset (https://huggingface.co/datasets/MBZUAI/DuwatBench) and evaluation suit (https://github.com/mbzuai-oryx/DuwatBench) are publicly available.
cs.SD [Back]
[66] SICL-AT: Another way to adapt Auditory LLM to low-resource task cs.SD | cs.AI | cs.CLPDF
Haolong Zheng, Siyin Wang, Zengrui Jin, Mark Hasegawa-Johnson
TL;DR: 本文提出了一种名为SICL-AT的后训练方法,旨在增强听觉大语言模型在低资源任务上的上下文学习能力,从而在标注数据稀缺或分布不匹配时,避免直接微调的脆弱性,并提升模型在多种语音和音频理解任务上的零样本性能。
Details
Motivation: 解决听觉大语言模型在低资源或陌生任务上表现不佳的问题,特别是在标注数据稀缺或与真实测试分布不匹配时,直接微调效果不稳定。
Result: 实验表明,所提出的SICL-AT方法在低资源场景下持续优于直接微调,并能将增强的上下文学习能力泛化到音频理解/推理任务。
Insight: 创新点在于利用高资源语音数据进行后训练来专门增强模型的上下文学习能力,而非直接微调模型参数,这为多模态大模型适应低资源任务提供了一种免训练、推理时适配的新途径。
Abstract: Auditory Large Language Models (LLMs) have demonstrated strong performance across a wide range of speech and audio understanding tasks. Nevertheless, they often struggle when applied to low-resource or unfamiliar tasks. In case of labeled in-domain data is scarce or mismatched to the true test distribution, direct fine-tuning can be brittle. In-Context Learning (ICL) provides a training-free, inference-time solution by adapting auditory LLMs through conditioning on a few in-domain demonstrations. In this work, we first show that \emph{Vanilla ICL}, improves zero-shot performance across diverse speech and audio tasks for selected models which suggest this ICL adaptation capability can be generalized to multimodal setting. Building on this, we propose \textbf{Speech In-Context Learning Adaptation Training (SICL-AT)}, a post-training recipe utilizes only high resource speech data intending to strengthen model’s in-context learning capability. The enhancement can generalize to audio understanding/reasoning task. Experiments indicate our proposed method consistently outperforms direct fine-tuning in low-resource scenario.
eess.IV [Back]
[67] AMGFormer: Adaptive Multi-Granular Transformer for Brain Tumor Segmentation with Missing Modalities eess.IV | cs.CVPDF
Chengxiang Guo, Jian Wang, Junhua Fei, Xiao Li, Chunling Chen
TL;DR: 本文提出了AMGFormer,一种用于处理模态缺失的脑肿瘤分割的自适应多粒度Transformer模型。它通过三个协同模块(QIB、MGAO、MQAE)显著提升了模型在不同模态组合下的稳定性和性能,解决了现有方法因模态缺失导致的性能剧烈波动问题。
Details
Motivation: 临床实践中多模态MRI数据常存在模态缺失,导致现有分割方法性能不稳定(波动>40%),临床可靠性差。本文旨在开发一种对模态缺失鲁棒、性能稳定的脑肿瘤分割方法。
Result: 在BraTS 2018数据集上,方法在15种模态组合下取得了89.33% WT、82.70% TC、67.23% ET的Dice分数,且性能方差<0.5%。单模态ET分割相对SOTA方法有40-81%的相对提升。在BraTS 2020/2021上泛化性良好,最高达到92.44% WT、89.91% TC、84.57% ET。推理时间为1.2秒。
Insight: 创新点在于三个模块的协同设计:QIB实现空间自适应融合以保持预测一致性;MGAO通过多粒度注意力聚焦病理区域,降低背景敏感性;MQAE感知模态质量以防止错误传播。核心贡献是系统性地解决了模态缺失下的分割稳定性危机,为临床部署提供了潜力。
Abstract: Multimodal MRI is essential for brain tumor segmentation, yet missing modalities in clinical practice cause existing methods to exhibit >40% performance variance across modality combinations, rendering them clinically unreliable. We propose AMGFormer, achieving significantly improved stability through three synergistic modules: (1) QuadIntegrator Bridge (QIB) enabling spatially adaptive fusion maintaining consistent predictions regardless of available modalities, (2) Multi-Granular Attention Orchestrator (MGAO) focusing on pathological regions to reduce background sensitivity, and (3) Modality Quality-Aware Enhancement (MQAE) preventing error propagation from corrupted sequences. On BraTS 2018, our method achieves 89.33% WT, 82.70% TC, 67.23% ET Dice scores with <0.5% variance across 15 modality combinations, solving the stability crisis. Single-modality ET segmentation shows 40-81% relative improvements over state-of-the-art methods. The method generalizes to BraTS 2020/2021, achieving up to 92.44% WT, 89.91% TC, 84.57% ET. The model demonstrates potential for clinical deployment with 1.2s inference. Code: https://github.com/guochengxiangives/AMGFormer.
eess.AS [Back]
[68] Rethinking Discrete Speech Representation Tokens for Accent Generation eess.AS | cs.CL | cs.SDPDF
Jinzuomu Zhong, Yi Wang, Korin Richmond, Peter Bell
TL;DR: 本文首次系统性地研究了离散语音表示标记(DSRTs)中的口音信息编码问题,提出了一个统一的评估框架,通过新颖的Accent ABX任务和跨口音语音转换重合成来测量口音信息的可访问性和可恢复性。研究发现,使用ASR监督微调编码器会显著减少口音信息,而简单的码本大小缩减无法有效分离口音与音素及说话者信息。基于此,作者提出了新的仅内容(content-only)和内容-口音(content-accent)DSRTs,在可控口音生成任务中显著优于现有设计。
Details
Motivation: 尽管先前工作已广泛研究DSRTs中的音素和说话者信息,但口音信息在DSRTs中的编码方式仍未被充分探索,本文旨在填补这一空白,以促进可控口音语音生成。
Result: 在提出的评估框架下,分析了多种语音编码器衍生的DSRTs,发现ASR监督微调会大幅减少口音信息;新提出的content-only和content-accent DSRTs在可控口音生成任务中表现显著优于现有设计。
Insight: 创新点在于首次系统研究DSRTs中的口音信息,提出了统一的评估框架(Accent ABX和跨口音VC重合成),并基于分析结果设计了更有效的DSRTs变体,强调了口音感知评估的重要性,为设计用于口音控制语音生成的DSRTs提供了实用指导。
Abstract: Discrete Speech Representation Tokens (DSRTs) have become a foundational component in speech generation. While prior work has extensively studied phonetic and speaker information in DSRTs, how accent information is encoded in DSRTs remains largely unexplored. In this paper, we present the first systematic investigation of accent information in DSRTs. We propose a unified evaluation framework that measures both accessibility of accent information via a novel Accent ABX task and recoverability via cross-accent Voice Conversion (VC) resynthesis. Using this framework, we analyse DSRTs derived from a variety of speech encoders. Our results reveal that accent information is substantially reduced when ASR supervision is used to fine-tune the encoder, but cannot be effectively disentangled from phonetic and speaker information through naive codebook size reduction. Based on these findings, we propose new content-only and content-accent DSRTs that significantly outperform existing designs in controllable accent generation. Our work highlights the importance of accent-aware evaluation and provides practical guidance for designing DSRTs for accent-controlled speech generation.
cs.AI [Back]
[69] LocationAgent: A Hierarchical Agent for Image Geolocation via Decoupling Strategy and Evidence from Parametric Knowledge cs.AI | cs.CVPDF
Qiujun Li, Zijin Xiao, Xulin Wang, Zhidan Ma, Cheng Yang
TL;DR: 本文提出了一种名为LocationAgent的分层智能体,用于图像地理位置推断。其核心思想是将分层推理逻辑保留在模型内部,同时将地理证据的验证卸载到外部工具,以解决现有方法在开放世界或动态知识场景中易产生事实幻觉和泛化瓶颈的问题。
Details
Motivation: 现有图像地理位置推断方法通常通过监督训练或基于轨迹的强化微调将位置知识和推理模式内化为静态记忆,导致在开放世界或需要动态知识的场景中容易出现事实幻觉和泛化瓶颈。
Result: 在零样本设置下,LocationAgent显著优于现有方法至少30%。此外,作者还构建了CCL-Bench(中国城市位置基准)数据集,涵盖了多种场景粒度和难度级别。
Insight: 创新点在于提出了RER(Reasoner-Executor-Recorder)架构,通过角色分离和上下文压缩来实现分层推理,并构建了一套线索探索工具进行证据验证。该方法将推理与验证解耦,利用外部工具提供动态知识,提高了系统的可靠性和泛化能力。
Abstract: Image geolocation aims to infer capture locations based on visual content. Fundamentally, this constitutes a reasoning process composed of \textit{hypothesis-verification cycles}, requiring models to possess both geospatial reasoning capabilities and the ability to verify evidence against geographic facts. Existing methods typically internalize location knowledge and reasoning patterns into static memory via supervised training or trajectory-based reinforcement fine-tuning. Consequently, these methods are prone to factual hallucinations and generalization bottlenecks in open-world settings or scenarios requiring dynamic knowledge. To address these challenges, we propose a Hierarchical Localization Agent, called LocationAgent. Our core philosophy is to retain hierarchical reasoning logic within the model while offloading the verification of geographic evidence to external tools. To implement hierarchical reasoning, we design the RER architecture (Reasoner-Executor-Recorder), which employs role separation and context compression to prevent the drifting problem in multi-step reasoning. For evidence verification, we construct a suite of clue exploration tools that provide diverse evidence to support location reasoning. Furthermore, to address data leakage and the scarcity of Chinese data in existing datasets, we introduce CCL-Bench (China City Location Bench), an image geolocation benchmark encompassing various scene granularities and difficulty levels. Extensive experiments demonstrate that LocationAgent significantly outperforms existing methods by at least 30% in zero-shot settings.
[70] MATA: A Trainable Hierarchical Automaton System for Multi-Agent Visual Reasoning cs.AI | cs.CVPDF
Zhixi Cai, Fucai Ke, Kevin Leo, Sukai Huang, Maria Garcia de la Banda
TL;DR: MATA是一个可训练的分层自动机系统,用于多智能体视觉推理。它通过一个可训练的超智能体控制高层状态转移,每个智能体对应一个状态并运行基于规则的子自动机,所有智能体共享内存以实现透明执行历史。该方法在多个视觉推理基准测试中取得了最先进的结果。
Details
Motivation: 解决当前视觉语言模型在复杂查询上存在隐式推理难以解释、容易产生幻觉的问题,以及现有组合方法大多依赖单一智能体或手工流程,无法动态决定智能体间协作与竞争的局限性。
Result: 在多个视觉推理基准测试中,MATA相比单体模型和组合基线方法取得了最先进的(SOTA)结果。
Insight: 创新点在于将多智能体系统构建为分层有限状态自动机,并引入可训练的超智能体学习状态转移策略;通过构建转移轨迹树并转换为监督微调数据集(MATA-SFT-90K),使LLM作为策略能理解查询和智能体能力,从而高效选择最优智能体;系统设计实现了透明的执行历史和可靠的微控制。
Abstract: Recent vision-language models have strong perceptual ability but their implicit reasoning is hard to explain and easily generates hallucinations on complex queries. Compositional methods improve interpretability, but most rely on a single agent or hand-crafted pipeline and cannot decide when to collaborate across complementary agents or compete among overlapping ones. We introduce MATA (Multi-Agent hierarchical Trainable Automaton), a multi-agent system presented as a hierarchical finite-state automaton for visual reasoning whose top-level transitions are chosen by a trainable hyper agent. Each agent corresponds to a state in the hyper automaton, and runs a small rule-based sub-automaton for reliable micro-control. All agents read and write a shared memory, yielding transparent execution history. To supervise the hyper agent’s transition policy, we build transition-trajectory trees and transform to memory-to-next-state pairs, forming the MATA-SFT-90K dataset for supervised finetuning (SFT). The finetuned LLM as the transition policy understands the query and the capacity of agents, and it can efficiently choose the optimal agent to solve the task. Across multiple visual reasoning benchmarks, MATA achieves the state-of-the-art results compared with monolithic and compositional baselines. The code and dataset are available at https://github.com/ControlNet/MATA.
cs.LG [Back]
[71] Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning cs.LG | cs.CLPDF
Haolin Liu, Dian Yu, Sidi Lu, Yujun Zhou, Rui Liu
TL;DR: 本文提出了一种名为Verifiable Prefix Policy Optimization (VPPO)的新方法,用于通过过程监督强化学习来提升大语言模型的推理能力。该方法利用过程奖励模型仅定位推理路径中的第一个错误,将轨迹划分为已验证的正确前缀和错误后缀,并对前缀进行奖励、对后缀进行精准惩罚,从而提供更稳定、可解释的学习信号。
Details
Motivation: 现有基于稀疏结果奖励的强化学习方法无法对部分成功解决方案中的正确中间步骤给予奖励,而过程奖励模型提供的细粒度步骤级监督分数往往存在噪声且难以评估。过程奖励模型的评估目标(检测第一个错误步骤)与其在强化学习中的典型使用方式(最大化其步骤分数作为原始奖励)存在错位。
Result: 在多个推理基准测试中,VPPO在Pass@1和Pass@K指标上持续优于稀疏奖励强化学习和先前基于过程奖励模型的基线方法。
Insight: 核心创新在于将过程奖励模型的使用方式从提供原始步骤奖励,转变为仅用于定位第一个错误,并基于此对轨迹进行精确的奖励与惩罚划分。这种设计改善了信用分配,提供了更稳定的学习信号,并更好地对齐了模型评估与实际应用的目标。
Abstract: Reinforcement learning (RL) has emerged as a powerful framework for improving the reasoning capabilities of large language models (LLMs). However, most existing RL approaches rely on sparse outcome rewards, which fail to credit correct intermediate steps in partially successful solutions. Process reward models (PRMs) offer fine-grained step-level supervision, but their scores are often noisy and difficult to evaluate. As a result, recent PRM benchmarks focus on a more objective capability: detecting the first incorrect step in a reasoning path. However, this evaluation target is misaligned with how PRMs are typically used in RL, where their step-wise scores are treated as raw rewards to maximize. To bridge this gap, we propose Verifiable Prefix Policy Optimization (VPPO), which uses PRMs only to localize the first error during RL. Given an incorrect rollout, VPPO partitions the trajectory into a verified correct prefix and an erroneous suffix based on the first error, rewarding the former while applying targeted penalties only after the detected mistake. This design yields stable, interpretable learning signals and improves credit assignment. Across multiple reasoning benchmarks, VPPO consistently outperforms sparse-reward RL and prior PRM-guided baselines on both Pass@1 and Pass@K.
[72] Principled Fine-tuning of LLMs from User-Edits: A Medley of Preference, Supervision, and Reward cs.LG | cs.AI | cs.CL | stat.MLPDF
Dipendra Misra, Aldo Pacchiano, Ta-Chung Chi, Ge Gao
TL;DR: 本文研究如何利用用户编辑数据(包括上下文、代理响应和用户编辑)对大型语言模型进行微调,这些数据自然产生于基于LLM的写作助手和编码代理等应用。论文从理论上探讨了从用户编辑中学习的方法,推导了从不同反馈类型(偏好、监督标签和成本)学习的算法边界,并提出了一种简单的集成程序来联合学习这些反馈类型。在两个数据集上的实验表明,该集成方法优于仅从单一反馈类型学习的方法,并能鲁棒地适应不同的用户编辑分布。
Details
Motivation: 用户编辑数据是自然生成的,是适应和个性化LLM的理想来源,且统一了通常单独研究的偏好、监督标签和成本等反馈类型,旨在从理论上研究如何有效利用这些数据进行微调。
Result: 在两个基于Gao等人2024年工作的领域上,提出的集成程序优于从单个反馈类型学习的方法,并能鲁棒地适应测试时不同的用户编辑分布。
Insight: 创新点在于从理论上分析从用户编辑数据学习的边界,并提出集成多种反馈类型的简单方法,这为利用自然用户交互数据微调LLM提供了原则性框架,可借鉴其多反馈统一学习和鲁棒适应策略。
Abstract: We study how to fine-tune LLMs using user-edit deployment data consisting of a set of context, an agent’s response, and user edits. This deployment data is naturally generated by users in applications such as LLMs-based writing assistants and coding agents. The natural origin of user edits makes it a desired source for adapting and personalizing LLMs. In this setup, there emerges a unification of various feedback types namely preferences, supervised labels, and cost that are typically studied separately in the literature. In this paper, we initiate the theoretical investigation of learning from user edits. We first derive bounds for learning algorithms that learn from each of these feedback types. We prove that these algorithms have different trade-offs depending upon the user, data distribution, and model class. We then propose a simple ensembling procedure to jointly learn from these feedback types. On two domains adapted from Gao et al. 2024, we show our ensembling procedure outperforms these methods that learn from individual feedback. Further, we show that our proposed procedure can robustly adapt to different user-edit distributions at test time.
[73] Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning cs.LG | cs.AI | cs.CLPDF
Kishan Panaganti, Zhenwen Liang, Wenhao Yu, Haitao Mi, Dong Yu
TL;DR: 本文提出了一种名为多对抗者组分布鲁棒优化(Multi-Adversary Group Distributionally Robust Optimization, GDRO)的优化框架,以解决大语言模型推理任务中传统强化学习方法(如GRPO)因均匀采样和固定rollout次数而导致的训练效率低下问题。该框架通过在线难度分类器动态划分提示难度,并设计了Prompt-GDRO和Rollout-GDRO两个独立的对抗性游戏,分别优化提示采样分布和rollout资源分配,从而更高效地训练模型解决困难的长尾推理问题。
Details
Motivation: 现有大语言模型推理训练中,标准的强化学习范式(如GRPO)采用均匀的提示采样和每个提示固定次数的rollout,对于异构、重尾的推理数据效率低下,浪费计算资源在已解决的问题上,而对困难的长尾问题训练不足。
Result: 在DAPO 14.1k数据集上使用Qwen3-Base模型(1.7B、4B、8B规模)进行验证,与GRPO基线相比,Prompt-GDRO和Rollout-GDRO在pass@8准确率上分别平均相对提升了+10.6%和+10.1%。定性分析显示,该方法能形成一种涌现课程,将资源动态调整到不断演进的推理前沿。
Insight: 主要创新点在于将分布鲁棒优化思想引入LLM推理的强化学习训练,通过动态难度分组和对抗性优化机制(如EMA去偏乘性权重老虎机采样器和影子价格控制器)实现训练分布的自适应调整,从而更高效地利用计算资源提升模型在困难任务上的性能。从客观角度看,其提出的计算中性(compute-neutral)的Rollout-GDRO和针对难度边际的Prompt-GDRO,为优化训练数据分布和资源分配提供了新的理论框架和实用方法。
Abstract: Recent progress in Large Language Model (LLM) reasoning is increasingly driven by the refinement of post-training loss functions and alignment strategies. However, standard Reinforcement Learning (RL) paradigms like Group Relative Policy Optimization (GRPO) remain constrained by static uniformity: uniform prompt sampling and a fixed number of rollouts per prompt. For heterogeneous, heavy-tailed reasoning data, this creates structural inefficiencies that waste compute on already-solved patterns while under-training the long tail of hard problems. To address this, we propose Multi-Adversary Group Distributionally Robust Optimization (GDRO), an optimization-first framework that moves beyond uniform reasoning models by dynamically adapting the training distribution. We introduce an Online Difficulty Classifier that partitions prompts into dynamic pass@k difficulty groups. We then propose two independent GDRO games for post-training: (1) Prompt-GDRO, which employs an EMA-debiased multiplicative-weights bandit sampler to target the intensive difficulty margin and upweight persistently hard groups without frequency bias; and (2) Rollout-GDRO, which uses a shadow-price controller to reallocate rollouts across groups, maximizing gradient variance reduction on hard tasks under a fixed mean budget (compute-neutral). We provide no-regret guarantees for both controllers and additionally a variance-proxy analysis motivating a square-root optimal rollout allocation for Rollout-GDRO. We validate our framework on the DAPO 14.1k dataset using Qwen3-Base models. Prompt-GDRO and Rollout-GDRO achieve average relative gains of +10.6% and +10.1%, respectively, in pass@8 accuracy across 1.7B, 4B, and 8B scales compared to the GRPO baseline. Qualitative analysis shows an emergent curriculum: the adversaries shift resources to the evolving reasoning frontier, enhancing the reasoning model’s performance.
[74] Explicit Multi-head Attention for Inter-head Interaction in Large Language Models cs.LG | cs.AI | cs.CLPDF
Runyu Peng, Yunhua Zhou, Demin Song, Kai Lv, Bo Wang
TL;DR: 本文提出了一种名为多头显式注意力(MEA)的新型注意力机制,旨在通过显式建模多头之间的交互来增强大型语言模型的注意力性能。MEA包含两个核心组件:头级线性组合模块和头级组归一化层,它们共同促进了多头间的通信并统一了统计特性。该方法在预训练中表现出强鲁棒性,允许使用更大的学习率以加速收敛,从而降低验证损失并提升多项任务性能。此外,论文还探索了MEA的参数效率,通过减少注意力头数量并利用低秩“虚拟头”重建,实现了键值缓存压缩,将内存使用降低50%且性能损失可忽略。
Details
Motivation: 基于Transformer架构的大型语言模型中,已有研究表明多头间的交互可以提升注意力性能。因此,本文旨在通过显式建模跨头交互来解决现有注意力机制中多头交互不足的问题,以增强模型的表现和效率。
Result: MEA在预训练中表现出强鲁棒性,允许使用更大的学习率,导致更快的收敛、更低的验证损失,并在多项任务上提升了性能。在键值缓存压缩方面,该方法将内存使用降低50%,在知识密集型和科学推理任务上性能损失可忽略,在奥林匹克级数学基准上仅导致3.59%的准确率下降。
Insight: 论文的创新点在于显式建模多头交互的MEA机制,包括头级线性组合模块和头级组归一化层,这促进了跨头通信并统一了统计特性。从客观角度看,该方法不仅提升了注意力性能,还通过低秩虚拟头重建实现了高效的参数压缩,为大型语言模型的优化和部署提供了可借鉴的思路。
Abstract: In large language models built upon the Transformer architecture, recent studies have shown that inter-head interaction can enhance attention performance. Motivated by this, we propose Multi-head Explicit Attention (MEA), a simple yet effective attention variant that explicitly models cross-head interaction. MEA consists of two key components: a Head-level Linear Composition (HLC) module that separately applies learnable linear combinations to the key and value vectors across heads, thereby enabling rich inter-head communication; and a head-level Group Normalization layer that aligns the statistical properties of the recombined heads. MEA shows strong robustness in pretraining, which allows the use of larger learning rates that lead to faster convergence, ultimately resulting in lower validation loss and improved performance across a range of tasks. Furthermore, we explore the parameter efficiency of MEA by reducing the number of attention heads and leveraging HLC to reconstruct them using low-rank “virtual heads”. This enables a practical key-value cache compression strategy that reduces KV-cache memory usage by 50% with negligible performance loss on knowledge-intensive and scientific reasoning tasks, and only a 3.59% accuracy drop for Olympiad-level mathematical benchmarks.
physics.optics [Back]
[75] Learned split-spectrum metalens for obstruction-free broadband imaging in the visible physics.optics | cs.AI | cs.CV | physics.app-phPDF
Seungwoo Yoon, Dohyun Kang, Eunsue Choi, Sohyun Lee, Seoyeon Kim
TL;DR: 本文提出了一种学习型分光谱超透镜,通过将每个RGB通道的光谱分成通带和阻带,并利用多带光谱滤波来聚焦远距离物体的光同时过滤近深度遮挡的光,结合神经网络后处理实现宽带无遮挡成像。
Details
Motivation: 解决雨滴、栅栏或灰尘等遮挡物在成像中导致图像质量下降的问题,特别是在机械清洁不可行时。传统方案依赖庞大复合光学阵列或计算修复,牺牲紧凑性或保真度。
Result: 在宽带无遮挡成像上,相对PSNR增益32.29%,并提升物体检测和语义分割精度,绝对增益+13.54% mAP、+48.45% IoU和+20.35% mIoU超过传统双曲设计。
Insight: 创新点包括学习型分光谱超透镜设计,将光谱分成通带和阻带以实现光学过滤遮挡光,结合神经网络增强信号。这为空间受限系统如移动机器人、无人机和内窥镜提供鲁棒无遮挡感知。
Abstract: Obstructions such as raindrops, fences, or dust degrade captured images, especially when mechanical cleaning is infeasible. Conventional solutions to obstructions rely on a bulky compound optics array or computational inpainting, which compromise compactness or fidelity. Metalenses composed of subwavelength meta-atoms promise compact imaging, but simultaneous achievement of broadband and obstruction-free imaging remains a challenge, since a metalens that images distant scenes across a broadband spectrum cannot properly defocus near-depth occlusions. Here, we introduce a learned split-spectrum metalens that enables broadband obstruction-free imaging. Our approach divides the spectrum of each RGB channel into pass and stop bands with multi-band spectral filtering and learns the metalens to focus light from far objects through pass bands, while filtering focused near-depth light through stop bands. This optical signal is further enhanced using a neural network. Our learned split-spectrum metalens achieves broadband and obstruction-free imaging with relative PSNR gains of 32.29% and improves object detection and semantic segmentation accuracies with absolute gains of +13.54% mAP, +48.45% IoU, and +20.35% mIoU over a conventional hyperbolic design. This promises robust obstruction-free sensing and vision for space-constrained systems, such as mobile robots, drones, and endoscopes.
cs.MM [Back]
[76] Benchmarking Multimodal Large Language Models for Missing Modality Completion in Product Catalogues cs.MM | cs.CV | cs.IRPDF
Junchen Fu, Wenhao Deng, Kaiwen Zheng, Alexandros Karatzoglou, Ioannis Arapakis
TL;DR: 本文研究了多模态大语言模型在电子商务场景中补全缺失模态信息的能力,提出了缺失模态产品补全基准MMPCBench,包含内容质量补全和推荐两个子基准。评估了Qwen2.5-VL和Gemma-3系列六个SOTA MLLM在九个真实电商类别上的图像到文本和文本到图像补全任务,发现MLLM能捕捉高层语义但细粒度对齐困难,性能因类别和模型规模而异且与模型大小无简单正相关。探索了GRPO优化方法,改善了图像到文本补全但未提升文本到图像补全,揭示了当前MLLM在真实跨模态生成中的局限性。
Details
Motivation: 解决电子商务平台中因标注错误或不完整元数据导致的缺失模态信息(如图像或文本描述缺失)问题,这损害了产品展示和下游应用(如推荐系统),并探索MLLM是否能在电商场景中生成缺失模态。
Result: 在MMPCBench基准上评估了六个SOTA MLLM,结果显示MLLM能捕捉高层语义,但在细粒度词级和像素/块级对齐上表现不佳;性能在不同产品类别和模型规模间差异显著,且模型大小与性能无简单正相关(与主流基准趋势不同);GRPO优化改善了图像到文本补全,但未提升文本到图像补全。
Insight: 创新点在于首次系统研究MLLM在电商缺失模态补全任务中的能力,并构建了专门的基准MMPCBench;客观分析表明,当前MLLM在真实跨模态生成中存在细粒度对齐和类别依赖性等局限,GRPO优化方法对任务对齐有选择性效果,为后续改进提供了方向。
Abstract: Missing-modality information on e-commerce platforms, such as absent product images or textual descriptions, often arises from annotation errors or incomplete metadata, impairing both product presentation and downstream applications such as recommendation systems. Motivated by the multimodal generative capabilities of recent Multimodal Large Language Models (MLLMs), this work investigates a fundamental yet underexplored question: can MLLMs generate missing modalities for products in e-commerce scenarios? We propose the Missing Modality Product Completion Benchmark (MMPCBench), which consists of two sub-benchmarks: a Content Quality Completion Benchmark and a Recommendation Benchmark. We further evaluate six state-of-the-art MLLMs from the Qwen2.5-VL and Gemma-3 model families across nine real-world e-commerce categories, focusing on image-to-text and text-to-image completion tasks. Experimental results show that while MLLMs can capture high-level semantics, they struggle with fine-grained word-level and pixel- or patch-level alignment. In addition, performance varies substantially across product categories and model scales, and we observe no trivial correlation between model size and performance, in contrast to trends commonly reported in mainstream benchmarks. We also explore Group Relative Policy Optimization (GRPO) to better align MLLMs with this task. GRPO improves image-to-text completion but does not yield gains for text-to-image completion. Overall, these findings expose the limitations of current MLLMs in real-world cross-modal generation and represent an early step toward more effective missing-modality product completion.
cs.RO [Back]
[77] ALRM: Agentic LLM for Robotic Manipulation cs.RO | cs.CLPDF
Vitor Gaboardi dos Santos, Ibrahim Khadraoui, Ibrahim Farhat, Hamza Yous, Samy Teffahi
TL;DR: 本文提出了ALRM(Agentic LLM for Robotic Manipulation),一个由大语言模型驱动的机器人操作智能体框架。该框架通过ReAct式推理循环,将策略生成与智能体执行相结合,支持代码即策略(CaP)和工具即策略(TaP)两种互补模式。为了系统评估,作者还引入了一个包含56个任务的新仿真基准。实验表明,ALRM为连接自然语言推理与可靠机器人执行提供了一种可扩展、可解释且模块化的方法。
Details
Motivation: 现有基于LLM的机器人控制方法缺乏模块化、智能化的闭环执行机制,且现有操作任务基准侧重于低级控制,未能系统评估多步推理和语言多样性。本文旨在解决这两个局限性。
Result: 在提出的新仿真基准上对十个LLM进行了实验。结果显示,在CaP模式下,Claude-4.1-Opus是表现最好的闭源模型,Falcon-H1-7B是表现最好的开源模型。
Insight: 主要创新点在于提出了一个整合策略生成与智能体执行的模块化框架(ALRM),并引入了支持语言多样性的新仿真基准。其核心是将ReAct式推理循环应用于机器人操作,并区分了直接代码生成(CaP)与迭代工具调用(TaP)两种执行模式,增强了系统的灵活性和可靠性。
Abstract: Large Language Models (LLMs) have recently empowered agentic frameworks to exhibit advanced reasoning and planning capabilities. However, their integration in robotic control pipelines remains limited in two aspects: (1) prior \ac{llm}-based approaches often lack modular, agentic execution mechanisms, limiting their ability to plan, reflect on outcomes, and revise actions in a closed-loop manner; and (2) existing benchmarks for manipulation tasks focus on low-level control and do not systematically evaluate multistep reasoning and linguistic variation. In this paper, we propose Agentic LLM for Robot Manipulation (ALRM), an LLM-driven agentic framework for robotic manipulation. ALRM integrates policy generation with agentic execution through a ReAct-style reasoning loop, supporting two complementary modes: Code-asPolicy (CaP) for direct executable control code generation, and Tool-as-Policy (TaP) for iterative planning and tool-based action execution. To enable systematic evaluation, we also introduce a novel simulation benchmark comprising 56 tasks across multiple environments, capturing linguistically diverse instructions. Experiments with ten LLMs demonstrate that ALRM provides a scalable, interpretable, and modular approach for bridging natural language reasoning with reliable robotic execution. Results reveal Claude-4.1-Opus as the top closed-source model and Falcon-H1-7B as the top open-source model under CaP.
[78] Perception-to-Pursuit: Track-Centric Temporal Reasoning for Open-World Drone Detection and Autonomous Chasing cs.RO | cs.CVPDF
Venkatakrishna Reddy Oruganti
TL;DR: 本文提出了一种名为Perception-to-Pursuit(P2P)的以轨迹为中心的时间推理框架,用于开放世界中的无人机检测与自主追逐。该方法将无人机运动表示为紧凑的8维令牌,并使用因果Transformer模型来推理未来行为,旨在弥合检测与可执行的追逐规划之间的差距。
Details
Motivation: 现有跟踪方法仅优化预测精度,而忽略了追逐的可行性,导致预测的轨迹在物理上几乎无法被拦截。本文旨在解决无人机自主追逐中检测与可行动轨迹预测之间的脱节问题。
Result: 在包含226个真实无人机序列的Anti-UAV-RGBT数据集上评估,P2P实现了28.12像素的平均位移误差和0.597的拦截成功率,相比仅跟踪的基线,轨迹预测精度提高了77%,追逐可行性提高了597倍,同时保持了100%的无人机分类准确率。
Insight: 创新点在于提出了一个将运动模式(速度、加速度、尺度和平滑性)编码为紧凑令牌的轨迹中心化表示,并引入了拦截成功率这一新指标来量化在现实拦截器约束下的追逐可行性,从而将时间推理直接与可执行的追逐规划相结合。
Abstract: Autonomous drone pursuit requires not only detecting drones but also predicting their trajectories in a manner that enables kinematically feasible interception. Existing tracking methods optimize for prediction accuracy but ignore pursuit feasibility, resulting in trajectories that are physically impossible to intercept 99.9% of the time. We propose Perception-to-Pursuit (P2P), a track-centric temporal reasoning framework that bridges detection and actionable pursuit planning. Our method represents drone motion as compact 8-dimensional tokens capturing velocity, acceleration, scale, and smoothness, enabling a 12-frame causal transformer to reason about future behavior. We introduce the Intercept Success Rate (ISR) metric to measure pursuit feasibility under realistic interceptor constraints. Evaluated on the Anti-UAV-RGBT dataset with 226 real drone sequences, P2P achieves 28.12 pixel average displacement error and 0.597 ISR, representing a 77% improvement in trajectory prediction and 597x improvement in pursuit feasibility over tracking-only baselines, while maintaining perfect drone classification accuracy (100%). Our work demonstrates that temporal reasoning over motion patterns enables both accurate prediction and actionable pursuit planning.