Table of Contents

cs.CL [Back]

[1] Order Matters: Rethinking Prompt Construction in In-Context Learning

Warren Li,Yiqian Wang,Zihan Wang,Jingbo Shang

Main category: cs.CL

TL;DR: 本文研究了示例顺序在上下文学习(ICL)中的重要性,发现其对性能的影响与示例选择相当,并提出了通过开发集识别最优顺序的方法。

Details Motivation: 以往研究主要关注示例选择对上下文学习性能的影响,而忽略了示例顺序的作用。本文旨在重新评估示例选择和顺序的相对重要性。

Contribution: 1. 通过实验证明示例顺序对性能的影响与示例选择相当;2. 提出了一种基于开发集的方法,能够识别接近最优顺序的示例顺序。

Method: 在分类和生成任务上进行了控制实验,使用多个开源模型(0.5B至27B参数)和GPT-5,比较示例选择和顺序对性能的影响,并通过开发集评估最优顺序。

Result: 实验结果显示,不同示例顺序导致的性能差异与完全不同的示例集相当;基于开发集的方法可以识别接近最优的顺序。

Insight: 示例选择和顺序在提示设计中具有同等重要性,需要重新审视ICL中的假设。

Abstract: In-context learning (ICL) enables large language models to perform new tasks by conditioning on a sequence of examples. Most prior work reasonably and intuitively assumes that which examples are chosen has a far greater effect on performance than how those examples are ordered, leading to a focus on example selection. We revisit this assumption and conduct a systematic comparison between the effect of selection and ordering. Through controlled experiments on both classification and generation tasks, using multiple open-source model families (0.5B to 27B parameters) and GPT-5, we find that the variance in performance due to different example orderings is comparable to that from using entirely different example sets. Furthermore, we show that strong orderings can be identified using only a development set, achieving performance close to an oracle that selects the best ordering based on test labels. Our findings highlight the equal and intertwined importance of example selection and ordering in prompt design, calling for a reexamination of the assumptions held in ICL.

[2] Assessing the Applicability of Natural Language Processing to Traditional Social Science Methodology: A Case Study in Identifying Strategic Signaling Patterns in Presidential Directives

C. LeMay,A. Lane,J. Seales,M. Winstead,S. Baty

Main category: cs.CL

TL;DR: 本文探讨了自然语言处理(NLP)在社会科学研究中的应用潜力,通过对里根至克林顿政府总统指令的信号主题识别案例,展示了NLP在分析大规模文本语料中的作用,同时指出了NLP与人工标注结果的差异。

Details Motivation: 社会科学研究中常涉及大规模文本分析,传统方法效率低且主观性强。作者希望通过NLP技术提升分析效率和客观性,同时验证其在这一领域的适用性。

Contribution: 1. 展示了NLP在社会科学领域的实际应用案例;2. 揭示了NLP与人工分析结果的差异性,为未来研究提供了方向。

Method: 作者结合NLP技术和人工标注方法,从里根至克林顿政府的总统指令中提取主题信号,并对比两者的结果。

Result: NLP能够有效识别相关文档,但与人工标注结果存在差距,表明现有工具仍需改进。

Insight: NLP在社会科学研究中具有潜力,但其准确性仍需进一步验证和完善。技术快速发展使得工具更新频繁,研究需与时俱进。

Abstract: Our research investigates how Natural Language Processing (NLP) can be used to extract main topics from a larger corpus of written data, as applied to the case of identifying signaling themes in Presidential Directives (PDs) from the Reagan through Clinton administrations. Analysts and NLP both identified relevant documents, demonstrating the potential utility of NLPs in research involving large written corpuses. However, we also identified discrepancies between NLP and human-labeled results that indicate a need for more research to assess the validity of NLP in this use case. The research was conducted in 2023, and the rapidly evolving landscape of AIML means existing tools have improved and new tools have been developed; this research displays the inherent capabilities of a potentially dated AI tool in emerging social science applications.

[3] How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation

Muskaan Chopra,Lorenz Sparrenberg,Sarthak Khanna,Rafet Sifa

Main category: cs.CL

TL;DR: 论文研究了如何在边缘设备上部署紧凑的语言模型(sub-2B参数)以实现高效的机器翻译关键错误检测(CED),发现10亿参数左右的模型(如Gemma-3-1B)在质量和效率上表现最佳。

Details Motivation: 大型语言模型(LLMs)在机器翻译评估中表现出色,但其规模和成本限制了在边缘设备和隐私敏感场景中的应用。因此,研究紧凑模型能否在保持高性能的同时实现高效部署。

Contribution: 1. 提出了一个标准化框架,用于评估小模型在CED任务中的性能;2. 发现10亿参数的模型(如Gemma-3-1B)是质量与效率的最佳平衡点;3. 提供了公开的数据集、提示和脚本。

Method: 1. 标准化提示设计;2. 轻量级的logit-bias校准和多数投票;3. 结合语义质量(MCC, F1-ERR/F1-NOT)和计算指标(VRAM、延迟、吞吐量)进行评估。

Result: Gemma-3-1B表现最佳(MCC=0.77, F1-ERR=0.98),在MacBook Pro M4 Pro上单样本延迟仅为400毫秒;而更大的Qwen-3-1.7B虽性能更高但计算成本更高。

Insight: 紧凑且经过指令调优的LLMs结合轻量校准和小样本监督,可实现高效的、隐私保护的实时错误检测,适用于实际翻译流程。

Abstract: Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). At larger scale, Qwen-3-1.7B attains the highest absolute MCC (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our GitHub repository.

[4] Improving Graduate Outcomes by Identifying Skills Gaps and Recommending Courses Based on Career Interests

Rahul Soni,Basem Suleiman,Sonit Singh

Main category: cs.CL

TL;DR: 该论文提出了一种基于数据分析和机器学习的课程推荐系统,通过结合用户偏好、学术标准和行业趋势,帮助学生选择与其职业兴趣相符的课程。

Details Motivation: 目前学生在选择课程时缺乏具体指导,难以与行业需求对接。为了解决这一问题,论文旨在开发一个智能推荐系统,帮助学生做出更明智的课程选择。

Contribution: 1. 提出了一种结合数据挖掘和协同过滤技术的课程推荐框架。2. 设计了用户友好的前端界面,提升了系统的可用性。3. 通过用户反馈优化系统,确保其满足目标用户的需求。

Method: 采用了数据挖掘和机器学习算法(如协同过滤)分析历史课程数据和职业目标;通过迭代原型设计和用户测试优化前端界面。

Result: 开发的系统能够为学生提供个性化的课程推荐,帮助他们在学术和职业发展上做出更明智的决策。

Insight: 该系统的成功关键在于结合了多源数据和用户反馈,解决了教育和行业需求之间的脱节问题。

Abstract: This paper aims to address the challenge of selecting relevant courses for students by proposing the design and development of a course recommendation system. The course recommendation system utilises a combination of data analytics techniques and machine learning algorithms to recommend courses that align with current industry trends and requirements. In order to provide customised suggestions, the study entails the design and implementation of an extensive algorithmic framework that combines machine learning methods, user preferences, and academic criteria. The system employs data mining and collaborative filtering techniques to examine past courses and individual career goals in order to provide course recommendations. Moreover, to improve the accessibility and usefulness of the recommendation system, special attention is given to the development of an easy-to-use front-end interface. The front-end design prioritises visual clarity, interaction, and simplicity through iterative prototyping and user input revisions, guaranteeing a smooth and captivating user experience. We refined and optimised the proposed system by incorporating user feedback, ensuring that it effectively meets the needs and preferences of its target users. The proposed course recommendation system could be a useful tool for students, instructors, and career advisers to use in promoting lifelong learning and professional progression as it fills the gap between university learning and industry expectations. We hope that the proposed course recommendation system will help university students in making data-drive and industry-informed course decisions, in turn, improving graduate outcomes for the university sector.

[5] Answering Students’ Questions on Course Forums Using Multiple Chain-of-Thought Reasoning and Finetuning RAG-Enabled LLM

Neo Wang,Sonit Singh

Main category: cs.CL

TL;DR: 为了解决课程论坛中学生问题回答的延迟和重复性问题,该论文提出了一种基于检索增强生成(RAG)方法和大语言模型(LLM)的问答系统,并通过多链式推理减少幻觉问题。

Details Motivation: 随着课程学生数量的增加,教师在论坛中难以及时回答学生的问题,且重复性问题频发。这些挑战促使研究团队设计一个自动化问答系统,以提高效率。

Contribution: 主要贡献包括提出了一种结合RAG方法和开源LLM的问答系统,通过本地知识库和多链式推理优化模型性能,减少了LLM的幻觉问题。

Method: 方法包括:1)使用RAG方法从本地知识库中检索相关文档;2)对LLM进行微调;3)引入多链式推理机制以减少幻觉。

Result: 实验结果表明,结合RAG方法和微调的LLM在HotpotQA数据集上表现出色,显著提升了问答任务的性能。

Insight: 本研究展示了RAG和多链式推理在改善LLM问答系统性能方面的潜力,特别是在教育领域。为类似场景提供了可扩展的解决方案。

Abstract: The course forums are increasingly significant and play vital role in facilitating student discussions and answering their questions related to the course. It provides a platform for students to post their questions related to the content and admin issues related to the course. However, there are several challenges due to the increase in the number of students enrolled in the course. The primary challenge is that students’ queries cannot be responded immediately and the instructors have to face lots of repetitive questions. To mitigate these issues, we propose a question answering system based on large language model with retrieval augmented generation (RAG) method. This work focuses on designing a question answering system with open source Large Language Model (LLM) and fine-tuning it on the relevant course dataset. To further improve the performance, we use a local knowledge base and applied RAG method to retrieve relevant documents relevant to students’ queries, where the local knowledge base contains all the course content. To mitigate the hallucination of LLMs, We also integrate it with multi chain-of-thought reasoning to overcome the challenge of hallucination in LLMs. In this work, we experiment fine-tuned LLM with RAG method on the HotpotQA dataset. The experimental results demonstrate that the fine-tuned LLM with RAG method has a strong performance on question answering task.

[6] In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback

Mingye Zhu,Yi Liu,Zheren Fu,Quan Wang,Yongdong Zhang

Main category: cs.CL

TL;DR: InTRO(In-Token Rationality Optimization)是一种新型框架,通过自反馈和token级探索实现LLM的准确简洁推理。它避免了传统方法的泛化问题和计算成本,显著提升数学推理任务的准确性(相对提升20%)并减少冗余。

Details Motivation: 传统方法中,监督微调单一黄金理由会惩罚其他有效选择,限制泛化;而强化学习的验证奖励面临信用分配和高计算成本问题。InTRO旨在解决这些局限性。

Contribution: 提出InTRO框架,利用token级探索和自反馈优化推理;引入校正因子(correction factors)估计token重要性,提升准确性和简洁性;展示了跨域迁移能力。

Method: 通过信息差异估计token级重要性权重(校正因子),在单次前向传递中同时实现探索和自反馈;优化目标是鼓励生成准确且简洁的理由。

Result: 在六个数学推理基准上,InTRO相对基础模型提升20%准确率,且生成的推理链更简洁;还能成功迁移到非数学域推理任务。

Insight: token级探索和自反馈的结合是高效优化LLM推理的关键;校正因子提供了一种轻量级信用分配机制,避免了传统强化学习的复杂性。

Abstract: Training Large Language Models (LLMs) for chain-of-thought reasoning presents a significant challenge: supervised fine-tuning on a single “golden” rationale hurts generalization as it penalizes equally valid alternatives, whereas reinforcement learning with verifiable rewards struggles with credit assignment and prohibitive computational cost. To tackle these limitations, we introduce InTRO (In-Token Rationality Optimization), a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning. Instead of directly optimizing an intractable objective over all valid reasoning paths, InTRO leverages correction factors-token-wise importance weights estimated by the information discrepancy between the generative policy and its answer-conditioned counterpart, for informative next token selection. This approach allows the model to perform token-level exploration and receive self-generated feedback within a single forward pass, ultimately encouraging accurate and concise rationales. Across six math-reasoning benchmarks, InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model. Its chains of thought are also notably more concise, exhibiting reduced verbosity. Beyond this, InTRO enables cross-domain transfer, successfully adapting to out-of-domain reasoning tasks that extend beyond the realm of mathematics, demonstrating robust generalization.

[7] HierRouter: Coordinated Routing of Specialized Large Language Models via Reinforcement Learning

Nikunj Gupta,Bill Guo,Rajgopal Kannan,Viktor K. Prasanna

Main category: cs.CL

TL;DR: HierRouter提出了一种基于强化学习的分层路由方法,动态地从多个轻量级专用语言模型中选择推理流程,以提高性能和降低成本。

Details Motivation: 大型语言模型(LLMs)虽然性能卓越,但计算和内存成本高昂,限制了其在资源受限或实时场景中的应用。为解决这一问题,作者提出了HierRouter。

Contribution: 主要贡献是提出了PPO-based强化学习方法HierRouter,动态组装轻量级专用模型的推理流程,显著提升了性能和效率。

Method: 通过有限马尔可夫决策过程(MDP)和PPO强化学习,训练代理在推理过程中根据上下文和累积成本动态选择模型。

Result: 在六个基准测试中,HierRouter比独立使用单个模型的响应质量提升了2.4倍,同时仅增加了少量额外的推理成本。

Insight: 分层路由能够高效协调模型资源,为资源受限场景下实现高性能LLM推理提供了可行方案。

Abstract: Large Language Models (LLMs) deliver state-of-the-art performance across many tasks but impose high computational and memory costs, limiting their deployment in resource-constrained or real-time settings. To address this, we propose HierRouter, a hierarchical routing approach that dynamically assembles inference pipelines from a pool of specialized, lightweight language models. Formulated as a finite-horizon Markov Decision Process (MDP), our approach trains a Proximal Policy Optimization (PPO)-based reinforcement learning agent to iteratively select which models to invoke at each stage of multi-hop inference. The agent conditions on the evolving context and accumulated cost to make context-aware routing decisions. Experiments with three open-source candidate LLMs across six benchmarks, including QA, code generation, and mathematical reasoning, show that HierRouter improves response quality by up to 2.4x compared to using individual models independently, while incurring only a minimal additional inference cost on average. These results highlight the promise of hierarchical routing for cost-efficient, high-performance LLM inference. All codes can be found here https://github.com/ Nikunj-Gupta/hierouter.

[8] EnchTable: Unified Safety Alignment Transfer in Fine-tuned Large Language Models

Jialin Wu,Kecen Li,Zhicong Huang,Xinfeng Li,Xiaofeng Wang,Cheng Hong

Main category: cs.CL

TL;DR: EnchTable是一个新框架,用于在微调的大型语言模型(LLMs)中保持安全性对齐,避免重新训练,并通过NTK-based safety vector蒸馏技术和干扰感知合并技术,实现安全和效用的平衡。

Details Motivation: 微调LLMs可能导致安全性对齐的系统性退化,增加有害输出的风险。现有方法需要大量重新训练或牺牲性能,亟需一种高效且通用的解决方案。

Contribution: 1. 提出EnchTable框架,首次实现安全性对齐的跨模型和任务域转移;2. 基于NTK的安全性向量蒸馏技术解耦安全约束与任务推理;3. 干扰感知合并技术平衡安全与效用;4. 在多种任务域和LLMs上验证了通用性和抗攻击能力。

Method: 1. NTK-based safety vector蒸馏技术提取安全性特征;2. 干扰感知合并技术优化安全向量与任务模型的融合;3. 支持多种LLM架构和任务域。

Result: 在11个数据集上验证,EnchTable显著降低不安全率,提高效用分数,优于6种参数修改方法和2种推理时对齐基线,且抗攻击能力强。

Insight: 安全性对齐可通过特征蒸馏和干扰感知技术高效转移,无需重新训练;跨模型和任务域的通用性是可行的。

Abstract: Many machine learning models are fine-tuned from large language models (LLMs) to achieve high performance in specialized domains like code generation, biomedical analysis, and mathematical problem solving. However, this fine-tuning process often introduces a critical vulnerability: the systematic degradation of safety alignment, undermining ethical guidelines and increasing the risk of harmful outputs. Addressing this challenge, we introduce EnchTable, a novel framework designed to transfer and maintain safety alignment in downstream LLMs without requiring extensive retraining. EnchTable leverages a Neural Tangent Kernel (NTK)-based safety vector distillation method to decouple safety constraints from task-specific reasoning, ensuring compatibility across diverse model architectures and sizes. Additionally, our interference-aware merging technique effectively balances safety and utility, minimizing performance compromises across various task domains. We implemented a fully functional prototype of EnchTable on three different task domains and three distinct LLM architectures, and evaluated its performance through extensive experiments on eleven diverse datasets, assessing both utility and model safety. Our evaluations include LLMs from different vendors, demonstrating EnchTable’s generalization capability. Furthermore, EnchTable exhibits robust resistance to static and dynamic jailbreaking attacks, outperforming vendor-released safety models in mitigating adversarial prompts. Comparative analyses with six parameter modification methods and two inference-time alignment baselines reveal that EnchTable achieves a significantly lower unsafe rate, higher utility score, and universal applicability across different task domains. Additionally, we validate EnchTable can be seamlessly integrated into various deployment pipelines without significant overhead.

[9] HI-TransPA: Hearing Impairments Translation Personal Assistant

Zhiming Ma,Shiyu Gan,Junhao Zhao,Xianming Li,Qingyun Pan,Peidong Wang,Mingjun Pan,Yuhao Mo,Jiajie Cheng,Chengxin Chen,Zhonglun Cao,Chonghan Liu,Shi Cheng

Main category: cs.CL

TL;DR: HI-TransPA是一个面向听力障碍人士的多模态个人助手,通过融合语音和唇部动态实现翻译与对话,采用课程学习和高质量数据预处理提升模型鲁棒性。

Details Motivation: 为听力障碍人士提供统一的日常沟通解决方案,解决现有Omni-Model在噪声数据和处理听力障碍语音方面的局限性。

Contribution: 1. 提出了HI-TransPA,一个基于指令驱动的多模态助手;2. 构建了数据预处理和评估流水线;3. 采用课程学习和SigLIP编码器提升模型性能。

Method: 1. 多模态数据预处理(面部标记点检测、唇部区域稳定等);2. 课程学习策略;3. SigLIP编码器与Unified 3D-Resampler结合编码唇部动态。

Result: 在HI-Dialogue数据集上,HI-TransPA在语义保真度和翻译准确性上达到SOTA。

Insight: Omni-Model范式可有效应用于辅助技术,未来研究可在此基础上扩展。

Abstract: To provide a unified and flexible solution for daily communication among hearing-impaired individuals, we introduce the Omni-Model paradigm into assistive technology and present HI-TransPA, an instruction-driven audio-visual personal assistant. The model fuses indistinct speech with high-frame-rate lip dynamics, enabling both translation and dialogue within a single multimodal framework. To tackle the challenges of noisy and heterogeneous raw data and the limited adaptability of existing Omni-Models to hearing-impaired speech, we construct a comprehensive preprocessing and curation pipeline that detects facial landmarks, isolates and stabilizes the lip region, and quantitatively assesses multimodal sample quality. These quality scores guide a curriculum learning strategy that first trains on clean, high-confidence samples and progressively incorporates harder cases to strengthen model robustness. We further adopt a SigLIP encoder combined with a Unified 3D-Resampler to efficiently encode high-frame-rate lip motion. Experiments on our purpose-built HI-Dialogue dataset show that HI-TransPA achieves state-of-the-art performance in both literal accuracy and semantic fidelity. This work establishes a foundation for applying Omni-Models to assistive communication technology, providing an end-to-end modeling framework and essential processing tools for future research.

[10] MINDS: A Cross-cultural Dialogue Corpus for Social Norm Classification and Adherence Detection

Pritish Sahu,Anirudh Som,Dimitra Vergyri,Ajay Divakaran

Main category: cs.CL

TL;DR: 论文提出了一个名为Norm-RAG的检索增强框架,用于多轮对话中的社交规范推理,并引入了MINDS数据集,包含中英和西英双语对话,用于社交规范分类和遵守检测。

Details Motivation: 社交规范是隐式的、基于文化的期望,指导人际交流。现有的标注数据集多为孤立语句或合成对话,无法捕捉真实对话的多轮流动性,需要更好的模型和多文化数据集。

Contribution: 1. 提出Norm-RAG框架,通过多属性建模和语义检索实现规范的上下文感知推理;2. 发布MINDS双语数据集,包含多轮对话和多文化规范标注。

Method: Norm-RAG结合了对话中的意图、角色、人际框架和语言线索等多属性建模,并利用新颖的语义分块检索结构化规范性文档。

Result: 实验表明Norm-RAG在规范检测和泛化方面表现优越,提升了文化适应性和社交智能对话系统的性能。

Insight: 社交规范推理需要多属性建模和文化背景支持,语义检索能增强模型的解释性和适应性。

Abstract: Social norms are implicit, culturally grounded expectations that guide interpersonal communication. Unlike factual commonsense, norm reasoning is subjective, context-dependent, and varies across cultures, posing challenges for computational models. Prior works provide valuable normative annotations but mostly target isolated utterances or synthetic dialogues, limiting their ability to capture the fluid, multi-turn nature of real-world conversations. In this work, we present Norm-RAG, a retrieval-augmented, agentic framework for nuanced social norm inference in multi-turn dialogues. Norm-RAG models utterance-level attributes including communicative intent, speaker roles, interpersonal framing, and linguistic cues and grounds them in structured normative documentation retrieved via a novel Semantic Chunking approach. This enables interpretable and context-aware reasoning about norm adherence and violation across multilingual dialogues. We further introduce MINDS (Multilingual Interactions with Norm-Driven Speech), a bilingual dataset comprising 31 multi-turn Mandarin-English and Spanish-English conversations. Each turn is annotated for norm category and adherence status using multi-annotator consensus, reflecting cross-cultural and realistic norm expression. Our experiments show that Norm-RAG improves norm detection and generalization, demonstrates improved performance for culturally adaptive and socially intelligent dialogue systems.

[11] Leveraging Large Language Models for Identifying Knowledge Components

Canwen Wang,Jionghao Lin,Kenneth R. Koedinger

Main category: cs.CL

TL;DR: 该论文研究了利用大语言模型(LLM)自动化识别知识组件(KCs)的方法,提出了一种基于余弦相似度的语义合并策略,显著减少了冗余标签并提升了性能。

Details Motivation: 手动标注知识组件(KCs)是自适应学习系统的瓶颈,而现有基于LLM的方法在小数据集上表现不佳且生成冗余标签。

Contribution: 1. 扩展了LLM提示策略到大规模数据集(646道选择题);2. 提出了一种基于余弦相似度的KC标签合并方法。

Method: 1. 使用GPT-4o-mini生成KC标签;2. 设计余弦相似度阈值(如0.8)合并语义相似的标签。

Result: 合并策略将KC数量从569降至428,RMSE从0.4285提升至0.4259,接近专家模型。

Insight: 单独使用LLM生成KC标签效果有限,但结合语义合并技术可以显著优化自动化识别流程。

Abstract: Knowledge Components (KCs) are foundational to adaptive learning systems, but their manual identification by domain experts is a significant bottleneck. While Large Language Models (LLMs) offer a promising avenue for automating this process, prior research has been limited to small datasets and has been shown to produce superfluous, redundant KC labels. This study addresses these limitations by first scaling a “simulated textbook” LLM prompting strategy (using GPT-4o-mini) to a larger dataset of 646 multiple-choice questions. We found that this initial automated approach performed significantly worse than an expert-designed KC model (RMSE 0.4285 vs. 0.4206) and generated an excessive number of KCs (569 vs. 101). To address the issue of redundancy, we proposed and evaluated a novel method for merging semantically similar KC labels based on their cosine similarity. This merging strategy significantly improved the model’s performance; a model using a cosine similarity threshold of 0.8 achieved the best result, reducing the KC count to 428 and improving the RMSE to 0.4259. This demonstrates that while scaled LLM generation alone is insufficient, combining it with a semantic merging technique offers a viable path toward automating and refining KC identification.

[12] REAP: Enhancing RAG with Recursive Evaluation and Adaptive Planning for Multi-Hop Question Answering

Yijie Zhu,Haojie Zhou,Wanting Hong,Tailin Liu,Ning Wang

Main category: cs.CL

TL;DR: 论文提出REAP方法,通过递归评估和自适应规划提升多跳问答的性能,解决了现有RAG方法在多跳任务中全局规划不足和线索利用不充分的问题。

Details Motivation: 现有RAG方法在多跳推理任务中缺乏全局规划,容易陷入局部推理困境,且对检索内容利用不足,导致推理结果不准确。

Contribution: 1. 提出REAP框架,通过子任务规划器和事实验取器模块实现动态优化和全局知识表示;2. 设计统一的任务范式,支持多任务微调;3. 在多个数据集上验证其有效性。

Method: REAP框架包含子任务规划器(SP)和事实验取器(FE)。SP维护全局视角,动态优化任务解决路径;FE细粒度分析检索内容。两者协同构建全局知识表示。

Result: 在多个多跳数据集上,REAP显著优于现有RAG方法,验证了其在复杂推理任务中的有效性。

Insight: 全局规划和动态路径优化是提升多跳问答性能的关键;统一任务范式设计可增强模型在数据稀缺任务上的表现。

Abstract: Retrieval-augmented generation (RAG) has been extensively employed to mitigate hallucinations in large language models (LLMs). However, existing methods for multi-hop reasoning tasks often lack global planning, increasing the risk of falling into local reasoning impasses. Insufficient exploitation of retrieved content and the neglect of latent clues fail to ensure the accuracy of reasoning outcomes. To overcome these limitations, we propose Recursive Evaluation and Adaptive Planning (REAP), whose core idea is to explicitly maintain structured sub-tasks and facts related to the current task through the Sub-task Planner (SP) and Fact Extractor (FE) modules. SP maintains a global perspective, guiding the overall reasoning direction and evaluating the task state based on the outcomes of FE, enabling dynamic optimization of the task-solving trajectory. FE performs fine-grained analysis over retrieved content to extract reliable answers and clues. These two modules incrementally enrich a logically coherent representation of global knowledge, enhancing the reliability and the traceability of the reasoning process. Furthermore, we propose a unified task paradigm design that enables effective multi-task fine-tuning, significantly enhancing SP’s performance on complex, data-scarce tasks. We conduct extensive experiments on multiple public multi-hop datasets, and the results demonstrate that our method significantly outperforms existing RAG methods in both in-domain and out-of-domain settings, validating its effectiveness in complex multi-hop reasoning tasks.

[13] NumPert: Numerical Perturbations to Probe Language Models for Veracity Prediction

Peter Røysland Aarnes,Vinay Setty

Main category: cs.CL

TL;DR: 论文通过数值扰动评估大语言模型在真实性预测任务中的表现,发现模型在数值推理上存在显著不足,且对扰动敏感,尤其是在上下文长度增加或扰动演示丰富化时。

Details Motivation: 大语言模型在知识密集型任务中表现优异,但在数值推理上存在短板。作者希望通过系统性评估揭示模型在数值真实性预测中的局限性和鲁棒性问题。

Contribution: 提出了NumPert方法,通过控制扰动(如标签翻转)评估模型的鲁棒性;发现所有模型在数值推理任务中均存在显著性能下降。

Method: 使用数值扰动(如标签翻转)生成测试数据,评估模型在扰动条件下的表现;研究了上下文长度和演示丰富化对模型性能的影响。

Result: 主流模型在扰动下准确率下降高达62%;上下文长度增加降低了准确率,但通过丰富化演示可部分恢复模型性能。

Insight: 数值推理仍是大语言模型的短板,尤其是在鲁棒性方面;丰富的上下文并不总能提升模型表现,需结合特定扰动策略来优化。

Abstract: Large language models show strong performance on knowledge intensive tasks such as fact-checking and question answering, yet they often struggle with numerical reasoning. We present a systematic evaluation of state-of-the-art models for veracity prediction on numerical claims and evidence pairs using controlled perturbations, including label-flipping probes, to test robustness. Our results indicate that even leading proprietary systems experience accuracy drops of up to 62% under certain perturbations. No model proves to be robust across all conditions. We further find that increasing context length generally reduces accuracy, but when extended context is enriched with perturbed demonstrations, most models substantially recover. These findings highlight critical limitations in numerical fact-checking and suggest that robustness remains an open challenge for current language models.

[14] Language Drift in Multilingual Retrieval-Augmented Generation: Characterization and Decoding-Time Mitigation

Bo Li,Zhenghua Xu,Rui Xie

Main category: cs.CL

TL;DR: 论文研究了多语言检索增强生成(RAG)中的语言漂移现象,发现其源于解码器层面的崩溃而非理解失败,并提出了一种无需训练的轻量解码策略SCD以缓解该问题。

Details Motivation: 多语言RAG在检索证据与查询语言不一致时会产生非预期的语言漂移现象,尤其在推理密集型任务中更为明显,影响了多语言任务的性能。

Contribution: 1)系统研究了多语言RAG中的语言漂移现象;2)揭示了英语作为语义吸引子的作用;3)提出了轻量级的SCD解码策略。

Method: 提出了Soft Constrained Decoding(SCD),通过惩罚非目标语言标记来引导生成目标语言响应。

Result: SCD在多语言数据集和多样化语言中一致提升了语言对齐性和任务性能。

Insight: 语言漂移主要源于解码器崩溃而非理解失败,英语在多语言场景中具有显著的干扰作用。

Abstract: Multilingual Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to perform knowledge-intensive tasks in multilingual settings by leveraging retrieved documents as external evidence. However, when the retrieved evidence differs in language from the user query and in-context exemplars, the model often exhibits language drift by generating responses in an unintended language. This phenomenon is especially pronounced during reasoning-intensive decoding, such as Chain-of-Thought (CoT) generation, where intermediate steps introduce further language instability. In this paper, we systematically study output language drift in multilingual RAG across multiple datasets, languages, and LLM backbones. Our controlled experiments reveal that the drift results not from comprehension failure but from decoder-level collapse, where dominant token distributions and high-frequency English patterns dominate the intended generation language. We further observe that English serves as a semantic attractor under cross-lingual conditions, emerging as both the strongest interference source and the most frequent fallback language. To mitigate this, we propose Soft Constrained Decoding (SCD), a lightweight, training-free decoding strategy that gently steers generation toward the target language by penalizing non-target-language tokens. SCD is model-agnostic and can be applied to any generation algorithm without modifying the architecture or requiring additional data. Experiments across three multilingual datasets and multiple typologically diverse languages show that SCD consistently improves language alignment and task performance, providing an effective and generalizable solution in multilingual RAG.

[15] PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models

Shivam Sharma,Riya Naik,Tejas Gawas,Heramb Patil,Kunal Korgaonkar

Main category: cs.CL

TL;DR: 该论文提出了一个名为PustakAI的框架,用于设计和评估与印度NCERT课程(6至8年级英语和科学)对齐的问题回答数据集NCERT-QA,并通过多种提示技术和评估指标分析了开源和高性能大语言模型在教育系统中的适用性与局限性。

Details Motivation: 大语言模型(LLMs)在教育领域的潜力巨大,尤其是为资源有限的地区提供个性化和交互式学习体验。然而,如何将这些模型有效适应特定课程内容(如印度NCERT课程)仍面临准确性、对齐性和教学相关性的挑战。

Contribution: 论文的主要贡献包括:(1)提出PustakAI框架,(2)创建与NCERT课程对齐的数据集NCERT-QA,(3)分析了多种提示技术及开源与高性能LLMs的教育适用性。

Method: 方法包括:(1)设计并分类NCERT-QA数据集(事实型、推理型等),(2)使用元提示、小样本和链式思维(CoT)等提示技术评估模型表现,(3)对比开源模型(如Gemma3:1b)与高性能模型(如Deepseek-r1-70B)。

Result: 结果表明,某些提示技术(如CoT)能更好地满足课程需求,而高性能模型在教育场景中表现更优,但也凸显了开源模型在资源有限环境中的实用性。

Insight: 论文揭示了LLMs在教育中的潜力与挑战,提示技术在课程对齐中的重要性,以及不同规模模型在资源受限环境中的权衡。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like content. This has revolutionized various sectors such as healthcare, software development, and education. In education, LLMs offer potential for personalized and interactive learning experiences, especially in regions with limited teaching resources. However, adapting these models effectively to curriculum-specific content, such as the National Council of Educational Research and Training (NCERT) syllabus in India, presents unique challenges in terms of accuracy, alignment, and pedagogical relevance. In this paper, we present the framework “PustakAI”\footnote{Pustak means `book’ in many Indian languages.} for the design and evaluation of a novel question-answering dataset “NCERT-QA” aligned with the NCERT curriculum for English and Science subjects of grades 6 to 8. We classify the curated QA pairs as Factoid, Inferential, and Others (evaluative and reasoning). We evaluate the dataset with various prompting techniques, such as meta-prompt, few-shot, and CoT-style prompting, using diverse evaluation metrics to understand which approach aligns more efficiently with the structure and demands of the curriculum. Along with the usability of the dataset, we analyze the strengths and limitations of current open-source LLMs (Gemma3:1b, Llama3.2:3b, and Nemotron-mini:4b) and high-end LLMs (Llama-4-Scout-17B and Deepseek-r1-70B) as AI-based learning tools in formal education systems.

[16] ScaleFormer: Span Representation Cumulation for Long-Context Transformer

Jiangshu Du,Wenpeng Yin,Philip Yu

Main category: cs.CL

TL;DR: ScaleFormer是一种无需修改架构或从头训练的即插即用框架,通过分块和上下文累积机制,将预训练模型适应于长文本任务。

Details Motivation: 标准自注意力的二次复杂度限制了Transformer在长文本任务中的应用,而现有高效变体通常需要架构修改和从头训练。为了解决这些问题,ScaleFormer提供了一种无需修改架构的解决方案。

Contribution: 提出了一种新颖的无参数融合机制,通过累积上下文向量增强分块表示的结构感知能力,实现了线性复杂度和高效的长文本推理。

Method: ScaleFormer将长文本分块并生成压缩的上下文感知表示,通过边界表示的累积上下文向量实现结构感知。

Result: 在长文档摘要任务中,ScaleFormer表现优异,甚至优于现有方法,且无需架构修改或外部检索机制。

Insight: 通过简单的分块和上下文累积策略,可以有效提升预训练模型处理长文本的能力,证明了结构感知对长文本任务的重要性。

Abstract: The quadratic complexity of standard self-attention severely limits the application of Transformer-based models to long-context tasks. While efficient Transformer variants exist, they often require architectural changes and costly pre-training from scratch. To circumvent this, we propose ScaleFormer(Span Representation Cumulation for Long-Context Transformer) - a simple and effective plug-and-play framework that adapts off-the-shelf pre-trained encoder-decoder models to process long sequences without requiring architectural modifications. Our approach segments long inputs into overlapping chunks and generates a compressed, context-aware representation for the decoder. The core of our method is a novel, parameter-free fusion mechanism that endows each chunk’s representation with structural awareness of its position within the document. It achieves this by enriching each chunk’s boundary representations with cumulative context vectors from all preceding and succeeding chunks. This strategy provides the model with a strong signal of the document’s narrative flow, achieves linear complexity, and enables pre-trained models to reason effectively over long-form text. Experiments on long-document summarization show that our method is highly competitive with and often outperforms state-of-the-art approaches without requiring architectural modifications or external retrieval mechanisms.

[17] Do Language Models Associate Sound with Meaning? A Multimodal Study of Sound Symbolism

Jinhong Jeong,Sunghyun Lee,Jaeyoung Lee,Seonah Han,Youngjae Yu

Main category: cs.CL

TL;DR: 论文探讨了多模态大语言模型(MLLMs)如何将声音与意义关联,通过声音象征性(sound symbolism)研究了模型对听觉信息的处理能力,并提出LEX-ICON数据集支持研究。

Details Motivation: 声音象征性是非任意性的语音形式与意义的关联,本研究旨在探索MLLMs是否能够捕捉这种关联,从而理解其在跨模态语言处理中的能力。

Contribution: 提出LEX-ICON数据集,包含多种语言的真词和伪词;首次量化分析了MLLMs在声音象征性上的表现,揭示了模型对标志性音素的注意力模式。

Method: 通过文本(正字法和IPA)和听觉输入,研究了MLLMs在25种语义维度(如锐利vs.圆润)上的表现,测量了音素级别的注意力分数。

Result: 研究发现MLLMs的语音直觉与语言学研究成果一致,并在标志性音素上展现出集中的注意力模式。

Insight: 研究为AI与认知语言学的交叉领域提供了实证基础,表明MLLMs能够捕捉声音象征性,为跨模态语言理解提供了新视角。

Abstract: Sound symbolism is a linguistic concept that refers to non-arbitrary associations between phonetic forms and their meanings. We suggest that this can be a compelling probe into how Multimodal Large Language Models (MLLMs) interpret auditory information in human languages. We investigate MLLMs’ performance on phonetic iconicity across textual (orthographic and IPA) and auditory forms of inputs with up to 25 semantic dimensions (e.g., sharp vs. round), observing models’ layer-wise information processing by measuring phoneme-level attention fraction scores. To this end, we present LEX-ICON, an extensive mimetic word dataset consisting of 8,052 words from four natural languages (English, French, Japanese, and Korean) and 2,930 systematically constructed pseudo-words, annotated with semantic features applied across both text and audio modalities. Our key findings demonstrate (1) MLLMs’ phonetic intuitions that align with existing linguistic research across multiple semantic dimensions and (2) phonosemantic attention patterns that highlight models’ focus on iconic phonemes. These results bridge domains of artificial intelligence and cognitive linguistics, providing the first large-scale, quantitative analyses of phonetic iconicity in terms of MLLMs’ interpretability.

[18] Format Matters: The Robustness of Multimodal LLMs in Reviewing Evidence from Tables and Charts

Xanh Ho,Yun-Ang Wu,Sunisth Kumar,Florian Boudin,Atsuhiro Takasu,Akiko Aizawa

Main category: cs.CL

TL;DR: 该论文研究了多模态大语言模型(LLMs)在验证科学主张时对表格和图表的鲁棒性差异。研究表明,模型在表格数据上表现更好,而在图表数据上较弱,且小模型(小于8B)的跨模态泛化能力有限。

Details Motivation: 随着科学论文数量的增长,需要系统辅助评审验证研究主张。实验结果常以表格或图表形式呈现,但当前多模态LLMs在不同证据格式下的鲁棒性尚不明确。

Contribution: 设计了实验评估12种多模态LLMs在表格和图表证据下的表现,发现模型在图表上较弱,小模型跨模态泛化能力不足,并提出未来需改进图表理解能力。

Method: 通过改编两个现有科学论文数据集,加入多模态主张验证任务的结构和标注,评估模型表现并进行人工对比实验。

Result: 模型在表格数据上表现更优,图表上表现不佳;人类在两者上均表现良好;小模型(小于8B)跨模态泛化能力有限。

Insight: 当前多模态LLMs在图表理解上存在明显短板,未来需针对性提升这一能力以支持科学主张验证。

Abstract: With the growing number of submitted scientific papers, there is an increasing demand for systems that can assist reviewers in evaluating research claims. Experimental results are a core component of scientific work, often presented in varying formats such as tables or charts. Understanding how robust current multimodal large language models (multimodal LLMs) are at verifying scientific claims across different evidence formats remains an important and underexplored challenge. In this paper, we design and conduct a series of experiments to assess the ability of multimodal LLMs to verify scientific claims using both tables and charts as evidence. To enable this evaluation, we adapt two existing datasets of scientific papers by incorporating annotations and structures necessary for a multimodal claim verification task. Using this adapted dataset, we evaluate 12 multimodal LLMs and find that current models perform better with table-based evidence while struggling with chart-based evidence. We further conduct human evaluations and observe that humans maintain strong performance across both formats, unlike the models. Our analysis also reveals that smaller multimodal LLMs (under 8B) show weak correlation in performance between table-based and chart-based tasks, indicating limited cross-modal generalization. These findings highlight a critical gap in current models’ multimodal reasoning capabilities. We suggest that future multimodal LLMs should place greater emphasis on improving chart understanding to better support scientific claim verification.

[19] ELYADATA & LIA at NADI 2025: ASR and ADI Subtasks

Haroun Elleuch,Youssef Saidi,Salima Mdhaffar,Yannick Estève,Fethi Bougares

Main category: cs.CL

TL;DR: 论文描述了Elyadata和LIA在NADI 2025多方言阿拉伯语语音处理比赛中的联合提交,展示了在口语阿拉伯方言识别(ADI)和多方言阿拉伯语自动语音识别(ASR)任务中的优异表现。

Details Motivation: 阿拉伯语方言多样,语音处理任务具有挑战性,需要高效的系统来识别和转录多方言阿拉伯语。

Contribution: 提出了基于预训练模型的适配方法,在ADI任务中取得第一,ASR任务中取得第二;证明了大规模预训练模型在阿拉伯语语音处理中的有效性。

Method: ADI任务采用数据增强后的Whisper-large-v3编码器;ASR任务针对八种方言分别微调SeamlessM4T-v2 Large模型。

Result: ADI任务官方测试集准确率达79.83%;ASR任务平均WER和CER分别为38.54%和14.53%。

Insight: 针对特定任务的数据增强和方言适配微调是提高阿拉伯语语音处理性能的关键。

Abstract: This paper describes Elyadata & LIA’s joint submission to the NADI multi-dialectal Arabic Speech Processing 2025. We participated in the Spoken Arabic Dialect Identification (ADI) and multi-dialectal Arabic ASR subtasks. Our submission ranked first for the ADI subtask and second for the multi-dialectal Arabic ASR subtask among all participants. Our ADI system is a fine-tuned Whisper-large-v3 encoder with data augmentation. This system obtained the highest ADI accuracy score of \textbf{79.83%} on the official test set. For multi-dialectal Arabic ASR, we fine-tuned SeamlessM4T-v2 Large (Egyptian variant) separately for each of the eight considered dialects. Overall, we obtained an average WER and CER of \textbf{38.54%} and \textbf{14.53%}, respectively, on the test set. Our results demonstrate the effectiveness of large pre-trained speech models with targeted fine-tuning for Arabic speech processing.

[20] On the Military Applications of Large Language Models

Satu Johansson,Taneli Riihonen

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型(如GPT)在军事领域的应用潜力,分析了其总结与生成能力如何直接支持多种应用,并评估了基于商业云服务(如Microsoft Azure)的实现可行性。

Details Motivation: 随着生成预训练模型(如ChatGPT)的快速发展,研究者希望探索其在军事领域的潜在应用,以提升任务效率和决策支持能力。

Contribution: 1. 研究了GPT模型在军事应用中的知识披露;2. 评估了基于商业云服务构建此类应用的可行性;3. 总结了语言模型特性(如总结和生成)在军事中的直接用途。

Method: 1. 通过对话GPT模型(如Microsoft Copilot)揭示其对军事应用的了解;2. 分析商业云服务(如Microsoft Azure)的适用性。

Result: 研究表明,语言模型的总结和生成能力可直接支持多种军事应用,部分特性在特定场景中尤为有用。

Insight: 商业化的云服务和现有语言模型技术已具备支持军事应用的潜力,但需进一步验证其可靠性和安全性。

Abstract: In this paper, military use cases or applications and implementation thereof are considered for natural language processing and large language models, which have broken into fame with the invention of the generative pre-trained transformer (GPT) and the extensive foundation model pretraining done by OpenAI for ChatGPT and others. First, we interrogate a GPT-based language model (viz. Microsoft Copilot) to make it reveal its own knowledge about their potential military applications and then critically assess the information. Second, we study how commercial cloud services (viz. Microsoft Azure) could be used readily to build such applications and assess which of them are feasible. We conclude that the summarization and generative properties of language models directly facilitate many applications at large and other features may find particular uses.

[21] Generalizing to Unseen Disaster Events: A Causal View

Philipp Seeberger,Steffen Freisinger,Tobias Bocklet,Korbinian Riedhammer

Main category: cs.CL

TL;DR: 该论文通过因果视角解决灾害事件分类中的偏差问题,提出了一种新方法以减少事件和领域相关偏差,从而提升对未来事件的泛化能力。

Details Motivation: 现有系统在处理灾害事件数据时受到事件相关偏差的影响,导致泛化能力不足。因果学习和去偏差方法的进展为解决这一问题提供了潜力,但在灾害事件领域尚未充分探索。

Contribution: 提出了一种基于因果视角的方法,有效减少了事件和领域相关偏差,提升了模型在未来灾害事件上的泛化能力。

Method: 采用因果学习框架,识别并减少事件和领域相关的偏差,通过改进模型训练过程实现更好的泛化。

Result: 实验表明,该方法在三个灾害分类任务中优于多个基线模型,F1分数最高提升+1.9%,并显著提升了基于PLM的分类器性能。

Insight: 因果视角为灾害事件数据处理提供了新的思路,能够有效减少偏差并提升模型泛化能力,尤其是在新兴事件上的表现。

Abstract: Due to the rapid growth of social media platforms, these tools have become essential for monitoring information during ongoing disaster events. However, extracting valuable insights requires real-time processing of vast amounts of data. A major challenge in existing systems is their exposure to event-related biases, which negatively affects their ability to generalize to emerging events. While recent advancements in debiasing and causal learning offer promising solutions, they remain underexplored in the disaster event domain. In this work, we approach bias mitigation through a causal lens and propose a method to reduce event- and domain-related biases, enhancing generalization to future events. Our approach outperforms multiple baselines by up to +1.9% F1 and significantly improves a PLM-based classifier across three disaster classification tasks.

[22] Beyond the Black Box: Demystifying Multi-Turn LLM Reasoning with VISTA

Yiran Zhang,Mingyang Lin,Mark Dras,Usman Naseem

Main category: cs.CL

TL;DR: 论文提出了VISTA,一个基于Web的可视化交互系统,用于分析和可视化多轮LLM推理任务中的复杂推理过程,支持上下文修改和依赖树的自动生成。

Details Motivation: 多轮交互中的LLM推理过程复杂且缺乏可视化工具,增加了研究者的认知负担,因此需要一种工具来透明化和简化分析过程。

Contribution: 1. 开发了VISTA平台,支持多轮LLM推理的可视化和交互分析;2. 支持上下文修改和依赖树的自动生成;3. 提供开源框架,便于集成自定义基准和本地模型。

Method: 1. 设计基于Web的可视化系统;2. 实现上下文交互修改和依赖树自动解析;3. 提供统一的交互框架。

Result: VISTA显著降低了分析推理链的复杂性,帮助深入理解LLM的能力和局限。

Insight: 通过可视化工具,可以更直观地分析LLM的推理过程,揭示其逻辑路径的透明性。

Abstract: Recent research has increasingly focused on the reasoning capabilities of Large Language Models (LLMs) in multi-turn interactions, as these scenarios more closely mirror real-world problem-solving. However, analyzing the intricate reasoning processes within these interactions presents a significant challenge due to complex contextual dependencies and a lack of specialized visualization tools, leading to a high cognitive load for researchers. To address this gap, we present VISTA, an web-based Visual Interactive System for Textual Analytics in multi-turn reasoning tasks. VISTA allows users to visualize the influence of context on model decisions and interactively modify conversation histories to conduct “what-if” analyses across different models. Furthermore, the platform can automatically parse a session and generate a reasoning dependency tree, offering a transparent view of the model’s step-by-step logical path. By providing a unified and interactive framework, VISTA significantly reduces the complexity of analyzing reasoning chains, thereby facilitating a deeper understanding of the capabilities and limitations of current LLMs. The platform is open-source and supports easy integration of custom benchmarks and local models.

[23] Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL

Qifeng Cai,Hao Liang,Chang Xu,Tao Xie,Wentao Zhang,Bin Cui

Main category: cs.CL

TL;DR: Text2SQL-Flow是一个SQL感知的数据增强框架,通过六个维度的增强生成高质量、大规模的Text-to-SQL对,并构建SQLFlow数据集,显著提升模型性能。

Details Motivation: Text-to-SQL领域的性能受限于数据稀缺、简单且多样性低的问题。

Contribution: 提出了Text2SQL-Flow框架,生成高质量的Text-to-SQL数据;构建了SQLFlow数据集;提出了掩码对齐检索方法。

Method: 采用六维数据增强策略,包括SQL执行验证、自然语言问题生成等;设计了模块化数据库管理器;提出了掩码对齐检索方法。

Result: SQLFlow数据集包含89,544个标注样本;提升了开源和闭源LLM的性能;检索策略优于现有方法。

Insight: 高质量结构化数据对Text-to-SQL系统至关重要;数据多样性提升模型泛化能力。

Abstract: The data-centric paradigm has become pivotal in AI, especially for Text-to-SQL, where performance is limited by scarce, simplistic, and low-diversity datasets. To address this, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from minimal seed data. It operates across six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language question generation, chain-of-thought reasoning traces, and data classification. A modular Database Manager ensures cross-database compatibility and scalability. Using this framework, we build SQLFlow, a high-quality dataset of 89,544 annotated examples. We evaluate SQLFlow in two settings: (1) For open-source LLMs, fine-tuning on SQLFlow consistently improves performance across benchmarks under the same data budget. (2) For closed-source LLMs, we introduce a masked alignment retrieval method that treats SQLFlow as both knowledge base and training data for the retriever. This enables structure-aware example matching by modeling fine-grained alignments between questions and SQL queries. Experiments show our retrieval strategy outperforms existing methods, underscoring the value of SQLFlow’s high-fidelity data and our novel technique. Our work establishes a scalable, data-centric foundation for advancing Text-to-SQL systems and highlights the critical role of high-quality structured data in modern AI.

[24] EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models

Junquan Huang,Haotian Wu,Yubo Gao,Yibo Yan,Junyan Zhang,Yonghua Hei,Song Dai,Jie Zhang,Puay Siew Tan,Xuming Hu

Main category: cs.CL

TL;DR: EffiReason-Bench是一个统一的基准测试,用于评估大语言模型(LLMs)中高效推理方法,填补了现有评估实践的碎片化问题。

Details Motivation: 当前CoT提示的LLMs虽然推理能力强,但生成长解释增加了成本并可能降低准确性,且缺乏统一的效率评估框架。

Contribution: 提出了EffiReason-Bench基准和E3-Score评估指标,支持跨方法论比较,并通过验证的CoT注释提供了逐步评估的能力。

Method: 构建了验证过的CoT注释流水线,评估了7种方法在6个开源LLMs上的表现,覆盖数学、常识和逻辑任务。

Result: 实验显示,没有单一方法在所有场景中都最优,最佳策略取决于模型规模、任务复杂度和架构。

Insight: 高效推理方法的有效性高度依赖于具体场景,统一基准和稳定评估指标对跨方法比较至关重要。

Abstract: Large language models (LLMs) with Chain-of-Thought (CoT) prompting achieve strong reasoning but often produce unnecessarily long explanations, increasing cost and sometimes reducing accuracy. Fair comparison of efficiency-oriented approaches is hindered by fragmented evaluation practices. We introduce EffiReason-Bench, a unified benchmark for rigorous cross-paradigm evaluation of efficient reasoning methods across three categories: Reasoning Blueprints, Dynamic Execution, and Post-hoc Refinement. To enable step-by-step evaluation, we construct verified CoT annotations for CommonsenseQA and LogiQA via a pipeline that enforces standardized reasoning structures, comprehensive option-wise analysis, and human verification. We evaluate 7 methods across 6 open-source LLMs (1B-70B) on 4 datasets spanning mathematics, commonsense, and logic, and propose the E3-Score, a principled metric inspired by economic trade-off modeling that provides smooth, stable evaluation without discontinuities or heavy reliance on heuristics. Experiments show that no single method universally dominates; optimal strategies depend on backbone scale, task complexity, and architecture.

[25] Rectify Evaluation Preference: Improving LLMs’ Critique on Math Reasoning via Perplexity-aware Reinforcement Learning

Changyuan Tian,Zhicong Lu,Shuang Qian,Nayu Liu,Peiguang Li,Li Jin,Leiyi Hu,Zhizhao Zeng,Sirui Wang,Ke Zeng,Zhi Guo

Main category: cs.CL

TL;DR: 本文通过困惑度感知强化学习算法,解决大语言模型(LLMs)在多步数学推理(MsMR)中批判能力不足的问题,发现了LLMs存在的不平衡评估偏好,并提出了一种新颖的方法来纠正这种偏好。

Details Motivation: 现有的方法依赖高质量监督微调来增强LLMs的批判能力,但忽视了其表现不佳的根本原因。本文发现LLMs存在不平衡评估偏好(倾向于认为困惑度较低的解答正确),导致批判能力受限。

Contribution: 1. 构建了One-to-many Problem-Solution(OPS)基准,用于量化LLMs在评估自身与他人解答时的行为差异。2. 发现并分析了LLMs的不平衡评估偏好现象。3. 提出了困惑度感知强化学习算法,通过优化策略纠正这种偏好。

Method: 1. 使用OPS基准量化LLMs的批判行为差异。2. 通过统计困惑度分析发现不平衡评估偏好。3. 设计困惑度感知的Group Relative Policy Optimization算法,引导LLMs正确评估高困惑度和低困惑度的解答。

Result: 在OPS和现有批判基准上的实验结果表明,所提方法显著提升了LLMs的批判能力。

Insight: 1. LLMs在评估解答时存在困惑度驱动的偏好偏差。2. 直接优化这种偏好可以显著提升模型的批判性能。3. 困惑度可作为强化学习中的重要信号,指导策略优化。

Abstract: To improve Multi-step Mathematical Reasoning (MsMR) of Large Language Models (LLMs), it is crucial to obtain scalable supervision from the corpus by automatically critiquing mistakes in the reasoning process of MsMR and rendering a final verdict of the problem-solution. Most existing methods rely on crafting high-quality supervised fine-tuning demonstrations for critiquing capability enhancement and pay little attention to delving into the underlying reason for the poor critiquing performance of LLMs. In this paper, we orthogonally quantify and investigate the potential reason – imbalanced evaluation preference, and conduct a statistical preference analysis. Motivated by the analysis of the reason, a novel perplexity-aware reinforcement learning algorithm is proposed to rectify the evaluation preference, elevating the critiquing capability. Specifically, to probe into LLMs’ critiquing characteristics, a One-to-many Problem-Solution (OPS) benchmark is meticulously constructed to quantify the behavior difference of LLMs when evaluating the problem solutions generated by itself and others. Then, to investigate the behavior difference in depth, we conduct a statistical preference analysis oriented on perplexity and find an intriguing phenomenon – ``LLMs incline to judge solutions with lower perplexity as correct’’, which is dubbed as \textit{imbalanced evaluation preference}. To rectify this preference, we regard perplexity as the baton in the algorithm of Group Relative Policy Optimization, supporting the LLMs to explore trajectories that judge lower perplexity as wrong and higher perplexity as correct. Extensive experimental results on our built OPS and existing available critic benchmarks demonstrate the validity of our method.

[26] BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages

Guduru Manoj,Neel Prabhanjan Rachamalla,Ashish Kulkarni,Gautam Rajeev,Jay Piplodiya,Arul Menezes,Shaharukh Khan,Souvik Rana,Manya Sah,Chandra Khatri,Shubham Agarwal

Main category: cs.CL

TL;DR: 该论文系统研究了为印度语言构建大规模合成预训练数据的方法,提出了BhashaKritika数据集(540B tokens),并探讨了数据生成的技术、语言选择和评估方法。

Details Motivation: 低资源语言的LLM预训练数据不足,导致语言间受益不均,而合成数据为这一挑战提供了解决方案。

Contribution: ,,;।.;

Method: 使用5种技术生成了10种语言的合成数据集,并通过模块化质量评估流程(如语言检测、重复分析和困惑度过滤)进行质量控制。

Result: 实验揭示了生成策略中的关键权衡,并提出了构建多语言语料库的最佳实践。

Insight: 语言选择(提示指令和文档基础)对合成数据质量有显著影响,同时评估流程的设计需适应多样化的语言环境。

Abstract: In the context of pretraining of Large Language Models (LLMs), synthetic data has emerged as an alternative for generating high-quality pretraining data at scale. This is particularly beneficial in low-resource language settings where the benefits of recent LLMs have been unevenly distributed across languages. In this work, we present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages, where we construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages. We explore the impact of grounding generation in documents, personas, and topics. We analyze how language choice, both in the prompt instructions and document grounding, affects data quality, and we compare translations of English content with native generation in Indic languages. To support scalable and language-sensitive evaluation, we introduce a modular quality evaluation pipeline that integrates script and language detection, metadata consistency checks, n-gram repetition analysis, and perplexity-based filtering using KenLM models. Our framework enables robust quality control across diverse scripts and linguistic contexts. Empirical results through model runs reveal key trade-offs in generation strategies and highlight best practices for constructing effective multilingual corpora.

[27] Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates

Andrea Schimmenti,Valentina Pasqual,Fabio Vitali,Marieke van Erp

Main category: cs.CL

TL;DR: 该论文提出了一种名为ATR4CH的系统方法,结合LLMs和本体工程,从文化遗产文本中生成知识图谱,并通过案例研究验证其有效性。

Details Motivation: 文化遗产文本包含丰富的知识,但由于从非结构化文本转换为结构化知识图谱的困难,这些知识难以系统化查询。该方法旨在解决这一问题。

Contribution: 论文的主要贡献是提出了ATR4CH这一五步方法论,将LLMs与本体工程结合,首次系统性解决了文化遗产领域知识图谱生成问题。

Method: ATR4CH方法结合了标注模型、本体框架和LLM提取,通过五个步骤实现:基础分析、标注模式开发、管道架构、集成优化和全面评估。使用了三种LLM(Claude Sonnet 3.7、Llama 3.3 70B、GPT-4o-mini)进行实验。

Result: 实验结果显示,该方法在元数据提取(F1 0.96-0.99)、实体识别(F1 0.7-0.8)、假设提取(F1 0.65-0.75)、证据提取(F1 0.95-0.97)和话语表示(G-EVAL 0.62)方面表现优异,小模型也能高效运行。

Insight: ATR4CH为文化遗产领域提供了一个可复现的框架,适用于多领域和多机构资源。尽管结果积极,但后处理仍需人工监督,且目前仅限于维基百科文章。

Abstract: Cultural Heritage texts contain rich knowledge that is difficult to query systematically due to the challenges of converting unstructured discourse into structured Knowledge Graphs (KGs). This paper introduces ATR4CH (Adaptive Text-to-RDF for Cultural Heritage), a systematic five-step methodology for Large Language Model-based Knowledge Extraction from Cultural Heritage documents. We validate the methodology through a case study on authenticity assessment debates. Methodology - ATR4CH combines annotation models, ontological frameworks, and LLM-based extraction through iterative development: foundational analysis, annotation schema development, pipeline architecture, integration refinement, and comprehensive evaluation. We demonstrate the approach using Wikipedia articles about disputed items (documents, artifacts…), implementing a sequential pipeline with three LLMs (Claude Sonnet 3.7, Llama 3.3 70B, GPT-4o-mini). Findings - The methodology successfully extracts complex Cultural Heritage knowledge: 0.96-0.99 F1 for metadata extraction, 0.7-0.8 F1 for entity recognition, 0.65-0.75 F1 for hypothesis extraction, 0.95-0.97 for evidence extraction, and 0.62 G-EVAL for discourse representation. Smaller models performed competitively, enabling cost-effective deployment. Originality - This is the first systematic methodology for coordinating LLM-based extraction with Cultural Heritage ontologies. ATR4CH provides a replicable framework adaptable across CH domains and institutional resources. Research Limitations - The produced KG is limited to Wikipedia articles. While the results are encouraging, human oversight is necessary during post-processing. Practical Implications - ATR4CH enables Cultural Heritage institutions to systematically convert textual knowledge into queryable KGs, supporting automated metadata enrichment and knowledge discovery.

[28] Position: On the Methodological Pitfalls of Evaluating Base LLMs for Reasoning

Jason Chan,Zhixue Zhao,Robert Gaizauskas

Main category: cs.CL

TL;DR: 这篇立场论文指出,评估基础大型语言模型(LLMs)的推理能力存在固有的方法论问题,质疑了现有研究中忽视的语言模型预训练目标与推理评估标准(如正确性)之间的不匹配。

Details Motivation: 现有研究通过评估基础LLMs的推理能力来揭示其局限性或偏见,但作者认为这种做法忽略了方法论上的根本问题。

Contribution: 论文的主要贡献是揭示了基础LLMs预训练目标与推理评估标准之间的不匹配,并挑战了现有研究中隐含的两个假设:(1)基础LLMs的输出可以被视为对正确答案的真实尝试;(2)基础LLMs的推理结论可推广到针对指令优化的LLMs。

Method: 通过分析基础LLMs生成逻辑结论的过程,指出其结论只是统计上合理的语言模式的副产品,而非真正的推理结果。

Result: 研究表明,基础LLMs的输出不能直接用于评估其推理能力,因为其生成逻辑结论的过程与推理的本质脱节。

Insight: 论文呼吁重新审视现有研究中评估LLMs推理能力的假设,并建议未来研究需避免此类方法论陷阱。

Abstract: Existing work investigates the reasoning capabilities of large language models (LLMs) to uncover their limitations, human-like biases and underlying processes. Such studies include evaluations of base LLMs (pre-trained on unlabeled corpora only) for this purpose. Our position paper argues that evaluating base LLMs’ reasoning capabilities raises inherent methodological concerns that are overlooked in such existing studies. We highlight the fundamental mismatch between base LLMs’ pretraining objective and normative qualities, such as correctness, by which reasoning is assessed. In particular, we show how base LLMs generate logically valid or invalid conclusions as coincidental byproducts of conforming to purely linguistic patterns of statistical plausibility. This fundamental mismatch challenges the assumptions that (a) base LLMs’ outputs can be assessed as their bona fide attempts at correct answers or conclusions; and (b) conclusions about base LLMs’ reasoning can generalize to post-trained LLMs optimized for successful instruction-following. We call for a critical re-examination of existing work that relies implicitly on these assumptions, and for future work to account for these methodological pitfalls.

[29] Reasoning About Intent for Ambiguous Requests

Irina Saparina,Mirella Lapata

Main category: cs.CL

TL;DR: 为了解决大型语言模型对模糊请求的单一解释问题,本文提出了一种生成多解释-答案对的单步结构化响应方法,通过强化学习和定制奖励函数提升覆盖率和准确性。

Details Motivation: 大型语言模型在处理模糊请求时常隐含选择一个解释,导致用户误解和安全风险。为提升透明性和准确性,本文提出生成多解释-答案对的方法。

Contribution: 1. 提出单步生成多解释-答案对的结构化响应;2. 通过强化学习和定制奖励函数提升覆盖率;3. 实验证明该方法在对话问答和语义解析中优于基线。

Method: 1. 训练模型生成多解释-答案对;2. 使用强化学习和定制奖励函数优化;3. 单一生成步骤提高效率。

Result: 实验显示该方法在覆盖率和人类评估中表现优于基线,解释与答案高度一致。

Insight: 结构化响应不仅提升透明性,还支持下游应用,单步生成兼顾效率。

Abstract: Large language models often respond to ambiguous requests by implicitly committing to one interpretation. Intent misunderstandings can frustrate users and create safety risks. To address this, we propose generating multiple interpretation-answer pairs in a single structured response to ambiguous requests. Our models are trained with reinforcement learning and customized reward functions using multiple valid answers as supervision. Experiments on conversational question answering and semantic parsing demonstrate that our method achieves higher coverage of valid answers than baseline approaches. Human evaluation confirms that predicted interpretations are highly aligned with their answers. Our approach promotes transparency with explicit interpretations, achieves efficiency by requiring only one generation step, and supports downstream applications through its structured output format.

[30] Exploring State Tracking Capabilities of Large Language Models

Kiamehr Rezaee,Jose Camacho-Collados,Mohammad Taher Pilehvar

Main category: cs.CL

TL;DR: 该论文研究了大型语言模型(LLM)在状态跟踪任务上的表现,提出了一个基准测试,分析了不同模型在状态跟踪中的能力。

Details Motivation: 状态跟踪是一个复杂任务,需要模型动态维护多个实体的状态。研究旨在评估LLM在此任务上的能力,并探讨其局限性。

Contribution: 提出了一个专注于状态跟踪的基准测试,分析了GPT-4和Llama3等新一代LLM的能力,并与前一代模型进行对比。

Method: 设计了三个明确定义的状态跟踪任务,使用不同的LLM进行测试,并结合Chain of Thought等机制评估性能。

Result: 新一代LLM(如GPT-4和Llama3)在状态跟踪任务中表现良好,尤其是结合Chain of Thought时;前一代模型则在任务后期表现不佳。

Insight: LLM的状态跟踪能力与其规模和架构改进密切相关,Chain of Thought等技术能显著提升任务表现。

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in solving complex tasks, including those requiring a certain level of reasoning. In this paper, we focus on state tracking, a problem where models need to keep track of the state governing a number of entities. To isolate the state tracking component from other factors, we propose a benchmark based on three well-defined state tracking tasks and analyse the performance of LLMs in different scenarios. The results indicate that the recent generation of LLMs (specifically, GPT-4 and Llama3) are capable of tracking state, especially when integrated with mechanisms such as Chain of Thought. However, models from the former generation, while understanding the task and being able to solve it at the initial stages, often fail at this task after a certain number of steps.

[31] LocalBench: Benchmarking LLMs on County-Level Local Knowledge and Reasoning

Zihan Gao,Yifei Xu,Jacob Thebault-Spieker

Main category: cs.CL

TL;DR: LocalBench是一个专注于评估大型语言模型(LLMs)在美国县一级本地知识和推理能力的首个基准。通过14,782个验证问题对526个县进行评估,结果表明现有LLMs在本地知识处理上存在显著局限性,尤其是数值推理和叙事风格问题。

Details Motivation: 现有基准未能充分捕捉LLMs在处理超本地知识(如社区动态、文化叙事和本地治理)时的能力,而这对实际应用(如公民平台和社区新闻)至关重要。

Contribution: 提出了首个系统性评估LLMs在县一级本地知识和推理能力的基准LocalBench,包含多样化的数据来源和多维度本地知识问题。

Method: 基于Localness Conceptual Framework,整合了普查数据、本地论坛讨论和区域新闻,设计了14,782个验证问题对526个县进行评估。

Result: 最佳模型在叙事风格问题上仅达到56.8%准确率,数值推理问题准确率低于15.5%。网络增强对部分模型有益(如Gemini提升13.6%),但对其他模型有害(如GPT系列降低11.4%)。

Insight: 模型规模和网络增强不总能提升性能,强调了开发能够支持公平、具有地方意识的AI系统的紧迫性。

Abstract: Large language models (LLMs) have been widely evaluated on macro-scale geographic tasks, such as global factual recall, event summarization, and regional reasoning. Yet, their ability to handle hyper-local knowledge remains poorly understood. This gap is increasingly consequential as real-world applications, from civic platforms to community journalism, demand AI systems that can reason about neighborhood-specific dynamics, cultural narratives, and local governance. Existing benchmarks fall short in capturing this complexity, often relying on coarse-grained data or isolated references. We present LocalBench, the first benchmark designed to systematically evaluate LLMs on county-level local knowledge across the United States. Grounded in the Localness Conceptual Framework, LocalBench includes 14,782 validated question-answer pairs across 526 U.S. counties in 49 states, integrating diverse sources such as Census statistics, local subreddit discourse, and regional news. It spans physical, cognitive, and relational dimensions of locality. Using LocalBench, we evaluate 13 state-of-the-art LLMs under both closed-book and web-augmented settings. Our findings reveal critical limitations: even the best-performing models reach only 56.8% accuracy on narrative-style questions and perform below 15.5% on numerical reasoning. Moreover, larger model size and web augmentation do not guarantee better performance, for example, search improves Gemini’s accuracy by +13.6%, but reduces GPT-series performance by -11.4%. These results underscore the urgent need for language models that can support equitable, place-aware AI systems: capable of engaging with the diverse, fine-grained realities of local communities across geographic and cultural contexts.

[32] Beyond Elicitation: Provision-based Prompt Optimization for Knowledge-Intensive Tasks

Yunzhe Xu,Zhuosheng Zhang,Zhe Liu

Main category: cs.CL

TL;DR: 论文提出了一种基于知识补给的提示优化框架KPPO,针对知识密集型任务,通过系统化的知识整合而非潜在的能力引发,显著提升语言模型的性能。

Details Motivation: 现有的提示优化方法主要关注引发模型能力的策略,但这些方法在处理知识密集型任务时存在固有局限性,因为它们无法提供所需的专业知识、术语精确性和推理模式。

Contribution: KPPO框架提出三种关键创新:1) 知识缺口填补机制;2) 批量候选评估方法;3) 自适应知识剪枝策略,显著提升了性能并减少了token消耗。

Method: KPPO通过知识缺口识别与补强、性能与分布稳定性的批量评估、以及平衡性能与效率的自适应知识剪枝,实现了系统化的知识整合。

Result: 在15个知识密集型基准测试中,KPPO平均性能提升约6%,同时token消耗减少高达29%。

Insight: 知识密集型任务需要超越传统提示优化的方法,通过系统整合专业知识能够更有效地提升模型性能。

Abstract: While prompt optimization has emerged as a critical technique for enhancing language model performance, existing approaches primarily focus on elicitation-based strategies that search for optimal prompts to activate models’ capabilities. These methods exhibit fundamental limitations when addressing knowledge-intensive tasks, as they operate within fixed parametric boundaries rather than providing the factual knowledge, terminology precision, and reasoning patterns required in specialized domains. To address these limitations, we propose Knowledge-Provision-based Prompt Optimization (KPPO), a framework that reformulates prompt optimization as systematic knowledge integration rather than potential elicitation. KPPO introduces three key innovations: 1) a knowledge gap filling mechanism for knowledge gap identification and targeted remediation; 2) a batch-wise candidate evaluation approach that considers both performance improvement and distributional stability; 3) an adaptive knowledge pruning strategy that balances performance and token efficiency, reducing up to 29% token usage. Extensive evaluation on 15 knowledge-intensive benchmarks from various domains demonstrates KPPO’s superiority over elicitation-based methods, with an average performance improvement of ~6% over the strongest baseline while achieving comparable or lower token consumption. Code at: https://github.com/xyz9911/KPPO.

[33] Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following

Yun He,Wenzhe Li,Hejia Zhang,Songlin Li,Karishma Mandyam,Sopan Khosla,Yuanhao Xiong,Nanshu Wang,Selina Peng,Beibin Li,Shengjie Bi,Shishir G. Patil,Qi Qi,Shengyu Feng,Julian Katz-Samuels,Richard Yuanzhe Pang,Sujan Gonugondla,Hunter Lang,Yue Yu,Yundi Qian,Maryam Fazel-Zarandi,Licheng Yu,Amine Benhalloum,Hany Awadalla,Manaal Faruqui

Main category: cs.CL

TL;DR: 论文提出一个名为AdvancedIF的基准测试,包含1600多个复杂指令,并提出RIFL方法,通过基于规则的奖励信号和改进RL训练,显著提升LLMs的指令跟随能力。

Details Motivation: 现有LLMs在复杂、多轮和系统级别的指令跟随能力不足,缺乏高质量评估基准和可靠奖励信号,制约了其训练与评估。

Contribution: 1) 提出AdvancedIF基准,包含1600+专家标注的复杂指令;2) 设计RIFL方法,结合规则生成、验证与奖励塑造,优化LLMs的指令跟随能力。

Method: RIFL通过三步实现:1) 生成基于规则的评价标准;2) 微调规则验证器;3) 结合奖励信号进行强化学习训练。

Result: RIFL在AdvancedIF上实现6.7%绝对性能提升,并在公开基准上表现优异。消融实验验证各组件有效性。

Insight: 规则化评估标准不仅适用于LLMs的训练优化,也是衡量其指令跟随能力的有效工具。

Abstract: Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)-especially for complex, multi-turn, and system-prompted instructions-remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpretable reward signals. In this work, we introduce AdvancedIF (we will release this benchmark soon), a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs ability to follow complex, multi-turn, and system-level instructions. We further propose RIFL (Rubric-based Instruction-Following Learning), a novel post-training pipeline that leverages rubric generation, a finetuned rubric verifier, and reward shaping to enable effective reinforcement learning for instruction following. Extensive experiments demonstrate that RIFL substantially improves the instruction-following abilities of LLMs, achieving a 6.7% absolute gain on AdvancedIF and strong results on public benchmarks. Our ablation studies confirm the effectiveness of each component in RIFL. This work establishes rubrics as a powerful tool for both training and evaluating advanced IF in LLMs, paving the way for more capable and reliable AI systems.

[34] LOCA-R: Near-Perfect Performance on the Chinese Physics Olympiad 2025

Dong-Shan Jian,Xiang Li,Chen-Xu Yan,Hui-Wen Zheng,Zhi-Zhang Bian,You-Le Fang,Sheng-Qi Zhang,Bing-Rui Gong,Ren-Xi He,Jing-Tian Zhang,Ce Meng,Yan-Qing Ma

Main category: cs.CL

TL;DR: LOCA-R是一个改进的逻辑链增强推理框架,用于解决中国物理奥林匹克竞赛的复杂问题,取得接近满分的成绩。

Details Motivation: 奥林匹克级别的物理问题解决对人类和AI都是巨大挑战,需要精确计算、抽象推理和物理原理的综合运用。

Contribution: 提出LOCA-R框架,针对复杂推理任务进行优化,并在中国物理奥林匹克竞赛中验证其性能。

Method: 基于LOCA框架的逻辑链增强推理方法,专注于整合精确计算和抽象推理。

Result: 在CPhO 2025理论考试中获得313分(满分320),超越所有人类和基线方法。

Insight: 逻辑链增强推理在处理复杂物理问题时表现出色,展示了AI在高级推理任务中的潜力。

Abstract: Olympiad-level physics problem-solving presents a significant challenge for both humans and artificial intelligence (AI), as it requires a sophisticated integration of precise calculation, abstract reasoning, and a fundamental grasp of physical principles. The Chinese Physics Olympiad (CPhO), renowned for its complexity and depth, serves as an ideal and rigorous testbed for these advanced capabilities. In this paper, we introduce LOCA-R (LOgical Chain Augmentation for Reasoning), an improved version of the LOCA framework adapted for complex reasoning, and apply it to the CPhO 2025 theory examination. LOCA-R achieves a near-perfect score of 313 out of 320 points, solidly surpassing the highest-scoring human competitor and significantly outperforming all baseline methods.

[35] Convomem Benchmark: Why Your First 150 Conversations Don’t Need RAG

Egor Pakhomov,Erik Nijkamp,Caiming Xiong

Main category: cs.CL

TL;DR: 论文提出了一个全面的对话记忆评估基准Convomem Benchmark,研究了对话记忆与检索增强生成(RAG)的关系,发现简单全上下文方法在小规模对话中的性能优于RAG系统。

Details Motivation: 现有基准在统计效力、数据生成一致性和评估灵活性方面存在不足,本研究旨在填补这一空白,并探索对话记忆与RAG的差异和联系。

Contribution: 提出了Convomem Benchmark,包含75,336个问答对,涵盖多样化的对话场景;揭示了在小规模对话(少于150次)中,简单全上下文方法的性能优于RAG系统。

Method: 通过构建大规模对话记忆基准,对比分析简单全上下文方法与RAG系统(如Mem0)在不同对话数量下的性能和成本表现。

Result: 在小规模对话中(少于150次),简单全上下文方法达到70-82%的准确率,而RAG方法仅为30-45%;随着对话数量增加,RAG或混合方法的必要性逐渐显现。

Insight: 对话记忆的小规模优势使得穷举搜索和完全重排成为可能,这为对话记忆系统的优化提供了独特的研究方向,而非直接套用通用RAG方案。

Abstract: We introduce a comprehensive benchmark for conversational memory evaluation containing 75,336 question-answer pairs across diverse categories including user facts, assistant recall, abstention, preferences, temporal changes, and implicit connections. While existing benchmarks have advanced the field, our work addresses fundamental challenges in statistical power, data generation consistency, and evaluation flexibility that limit current memory evaluation frameworks. We examine the relationship between conversational memory and retrieval-augmented generation (RAG). While these systems share fundamental architectural patterns–temporal reasoning, implicit extraction, knowledge updates, and graph representations–memory systems have a unique characteristic: they start from zero and grow progressively with each conversation. This characteristic enables naive approaches that would be impractical for traditional RAG. Consistent with recent findings on long context effectiveness, we observe that simple full-context approaches achieve 70-82% accuracy even on our most challenging multi-message evidence cases, while sophisticated RAG-based memory systems like Mem0 achieve only 30-45% when operating on conversation histories under 150 interactions. Our analysis reveals practical transition points: long context excels for the first 30 conversations, remains viable with manageable trade-offs up to 150 conversations, and typically requires hybrid or RAG approaches beyond that point as costs and latencies become prohibitive. These patterns indicate that the small-corpus advantage of conversational memory–where exhaustive search and complete reranking are feasible–deserves dedicated research attention rather than simply applying general RAG solutions to conversation histories.

[36] URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding

Yongxin Shi,Jiapeng Wang,Zeyu Shan,Dezhi Peng,Zening Lin,Lianwen Jin

Main category: cs.CL

TL;DR: URaG是一个统一检索与生成的框架,通过利用MLLMs的内在证据定位能力,实现了高效的长文档理解。

Details Motivation: 当前的多模态大语言模型(MLLMs)在处理长文档时面临信息干扰和计算成本高昂的挑战,现有方法未能很好地平衡效率和细节保留。

Contribution: 提出了URaG框架,通过将检索与生成统一在单一MLLM中,显著提高了长文档理解的效率和准确性。

Method: 在MLLM中引入轻量级跨模态检索模块,将早期Transformer层转化为高效的证据选择器,后期层专注于相关信息。

Result: 实验表明,URaG在保持最先进性能的同时,计算开销减少了44-56%。

Insight: MLLMs表现出类似人类的由粗到细的推理模式,这可以显式用于检索和生成统一的设计。

Abstract: Recent multimodal large language models (MLLMs) still struggle with long document understanding due to two fundamental challenges: information interference from abundant irrelevant content, and the quadratic computational cost of Transformer-based architectures. Existing approaches primarily fall into two categories: token compression, which sacrifices fine-grained details; and introducing external retrievers, which increase system complexity and prevent end-to-end optimization. To address these issues, we conduct an in-depth analysis and observe that MLLMs exhibit a human-like coarse-to-fine reasoning pattern: early Transformer layers attend broadly across the document, while deeper layers focus on relevant evidence pages. Motivated by this insight, we posit that the inherent evidence localization capabilities of MLLMs can be explicitly leveraged to perform retrieval during the reasoning process, facilitating efficient long document understanding. To this end, we propose URaG, a simple-yet-effective framework that Unifies Retrieval and Generation within a single MLLM. URaG introduces a lightweight cross-modal retrieval module that converts the early Transformer layers into an efficient evidence selector, identifying and preserving the most relevant pages while discarding irrelevant content. This design enables the deeper layers to concentrate computational resources on pertinent information, improving both accuracy and efficiency. Extensive experiments demonstrate that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%. The code is available at https://github.com/shi-yx/URaG.

[37] Evaluating Prompting Strategies with MedGemma for Medical Order Extraction

Abhinand Balachandran,Bavana Durgapraveen,Gowsikkan Sikkan Sudhagar,Vidhya Varshany J S,Sriram Rajkumar

Main category: cs.CL

TL;DR: 该论文研究了使用MedGemma模型从医患对话中提取医疗指令的效果,比较了三种提示策略:单次提示、ReAct框架和多步代理流程,发现简单的单次提示在验证集上表现最好。

Details Motivation: 准确提取医疗指令对减轻临床文档负担和保障患者安全至关重要。本文旨在探索MedGemma模型在不同提示策略下的表现,为临床信息提取提供指导。

Contribution: 1) 系统评估了MedGemma模型在医疗指令提取任务中的表现;2) 比较了三种提示策略的优劣,发现单次提示在手动标注数据上更鲁棒;3) 为临床信息提取任务提供了实用的提示策略选择建议。

Method: 研究者使用了MedGemma模型,并测试了三种提示方法:单次直接提示、基于推理的ReAct框架和多步代理流程,通过实验分析了它们在医疗指令提取任务中的表现。

Result: 实验结果表明,单次提示方法在手动标注的验证集上表现最佳,而复杂的推理方法(如ReAct和多步代理)可能因‘过度思考’而引入噪音。

Insight: 在手动标注数据上,简单的提示策略可能更高效,因为复杂推理容易产生噪音;MedGemma在这一任务中展现了潜力,但提示策略的选择需结合实际数据特性。

Abstract: The accurate extraction of medical orders from doctor-patient conversations is a critical task for reducing clinical documentation burdens and ensuring patient safety. This paper details our team submission to the MEDIQA-OE-2025 Shared Task. We investigate the performance of MedGemma, a new domain-specific open-source language model, for structured order extraction. We systematically evaluate three distinct prompting paradigms: a straightforward one-Shot approach, a reasoning-focused ReAct framework, and a multi-step agentic workflow. Our experiments reveal that while more complex frameworks like ReAct and agentic flows are powerful, the simpler one-shot prompting method achieved the highest performance on the official validation set. We posit that on manually annotated transcripts, complex reasoning chains can lead to “overthinking” and introduce noise, making a direct approach more robust and efficient. Our work provides valuable insights into selecting appropriate prompting strategies for clinical information extraction in varied data conditions.

[38] SSR: Socratic Self-Refine for Large Language Model Reasoning

Haizhou Shi,Ye Liu,Bo Pang,Zeyu Leo Liu,Hao Wang,Silvio Savarese,Caiming Xiong,Yingbo Zhou,Semih Yavuz

Main category: cs.CL

TL;DR: 本文提出了Socratic Self-Refine (SSR)框架,通过细粒度分解模型回答并逐步验证和修正,提升大语言模型(LLM)的推理能力,实验表明其在多个基准测试上优于现有方法。

Details Motivation: 现有的大语言模型推理框架通常依赖粗粒度的自我验证和自我修正,限制了在复杂任务上的效果,需要更精细的方法来提升推理准确性。

Contribution: 提出了SSR框架,通过分解回答为可验证的子问题对,逐步估计置信度并修正不可靠步骤,提供了一种黑盒方法来评估和理解LLM的内部推理过程。

Method: SSR将模型回答分解为(子问题,子答案)对,通过重新解决和一致性检查估计步骤置信度,迭代修正不可靠步骤以优化推理链。

Result: 实验表明,SSR在五个推理基准测试和三种LLM上均优于现有迭代自修正基线方法。

Insight: SSR不仅提升了推理性能,还提供了一种解释性强的黑盒分析工具,有助于理解LLM的内部推理逻辑。

Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning abilities, yet existing test-time frameworks often rely on coarse self-verification and self-correction, limiting their effectiveness on complex tasks. In this paper, we propose Socratic Self-Refine (SSR), a novel framework for fine-grained evaluation and precise refinement of LLM reasoning. Our proposed SSR decomposes model responses into verifiable (sub-question, sub-answer) pairs, enabling step-level confidence estimation through controlled re-solving and self-consistency checks. By pinpointing unreliable steps and iteratively refining them, SSR produces more accurate and interpretable reasoning chains. Empirical results across five reasoning benchmarks and three LLMs show that SSR consistently outperforms state-of-the-art iterative self-refinement baselines. Beyond performance gains, SSR provides a principled black-box approach for evaluating and understanding the internal reasoning processes of LLMs. Code is available at https://github.com/SalesforceAIResearch/socratic-self-refine-reasoning.

[39] Instella: Fully Open Language Models with Stellar Performance

Jiang Liu,Jialian Wu,Xiaodong Yu,Yusheng Su,Prakamya Mishra,Gowtham Ramesh,Sudhanshu Ranjan,Chaitanya Manem,Ximeng Sun,Ze Wang,Pratik Prabhanjan Brahma,Zicheng Liu,Emad Barsoum

Main category: cs.CL

TL;DR: Instella是一个完全开放的三亿参数语言模型家族,基于公开数据和代码库训练,并在性能上与同类领先的开源模型竞争。

Details Motivation: 当前高性能语言模型多为闭源或部分开源,限制了透明度和可复现性。Instella旨在通过完全开放模型和数据集推动开放研究。

Contribution: 1) 发布完全开放的Instella模型家族;2) 推出两个专用变体(Instella-Long和Instella-Math);3) 在性能上与同类开源模型竞争。

Method: 采用大规模预训练、通用指令调优和基于人类偏好的对齐方法;特殊变体通过监督微调和强化学习优化数学和长上下文任务。

Result: Instella在三亿参数规模下达到同类完全开放模型的SOTA性能,并与领先的开源权重模型竞争。

Insight: 完全开放的语言模型可以实现高性能,同时推动透明度和社区研究;专用变体展示了模型多样化的潜力。

Abstract: Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, yet the majority of high-performing models remain closed-source or partially open, limiting transparency and reproducibility. In this work, we introduce Instella, a family of fully open three billion parameter language models trained entirely on openly available data and codebase. Powered by AMD Instinct MI300X GPUs, Instella is developed through large-scale pre-training, general-purpose instruction tuning, and alignment with human preferences. Despite using substantially fewer pre-training tokens than many contemporaries, Instella achieves state-of-the-art results among fully open models and is competitive with leading open-weight models of comparable size. We further release two specialized variants: Instella-Long, capable of handling context lengths up to 128K tokens, and Instella-Math, a reasoning-focused model enhanced through supervised fine-tuning and reinforcement learning on mathematical tasks. Together, these contributions establish Instella as a transparent, performant, and versatile alternative for the community, advancing the goal of open and reproducible language modeling research.

[40] ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

Yesheng Liang,Haisheng Chen,Song Han,Zhijian Liu

Main category: cs.CL

TL;DR: ParoQuant提出了一种基于成对旋转量化的权重后训练量化方法,通过硬件高效的独立Givens旋转和通道缩放,解决了LLM中异常值和动态范围大的问题,提升了推理效率和精度。

Details Motivation: 大语言模型(LLM)的权重后训练量化(PTQ)在减少内存占用和加速推理时,常因权重和激活中的异常值导致较大的量化误差和精度下降,尤其在多步推理任务中误差累积更为严重。现有方法要么无法有效抑制异常值,要么带来较大推理开销。

Contribution: 1. 提出Pairwise Rotation Quantization(ParoQuant),结合硬件高效的独立Givens旋转和通道缩放,均衡通道间幅度并缩小量化组内动态范围;2. 设计了协同优化的推理内核,充分利用GPU并行性,保持运行时计算轻量。

Method: ParoQuant采用独立Givens旋转对权重进行成对旋转量化,并通过通道缩放调整幅度。进一步与专为GPU设计的推理内核协同优化,减少运行时开销。

Result: 在推理任务中,ParoQuant比AWQ平均提升了2.4%的准确率,且额外开销低于10%,显著优于现有方法。

Insight: 通过旋转量化和通道缩放的结合,不仅能有效抑制异常值对量化的影响,还能在保持硬件效率的同时显著提升推理精度。

Abstract: Weight-only post-training quantization (PTQ) compresses the weights of Large Language Models (LLMs) into low-precision representations to reduce memory footprint and accelerate inference. However, the presence of outliers in weights and activations often leads to large quantization errors and severe accuracy degradation, especially in recent reasoning LLMs where errors accumulate across long chains of thought. Existing PTQ methods either fail to sufficiently suppress outliers or introduce significant overhead during inference. In this paper, we propose Pairwise Rotation Quantization (ParoQuant), a weight-only PTQ method that combines hardware-efficient and optimizable independent Givens rotations with channel-wise scaling to even out the magnitude across channels and narrow the dynamic range within each quantization group. We further co-design the inference kernel to fully exploit GPU parallelism and keep the rotations and scaling lightweight at runtime. ParoQuant achieves an average 2.4% accuracy improvement over AWQ on reasoning tasks with less than 10% overhead. This paves the way for more efficient and accurate deployment of reasoning LLMs.

cs.CV [Back]

[41] MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Ye Tian,Ling Yang,Jiongfan Yang,Anran Wang,Yu Tian,Jiani Zheng,Haochen Wang,Zhiyang Teng,Zhuochen Wang,Yinjie Wang,Yunhai Tong,Mengdi Wang,Xiangtai Li

Main category: cs.CV

TL;DR: MMaDA-Parallel是一个多模态扩散框架,通过并行双向交互提升跨模态一致性,显著改善了思考感知的图像生成任务。

Details Motivation: 现有自回归方法在处理复杂任务时因误差传播导致性能下降,跨模态对齐不足是主要问题。本研究旨在解决这一问题。

Contribution: 1) 提出ParaBench基准;2) 开发MMaDA-Parallel框架,支持双向文本-图像交互;3) 提出ParaRL优化策略。

Method: MMaDA-Parallel结合监督微调和ParaRL(沿扩散轨迹应用语义奖励),增强跨模态一致性。

Result: 实验表明模型在ParaBench上的输出对齐提升6.9%,优于当前最佳模型Bagel。

Insight: 并行交互和多模态强化学习是提升跨模态生成任务性能的有效途径。

Abstract: While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at https://github.com/tyfeld/MMaDA-Parallel

[42] PriVi: Towards A General-Purpose Video Model For Primate Behavior In The Wild

Felix B. Mueller,Jan F. Meier,Timo Lueddecke,Richard Vogg,Roger L. Freixanet,Valentin Hassler,Tiffany Bosshard,Elif Karakoc,William J. O’Hearn,Sofia M. Pereira,Sandro Sehner,Kaja Wierucka,Judith Burkart,Claudia Fichtel,Julia Fischer,Alexander Gail,Catherine Hobaiter,Julia Ostner,Liran Samuni,Oliver Schülke,Neda Shahidi,Erin G. Wessling,Alexander S. Ecker

Main category: cs.CV

TL;DR: PriVi是一个大规模以灵长类为中心的视频预训练数据集,旨在通过数据为中心的方法提升灵长类行为分析的泛化能力。V-JEPA在PriVi上预训练后,在多个基准数据集上表现优异,优于全微调基线。

Details Motivation: 现有方法依赖人类中心的预训练模型,且集中在单一数据集上,限制了泛化能力。PriVi通过灵长类特化的数据为中心方法弥补这一缺陷。

Contribution: 1. 提出PriVi数据集,包含424小时高质量灵长类视频;2. 设计可扩展的数据清洗流程;3. 展示灵长类特化预训练在低标签场景下的高效性和泛化能力。

Method: 1. 构建PriVi数据集,结合研究视频和网络素材;2. 在PriVi上预训练V-JEPA模型;3. 用轻量级冻结分类器评估模型。

Result: 在ChimpACT等四个基准数据集上优于现有方法,包括全微调基线,且在低标签场景下表现优异。

Insight: 灵长类特化的预训练显著提升了数据效率和泛化能力,表明数据为中心的方法在领域特化任务中的潜力。

Abstract: Non-human primates are our closest living relatives, and analyzing their behavior is central to research in cognition, evolution, and conservation. Computer vision could greatly aid this research, but existing methods often rely on human-centric pretrained models and focus on single datasets, which limits generalization. We address this limitation by shifting from a model-centric to a data-centric approach and introduce PriVi, a large-scale primate-centric video pretraining dataset. PriVi contains 424 hours of curated video, combining 174 hours from behavioral research across 11 settings with 250 hours of diverse web-sourced footage, assembled through a scalable data curation pipeline. We pretrain V-JEPA on PriVi to learn primate-specific representations and evaluate it using a lightweight frozen classifier. Across four benchmark datasets, ChimpACT, BaboonLand, PanAf500, and ChimpBehave, our approach consistently outperforms prior work, including fully finetuned baselines, and scales favorably with fewer labels. These results demonstrate that primate-centric pretraining substantially improves data efficiency and generalization, making it a promising approach for low-label applications. Code, models, and the majority of the dataset will be made available.

[43] Classifying Phonotrauma Severity from Vocal Fold Images with Soft Ordinal Regression

Katie Matton,Purvaja Balaji,Hamzeh Ghasemzadeh,Jameson C. Cooper,Daryush D. Mehta,Jarrad H. Van Stan,Robert E. Hillman,Rosalind Picard,John Guttag,S. Mazdak Abulnaga

Main category: cs.CV

TL;DR: 论文提出了一种通过声带图像自动分类音声创伤严重程度的方法,采用软序数回归框架处理标签的序数性和不确定性。

Details Motivation: 音声创伤的严重程度评估依赖临床专家的主观判断,成本高且可靠性不一,需要自动化工具支持大规模研究与临床决策。

Contribution: 1. 首次实现从声带图像自动分类音声创伤严重程度;2. 提出软序数回归方法,处理标签序数性与不确定性。

Method: 采用序数回归框架,提出改进的损失函数,支持软标签(反映标注者评分分布)。

Result: 预测性能接近临床专家水平,且能生成校准良好的不确定性估计。

Insight: 软序数回归方法可用于其他具有序数标签和不确定性的医学图像分类任务。

Abstract: Phonotrauma refers to vocal fold tissue damage resulting from exposure to forces during voicing. It occurs on a continuum from mild to severe, and treatment options can vary based on severity. Assessment of severity involves a clinician’s expert judgment, which is costly and can vary widely in reliability. In this work, we present the first method for automatically classifying phonotrauma severity from vocal fold images. To account for the ordinal nature of the labels, we adopt a widely used ordinal regression framework. To account for label uncertainty, we propose a novel modification to ordinal regression loss functions that enables them to operate on soft labels reflecting annotator rating distributions. Our proposed soft ordinal regression method achieves predictive performance approaching that of clinical experts, while producing well-calibrated uncertainty estimates. By providing an automated tool for phonotrauma severity assessment, our work can enable large-scale studies of phonotrauma, ultimately leading to improved clinical understanding and patient care.

[44] SliderEdit: Continuous Image Editing with Fine-Grained Instruction Control

Arman Zarei,Samyadeep Basu,Mobina Pournemat,Sayan Nag,Ryan Rossi,Soheil Feizi

Main category: cs.CV

TL;DR: SliderEdit是一个框架,支持通过细粒度、可解释的滑块控制实现连续图像编辑,克服了现有方法对指令强度的固定限制问题。

Details Motivation: 现有基于指令的图像编辑模型无法连续调整单个编辑指令的强度,限制了用户的精确控制需求。SliderEdit旨在填补这一空白。

Contribution: 提出了SliderEdit框架,通过全局训练的滑块实现连续编辑控制,避免了为每个属性单独训练的需求。

Method: 采用低秩适应矩阵学习多种编辑指令的统一表示,支持跨编辑、属性和组合指令的通用控制。

Result: 在FLUX-Kontext和Qwen-Image-Edit等模型中应用SliderEdit,显著提升了编辑可控性、视觉一致性和用户操作性。

Insight: SliderEdit展示了在基于指令的图像编辑中实现连续、细粒度控制的技术路径,为交互式图像操作提供了新方向。

Abstract: Instruction-based image editing models have recently achieved impressive performance, enabling complex edits to an input image from a multi-instruction prompt. However, these models apply each instruction in the prompt with a fixed strength, limiting the user’s ability to precisely and continuously control the intensity of individual edits. We introduce SliderEdit, a framework for continuous image editing with fine-grained, interpretable instruction control. Given a multi-part edit instruction, SliderEdit disentangles the individual instructions and exposes each as a globally trained slider, allowing smooth adjustment of its strength. Unlike prior works that introduced slider-based attribute controls in text-to-image generation, typically requiring separate training or fine-tuning for each attribute or concept, our method learns a single set of low-rank adaptation matrices that generalize across diverse edits, attributes, and compositional instructions. This enables continuous interpolation along individual edit dimensions while preserving both spatial locality and global semantic consistency. We apply SliderEdit to state-of-the-art image editing models, including FLUX-Kontext and Qwen-Image-Edit, and observe substantial improvements in edit controllability, visual consistency, and user steerability. To the best of our knowledge, we are the first to explore and propose a framework for continuous, fine-grained instruction control in instruction-based image editing models. Our results pave the way for interactive, instruction-driven image manipulation with continuous and compositional control.

[45] Density Estimation and Crowd Counting

Balachandra Devarangadi Sunil,Rakshith Venkatesh,Shantanu Todmal

Main category: cs.CV

TL;DR: 该研究将原本用于图像分析的群体密度估计算法扩展到视频场景,通过结合去噪概率模型和扩散过程生成高质量密度图,同时引入回归分支和合并机制提升精度。采用事件驱动采样技术减少计算负担,并通过实验验证了方法的有效性。

Details Motivation: 视频中的群体密度估计面临时间动态性和计算负担的挑战。现有方法多针对静态图像,难以直接适用于视频。因此,需要一种高效且适应动态场景的方法来支持实时监控应用。

Contribution: 1. 提出一种结合去噪概率模型和扩散过程的视频群体密度估计算法;2. 引入窄高斯核和多密度图生成提升精度;3. 开发事件驱动采样技术,显著减少计算量和存储需求。

Method: 1. 使用扩散过程生成高质量群体密度图;2. 通过回归分支精确提取特征,合并多密度图;3. 基于Farneback光流算法的事件驱动采样选择关键帧。

Result: 实验显示,模型在稀疏和密集场景下均能有效捕捉群体动态,采样方法在减少帧数的同时保留了关键事件,MAE指标验证了其准确性。

Insight: 结合扩散过程的概率模型和事件驱动采样技术为视频群体分析提供了高效解决方案,尤其适合实时监控需求。

Abstract: This study enhances a crowd density estimation algorithm originally designed for image-based analysis by adapting it for video-based scenarios. The proposed method integrates a denoising probabilistic model that utilizes diffusion processes to generate high-quality crowd density maps. To improve accuracy, narrow Gaussian kernels are employed, and multiple density map outputs are generated. A regression branch is incorporated into the model for precise feature extraction, while a consolidation mechanism combines these maps based on similarity scores to produce a robust final result. An event-driven sampling technique, utilizing the Farneback optical flow algorithm, is introduced to selectively capture frames showing significant crowd movements, reducing computational load and storage by focusing on critical crowd dynamics. Through qualitative and quantitative evaluations, including overlay plots and Mean Absolute Error (MAE), the model demonstrates its ability to effectively capture crowd dynamics in both dense and sparse settings. The efficiency of the sampling method is further assessed, showcasing its capability to decrease frame counts while maintaining essential crowd events. By addressing the temporal challenges unique to video analysis, this work offers a scalable and efficient framework for real-time crowd monitoring in applications such as public safety, disaster response, and event management.

[46] PALMS+: Modular Image-Based Floor Plan Localization Leveraging Depth Foundation Model

Yunqian Cheng,Benjamin Princen,Roberto Manduchi

Main category: cs.CV

TL;DR: PALMS+是一个基于图像的模块化室内定位系统,利用RGB图像和深度估计模型重建3D点云,并通过几何布局匹配实现高精度定位。

Details Motivation: GPS信号在室内环境中无法使用,现有视觉定位方法(如PALMS)受限于智能手机LiDAR的短距离和室内布局的模糊性。

Contribution: 提出了PALMS+系统,利用单目深度估计模型(Depth Pro)重建尺度对齐的3D点云,并通过卷积匹配平面图实现高精度定位。

Method: 1. 使用Depth Pro从RGB图像生成深度图并重建3D点云;2. 通过几何布局匹配(卷积)定位位置和方向。

Result: 在两个数据集(Structured3D和自定义校园数据集)上,PALMS+在静态定位精度上优于PALMS和F3Loc;在33条真实轨迹上的连续定位误差更低。

Insight: PALMS+无需训练即可实现高精度定位,展示了在无基础设施应用中摄像头无关跟踪的潜力。

Abstract: Indoor localization in GPS-denied environments is crucial for applications like emergency response and assistive navigation. Vision-based methods such as PALMS enable infrastructure-free localization using only a floor plan and a stationary scan, but are limited by the short range of smartphone LiDAR and ambiguity in indoor layouts. We propose PALMS$+$, a modular, image-based system that addresses these challenges by reconstructing scale-aligned 3D point clouds from posed RGB images using a foundation monocular depth estimation model (Depth Pro), followed by geometric layout matching via convolution with the floor plan. PALMS$+$ outputs a posterior over the location and orientation, usable for direct or sequential localization. Evaluated on the Structured3D and a custom campus dataset consisting of 80 observations across four large campus buildings, PALMS$+$ outperforms PALMS and F3Loc in stationary localization accuracy – without requiring any training. Furthermore, when integrated with a particle filter for sequential localization on 33 real-world trajectories, PALMS$+$ achieved lower localization errors compared to other methods, demonstrating robustness for camera-free tracking and its potential for infrastructure-free applications. Code and data are available at https://github.com/Head-inthe-Cloud/PALMS-Plane-based-Accessible-Indoor-Localization-Using-Mobile-Smartphones

[47] Social LSTM with Dynamic Occupancy Modeling for Realistic Pedestrian Trajectory Prediction

Ahmed Alia,Mohcine Chraibi,Armin Seyfried

Main category: cs.CV

TL;DR: 论文提出了一种改进的Social LSTM模型,通过动态占据空间损失函数,在预测行人轨迹时减少碰撞率并提高位移准确性。

Details Motivation: 在动态和拥挤的环境中,行人轨迹预测的挑战在于复杂的人体运动和相互影响。现有方法通常将行人视为点实体,忽略了其实际占据的物理空间。

Contribution: 提出了一种新的动态占据空间损失函数,将碰撞惩罚与位移误差结合,解决了碰撞问题且不增加位移误差。

Method: 结合Social LSTM和动态占据空间损失函数,通过对场景密度和个体空间占据敏感的训练优化模型。

Result: 实验显示,模型在碰撞率上降低了31%,位移误差和终点误差分别平均降低了5%和6%。

Insight: 考虑行人实际占据的空间和场景密度对轨迹预测的优化具有显著作用。

Abstract: In dynamic and crowded environments, realistic pedestrian trajectory prediction remains a challenging task due to the complex nature of human motion and the mutual influences among individuals. Deep learning models have recently achieved promising results by implicitly learning such patterns from 2D trajectory data. However, most approaches treat pedestrians as point entities, ignoring the physical space that each person occupies. To address these limitations, this paper proposes a novel deep learning model that enhances the Social LSTM with a new Dynamic Occupied Space loss function. This loss function guides Social LSTM in learning to avoid realistic collisions without increasing displacement error across different crowd densities, ranging from low to high, in both homogeneous and heterogeneous density settings. Such a function achieves this by combining the average displacement error with a new collision penalty that is sensitive to scene density and individual spatial occupancy. For efficient training and evaluation, five datasets were generated from real pedestrian trajectories recorded during the Festival of Lights in Lyon 2022. Four datasets represent homogeneous crowd conditions – low, medium, high, and very high density – while the fifth corresponds to a heterogeneous density distribution. The experimental findings indicate that the proposed model not only lowers collision rates but also enhances displacement prediction accuracy in each dataset. Specifically, the model achieves up to a 31% reduction in the collision rate and reduces the average displacement error and the final displacement error by 5% and 6%, respectively, on average across all datasets compared to the baseline. Moreover, the proposed model consistently outperforms several state-of-the-art deep learning models across most test sets.

[48] Soiling detection for Advanced Driver Assistance Systems

Filip Beránek,Václav Diviš,Ivan Gruber

Main category: cs.CV

TL;DR: 该论文探讨了汽车摄像头污染检测问题,将其视为语义分割任务,比较了多种分割方法,优于基于瓦片分类的方法。同时,论文指出Woodscape数据集存在数据泄露和标注不精确的问题,并提出了一个更小的子集,可以在更短时间内达到可比结果。

Details Motivation: 汽车摄像头的污染检测对高级驾驶辅助系统(ADAS)至关重要,但目前的数据集可能存在质量问题(如数据泄露和标注不精确),影响了模型的性能。论文旨在提出一个更高效且准确的方法来解决这一问题。

Contribution: 1. 将污染检测建模为语义分割问题,并展示了分割方法的优越性;2. 揭示了Woodscape数据集的数据泄露和标注问题;3. 提出了一个更小的数据子集,能够在短时间内实现可比较的结果。

Method: 采用了语义分割方法,并对流行的分割方法进行了全面比较。同时,通过对Woodscape数据集的分析,提出了一个改进的数据子集。

Result: 语义分割方法显著优于瓦片分类方法。尽管使用更小的数据子集,仍能在短时间内达到与传统方法相当的性能。

Insight: 数据质量(如标注精确性和数据泄露)对模型性能有重要影响;优化数据集可以提高训练效率和模型鲁棒性。

Abstract: Soiling detection for automotive cameras is a crucial part of advanced driver assistance systems to make them more robust to external conditions like weather, dust, etc. In this paper, we regard the soiling detection as a semantic segmentation problem. We provide a comprehensive comparison of popular segmentation methods and show their superiority in performance while comparing them to tile-level classification approaches. Moreover, we present an extensive analysis of the Woodscape dataset showing that the original dataset contains a data-leakage and imprecise annotations. To address these problems, we create a new data subset, which, despite being much smaller, provides enough information for the segmentation method to reach comparable results in a much shorter time. All our codes and dataset splits are available at https://github.com/filipberanek/woodscape_revision.

[49] Feature Quality and Adaptability of Medical Foundation Models: A Comparative Evaluation for Radiographic Classification and Segmentation

Frank Li,Theo Dapamede,Mohammadreza Chavoshi,Young Seok Jeon,Bardia Khosravi,Abdulhameed Dere,Beatrice Brown-Mulry,Rohan Satya Isaac,Aawez Mansuri,Chiratidzo Sanyika,Janice Newsome,Saptarshi Purkayastha,Imon Banerjee,Hari Trivedi,Judy Gichoya

Main category: cs.CV

TL;DR: 该论文评估了医学和通用领域的8个基础模型(FMs)在胸部X射线分析中的表现,重点比较了分类和分割任务的效果。研究发现医学领域的预训练显著优于通用模型,但特征有效性高度依赖任务。此外,文本-图像对齐并非必需,监督基线模型在分割任务中表现优异。

Details Motivation: 基础模型在医学影像中的应用潜力巨大,但其预训练领域(医学vs通用)、范式(如文本引导)和架构对特征质量的影响尚不明确。因此,论文旨在评估这些因素如何影响放射学任务中的表现,以帮助选择最合适的编码器。

Contribution: 1. 比较了医学和通用领域基础模型在胸部X射线分类和分割任务中的表现。2. 发现医学预训练显著提高特征质量,但特征有效性依赖任务。3. 揭示了文本-图像对齐的不足,并展示了监督基线模型的竞争力。

Method: 使用线性探测和微调方法,评估8个医学和通用领域FMs的分类(如气胸、心脏肥大)和分割(如气胸、心脏边界)任务。分析了预训练域、范式和架构对结果的影响。

Result: 1. 医学预训练模型的线性探测表现优于通用模型。2. 特征在全局分类和显著解剖结构分割中表现良好,但对复杂病理(如气胸)分割效果差。3. 监督基线模型在分割任务中匹配或超越最佳FMs。

Insight: 医学预训练有益,但架构选择(如多尺度)至关重要。预训练特征并非万能,复杂定位任务中监督模型仍有优势。此外,文本-图像对齐的非必要性为非对齐方法提供了机会。

Abstract: Foundation models (FMs) promise to generalize medical imaging, but their effectiveness varies. It remains unclear how pre-training domain (medical vs. general), paradigm (e.g., text-guided), and architecture influence embedding quality, hindering the selection of optimal encoders for specific radiology tasks. To address this, we evaluate vision encoders from eight medical and general-domain FMs for chest X-ray analysis. We benchmark classification (pneumothorax, cardiomegaly) and segmentation (pneumothorax, cardiac boundary) using linear probing and fine-tuning. Our results show that domain-specific pre-training provides a significant advantage; medical FMs consistently outperformed general-domain models in linear probing, establishing superior initial feature quality. However, feature utility is highly task-dependent. Pre-trained embeddings were strong for global classification and segmenting salient anatomy (e.g., heart). In contrast, for segmenting complex, subtle pathologies (e.g., pneumothorax), all FMs performed poorly without significant fine-tuning, revealing a critical gap in localizing subtle disease. Subgroup analysis showed FMs use confounding shortcuts (e.g., chest tubes for pneumothorax) for classification, a strategy that fails for precise segmentation. We also found that expensive text-image alignment is not a prerequisite; image-only (RAD-DINO) and label-supervised (Ark+) FMs were among top performers. Notably, a supervised, end-to-end baseline remained highly competitive, matching or exceeding the best FMs on segmentation tasks. These findings show that while medical pre-training is beneficial, architectural choices (e.g., multi-scale) are critical, and pre-trained features are not universally effective, especially for complex localization tasks where supervised models remain a strong alternative.

[50] STORM: Segment, Track, and Object Re-Localization from a Single 3D Model

Yu Deng,Teng Cao,Hikaru Shindo,Jiahong Xue,Quentin Delfosse,Kristian Kersting

Main category: cs.CV

TL;DR: STORM 是一个无标注、实时 6D 姿态估计系统,结合视觉-语言理解和自监督特征匹配,实现了高精度的目标分割、跟踪和重定位。

Details Motivation: 现有方法依赖首帧手动标注分割掩码,耗时且对遮挡和快速运动表现不佳。STORM 旨在解决这些问题。

Contribution: 1. 提出无需标注的三阶段流水线,结合视觉-语言理解和自监督特征匹配;2. 引入自动重注册机制处理遮挡和快速运动;3. 在工业数据集上实现 SOTA 精度和实时性能。

Method: 1. 上下文目标描述引导定位;2. 自交叉注意力识别候选区域;3. 分割模型生成精确掩码;4. 特征相似性监控实现自动重注册。

Result: 在遮挡、高速运动和光照变化的工业数据集上达到 SOTA 精度,且运行速度为实时。

Insight: STORM 通过无标注和自动化机制显著降低了部署成本,为制造业和质量控制等应用提供实用方案。

Abstract: Accurate 6D pose estimation and tracking are fundamental capabilities for physical AI systems such as robots. However, existing approaches typically rely on a manually annotated segmentation mask of the target in the first frame, which is labor-intensive and leads to reduced performance when faced with occlusions or rapid movement. To address these limi- tations, we propose STORM (Segment, Track, and Object Re-localization from a single 3D Model), an open-source robust real-time 6D pose estimation system that requires no manual annotation. STORM employs a novel three-stage pipeline combining vision-language understanding with self-supervised feature matching: contextual object descriptions guide localization, self-cross-attention mechanisms identify candidate regions, and a segmentation model produces precise masks for accurate pose estimation. Another key innovation is our automatic re-registration mechanism that detects tracking failures through feature similarity monitoring and recovers from severe occlusions or rapid motion. STORM achieves state-of-the-art accuracy on challenging industrial datasets featuring multi-object occlusions, high-speed motion, and varying illumination, while operating at real-time speeds without additional training. This annotation-free approach significantly reduces deployment overhead, providing a practical solution for modern applications, such as flexible manufacturing and intelligent quality control.

[51] Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models

Konstantinos M. Dafnis,Dimitris N. Metaxas

Main category: cs.CV

TL;DR: STS是一种轻量级的测试时适应框架,通过谱子空间提取和潜空间调整,显著提升了视觉语言模型在零样本任务中的泛化能力,且无需修改模型权重。

Details Motivation: 视觉语言模型在零样本推理中表现优异,但在测试时的领域偏移下性能下降。现有方法通常需要反向传播或修改模型组件,计算开销大。

Contribution: 提出了STS框架,通过谱子空间提取和潜空间调整,实现了高效、轻量的测试时适应,显著提升了模型性能。

Method: STS提取文本嵌入的谱子空间定义主语义方向,并通过最小化增强视图的熵来调整潜表示。

Result: STS在性能上优于现有方法,计算效率提升8倍,内存占用减少12倍。

Insight: STS展示了在潜空间轻量调整的潜力,为测试时适应提供了一种高效且无需模型修改的解决方案。

Abstract: Vision-Language Models (VLMs) excel at zero-shot inference but often degrade under test-time domain shifts. For this reason, episodic test-time adaptation strategies have recently emerged as powerful techniques for adapting VLMs to a single unlabeled image. However, existing adaptation strategies, such as test-time prompt tuning, typically require backpropagating through large encoder weights or altering core model components. In this work, we introduce Spectrum-Aware Test-Time Steering (STS), a lightweight adaptation framework that extracts a spectral subspace from the textual embeddings to define principal semantic directions and learns to steer latent representations in a spectrum-aware manner by adapting a small number of per-sample shift parameters to minimize entropy across augmented views. STS operates entirely at inference in the latent space, without backpropagation through or modification of the frozen encoders. Building on standard evaluation protocols, our comprehensive experiments demonstrate that STS largely surpasses or compares favorably against state-of-the-art test-time adaptation methods, while introducing only a handful of additional parameters and achieving inference speeds up to 8x faster with a 12x smaller memory footprint than conventional test-time prompt tuning. The code is available at https://github.com/kdafnis/STS.

[52] From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance

Jeongho Min,Dongyoung Kim,Jaehyup Lee

Main category: cs.CV

TL;DR: 提出一种无需训练的跨视角图像检索框架,利用预训练视觉编码器和大型语言模型(LLM),通过地理位置语义和LLM指导实现街景到卫星图像的检索。

Details Motivation: 现有跨视角图像检索方法通常需要监督训练或依赖特定数据集,限制了实际应用,亟需一种无需额外训练且适应性强的方法。

Contribution: 1. 提出首个无需训练的跨视角图像检索框架;2. 结合LLM和地理位置语义提升检索效果;3. 支持自动构建语义对齐的街景-卫星数据集。

Method: 1. 通过基于网络的图像搜索和LLM推断地理位置;2. 利用地理编码API生成卫星查询;3. 使用预训练视觉编码器(如DINOv2)和PCA白化特征优化进行检索。

Result: 在零样本设置下,超越现有基于学习的方法的基准性能,同时支持高效的数据集自动构建。

Insight: 结合预训练模型和LLM可以实现高效的无监督跨模态任务处理,为实际应用提供了灵活性和可扩展性。

Abstract: Cross-view image retrieval, particularly street-to-satellite matching, is a critical task for applications such as autonomous navigation, urban planning, and localization in GPS-denied environments. However, existing approaches often require supervised training on curated datasets and rely on panoramic or UAV-based images, which limits real-world deployment. In this paper, we present a simple yet effective cross-view image retrieval framework that leverages a pretrained vision encoder and a large language model (LLM), requiring no additional training. Given a monocular street-view image, our method extracts geographic cues through web-based image search and LLM-based location inference, generates a satellite query via geocoding API, and retrieves matching tiles using a pretrained vision encoder (e.g., DINOv2) with PCA-based whitening feature refinement. Despite using no ground-truth supervision or finetuning, our proposed method outperforms prior learning-based approaches on the benchmark dataset under zero-shot settings. Moreover, our pipeline enables automatic construction of semantically aligned street-to-satellite datasets, which is offering a scalable and cost-efficient alternative to manual annotation. All source codes will be made publicly available at https://jeonghomin.github.io/street2orbit.github.io/.

[53] AHA! Animating Human Avatars in Diverse Scenes with Gaussian Splatting

Aymen Mir,Jian Wang,Riza Alp Guler,Chuan Guo,Gerard Pons-Moll,Bing Zhou

Main category: cs.CV

TL;DR: 该论文提出了一种基于3D高斯分布(3DGS)的新型框架,用于在3D场景中动画化人类角色。通过将人类和场景表示为高斯分布,实现了几何一致的自由视角渲染。

Details Motivation: 现有的动画管线通常使用网格或点云作为3D表示,但这些方法在人类与场景交互的自由视角渲染中表现有限。3DGS作为新型场景表示方法尚未被充分探索用于人类动画化问题。

Contribution: 1. 首次将3DGS应用于人类动画化问题;2. 提出高斯对齐的运动合成模块和人类-场景高斯细化优化方法;3. 支持几何一致的自由视角渲染,且在无需配对数据的情况下实现运动和渲染的解耦。

Method: 1. 使用3DGS表示人类和场景;2. 提出高斯对齐运动模块,通过不透明度线索和投影高斯结构指导运动合成;3. 引入人类-场景高斯细化优化,确保自然的接触和导航。

Result: 在Scannet++和SuperSplat库的场景上进行了评估,展示了在稀疏和密集多视角捕捉下的人类角色重建效果,并支持单目RGB视频的编辑和新人类动画化。

Insight: 3DGS在人类动画化问题中具有潜在优势,特别是几何一致的自由视角渲染能力,为单目视频提供了新的应用场景。

Abstract: We present a novel framework for animating humans in 3D scenes using 3D Gaussian Splatting (3DGS), a neural scene representation that has recently achieved state-of-the-art photorealistic results for novel-view synthesis but remains under-explored for human-scene animation and interaction. Unlike existing animation pipelines that use meshes or point clouds as the underlying 3D representation, our approach introduces the use of 3DGS as the 3D representation to the problem of animating humans in scenes. By representing humans and scenes as Gaussians, our approach allows for geometry-consistent free-viewpoint rendering of humans interacting with 3D scenes. Our key insight is that the rendering can be decoupled from the motion synthesis and each sub-problem can be addressed independently, without the need for paired human-scene data. Central to our method is a Gaussian-aligned motion module that synthesizes motion without explicit scene geometry, using opacity-based cues and projected Gaussian structures to guide human placement and pose alignment. To ensure natural interactions, we further propose a human-scene Gaussian refinement optimization that enforces realistic contact and navigation. We evaluate our approach on scenes from Scannet++ and the SuperSplat library, and on avatars reconstructed from sparse and dense multi-view human capture. Finally, we demonstrate that our framework allows for novel applications such as geometry-consistent free-viewpoint rendering of edited monocular RGB videos with new animated humans, showcasing the unique advantage of 3DGS for monocular video-based human animation.

[54] CertMask: Certifiable Defense Against Adversarial Patches via Theoretically Optimal Mask Coverage

Xuntao Lyu,Ching-Chi Lin,Abdullah Al Arafat,Georg von der Brüggen,Jian-Jia Chen,Zhishan Guo

Main category: cs.CV

TL;DR: CertMask是一种可认证的防御方法,通过理论最优的掩码覆盖率来对抗对抗性补丁攻击。相比现有方法(PatchCleanser),它在单轮掩码操作中实现高效且稳健的防御,显著提升认证稳健精度。

Details Motivation: 对抗性补丁攻击通过局部扰动误导深度学习模型,尤其在现实应用中具有高风险。现有防御方法(如PatchCleanser)效率低且计算成本高(O(n^2))。CertMask旨在提供高效、理论可认证的防御。

Contribution: CertMask提出了理论最优的掩码覆盖率策略,确保每个可能的补丁位置至少被覆盖k次,从而在单轮操作(O(n)复杂度)中实现高效且稳健的防御。

Method: CertMask通过数学严格的覆盖率策略生成二进制掩码集,确保每个补丁位置被覆盖k次,并通过理论分析验证其充足性。

Result: 在ImageNet等数据集上的实验表明,CertMask的认证稳健精度比PatchCleanser提升高达+13.4%,且保持与基础模型几乎相同的干净精度。

Insight: CertMask展示了理论最优覆盖率在高效率防御中的重要性,为对抗性补丁防御提供了一种可扩展且可认证的解决方案。

Abstract: Adversarial patch attacks inject localized perturbations into images to mislead deep vision models. These attacks can be physically deployed, posing serious risks to real-world applications. In this paper, we propose CertMask, a certifiably robust defense that constructs a provably sufficient set of binary masks to neutralize patch effects with strong theoretical guarantees. While the state-of-the-art approach (PatchCleanser) requires two rounds of masking and incurs $O(n^2)$ inference cost, CertMask performs only a single round of masking with $O(n)$ time complexity, where $n$ is the cardinality of the mask set to cover an input image. Our proposed mask set is computed using a mathematically rigorous coverage strategy that ensures each possible patch location is covered at least $k$ times, providing both efficiency and robustness. We offer a theoretical analysis of the coverage condition and prove its sufficiency for certification. Experiments on ImageNet, ImageNette, and CIFAR-10 show that CertMask improves certified robust accuracy by up to +13.4% over PatchCleanser, while maintaining clean accuracy nearly identical to the vanilla model.

[55] CORONA-Fields: Leveraging Foundation Models for Classification of Solar Wind Phenomena

Daniela Martin,Jinsu Hong,Connor O’Brien,Valmir P Moraes Filho,Jasmine R. Kobayashi,Evangelia Samara,Joseph Gallego

Main category: cs.CV

TL;DR: 该论文提出了一个基于基础模型的深度学习架构,用于太阳风现象的自动分类,结合了远程观测和实地测量数据,为空间天气预报提供了初步可行性证明。

Details Motivation: 太阳活动和太阳风现象对地球的空间天气和卫星等技术基础设施构成重大风险,但目前自动化分类这些结构仍具有挑战性。

Contribution: 1. 将基础模型从太阳物理学迁移到太阳风结构分析;2. 结合航天器位置和太阳磁连接性的嵌入表示;3. 提出了一个神经场模型,用于太阳风分类任务。

Method: 1. 使用预训练的太阳物理学基础模型生成嵌入;2. 将这些嵌入与航天器位置和太阳磁连接性(傅里叶特征编码)拼接;3. 微调深度学习架构,结合远程和实地观测数据。

Result: 分类性能一般,主要受限于粗糙标签、类别不平衡和预训练模型的迁移能力不足,但证明了基础模型嵌入在太阳风任务中的可行性。

Insight: 这是首个概念验证研究,为未来改进空间天气预报模型奠定了基础,强调了多模态数据结合的重要性。

Abstract: Space weather at Earth, driven by the solar activity, poses growing risks to satellites around our planet as well as to critical ground-based technological infrastructure. Major space weather contributors are the solar wind and coronal mass ejections whose variable density, speed, temperature, and magnetic field make the automated classification of those structures challenging. In this work, we adapt a foundation model for solar physics, originally trained on Solar Dynamics Observatory imagery, to create embeddings suitable for solar wind structure analysis. These embeddings are concatenated with the spacecraft position and solar magnetic connectivity encoded using Fourier features which generates a neural field-based model. The full deep learning architecture is fine-tuned bridging the gap between remote sensing and in situ observations. Labels are derived from Parker Solar Probe measurements, forming a downstream classification task that maps plasma properties to solar wind structures. Although overall classification performance is modest, likely due to coarse labeling, class imbalance, and limited transferability of the pretrained model, this study demonstrates the feasibility of leveraging foundation model embeddings for in situ solar wind tasks. As a first proof-of-concept, it lays the groundwork for future improvements toward more reliable space weather predictions. The code and configuration files used in this study are publicly available to support reproducibility.

[56] Remember Me: Bridging the Long-Range Gap in LVLMs with Three-Step Inference-Only Decay Resilience Strategies

Peng Gao,Yujian Lee,Xiaofeng Zhang,Zailong Chen,Hui Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为T-DRS的三步推断专用衰减恢复策略,用于缓解LVLM中因ROPE导致的长距离依赖建模问题,通过语义驱动、距离感知控制和重新强化远程依赖的组合策略,显著提升了模型的全局上下文记忆能力。

Details Motivation: 大型视觉语言模型(LVLM)在ROPE(Rotary Positional Encoding)的使用中存在长距离依赖建模的缺陷,尤其是注意力衰减问题,导致模型难以记住全局上下文。因此,本文旨在解决这一问题。

Contribution: 提出了T-DRS策略,包括SD-DRS(语义驱动衰减恢复)、DC-DRS(距离感知控制衰减恢复)和reRD-DRS(重新强化远程依赖衰减恢复),通过推断专用方式提升模型的长距离依赖建模能力。

Method: T-DRS是纯推断阶段的策略,无需额外训练。具体包括:(1) SD-DRS通过内容感知残差放大语义相关的长距离信号;(2) DC-DRS基于位置距离平滑调节注意力权重以抑制噪声;(3) reRD-DRS强化剩余的信息远程依赖以保持全局一致性。

Result: 在视觉问答(VQA)基准测试中,T-DRS在不额外训练的情况下显著提升了模型的性能。

Insight: 本文工作表明,推断阶段的策略优化可以有效弥补ROPE在长距离依赖建模中的缺陷,同时不影响局部归纳偏置。这种推断专用方法为其他模型的优化提供了新思路。

Abstract: Large Vision-Language Models (LVLMs) have achieved impressive performance across a wide range of multimodal tasks. However, they still face critical challenges in modeling long-range dependencies under the usage of Rotary Positional Encoding (ROPE). Although it can facilitate precise modeling of token positions, it induces progressive attention decay as token distance increases, especially with progressive attention decay over distant token pairs, which severely impairs the model’s ability to remember global context. To alleviate this issue, we propose inference-only Three-step Decay Resilience Strategies (T-DRS), comprising (1) Semantic-Driven DRS (SD-DRS), amplifying semantically meaningful but distant signals via content-aware residuals, (2) Distance-aware Control DRS (DC-DRS), which can purify attention by smoothly modulating weights based on positional distances, suppressing noise while preserving locality, and (3) re-Reinforce Distant DRS (reRD-DRS), consolidating the remaining informative remote dependencies to maintain global coherence. Together, the T-DRS recover suppressed long-range token pairs without harming local inductive biases. Extensive experiments on Vision Question Answering (VQA) benchmarks demonstrate that T-DRS can consistently improve performance in a training-free manner. The code can be accessed in https://github.com/labixiaoq-qq/Remember-me

[57] SAM-DAQ: Segment Anything Model with Depth-guided Adaptive Queries for RGB-D Video Salient Object Detection

Jia Lin,Xiaofei Zhou,Jiyuan Liu,Runmin Cong,Guodao Zhang,Zhi Liu,Jiyong Zhang

Main category: cs.CV

TL;DR: SAM-DAQ是一种将Segment Anything Model(SAM)与深度引导自适应查询相结合的方法,用于RGB-D视频显著性目标检测,解决了手动提示依赖、高内存消耗和计算负担的问题。

Details Motivation: 现有的SAM模型在RGB-D视频显著性目标检测中面临手动提示依赖、高内存消耗和计算负担的挑战。

Contribution: 提出了SAM-DAQ,通过深度引导的自适应查询和多模态图像编码器(PAMIE)及查询驱动的时间记忆模块(QTM)提升了性能。

Method: 采用了并行适配器多模态图像编码器(PAMIE)和深度引导并行适配器(DPA),结合查询驱动的时序记忆模块(QTM)统一时序特征提取与更新。

Result: 在三个RGB-D VSOD数据集上的实验表明,SAM-DAQ在所有评估指标上均优于现有方法。

Insight: 深度信息的有效融合和自适应查询机制显著提升了视频显著性目标检测的性能。

Abstract: Recently segment anything model (SAM) has attracted widespread concerns, and it is often treated as a vision foundation model for universal segmentation. Some researchers have attempted to directly apply the foundation model to the RGB-D video salient object detection (RGB-D VSOD) task, which often encounters three challenges, including the dependence on manual prompts, the high memory consumption of sequential adapters, and the computational burden of memory attention. To address the limitations, we propose a novel method, namely Segment Anything Model with Depth-guided Adaptive Queries (SAM-DAQ), which adapts SAM2 to pop-out salient objects from videos by seamlessly integrating depth and temporal cues within a unified framework. Firstly, we deploy a parallel adapter-based multi-modal image encoder (PAMIE), which incorporates several depth-guided parallel adapters (DPAs) in a skip-connection way. Remarkably, we fine-tune the frozen SAM encoder under prompt-free conditions, where the DPA utilizes depth cues to facilitate the fusion of multi-modal features. Secondly, we deploy a query-driven temporal memory (QTM) module, which unifies the memory bank and prompt embeddings into a learnable pipeline. Concretely, by leveraging both frame-level queries and video-level queries simultaneously, the QTM module can not only selectively extract temporal consistency features but also iteratively update the temporal representations of the queries. Extensive experiments are conducted on three RGB-D VSOD datasets, and the results show that the proposed SAM-DAQ consistently outperforms state-of-the-art methods in terms of all evaluation metrics.

[58] RWKV-PCSSC: Exploring RWKV Model for Point Cloud Semantic Scene Completion

Wenzhe He,Xiaojun Chen,Wentang Chen,Hongyu Wang,Ying Liu,Ruihui Li

Main category: cs.CV

TL;DR: 该论文提出了一种轻量化的点云语义场景补全网络RWKV-PCSSC,通过引入RWKV机制降低了模型复杂度,并在性能和资源效率上优于现有方法。

Details Motivation: 现有的语义场景补全方法通常采用密集网络架构,参数量大且资源需求高,限制了实际应用。因此,作者提出一种轻量化解决方案。

Contribution: 1. 设计了RWKV Seed Generator模块,用于生成粗粒度点云特征;2. 提出多阶段的RWKV Point Deconvolution模块逐步恢复点云特征;3. 实现了参数量和内存效率的显著提升。

Method: 1. 使用RWKV-SG模块聚合部分点云特征生成粗粒度特征;2. 通过RWKV-PD模块分阶段恢复点云特征;3. 整体网络设计紧凑高效。

Result: 实验表明,RWKV-PCSSC在参数量上减少了4.18倍,内存效率提升1.37倍,同时在多个数据集上达到SOTA性能。

Insight: RWKV机制在点云任务中表现优异,轻量化设计不仅能减少计算资源需求,还能保持高性能。

Abstract: Semantic Scene Completion (SSC) aims to generate a complete semantic scene from an incomplete input. Existing approaches often employ dense network architectures with a high parameter count, leading to increased model complexity and resource demands. To address these limitations, we propose RWKV-PCSSC, a lightweight point cloud semantic scene completion network inspired by the Receptance Weighted Key Value (RWKV) mechanism. Specifically, we introduce a RWKV Seed Generator (RWKV-SG) module that can aggregate features from a partial point cloud to produce a coarse point cloud with coarse features. Subsequently, the point-wise feature of the point cloud is progressively restored through multiple stages of the RWKV Point Deconvolution (RWKV-PD) modules. By leveraging a compact and efficient design, our method achieves a lightweight model representation. Experimental results demonstrate that RWKV-PCSSC reduces the parameter count by 4.18$\times$ and improves memory efficiency by 1.37$\times$ compared to state-of-the-art methods PointSSC. Furthermore, our network achieves state-of-the-art performance on established indoor (SSC-PC, NYUCAD-PC) and outdoor (PointSSC) scene dataset, as well as on our proposed datasets (NYUCAD-PC-V2, 3D-FRONT-PC).

[59] HCC-3D: Hierarchical Compensatory Compression for 98% 3D Token Reduction in Vision-Language Models

Liheng Zhang,Jin Wang,Hui Li,Bingfeng Zhang,Weifeng Liu

Main category: cs.CV

TL;DR: HCC-3D提出了一种分层补偿压缩方法,显著减少3D视觉语言模型的计算开销,同时保持信息完整性。

Details Motivation: 当前3D-VLMs直接嵌入3D点云数据导致计算成本高,限制了应用。研究目标是减少3D tokens的计算开销而不损失关键信息。

Contribution: 提出了HCC-3D方法,包括全局结构压缩(GSC)和自适应细节挖掘(ADM),实现了98%的3D token压缩率,同时提升了性能。

Method: 1. 全局结构压缩(GSC):使用全局查询将3D token压缩为少量关键token。2. 自适应细节挖掘(ADM):选择性重新压缩重要但未被充分关注的特征。

Result: HCC-3D在压缩率达到98%的同时,性能优于现有方法,实现了效率和性能的双重提升。

Insight: 通过分层压缩和细节补偿的策略,可以有效平衡计算效率和信息完整性,为3D多模态建模提供了新思路。

Abstract: 3D understanding has drawn significant attention recently, leveraging Vision-Language Models (VLMs) to enable multi-modal reasoning between point cloud and text data. Current 3D-VLMs directly embed the 3D point clouds into 3D tokens, following large 2D-VLMs with powerful reasoning capabilities. However, this framework has a great computational cost limiting its application, where we identify that the bottleneck lies in processing all 3D tokens in the Large Language Model (LLM) part. This raises the question: how can we reduce the computational overhead introduced by 3D tokens while preserving the integrity of their essential information? To address this question, we introduce Hierarchical Compensatory Compression (HCC-3D) to efficiently compress 3D tokens while maintaining critical detail retention. Specifically, we first propose a global structure compression (GSC), in which we design global queries to compress all 3D tokens into a few key tokens while keeping overall structural information. Then, to compensate for the information loss in GSC, we further propose an adaptive detail mining (ADM) module that selectively recompresses salient but under-attended features through complementary scoring. Extensive experiments demonstrate that HCC-3D not only achieves extreme compression ratios (approximately 98%) compared to previous 3D-VLMs, but also achieves new state-of-the-art performance, showing the great improvements on both efficiency and performance.

[60] MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding

Ketong Chen,Yuhao Chen,Yang Xue

Main category: cs.CV

TL;DR: 文章提出了MosaicDoc,一个双语(中英文)的大规模视觉丰富文档理解(VRDU)基准测试,弥补了现有测试在语言多样性、布局复杂性和任务多样性上的不足。

Details Motivation: 现有的视觉语言模型(VLMs)基准测试多为英文,布局简单且任务有限,难以评估模型在处理复杂布局和密集文本的VRDU任务中的表现。

Contribution: 1. 提出了DocWeaver,一种利用大语言模型自动生成基准的多智能体流程;2. 发布了MosaicDoc,一个包含72K图像和600K问答对的双语大规模基准测试,支持多样化任务与复杂布局。

Method: 使用DocWeaver这一多智能体流程,基于大语言模型自动生成数据源和标注,构建MosaicDoc数据集。

Result: MosaicDoc成为当前VRDU领域的权威基准,评测显示现有先进模型在处理复杂文档时仍存在明显不足。

Insight: 复杂的文档布局和多语言支持是VRDU领域的关键挑战,未来研究需进一步优化模型在此类任务上的表现。

Abstract: Despite the rapid progress of Vision-Language Models (VLMs), their capabilities are inadequately assessed by existing benchmarks, which are predominantly English-centric, feature simplistic layouts, and support limited tasks. Consequently, they fail to evaluate model performance for Visually Rich Document Understanding (VRDU), a critical challenge involving complex layouts and dense text. To address this, we introduce DocWeaver, a novel multi-agent pipeline that leverages Large Language Models to automatically generate a new benchmark. The result is MosaicDoc, a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of VRDU. Sourced from newspapers and magazines, MosaicDoc features diverse and complex layouts (including multi-column and non-Manhattan), rich stylistic variety from 196 publishers, and comprehensive multi-task annotations (OCR, VQA, reading order, and localization). With 72K images and over 600K QA pairs, MosaicDoc serves as a definitive benchmark for the field. Our extensive evaluation of state-of-the-art models on this benchmark reveals their current limitations in handling real-world document complexity and charts a clear path for future research.

[61] Compensating Distribution Drifts in Class-incremental Learning of Pre-trained Vision Transformers

Xuan Rao,Simian Xu,Zheng Li,Bo Zhao,Derong Liu,Mingming Ha,Cesare Alippi

Main category: cs.CV

TL;DR: 论文提出一种名为SLDC的方法,通过潜在空间转换算子和知识蒸馏,解决了预训练ViT在类增量学习中的分布漂移问题,显著提升了SeqFT的性能。

Details Motivation: 类增量学习中,预训练ViT的顺序微调(SeqFT)会导致特征分布漂移,影响分类器性能。现有方法未能有效解决这一问题。

Contribution: 提出SLDC方法,通过线性/弱非线性转换算子和知识蒸馏,对齐任务间的特征分布,减少漂移,使SeqFT性能接近联合训练。

Method: 1. 线性SLDC:通过正则化最小二乘学习线性转换算子;2. 弱非线性SLDC:结合可学习的弱非线性映射;3. 结合知识蒸馏。

Result: 实验表明,SLDC显著提升了SeqFT的性能,结合知识蒸馏后,性能接近联合训练。

Insight: 分布漂移是影响类增量学习的关键问题,SLDC通过特征对齐和知识蒸馏有效解决了这一问题。

Abstract: Recent advances have shown that sequential fine-tuning (SeqFT) of pre-trained vision transformers (ViTs), followed by classifier refinement using approximate distributions of class features, can be an effective strategy for class-incremental learning (CIL). However, this approach is susceptible to distribution drift, caused by the sequential optimization of shared backbone parameters. This results in a mismatch between the distributions of the previously learned classes and that of the updater model, ultimately degrading the effectiveness of classifier performance over time. To address this issue, we introduce a latent space transition operator and propose Sequential Learning with Drift Compensation (SLDC). SLDC aims to align feature distributions across tasks to mitigate the impact of drift. First, we present a linear variant of SLDC, which learns a linear operator by solving a regularized least-squares problem that maps features before and after fine-tuning. Next, we extend this with a weakly nonlinear SLDC variant, which assumes that the ideal transition operator lies between purely linear and fully nonlinear transformations. This is implemented using learnable, weakly nonlinear mappings that balance flexibility and generalization. To further reduce representation drift, we apply knowledge distillation (KD) in both algorithmic variants. Extensive experiments on standard CIL benchmarks demonstrate that SLDC significantly improves the performance of SeqFT. Notably, by combining KD to address representation drift with SLDC to compensate distribution drift, SeqFT achieves performance comparable to joint training across all evaluated datasets. Code: https://github.com/raoxuan98-hash/sldc.git.

[62] AdaptViG: Adaptive Vision GNN with Exponential Decay Gating

Mustafa Munir,Md Mostafijur Rahman,Radu Marculescu

Main category: cs.CV

TL;DR: AdaptViG提出了一种高效的视觉图神经网络(ViG),通过自适应图卷积和指数衰减门控机制解决传统ViG的计算效率问题,实现了精度和效率的最佳平衡。

Details Motivation: 传统ViG在计算效率上存在瓶颈,尤其是在图构建阶段,限制了其实际应用。AdaptViG旨在通过创新的动态门控和混合策略解决这一问题。

Contribution: 1. 提出自适应图卷积机制;2. 引入指数衰减门控策略;3. 结合静态和动态模块的混合设计,实现了高效的特征聚合。

Method: 1. 使用静态轴向脚手架作为基础结构;2. 动态门控基于特征相似性选择性地加权长距离连接;3. 早期阶段使用门控,后期阶段结合全局注意力模块。

Result: AdaptViG-M在ImageNet上达到82.6%的top-1准确率,参数和计算量分别减少80%和84%;在下游任务中也显著优于大规模模型。

Insight: 动态门控和混合设计可以有效提升ViG的效率,同时保持高精度,为视觉任务的图神经网络提供了新思路。

Abstract: Vision Graph Neural Networks (ViGs) offer a new direction for advancements in vision architectures. While powerful, ViGs often face substantial computational challenges stemming from their graph construction phase, which can hinder their efficiency. To address this issue we propose AdaptViG, an efficient and powerful hybrid Vision GNN that introduces a novel graph construction mechanism called Adaptive Graph Convolution. This mechanism builds upon a highly efficient static axial scaffold and a dynamic, content-aware gating strategy called Exponential Decay Gating. This gating mechanism selectively weighs long-range connections based on feature similarity. Furthermore, AdaptViG employs a hybrid strategy, utilizing our efficient gating mechanism in the early stages and a full Global Attention block in the final stage for maximum feature aggregation. Our method achieves a new state-of-the-art trade-off between accuracy and efficiency among Vision GNNs. For instance, our AdaptViG-M achieves 82.6% top-1 accuracy, outperforming ViG-B by 0.3% while using 80% fewer parameters and 84% fewer GMACs. On downstream tasks, AdaptViG-M obtains 45.8 mIoU, 44.8 APbox, and 41.1 APmask, surpassing the much larger EfficientFormer-L7 by 0.7 mIoU, 2.2 APbox, and 2.1 APmask, respectively, with 78% fewer parameters.

[63] TSPE-GS: Probabilistic Depth Extraction for Semi-Transparent Surface Reconstruction via 3D Gaussian Splatting

Zhiyuan Xu,Nan Min,Yuhang Guo,Tong Wei

Main category: cs.CV

TL;DR: 论文提出了TSPE-GS方法,通过概率深度提取和3D高斯泼溅技术改进半透明表面重建,解决了传统方法假设每个像素单一深度导致的跨表面深度模糊问题。

Details Motivation: 传统3D高斯泼溅方法在处理半透明表面时表现不佳,因为它们假设每个像素只有一个深度,而半透明场景中多个表面可能同时可见。

Contribution: 提出了TSPE-GS方法,通过均匀采样透射率建模像素级多模态不透明度和深度分布,取代了单峰假设,解决了跨表面深度模糊问题。

Method: 方法采用截断符号距离函数逐步融合,在统一框架内分别重建外部和内部表面,且无需额外训练即可推广到其他基于高斯的重建流程。

Result: 在公开和自收集的半透明和不透明数据集上,TSPE-GS显著提升了半透明几何重建质量,同时在不透明场景中保持性能。

Insight: 论文表明,建模像素级多模态分布是解决半透明表面重建问题的关键,同时证明了方法对其他高斯重建流程的通用性。

Abstract: 3D Gaussian Splatting offers a strong speed-quality trade-off but struggles to reconstruct semi-transparent surfaces because most methods assume a single depth per pixel, which fails when multiple surfaces are visible. We propose TSPE-GS (Transparent Surface Probabilistic Extraction for Gaussian Splatting), which uniformly samples transmittance to model a pixel-wise multi-modal distribution of opacity and depth, replacing the prior single-peak assumption and resolving cross-surface depth ambiguity. By progressively fusing truncated signed distance functions, TSPE-GS reconstructs external and internal surfaces separately within a unified framework. The method generalizes to other Gaussian-based reconstruction pipelines without extra training overhead. Extensive experiments on public and self-collected semi-transparent and opaque datasets show TSPE-GS significantly improves semi-transparent geometry reconstruction while maintaining performance on opaque scenes.

[64] Beyond Cosine Similarity Magnitude-Aware CLIP for No-Reference Image Quality Assessment

Zhicheng Liao,Dongxu Wu,Zhenshan Shi,Sijie Mai,Hanwei Zhu,Lingyu Zhu,Yuncheng Jiang,Baoliang Chen

Main category: cs.CV

TL;DR: 本文提出了一种结合CLIP模型余弦相似度和特征幅度感知的新型无参考图像质量评估方法,通过自适应融合框架和统计归一化提升性能。

Details Motivation: 现有基于CLIP的NR-IQA方法仅依赖余弦相似度语义匹配,忽视了CLIP图像特征幅度与感知质量的强相关性。

Contribution: 1. 揭示了CLIP图像特征幅度与图像质量的强相关性;2. 提出了统计归一化的幅度感知辅助线索;3. 设计了置信引导的自适应融合框架。

Method: 1. 提取CLIP图像特征绝对值并应用Box-Cox变换归一化;2. 结合余弦相似度和归一化特征幅度;3. 通过置信度自适应加权融合两项线索。

Result: 在多个基准数据集上,该方法无需任务特定训练即超越传统CLIP方法和SOTA基线。

Insight: CLIP特征的幅度信息可作为语义匹配的有效补充,统计归一化和自适应融合是提升NR-IQA性能的关键。

Abstract: Recent efforts have repurposed the Contrastive Language-Image Pre-training (CLIP) model for No-Reference Image Quality Assessment (NR-IQA) by measuring the cosine similarity between the image embedding and textual prompts such as “a good photo” or “a bad photo.” However, this semantic similarity overlooks a critical yet underexplored cue: the magnitude of the CLIP image features, which we empirically find to exhibit a strong correlation with perceptual quality. In this work, we introduce a novel adaptive fusion framework that complements cosine similarity with a magnitude-aware quality cue. Specifically, we first extract the absolute CLIP image features and apply a Box-Cox transformation to statistically normalize the feature distribution and mitigate semantic sensitivity. The resulting scalar summary serves as a semantically-normalized auxiliary cue that complements cosine-based prompt matching. To integrate both cues effectively, we further design a confidence-guided fusion scheme that adaptively weighs each term according to its relative strength. Extensive experiments on multiple benchmark IQA datasets demonstrate that our method consistently outperforms standard CLIP-based IQA and state-of-the-art baselines, without any task-specific training.

[65] Robust Object Detection with Pseudo Labels from VLMs using Per-Object Co-teaching

Uday Bhaskar,Rishabh Bhattacharya,Avinash Patel,Sarthak Khoche,Praveen Anil Kulkarni,Naresh Manwani

Main category: cs.CV

TL;DR: 该论文提出了一种利用视觉语言模型(VLM)生成的伪标签来训练高效实时目标检测器的新管道,通过每对象协作教学策略减少VLM生成标签中的噪声,显著提升了检测性能。

Details Motivation: 在自动驾驶等领域,手动标注数据成本高昂。尽管VLM提供零样本目标检测能力,但其检测延迟和幻觉预测问题限制了直接应用。因此,需要一个高效且稳健的方法来利用VLM的伪标签训练实时目标检测器。

Contribution: 主要贡献包括:1)提出了一种基于伪标签的新训练管道;2)设计了每对象协作教学策略,减少噪声标签的影响;3)展示了在自动驾驶数据集上的显著性能提升。

Method: 方法的核心是每对象协作教学策略,两个YOLO模型协作学习,基于彼此的单对象损失值过滤不可靠的边界框。

Result: 在KITTI数据集上,mAP@0.5从31.12%提升至46.61%,补充少量真实标签(10%)后进一步提升至57.97%。ACDC和BDD100k数据集上也观察到类似改进。

Insight: 研究表明,VLM生成的伪标签可用于训练高效检测器,通过协作教学策略减少噪声的影响。同时,少量真实标签可显著提升性能,为实际应用提供了高效的数据标注解决方案。

Abstract: Foundation models, especially vision-language models (VLMs), offer compelling zero-shot object detection for applications like autonomous driving, a domain where manual labelling is prohibitively expensive. However, their detection latency and tendency to hallucinate predictions render them unsuitable for direct deployment. This work introduces a novel pipeline that addresses this challenge by leveraging VLMs to automatically generate pseudo-labels for training efficient, real-time object detectors. Our key innovation is a per-object co-teaching-based training strategy that mitigates the inherent noise in VLM-generated labels. The proposed per-object coteaching approach filters noisy bounding boxes from training instead of filtering the entire image. Specifically, two YOLO models learn collaboratively, filtering out unreliable boxes from each mini-batch based on their peers’ per-object loss values. Overall, our pipeline provides an efficient, robust, and scalable approach to train high-performance object detectors for autonomous driving, significantly reducing reliance on costly human annotation. Experimental results on the KITTI dataset demonstrate that our method outperforms a baseline YOLOv5m model, achieving a significant mAP@0.5 boost ($31.12%$ to $46.61%$) while maintaining real-time detection latency. Furthermore, we show that supplementing our pseudo-labelled data with a small fraction of ground truth labels ($10%$) leads to further performance gains, reaching $57.97%$ mAP@0.5 on the KITTI dataset. We observe similar performance improvements for the ACDC and BDD100k datasets.

[66] Equivariant Sampling for Improving Diffusion Model-based Image Restoration

Chenxu Wu,Qingpeng Kong,Peiang Zhao,Wendi Yang,Wenxin Ma,Fenghe Tang,Zihang Jiang,S. Kevin Zhou

Main category: cs.CV

TL;DR: 论文提出了一种名为EquS的方法,通过双采样轨迹引入等变信息,改进了基于扩散模型的图像恢复(DMIR)方法。此外,还提出了时间步感知调度(TAS)以进一步提升性能。

Details Motivation: 现有的问题无关扩散模型图像恢复(DMIR)方法未能充分利用扩散先验,导致性能不佳。本文通过分析采样过程并提出有效解决方案来解决这些限制。

Contribution: 1. 提出了EquS方法,通过双采样轨迹引入等变信息。2. 进一步提出了TAS调度策略,优化确定性步骤以提高效率和性能。

Method: 1. 通过EquS方法引入等变信息。2. 采用TAS调度策略优化采样过程。

Result: 实验表明,EquS兼容现有DMIR方法,显著提升了性能且未增加计算成本。

Insight: 等变信息的引入和时间步感知调度可以显著提升扩散模型在图像恢复任务中的表现。

Abstract: Recent advances in generative models, especially diffusion models, have significantly improved image restoration (IR) performance. However, existing problem-agnostic diffusion model-based image restoration (DMIR) methods face challenges in fully leveraging diffusion priors, resulting in suboptimal performance. In this paper, we address the limitations of current problem-agnostic DMIR methods by analyzing their sampling process and providing effective solutions. We introduce EquS, a DMIR method that imposes equivariant information through dual sampling trajectories. To further boost EquS, we propose the Timestep-Aware Schedule (TAS) and introduce EquS$^+$. TAS prioritizes deterministic steps to enhance certainty and sampling efficiency. Extensive experiments on benchmarks demonstrate that our method is compatible with previous problem-agnostic DMIR methods and significantly boosts their performance without increasing computational costs. Our code is available at https://github.com/FouierL/EquS.

[67] Difference Vector Equalization for Robust Fine-tuning of Vision-Language Models

Satoshi Suzuki,Shin’ya Yamaguchi,Shoichiro Takeda,Taiga Yamane,Naoki Makishima,Naotaka Kawata,Mana Ihori,Tomohiro Tanaka,Shota Orihashi,Ryo Masumura

Main category: cs.CV

TL;DR: 论文提出了一种名为DiVE的新方法,用于在微调视觉语言模型(如CLIP)时保护嵌入的几何结构,从而在不损害其在分布外(OOD)和零样本场景下泛化能力的情况下增强分布内(ID)性能。

Details Motivation: 现有的鲁棒微调方法在微调过程中使用对比学习,但这些方法会扭曲嵌入的几何结构,限制了模型的OOD和零样本性能。因此,论文旨在解决这一问题。

Contribution: 提出了Difference Vector Equalization(DiVE)方法,通过约束来自预训练和微调模型的嵌入差异向量来保护几何结构。并引入了两种损失函数:平均向量损失(AVL)和成对向量损失(PVL)。

Method: DiVE通过约束差异向量(同一数据样本在预训练和微调模型中的嵌入差异)的全局(AVL)和局部(PVL)一致性来保护几何结构。

Result: 实验表明,DiVE在保护几何结构的同时,在ID、OOD和零样本指标上均取得了显著效果。

Insight: 保护嵌入的几何结构对于视觉语言模型的泛化能力至关重要,DiVE通过约束差异向量实现了这一点。

Abstract: Contrastive pre-trained vision-language models, such as CLIP, demonstrate strong generalization abilities in zero-shot classification by leveraging embeddings extracted from image and text encoders. This paper aims to robustly fine-tune these vision-language models on in-distribution (ID) data without compromising their generalization abilities in out-of-distribution (OOD) and zero-shot settings. Current robust fine-tuning methods tackle this challenge by reusing contrastive learning, which was used in pre-training, for fine-tuning. However, we found that these methods distort the geometric structure of the embeddings, which plays a crucial role in the generalization of vision-language models, resulting in limited OOD and zero-shot performance. To address this, we propose Difference Vector Equalization (DiVE), which preserves the geometric structure during fine-tuning. The idea behind DiVE is to constrain difference vectors, each of which is obtained by subtracting the embeddings extracted from the pre-trained and fine-tuning models for the same data sample. By constraining the difference vectors to be equal across various data samples, we effectively preserve the geometric structure. Therefore, we introduce two losses: average vector loss (AVL) and pairwise vector loss (PVL). AVL preserves the geometric structure globally by constraining difference vectors to be equal to their weighted average. PVL preserves the geometric structure locally by ensuring a consistent multimodal alignment. Our experiments demonstrate that DiVE effectively preserves the geometric structure, achieving strong results across ID, OOD, and zero-shot metrics.

[68] DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Instance Segmentation

Xuexun Liu,Xiaoxu Xu,Qiudan Zhang,Lin Ma,Xu Wang

Main category: cs.CV

TL;DR: DBGroup提出了一种双分支点分组方法,利用场景级标注作为弱监督3D实例分割的高效解决方案,并通过伪标签生成和自训练提升性能。

Details Motivation: 当前弱监督3D实例分割方法依赖昂贵的人工标注,且过程复杂而低效。DBGroup旨在通过场景级注释减少标注成本并提高可扩展性。

Contribution: 1) 提出Dual-Branch Point Grouping模块生成伪标签;2) 设计粒度感知实例合并和语义选择传播策略优化标签质量;3) 引入多轮自训练框架和实例掩码过滤器进一步提升性能。

Method: 1) 第一阶段:结合多视图图像的语义和掩码线索生成伪标签;2) 第二阶段:通过自训练和实例掩码过滤优化伪标签并训练分割网络。

Result: DBGroup在稀疏点级监督方法中表现突出,并超越基于场景级监督的语义分割方法。

Insight: 场景级标注是一种低成本、高效的弱监督解决方案,伪标签的精细优化和多轮自训练显著提升了3D实例分割的性能。

Abstract: Weakly supervised 3D instance segmentation is essential for 3D scene understanding, especially as the growing scale of data and high annotation costs associated with fully supervised approaches. Existing methods primarily rely on two forms of weak supervision: one-thing-one-click annotations and bounding box annotations, both of which aim to reduce labeling efforts. However, these approaches still encounter limitations, including labor-intensive annotation processes, high complexity, and reliance on expert annotators. To address these challenges, we propose \textbf{DBGroup}, a two-stage weakly supervised 3D instance segmentation framework that leverages scene-level annotations as a more efficient and scalable alternative. In the first stage, we introduce a Dual-Branch Point Grouping module to generate pseudo labels guided by semantic and mask cues extracted from multi-view images. To further improve label quality, we develop two refinement strategies: Granularity-Aware Instance Merging and Semantic Selection and Propagation. The second stage involves multi-round self-training on an end-to-end instance segmentation network using the refined pseudo-labels. Additionally, we introduce an Instance Mask Filter strategy to address inconsistencies within the pseudo labels. Extensive experiments demonstrate that DBGroup achieves competitive performance compared to sparse-point-level supervised 3D instance segmentation methods, while surpassing state-of-the-art scene-level supervised 3D semantic segmentation approaches. Code is available at https://github.com/liuxuexun/DBGroup.

[69] LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers

Minjun Kim,Jaeri Lee,Jongjin Kim,Jeongin Yun,Yongmo Kwon,U Kang

Main category: cs.CV

TL;DR: LampQ是一种针对Vision Transformers的层混合精度量化方法,解决了现有方法中粒度粗、尺度不匹配和位分配缺乏量化意识的问题。

Details Motivation: 现有量化方法多为统一精度,忽视了ViT不同组件对量化的敏感度差异。

Contribution: 1. 提出层粒度量化方法;2. 引入类型感知的Fisher度量;3. 使用整数线性规划和迭代更新优化位分配。

Method: 1. 层粒度量化;2. Fisher度量敏感度;3. 整数线性规划和迭代优化位分配。

Result: 在图像分类、目标检测和零样本量化等多个任务上取得SOTA性能。

Insight: ViT组件的量化敏感度差异可以通过类型感知度量和优化位分配来精确捕捉,从而提升量化效果。

Abstract: How can we accurately quantize a pre-trained Vision Transformer model? Quantization algorithms compress Vision Transformers (ViTs) into low-bit formats, reducing memory and computation demands with minimal accuracy degradation. However, existing methods rely on uniform precision, ignoring the diverse sensitivity of ViT components to quantization. Metric-based Mixed Precision Quantization (MPQ) is a promising alternative, but previous MPQ methods for ViTs suffer from three major limitations: 1) coarse granularity, 2) mismatch in metric scale across component types, and 3) quantization-unaware bit allocation. In this paper, we propose LampQ (Layer-wise Mixed Precision Quantization for Vision Transformers), an accurate metric-based MPQ method for ViTs to overcome these limitations. LampQ performs layer-wise quantization to achieve both fine-grained control and efficient acceleration, incorporating a type-aware Fisher-based metric to measure sensitivity. Then, LampQ assigns bit-widths optimally through integer linear programming and further updates them iteratively. Extensive experiments show that LampQ provides the state-of-the-art performance in quantizing ViTs pre-trained on various tasks such as image classification, object detection, and zero-shot quantization.

[70] MIRNet: Integrating Constrained Graph-Based Reasoning with Pre-training for Diagnostic Medical Imaging

Shufeng Kong,Zijie Wang,Nuan Cui,Hao Tang,Yihan Meng,Yuanyuan Wei,Feifan Chen,Yingheng Wang,Zhuo Cai,Yaonan Wang,Yulong Zhang,Yuzheng Li,Zibin Zheng,Caihua Liu

Main category: cs.CV

TL;DR: MIRNet结合自监督预训练与约束图推理,优化医学图像诊断,特别是在舌诊领域,通过MAE学习视觉表示、GAT建模标签关系,并引入TongueAtlas-4K数据集解决标注稀缺问题。

Details Motivation: 医学图像诊断需解决标注稀缺、标签不平衡及临床合理性等问题,尤其舌诊领域对视觉语义理解要求高。

Contribution: 1.提出MIRNet框架,整合自监督预训练与图推理;2.引入TongueAtlas-4K数据集;3.在舌诊任务中实现SOTA性能。

Method: 1.用MAE进行自监督预训练;2.利用GAT建模标签关系;3.通过KL散度和正则化损失引入临床先验;4.用ASL和集成方法解决不平衡问题。

Result: MIRNet在舌诊任务上表现优异,并可推广至其他医学图像诊断任务。

Insight: 结合自监督学习与专家知识驱动的图推理能有效提升医学图像诊断的鲁棒性和泛化能力。

Abstract: Automated interpretation of medical images demands robust modeling of complex visual-semantic relationships while addressing annotation scarcity, label imbalance, and clinical plausibility constraints. We introduce MIRNet (Medical Image Reasoner Network), a novel framework that integrates self-supervised pre-training with constrained graph-based reasoning. Tongue image diagnosis is a particularly challenging domain that requires fine-grained visual and semantic understanding. Our approach leverages self-supervised masked autoencoder (MAE) to learn transferable visual representations from unlabeled data; employs graph attention networks (GAT) to model label correlations through expert-defined structured graphs; enforces clinical priors via constraint-aware optimization using KL divergence and regularization losses; and mitigates imbalance using asymmetric loss (ASL) and boosting ensembles. To address annotation scarcity, we also introduce TongueAtlas-4K, a comprehensive expert-curated benchmark comprising 4,000 images annotated with 22 diagnostic labels–representing the largest public dataset in tongue analysis. Validation shows our method achieves state-of-the-art performance. While optimized for tongue diagnosis, the framework readily generalizes to broader diagnostic medical imaging tasks.

[71] AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

Xinyi Wang,Xun Yang,Yanlong Xu,Yuchen Wu,Zhen Li,Na Zhao

Main category: cs.CV

TL;DR: 该论文提出了一种名为AffordBot的新框架,用于解决3D细粒度具身推理任务,通过多模态大语言模型和定制的思维链推理范式,实现了在3D场景中对可交互元素的空间位置、运动类型和运动轴的预测。

Details Motivation: 现有方法通常在对象级别或分离地处理细粒度可交互推理,缺乏连贯的、基于指令的推理和定位能力。论文旨在解决这一问题,以实现更有效的人-智能体协作。

Contribution: 1) 提出了3D细粒度具身推理的新任务;2) 设计了AffordBot框架,结合多模态大语言模型和思维链推理;3) 通过渲染场景的多视角图像并投影3D候选元素,弥合了3D输入与2D兼容MLLM之间的差距。

Method: 论文提出了一种两阶段方法:1) 主动感知阶段,选择最具信息量的视角;2) 分步推理阶段,定位可交互元素并推断其运动。通过多视角渲染和3D投影生成丰富的视觉表示。

Result: 在SceneFun3D数据集上,AffordBot实现了最先进的性能,展示了强大的泛化能力和物理基础的推理能力,仅需3D点云输入和MLLM。

Insight: 1) 多视角渲染和投影有效解决了3D场景的视觉表示问题;2) 思维链推理的提升显著提高了任务的性能;3) 该框架适用于更广泛的具身推理任务。

Abstract: Effective human-agent collaboration in physical environments requires understanding not only what to act upon, but also where the actionable elements are and how to interact with them. Existing approaches often operate at the object level or disjointedly handle fine-grained affordance reasoning, lacking coherent, instruction-driven grounding and reasoning. In this work, we introduce a new task: Fine-grained 3D Embodied Reasoning, which requires an agent to predict, for each referenced affordance element in a 3D scene, a structured triplet comprising its spatial location, motion type, and motion axis, based on a task instruction. To solve this task, we propose AffordBot, a novel framework that integrates Multimodal Large Language Models (MLLMs) with a tailored chain-of-thought (CoT) reasoning paradigm. To bridge the gap between 3D input and 2D-compatible MLLMs, we render surround-view images of the scene and project 3D element candidates into these views, forming a rich visual representation aligned with the scene geometry. Our CoT pipeline begins with an active perception stage, prompting the MLLM to select the most informative viewpoint based on the instruction, before proceeding with step-by-step reasoning to localize affordance elements and infer plausible interaction motions. Evaluated on the SceneFun3D dataset, AffordBot achieves state-of-the-art performance, demonstrating strong generalization and physically grounded reasoning with only 3D point cloud input and MLLMs.

[72] Anomagic: Crossmodal Prompt-driven Zero-shot Anomaly Generation

Yuxin Jiang,Wei Luo,Hui Zhang,Qiyu Chen,Haiming Yao,Weiming Shen,Yunkang Cao

Main category: cs.CV

TL;DR: Anomagic提出了一种零样本异常生成方法,通过跨模态提示编码结合视觉和文本信息,无需真实异常样本即可生成语义一致的异常,并利用对比细化策略提升下游异常检测性能。

Details Motivation: 现有异常生成方法通常依赖真实异常样本,限制了模型的泛化能力。Anomagic通过跨模态提示实现零样本异常生成,解决这一限制。

Contribution: 1. 提出了Anomagic,一种无需真实异常样本的零样本异常生成方法;2. 引入了AnomVerse数据集,包含12,987个异常-掩码-描述三元组;3. 展示了Anomagic在下游异常检测任务中的显著提升。

Method: 1. 跨模态提示编码结合视觉和文本信息;2. 基于修复的生成流程;3. 对比细化策略确保异常与掩码的精确对齐。

Result: Anomagic生成的异常更真实多样,显著提升了下游异常检测性能,并能通过用户定义的提示为任何正常图像生成异常。

Insight: 跨模态提示和对比细化策略的结合为异常生成任务提供了一种泛化性强且高效的解决方案。

Abstract: We propose Anomagic, a zero-shot anomaly generation method that produces semantically coherent anomalies without requiring any exemplar anomalies. By unifying both visual and textual cues through a crossmodal prompt encoding scheme, Anomagic leverages rich contextual information to steer an inpainting-based generation pipeline. A subsequent contrastive refinement strategy enforces precise alignment between synthesized anomalies and their masks, thereby bolstering downstream anomaly detection accuracy. To facilitate training, we introduce AnomVerse, a collection of 12,987 anomaly-mask-caption triplets assembled from 13 publicly available datasets, where captions are automatically generated by multimodal large language models using structured visual prompts and template-based textual hints. Extensive experiments demonstrate that Anomagic trained on AnomVerse can synthesize more realistic and varied anomalies than prior methods, yielding superior improvements in downstream anomaly detection. Furthermore, Anomagic can generate anomalies for any normal-category image using user-defined prompts, establishing a versatile foundation model for anomaly generation.

[73] DGFusion: Dual-guided Fusion for Robust Multi-Modal 3D Object Detection

Feiyang Jia,Caiyan Jia,Ailin Liu,Shaoqing Xu,Qiming Xia,Lin Liu,Lei Yang,Yan Gong,Ziying Song

Main category: cs.CV

TL;DR: DGFusion提出了一种双引导融合方法,通过难度感知实例匹配和双引导模块,提升了多模态3D目标检测在困难实例(远距离、小目标或遮挡对象)上的性能。

Details Motivation: 现有的多模态3D目标检测方法通常采用单引导范式,忽略了不同模态间信息密度的差异,尤其是对困难实例的不适应性,影响了自动驾驶系统的安全性。

Contribution: 1. 提出了DGFusion框架,结合点引导图像和图像引导点的双引导范式;2. 设计了难度感知实例匹配器(DIPM),根据难度生成实例对;3. 增强了多模态特征融合的有效性。

Method: 1. 使用DIPM生成容易和困难实例对;2. 通过双引导模块融合多模态特征;3. 在nuScenes数据集上验证性能。

Result: 相比基线方法,DGFusion在mAP、NDS和平均召回率上分别提升了1.0%、0.8%和1.3%,并在多种困难场景下表现鲁棒。

Insight: 双引导范式能够充分利用不同模态的优势,显著提升困难实例的检测性能,为自动驾驶感知系统提供了更可靠的解决方案。

Abstract: As a critical task in autonomous driving perception systems, 3D object detection is used to identify and track key objects, such as vehicles and pedestrians. However, detecting distant, small, or occluded objects (hard instances) remains a challenge, which directly compromises the safety of autonomous driving systems. We observe that existing multi-modal 3D object detection methods often follow a single-guided paradigm, failing to account for the differences in information density of hard instances between modalities. In this work, we propose DGFusion, based on the Dual-guided paradigm, which fully inherits the advantages of the Point-guide-Image paradigm and integrates the Image-guide-Point paradigm to address the limitations of the single paradigms. The core of DGFusion, the Difficulty-aware Instance Pair Matcher (DIPM), performs instance-level feature matching based on difficulty to generate easy and hard instance pairs, while the Dual-guided Modules exploit the advantages of both pair types to enable effective multi-modal feature fusion. Experimental results demonstrate that our DGFusion outperforms the baseline methods, with respective improvements of +1.0% mAP, +0.8% NDS, and +1.3% average recall on nuScenes. Extensive experiments demonstrate consistent robustness gains for hard instance detection across ego-distance, size, visibility, and small-scale training scenarios.

[74] FreDFT: Frequency Domain Fusion Transformer for Visible-Infrared Object Detection

Wencong Wu,Xiuwei Zhang,Hanlin Yin,Shun Dai,Hongxi Zhang,Yanning Zhang

Main category: cs.CV

TL;DR: FreDFT提出了一种基于频率域的多模态融合Transformer方法,用于可见光-红外目标检测,通过频率域注意力和混合尺度特征融合策略提升性能。

Details Motivation: 现有方法多在空间域使用Transformer融合多模态信息,忽略了频率域的互补信息潜力,导致在复杂场景中多模态信息不平衡,检测性能下降。

Contribution: 1. 提出了MFDA(多模态频率域注意力)和FDFFL(频率域前馈层)来挖掘频率域的互补信息;2. 设计了CGMM(跨模态全局建模模块)和LFEM(局部特征增强模块)以解决模态不平衡问题。

Method: 1. MFDA和FDFFL用于频率域特征融合;2. CGMM在空间和通道维度进行跨模态特征交互;3. LFEM通过卷积和通道混洗增强局部特征。

Result: 在多个公开数据集上优于SOTA方法。

Insight: 频率域Transformer在多模态信息融合中具有显著优势,能够更好地挖掘互补信息。

Abstract: Visible-infrared object detection has gained sufficient attention due to its detection performance in low light, fog, and rain conditions. However, visible and infrared modalities captured by different sensors exist the information imbalance problem in complex scenarios, which can cause inadequate cross-modal fusion, resulting in degraded detection performance. \textcolor{red}{Furthermore, most existing methods use transformers in the spatial domain to capture complementary features, ignoring the advantages of developing frequency domain transformers to mine complementary information.} To solve these weaknesses, we propose a frequency domain fusion transformer, called FreDFT, for visible-infrared object detection. The proposed approach employs a novel multimodal frequency domain attention (MFDA) to mine complementary information between modalities and a frequency domain feed-forward layer (FDFFL) via a mixed-scale frequency feature fusion strategy is designed to better enhance multimodal features. To eliminate the imbalance of multimodal information, a cross-modal global modeling module (CGMM) is constructed to perform pixel-wise inter-modal feature interaction in a spatial and channel manner. Moreover, a local feature enhancement module (LFEM) is developed to strengthen multimodal local feature representation and promote multimodal feature fusion by using various convolution layers and applying a channel shuffle. Extensive experimental results have verified that our proposed FreDFT achieves excellent performance on multiple public datasets compared with other state-of-the-art methods. The code of our FreDFT is linked at https://github.com/WenCongWu/FreDFT.

[75] MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples

Xurui Li,Feng Xue,Yu Zhou

Main category: cs.CV

TL;DR: MuSc-V2是一个零样本多模态工业异常分类与分割框架,通过联合评分未标记样本实现高精度异常检测,显著提升性能。

Details Motivation: 现有零样本异常检测方法忽略了正常图像块在2D和3D中的相似性,而异常块通常是多样且孤立的。MuSc-V2旨在利用这一特性改进检测效果。

Contribution: 提出一个灵活的联合评分框架(MuSc-V2),支持单模态或多模态输入,通过迭代点分组、相似性邻域聚合和多尺度特征融合提升检测性能。

Method: 1. 使用Iterative Point Grouping(IPG)优化3D表示;2. 通过SNAMD融合2D/3D邻域特征;3. 提出Mutual Scoring Mechanism(MSM)和Cross-modal Anomaly Enhancement(CAE)联合评分;4. 使用RsCon抑制误分类。

Result: 在MVTec 3D-AD和Eyecandies数据集上分别实现23.7%和19.3%的平均精度提升,超越零样本基准甚至部分少样本方法。

Insight: 正常样本在多模态中具有高度相似性,而异常样本则表现为孤立性,跨模态联合评分可有效弥补单模态的局限性。

Abstract: Zero-shot anomaly classification (AC) and segmentation (AS) methods aim to identify and outline defects without using any labeled samples. In this paper, we reveal a key property that is overlooked by existing methods: normal image patches across industrial products typically find many other similar patches, not only in 2D appearance but also in 3D shapes, while anomalies remain diverse and isolated. To explicitly leverage this discriminative property, we propose a Mutual Scoring framework (MuSc-V2) for zero-shot AC/AS, which flexibly supports single 2D/3D or multimodality. Specifically, our method begins by improving 3D representation through Iterative Point Grouping (IPG), which reduces false positives from discontinuous surfaces. Then we use Similarity Neighborhood Aggregation with Multi-Degrees (SNAMD) to fuse 2D/3D neighborhood cues into more discriminative multi-scale patch features for mutual scoring. The core comprises a Mutual Scoring Mechanism (MSM) that lets samples within each modality to assign score to each other, and Cross-modal Anomaly Enhancement (CAE) that fuses 2D and 3D scores to recover modality-specific missing anomalies. Finally, Re-scoring with Constrained Neighborhood (RsCon) suppresses false classification based on similarity to more representative samples. Our framework flexibly works on both the full dataset and smaller subsets with consistently robust performance, ensuring seamless adaptability across diverse product lines. In aid of the novel framework, MuSc-V2 achieves significant performance improvements: a $\textbf{+23.7%}$ AP gain on the MVTec 3D-AD dataset and a $\textbf{+19.3%}$ boost on the Eyecandies dataset, surpassing previous zero-shot benchmarks and even outperforming most few-shot methods. The code will be available at The code will be available at \href{https://github.com/HUST-SLOW/MuSc-V2}{https://github.com/HUST-SLOW/MuSc-V2}.

[76] Image Aesthetic Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance

Zhiyuan Hu,Zheng Sun,Yi Wei,Long Yu

Main category: cs.CV

TL;DR: 论文提出了HCM-GRPO方法,结合Hard Cases Mining(HCM)策略和Dynamic Proportional Accuracy(DPA)奖励,显著提升了图像美学推理能力,并通过一个包含128k样本的数据集验证了其优于开源和闭源大模型的性能。

Details Motivation: 现有Multimodal Large Language Models(MLLMs)在图像美学推理能力上表现不佳,且相关数据集匮乏,限制了图像筛选任务的发展。

Contribution: 1) 构建了一个包含128k样本的图像美学推理数据集;2) 提出了HCM-GRPO方法,显著提升了模型的推理能力;3) 验证了该方法在小模型上超越大模型的潜力。

Method: 结合Hard Cases Mining(HCM)和Dynamic Proportional Accuracy(DPA)奖励改进Group Relative Policy Optimization(GRPO),形成HCM-GRPO框架。

Result: 实验表明,HCM-GRPO在小模型上表现优异,超越了GPT4o和Qwen-VL-Max等闭源大模型。

Insight: 数据集质量和方法改进(如HCM和DPA)是提升图像美学推理能力的关键,小模型通过优化也能达到甚至超越大模型的性能。

Abstract: The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive image screening dataset with over 128k samples, about 640k images. Each sample consists of an original image, four generated images. The dataset evaluates the image aesthetic reasoning ability under four aspects: appearance deformation, physical shadow, placement layout, and extension rationality. Regarding data annotation, we investigate multiple approaches, including purely manual, fully automated, and answer-driven annotations, to acquire high-quality chains of thought (CoT) data in the most cost-effective manner. Methodologically, we introduce a Hard Cases Mining (HCM) strategy with a Dynamic Proportional Accuracy (DPA) reward into the Group Relative Policy Optimization (GRPO) framework, called HCM-GRPO. This enhanced method demonstrates superior image aesthetic reasoning capabilities compared to the original GRPO. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the HCM-GRPO, we are able to surpass the scores of both large-scale open-source and leading closed-source models with a much smaller model.

[77] When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?

Qilang Ye,Wei Zeng,Meng Liu,Jie Zhang,Yupeng Hu,Zitong Yu,Yu Zhou

Main category: cs.CV

TL;DR: 论文提出了一种新的基准AV-ConfuseBench,用于测试多模态大语言模型(MLLMs)在‘视听混淆’场景下的表现,发现模型因视觉主导的推理难以辨别缺失的音频。为此,作者提出RL-CoMM方法,通过强化学习结合音频语言模型(LALM)优化多模态推理。

Details Motivation: 研究多模态大语言模型(MLLMs)在视听不一致场景中的表现,探索其在视觉主导推理下是否能正确识别音频缺失的问题。

Contribution: 1)提出AV-ConfuseBench基准;2)设计RL-CoMM方法,结合强化学习和外部音频模型优化多模态推理;3)在实验中显著提升模型性能。

Method: RL-CoMM分为两阶段:1)引入大型音频语言模型(LALM)生成音频推理参考,设计逐步推理奖励函数优化MLLMs;2)通过答案为中心的置信度优化减少不确定性。

Result: RL-CoMM在有限训练数据下,将音频视觉问答和幻觉任务的准确率提升10%~30%。

Insight: MLLMs在视听不一致时容易受视觉主导影响,通过结合音频推理和强化学习可有效提升多模态任务的鲁棒性。

Abstract: Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an ``Audio-Visual Confusion’’ scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs Is there a/an muted-object sound’’. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves the accuracy by 10~30% over the baseline model with limited training data. Follow: https://github.com/rikeilong/AVConfusion.

[78] Multivariate Gaussian Representation Learning for Medical Action Evaluation

Luming Yang,Haoxian Liu,Siqing Li,Alper Yilmaz

Main category: cs.CV

TL;DR: 该论文提出了一种基于多元高斯表示的医学动作评估方法,名为GaussMedAct,并引入了CPREval-6k数据集,通过自适应时空表征学习提升医学动作分析的性能。

Details Motivation: 医学视觉中的细粒度动作评估面临数据集不足、精度要求高以及快速动作的时空动态建模不足等问题,亟需新的方法和数据集来解决这些问题。

Contribution: 1. 引入CPREval-6k数据集,包含6,372个专家标注的视频和22个临床标签;2. 提出GaussMedAct框架,利用多元高斯表示实现自适应时空分析。

Method: 1. 多元高斯表示将关节运动投影到时间标度的多维空间,分解动作为自适应3D高斯token;2. 混合空间编码使用Cartesian和Vector双流策略提取骨骼信息。

Result: 在基准测试中达到92.1%的Top-1准确率,比ST-GCN基线高5.9%,且仅需10%的FLOPs。跨数据集实验验证了鲁棒性。

Insight: 多元高斯表示能够有效捕捉动作的语义信息,同时对时空噪声保持鲁棒性,为医学动作分析提供了新思路。

Abstract: Fine-grained action evaluation in medical vision faces unique challenges due to the unavailability of comprehensive datasets, stringent precision requirements, and insufficient spatiotemporal dynamic modeling of very rapid actions. To support development and evaluation, we introduce CPREval-6k, a multi-view, multi-label medical action benchmark containing 6,372 expert-annotated videos with 22 clinical labels. Using this dataset, we present GaussMedAct, a multivariate Gaussian encoding framework, to advance medical motion analysis through adaptive spatiotemporal representation learning. Multivariate Gaussian Representation projects the joint motions to a temporally scaled multi-dimensional space, and decomposes actions into adaptive 3D Gaussians that serve as tokens. These tokens preserve motion semantics through anisotropic covariance modeling while maintaining robustness to spatiotemporal noise. Hybrid Spatial Encoding, employing a Cartesian and Vector dual-stream strategy, effectively utilizes skeletal information in the form of joint and bone features. The proposed method achieves 92.1% Top-1 accuracy with real-time inference on the benchmark, outperforming the ST-GCN baseline by +5.9% accuracy with only 10% FLOPs. Cross-dataset experiments confirm the superiority of our method in robustness.

[79] Perceive, Act and Correct: Confidence Is Not Enough for Hyperspectral Classification

Muzhou Yang,Wuzhou Quan,Mingqiang Wei

Main category: cs.CV

TL;DR: 论文提出了CABIN框架,通过感知、行动和纠正的闭环学习过程解决高光谱图像分类中置信度误导的问题。

Details Motivation: 高光谱图像分类中,仅依赖置信度容易导致错误,特别是在稀疏标注或类别不平衡的情况下,模型容易过度拟合错误的置信度预测。

Contribution: 提出了CABIN框架,包括不确定性估计、不确定性引导的双采样策略和细粒度动态分配策略,有效减少偏差并提升泛化能力。

Method: 1. 估计认知不确定性以感知模糊区域;2. 采用不确定性引导的双采样策略探索不确定样本并固定置信样本;3. 动态分配伪标签数据并应用定制损失函数。

Result: 实验表明,CABIN显著提升了多种现有方法的标注效率和性能。

Insight: 通过闭环学习和动态策略,CABIN能够更可靠地处理不确定性,为半监督学习提供了新思路。

Abstract: Confidence alone is often misleading in hyperspectral image classification, as models tend to mistake high predictive scores for correctness while lacking awareness of uncertainty. This leads to confirmation bias, especially under sparse annotations or class imbalance, where models overfit confident errors and fail to generalize. We propose CABIN (Cognitive-Aware Behavior-Informed learNing), a semi-supervised framework that addresses this limitation through a closed-loop learning process of perception, action, and correction. CABIN first develops perceptual awareness by estimating epistemic uncertainty, identifying ambiguous regions where errors are likely to occur. It then acts by adopting an Uncertainty-Guided Dual Sampling Strategy, selecting uncertain samples for exploration while anchoring confident ones as stable pseudo-labels to reduce bias. To correct noisy supervision, CABIN introduces a Fine-Grained Dynamic Assignment Strategy that categorizes pseudo-labeled data into reliable, ambiguous, and noisy subsets, applying tailored losses to enhance generalization. Experimental results show that a wide range of state-of-the-art methods benefit from the integration of CABIN, with improved labeling efficiency and performance.

[80] VLF-MSC: Vision-Language Feature-Based Multimodal Semantic Communication System

Gwangyeon Ahn,Jiwan Seo,Joonhyuk Kang

Main category: cs.CV

TL;DR: VLF-MSC提出了一种基于视觉语言特征的统一多模态语义通信系统,通过单一紧凑的视觉语言表示同时支持图像和文本生成,提高了频谱效率和适应性。

Details Motivation: 现有的语义通信技术通常分别处理多模态数据,导致频谱效率低下和适应性不足。VLF-MSC旨在通过统一的视觉语言特征表示解决这一问题。

Contribution: 1)提出了一种基于视觉语言特征的统一多模态语义通信系统;2)利用预训练模型编码和解码,提高了鲁棒性和语义保真度;3)实验表明系统在低信噪比下优于基线方法。

Method: 1)使用预训练的视觉语言模型(VLM)将源图像编码为视觉语言特征(VLF);2)通过无线信道传输VLF;3)接收端基于VLF生成文本和图像。

Result: 实验证明VLF-MSC在低信噪比下优于文本和图像基线方法,显著降低了带宽需求。

Insight: 统一的视觉语言特征表示可以同时支持多模态生成任务,且预训练模型的应用显著提升了系统的鲁棒性和语义保真度。

Abstract: We propose Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a unified system that transmits a single compact vision-language representation to support both image and text generation at the receiver. Unlike existing semantic communication techniques that process each modality separately, VLF-MSC employs a pre-trained vision-language model (VLM) to encode the source image into a vision-language semantic feature (VLF), which is transmitted over the wireless channel. At the receiver, a decoder-based language model and a diffusion-based image generator are both conditioned on the VLF to produce a descriptive text and a semantically aligned image. This unified representation eliminates the need for modality-specific streams or retransmissions, improving spectral efficiency and adaptability. By leveraging foundation models, the system achieves robustness to channel noise while preserving semantic fidelity. Experiments demonstrate that VLF-MSC outperforms text-only and image-only baselines, achieving higher semantic accuracy for both modalities under low SNR with significantly reduced bandwidth.

[81] Mitigating Error Accumulation in Co-Speech Motion Generation via Global Rotation Diffusion and Multi-Level Constraints

Xiangyue Zhang,Jianfang Li,Jianqiang Ren,Jiaxu Zhang

Main category: cs.CV

TL;DR: GlobalDiff是一个基于扩散模型的框架,首次直接在全局关节旋转空间中操作,通过多级约束方案缓解了分层误差累积问题,显著提升了共语动作生成的准确性和流畅性。

Details Motivation: 现有生成方法通常在局部关节旋转上操作,导致分层误差累积,从而在末端效应器上产生不稳定和不真实的动作。为了解决这一问题,论文提出了直接操作全局旋转空间的新方法。

Contribution: 1. 首次提出在全局关节旋转空间中操作的扩散框架GlobalDiff,解耦了各关节预测的上游依赖关系;2. 引入了多级约束方案(关节结构约束、骨骼结构约束和时间结构约束)以弥补全局旋转空间中结构先验的缺失。

Method: GlobalDiff通过全局旋转扩散模型直接生成动作,并结合多级约束:关节结构约束引入虚拟锚点捕捉细粒度方向;骨骼结构约束保持骨骼角度一致性;时间结构约束使用多尺度变分编码器对齐生成动作的时间模式。

Result: 在标准共语动作生成基准测试中,GlobalDiff相比当前最优方法性能提升了46.0%,生成的动更加平滑和准确。

Insight: 直接操作全局旋转空间可以显著减少分层误差累积,但需要额外的结构约束以保持动作的合理性。多级约束方案在此类生成任务中是关键。

Abstract: Reliable co-speech motion generation requires precise motion representation and consistent structural priors across all joints. Existing generative methods typically operate on local joint rotations, which are defined hierarchically based on the skeleton structure. This leads to cumulative errors during generation, manifesting as unstable and implausible motions at end-effectors. In this work, we propose GlobalDiff, a diffusion-based framework that operates directly in the space of global joint rotations for the first time, fundamentally decoupling each joint’s prediction from upstream dependencies and alleviating hierarchical error accumulation. To compensate for the absence of structural priors in global rotation space, we introduce a multi-level constraint scheme. Specifically, a joint structure constraint introduces virtual anchor points around each joint to better capture fine-grained orientation. A skeleton structure constraint enforces angular consistency across bones to maintain structural integrity. A temporal structure constraint utilizes a multi-scale variational encoder to align the generated motion with ground-truth temporal patterns. These constraints jointly regularize the global diffusion process and reinforce structural awareness. Extensive evaluations on standard co-speech benchmarks show that GlobalDiff generates smooth and accurate motions, improving the performance by 46.0 % compared to the current SOTA under multiple speaker identities.

[82] GridPrune: From “Where to Look” to “What to Select” in Visual Token Pruning for MLLMs

Yuxiang Duan,Ao Li,Yingqin Li,Luyu Li,Pengwei Wang

Main category: cs.CV

TL;DR: GridPrune提出了一种新的视觉token修剪方法,通过两阶段的‘全局引导-局部选择’策略,显著提升了MLLMs的效率。

Details Motivation: 研究表明人类视觉系统采用‘先看哪里,再选什么’的两阶段注意力分配策略。然而现有的视觉token修剪方法直接优化‘选什么’,忽略了空间分配的重要性。

Contribution: 提出了GridPrune方法,通过动态分配token预算到空间区域并进行局部选择,解决了现有方法的低效空间分配和位置偏见问题。

Method: GridPrune分为两步:1)使用文本条件引导动态分配token预算到空间区域;2)在每个区域内进行局部选择。

Result: 在LLaVA-NeXT-7B上,GridPrune仅使用11.1%的token即可保留96.98%的性能,优于现有方法2.34%。

Insight: 人类的注意力分配策略可以启发高效的token修剪方法,全局引导和局部选择的结合是关键。

Abstract: Multimodal large language models (MLLMs) have shown remarkable capabilities in a wide range of vision-language tasks. However, the large number of visual tokens introduces significant computational overhead. To address this issue, visual token pruning has emerged as a key technique for enhancing the efficiency of MLLMs. In cognitive science, humans tend to first determine which regions of a scene to attend to (“where to look”) before deciding which specific elements within those regions to process in detail (“what to select”). This two-stage strategy enables the visual system to efficiently allocate attention at a coarse spatial level before performing fine-grained selection. However, existing pruning methods primarily focus on directly optimizing “what to select”, typically using attention scores or similarity metrics. They rarely consider “where to look”, which has been shown to lead to inefficient spatial allocation, positional bias, and the retention of irrelevant or redundant tokens. In this paper, we propose GridPrune, a method that replaces the global Top-K mechanism with a “guide-globally, select-locally” zonal selection system. GridPrune splits the pruning process into two steps: first, it uses text-conditional guidance to dynamically allocate a token budget across spatial zones; and then, it performs local selection within each budgeted zone. Experimental results demonstrate that GridPrune achieves superior performance across various MLLM architectures. On LLaVA-NeXT-7B, GridPrune retains 96.98% of the full performance while using 11.1% of the tokens, outperforming the best-performing baseline by 2.34% at the same pruning rate.

[83] SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition

Qilang Ye,Yu Zhou,Lian He,Jie Zhang,Xuanming Guo,Jiayu Zhang,Mingkui Tan,Weicheng Xie,Yue Sun,Tao Tan,Xiaochen Yuan,Ghada Khoriba,Zitong Yu

Main category: cs.CV

TL;DR: SUGAR 是一种新颖的范式,通过结合视觉-运动知识和骨骼数据学习动作表示,利用大型语言模型(LLMs)进行动作分类和描述。

Details Motivation: 传统的动作识别方法通常依赖于手动设计的特征或深度学习的端到端训练,而大型语言模型(LLMs)拥有丰富的隐式知识和强大的迁移能力。本文探索如何将 LLMs 与人体骨骼数据结合,解决 LLMs 理解骨骼数据和区分动作类别的问题。

Contribution: 1) 提出 SUGAR 范式,利用大规模视频模型生成视觉-运动知识作为先验监督骨骼学习;2) 设计 Temporal Query Projection (TQP) 模块,建模长序列骨骼信号;3) 在零样本场景下验证 SUGAR 的泛化能力。

Method: 1) 利用现成的大规模视频模型生成视觉-运动信息;2) 通过先验知识监督骨骼学习,生成离散表示;3) 使用预训练的 LLMs 理解表示并生成动作目标和描述;4) 引入 TQP 模块建模长序列骨骼信号。

Result: 在多个骨骼动作分类基准测试中表现优越,零样本场景下显示出比线性方法更强的泛化能力。

Insight: SUGAR 揭示了将 LLMs 与先验知识结合的潜力,为多模态动作识别提供了新思路。

Abstract: Large Language Models (LLMs) hold rich implicit knowledge and powerful transferability. In this paper, we explore the combination of LLMs with the human skeleton to perform action classification and description. However, when treating LLM as a recognizer, two questions arise: 1) How can LLMs understand skeleton? 2) How can LLMs distinguish among actions? To address these problems, we introduce a novel paradigm named learning Skeleton representation with visUal-motion knowledGe for Action Recognition (SUGAR). In our pipeline, we first utilize off-the-shelf large-scale video models as a knowledge base to generate visual, motion information related to actions. Then, we propose to supervise skeleton learning through this prior knowledge to yield discrete representations. Finally, we use the LLM with untouched pre-training weights to understand these representations and generate the desired action targets and descriptions. Notably, we present a Temporal Query Projection (TQP) module to continuously model the skeleton signals with long sequences. Experiments on several skeleton-based action classification benchmarks demonstrate the efficacy of our SUGAR. Moreover, experiments on zero-shot scenarios show that SUGAR is more versatile than linear-based methods.

[84] MTAttack: Multi-Target Backdoor Attacks against Large Vision-Language Models

Zihan Wang,Guansong Pang,Wenjun Miao,Jin Zheng,Xiao Bai

Main category: cs.CV

TL;DR: MTAttack提出了一种针对大型视觉语言模型(LVLMs)的多目标后门攻击框架,通过独特的优化方法实现了多个触发器与目标的精准映射,展示了LVLMs在多目标攻击下的脆弱性。

Details Motivation: 现有的后门攻击主要集中在单目标攻击上,而多目标攻击由于触发器间的特征干扰难以实现。作者发现了这一漏洞,并提出了MTAttack来解决多目标攻击中的挑战。

Contribution: 提出了首个多目标后门攻击框架MTAttack,引入代理空间分割约束和触发器原型锚定约束,解决了多触发器映射中的干扰问题。

Method: MTAttack的核心是一种新颖的优化方法,通过代理空间分割和触发器原型锚定两项约束,在潜在空间中联合优化多个触发器,确保每个触发器独立映射到唯一的代理类并保持可区分性。

Result: 实验表明MTAttack在多目标攻击中实现了高成功率,显著优于现有方法,且在数据集间具有强泛化性和对抗防御策略的鲁棒性。

Insight: MTAttack揭示了LVLMs在多目标后门攻击中的脆弱性,强调了防御此类威胁的紧迫性。

Abstract: Recent advances in Large Visual Language Models (LVLMs) have demonstrated impressive performance across various vision-language tasks by leveraging large-scale image-text pretraining and instruction tuning. However, the security vulnerabilities of LVLMs have become increasingly concerning, particularly their susceptibility to backdoor attacks. Existing backdoor attacks focus on single-target attacks, i.e., targeting a single malicious output associated with a specific trigger. In this work, we uncover multi-target backdoor attacks, where multiple independent triggers corresponding to different attack targets are added in a single pass of training, posing a greater threat to LVLMs in real-world applications. Executing such attacks in LVLMs is challenging since there can be many incorrect trigger-target mappings due to severe feature interference among different triggers. To address this challenge, we propose MTAttack, the first multi-target backdoor attack framework for enforcing accurate multiple trigger-target mappings in LVLMs. The core of MTAttack is a novel optimization method with two constraints, namely Proxy Space Partitioning constraint and Trigger Prototype Anchoring constraint. It jointly optimizes multiple triggers in the latent space, with each trigger independently mapping clean images to a unique proxy class while at the same time guaranteeing their separability. Experiments on popular benchmarks demonstrate a high success rate of MTAttack for multi-target attacks, substantially outperforming existing attack methods. Furthermore, our attack exhibits strong generalizability across datasets and robustness against backdoor defense strategies. These findings highlight the vulnerability of LVLMs to multi-target backdoor attacks and underscore the urgent need for mitigating such threats. Code is available at https://github.com/mala-lab/MTAttack.

[85] RobIA: Robust Instance-aware Continual Test-time Adaptation for Deep Stereo

Jueun Ko,Hyewon Park,Hyesong Choi,Dongbo Min

Main category: cs.CV

TL;DR: RobIA提出了一种鲁棒的、实例感知的持续测试时适应框架,用于解决立体深度估计中的动态域偏移问题,通过动态路由和鲁棒教师模型实现高效适应。

Details Motivation: 立体深度估计在真实环境中面临动态域偏移、稀疏或不准确的监督以及密集标签获取成本高的问题,传统测试时适应方法在持续偏移下效果有限。

Contribution: 1)Attend-and-Excite Mixture-of-Experts(AttEx-MoE),动态路由输入以高效适应;2)Robust AdaptBN Teacher,通过伪监督补充稀疏标签。

Method: 结合动态路由的轻量自注意力机制和基于PEFT的教师模型,实现输入特定的灵活性和广泛监督覆盖,提升域偏移下的泛化能力。

Result: 实验表明RobIA在动态目标域中表现优异,同时保持了计算高效性。

Insight: 通过实例感知的动态适应和伪监督策略,RobIA在持续域偏移下表现出更强的鲁棒性和适应性。

Abstract: Stereo Depth Estimation in real-world environments poses significant challenges due to dynamic domain shifts, sparse or unreliable supervision, and the high cost of acquiring dense ground-truth labels. While recent Test-Time Adaptation (TTA) methods offer promising solutions, most rely on static target domain assumptions and input-invariant adaptation strategies, limiting their effectiveness under continual shifts. In this paper, we propose RobIA, a novel Robust, Instance-Aware framework for Continual Test-Time Adaptation (CTTA) in stereo depth estimation. RobIA integrates two key components: (1) Attend-and-Excite Mixture-of-Experts (AttEx-MoE), a parameter-efficient module that dynamically routes input to frozen experts via lightweight self-attention mechanism tailored to epipolar geometry, and (2) Robust AdaptBN Teacher, a PEFT-based teacher model that provides dense pseudo-supervision by complementing sparse handcrafted labels. This strategy enables input-specific flexibility, broad supervision coverage, improving generalization under domain shift. Extensive experiments demonstrate that RobIA achieves superior adaptation performance across dynamic target domains while maintaining computational efficiency.

[86] Explicit Temporal-Semantic Modeling for Dense Video Captioning via Context-Aware Cross-Modal Interaction

Mingda Jia,Weiliang Meng,Zenghuang Fu,Yiheng Li,Qi Zeng,Yifan Zhang,Ju Xin,Rongtao Xu,Jiguang Zhang,Xiaopeng Zhang

Main category: cs.CV

TL;DR: 针对密集视频描述任务,论文提出了一种显式时空语义建模框架CACMI,通过跨模态检索和上下文感知特征增强,显著提升了性能。

Details Motivation: 现有的密集视频描述方法多依赖隐式建模(如帧级特征),忽视了事件序列的时序一致性和视觉上下文的语义完整性,导致效果受限。

Contribution: 提出了CACMI框架,通过显式建模视频的时序特性和语义上下文,结合跨模态检索和查询引导的注意力机制,提升了描述的质量和准确性。

Method: 1. Cross-modal Frame Aggregation:通过跨模态检索对齐事件与文本特征;2. Context-aware Feature Enhancement:利用查询引导注意力融合视觉动态与伪事件语义。

Result: 在ActivityNet Captions和YouCook2数据集上达到了SOTA性能。

Insight: 显式建模时序和语义信息是密集视频描述任务的关键,跨模态交互和上下文感知能有效弥补传统方法的不足。

Abstract: Dense video captioning jointly localizes and captions salient events in untrimmed videos. Recent methods primarily focus on leveraging additional prior knowledge and advanced multi-task architectures to achieve competitive performance. However, these pipelines rely on implicit modeling that uses frame-level or fragmented video features, failing to capture the temporal coherence across event sequences and comprehensive semantics within visual contexts. To address this, we propose an explicit temporal-semantic modeling framework called Context-Aware Cross-Modal Interaction (CACMI), which leverages both latent temporal characteristics within videos and linguistic semantics from text corpus. Specifically, our model consists of two core components: Cross-modal Frame Aggregation aggregates relevant frames to extract temporally coherent, event-aligned textual features through cross-modal retrieval; and Context-aware Feature Enhancement utilizes query-guided attention to integrate visual dynamics with pseudo-event semantics. Extensive experiments on the ActivityNet Captions and YouCook2 datasets demonstrate that CACMI achieves the state-of-the-art performance on dense video captioning task.

[87] Right Looks, Wrong Reasons: Compositional Fidelity in Text-to-Image Generation

Mayank Vatsa,Aparna Bharati,Richa Singh

Main category: cs.CV

TL;DR: 该论文指出当前主流文本到图像生成模型在逻辑组合能力上的缺陷,特别是在否定、计数和空间关系方面的表现崩溃,并分析了其失败的三个关键因素。

Details Motivation: 现有文本到图像生成模型在单个逻辑元素上表现良好,但在组合逻辑(如否定、计数和空间关系)上表现急剧下降,凸显了模型的不足。

Contribution: 论文揭示了模型在组合逻辑能力上的崩溃现象,并分析了训练数据、注意力架构和评估指标的三个关键失败因素。

Method: 论文通过调查当前主流模型在组合逻辑(否定、计数、空间关系)上的表现,并分析其失败的根本原因。

Result: 研究发现,现有模型在组合逻辑任务上表现极差,且简单的扩展或调整无法解决这一问题。

Insight: 实现真正的组合性需要根本性的表征和推理方法改进,而非对现有架构的小修小补。

Abstract: The architectural blueprint of today’s leading text-to-image models contains a fundamental flaw: an inability to handle logical composition. This survey investigates this breakdown across three core primitives-negation, counting, and spatial relations. Our analysis reveals a dramatic performance collapse: models that are accurate on single primitives fail precipitously when these are combined, exposing severe interference. We trace this failure to three key factors. First, training data show a near-total absence of explicit negations. Second, continuous attention architectures are fundamentally unsuitable for discrete logic. Third, evaluation metrics reward visual plausibility over constraint satisfaction. By analyzing recent benchmarks and methods, we show that current solutions and simple scaling cannot bridge this gap. Achieving genuine compositionality, we conclude, will require fundamental advances in representation and reasoning rather than incremental adjustments to existing architectures.

[88] Split-Layer: Enhancing Implicit Neural Representation by Maximizing the Dimensionality of Feature Space

Zhicheng Cai,Hao Zhu,Linsen Chen,Qiu Shen,Xun Cao

Main category: cs.CV

TL;DR: 本文提出了一种称为split-layer的新方法,通过将多层感知机(MLP)的每一层分解为多个并行分支,并通过Hadamard乘积整合输出,从而显著提高了隐式神经表示(INR)的表征能力,同时避免了计算成本的急剧增加。

Details Motivation: 隐式神经表示(INR)在信号建模和逆问题中具有广泛应用,但传统MLP的低维特征空间限制了其表征能力。扩展MLP宽度虽可提升能力,但会带来计算和内存成本的二次增长,因此需要一种更高效的方法。

Contribution: 主要贡献是提出了split-layer,这是一种通过并行分支和Hadamard乘积构建高维多项式空间的MLP重构方法,显著提升了INR的表征能力,同时避免了计算开销的快速增加。

Method: split-layer将MLP的每一层分解为多个并行分支,并通过Hadamard乘积整合它们的输出,从而构建一个高次多项式空间,而非简单地增加MLP宽度。

Result: 实验表明,split-layer在2D图像拟合、2D CT重建、3D形状表示和5D新视角合成等任务中表现优异,超越了现有方法。

Insight: 通过并行分支和Hadamard乘积的整合方式,split-layer高效地扩展了特征空间的维度,为提升INR的表征能力提供了一种低开销的解决方案。

Abstract: Implicit neural representation (INR) models signals as continuous functions using neural networks, offering efficient and differentiable optimization for inverse problems across diverse disciplines. However, the representational capacity of INR defined by the range of functions the neural network can characterize, is inherently limited by the low-dimensional feature space in conventional multilayer perceptron (MLP) architectures. While widening the MLP can linearly increase feature space dimensionality, it also leads to a quadratic growth in computational and memory costs. To address this limitation, we propose the split-layer, a novel reformulation of MLP construction. The split-layer divides each layer into multiple parallel branches and integrates their outputs via Hadamard product, effectively constructing a high-degree polynomial space. This approach significantly enhances INR’s representational capacity by expanding the feature space dimensionality without incurring prohibitive computational overhead. Extensive experiments demonstrate that the split-layer substantially improves INR performance, surpassing existing methods across multiple tasks, including 2D image fitting, 2D CT reconstruction, 3D shape representation, and 5D novel view synthesis.

[89] Physically Interpretable Multi-Degradation Image Restoration via Deep Unfolding and Explainable Convolution

Hu Gao,Xiaoning Lei,Xichen Xu,Depeng Dang,Lizhuang Ma

Main category: cs.CV

TL;DR: 这篇论文提出了一种名为InterIR的新方法,通过深度展开网络和可解释卷积模块,解决了多退化图像复原问题,同时保持了模型的物理可解释性。

Details Motivation: 现实场景中的图像常同时存在多种退化(如雨、噪声、雾),而现有方法通常仅针对单一退化。此外,性能提升的模块堆叠方法通常缺乏可解释性。

Contribution: 1. 提出了一种基于深度展开网络的多退化图像复原方法;2. 设计了可解释卷积模块,增强模型的可解释性和适应性。

Method: 1. 使用改进的二阶半光滑牛顿算法,确保每个模块的物理可解释性;2. 设计可解释卷积模块,模拟人脑的信息处理机制和图像内在特性。

Result: InterIR在多退化复原任务中表现优异,同时在单一退化任务中也具有竞争力。

Insight: 结合数学优化算法的深度展开结构和可解释模块设计,能在提升性能的同时保持模型的物理可解释性。

Abstract: Although image restoration has advanced significantly, most existing methods target only a single type of degradation. In real-world scenarios, images often contain multiple degradations simultaneously, such as rain, noise, and haze, requiring models capable of handling diverse degradation types. Moreover, methods that improve performance through module stacking often suffer from limited interpretability. In this paper, we propose a novel interpretability-driven approach for multi-degradation image restoration, built upon a deep unfolding network that maps the iterative process of a mathematical optimization algorithm into a learnable network structure. Specifically, we employ an improved second-order semi-smooth Newton algorithm to ensure that each module maintains clear physical interpretability. To further enhance interpretability and adaptability, we design an explainable convolution module inspired by the human brain’s flexible information processing and the intrinsic characteristics of images, allowing the network to flexibly leverage learned knowledge and autonomously adjust parameters for different input. The resulting tightly integrated architecture, named InterIR, demonstrates excellent performance in multi-degradation restoration while remaining highly competitive on single-degradation tasks.

[90] CephRes-MHNet: A Multi-Head Residual Network for Accurate and Robust Cephalometric Landmark Detection

Ahmed Jaheen,Islam Hassan,Mohanad Abouserie,Abdelaty Rehab,Adham Elasfar,Knzy Elmasry,Mostafa El-Dawlatly,Seif Eldawlatly

Main category: cs.CV

TL;DR: 提出了CephRes-MHNet,一种多头残差网络,用于高效准确地检测头部X光片中的标志点,优于现有方法。

Details Motivation: 手动标记头部X光片的标志点耗时且易错,而现有自动化方法难以应对低对比度和复杂解剖结构的问题。

Contribution: 提出了CephRes-MHNet网络,将多头解码器、残差编码和双重注意力机制结合,提升了标志点检测的精度和鲁棒性。

Method: 利用多头解码器和残差编码增强上下文推理能力,引入双重注意力机制提升解剖结构精度。

Result: 在1,000张X光片数据集上,平均径向误差(MRE)1.23 mm,2 mm内检测成功率(SDR)85.5%,优于基准方法。

Insight: 网络结构的效率是关键,通过残差和多头设计可以在减少参数的同时提升性能。

Abstract: Accurate localization of cephalometric landmarks from 2D lateral skull X-rays is vital for orthodontic diagnosis and treatment. Manual annotation is time-consuming and error-prone, whereas automated approaches often struggle with low contrast and anatomical complexity. This paper introduces CephRes-MHNet, a multi-head residual convolutional network for robust and efficient cephalometric landmark detection. The architecture integrates residual encoding, dual-attention mechanisms, and multi-head decoders to enhance contextual reasoning and anatomical precision. Trained on the Aariz Cephalometric dataset of 1,000 radiographs, CephRes-MHNet achieved a mean radial error (MRE) of 1.23 mm and a success detection rate (SDR) @ 2.0 mm of 85.5%, outperforming all evaluated models. In particular, it exceeded the strongest baseline, the attention-driven AFPF-Net (MRE = 1.25 mm, SDR @ 2.0 mm = 84.1%), while using less than 25% of its parameters. These results demonstrate that CephRes-MHNet attains state-of-the-art accuracy through architectural efficiency, providing a practical solution for real-world orthodontic analysis.

[91] VISTA: A Vision and Intent-Aware Social Attention Framework for Multi-Agent Trajectory Prediction

Stephane Da Silva Martins,Emanuel Aldea,Sylvie Le Hégarat-Mascle

Main category: cs.CV

TL;DR: VISTA 是一种基于递归目标条件转换器的多智能体轨迹预测方法,结合长期意图与社交交互,显著提升了轨迹的真实性和安全性。

Details Motivation: 现有方法难以同时捕捉智能体的长期目标和细粒度社交交互,导致多智能体轨迹预测结果不真实。

Contribution: VISTA 提出了跨注意力融合模块、社交令牌注意力机制和可解释的成对注意力图,将单智能体目标条件预测扩展到多智能体框架。

Method: VISTA 采用递归目标条件转换器,结合长期意图、社交交互建模和可解释的注意力机制,生成目标感知且社交合规的轨迹。

Result: 在高密度 MADRAS 和 SDD 基准上,VISTA 实现了最先进的精度,碰撞率大幅降低(MADRAS 从 2.14% 降至 0.03%,SDD 零碰撞)。

Insight: VISTA 通过联合建模目标和交互,生成了更真实、安全和可解释的轨迹,适用于安全关键的自主系统。

Abstract: Multi-agent trajectory prediction is crucial for autonomous systems operating in dense, interactive environments. Existing methods often fail to jointly capture agents’ long-term goals and their fine-grained social interactions, which leads to unrealistic multi-agent futures. We propose VISTA, a recursive goal-conditioned transformer for multi-agent trajectory forecasting. VISTA combines (i) a cross-attention fusion module that integrates long-horizon intent with past motion, (ii) a social-token attention mechanism for flexible interaction modeling across agents, and (iii) pairwise attention maps that make social influence patterns interpretable at inference time. Our model turns single-agent goal-conditioned prediction into a coherent multi-agent forecasting framework. Beyond standard displacement metrics, we evaluate trajectory collision rates as a measure of joint realism. On the high-density MADRAS benchmark and on SDD, VISTA achieves state-of-the-art accuracy and substantially fewer collisions. On MADRAS, it reduces the average collision rate of strong baselines from 2.14 to 0.03 percent, and on SDD it attains zero collisions while improving ADE, FDE, and minFDE. These results show that VISTA generates socially compliant, goal-aware, and interpretable trajectories, making it promising for safety-critical autonomous systems.

[92] HeatV2X: Scalable Heterogeneous Collaborative Perception via Efficient Alignment and Interaction

Yueran Zhao,Zhang Zhang,Chao Sun,Tianze Wang,Chao Yue,Nuoran Li

Main category: cs.CV

TL;DR: HeatV2X是一种针对V2X协同感知的可扩展异构框架,通过高效的异构对齐和多智能体交互,解决了多模态异构性和可扩展性问题。

Details Motivation: 现有V2X协同感知框架面临多模态异构性和可扩展性挑战,特别是在参与智能体增多时,异构性和训练成本成为瓶颈。

Contribution: 提出HeatV2X框架,包括异构图注意力基础模型、局部异构微调和全局协同微调设计,实现高效对齐和协作。

Method: 采用Hetero-Aware Adapters提取模态差异,Multi-Cognitive Adapter增强跨智能体协作,减少训练开销。

Result: 在OPV2V-H和DAIR-V2X数据集上表现优异,显著降低训练成本并超越现有方法。

Insight: 异构对齐和轻量化微调是实现可扩展协同感知的关键,HeatV2X为多智能体协作提供了高效解决方案。

Abstract: Vehicle-to-Everything (V2X) collaborative perception extends sensing beyond single vehicle limits through transmission. However, as more agents participate, existing frameworks face two key challenges: (1) the participating agents are inherently multi-modal and heterogeneous, and (2) the collaborative framework must be scalable to accommodate new agents. The former requires effective cross-agent feature alignment to mitigate heterogeneity loss, while the latter renders full-parameter training impractical, highlighting the importance of scalable adaptation. To address these issues, we propose Heterogeneous Adaptation (HeatV2X), a scalable collaborative framework. We first train a high-performance agent based on heterogeneous graph attention as the foundation for collaborative learning. Then, we design Local Heterogeneous Fine-Tuning and Global Collaborative Fine-Tuning to achieve effective alignment and interaction among heterogeneous agents. The former efficiently extracts modality-specific differences using Hetero-Aware Adapters, while the latter employs the Multi-Cognitive Adapter to enhance cross-agent collaboration and fully exploit the fusion potential. These designs enable substantial performance improvement of the collaborative framework with minimal training cost. We evaluate our approach on the OPV2V-H and DAIR-V2X datasets. Experimental results demonstrate that our method achieves superior perception performance with significantly reduced training overhead, outperforming existing state-of-the-art approaches. Our implementation will be released soon.

[93] Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization

Ashutosh Anshul,Shreyas Gopal,Deepu Rajan,Eng Siong Chng

Main category: cs.CV

TL;DR: 该论文提出了一种单阶段训练框架,通过结合下一帧预测和窗口级注意力机制,提升了多模态深度伪造检测的泛化能力,并实现了精确的时间定位。

Details Motivation: 现有的多模态深度伪造检测方法在泛化性和对抗保留音频-视觉对齐的伪造方法时存在不足,需要改进。

Contribution: 1. 提出了结合下一帧预测的单阶段训练框架;2. 引入了窗口级注意力机制以捕获帧间差异;3. 实现了对完全伪造视频的分类和部分伪造样本的时间定位。

Method: 1. 使用下一帧预测方法增强模型的泛化能力;2. 通过窗口级注意力机制检测预测帧与实际帧之间的不一致性。

Result: 模型在多个基准数据集上表现出强大的泛化能力和精确的时间定位性能。

Insight: 结合时序预测和注意力机制可以有效提升深度伪造检测的性能,尤其是针对部分伪造样本的定位能力。

Abstract: Recent multimodal deepfake detection methods designed for generalization conjecture that single-stage supervised training struggles to generalize across unseen manipulations and datasets. However, such approaches that target generalization require pretraining over real samples. Additionally, these methods primarily focus on detecting audio-visual inconsistencies and may overlook intra-modal artifacts causing them to fail against manipulations that preserve audio-visual alignment. To address these limitations, we propose a single-stage training framework that enhances generalization by incorporating next-frame prediction for both uni-modal and cross-modal features. Additionally, we introduce a window-level attention mechanism to capture discrepancies between predicted and actual frames, enabling the model to detect local artifacts around every frame, which is crucial for accurately classifying fully manipulated videos and effectively localizing deepfake segments in partially spoofed samples. Our model, evaluated on multiple benchmark datasets, demonstrates strong generalization and precise temporal localization.

[94] TubeRMC: Tube-conditioned Reconstruction with Mutual Constraints for Weakly-supervised Spatio-Temporal Video Grounding

Jinxuan Li,Yi Zhang,Jian-Fang Hu,Chaolei Tan,Tianming Liang,Beihao Xia

Main category: cs.CV

TL;DR: TubeRMC(Tube-conditioned Reconstruction with Mutual Constraints)是针对弱监督时空视频定位(STVG)任务提出的新框架,通过文本条件的候选管生成和管条件的重建来解决目标识别和跟踪不一致的问题。

Details Motivation: 现有的弱监督STVG方法通常采用简单的后期融合策略,导致目标识别失败和跟踪不一致。本文旨在通过引入文本条件和时空约束的候选管生成与重建方法来改进这一问题。

Contribution: 1)提出TubeRMC框架,利用预训练的视觉定位模型生成文本条件的候选管,并通过时空约束的重建方法优化;2)设计了三种重建策略(时间、空间和时空)来捕捉丰富的管-文本对应关系;3)在VidSTG和HCSTVG基准上表现优于现有方法。

Method: TubeRMC采用预训练模型生成文本条件的候选管,并通过三种Tube-conditioned Reconstructor(时间、空间和时空)对关键线索进行重建。同时,引入空间和时间提议之间的互相约束以提高重建质量。

Result: 在VidSTG和HCSTVG两个公共基准上,TubeRMC的性能优于现有方法,视觉分析表明其有效减少了目标识别错误和跟踪不一致问题。

Insight: 通过文本条件的管生成和时空约束的重建策略,TubeRMC在弱监督STVG任务中实现了更高质量的定位效果,显示出时空推理和视觉语言理解的有效结合。

Abstract: Spatio-Temporal Video Grounding (STVG) aims to localize a spatio-temporal tube that corresponds to a given language query in an untrimmed video. This is a challenging task since it involves complex vision-language understanding and spatiotemporal reasoning. Recent works have explored weakly-supervised setting in STVG to eliminate reliance on fine-grained annotations like bounding boxes or temporal stamps. However, they typically follow a simple late-fusion manner, which generates tubes independent of the text description, often resulting in failed target identification and inconsistent target tracking. To address this limitation, we propose a Tube-conditioned Reconstruction with Mutual Constraints (\textbf{TubeRMC}) framework that generates text-conditioned candidate tubes with pre-trained visual grounding models and further refine them via tube-conditioned reconstruction with spatio-temporal constraints. Specifically, we design three reconstruction strategies from temporal, spatial, and spatio-temporal perspectives to comprehensively capture rich tube-text correspondences. Each strategy is equipped with a Tube-conditioned Reconstructor, utilizing spatio-temporal tubes as condition to reconstruct the key clues in the query. We further introduce mutual constraints between spatial and temporal proposals to enhance their quality for reconstruction. TubeRMC outperforms existing methods on two public benchmarks VidSTG and HCSTVG. Further visualization shows that TubeRMC effectively mitigates both target identification errors and inconsistent tracking.

[95] FineSkiing: A Fine-grained Benchmark for Skiing Action Quality Assessment

Yongji Zhang,Siqi Li,Yue Gao,Yu Jiang

Main category: cs.CV

TL;DR: 该论文提出了FineSkiing数据集和JudgeMind方法,首次为空中滑雪运动提供了细粒度的子分数和扣分注释,并通过模拟裁判评分思路提升了AQA的性能和可靠性。

Details Motivation: 现有AQA方法通常从整个视频中提取特征进行评分,导致其可解释性和可靠性有限,且缺乏细粒度的动作评分注释。

Contribution: 1. 构建了首个包含细粒度子分数和扣分注释的AQA数据集FineSkiing;2. 提出了JudgeMind方法,通过分阶段评分和知识融合显著提升了AQA的性能。

Method: 1. 将动作视频分段评分;2. 引入阶段感知的特征增强与融合模块;3. 提出基于知识的评分感知解码器,融合扣分项先验知识。

Result: 实验表明,JudgeMind在FineSkiing数据集上实现了最先进的性能。

Insight: 分阶段评分和结合裁判知识的方法可以显著提升AQA的可解释性和可靠性,尤其在细粒度评分任务中。

Abstract: Action Quality Assessment (AQA) aims to evaluate and score sports actions, which has attracted widespread interest in recent years. Existing AQA methods primarily predict scores based on features extracted from the entire video, resulting in limited interpretability and reliability. Meanwhile, existing AQA datasets also lack fine-grained annotations for action scores, especially for deduction items and sub-score annotations. In this paper, we construct the first AQA dataset containing fine-grained sub-score and deduction annotations for aerial skiing, which will be released as a new benchmark. For the technical challenges, we propose a novel AQA method, named JudgeMind, which significantly enhances performance and reliability by simulating the judgment and scoring mindset of professional referees. Our method segments the input action video into different stages and scores each stage to enhance accuracy. Then, we propose a stage-aware feature enhancement and fusion module to boost the perception of stage-specific key regions and enhance the robustness to visual changes caused by frequent camera viewpoints switching. In addition, we propose a knowledge-based grade-aware decoder to incorporate possible deduction items as prior knowledge to predict more accurate and reliable scores. Experimental results demonstrate that our method achieves state-of-the-art performance.

[96] Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis

Jiulong Wu,Yucheng Shen,Lingyong Yan,Haixin Sun,Deguo Xia,Jizhou Huang,Min Cao

Main category: cs.CV

TL;DR: Facial-R1提出了一个三阶段对齐框架,通过指令微调、强化训练和数据合成解决了情感分析中的幻觉推理和推理-识别不对齐问题,并在多个基准上实现了SOTA性能。

Details Motivation: 传统的情感分析方法存在幻觉推理和推理-识别不对齐的问题,Facial-R1旨在通过结合情感识别、面部动作单元识别和基于动作单元的情感推理来提供更精细的解释性分析。

Contribution: 提出了三阶段对齐框架Facial-R1,包括指令微调、强化训练和数据合成,并发布了FEA-20K基准数据集。

Method: 1. 指令微调建立基本情感推理能力;2. 强化训练通过情感和动作单元标签对齐推理过程;3. 数据合成迭代扩展训练数据。

Result: 在八个标准基准上实现了最优性能,展示了强泛化能力和解释性。

Insight: 通过强化训练和数据合成的结合,可以有效提升模型在细粒度情感分析任务中的表现,同时增强其可解释性。

Abstract: Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. The task integrates three subtasks: emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning to model affective states jointly. While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: (1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels. We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. First, we employ instruction fine-tuning to establish basic emotional reasoning capability. Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model. Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.

[97] PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning

Yanbei Jiang,Chao Lei,Yihao Ding,Krista Ehinger,Jey Han Lau

Main category: cs.CV

TL;DR: PROPA整合了MCTS与GRPO,通过密集的过程级奖励优化视觉推理任务,无需人工标注。

Details Motivation: 视觉语言模型在复杂推理中依赖多步关联,早期错误易传导。现有方法SFT依赖昂贵标注,RLVR仅提供稀疏反馈,限制了优化效果。

Contribution: 提出PROPA框架,结合MCTS与GRPO生成密集过程级奖励,通过交替GRPO更新与SFT解决冷启动问题,并训练PRM模型指导推理搜索。

Method: PROPA利用MCTS生成密集过程奖励,交替GRPO与SFT优化推理轨迹,并通过PRM模型对齐测试与训练信号。

Result: 在7个基准测试和4种VLM主干网络上,PROPA优于SFT和RLVR基线,域内任务提升17.0%,域外任务提升21.0%。

Insight: 密集过程级奖励和多策略交替优化显著提升复杂视觉推理的能力和泛化性,无需依赖昂贵人工标注。

Abstract: Despite significant progress, Vision-Language Models (VLMs) still struggle with complex visual reasoning, where multi-step dependencies cause early errors to cascade through the reasoning chain. Existing post-training paradigms are limited: Supervised Fine-Tuning (SFT) relies on costly step-level annotations, while Reinforcement Learning with Verifiable Rewards (RLVR) methods like GRPO provide only sparse, outcome-level feedback, hindering stable optimization. We introduce PROPA (Process-level Reasoning Optimization with interleaved Policy Alignment), a novel framework that integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense, process-level rewards and optimize reasoning at each intermediate step without human annotations. To overcome the cold-start problem, PROPA interleaves GRPO updates with SFT, enabling the model to learn from both successful and failed reasoning trajectories. A Process Reward Model (PRM) is further trained to guide inference-time search, aligning the test-time search with the training signal. Across seven benchmarks and four VLM backbones, PROPA consistently outperforms both SFT- and RLVR-based baselines. It achieves up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art, establishing a strong reasoning and generalization capability for visual reasoning tasks. The code isavailable at: https://github.com/YanbeiJiang/PROPA.

[98] Towards Blind and Low-Vision Accessibility of Lightweight VLMs and Custom LLM-Evals

Shruti Singh Baghel,Yash Pratap Singh Rathore,Sushovan Jena,Anurag Pradhan,Amit Shukla,Arnav Bhavsar,Pawan Goyal

Main category: cs.CV

TL;DR: 论文研究了轻量级视觉语言模型(VLM)在盲人和低视力(BLV)用户中的应用,通过评估不同规模的SmolVLM2模型,并提出了两个新的评估框架来专注于BLV的可访问性。

Details Motivation: 大型视觉语言模型(VLM)虽然在视频描述任务上表现优异,但其高资源需求限制了在BLV用户中的实际应用。因此,研究轻量级模型在BLV用户中的可行性和效果具有重要意义。

Contribution: 1. 评估了两种规模的SmolVLM2模型在BLV用户中的性能;2. 提出了两个新的评估框架(Multi-Context BLV Framework和Navigational Assistance Framework);3. 研究了在智能手机上的部署性能。

Method: 1. 使用了两个数据集(AVCaps和Charades);2. 设计了四种不同的提示策略;3. 测试了FP32和INT8精度变体的性能。

Result: 论文展示了轻量级VLM在BLV用户中的潜力,并通过新的评估框架提供了更全面的性能分析。

Insight: 轻量级VLM可以在资源有限的设备上部署,同时满足BLV用户的特殊需求,新的评估框架为未来研究提供了重要的方向。

Abstract: Large Vision-Language Models (VLMs) excel at understanding and generating video descriptions but their high memory, computation, and deployment demands hinder practical use particularly for blind and low-vision (BLV) users who depend on detailed, context-aware descriptions. To study the effect of model size on accessibility-focused description quality, we evaluate SmolVLM2 variants with 500M and 2.2B parameters across two diverse datasets: AVCaps (outdoor), and Charades (indoor). In this work, we introduce two novel evaluation frameworks specifically designed for BLV accessibility assessment: the Multi-Context BLV Framework evaluating spatial orientation, social interaction, action events, and ambience contexts; and the Navigational Assistance Framework focusing on mobility-critical information. Additionally, we conduct a systematic evaluation of four different prompt design strategies and deploy both models on a smartphone, evaluating FP32 and INT8 precision variants to assess real-world performance constraints on resource-limited mobile devices.

[99] Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models

Zhengtao Zou,Ya Gao,Jiarui Guan,Bin Li,Pekka Marttinen

Main category: cs.CV

TL;DR: 论文提出了一种低开销的框架RUDDER,通过自适应残差更新方向减少大型视觉语言模型中的物体幻觉问题,实现了高效性和可靠性的平衡。

Details Motivation: 大型视觉语言模型(LVLM)常因物体幻觉生成与视觉输入不一致的文本,影响可靠性。现有方法需额外计算开销,RUDDER旨在解决这一效率与效果的矛盾。

Contribution: 提出了RUDDER框架,包括CARD向量和自适应门机制,能够在单次前向传播中提取视觉证据并纠正模型输出,显著降低计算开销。

Method: 1. CARD向量:从自注意力层的残差更新中提取视觉证据。2. 自适应门:基于贝叶斯思想调整纠正信号强度,按token注入纠正信号。

Result: 在POPE和CHAIR等基准测试中,RUDDER表现与SOTA方法相当,且计算延迟极低,验证了其高效性和有效性。

Insight: RUDDER通过高效的单次前向传播即可实现可靠性提升,为实际部署提供了实用解决方案。

Abstract: Large Vision-Language Models (LVLMs) often suffer from object hallucination, generating text inconsistent with visual inputs, which can critically undermine their reliability. Existing inference-time interventions to mitigate this issue present a challenging trade-off: while methods that steer internal states or adjust output logits can be effective, they often incur substantial computational overhead, typically requiring extra forward passes. This efficiency bottleneck can limit their practicality for real-world, latency-sensitive deployments. In this work, we aim to address this trade-off with Residual-Update Directed DEcoding Regulation (RUDDER), a low-overhead framework that steers LVLMs towards visually-grounded generation. RUDDER is built on two key innovations: (1) Contextual Activation Residual Direction (CARD) vector, a per-sample visual evidence vector extracted from the residual update of a self-attention layer during a single, standard forward pass. (2) A Bayesian-inspired adaptive gate that performs token-wise injection, applying a corrective signal whose strength is conditioned on the model’s deviation from the visual context. Extensive experiments on key hallucination benchmarks, including POPE and CHAIR, indicate that RUDDER achieves performance comparable to state-of-the-art methods while introducing negligible computational latency, validating RUDDER as a pragmatic and effective approach for improving LVLMs’ reliability without a significant compromise on efficiency.

[100] Rethinking Visual Information Processing in Multimodal LLMs

Dongwan Kim,Viresh Ranjan,Takashi Nagata,Arnab Dhua,Amit Kumar K C

Main category: cs.CV

TL;DR: 论文提出LLaViT,通过让LLM同时作为视觉编码器,改进多模态LLMs中的视觉信息处理,显著优于基线方法。

Details Motivation: LLaVA架构在多模态任务中表现优异,但由于文本和视觉模态的不匹配,其视觉特征整合效果不佳。本文旨在解决这一问题,提出LLM不仅可作为语言模型,还可作为强大的视觉编码器。

Contribution: 提出LLaViT方法,通过三个关键修改(独立QKV投影、双向注意力、全局与局部视觉表示),使LLM能同时作为视觉编码器,提升视觉-语言建模效果。

Method: 1. 为视觉模态学习独立的QKV投影;2. 在视觉令牌上启用双向注意力;3. 结合全局与局部视觉表示。

Result: LLaViT在多个基准测试中显著优于LLaVA基线方法,甚至超越参数翻倍的模型。

Insight: LLM不仅适用于语言任务,还可通过适当修改作为视觉编码器,为多模态任务提供更高效的视觉-语言建模方案。

Abstract: Despite the remarkable success of the LLaVA architecture for vision-language tasks, its design inherently struggles to effectively integrate visual features due to the inherent mismatch between text and vision modalities. We tackle this issue from a novel perspective in which the LLM not only serves as a language model but also a powerful vision encoder. To this end, we present LLaViT - Large Language Models as extended Vision Transformers - which enables the LLM to simultaneously function as a vision encoder through three key modifications: (1) learning separate QKV projections for vision modality, (2) enabling bidirectional attention on visual tokens, and (3) incorporating both global and local visual representations. Through extensive controlled experiments on a wide range of LLMs, we demonstrate that LLaViT significantly outperforms the baseline LLaVA method on a multitude of benchmarks, even surpassing models with double its parameter count, establishing a more effective approach to vision-language modeling.

[101] CLIP4VI-ReID: Learning Modality-shared Representations via CLIP Semantic Bridge for Visible-Infrared Person Re-identification

Xiaomei Yang,Xizhan Gao,Sijie Niu,Fa Zhu,Guang Feng,Xiaofeng Qu,David Camacho

Main category: cs.CV

TL;DR: 论文提出了一种基于CLIP的模态共享表示学习网络CLIP4VI-ReID,通过文本语义生成、红外特征嵌入和高层语义对齐三个模块,实现了可见光-红外人员重识别的模态对齐和共享表示学习。

Details Motivation: 面对可见光和红外图像在物理特性上的巨大差异,传统方法难以实现有效的跨模态对齐,因此需要一种新的方法来生成共享的模态表示。

Contribution: 提出了CLIP4VI-ReID网络,利用CLIP模型生成的文本语义作为桥梁,实现了可见光-红外模态的间接对齐,并通过高层语义对齐进一步优化表示学习。

Method: 设计了三个模块:文本语义生成(TSG)用于可见光图像的文本语义生成;红外特征嵌入(IFE)利用文本语义调整红外图像的特征嵌入;高层语义对齐(HSA)优化语义对齐,确保仅包含身份相关信息。

Result: 在多个VI-ReID数据集上的实验表明,CLIP4VI-ReID性能优于其他最先进方法。

Insight: 文本语义可以作为跨模态对齐的有效桥梁,并且通过高层语义对齐可以进一步提升共享表示的质量。

Abstract: This paper proposes a novel CLIP-driven modality-shared representation learning network named CLIP4VI-ReID for VI-ReID task, which consists of Text Semantic Generation (TSG), Infrared Feature Embedding (IFE), and High-level Semantic Alignment (HSA). Specifically, considering the huge gap in the physical characteristics between natural images and infrared images, the TSG is designed to generate text semantics only for visible images, thereby enabling preliminary visible-text modality alignment. Then, the IFE is proposed to rectify the feature embeddings of infrared images using the generated text semantics. This process injects id-related semantics into the shared image encoder, enhancing its adaptability to the infrared modality. Besides, with text serving as a bridge, it enables indirect visible-infrared modality alignment. Finally, the HSA is established to refine the high-level semantic alignment. This process ensures that the fine-tuned text semantics only contain id-related information, thereby achieving more accurate cross-modal alignment and enhancing the discriminability of the learned modal-shared representations. Extensive experimental results demonstrate that the proposed CLIP4VI-ReID achieves superior performance than other state-of-the-art methods on some widely used VI-ReID datasets.

[102] Depth-Consistent 3D Gaussian Splatting via Physical Defocus Modeling and Multi-View Geometric Supervision

Yu Deng,Baozhu Zhao,Junyan Su,Xiaohan Zhang,Qi Liu

Main category: cs.CV

TL;DR: 本文提出了一种结合景深监督和多视角一致性监督的3D高斯泼溅方法,解决了极端深度变化场景中深度估计不一致的问题,显著提升了深度保真度。

Details Motivation: 在深度变化极大的场景中,现有方法无法同时解决远场区域深度估计不准确和近场区域结构退化的问题,亟需一种综合物理成像原理和学习深度正则化的新方法。

Contribution: 主要贡献是提出了一种集成景深监督和多视角一致性监督的计算框架,通过物理准确的景深损失和最小化多视角几何误差,提升了3D高斯泼溅的深度保真度。

Method: 方法包括两部分:1)利用景深监督,通过单目深度估计生成深度先验,结合散焦卷积合成散焦图像,并通过景深损失增强深度一致性;2)利用LoFTR半稠密特征匹配实现多视角一致性监督,最小化几何误差。

Result: 在Waymo Open Dataset上,相比现有方法,PSNR提升了0.8 dB,证明了深度保真度的显著提高。

Insight: 本文的创新在于结合物理成像原理和多视角几何约束,为解决复杂城市环境中深度分层问题提供了可扩展的方案。

Abstract: Three-dimensional reconstruction in scenes with extreme depth variations remains challenging due to inconsistent supervisory signals between near-field and far-field regions. Existing methods fail to simultaneously address inaccurate depth estimation in distant areas and structural degradation in close-range regions. This paper proposes a novel computational framework that integrates depth-of-field supervision and multi-view consistency supervision to advance 3D Gaussian Splatting. Our approach comprises two core components: (1) Depth-of-field Supervision employs a scale-recovered monocular depth estimator (e.g., Metric3D) to generate depth priors, leverages defocus convolution to synthesize physically accurate defocused images, and enforces geometric consistency through a novel depth-of-field loss, thereby enhancing depth fidelity in both far-field and near-field regions; (2) Multi-View Consistency Supervision employing LoFTR-based semi-dense feature matching to minimize cross-view geometric errors and enforce depth consistency via least squares optimization of reliable matched points. By unifying defocus physics with multi-view geometric constraints, our method achieves superior depth fidelity, demonstrating a 0.8 dB PSNR improvement over the state-of-the-art method on the Waymo Open Dataset. This framework bridges physical imaging principles and learning-based depth regularization, offering a scalable solution for complex depth stratification in urban environments.

[103] Learning to Tell Apart: Weakly Supervised Video Anomaly Detection via Disentangled Semantic Alignment

Wenti Yin,Huaxin Zhang,Xiang Wang,Yuqing Lu,Yicheng Zhang,Bingquan Gong,Jialong Zuo,Li Yu,Changxin Gao,Nong Sang

Main category: cs.CV

TL;DR: 论文提出了一种新型的弱监督视频异常检测方法DSANet,通过解耦语义对齐,显式地区分异常和正常特征,从而提升分类的细粒度和准确性。

Details Motivation: 现有的弱监督视频异常检测方法倾向于检测最显著的回放片段,忽视了挖掘与异常分离的多样化正常模式,且因外观相似而易引起类别混淆,导致细粒度分类效果不佳。

Contribution: 1. 引入自引导的正常性建模分支,重构输入视频特征以利用视频中的正常线索;2. 提出解耦对比语义对齐机制,分别处理事件中心与背景中心的分量,增强类别区分性表示。

Method: 1. 粗粒度层面:使用学习到的正常原型指导重构输入特征,分离正常模式与异常事件;2. 细粒度层面:基于帧级异常分数分解视频,并通过视觉-语言对比学习提升分类能力。

Result: 在XD-Violence和UCF-Crime两个基准测试中,DSANet性能优于现有最先进方法。

Insight: 显式解耦异常与正常特征,结合多模态对比学习,能够有效提升视频异常检测的细粒度分类能力和时间分离效果。

Abstract: Recent advancements in weakly-supervised video anomaly detection have achieved remarkable performance by applying the multiple instance learning paradigm based on multimodal foundation models such as CLIP to highlight anomalous instances and classify categories. However, their objectives may tend to detect the most salient response segments, while neglecting to mine diverse normal patterns separated from anomalies, and are prone to category confusion due to similar appearance, leading to unsatisfactory fine-grained classification results. Therefore, we propose a novel Disentangled Semantic Alignment Network (DSANet) to explicitly separate abnormal and normal features from coarse-grained and fine-grained aspects, enhancing the distinguishability. Specifically, at the coarse-grained level, we introduce a self-guided normality modeling branch that reconstructs input video features under the guidance of learned normal prototypes, encouraging the model to exploit normality cues inherent in the video, thereby improving the temporal separation of normal patterns and anomalous events. At the fine-grained level, we present a decoupled contrastive semantic alignment mechanism, which first temporally decomposes each video into event-centric and background-centric components using frame-level anomaly scores and then applies visual-language contrastive learning to enhance class-discriminative representations. Comprehensive experiments on two standard benchmarks, namely XD-Violence and UCF-Crime, demonstrate that DSANet outperforms existing state-of-the-art methods.

[104] DermAI: Clinical dermatology acquisition through quality-driven image collection for AI classification in mobile

Thales Bezerra,Emanoel Thyago,Kelvin Cunha,Rodrigo Abreu,Fábio Papais,Francisco Mauro,Natália Lopes,Érico Medeiros,Jéssica Guido,Shirley Cruz,Paulo Borba,Tsang Ing Ren

Main category: cs.CV

TL;DR: DermAI是一款基于智能手机的轻量级应用,旨在通过实时捕获、注释和分类皮肤病变图像推动AI在皮肤病学中的应用。它解决了现有数据集偏差、图像质量不一和验证不足的问题。

Details Motivation: AI在皮肤病学中的应用受限于数据集偏差、图像质量不一致和验证不足等问题,DermAI旨在解决这些问题。

Contribution: DermAI的主要贡献是通过智能手机实现实时图像捕获和质量检查,同时支持本地模型适配,生成了一个多样化的临床数据集。

Method: DermAI使用智能手机应用进行实时图像捕获和质量检查,并通过本地数据微调模型以提高性能。

Result: 初步实验表明,公共数据集训练的模型在DermAI数据集上泛化能力不足,但经过本地数据微调后性能显著提升。

Insight: 研究强调了标准化、多样化数据收集的重要性,尤其是在医疗需求与机器学习开发结合的场景中。

Abstract: AI-based dermatology adoption remains limited by biased datasets, variable image quality, and limited validation. We introduce DermAI, a lightweight, smartphone-based application that enables real-time capture, annotation, and classification of skin lesions during routine consultations. Unlike prior dermoscopy-focused tools, DermAI performs on-device quality checks, and local model adaptation. The DermAI clinical dataset, encompasses a wide range of skin tones, ethinicity and source devices. In preliminary experiments, models trained on public datasets failed to generalize to our samples, while fine-tuning with local data improved performance. These results highlight the importance of standardized, diverse data collection aligned with healthcare needs and oriented to machine learning development.

[105] MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation

Xun Huang,Shijia Zhao,Yunxiang Wang,Xin Lu,Wanfa Zhang,Rongsheng Qu,Weixin Li,Yunhong Wang,Chenglu Wen

Main category: cs.CV

TL;DR: MSGNav提出了一种基于多模态3D场景图(M3DSG)的零样本导航系统,通过保留视觉线索和动态图像分配优化场景图构建,结合多个创新模块解决了传统方法的局限性,并在实验中达到最先进性能。

Details Motivation: 现有零样本导航方法通常将视觉观测压缩为文本关系,导致高构建成本、视觉信息丢失和受限的词汇量,因此需要一种能保留视觉线索且支持开放词汇的方法。

Contribution: 1. 提出多模态3D场景图(M3DSG),用动态图像替换文本关系边;2. 开发MSGNav系统,包含多个创新模块(如Key Subgraph Selection);3. 识别并解决零样本导航中的‘最后一英里’问题。

Method: 1. M3DSG保留视觉线索;2. MSGNav系统包含高效推理的关键子图选择模块、自适应词汇更新模块和闭环推理模块;3. 引入基于可见性的视角决策模块解决‘最后一英里’问题。

Result: 在GOAT-Bench和HM3D-OVON数据集上达到最先进性能。

Insight: 通过多模态场景图和动态视觉关系保留,显著提升了零样本导航的泛化能力和准确性,验证了视觉信息在处理复杂任务中的重要性。

Abstract: Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the last-mile problem in zero-shot navigation - determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on GOAT-Bench and HM3D-OVON datasets. The open-source code will be publicly available.

[106] SAMIRO: Spatial Attention Mutual Information Regularization with a Pre-trained Model as Oracle for Lane Detection

Hyunjong Lee,Jangho Lee,Jaekoo Lee

Main category: cs.CV

TL;DR: 论文提出了一种名为SAMIRO的方法,通过预训练模型作为Oracle,利用空间注意力互信息正则化来提升车道检测性能,适用于多种先进模型和数据集。

Details Motivation: 现实世界中复杂的背景、光照变化和遮挡等问题对车道检测提出了挑战,尤其是在数据驱动方法中,数据收集和标注成本较高。因此,需要一种方法来利用上下文和全局信息,同时减少对大量标注数据的依赖。

Contribution: 提出了SAMIRO方法,通过预训练模型作为Oracle,结合空间注意力互信息正则化,提升车道检测性能,并验证了其在多种模型和数据集上的适应性。

Method: SAMIRO通过预训练模型提供全局信息,利用空间注意力机制和互信息正则化,将知识迁移到车道检测任务中,同时保留领域无关的空间信息。

Result: 在CULane、Tusimple和LLAMAS等主流基准测试中,SAMIRO显著提升了不同车道检测模型的性能。

Insight: SAMIRO展示了如何利用预训练模型和正则化方法提升特定任务的性能,同时证明了其灵活性和可扩展性。

Abstract: Lane detection is an important topic in the future mobility solutions. Real-world environmental challenges such as background clutter, varying illumination, and occlusions pose significant obstacles to effective lane detection, particularly when relying on data-driven approaches that require substantial effort and cost for data collection and annotation. To address these issues, lane detection methods must leverage contextual and global information from surrounding lanes and objects. In this paper, we propose a Spatial Attention Mutual Information Regularization with a pre-trained model as an Oracle, called SAMIRO. SAMIRO enhances lane detection performance by transferring knowledge from a pretrained model while preserving domain-agnostic spatial information. Leveraging SAMIRO’s plug-and-play characteristic, we integrate it into various state-of-the-art lane detection approaches and conduct extensive experiments on major benchmarks such as CULane, Tusimple, and LLAMAS. The results demonstrate that SAMIRO consistently improves performance across different models and datasets. The code will be made available upon publication.

[107] MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

Jiarui Zhang,Yuliang Liu,Zijun Wu,Guosheng Pang,Zhili Ye,Yupei Zhong,Junteng Ma,Tao Wei,Haiyang Xu,Weikai Chen,Zeen Wang,Qiangjun Ji,Fanxi Zhou,Qi Zhang,Yuanrui Hu,Jiahao Liu,Zhang Li,Ziyang Zhang,Qiang Liu,Xiang Bai

Main category: cs.CV

TL;DR: MonkeyOCR v1.5提出了一个统一的多模态视觉语言框架,通过两阶段解析流程解决了复杂文档布局的OCR问题。第一阶段利用大模型预测布局和阅读顺序,第二阶段进行局部内容识别。通过视觉一致性强化学习和专门模块,提升了复杂表格解析能力,实验证明其性能优于现有方法。

Details Motivation: 解决现实世界中复杂文档布局(如多级表格、嵌入图像或公式、跨页结构)对现有OCR系统的挑战,提升文档解析的鲁棒性和准确性。

Contribution: 1.提出了统一的两阶段解析框架;2.引入了视觉一致性强化学习方案提升表格解析质量;3.设计了Image-Decoupled Table Parsing和Type-Guided Table Merging两个模块,专门处理嵌入图像和跨页表格。

Method: 1.第一阶段:使用大模型联合预测文档布局和阅读顺序;2.第二阶段:在检测区域内局部识别文本、公式和表格;3.通过视觉一致性强化学习优化表格结构;4.采用Image-Decoupled Table Parsing和Type-Guided Table Merging模块处理复杂表格。

Result: 在OmniDocBench v1.5数据集上,MonkeyOCR v1.5表现出色,优于PPOCR-VL和MinerU 2.5,在复杂文档场景中展现出卓越的鲁棒性。

Insight: 结合多模态视觉和语言信息的两阶段解析流程,以及对复杂表格的专门处理,是提升OCR系统在复杂文档中性能的关键。

Abstract: Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage parsing pipeline. The first stage employs a large multimodal model to jointly predict document layout and reading order, leveraging visual information to ensure structural and sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.

[108] LLM-YOLOMS: Large Language Model-based Semantic Interpretation and Fault Diagnosis for Wind Turbine Components

Yaru Li,Yanxue Wang,Meng Li,Xinming Li,Jianbo Feng

Main category: cs.CV

TL;DR: 本文提出LLM-YOLOMS框架,结合YOLOMS与大型语言模型(LLM),用于风力涡轮机组件的语义解释与故障诊断。通过增强特征提取与语义推理,提高了故障检测的准确性与维护建议的可解释性。

Details Motivation: 现有风力涡轮机组件故障检测方法主要依赖视觉识别,输出缺乏语义解释性,无法支持维护决策。本文旨在通过结合视觉检测与语言模型,提升诊断结果的语义化和实用性。

Contribution: 1) 提出YOLOMS与LLM结合的集成框架;2) 设计轻量级KV映射模块,将视觉输出转化为结构化文本;3) 实现高精度故障检测(90.6%)与语义化维护建议(89%准确率)。

Method: YOLOMS通过多尺度检测和滑动窗口裁剪增强特征提取;KV模块将检测结果转化为文本表示;领域调优的LLM进行语义推理生成故障分析与维护建议。

Result: 实验显示,框架故障检测准确率达90.6%,维护报告平均准确率为89%,显著提升了诊断的可解释性和决策支持能力。

Insight: 结合视觉与语言模型的多模态方法可显著增强工业设备故障诊断的语义解释性,为智能运维提供新思路。

Abstract: The health condition of wind turbine (WT) components is crucial for ensuring stable and reliable operation. However, existing fault detection methods are largely limited to visual recognition, producing structured outputs that lack semantic interpretability and fail to support maintenance decision-making. To address these limitations, this study proposes an integrated framework that combines YOLOMS with a large language model (LLM) for intelligent fault analysis and diagnosis. Specifically, YOLOMS employs multi-scale detection and sliding-window cropping to enhance fault feature extraction, while a lightweight key-value (KV) mapping module bridges the gap between visual outputs and textual inputs. This module converts YOLOMS detection results into structured textual representations enriched with both qualitative and quantitative attributes. A domain-tuned LLM then performs semantic reasoning to generate interpretable fault analyses and maintenance recommendations. Experiments on real-world datasets demonstrate that the proposed framework achieves a fault detection accuracy of 90.6% and generates maintenance reports with an average accuracy of 89%, thereby improving the interpretability of diagnostic results and providing practical decision support for the operation and maintenance of wind turbines.

[109] RodEpil: A Video Dataset of Laboratory Rodents for Seizure Detection and Benchmark Evaluation

Daniele Perlo,Vladimir Despotovic,Selma Boudissa,Sang-Yoon Kim,Petr Nazarov,Yanrong Zhang,Max Wintermark,Olivier Keunen

Main category: cs.CV

TL;DR: 该论文介绍了RodEpil数据集,一个用于癫痫发作检测的实验室啮齿动物视频数据集,包含标注正常的活动和癫痫发作的片段,并使用TimeSformer模型取得了97%的平均F1分数。

Details Motivation: 为了解决预临床癫痫研究中非侵入式视频监测的需求,作者提供了高质量的视频数据集和基准评估方法。

Contribution: 主要贡献包括发布了一个包含标记视频片段的RodEpil数据集,以及展示了基于TimeSformer模型的癫痫发作检测基线性能。

Method: 采用TimeSformer(一种基于Transformer的视频分类器)进行实验,使用严格的五折交叉验证(按动物个体划分以避免数据泄露)。

Result: 实验结果表明,TimeSformer能够以97%的平均F1分数区分癫痫发作和正常活动。

Insight: 该数据集和基准代码为预临床癫痫研究的非侵入式视频监测提供了可重复的研究基础。

Abstract: We introduce a curated video dataset of laboratory rodents for automatic detection of convulsive events. The dataset contains short (10~s) top-down and side-view video clips of individual rodents, labeled at clip level as normal activity or seizure. It includes 10,101 negative samples and 2,952 positive samples collected from 19 subjects. We describe the data curation, annotation protocol and preprocessing pipeline, and report baseline experiments using a transformer-based video classifier (TimeSformer). Experiments employ five-fold cross-validation with strict subject-wise partitioning to prevent data leakage (no subject appears in more than one fold). Results show that the TimeSformer architecture enables discrimination between seizure and normal activity with an average F1-score of 97%. The dataset and baseline code are publicly released to support reproducible research on non-invasive, video-based monitoring in preclinical epilepsy research. RodEpil Dataset access - DOI: 10.5281/zenodo.17601357

[110] OpenSR-SRGAN: A Flexible Super-Resolution Framework for Multispectral Earth Observation Data

Simon Donike,Cesar Aybar,Julio Contreras,Luis Gómez-Chova

Main category: cs.CV

TL;DR: OpenSR-SRGAN是一个开源的、模块化的超分辨率框架,专为多光谱地球观测数据设计,旨在简化SRGAN类模型的配置和应用。

Details Motivation: 现有的超分辨率模型实现通常需要修改代码以适应不同任务和数据集,增加了使用门槛。OpenSR-SRGAN旨在通过配置驱动的方式降低这一门槛。

Contribution: 提供了一个灵活的、配置驱动的SRGAN实现框架,支持多种架构、尺度因子和波段设置。

Method: 通过配置文件定义生成器、判别器、损失函数和训练计划,避免了直接修改代码的需求。框架还集成了日志、验证和大场景推理功能。

Result: OpenSR-SRGAN提供了一个即插即用的解决方案,支持多光谱卫星数据(如Sentinel-2),并可作为基准实现。

Insight: 将超分辨率任务转化为配置驱动的工作流,显著提升了灵活性和可复现性,适用于广泛的地球观测数据集。

Abstract: We present OpenSR-SRGAN, an open and modular framework for single-image super-resolution in Earth Observation. The software provides a unified implementation of SRGAN-style models that is easy to configure, extend, and apply to multispectral satellite data such as Sentinel-2. Instead of requiring users to modify model code, OpenSR-SRGAN exposes generators, discriminators, loss functions, and training schedules through concise configuration files, making it straightforward to switch between architectures, scale factors, and band setups. The framework is designed as a practical tool and benchmark implementation rather than a state-of-the-art model. It ships with ready-to-use configurations for common remote sensing scenarios, sensible default settings for adversarial training, and built-in hooks for logging, validation, and large-scene inference. By turning GAN-based super-resolution into a configuration-driven workflow, OpenSR-SRGAN lowers the entry barrier for researchers and practitioners who wish to experiment with SRGANs, compare models in a reproducible way, and deploy super-resolution pipelines across diverse Earth-observation datasets.

[111] SPOT: Sparsification with Attention Dynamics via Token Relevance in Vision Transformers

Oded Schlesinger,Amirhossein Farzam,J. Matias Di Martino,Guillermo Sapiro

Main category: cs.CV

TL;DR: SPOT提出了一种基于注意力的动态token稀疏化方法,通过token相关性预测和剪枝,显著提升ViT的计算效率,在保持或提高性能的同时减少40%的计算量。

Details Motivation: Vision Transformers(ViT)的计算需求随token数量呈二次增长,需要一种高效的方法减少冗余token以提升效率。

Contribution: 提出SPOT框架,通过token嵌入、交互和注意力动态预测重要性,实现早期冗余token检测和剪枝,支持灵活的性能与效率权衡。

Method: 利用轻量级预测器在ViT各层动态评估token相关性,指导token稀疏化,剪枝冗余token以减少计算开销。

Result: 实验表明SPOT可减少40%的计算量,同时保持或提高模型性能,适配多种ViT架构。

Insight: SPOT的token剪枝方法具有通用性和可解释性,为ViT的高效部署提供了新思路。

Abstract: While Vision Transformers (ViT) have demonstrated remarkable performance across diverse tasks, their computational demands are substantial, scaling quadratically with the number of processed tokens. Compact attention representations, reflecting token interaction distributions, can guide early detection and reduction of less salient tokens prior to attention computation. Motivated by this, we present SParsification with attentiOn dynamics via Token relevance (SPOT), a framework for early detection of redundant tokens within ViTs that leverages token embeddings, interactions, and attention dynamics across layers to infer token importance, resulting in a more context-aware and interpretable relevance detection process. SPOT informs token sparsification and facilitates the elimination of such tokens, improving computational efficiency without sacrificing performance. SPOT employs computationally lightweight predictors that can be plugged into various ViT architectures and learn to derive effective input-specific token prioritization across layers. Its versatile design supports a range of performance levels adaptable to varying resource constraints. Empirical evaluations demonstrate significant efficiency gains of up to 40% compared to standard ViTs, while maintaining or even improving accuracy. Code and models are available at https://github.com/odedsc/SPOT .

[112] SemanticVLA: Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation

Wei Li,Renshan Zhang,Rui Shao,Zhijian Fang,Kaiwen Zhou,Zhuotao Tian,Liqiang Nie

Main category: cs.CV

TL;DR: SemanticVLA提出了一种新的视觉-语言-动作(VLA)框架,通过语义对齐的稀疏化和增强技术提升机器人操作的效率和性能。

Details Motivation: 当前VLA模型在机器人操作中存在两个关键问题:1)视觉输入的感知冗余导致低效处理;2)指令与视觉对齐浅层化,阻碍了动作的语义基础。

Contribution: 1)设计了SD-Pruner用于语义对齐的稀疏化;2)提出了SH-Fuser融合稀疏和密集特征;3)引入了SA-Coupler优化感知到动作的转换。

Method: 1)SD-Pruner通过ID-Pruner和SA-Pruner实现稀疏化;2)SH-Fuser跨SigLIP和DINOv2融合特征;3)SA-Coupler取代传统方法,提升动作建模。

Result: 在LIBERO基准测试中,SemanticVLA的成功率比OpenVLA高21.1%,同时训练成本和推理延迟分别降低3倍和2.7倍。

Insight: 通过语义对齐的稀疏化和增强技术,可以有效提升机器人操作的效率和性能,同时降低计算成本。

Abstract: Vision-Language-Action (VLA) models have advanced in robotic manipulation, yet practical deployment remains hindered by two key limitations: 1) perceptual redundancy, where irrelevant visual inputs are processed inefficiently, and 2) superficial instruction-vision alignment, which hampers semantic grounding of actions. In this paper, we propose SemanticVLA, a novel VLA framework that performs Semantic-Aligned Sparsification and Enhancement for Efficient Robotic Manipulation. Specifically: 1) To sparsify redundant perception while preserving semantic alignment, Semantic-guided Dual Visual Pruner (SD-Pruner) performs: Instruction-driven Pruner (ID-Pruner) extracts global action cues and local semantic anchors in SigLIP; Spatial-aggregation Pruner (SA-Pruner) compacts geometry-rich features into task-adaptive tokens in DINOv2. 2) To exploit sparsified features and integrate semantics with spatial geometry, Semantic-complementary Hierarchical Fuser (SH-Fuser) fuses dense patches and sparse tokens across SigLIP and DINOv2 for coherent representation. 3) To enhance the transformation from perception to action, Semantic-conditioned Action Coupler (SA-Coupler) replaces the conventional observation-to-DoF approach, yielding more efficient and interpretable behavior modeling for manipulation tasks. Extensive experiments on simulation and real-world tasks show that SemanticVLA sets a new SOTA in both performance and efficiency. SemanticVLA surpasses OpenVLA on LIBERO benchmark by 21.1% in success rate, while reducing training cost and inference latency by 3.0-fold and 2.7-fold.SemanticVLA is open-sourced and publicly available at https://github.com/JiuTian-VL/SemanticVLA

[113] Dynamic Avatar-Scene Rendering from Human-centric Context

Wenqing Wang,Haosen Yang,Josef Kittler,Xiatian Zhu

Main category: cs.CV

TL;DR: 该论文提出了一种名为Separate-then-Map(StM)的策略,用于从单目视频中重建动态人类与真实环境的交互。该方法通过专用信息映射机制桥接单独优化的模型,显著提升了视觉质量和渲染精度。

Details Motivation: 现有方法要么整体建模动态场景,要么将场景和背景分开建模并引入参数化人体先验,但这些方法未能处理不同组件(尤其是人类)的独特运动特性,或者忽略了组件间的信息交互,导致空间不一致性和视觉伪影。

Contribution: 提出了StM策略,通过共享变换函数统一分开建模的组件,提升了计算效率并确保人类与周围环境的空间和视觉一致性。

Method: 采用Separate-then-Map策略,通过专用映射机制桥接分开优化的模型,使用共享变换函数统一高斯属性。

Result: 在单目视频数据集上的实验表明,StM在视觉质量和渲染精度上显著优于现有方法,尤其是在复杂的人类-场景交互边界处。

Insight: 分开建模并引入专用信息映射机制可以有效解决人类与场景交互中的空间不一致性和视觉伪影问题。

Abstract: Reconstructing dynamic humans interacting with real-world environments from monocular videos is an important and challenging task. Despite considerable progress in 4D neural rendering, existing approaches either model dynamic scenes holistically or model scenes and backgrounds separately aim to introduce parametric human priors. However, these approaches either neglect distinct motion characteristics of various components in scene especially human, leading to incomplete reconstructions, or ignore the information exchange between the separately modeled components, resulting in spatial inconsistencies and visual artifacts at human-scene boundaries. To address this, we propose {\bf Separate-then-Map} (StM) strategy that introduces a dedicated information mapping mechanism to bridge separately defined and optimized models. Our method employs a shared transformation function for each Gaussian attribute to unify separately modeled components, enhancing computational efficiency by avoiding exhaustive pairwise interactions while ensuring spatial and visual coherence between humans and their surroundings. Extensive experiments on monocular video datasets demonstrate that StM significantly outperforms existing state-of-the-art methods in both visual quality and rendering accuracy, particularly at challenging human-scene interaction boundaries.

[114] A Style is Worth One Code: Unlocking Code-to-Style Image Generation with Discrete Style Space

Huijie Liu,Shuhao Cui,Haoxiang Cao,Shuai Ma,Kai Wu,Guoliang Kang

Main category: cs.CV

TL;DR: 论文提出了一种新的任务和开源方法CoTyle,通过数值风格代码生成新颖且一致的视觉风格图像,避免了传统方法中对文本提示、参考图像或复杂风格表示的依赖。

Details Motivation: 现有的视觉风格生成方法通常依赖文本提示或参考图像,难以保证风格一致性和多样性。为了解决这一问题,论文探索了通过数值代码直接控制风格生成的可能性。

Contribution: 1. 提出了code-to-style图像生成的新任务;2. 提出了首个开源方法CoTyle,通过离散风格代码和扩散模型结合,实现从数值代码到风格的直接映射;3. 展示了风格可以由单一数值代码控制的可行性。

Method: 1. 从图像集合中训练离散风格代码本以提取风格嵌入;2. 使用文本到图像扩散模型(T2I-DM)以风格嵌入为条件生成风格化图像;3. 训练自回归风格生成器建模风格嵌入分布,支持合成新风格。

Result: 实验验证CoTyle能够从单一数值代码生成多样且一致的视觉风格,展示了方法的有效性和创造性。

Insight: 论文表明,复杂的视觉风格可以通过简单的数值代码高效控制,为风格生成领域提供了新的思路和工具。

Abstract: Innovative visual stylization is a cornerstone of artistic creation, yet generating novel and consistent visual styles remains a significant challenge. Existing generative approaches typically rely on lengthy textual prompts, reference images, or parameter-efficient fine-tuning to guide style-aware image generation, but often struggle with style consistency, limited creativity, and complex style representations. In this paper, we affirm that a style is worth one numerical code by introducing the novel task, code-to-style image generation, which produces images with novel, consistent visual styles conditioned solely on a numerical style code. To date, this field has only been primarily explored by the industry (e.g., Midjourney), with no open-source research from the academic community. To fill this gap, we propose CoTyle, the first open-source method for this task. Specifically, we first train a discrete style codebook from a collection of images to extract style embeddings. These embeddings serve as conditions for a text-to-image diffusion model (T2I-DM) to generate stylistic images. Subsequently, we train an autoregressive style generator on the discrete style embeddings to model their distribution, allowing the synthesis of novel style embeddings. During inference, a numerical style code is mapped to a unique style embedding by the style generator, and this embedding guides the T2I-DM to generate images in the corresponding style. Unlike existing methods, our method offers unparalleled simplicity and diversity, unlocking a vast space of reproducible styles from minimal input. Extensive experiments validate that CoTyle effectively turns a numerical code into a style controller, demonstrating a style is worth one code.

[115] OmniVGGT: Omni-Modality Driven Visual Geometry Grounded

Haosong Peng,Hao Li,Yalun Dai,Yushi Lan,Yihang Luo,Tianyu Qi,Zhengshen Zhang,Yufeng Zhan,Junfei Zhang,Wenchao Xu,Ziwei Liu

Main category: cs.CV

TL;DR: OmniVGGT是一个多模态驱动的视觉几何基础框架,通过GeoAdapter和随机多模态融合策略,有效利用几何信息(如深度、相机参数),在RGB输入和多模态输入下均取得SOTA结果。

Details Motivation: 现有基础模型多依赖RGB输入,忽略了几何信息的重要性。OmniVGGT旨在充分利用这些辅助模态,提升模型性能。

Contribution: 1. 提出GeoAdapter,通过零初始化卷积注入几何信息而不破坏基础模型表示空间;2. 提出随机多模态融合策略,支持任意数量模态输入并提升鲁棒性。

Method: 1. GeoAdapter编码深度和相机参数;2. 随机采样模态子集训练,促进鲁棒表示学习。

Result: 在单目/多视图深度估计、多视立体视觉等任务中优于现有方法;在VLA模型中增强性能,优于点云基线。

Insight: 几何信息对视觉任务至关重要,轻量化的多模态设计可以显著提升模型表现且不影响效率。

Abstract: General 3D foundation models have started to lead the trend of unifying diverse vision tasks, yet most assume RGB-only inputs and ignore readily available geometric cues (e.g., camera intrinsics, poses, and depth maps). To address this issue, we introduce OmniVGGT, a novel framework that can effectively benefit from an arbitrary number of auxiliary geometric modalities during both training and inference. In our framework, a GeoAdapter is proposed to encode depth and camera intrinsics/extrinsics into a spatial foundation model. It employs zero-initialized convolutions to progressively inject geometric information without disrupting the foundation model’s representation space. This design ensures stable optimization with negligible overhead, maintaining inference speed comparable to VGGT even with multiple additional inputs. Additionally, a stochastic multimodal fusion regimen is proposed, which randomly samples modality subsets per instance during training. This enables an arbitrary number of modality inputs during testing and promotes learning robust spatial representations instead of overfitting to auxiliary cues. Comprehensive experiments on monocular/multi-view depth estimation, multi-view stereo, and camera pose estimation demonstrate that OmniVGGT outperforms prior methods with auxiliary inputs and achieves state-of-the-art results even with RGB-only input. To further highlight its practical utility, we integrated OmniVGGT into vision-language-action (VLA) models. The enhanced VLA model by OmniVGGT not only outperforms the vanilla point-cloud-based baseline on mainstream benchmarks, but also effectively leverages accessible auxiliary inputs to achieve consistent gains on robotic tasks.

[116] From 2D to 3D Without Extra Baggage: Data-Efficient Cancer Detection in Digital Breast Tomosynthesis

Yen Nhi Truong Vu,Dan Guo,Sripad Joshi,Harshit Kumar,Jason Su,Thomas Paul Matthews

Main category: cs.CV

TL;DR: 论文提出了一种名为M&M-3D的架构,可在不增加参数的情况下从2D FFDM模型迁移学习3D DBT数据,显著提升了乳腺癌检测的性能。

Details Motivation: DBT数据标注有限,现有方法要么丢弃3D信息,要么需要复杂架构和大量数据。M&M-3D旨在高效利用2D模型参数,同时保留3D推理能力。

Contribution: 提出M&M-3D架构,通过参数无增加的3D特征构建和混合学习,实现了从2D到3D的高效迁移,并在低数据和高数据条件下均表现优异。

Method: M&M-3D通过重复混合3D特征与切片级信息学习3D推理,直接修改2D模型操作而不增加参数,支持权重迁移。

Result: M&M-3D在定位和分类任务上分别超越2D和3D基准方法11-54%和3-10%,并在低数据条件下显著优于复杂3D方法。

Insight: 3D推理无需复杂架构或大量数据,通过高效特征设计和迁移学习可显著提升性能。这一方法可能适用于其他3D医学图像任务。

Abstract: Digital Breast Tomosynthesis (DBT) enhances finding visibility for breast cancer detection by providing volumetric information that reduces the impact of overlapping tissues; however, limited annotated data has constrained the development of deep learning models for DBT. To address data scarcity, existing methods attempt to reuse 2D full-field digital mammography (FFDM) models by either flattening DBT volumes or processing slices individually, thus discarding volumetric information. Alternatively, 3D reasoning approaches introduce complex architectures that require more DBT training data. Tackling these drawbacks, we propose M&M-3D, an architecture that enables learnable 3D reasoning while remaining parameter-free relative to its FFDM counterpart, M&M. M&M-3D constructs malignancy-guided 3D features, and 3D reasoning is learned through repeatedly mixing these 3D features with slice-level information. This is achieved by modifying operations in M&M without adding parameters, thus enabling direct weight transfer from FFDM. Extensive experiments show that M&M-3D surpasses 2D projection and 3D slice-based methods by 11-54% for localization and 3-10% for classification. Additionally, M&M-3D outperforms complex 3D reasoning variants by 20-47% for localization and 2-10% for classification in the low-data regime, while matching their performance in high-data regime. On the popular BCS-DBT benchmark, M&M-3D outperforms previous top baseline by 4% for classification and 10% for localization.

[117] One Small Step in Latent, One Giant Leap for Pixels: Fast Latent Upscale Adapter for Your Diffusion Models

Aleksandr Razin,Danil Kazantsev,Ilya Makarov

Main category: cs.CV

TL;DR: 本文提出了一种名为Latent Upscaler Adapter (LUA)的轻量级模块,直接在扩散模型的隐空间进行超分辨率操作,避免了传统像素空间超分辨率的延迟与伪影问题。

Details Motivation: 扩散模型在超出训练分辨率时面临采样速度慢、成本高的问题,而传统的图像超分辨率方法在解码后操作会引入伪影和额外延迟。

Contribution: LUA通过隐空间的单次前向传播实现高分辨率合成,无需修改基础模型或增加扩散阶段,显著降低了解码和上采样时间,并支持多种上采样因子。

Method: 采用共享的Swin风格主干网络和尺度特定的像素重组头,支持2倍和4倍上采样,并在隐空间中操作以提升效率。

Result: LUA在1024像素生成任务中仅增加0.42秒,比传统像素空间超分辨率方法(1.87秒)快3倍,同时保持了可比的感知质量。

Insight: LUA展示了在不同VAE隐空间中的强泛化能力,使其能够轻松部署而无需为每个新解码器重新训练。

Abstract: Diffusion models struggle to scale beyond their training resolutions, as direct high-resolution sampling is slow and costly, while post-hoc image super-resolution (ISR) introduces artifacts and additional latency by operating after decoding. We present the Latent Upscaler Adapter (LUA), a lightweight module that performs super-resolution directly on the generator’s latent code before the final VAE decoding step. LUA integrates as a drop-in component, requiring no modifications to the base model or additional diffusion stages, and enables high-resolution synthesis through a single feed-forward pass in latent space. A shared Swin-style backbone with scale-specific pixel-shuffle heads supports 2x and 4x factors and remains compatible with image-space SR baselines, achieving comparable perceptual quality with nearly 3x lower decoding and upscaling time (adding only +0.42 s for 1024 px generation from 512 px, compared to 1.87 s for pixel-space SR using the same SwinIR architecture). Furthermore, LUA shows strong generalization across the latent spaces of different VAEs, making it easy to deploy without retraining from scratch for each new decoder. Extensive experiments demonstrate that LUA closely matches the fidelity of native high-resolution generation while offering a practical and efficient path to scalable, high-fidelity image synthesis in modern diffusion pipelines.

[118] Enhancing the Outcome Reward-based RL Training of MLLMs with Self-Consistency Sampling

Jiahao Wang,Weiye Xu,Aijun Yang,Wengang Zhou,Lewei Lu,Houqiang Li,Xiaohua Wang,Jinguo Zhu

Main category: cs.CV

TL;DR: 论文提出了一种名为Self-Consistency Sampling (SCS)的方法,通过视觉扰动和轨迹重采样来解决多模态大语言模型(MLLMs)在基于结果的强化学习(RL)训练中存在的轨迹不忠实问题,显著提升了模型在多模态基准上的表现。

Details Motivation: 在多模态推理基准的多选题设置中,基于结果的RL训练存在一个常见问题:即使推理链错误,模型也能通过猜测得到正确答案,从而导致不忠实的轨迹与真实推理获得相同的奖励。这一问题亟待解决。

Contribution: 提出了SCS方法,通过视觉扰动和轨迹重采样生成一致性分数,降低不可靠轨迹在策略更新中的权重,从而纠正RL训练中的不忠实轨迹问题。

Method: SCS通过两步实现:(1)对输入视觉数据引入小扰动;(2)对初始轨迹重复截断并重采样,利用生成的轨迹一致性分数优化策略更新。

Result: 在Qwen2.5-VL-7B-Instruct等多种MLLM上,SCS显著提升了性能,最高提升了7.7个百分点,且在Qwen2.5-VL-3B-Instruct和InternVL3-8B上也表现优异。

Insight: SCS提供了一种简单通用的解决方案,能够有效消除多模态RL训练中的轨迹不忠实问题,且计算开销极低,适用于多种RL算法。

Abstract: Outcome-reward reinforcement learning (RL) is a common and increasingly significant way to refine the step-by-step reasoning of multimodal large language models (MLLMs). In the multiple-choice setting - a dominant format for multimodal reasoning benchmarks - the paradigm faces a significant yet often overlooked obstacle: unfaithful trajectories that guess the correct option after a faulty chain of thought receive the same reward as genuine reasoning, which is a flaw that cannot be ignored. We propose Self-Consistency Sampling (SCS) to correct this issue. For each question, SCS (i) introduces small visual perturbations and (ii) performs repeated truncation and resampling of an initial trajectory; agreement among the resulting trajectories yields a differentiable consistency score that down-weights unreliable traces during policy updates. Based on Qwen2.5-VL-7B-Instruct, plugging SCS into RLOO, GRPO, and REINFORCE++ series improves accuracy by up to 7.7 percentage points on six multimodal benchmarks with negligible extra computation. SCS also yields notable gains on both Qwen2.5-VL-3B-Instruct and InternVL3-8B, offering a simple, general remedy for outcome-reward RL in MLLMs.

cs.MA [Back]

[119] Rethinking the Reliability of Multi-agent System: A Perspective from Byzantine Fault Tolerance

Lifan Zheng,Jiawei Chen,Qinghong Yin,Jingyuan Zhang,Xinyi Zeng,Yu Tian

Main category: cs.MA

TL;DR: 本文研究了基于大语言模型(LLM)的多智能体系统(MAS)的可靠性,从拜占庭容错的角度探讨LLM智能体在可靠性上的优势,并提出了一种新的共识机制CP-WBFT。该方法通过探针加权的信息流传输提升了系统的稳定性,在极端故障条件下表现出色。

Details Motivation: 随着LLM智能体在多智能体系统中的广泛应用,其可靠性问题尚未得到充分研究。传统智能体在应对错误信息流时表现不足,而LLM智能体展现出更强的怀疑能力。本文旨在量化这种可靠性差异,并提出改进方案。

Contribution: 1. 首次从拜占庭容错角度量化LLM智能体的可靠性优势;2. 设计了CP-WBFT共识机制,利用LLM的反思和判别能力提升系统稳定性;3. 在高达85.7%故障率的极端条件下验证了方法的有效性。

Method: CP-WBFT是一种基于置信度探针的加权拜占庭容错共识机制。它通过探针加权的信息流传输方式,利用LLM智能体的内在能力(如反思和判别)来提高MAS的可靠性。

Result: 实验表明,CP-WBFT在不同网络拓扑结构下均表现出色,尤其在极端拜占庭故障条件下(85.7%故障率),显著优于传统方法。在数学推理和安全评估任务中保持了高准确性和可靠性。

Insight: LLM智能体的怀疑能力对提升MAS的可靠性至关重要。CP-WBFT通过结合LLM的固有特性与加权共识机制,为未来构建更稳定的多智能体系统提供了新思路。

Abstract: Ensuring the reliability of agent architectures and effectively identifying problematic agents when failures occur are crucial challenges in multi-agent systems (MAS). Advances in large language models (LLMs) have established LLM-based agents as a major branch of MAS, enabling major breakthroughs in complex problem solving and world modeling. However, the reliability implications of this shift remain largely unexplored. i.e., whether substituting traditional agents with LLM-based agents can effectively enhance the reliability of MAS. In this work, we investigate and quantify the reliability of LLM-based agents from the perspective of Byzantine fault tolerance. We observe that LLM-based agents demonstrate stronger skepticism when processing erroneous message flows, a characteristic that enables them to outperform traditional agents across different topological structures. Motivated by the results of the pilot experiment, we design CP-WBFT, a confidence probe-based weighted Byzantine Fault Tolerant consensus mechanism to enhance the stability of MAS with different topologies. It capitalizes on the intrinsic reflective and discriminative capabilities of LLMs by employing a probe-based, weighted information flow transmission method to improve the reliability of LLM-based agents. Extensive experiments demonstrate that CP-WBFT achieves superior performance across diverse network topologies under extreme Byzantine conditions (85.7% fault rate). Notably, our approach surpasses traditional methods by attaining remarkable accuracy on various topologies and maintaining strong reliability in both mathematical reasoning and safety assessment tasks.

cs.LG [Back]

[120] Probability-Biased Attention over Directed Bipartite Graphs for Long-Tail ICD Coding

Tianlei Chen,Yuxiao Chen,Yang Li,Feifei Wang

Main category: cs.LG

TL;DR: 这篇论文提出了一种针对长尾分布的ICD编码任务的概率偏置注意力方法,通过构建有向二部图和注入共现概率偏差,显著提升了罕见代码的分类性能。

Details Motivation: ICD编码任务面临大量标签空间和长尾分布的挑战,罕见代码缺乏足够训练数据。为此,论文提出通过建模代码间的细粒度共现关系来提升罕见代码的分类性能。

Contribution: 主要贡献包括:1) 设计了有向二部图编码器,支持从常见代码到罕见代码的单向信息流;2) 提出了基于共现概率的注意力偏置方法(Co-occurrence Encoding);3) 利用大语言模型生成代码描述,丰富输入嵌入的临床上下文信息。

Method: 方法核心包括:1) 构建有向二部图,节点分为常见代码和罕见代码两组;2) 利用共现概率定义边的权重,注入注意力模块;3) 使用LLM生成代码描述作为外部知识。

Result: 在三个基准数据集上的实验表明,该方法在长尾分类的关键指标Macro-F1上取得了显著提升,达到了当前最佳性能。

Insight: 通过统计共现关系和外部知识的结合,可以显著提升长尾标签的学习效果,尤其是在罕见代码的分类任务中。

Abstract: Automated International Classification of Diseases (ICD) coding aims to assign multiple disease codes to clinical documents, constituting a crucial multi-label text classification task in healthcare informatics. However, the task is challenging due to its large label space (10,000 to 20,000 codes) and long-tail distribution, where a few codes dominate while many rare codes lack sufficient training data. To address this, we propose a learning method that models fine-grained co-occurrence relationships among codes. Specifically, we construct a Directed Bipartite Graph Encoder with disjoint sets of common and rare code nodes. To facilitate a one-way information flow, edges are directed exclusively from common to rare codes. The nature of these connections is defined by a probability-based bias, which is derived from the conditional probability of a common code co-occurring given the presence of a rare code. This bias is then injected into the encoder’s attention module, a process we term Co-occurrence Encoding. This structure empowers the graph encoder to enrich rare code representations by aggregating latent comorbidity information reflected in the statistical co-occurrence of their common counterparts. To ensure high-quality input to the graph, we utilize a large language model (LLM) to generate comprehensive descriptions for codes, enriching initial embeddings with clinical context and comorbidity information, serving as external knowledge for the statistical co-occurrence relationships in the code system. Experiments on three automated ICD coding benchmark datasets demonstrate that our method achieves state-of-the-art performance with particularly notable improvements in Macro-F1, which is the key metric for long-tail classification.

[121] OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models

Yuping Yan,Yuhan Xie,Yuanshuai Li,Yingchao Yu,Lingjuan Lyu,Yaochu Jin

Main category: cs.LG

TL;DR: 该论文提出了OutSafe-Bench,一个针对多模态大语言模型(MLLMs)的内容安全评测基准,覆盖多模态数据并提供新的评估指标和方法,揭示了现有模型的安全漏洞。

Details Motivation: 随着多模态大语言模型的广泛应用,其输出的不安全内容(如有毒语言、偏见图像等)引发担忧。当前的安全评测基准在多模态覆盖和性能评估上存在不足,亟需更全面的评测工具。

Contribution: 1. 推出首个全面的多模态内容安全评测基准OutSafe-Bench,包含大规模多模态数据集和九类内容风险标注。2. 提出新指标MCRS和评估框架FairScore,提升评测的公平性和鲁棒性。

Method: 1. 构建覆盖文本、图像、音频、视频的多模态数据集。2. 设计MCRS指标量化跨风险类别的相关性。3. 引入FairScore框架,通过自适应评委模型减少单模型评测的偏差。

Result: 评测九种先进MLLMs后发现显著的安全漏洞,表明现有模型亟需更强的安全防护机制。

Insight: 多模态内容安全评测需覆盖更广的风险类别和模态,同时需解决单模型评测的偏差问题。

Abstract: Since Multimodal Large Language Models (MLLMs) are increasingly being integrated into everyday tools and intelligent agents, growing concerns have arisen regarding their possible output of unsafe contents, ranging from toxic language and biased imagery to privacy violations and harmful misinformation. Current safety benchmarks remain highly limited in both modality coverage and performance evaluations, often neglecting the extensive landscape of content safety. In this work, we introduce OutSafe-Bench, the first most comprehensive content safety evaluation test suite designed for the multimodal era. OutSafe-Bench includes a large-scale dataset that spans four modalities, featuring over 18,000 bilingual (Chinese and English) text prompts, 4,500 images, 450 audio clips and 450 videos, all systematically annotated across nine critical content risk categories. In addition to the dataset, we introduce a Multidimensional Cross Risk Score (MCRS), a novel metric designed to model and assess overlapping and correlated content risks across different categories. To ensure fair and robust evaluation, we propose FairScore, an explainable automated multi-reviewer weighted aggregation framework. FairScore selects top-performing models as adaptive juries, thereby mitigating biases from single-model judgments and enhancing overall evaluation reliability. Our evaluation of nine state-of-the-art MLLMs reveals persistent and substantial safety vulnerabilities, underscoring the pressing need for robust safeguards in MLLMs.

[122] AgentEvolver: Towards Efficient Self-Evolving Agent System

Yunpeng Zhai,Shuchang Tao,Cheng Chen,Anni Zou,Ziqian Chen,Qingxu Fu,Shinji Mai,Li Yu,Jiaji Deng,Zouying Cao,Zhaoyang Liu,Bolin Ding,Jingren Zhou

Main category: cs.LG

TL;DR: AgentEvolver是一个基于大型语言模型(LLM)的自主代理系统,通过自我提问、自我导航和自我归因三种机制,实现了高效的自我进化。

Details Motivation: 现有自主代理系统依赖人工数据集和强化学习,成本高且效率低。AgentEvolver旨在通过LLM的语义理解和推理能力,解决这些问题。

Contribution: 引入三种协同机制:自我提问(减少对人工数据集的依赖)、自我导航(提升探索效率)和自我归因(增强样本效率)。

Method: 集成自我提问生成任务、自我导航重用经验和混合策略指导、自我归因差异化奖励分配。

Result: 实验结果表明,AgentEvolver在探索效率、样本利用率和适应速度上优于传统强化学习方法。

Insight: 利用LLM的推理能力可显著降低代理系统的开发成本并提高效率,为自主代理的未来发展提供了新方向。

Abstract: Autonomous agents powered by large language models (LLMs) have the potential to significantly enhance human productivity by reasoning, using tools, and executing complex tasks in diverse environments. However, current approaches to developing such agents remain costly and inefficient, as they typically require manually constructed task datasets and reinforcement learning (RL) pipelines with extensive random exploration. These limitations lead to prohibitively high data-construction costs, low exploration efficiency, and poor sample utilization. To address these challenges, we present AgentEvolver, a self-evolving agent system that leverages the semantic understanding and reasoning capabilities of LLMs to drive autonomous agent learning. AgentEvolver introduces three synergistic mechanisms: (i) self-questioning, which enables curiosity-driven task generation in novel environments, reducing dependence on handcrafted datasets; (ii) self-navigating, which improves exploration efficiency through experience reuse and hybrid policy guidance; and (iii) self-attributing, which enhances sample efficiency by assigning differentiated rewards to trajectory states and actions based on their contribution. By integrating these mechanisms into a unified framework, AgentEvolver enables scalable, cost-effective, and continual improvement of agent capabilities. Preliminary experiments indicate that AgentEvolver achieves more efficient exploration, better sample utilization, and faster adaptation compared to traditional RL-based baselines.

[123] Impact of Layer Norm on Memorization and Generalization in Transformers

Rishi Singhal,Jung-Eun Kim

Main category: cs.LG

TL;DR: 该研究探讨了层归一化(LayerNorm)在Pre-LayerNorm和Post-LayerNorm变压器中对记忆和学习的影响,发现其在Pre-LayerNorm模型中稳定学习,而在Post-LayerNorm模型中影响记忆。

Details Motivation: 层归一化是变压器中的核心组件,但其在不同架构中对记忆和学习的具体影响尚未明确,尤其是Pre-LayerNorm和Post-LayerNorm模型的差异。

Contribution: 揭示了LayerNorm在Pre-LayerNorm和Post-LayerNorm变压器中的不同作用,并明确了其对记忆和学习的关键影响。

Method: 通过分析Pre-LayerNorm和Post-LayerNorm模型的梯度流和行为差异,并在13个模型和6个数据集上进行验证。

Result: Pre-LayerNorm模型中移除LayerNorm会破坏学习稳定性并加剧记忆,而Post-LayerNorm模型中移除LayerNorm能有效减少记忆,恢复真实标签。

Insight: LayerNorm的早期层对模型行为影响最大,其作用在Pre和Post架构中表现出显著差异,为设计高效变压器提供了新视角。

Abstract: Layer Normalization (LayerNorm) is one of the fundamental components in transformers that stabilizes training and improves optimization. In recent times, Pre-LayerNorm transformers have become the preferred choice over Post-LayerNorm transformers due to their stable gradient flow. However, the impact of LayerNorm on learning and memorization across these architectures remains unclear. In this work, we investigate how LayerNorm influences memorization and learning for Pre- and Post-LayerNorm transformers. We identify that LayerNorm serves as a key factor for stable learning in Pre-LayerNorm transformers, while in Post-LayerNorm transformers, it impacts memorization. Our analysis reveals that eliminating LayerNorm parameters in Pre-LayerNorm models exacerbates memorization and destabilizes learning, while in Post-LayerNorm models, it effectively mitigates memorization by restoring genuine labels. We further precisely identify that early layers LayerNorm are the most critical over middle/later layers and their influence varies across Pre and Post LayerNorm models. We have validated it through 13 models across 6 Vision and Language datasets. These insights shed new light on the role of LayerNorm in shaping memorization and learning in transformers.

[124] Towards Emotionally Intelligent and Responsible Reinforcement Learning

Garapati Keerthana,Manik Gupta

Main category: cs.LG

TL;DR: 论文提出了一种负责任强化学习(RRL)框架,将情感和伦理考量融入决策过程,以解决个性化系统中忽视情感和伦理的问题。通过约束马尔可夫决策过程(CMDP)和多目标奖励函数,RRL实现了短期行为参与和长期用户福祉的平衡。

Details Motivation: 当前个性化决策系统通常基于静态规则或最大化参与度的启发式方法,忽视了用户的情感背景和伦理约束,可能导致不敏感或不安全的干预。

Contribution: 1. 提出RRL框架,将情感理解和伦理安全融入强化学习的决策过程;2. 设计多目标奖励函数和情感感知的状态表示;3. 框架适配性强,可与多种RL算法结合。

Method: 采用CMDP建模,结合多目标奖励函数(平衡参与度和福祉)和情感感知状态表示(捕捉情感波动)。通过安全约束或拉格朗日正则化增强RL算法的伦理安全性。

Result: RRL框架提供了一种情感智能且伦理安全的强化学习方法,适用于行为健康、教育等以人为本的领域。

Insight: 将情感计算与安全强化学习结合,可为个性化系统带来更高的伦理可信度和情感智能。

Abstract: Personalized decision systems in healthcare and behavioral support often rely on static rule-based or engagement-maximizing heuristics that overlook users’ emotional context and ethical constraints. Such approaches risk recommending insensitive or unsafe interventions, especially in domains involving serious mental illness, substance use disorders, or depression. To address this limitation, we propose a Responsible Reinforcement Learning (RRL) framework that integrates emotional and contextual understanding with ethical considerations into the sequential decision-making process. RRL formulates personalization as a Constrained Markov Decision Process (CMDP), where the agent optimizes engagement and adherence while ensuring emotional alignment and ethical safety. We introduce a multi-objective reward function that explicitly balances short-term behavioral engagement with long-term user well-being, and define an emotion-informed state representation that captures fluctuations in emotional readiness, affect, and risk. The proposed architecture can be instantiated with any RL algorithm (e.g., DQN, PPO) augmented with safety constraints or Lagrangian regularization. Conceptually, this framework operationalizes empathy and responsibility within machine learning policy optimization, bridging safe RL, affective computing and responsible AI. We discuss the implications of this approach for human-centric domains such as behavioral health, education, and digital therapeutics, and outline simulation-based validation paths for future empirical work. This paper aims to initiate a methodological conversation about ethically aligned reinforcement learning for emotionally aware and trustworthy personalization systems.

[125] How does My Model Fail? Automatic Identification and Interpretation of Physical Plausibility Failure Modes with Matryoshka Transcoders

Yiming Tang,Abhijeet Sinha,Dianbo Liu

Main category: cs.LG

TL;DR: 该论文提出了一种名为Matryoshka Transcoders的新框架,用于自动发现和解释生成模型中的物理合理性错误模式,通过多粒度稀疏特征学习和物理合理性分类器的中间表示,提供了对模型物理约束失败的深入分析。

Details Motivation: 现有的生成模型虽然能够产生逼真且符合指令的输出,但在物理合理性方面仍存在显著问题。这些问题往往难以通过现有评估方法检测,且缺乏自动识别和解释的框架,阻碍了针对性改进。

Contribution: 主要贡献是提出了Matryoshka Transcoders框架,结合Matryoshka表示学习和多模态模型,自动发现和解释物理合理性错误模式,并建立了一个评估物理合理性的基准。

Method: 论文方法基于Matryoshka表示学习范式,扩展了transcoder架构,支持多粒度的稀疏特征学习,利用物理合理性分类器的中间表示和大规模多模态模型进行错误模式的解释。

Result: 该方法在特征相关性和准确性上优于现有方法,并在8个前沿生成模型中发现了多种物理合理性错误模式,为改进模型提供了方向。

Insight: 通过自动识别和解释物理合理性错误,该方法不仅揭示了生成模型的常见失败模式,还为提升模型的物理合理性提供了有效工具。

Abstract: Although recent generative models are remarkably capable of producing instruction-following and realistic outputs, they remain prone to notable physical plausibility failures. Though critical in applications, these physical plausibility errors often escape detection by existing evaluation methods. Furthermore, no framework exists for automatically identifying and interpreting specific physical error patterns in natural language, preventing targeted model improvements. We introduce Matryoshka Transcoders, a novel framework for the automatic discovery and interpretation of physical plausibility features in generative models. Our approach extends the Matryoshka representation learning paradigm to transcoder architectures, enabling hierarchical sparse feature learning at multiple granularity levels. By training on intermediate representations from a physical plausibility classifier and leveraging large multimodal models for interpretation, our method identifies diverse physics-related failure modes without manual feature engineering, achieving superior feature relevance and feature accuracy compared to existing approaches. We utilize the discovered visual patterns to establish a benchmark for evaluating physical plausibility in generative models. Our analysis of eight state-of-the-art generative models provides valuable insights into how these models fail to follow physical constraints, paving the way for further model improvements.

physics.chem-ph [Back]

[126] VEDA: 3D Molecular Generation via Variance-Exploding Diffusion with Annealing

Peining Zhang,Jinbo Bi,Minghu Song

Main category: physics.chem-ph

TL;DR: VEDA是一种结合方差爆炸扩散(VE)和退火的SE(3)-等变框架,用于高效生成高精度3D分子结构,解决了扩散模型在采样速度和构象准确性之间的权衡问题。

Details Motivation: 现有扩散模型在3D分子生成中存在采样效率与构象准确性的权衡问题,而VEDA旨在通过结合VE扩散与退火策略来解决这一问题。

Contribution: 1. 提出VE调度策略,实现类似模拟退火的噪声注入;2. 新预条件方案,协调SE(3)-等变网络与残差扩散目标;3. arcsin调度器,聚焦采样关键区间。

Method: VEDA采用SE(3)-等变框架,结合VE扩散与退火策略,通过优化噪声注入和采样调度提高效率与准确性。

Result: 在QM9和GEOM-DRUGS数据集上,VEDA仅需100步采样即达到流模型效率,生成结构放松能量中位数仅1.72 kcal/mol。

Insight: VE扩散与SE(3)-等变架构的结合可以实现高效且高精度的3D分子生成,为未来分子设计提供新方向。

Abstract: Diffusion models show promise for 3D molecular generation, but face a fundamental trade-off between sampling efficiency and conformational accuracy. While flow-based models are fast, they often produce geometrically inaccurate structures, as they have difficulty capturing the multimodal distributions of molecular conformations. In contrast, denoising diffusion models are more accurate but suffer from slow sampling, a limitation attributed to sub-optimal integration between diffusion dynamics and SE(3)-equivariant architectures. To address this, we propose VEDA, a unified SE(3)-equivariant framework that combines variance-exploding diffusion with annealing to efficiently generate conformationally accurate 3D molecular structures. Specifically, our key technical contributions include: (1) a VE schedule that enables noise injection functionally analogous to simulated annealing, improving 3D accuracy and reducing relaxation energy; (2) a novel preconditioning scheme that reconciles the coordinate-predicting nature of SE(3)-equivariant networks with a residual-based diffusion objective, and (3) a new arcsin-based scheduler that concentrates sampling in critical intervals of the logarithmic signal-to-noise ratio. On the QM9 and GEOM-DRUGS datasets, VEDA matches the sampling efficiency of flow-based models, achieving state-of-the-art valency stability and validity with only 100 sampling steps. More importantly, VEDA’s generated structures are remarkably stable, as measured by their relaxation energy during GFN2-xTB optimization. The median energy change is only 1.72 kcal/mol, significantly lower than the 32.3 kcal/mol from its architectural baseline, SemlaFlow. Our framework demonstrates that principled integration of VE diffusion with SE(3)-equivariant architectures can achieve both high chemical accuracy and computational efficiency.

cs.AI [Back]

[127] Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning

Yuxuan Zhou,Yubin Wang,Bin Wang,Chen Ning,Xien Liu,Ji Wu,Jianye Hao

Main category: cs.AI

TL;DR: 该论文提出了一种名为MuSeR的多方面自优化学习方法,旨在增强大型语言模型(LLMs)在医疗领域的上下文感知能力,通过自评估和优化提升其在决策、沟通和安全性三个关键方面的表现。

Details Motivation: 尽管LLMs在医疗领域的多个基准测试中表现优异,但在实际医疗场景中仍表现不佳,主要原因是缺乏对上下文细节(如用户身份、病史、风险因素)的感知能力。MuSeR旨在解决这一问题。

Contribution: 1) 提出了MuSeR方法,通过多方面自优化增强LLMs的上下文感知能力;2) 设计了一个基于属性的查询生成器,模拟多样化真实用户场景;3) 结合知识蒸馏提升小型LLM性能,达到HealthBench及其困难子集的SOTA表现。

Method: 1) 使用属性条件查询生成器模拟多样化用户场景;2) LLM生成响应并自评估,优化其在决策、沟通和安全性三个方面的表现;3) 通过监督微调强化上下文感知能力;4) 结合知识蒸馏进一步提升性能。

Result: 在HealthBench数据集上的实验表明,MuSeR显著提升了LLMs的性能,特别是在上下文感知方面。小型LLM(Qwen3-32B)通过知识蒸馏超越其教师模型,达到SOTA(63.8%和43.1%)。

Insight: 1) 自优化方法能有效提升LLMs在实际场景中的应用能力;2) 上下文感知的多样性模拟对实现稳健性能至关重要;3) 知识蒸馏可以放大高效方法的优势,赋能小型模型。

Abstract: Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks. However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness, i.e., the ability to recognize missing or critical details (e.g., user identity, medical history, risk factors) and provide safe, helpful, and contextually appropriate responses. To address this issue, we propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs’ context-awareness along three key facets (decision-making, communication, and safety) through self-evaluation and refinement. Specifically, we first design a attribute-conditioned query generator that simulates diverse real-world user contexts by varying attributes such as role, geographic region, intent, and degree of information ambiguity. An LLM then responds to these queries, self-evaluates its answers along three key facets, and refines its responses to better align with the requirements of each facet. Finally, the queries and refined responses are used for supervised fine-tuning to reinforce the model’s context-awareness ability. Evaluation results on the latest HealthBench dataset demonstrate that our method significantly improves LLM performance across multiple aspects, with particularly notable gains in the context-awareness axis. Furthermore, by incorporating knowledge distillation with the proposed method, the performance of a smaller backbone LLM (e.g., Qwen3-32B) surpasses its teacher model, achieving a new SOTA across all open-source LLMs on HealthBench (63.8%) and its hard subset (43.1%). Code and dataset will be released at https://muser-llm.github.io.

[128] ProgRAG: Hallucination-Resistant Progressive Retrieval and Reasoning over Knowledge Graphs

Minbae Park,Hyemin Yang,Jeonghyun Kim,Kunsoo Park,Hyunjoon Kim

Main category: cs.AI

TL;DR: ProgRAG 是一个多跳知识图谱问答框架,通过将复杂问题分解为子问题并逐步扩展推理路径,结合不确定性感知修剪优化证据检索和上下文组织,显著提升了问答的可靠性和推理质量。

Details Motivation: 大语言模型(LLMs)在推理任务中表现出色,但存在幻觉和透明度不足的问题。知识图谱(KGs)虽能增强 LLMs 的推理能力,但现有方法仍面临检索不准确和推理失败的挑战。ProgRAG 旨在通过渐进式检索与推理解决这些问题。

Contribution: 提出 ProgRAG 框架,通过分解问题、渐进扩展推理路径、不确定性感知修剪和上下文优化,显著提升了多跳 KGQA 的性能和可靠性。

Method: 1. 将复杂问题分解为子问题;2. 分步扩展推理路径;3. 外部检索器收集证据并通过 LLM 的不确定性感知修剪优化;4. 对子问题的答案进行上下文组织优化。

Result: 在三个知名数据集上的实验表明,ProgRAG 在多跳 KGQA 任务中表现优于现有基线,推理质量和可靠性显著提升。

Insight: 渐进式检索与推理路径优化能够有效解决复杂问题中的检索失败和推理错误,不确定性感知修剪是提升 LLM 推理可靠性的关键。

Abstract: Large Language Models (LLMs) demonstrate strong reasoning capabilities but struggle with hallucinations and limited transparency. Recently, KG-enhanced LLMs that integrate knowledge graphs (KGs) have been shown to improve reasoning performance, particularly for complex, knowledge-intensive tasks. However, these methods still face significant challenges, including inaccurate retrieval and reasoning failures, often exacerbated by long input contexts that obscure relevant information or by context constructions that struggle to capture the richer logical directions required by different question types. Furthermore, many of these approaches rely on LLMs to directly retrieve evidence from KGs, and to self-assess the sufficiency of this evidence, which often results in premature or incorrect reasoning. To address the retrieval and reasoning failures, we propose ProgRAG, a multi-hop knowledge graph question answering (KGQA) framework that decomposes complex questions into sub-questions, and progressively extends partial reasoning paths by answering each sub-question. At each step, external retrievers gather candidate evidence, which is then refined through uncertainty-aware pruning by the LLM. Finally, the context for LLM reasoning is optimized by organizing and rearranging the partial reasoning paths obtained from the sub-question answers. Experiments on three well-known datasets demonstrate that ProgRAG outperforms existing baselines in multi-hop KGQA, offering improved reliability and reasoning quality.

[129] FactGuard: Event-Centric and Commonsense-Guided Fake News Detection

Jing He,Han Zhang,Yuanhui Xiao,Wei Guo,Shaowen Yao,Renyang Liu

Main category: cs.AI

TL;DR: FactGuard提出了一个基于事件内容和常识的假新闻检测框架,利用大型语言模型(LLMs)提取事件中心内容,并通过动态可用性机制和知识蒸馏提高检测的鲁棒性和实用性。

Details Motivation: 现有基于写作风格的假新闻检测方法因对手模仿真实新闻风格而效果下降。尽管LLMs具有潜力,但其在假新闻检测中的实际应用受到浅层功能探索、模糊可用性和高昂推理成本的限制。

Contribution: 提出FactGuard框架,通过事件中心内容提取减少写作风格对检测的影响,并引入动态可用性机制和知识蒸馏技术,实现了高效、鲁棒的假新闻检测。

Method: 1. 利用LLMs提取事件中心内容;2. 设计动态可用性机制识别矛盾与模糊案例;3. 通过知识蒸馏生成轻量化版本FactGuard-D。

Result: 在两个基准数据集上实验表明,FactGuard在鲁棒性和准确性上均优于现有方法,有效解决了风格敏感性和LLM可用性问题。

Insight: 事件内容和常识推理是假新闻检测的关键方向,结合LLMs的动态可用性机制可以显著提升检测的可靠性。

Abstract: Fake news detection methods based on writing style have achieved remarkable progress. However, as adversaries increasingly imitate the style of authentic news, the effectiveness of such approaches is gradually diminishing. Recent research has explored incorporating large language models (LLMs) to enhance fake news detection. Yet, despite their transformative potential, LLMs remain an untapped goldmine for fake news detection, with their real-world adoption hampered by shallow functionality exploration, ambiguous usability, and prohibitive inference costs. In this paper, we propose a novel fake news detection framework, dubbed FactGuard, that leverages LLMs to extract event-centric content, thereby reducing the impact of writing style on detection performance. Furthermore, our approach introduces a dynamic usability mechanism that identifies contradictions and ambiguous cases in factual reasoning, adaptively incorporating LLM advice to improve decision reliability. To ensure efficiency and practical deployment, we employ knowledge distillation to derive FactGuard-D, enabling the framework to operate effectively in cold-start and resource-constrained scenarios. Comprehensive experiments on two benchmark datasets demonstrate that our approach consistently outperforms existing methods in both robustness and accuracy, effectively addressing the challenges of style sensitivity and LLM usability in fake news detection.

[130] EgoEMS: A High-Fidelity Multimodal Egocentric Dataset for Cognitive Assistance in Emergency Medical Services

Keshara Weerasinghe,Xueren Ge,Tessa Heick,Lahiru Nuwan Wijayasingha,Anthony Cortez,Abhishek Satpathy,John Stankovic,Homa Alemzadeh

Main category: cs.AI

TL;DR: 论文提出了首个高保真、多模态的自中心数据集EgoEMS,旨在支持急诊医疗服务中的AI认知助手开发,包含20小时的多任务场景数据和多种标注。

Details Motivation: 急诊医疗服务(EMS)中,急救人员面临高压力和高认知负荷的任务,AI认知助手有望通过实时数据收集和决策支持缓解这一问题。

Contribution: 引入了EgoEMS数据集,这是首个端到端、多模态、多人参与的自中心数据集,模拟233个紧急场景,包含丰富标注和基准任务。

Method: 数据集通过开源、低成本的采集系统记录,涵盖62名参与者(含46名EMS专业人员)的行为,标注了关键步骤、音频转录、动作质量等。

Result: EgoEMS提供了高保真数据和多任务基准,支持AI工具的开发和评估,推动了智能EMS系统的研究。

Insight: 数据集的真实性和多模态特性为开发实时认知助手提供了重要基础,有望改善急救效率和患者结局。

Abstract: Emergency Medical Services (EMS) are critical to patient survival in emergencies, but first responders often face intense cognitive demands in high-stakes situations. AI cognitive assistants, acting as virtual partners, have the potential to ease this burden by supporting real-time data collection and decision making. In pursuit of this vision, we introduce EgoEMS, the first end-to-end, high-fidelity, multimodal, multiperson dataset capturing over 20 hours of realistic, procedural EMS activities from an egocentric view in 233 simulated emergency scenarios performed by 62 participants, including 46 EMS professionals. Developed in collaboration with EMS experts and aligned with national standards, EgoEMS is captured using an open-source, low-cost, and replicable data collection system and is annotated with keysteps, timestamped audio transcripts with speaker diarization, action quality metrics, and bounding boxes with segmentation masks. Emphasizing realism, the dataset includes responder-patient interactions reflecting real-world emergency dynamics. We also present a suite of benchmarks for real-time multimodal keystep recognition and action quality estimation, essential for developing AI support tools for EMS. We hope EgoEMS inspires the research community to push the boundaries of intelligent EMS systems and ultimately contribute to improved patient outcomes.

[131] Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis for Large Reasoning Models

Yongxian Wei,Yilin Zhao,Li Shen,Xinrui Chen,Runxi Cheng,Sinan Du,Hao Yu,Gang Liu,Jiahong Yan,Chun Yuan,Dian Li

Main category: cs.AI

TL;DR: 本文提出了一种结合推理和自适应难度调整的问题生成方法,用于训练大型推理模型,显著提升了数据合成的质量和模型的泛化能力。

Details Motivation: 现有数据合成方法在生成问题时缺乏对问题方向和难度的精确控制,导致生成的问题价值低或过于简单。本文旨在通过显式推理和自适应难度调整解决这些问题。

Contribution: 1. 提出了一种基于推理的问题生成器,能够显式规划问题方向;2. 通过反馈信号自适应调整问题难度;3. 在10个数学和通用推理基准上验证了方法的有效性。

Method: 1. 构造相关的问题对,并通过推理模型生成中间设计链(CoT);2. 将求解器对合成问题的反馈作为奖励信号,调整问题难度;3. 实现生成器与求解器的协同进化。

Result: 实验表明,该方法平均提升了2.5%的性能,并能泛化到语言和视觉语言模型。协同进化进一步提升了0.7%的性能。

Insight: 显式推理和自适应难度调整是提升数据合成质量的关键;生成器与求解器的协同设计能持续优化模型性能。

Abstract: Data synthesis for training large reasoning models offers a scalable alternative to limited, human-curated datasets, enabling the creation of high-quality data. However, existing approaches face several challenges: (i) indiscriminate generation that ignores the solver’s ability and yields low-value problems, or reliance on complex data pipelines to balance problem difficulty; and (ii) a lack of reasoning in problem generation, leading to shallow problem variants. In this paper, we develop a problem generator that reasons explicitly to plan problem directions before synthesis and adapts difficulty to the solver’s ability. Specifically, we construct related problem pairs and augment them with intermediate problem-design CoT produced by a reasoning model. These data bootstrap problem-design strategies from the generator. Then, we treat the solver’s feedback on synthetic problems as a reward signal, enabling the generator to calibrate difficulty and produce complementary problems near the edge of the solver’s competence. Extensive experiments on 10 mathematical and general reasoning benchmarks show that our method achieves an average improvement of 2.5% and generalizes to both language and vision-language models. Moreover, a solver trained on the synthesized data provides improved rewards for continued generator training, enabling co-evolution and yielding a further 0.7% performance gain. Our code will be made publicly available here.

[132] Querying Labeled Time Series Data with Scenario Programs

Edward Kim,Devan Shanker,Varun Bharadwaj,Hongbeen Park,Jinkyu Kim,Hazem Torfah,Daniel J Fremont,Sanjit A Seshia

Main category: cs.AI

TL;DR: 该论文提出了一种新的方法,通过标记时间序列数据和场景程序的形式化匹配,提高了仿真测试中发现失效场景的真实性和效率。

Details Motivation: 为弥合仿真测试与实际系统之间的差距(sim-to-real gap),验证仿真中发现的安全场景是否在真实世界中可复现,需要一种高效的方法在真实数据中匹配这些场景。

Contribution: 提出了时间序列数据与抽象场景的形式化匹配定义,并设计了一种高效的查询算法,用于在标记数据集中快速识别匹配场景。

Method: 使用Scenic概率编程语言表示抽象场景,开发了一种查询算法,能够在标记的时间序列数据中高效匹配场景。

Result: 实验表明,该算法在查询场景时比现有商业视觉大模型更准确且快几个数量级,并能随数据时长扩展。

Insight: 通过形式化定义和高效查询算法,该方法为仿真测试的验证提供了新工具,有助于提升自动驾驶系统的安全性验证效率。

Abstract: Simulation-based testing has become a crucial complement to road testing for ensuring the safety of cyber physical systems (CPS). As a result, significant research efforts have been directed toward identifying failure scenarios within simulation environments. However, a critical question remains. Are the AV failure scenarios discovered in simulation reproducible on actual systems in the real world? The sim-to-real gap caused by differences between simulated and real sensor data means that failure scenarios identified in simulation might either be artifacts of synthetic sensor data or actual issues that also occur with real sensor data. To address this, an effective approach to validating simulated failure scenarios is to locate occurrences of these scenarios within real-world datasets and verify whether the failure persists on the datasets. To this end, we introduce a formal definition of how labeled time series sensor data can match an abstract scenario, represented as a scenario program using the Scenic probabilistic programming language. We present a querying algorithm that, given a scenario program and a labeled dataset, identifies the subset of data that matches the specified scenario. Our experiment shows that our algorithm is more accurate and orders of magnitude faster in querying scenarios than the state-of-the-art commercial vision large language models, and can scale with the duration of queried time series data.

cs.CR [Back]

[133] Trapped by Their Own Light: Deployable and Stealth Retroreflective Patch Attacks on Traffic Sign Recognition Systems

Go Tsuruoka,Takami Sato,Qi Alfred Chen,Kazuki Nomoto,Ryunosuke Kobayashi,Yuna Tanaka,Tatsuya Mori

Main category: cs.CR

TL;DR: 该论文提出了一种新型的对抗性攻击方法——对抗性回反射补丁(ARP),利用回反射材料在受害者车头灯照射下激活的特性,结合高部署性和隐蔽性攻击交通标志识别系统(TSR)。通过黑盒优化最大化攻击效果,ARP在动态场景中达到93.4%的成功率,比传统补丁攻击更隐蔽。论文还提出了防御方法DPR Shield,对特定标志防御成功率≥75%。

Details Motivation: 交通标志识别系统(TSR)对自动驾驶的安全性至关重要,但现有对抗攻击(如贴纸或激光投影)存在视觉可检测性或实现限制的问题。论文探索了一种新的攻击表面,利用回反射材料实现隐蔽且可部署的攻击。

Contribution: 1. 提出ARP攻击方法,结合补丁攻击的高部署性和激光投影的隐蔽性;2. 开发回反射模拟方法和黑盒优化策略;3. 提出防御方法DPR Shield,有效抵御攻击;4. 在动态和真实场景中验证攻击效果和隐蔽性。

Method: 1. 利用回反射材料设计ARP补丁;2. 开发回反射模拟方法;3. 使用黑盒优化(如遗传算法)最大化攻击成功率;4. 在动态和真实场景中测试攻击效果;5. 设计DPR Shield防御(偏振滤波)。

Result: 1. ARP在35米动态场景中攻击成功率≥93.4%;2. 对商业TSR系统的攻击成功率≥60%;3. ARP的隐蔽性优于传统补丁攻击(提升≥1.9%);4. DPR Shield对特定标志的防御成功率≥75%。

Insight: 1. 回反射材料为对抗攻击提供了新的隐蔽性与有效性平衡;2. 黑盒优化可用于复杂场景的攻击生成;3. 偏振滤波是抵御此类攻击的有效手段。

Abstract: Traffic sign recognition plays a critical role in ensuring safe and efficient transportation of autonomous vehicles but remain vulnerable to adversarial attacks using stickers or laser projections. While existing attack vectors demonstrate security concerns, they suffer from visual detectability or implementation constraints, suggesting unexplored vulnerability surfaces in TSR systems. We introduce the Adversarial Retroreflective Patch (ARP), a novel attack vector that combines the high deployability of patch attacks with the stealthiness of laser projections by utilizing retroreflective materials activated only under victim headlight illumination. We develop a retroreflection simulation method and employ black-box optimization to maximize attack effectiveness. ARP achieves $\geq$93.4% success rate in dynamic scenarios at 35 meters and $\geq$60% success rate against commercial TSR systems in real-world conditions. Our user study demonstrates that ARP attacks maintain near-identical stealthiness to benign signs while achieving $\geq$1.9% higher stealthiness scores than previous patch attacks. We propose the DPR Shield defense, employing strategically placed polarized filters, which achieves $\geq$75% defense success rates for stop signs and speed limit signs against micro-prism patches.

cs.SI [Back]

[134] Simulating Misinformation Propagation in Social Networks using Large Language Models

Raj Gaurav Maurya,Vaibhav Shukla,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat

Main category: cs.SI

TL;DR: 论文提出了一种基于大语言模型(LLM)的社交网络错误信息传播模拟框架,通过模拟具有不同认知偏见和意识形态的用户行为,量化并分析了错误信息的传播机制。

Details Motivation: 社交网络中的错误信息往往依赖人类认知偏见和情感因素传播,传统方法难以精确模拟这一过程。作者希望通过LLM模拟用户行为并量化错误信息的扩散效果。

Contribution: 1. 提出了一个基于LLM的‘角色-节点’框架,模拟错误信息在社交网络中的传播;2. 设计了审计机制和量化指标(如错误信息指数和传播速率);3. 发现了意识形态驱动的角色会加速错误信息传播,而专家角色能保持事实稳定性。

Method: 1. 利用LLM创建多种角色(如意识形态驱动或专家驱动的用户);2. 通过角色节点传播新闻内容并重写;3. 基于问答的审计机制追踪每一步的事实保真度;4. 量化错误信息的传播速率和严重程度。

Result: 实验表明,意识形态驱动的角色在政治、营销和技术领域加速错误信息传播,而专家角色保持事实稳定。异质性角色互动会导致错误信息快速升级为宣传级别。

Insight: LLM不仅能模拟人类偏见,还可作为审计工具追踪信息保真度,为研究和缓解数字生态中的错误信息提供了新方法。

Abstract: Misinformation on social media thrives on surprise, emotion, and identity-driven reasoning, often amplified through human cognitive biases. To investigate these mechanisms, we model large language model (LLM) personas as synthetic agents that mimic user-level biases, ideological alignments, and trust heuristics. Within this setup, we introduce an auditor–node framework to simulate and analyze how misinformation evolves as it circulates through networks of such agents. News articles are propagated across networks of persona-conditioned LLM nodes, each rewriting received content. A question–answering-based auditor then measures factual fidelity at every step, offering interpretable, claim-level tracking of misinformation drift. We formalize a misinformation index and a misinformation propagation rate to quantify factual degradation across homogeneous and heterogeneous branches of up to 30 sequential rewrites. Experiments with 21 personas across 10 domains reveal that identity- and ideology-based personas act as misinformation accelerators, especially in politics, marketing, and technology. By contrast, expert-driven personas preserve factual stability. Controlled-random branch simulations further show that once early distortions emerge, heterogeneous persona interactions rapidly escalate misinformation to propaganda-level distortion. Our taxonomy of misinformation severity – spanning factual errors, lies, and propaganda – connects observed drift to established theories in misinformation studies. These findings demonstrate the dual role of LLMs as both proxies for human-like biases and as auditors capable of tracing information fidelity. The proposed framework provides an interpretable, empirically grounded approach for studying, simulating, and mitigating misinformation diffusion in digital ecosystems.

eess.AS [Back]

[135] Music Flamingo: Scaling Music Understanding in Audio Language Models

Sreyan Ghosh,Arushi Goel,Lasha Koroshinadze,Sang-gil Lee,Zhifeng Kong,Joao Felipe Santos,Ramani Duraiswami,Dinesh Manocha,Wei Ping,Mohammad Shoeybi,Bryan Catanzaro

Main category: eess.AS

TL;DR: Music Flamingo是一个新型的大规模音频-语言模型,旨在提升对音乐(包括歌曲)的理解能力。通过构建高质量的音乐数据集MF-Skills和改进后的训练方法,该模型在音乐理解与推理任务中取得了多项SOTA成果。

Details Motivation: 当前音频-语言模型对音乐的理解仍停留在浅层次,主要由于高质量音乐数据的稀缺性和标注难度。Music Flamingo旨在解决这一问题,推动音乐理解从表面识别向多层次、人类感知的方向发展。

Contribution: 1. 构建了一个大规模的音乐数据集MF-Skills,包含丰富的标注和问答对;2. 提出了改进版的Audio Flamingo 3模型;3. 引入了基于音乐理论的链式思维数据集MF-Think和GRPO强化学习方法,提升了模型的推理能力。

Method: 1. 通过多阶段流程标注MF-Skills数据集;2. 在Audio Flamingo 3基础上进行微调;3. 使用MF-Think进行冷启动训练,并结合GRPO强化学习方法优化模型。

Result: Music Flamingo在10多个音乐理解和推理基准测试中取得了SOTA结果,展现了其在音乐领域的强大能力和泛化性。

Insight: 高质量的数据集和结合音乐理论的训练方法对提升音频-语言模型的音乐理解能力至关重要。Music Flamingo为未来模型的发展提供了新的方向和标准。

Abstract: We introduce Music Flamingo, a novel large audio-language model designed to advance music (including song) understanding in foundational audio models. While audio-language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an enhanced Audio Flamingo 3 backbone on MF-Skills and further strengthen multiple skills relevant to music understanding. To improve the model’s reasoning abilities, we introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards. Music Flamingo achieves state-of-the-art results across 10+ benchmarks for music understanding and reasoning, establishing itself as a generalist and musically intelligent audio-language model. Beyond strong empirical results, Music Flamingo sets a new standard for advanced music understanding by demonstrating how models can move from surface-level recognition toward layered, human-like perception of songs. We believe this work provides both a benchmark and a foundation for the community to build the next generation of models that engage with music as meaningfully as humans do.