Table of Contents
- cs.CL [Total: 65]
- cs.CV [Total: 60]
- eess.IV [Total: 4]
- cs.CR [Total: 2]
- cs.HC [Total: 1]
- cs.IR [Total: 1]
- cs.GR [Total: 1]
- cs.CY [Total: 2]
- cs.AI [Total: 4]
- cs.RO [Total: 3]
- cs.SD [Total: 2]
- cs.LG [Total: 5]
cs.CL [Back]
[1] OpenStaxQA: A multilingual dataset based on open-source college textbooks
Pranav Gupta
Main category: cs.CL
TL;DR: OpenStaxQA是一个基于43本开放式大学教材的多语言评估数据集,支持英语、西班牙语和波兰语。作者使用量化低秩适配器(QLoRa)对大语言模型(LLM)进行了微调和评估,并通过在AI2推理挑战开发数据集上的零样本评估验证其潜在泛化能力。
Details
Motivation: 动机在于为大学教育应用提供一个多语言的评估基准,同时探索开放式教育资源在推动大语言模型性能提升方面的潜力。Contribution: 主要贡献是提出了OpenStaxQA数据集,这是一个支持多语言的大学教育评估基准,并通过微调和零样本评估验证了其对大语言模型的适用性。
Method: 方法包括:(1)基于43本开放式教材构建多语言数据集;(2)使用QLoRa对大语言模型进行微调;(3)在AI2推理挑战数据集上进行零样本评估。
Result: 结果表明,OpenStaxQA数据集可用于微调大语言模型,并在其他任务(如AI2推理挑战)上展现出潜在的泛化能力。
Insight: 开放式教育资源可以成为构建高质量数据集的重要来源,并且多语言数据集的设计有助于推动大语言模型的通用性。
Abstract: We present OpenStaxQA, an evaluation benchmark specific to college-level educational applications based on 43 open-source college textbooks in English, Spanish, and Polish, available under a permissive Creative Commons license. We finetune and evaluate large language models (LLMs) with approximately 7 billion parameters on this dataset using quantized low rank adapters (QLoRa). Additionally we also perform a zero-shot evaluation on the AI2 reasoning challenge dev dataset in order to check if OpenStaxQA can lead to an improved performance on other tasks. We also discuss broader impacts relevant to datasets such as OpenStaxQA.
[2] Knowledge Graph-Guided Multi-Agent Distillation for Reliable Industrial Question Answering with Datasets
Jiqun Pan,Zhenke Duan,Jiani Tu,Anzhi Cheng,Yanqing Wang
Main category: cs.CL
TL;DR: 该论文提出了知识图谱引导的多智能体蒸馏方法(KG-MASD),旨在提升工业问答系统的安全性和可靠性,通过结合知识图谱和多智能体协作推理,生成高质量的指令调优数据,并将推理能力和验证能力同时蒸馏到轻量级学生模型中。
Details
Motivation: 工业问答系统在高风险场景中需具备更高的安全性和可靠性,但现有方法在多智能体协作推理中面临迭代不可控和输出不可验证的问题,同时传统蒸馏方法难以将协作推理能力转移到轻量级模型中。Contribution: 1. 提出KG-MASD方法,将蒸馏建模为马尔可夫决策过程;2. 引入知识图谱作为可验证的结构化先验;3. 联合蒸馏推理深度和可验证性到轻量级学生模型。
Method: KG-MASD结合知识图谱和多智能体协作推理,通过马尔可夫决策过程建模蒸馏任务,生成高置信度的指令调优数据,并将协作推理能力与知识图谱验证能力共同蒸馏到学生模型中。
Result: 在工业QA数据集上的实验表明,KG-MASD的准确性较基线方法提升2.4%-20.1%,并显著增强了可靠性。
Insight: 知识图谱的引入不仅增强了状态表示的丰富性,还确保了推理的可验证性,为高风险工业场景中的可信AI部署提供了有效解决方案。
Abstract: Industrial question-answering (QA) systems require higher safety and reliability than general-purpose dialogue models, as errors in high-risk scenarios such as equipment fault diagnosis can have severe consequences. Although multi-agent large language models enhance reasoning depth, they suffer from uncontrolled iterations and unverifiable outputs, and conventional distillation methods struggle to transfer collaborative reasoning capabilities to lightweight, deployable student models. To address these challenges, we propose Knowledge Graph-guided Multi-Agent System Distillation (KG-MASD). Our approach formulates distillation as a Markov Decision Process and incorporates a knowledge graph as a verifiable structured prior to enrich state representation and ensure convergence. By integrating collaborative reasoning with knowledge grounding, KG-MASD generates high-confidence instruction-tuning data and jointly distills reasoning depth and verifiability into compact student models suitable for edge deployment. Experiments on an industrial QA dataset show that KG-MASD improves accuracy by 2.4 per cent to 20.1 per cent over baselines and significantly enhances reliability, enabling trustworthy AI deployment in safety-critical industrial scenarios. Code and data are available at https://github.com/erwinmsmith/KG-MAD/.
[3] CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning
Qihua Dong,Luis Figueroa,Handong Zhao,Kushal Kafle,Jason Kuen,Zhihong Ding,Scott Cohen,Yun Fu
Main category: cs.CL
TL;DR: 该论文提出了一种名为CoT Referring的新策略,通过链式思维的数据结构增强多模态模型在指称表达任务中的推理能力,显著提升了复杂查询场景下的准确性。
Details
Motivation: 指称表达理解和分割任务是评估多模态大语言模型(MLLMs)能力的关键任务。现有的方法在处理复杂查询时表现不佳,因此需要一种更系统的方法来提升模型的推理能力。Contribution: 1. 提出CoT Referring策略,通过结构化链式思维训练数据增强多模态推理能力;2. 重构训练数据并引入新的注释格式;3. 开发针对复杂指称场景的评测基准;4. 提出一种自适应加权损失函数,优化检测与分割的统一框架。
Method: 1. 将文本结构解析为顺序指称步骤;2. 在每一步中识别关系并确保指称对齐;3. 重构训练数据并提供新注释;4. 集成检测与分割能力到统一的MLLM框架中;5. 使用自适应加权损失函数训练模型。
Result: 在RefCOCO/+/g数据集及新构建的评测基准上,CoT Referring方法比基线模型提升了2.5%以上的性能。
Insight: 通过结构化链式思维推理和多模态数据的系统性对齐,可以有效提升模型在复杂指称表达任务中的表现。统一的MLLM框架结合自适应损失函数为多模态任务提供了新的优化方向。
Abstract: Referring Expression Comprehension and Segmentation are critical tasks for assessing the integration of language understanding and image comprehension, serving as benchmarks for Multimodal Large Language Models (MLLMs) capabilities. To address these challenges, we propose a new strategy, CoT Referring, which enhances model reasoning across modalities through a structured, chain-of-thought training data structure. Our approach systematically parses textual structures to a sequential referring step, where in each step it identifies relationships and ensures consistent reference alignment, thereby improving accuracy in complex query scenarios. We restructure the training data to enforce a new output form, providing new annotations for existing datasets and compiling an evaluation benchmark from existing resources. This benchmark is designed explicitly for complex referring cases. We also integrate detection and segmentation capabilities into a unified MLLM framework, training it with a novel adaptive weighted loss to optimize performance. Experimental results on our curated benchmark and RefCOCO/+/g demonstrate the effectiveness of our approach, with a notable increase of 2.5%+ over baseline models.
[4] TRepLiNa: Layer-wise CKA+REPINA Alignment Improves Low-Resource Machine Translation in Aya-23 8B
Toshiki Nakai,Ravi Kiran Chikkala,Lena Sophie Oberkircher,Nicholas Jennings,Natalia Skachkova,Tatiana Anikina,Jesujoba Oluwadara Alabi
Main category: cs.CL
TL;DR: TRepLiNa是一种结合CKA和REPINA的层对齐方法,用于提升低资源语言(LRL)到高资源语言(HRL)的翻译质量,尤其在数据稀缺环境下效果显著。
Details
Motivation: 解决印度多样低资源语言因资源匮乏导致的翻译质量差问题,探索在LLM特定层强制跨语言相似性是否有效。Contribution: 提出TRepLiNa方法,结合CKA(跨语言表征对齐)和REPINA(参数更新约束),并通过实验验证其提升LRL翻译的效果。
Method: 在Aya-23 8B模型中使用TRepLiNa(CKA+REPINA)对齐中层表征,实验覆盖零样本、少样本和微调场景。
Result: TRepLiNa在中层对齐中表现最佳,显著提升了低资源语言的翻译质量,是一种低成本实用方法。
Insight: 中层表征对齐对跨语言翻译任务尤为关键,尤其是在数据稀缺时,TRepLiNa提供了一种有效的解决方案。
Abstract: The 2025 Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo) Language Challenge addresses one of India’s most pressing linguistic gaps: the lack of resources for its diverse low-resource languages (LRLs). In this study, we investigate whether enforcing cross-lingual similarity in specific internal layers of a decoder-only multilingual large language model (LLM) can improve translation quality from LRL to high-resource language (HRL). Specifically, we combine Centered Kernel Alignment (CKA), a similarity metric that encourages representations of different languages to align, with REPINA, a regularization method that constrains parameter updates to remain close to the pretrained model, into a joint method we call TRepLiNa. In this research project, we experiment with zero-shot, few-shot, and fine-tuning settings using Aya-23 8B with QLoRA across MMLoSo shared task language pairs (Mundari, Santali, Bhili) with Hindi/English pivots. Our results show that aligning mid-level layers using TRepLiNa (CKA+REPINA) is a low-cost, practical approach to improving LRL translation, especially in data-scarce settings.
[5] EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA
Firoj Alam,Ali Ezzat Shahroor,Md. Arid Hasan,Zien Sheikh Ali,Hunzalah Hassan Bhatti,Mohamed Bayan Kmainasi,Shammur Absar Chowdhury,Basel Mousi,Fahim Dalvi,Nadir Durrani,Natasa Milic-Frayling
Main category: cs.CL
TL;DR: 论文提出了EverydayMMQA框架和OASIS数据集,专注于多语言和多模态的文化语境视觉问答,填补了现有模型在低资源和少数民族语言中文化背景知识不足的空白。
Details
Motivation: 现有的大规模多模态模型在视觉问答(VQA)等任务中表现优异,但在需要文化背景和日常生活知识的查询中效果不佳,尤其是在低资源和少数民族语言中。Contribution: 1. 提出EverydayMMQA框架,支持创建大规模、文化背景丰富的多模态数据集;2. 发布OASIS数据集,包含92万张图片和1480万QA对,涵盖语音、图像和文本的多种输入组合。
Method: 通过EverydayMMQA框架,整合语音、图像和文本数据,构建OASIS数据集,支持四种输入组合(语音/文本+图像/无图像)。数据集聚焦英语和阿拉伯语,覆盖18个国家的多样化场景。
Result: 评测了4个闭源模型、3个开源模型和1个微调模型,OASIS数据集测试了模型在涉及实用推理、常识和文化感知的任务上的能力。
Insight: 文化背景和多语言支持是提升多模态模型泛化能力的关键,OASIS为构建具备文化意识的模型提供了重要基准。
Abstract: Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they often fail when queries require culturally grounded, everyday knowledge, particularly in low-resource and underrepresented languages. To bridge this gap, we introduce Everyday Multimodal and Multilingual QA (EverydayMMQA), a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering (SVQA). Using this framework, we developed OASIS, a multimodal dataset integrating speech, images, and text. With over ~0.92M images and 14.8M QA pairs, OASIS contains 3.7M spoken questions, enabling four unique input combinations: speech-only, text-only, speech+image, and text+image. Focused on English and Arabic varieties, 18 countries, the dataset content is curated to reflect diverse, real-world situations. OASIS tests models on tasks beyond object recognition that involve pragmatic, commonsense, and culturally aware reasoning. We benchmarked four closed-source models, three open-source models, and one fine-tuned model. EverydayMMQA and OASIS together provide a benchmark and training dataset for building multimodal LLMs for a comprehensive set of everyday tasks within cultural contexts. The framework and dataset will be made publicly available to the community.
[6] Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language
Angie Boggust,Donghao Ren,Yannick Assogba,Dominik Moritz,Arvind Satyanarayan,Fred Hohman
Main category: cs.CL
TL;DR: 论文提出了‘语义正则表达式’(semantic regexes)作为一种结构化语言,用于精确描述大型语言模型(LLM)的特征,解决了自然语言描述模糊和不一致的问题。
Details
Motivation: 现有的大型语言模型特征解释方法通常使用自然语言描述,但这些描述往往模糊、不一致,且需要人工重新标注。为了解决这些问题,研究提出了语义正则表达式。Contribution: 论文的主要贡献是提出了语义正则表达式,一种结构化语言方法,能够更精确和一致地描述LLM特征,并通过量化和组合功能支持模型级别的分析。
Method: 该方法结合了捕获语言和语义特征模式的基本元素(primitives),以及用于上下文、组合和量化的修饰符,生成结构化特征描述。
Result: 实验表明,语义正则表达式在准确性上与自然语言描述相当,但更简洁一致,同时支持量化特征复杂度等新分析方式。用户研究发现其有助于建立准确的LLM特征心理模型。
Insight: 结构化语言可以显著提升特征解释的精确性和一致性,同时为模型级别的分析提供了新工具。
Abstract: Automated interpretability aims to translate large language model (LLM) features into human understandable descriptions. However, these natural language feature descriptions are often vague, inconsistent, and require manual relabeling. In response, we introduce semantic regexes, structured language descriptions of LLM features. By combining primitives that capture linguistic and semantic feature patterns with modifiers for contextualization, composition, and quantification, semantic regexes produce precise and expressive feature descriptions. Across quantitative benchmarks and qualitative analyses, we find that semantic regexes match the accuracy of natural language while yielding more concise and consistent feature descriptions. Moreover, their inherent structure affords new types of analyses, including quantifying feature complexity across layers, scaling automated interpretability from insights into individual features to model-wide patterns. Finally, in user studies, we find that semantic regex descriptions help people build accurate mental models of LLM feature activations.
[7] Protecting De-identified Documents from Search-based Linkage Attacks
Pierre Lison,Mark Anderson
Main category: cs.CL
TL;DR: 该论文提出了一种方法,通过构建N-gram倒排索引和使用LLM重写技术,有效防止文档的去标识化后仍能被搜索链接回原始数据的风险,同时保持文本的语义完整性。
Details
Motivation: 现有的去标识化模型虽然能隐藏文档中的个人身份信息,但无法解决文本仍可能被搜索链接回原始数据集的风险,容易导致隐私泄露。Contribution: 论文的主要贡献是提出了一种两步方法:1)构建N-gram倒排索引以识别罕见N-gram;2)使用LLM重写技术生成语义一致但无法链接的文本,从而防止搜索式链接攻击。
Method: 方法分为两步:首先,构建文档集中N-gram的倒排索引,识别出现次数少于k次的N-gram;其次,利用LLM迭代重写这些N-gram,直到无法进行链接攻击。
Result: 在法院案例数据集上的实验表明,该方法能有效防止搜索式链接攻击,同时保持文本内容的语义一致性。
Insight: LLM技术在隐私保护中的应用潜力显著,通过语义重写可以在保护隐私的同时不影响文本的实用性和可读性。
Abstract: While de-identification models can help conceal the identity of the individual(s) mentioned in a document, they fail to address linkage risks, defined as the potential to map the de-identified text back to its source. One straightforward way to perform such linkages is to extract phrases from the de-identified document and then check their presence in the original dataset. This paper presents a method to counter search-based linkage attacks while preserving the semantic integrity of the text. The method proceeds in two steps. We first construct an inverted index of the N-grams occurring in the document collection, making it possible to efficiently determine which N-grams appear in less than $k$ documents (either alone or in combination with other N-grams). An LLM-based rewriter is then iteratively queried to reformulate those spans until linkage is no longer possible. Experimental results on a collection of court cases show that the method is able to effectively prevent search-based linkages while remaining faithful to the original content.
[8] Controllable Stylistic Text Generation with Train-Time Attribute-Regularized Diffusion
Fan Zhou,Chang Tian,Tim Van de Cruys
Main category: cs.CL
TL;DR: 本文提出了RegDiff框架,通过训练时属性正则化的扩散模型实现可控风格文本生成,避免了采样时的高计算成本。
Details
Motivation: 现有的CFG和CG方法在可控文本生成中存在语义保留不足或计算成本高的问题,需一种更高效且精准的方法。Contribution: 提出RegDiff框架,利用训练时属性监督的潜在扩散模型,实现了高效且精确的属性控制文本生成。
Method: 采用VAE编码器-解码器结构保证重构保真度,结合潜在扩散模型和属性监督训练,仅在训练时注入属性信息。
Result: 在五个数据集上的实验表明,RegDiff在多风格属性生成任务中优于基线方法。
Insight: 训练时属性正则化可显著降低扩散模型的计算成本,同时保持高质量的属性控制能力。
Abstract: Generating stylistic text with specific attributes is a key problem in controllable text generation. Recently, diffusion models have emerged as a powerful paradigm for both visual and textual generation. Existing approaches can be broadly categorized into classifier-free guidance (CFG) and classifier guidance (CG) methods. While CFG effectively preserves semantic content, it often fails to provide effective attribute control. In contrast, CG modifies the denoising trajectory using classifier gradients, enabling better attribute alignment but incurring high computational costs during sampling and suffering from classifier generalization issues. In this work, we propose RegDiff, a regularized diffusion framework that leverages attribute features without requiring a pretrained classifier during sampling, thereby achieving controllable generation with reduced computational costs. Specifically, RegDiff employs a VAE-based encoder–decoder architecture to ensure reconstruction fidelity and a latent diffusion model trained with attribute supervision to enable controllable text generation. Attribute information is injected only during training. Experiments on five datasets spanning multiple stylistic attributes demonstrate that RegDiff outperforms strong baselines in generating stylistic texts. These results validate the effectiveness of RegDiff as an efficient solution for attribute-controllable text diffusion. Our code, datasets, and resources will be released upon publication at https://github.com/xxxx.
[9] FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering
Yitao Long,Tiansheng Hu,Yilun Zhao,Arman Cohan,Chen Zhao
Main category: cs.CL
TL;DR: FinLFQA是一个新的评测基准,旨在评估LLM在复杂金融领域长形式问题回答中的属真文本生成能力,强调多层次的属真性评估,并提出了自动评估框架。
Details
Motivation: 现有评测基准大多关注简单的文本证据检索属真方法,而忽略了金融等真实场景中需要多层次属真(如数值推理、领域知识)的需求,因此亟需更全面的评测工具。Contribution: 提出了FinLFQA评测基准,涵盖金融报告证据提取、中间数值推理步骤和领域知识三个关键属真维度;同时开发了一个自动评估框架。
Method: 通过人工标注构建属真性评估的三个维度,设计了自动评估工具,比较了八种LLM在多类属真生成范式下的表现。
Result: 实验表明,细粒度评测指标能区分模型能力;端到端生成与后处理性能相当;外部反馈下迭代细化才有效。
Insight: 属真性需多层次评估;端到端方法在多任务场景中表现优异;外部反馈是迭代优化的关键。
Abstract: Large Language Models (LLMs) frequently hallucinate to long-form questions, producing plausible yet factually incorrect answers. A common mitigation strategy is to provide attribution to LLM outputs. However, existing benchmarks primarily focus on simple attribution that retrieves supporting textual evidence as references. We argue that in real-world scenarios such as financial applications, attribution goes beyond reference retrieval. We introduce FinLFQA, a benchmark designed to evaluate the ability of LLMs to generate long-form answers to complex financial questions with reliable and nuanced attributions. FinLFQA evaluates three critical aspects of attribution through human annotations: (1) supporting evidence extracted from financial reports, (2) intermediate numerical reasoning steps, and (3) domain-specific financial knowledge that informs the reasoning process. We further provide an automatic evaluation framework covering both answer quality and attribution quality. Through extensive experiments on eight LLMs across multiple attribution-generation paradigms, we find that fine-grained metrics are important to distinguish model capabilities, that end-to-end generation achieves comparable performance to post-hoc approaches, and that iterative refinement only helps when guided by external feedback.
[10] Bridging Discourse Treebanks with a Unified Rhetorical Structure Parser
Elena Chistova
Main category: cs.CL
TL;DR: UniRST是一个统一的RST风格话语解析器,能够处理11种语言的18个树库,无需修改其关系库。通过两种训练策略(Multi-Head和Masked-Union),解决了关系库不兼容的问题,并在多语言端到端话语解析中表现优于大多数单树库基线。
Details
Motivation: 研究动机是解决多语言话语解析中关系库不兼容的问题,同时提出一个统一的解析器,能够在多语言和多树库环境下高效工作。Contribution: 主要贡献包括:(1)提出了UniRST,首个统一的RST风格话语解析器;(2)设计了Multi-Head和Masked-Union两种训练策略以克服关系库不兼容;(3)在低资源环境中提出了一种简单但有效的数据增强技术。
Method: 研究方法包括:(1)Multi-Head策略,为每个关系库分配单独的分类层;(2)Masked-Union策略,通过选择性标签掩码实现共享参数训练;(3)在低资源环境下采用数据增强技术。
Result: 结果显示,(1)Masked-Union策略在参数效率上表现最佳;(2)UniRST在18个单树库基线中优于16个,证明了其在多语言端到端话语解析中的优势。
Insight: 该研究的核心洞察是:通过选择性标签掩码和共享参数训练,可以在多语言和多树库环境下实现高效且统一的RST风格话语解析,同时显著提升低资源语言的性能。
Abstract: We introduce UniRST, the first unified RST-style discourse parser capable of handling 18 treebanks in 11 languages without modifying their relation inventories. To overcome inventory incompatibilities, we propose and evaluate two training strategies: Multi-Head, which assigns separate relation classification layer per inventory, and Masked-Union, which enables shared parameter training through selective label masking. We first benchmark monotreebank parsing with a simple yet effective augmentation technique for low-resource settings. We then train a unified model and show that (1) the parameter efficient Masked-Union approach is also the strongest, and (2) UniRST outperforms 16 of 18 mono-treebank baselines, demonstrating the advantages of a single-model, multilingual end-to-end discourse parsing across diverse resources.
[11] MathRobust-LV: Evaluation of Large Language Models’ Robustness to Linguistic Variations in Mathematical Reasoning
Neeraja Kirtane,Yuvraj Khanna,Peter Relan
Main category: cs.CL
TL;DR: 该论文提出了MathRobust-LV,用于评估大语言模型在数学推理中对语言变化的鲁棒性。研究发现,尽管大语言模型在数学基准测试中表现优异,但在语言变化下其准确性会下降,尤其是较小的模型。
Details
Motivation: 现有的数学推理评估主要集中在高难度竞赛(如IMO),而忽视了高中水平数学问题在真实教育场景中的应用。教师在评估中会重述问题但保持难度不变,因此需要评估模型对这种语言变化的鲁棒性。Contribution: 提出了MathRobust-LV,一个专注于高中水平数学问题的测试集和评估方法,通过改变问题的表面细节(如名称、背景、变量)但保持数值结构和答案不变,评估模型对语言变化的鲁棒性。
Method: 设计了MathRobust-LV测试集,通过重述问题生成变体,同时保持数学结构和答案不变,评估34个模型在这些变体上的表现。
Result: 实验显示,模型在语言变化下准确性下降,小型模型下降9-11%,前沿模型(如GPT-5、Gemini-2.5pro)相对稳定。
Insight: 语言变化的鲁棒性是当前大语言模型在数学推理中的一个重要挑战,即使是前沿模型也存在可测量的性能下降。
Abstract: Large language models excel on math benchmarks, but their math reasoning robustness to linguistic variation is underexplored. While recent work increasingly treats high-difficulty competitions like the IMO as the gold standard for evaluating reasoning, we believe in comprehensive benchmarking of high school-level math problems in real educational settings. We introduce MathRobust-LV, a test set and evaluation methodology that mirrors how instructors rephrase problems across assessments while keeping difficulty constant: we change surface details (names, contexts, variables) while preserving numerical structure and answers. In contrast to prior efforts that alter problem content or emphasize IMO-level tasks, we focus on high-school-level dataset problems at the difficulty level where models are currently deployed in educational settings: tutoring and assessment systems. In these applications, instructors rephrase identical concepts in varied ways, making linguistic robustness essential for reliable deployment. Although MATH data benchmarking is often regarded as saturated, our experiment on 34 models reveals that accuracy declines when moving from the baseline to the variants. These drops are severe for smaller models (9-11%) while stronger models also show measurable degradation. Frontier models like GPT-5, Gemini-2.5pro remain comparatively stable. Our results highlight that robustness to linguistic variation is a fundamental challenge, exposing reasoning vulnerabilities in models.
[12] A Survey on Agentic Security: Applications, Threats and Defenses
Asif Shahriar,Md Nafiu Rahman,Sadif Ahmed,Farig Sadeque,Md Rizwan Parvez
Main category: cs.CL
TL;DR: 本文是第一篇全面调查自主LLM代理在网络安全领域应用的论文,围绕应用、威胁和防御三大支柱展开,分析了150多篇论文,揭示了新兴趋势和研究空白。
Details
Motivation: 随着被动LLM向自主LLM代理的快速转变,网络安全面临新的风险和挑战,需要系统梳理其应用、威胁和防御措施。Contribution: 提出了首个关于代理安全的整体调查,构建了围绕应用、威胁和防御的分类体系,并分析了150多篇论文。
Method: 通过系统梳理和分类现有研究,构建了代理安全的三支柱框架,并结合跨领域分析揭示了趋势和不足。
Result: 总结了代理安全的应用场景、潜在威胁和防御措施,指出了模型和模态覆盖方面的研究空白。
Insight: 代理安全领域亟需更多关注模型和模态的多样性,同时需要开发针对新兴威胁的有效防御策略。
Abstract: The rapid shift from passive LLMs to autonomous LLM-agents marks a new paradigm in cybersecurity. While these agents can act as powerful tools for both offensive and defensive operations, the very agentic context introduces a new class of inherent security risks. In this work we present the first holistic survey of the agentic security landscape, structuring the field around three interdependent pillars: Applications, Threats, and Defenses. We provide a comprehensive taxonomy of over 150 papers, explaining how agents are used, the vulnerabilities they possess, and the countermeasures designed to protect them. A detailed cross-cutting analysis shows emerging trends in agent architecture while revealing critical research gaps in model and modality coverage.
[13] Linguistically Informed Tokenization Improves ASR for Underresourced Languages
Massimo Daul,Alessio Tosolini,Claire Bowern
Main category: cs.CL
TL;DR: 论文研究了针对资源匮乏语言的自动语音识别(ASR)系统,通过改进分词策略(采用音位分词而非拼写分词),显著提升了Yan-nhangu语言的识别效果,并验证了ASR在语言文档任务中的实用性。
Details
Motivation: 现代ASR系统依赖大数据和复杂的transformer架构,难以应用于资源匮乏的语言。研究旨在探索如何通过改进分词策略提升ASR在这些语言中的性能,并验证其在语言文档任务中的实际价值。Contribution: 1. 提出了基于语言学知识的音位分词策略,显著降低了WER和CER;2. 验证了ASR在资源匮乏语言文档任务中的可行性;3. 展示了手动修正ASR输出比从头转录更高效。
Method: 采用wav2vec2模型,针对Yan-nhangu语言,对比音位分词与拼写分词的效果。同时,评估ASR在语言文档任务中的实用性。
Result: 音位分词显著优于拼写分词,降低了WER和CER。手动修正ASR输出比传统转录快得多。
Insight: 语言学知识驱动的分词策略能有效提升资源匮乏语言的ASR性能,ASR工具在语言文档中具有实际应用潜力。
Abstract: Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems use data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec2 ASR model on Yan-nhangu, a dormant Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR’s viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves WER and CER compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can work for underresourced languages.
[14] Test-Time Scaling of Reasoning Models for Machine Translation
Zihao Li,Shaoxiong Ji,Jörg Tiedemann
Main category: cs.CL
TL;DR: 论文研究了测试时扩展(TTS)在机器翻译(MT)中的作用,发现通用推理模型(RMs)的直接翻译效果有限,但通过领域微调或多步自我校正可以显著提升性能。
Details
Motivation: 测试时扩展在数学和编码任务中表现优异,但对其在机器翻译中的应用效果尚不清楚。本研究旨在探索这一方法的实际价值。Contribution: 首次系统评估了TTS在MT中的效果,发现领域微调和多步自我校正是提升翻译质量的关键。
Method: 在12个推理模型上进行实验,涵盖直接翻译、强制推理外推和后编辑三种场景,分析TTS的影响。
Result: 通用模型的TTS效果有限且不稳定,领域特定模型则能显著受益;强制过度推理会损害性能,而后编辑场景下TTS效果显著。
Insight: TTS的价值不在单次通用翻译,而在于任务专用模型或多步工作流(如自我校正)。
Abstract: Test-time scaling (TTS) has enhanced the performance of Reasoning Models (RMs) on various tasks such as math and coding, yet its efficacy in machine translation (MT) remains underexplored. This paper investigates whether increased inference-time computation improves translation quality. We evaluate 12 RMs across a diverse suite of MT benchmarks spanning multiple domains, examining three scenarios: direct translation, forced-reasoning extrapolation, and post-editing. Our findings show that for general-purpose RMs, TTS provides limited and inconsistent benefits for direct translation, with performance quickly plateauing. However, the effectiveness of TTS is unlocked by domain-specific fine-tuning, which aligns a model’s reasoning process with task requirements, leading to consistent improvements up to an optimal, self-determined reasoning depth. We also find that forcing a model to reason beyond its natural stopping point consistently degrades translation quality. In contrast, TTS proves highly effective in a post-editing context, reliably turning self-correction into a beneficial process. These results indicate that the value of inference-time computation in MT lies not in enhancing single-pass translation with general models, but in targeted applications like multi-step, self-correction workflows and in conjunction with task-specialized models.
[15] Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels
Zhepeng Cen,Haolin Chen,Shiyu Wang,Zuxin Liu,Zhiwei Liu,Ding Zhao,Silvio Savarese,Caiming Xiong,Huan Wang,Weiran Yao
Main category: cs.CL
TL;DR: 论文提出了一种名为Webscale-RL的数据管道,用于将大规模预训练文档转化为多样化的问答对,以解决强化学习中的数据瓶颈问题。构建的数据集包含120万个样本,实验表明其显著优于持续预训练和基线方法。
Details
Motivation: 大型语言模型(LLM)在文本数据上表现优异,但存在训练与生成的差距,且推理能力受限。强化学习(RL)能弥补这一差距,但面临数据规模不足的瓶颈。Contribution: 提出了Webscale-RL管道,能够将预训练文档高效转化为强化学习所需的数据,并构建了大规模数据集Webscale-RL。
Method: 开发了一种自动化数据管道,通过系统化方法生成数百万个多样、可验证的问答对,用于强化学习训练。
Result: 实验显示,基于Webscale-RL数据集训练的模型在多个基准测试中表现显著优于持续预训练和其他基线方法,且数据效率更高(最高可达100倍)。
Insight: 该工作为强化学习扩展到预训练规模提供了可行路径,可能推动更高效、更强大的语言模型发展。
Abstract: Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100$\times$ fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.
[16] From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining
Seng Pei Liew,Takuya Kato
Main category: cs.CL
TL;DR: 研究了基于预训练模型的二次预训练(bootstrapped pretraining)的效率,发现其扩展效率随着基础模型预训练程度的增加而下降,表现为对数下降的标度定律。
Details
Motivation: 探索bootstrapped pretraining(如持续预训练或模型增长)的效率,尤其是在基础模型过度预训练的情况下,以降低从头训练语言模型的成本。Contribution: 发现了bootstrapped pretraining的标度效率随基础模型预训练程度的对数下降的规律,并提出一个简单的标度定律来描述这种关系。
Method: 通过实证研究bootstrapped pretraining的扩展行为,分析其对第一和第二预训练阶段token数量的依赖性。
Result: 结果表明,基础模型预训练越彻底,二次预训练的附加收益越少,揭示了多阶段预训练策略中的固有权衡。
Insight: 研究为高效训练语言模型提供了实用见解,并指出过度预训练模型的再利用需要慎重考虑。
Abstract: Bootstrapped pretraining, i.e., the reuse of a pretrained base model for further pretraining, such as continual pretraining or model growth, is promising at reducing the cost of training language models from scratch. However, its effectiveness remains unclear, especially when applied to overtrained base models. In this work, we empirically study the scaling behavior of bootstrapped pretraining and find that its scaling efficiency diminishes in a predictable manner: The scaling exponent with respect to second-stage pretraining tokens decreases logarithmically with the number of tokens used to pretrain the base model. The joint dependence on first- and second-stage tokens is accurately modeled by a simple scaling law. Such saturation effect reveals a fundamental trade-off in multi-stage pretraining strategies: the more extensively a model is pretrained, the less additional benefit bootstrapping provides. Our findings provide practical insights for efficient language model training and raise important considerations for the reuse of overtrained models.
[17] The Algebra of Meaning: Why Machines Need Montague More Than Moore’s Law
Cheonkam Jeong,Sungdo Kim,Jewoo Park
Main category: cs.CL
TL;DR: 论文指出当前语言模型在语义处理上存在类型错误,提出通过蒙塔古类型的逻辑形式编译自然语言输入,结合神经符号架构实现合规性导向的决策。
Details
Motivation: 现有语言模型虽然流畅,但在处理语义类型(如描述性、规范性和法律性)时表现不佳,导致幻觉和不透明的合规性问题。论文认为这些问题源于语义类型理论的缺失。Contribution: 1) 将幻觉问题诊断为类型错误;2) 提出一种结合蒙塔古语义和本体论的框架,支持商业和法律推理;3) 设计了一种面向生产的神经符号架构。
Method: 提出Savassan架构,通过神经组件提取输入的结构化候选,再通过符号组件进行类型检查和约束推理,实现多法域合规性映射。
Result: 论文提出了评估计划,包括法律推理基准和多法域合成数据集,旨在验证系统在统一语义代数中的表现。
Insight: 可信的自主系统需要对语义进行组合性类型化,使其能够在一个统一的框架中区分描述性、规范性和法律责任。
Abstract: Contemporary language models are fluent yet routinely mis-handle the types of meaning their outputs entail. We argue that hallucination, brittle moderation, and opaque compliance outcomes are symptoms of missing type-theoretic semantics rather than data or scale limitations. Building on Montague’s view of language as typed, compositional algebra, we recast alignment as a parsing problem: natural-language inputs must be compiled into structures that make explicit their descriptive, normative, and legal dimensions under context. We present Savassan, a neuro-symbolic architecture that compiles utterances into Montague-style logical forms and maps them to typed ontologies extended with deontic operators and jurisdictional contexts. Neural components extract candidate structures from unstructured inputs; symbolic components perform type checking, constraint reasoning, and cross-jurisdiction mapping to produce compliance-aware guidance rather than binary censorship. In cross-border scenarios, the system “parses once” (e.g., defect claim(product x, company y)) and projects the result into multiple legal ontologies (e.g., defamation risk in KR/JP, protected opinion in US, GDPR checks in EU), composing outcomes into a single, explainable decision. This paper contributes: (i) a diagnosis of hallucination as a type error; (ii) a formal Montague-ontology bridge for business/legal reasoning; and (iii) a production-oriented design that embeds typed interfaces across the pipeline. We outline an evaluation plan using legal reasoning benchmarks and synthetic multi-jurisdiction suites. Our position is that trustworthy autonomy requires compositional typing of meaning, enabling systems to reason about what is described, what is prescribed, and what incurs liability within a unified algebra of meaning.
[18] TinyScientist: An Interactive, Extensible, and Controllable Framework for Building Research Agents
Haofei Yu,Keyang Xuan,Fenghai Li,Kunlun Zhu,Zijie Lei,Jiaxun Zhang,Ziheng Qi,Kyle Richardson,Jiaxuan You
Main category: cs.CL
TL;DR: TinyScientist提出了一个交互式、可扩展且可控的框架,用于构建研究型智能代理,通过开源代码、交互式网页演示和PyPI包简化自动研究流程的开发与维护。
Details
Motivation: 随着LLM在自动研究中应用的增加,多代理系统、规划和工具使用的复杂性使得扩展和维护研究流程变得困难。TinyScientist旨在解决这一问题,提供一个灵活且易于扩展的框架。Contribution: 1. 提出了一个交互式、可扩展且可控的框架,支持新工具集成和迭代开发。2. 提供了开源代码库、交互式网页演示和PyPI包,便于研究人员和开发者使用。
Method: 框架识别自动研究流程的核心组件,并通过模块化设计实现交互性和可扩展性。
Result: 通过开源工具和包,框架能够轻松支持最新的自动研究流程,并广泛适用于研究和开发场景。
Insight: 模块化和可扩展性是简化复杂研究流程的关键,交互设计可增强用户与代理的协作效率。
Abstract: Automatic research with Large Language Models (LLMs) is rapidly gaining importance, driving the development of increasingly complex workflows involving multi-agent systems, planning, tool usage, code execution, and human-agent interaction to accelerate research processes. However, as more researchers and developers begin to use and build upon these tools and platforms, the complexity and difficulty of extending and maintaining such agentic workflows have become a significant challenge, particularly as algorithms and architectures continue to advance. To address this growing complexity, TinyScientist identifies the essential components of the automatic research workflow and proposes an interactive, extensible, and controllable framework that easily adapts to new tools and supports iterative growth. We provide an open-source codebase, an interactive web demonstration, and a PyPI Python package to make state-of-the-art auto-research pipelines broadly accessible to every researcher and developer.
[19] Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?
Sri Durga Sai Sowmya Kadali,Evangelos E. Papalexakis
Main category: cs.CL
TL;DR: 该论文研究了大型语言模型(LLMs)的内部层是否能为越狱攻击检测提供模式,通过分析GPT-J和Mamba2的内部表示,发现越狱提示与良性提示在隐藏层中存在差异。
Details
Motivation: 随着对话式LLMs的普及,越狱攻击(通过精心设计的提示诱导模型输出受限内容)成为迫切问题。现有防御机制无法完全抵抗新型攻击,因此研究内部层动态可能提供新思路。Contribution: 首次系统分析了LLMs内部层对越狱提示与良性提示的反应差异,为基于内部动态的越狱检测方法奠定了基础。
Method: 针对开源模型GPT-J和状态空间模型Mamba2,提取并对比其隐藏层在不同提示下的激活模式。
Result: 初步结果表明,越狱提示在特定层中表现出与良性提示显著不同的行为,暗示内部层动态可用于检测越狱攻击。
Insight: LLMs内部层的激活模式可能隐含越狱行为的特征,这为开发无需依赖外部规则的新型防御机制提供了可能。
Abstract: Jailbreaking large language models (LLMs) has emerged as a pressing concern with the increasing prevalence and accessibility of conversational LLMs. Adversarial users often exploit these models through carefully engineered prompts to elicit restricted or sensitive outputs, a strategy widely referred to as jailbreaking. While numerous defense mechanisms have been proposed, attackers continuously develop novel prompting techniques, and no existing model can be considered fully resistant. In this study, we investigate the jailbreak phenomenon by examining the internal representations of LLMs, with a focus on how hidden layers respond to jailbreak versus benign prompts. Specifically, we analyze the open-source LLM GPT-J and the state-space model Mamba2, presenting preliminary findings that highlight distinct layer-wise behaviors. Our results suggest promising directions for further research on leveraging internal model dynamics for robust jailbreak detection and defense.
[20] A Comparative Analysis of Contextual Representation Flow in State-Space and Transformer Architectures
Nhat M. Hoang,Do Xuan Long,Cong-Duy Nguyen,Min-Yen Kan,Luu Anh Tuan
Main category: cs.CL
TL;DR: 本文首次对状态空间模型(SSMs)和Transformer模型(TBMs)中的上下文信息流动进行了统一的层级和标记级别的分析,揭示了它们在表示传播中的关键差异和底层机制。
Details
Motivation: 尽管SSMs作为TBMs的高效替代方案在处理长序列时表现出线性扩展和低内存消耗的优势,但它们在上下文信息流动方面的差异尚未深入研究。本文旨在填补这一空白。Contribution: 首次通过统一的层级和标记级别分析框架,量化了SSMs和TBMs在表示传播中的动态特性,揭示了它们在表示均匀化和多样性保留方面的关键差异。
Method: 使用中心核对齐、稳定性指标和探测技术,分析了SSMs和TBMs在不同层级和标记间的表示演化。通过理论分析和参数随机化,进一步探讨了差异的根源。
Result: 研究发现TBMs早期迅速均匀化标记表示,后期才重新引入多样性;而SSMs早期保留标记独特性,深层时才趋向均匀化。差异源于TBMs的结构设计和SSMs的训练动态。
Insight: 这些发现揭示了两种架构的归纳偏置,为未来长上下文推理模型的设计和训练提供了理论依据。
Abstract: State Space Models (SSMs) have recently emerged as efficient alternatives to Transformer-Based Models (TBMs) for long-sequence processing, offering linear scaling and lower memory use. Yet, how contextual information flows across layers and tokens in these architectures remains understudied. We present the first unified, token- and layer-level analysis of representation propagation in SSMs and TBMs. Using centered kernel alignment, stability metrics, and probing, we characterize how representations evolve within and across layers. We find a key divergence: TBMs rapidly homogenize token representations, with diversity reemerging only in later layers, while SSMs preserve token uniqueness early but converge to homogenization deeper. Theoretical analysis and parameter randomization further reveal that oversmoothing in TBMs stems from architectural design, whereas in SSMs it arises mainly from training dynamics. These insights clarify the inductive biases of both architectures and inform future model and training designs for long-context reasoning.
[21] Aligning Large Language Models via Fully Self-Synthetic Data
Shangjian Yin,Zhepei Wei,Xinyu Zhu,Wei-Lin Chen,Yu Meng
Main category: cs.CL
TL;DR: 论文提出了一种名为SAO的全自生成数据框架,用于大型语言模型的自我对齐,无需依赖昂贵的人类标注或外部奖励模型,通过模型自身生成数据并优化偏好,显著提升了聊天能力和下游任务表现。
Details
Motivation: 传统RLHF和RLAIF方法依赖昂贵的人工或外部模型标注,限制了大规模应用。论文试图通过完全自生成数据的方式降低成本,同时保持模型性能。Contribution: 提出了Self-Alignment Optimization (SAO)框架,完全依赖模型自身生成训练数据和优化偏好,实现了低成本、高效的LLM对齐。
Method: SAO分为两步:1) 模型通过角色扮演生成多样化的提示和响应;2) 通过自我评估生成偏好数据并优化。
Result: 实验表明SAO在AlpacaEval~2.0基准上显著提升了模型的聊天能力,同时在下游任务(如问答、数学推理)中保持了强性能。
Insight: 完全自生成数据的对齐方法不仅降低了成本,还为LLM的自我改进提供了一种可行的解决方案。
Abstract: Traditional reinforcement learning from human feedback (RLHF) for large language models (LLMs) relies on expensive human-annotated datasets, while Reinforcement Learning from AI Feedback (RLAIF) also incurs significant costs, requiring the collection of diverse prompts and corresponding responses, often necessitating external reward models or proprietary models like GPT-4 to annotate preference pairs. In this work, we introduce Self-Alignment Optimization (SAO), a fully self-synthetic framework for LLM alignment, where all training data, including prompts (i.e., user queries), responses, and preferences, are generated by the model itself. Specifically, SAO first instructs the LLM to engage in persona role-play and generate diverse prompts and responses, which are then self-evaluated for preference optimization. Extensive experiments demonstrate that SAO effectively enhances the model’s chat capabilities on standard benchmarks like AlpacaEval~2.0, while maintaining strong performance on downstream objective tasks (e.g., question-answering, math reasoning). Our work provides a practical solution for self-improvement in aligning LLMs, and the code for reproducing our results is available at: https://github.com/SJY8460/SAO.
[22] ToolMem: Enhancing Multimodal Agents with Learnable Tool Capability Memory
Yunzhong Xiao,Yangmin Li,Hewei Wang,Yunlong Tang,Zora Zhiruo Wang
Main category: cs.CL
TL;DR: ToolMem是一种通过学习工具能力记忆来增强多模态代理的方法,通过总结工具的优缺点并存储在记忆中,代理能够在推理时选择最适合的工具,从而提高任务准确性。
Details
Motivation: 传统代理通常依赖固定工具,无法灵活选择最适合的工具。受人类通过交互学习工具能力的启发,本文提出ToolMem,旨在让代理通过学习工具能力记忆来优化工具选择。Contribution: 1. 提出ToolMem,通过学习工具的能力记忆来增强代理的工具选择能力;2. 在文本生成和文本到图像生成任务中验证了ToolMem的有效性,显著提升了工具选择和性能预测的准确性。
Method: 1. 代理通过与工具交互学习其能力,总结工具的优缺点并存储在记忆(ToolMem)中;2. 在推理时,代理检索ToolMem中的相关内容,选择最适合的工具完成任务。
Result: ToolMem在文本和多模态生成场景中分别提升了14.8%和28.7%的性能预测准确性,并在多工具选择中分别提高了21%和24%的绝对性能。
Insight: 通过学习工具的能力记忆,代理能够更灵活地适应不同任务需求,这种动态工具选择机制为实现更智能的多模态代理提供了新思路。
Abstract: Agents utilizing tools powered by large language models (LLMs) or vision-language models (VLMs) have demonstrated remarkable progress in diverse tasks across text and visual modalities. Unlike traditional tools such as calculators, which give deterministic outputs, neural tools perform uncertainly across task scenarios. While different tools for a task may excel in varied scenarios, existing agents typically rely on fixed tools, thus limiting the flexibility in selecting the most suitable tool for specific tasks. In contrast, humans snowball their understanding of the capabilities of different tools by interacting with them, and apply this knowledge to select the optimal tool when solving a future task. To build agents that similarly benefit from this process, we propose ToolMem that enables agents to develop memories of tool capabilities from previous interactions, by summarizing their strengths and weaknesses and storing them in memory; at inference, the agent can retrieve relevant entries from ToolMem, and select the best tool to solve individual tasks more accurately. We evaluate ToolMem on learning varied text generation and text-to-image generation neural tools. Compared to no-memory, generic agents, we find ToolMem-augmented agents predict tool performance 14.8% and 28.7% more accurately across text and multimodal generation scenarios. Moreover, ToolMem facilitates optimal tool selection among multiple choices by 21% and 24% absolute increases in respective scenarios.
[23] PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch
Shangjian Yin,Shining Liang,Wenbiao Ding,Yuli Qian,Zhouxing Shi,Hongzhi Li,Yutao Xie
Main category: cs.CL
TL;DR: PiKa提出了一种高效的专家级对齐数据集,仅需30k SFT示例即可超越需要大量数据的现有方法,显著降低了对齐开源LLM的成本和门槛。
Details
Motivation: 现有的大多数对齐数据集要么私有,要么需要昂贵的人工标注,限制了可复现性和扩展性。即使是RLAIF,数据质量问题仍存在,且不清楚需多少数据才能微调出强指令模型。Contribution: 提出了PiKa家族数据集,尤其是PiKa-SFT,仅用30k SFT示例便超越更大规模数据集的效果,展示了高效对齐的可能性。
Method: 通过构建高质量的小规模数据集PiKa-SFT,并在Llama-3-8B-Base等模型上进行微调,证明了其有效性。
Result: PiKa-SFT在AlpacaEval 2.0和Arena-Hard基准上甚至超越了官方使用1000万专有数据训练的Llama-3-8B-Instruct模型。
Insight: 高质量的对齐可以通过小规模数据集实现,为开源LLM对齐提供了可扩展的解决方案。
Abstract: Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models (LLMs). However, its effectiveness depends on high-quality instruction data. Most existing alignment datasets are either private or require costly human annotation, which limits reproducibility and scalability. Even with Reinforcement Learning from AI Feedback (RLAIF), concerns about data quality remain. Moreover, it is unclear how much data is actually required to fine-tune a base model into a strong instruction-following model. Current approaches often rely on over 300k examples even at the supervised fine-tuning (SFT) stage, yet they still underperform compared to proprietary models, creating barriers for academic and resource-limited communities. To address this gap, we introduce PiKa, a data-efficient family of expert-level alignment datasets. In particular, the PiKa-SFT dataset uses only 30k SFT examples, far fewer than state-of-the-art datasets like Magpie. Through evaluations by fine-tuning Llama-3-8B-Base on PiKa and other public datasets, we show that PiKa-SFT outperforms models trained on much larger data. On AlpacaEval 2.0 and Arena-Hard benchmarks, PiKa-SFT fine-tuning even surpasses the official Llama-3-8B-Instruct model trained on over 10 million proprietary examples. We further extend our study by training the Qwen2.5 series (0.5B to 7B) on PiKa-SFT, achieving consistent gains. These findings demonstrate that high-quality alignment can be achieved with significantly less data, offering a scalable path for open-source LLM alignment. Code and data: https://github.com/SJY8460/PiKa.
[24] Incremental Summarization for Customer Support via Progressive Note-Taking and Agent Feedback
Yisha Wu,Cen,Zhao,Yuanpei Cao,Xiaoqing Su,Yashar Mehdad,Mindy Ji,Claire Na Cheng
Main category: cs.CL
TL;DR: 这篇论文提出了一种增量式摘要系统,用于客服对话中智能生成简洁的笔记,减少客服人员的上下文切换和重复检查。该系统结合了Mixtral-8x7B模型和DeBERTa分类器,并通过客服编辑反馈优化模型。
Details
Motivation: 客服人员在处理复杂对话时需要频繁切换上下文,传统的批量摘要方法效率低下且冗余。因此,需要一种增量式摘要系统,能够实时生成简洁笔记并优化摘要质量。Contribution: 主要贡献包括:(1) 结合Mixtral-8x7B模型和DeBERTa分类器的增量式摘要系统;(2) 通过客服编辑反馈实现模型在线优化和离线训练;(3) 在真实生产中验证了系统的高效性。
Method: 方法包括:(1) 使用Mixtral-8x7B模型生成实时笔记;(2) 利用DeBERTa分类器过滤无关内容;(3) 通过客服编辑反馈优化笔记生成和模型训练。
Result: 实验结果表明,系统相比批量摘要减少了3%的处理时间(复杂案例中减少高达9%),同时客服满意度较高。
Insight: 增量式摘要结合实时反馈能够显著提升客服效率和质量,尤其是在复杂对话场景中。
Abstract: We introduce an incremental summarization system for customer support agents that intelligently determines when to generate concise bullet notes during conversations, reducing agents’ context-switching effort and redundant review. Our approach combines a fine-tuned Mixtral-8x7B model for continuous note generation with a DeBERTa-based classifier to filter trivial content. Agent edits refine the online notes generation and regularly inform offline model retraining, closing the agent edits feedback loop. Deployed in production, our system achieved a 3% reduction in case handling time compared to bulk summarization (with reductions of up to 9% in highly complex cases), alongside high agent satisfaction ratings from surveys. These results demonstrate that incremental summarization with continuous feedback effectively enhances summary quality and agent productivity at scale.
[25] Learning to Rewrite Prompts for Bootstrapping LLMs on Downstream Tasks
Qinhao Zhou,Xiang Xiang,Kun He,John E. Hopcroft
Main category: cs.CL
TL;DR: 该论文提出了一种针对机器翻译任务的提示优化方法,通过小参数模型和基于反向翻译的策略,降低训练开销并提升性能。
Details
Motivation: 现有提示工程方法主要优化指令部分,但对输入部分关键的机器翻译任务适用性有限,因此需要专门设计优化方法。Contribution: 提出了一种基于反向翻译的小参数模型提示优化方法,针对机器翻译任务显著提高了性能,并有望扩展到其他任务。
Method: 使用小参数模型和基于反向翻译的训练策略,专注于优化机器翻译任务中的输入部分提示。
Result: 该方法在降低训练开销的同时,实现了机器翻译任务的高效性能。
Insight: 提示工程不仅关注指令部分,输入部分的优化同样重要,尤其是对机器翻译等任务;小参数模型结合反向翻译策略是一种高效解决方案。
Abstract: In recent years, the growing interest in Large Language Models (LLMs) has significantly advanced prompt engineering, transitioning from manual design to model-based optimization. Prompts for LLMs generally comprise two components: the \textit{instruction}, which defines the task or objective, and the \textit{input}, which is tailored to the instruction type. In natural language generation (NLG) tasks such as machine translation, the \textit{input} component is particularly critical, while the \textit{instruction} component tends to be concise. Existing prompt engineering methods primarily focus on optimizing the \textit{instruction} component for general tasks, often requiring large-parameter LLMs as auxiliary tools. However, these approaches exhibit limited applicability for tasks like machine translation, where the \textit{input} component plays a more pivotal role. To address this limitation, this paper introduces a novel prompt optimization method specifically designed for machine translation tasks. The proposed approach employs a small-parameter model trained using a back-translation-based strategy, significantly reducing training overhead for single-task optimization while delivering highly effective performance. With certain adaptations, this method can also be extended to other downstream tasks.
[26] How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects
Leonardo Bertolazzi,Sandro Pezzelle,Raffaelle Bernardi
Main category: cs.CL
TL;DR: 本文探讨了大型语言模型(LLMs)如何将逻辑有效性(validity)与语义合理性(plausibility)混淆,揭示了其内部表征中两者的线性关系和几何对齐特征。研究表明,这种对齐导致模型在判断时将合理性误认为有效性,并通过干预表征实现了去偏。
Details
Motivation: 人类和LLMs都会因语义内容的合理性而影响对逻辑有效性的判断,即内容效应(content effects)。人类的行为可用双加工理论解释,但LLMs的机制尚不明确,因此需要研究其内部表征如何编码这两种概念。Contribution: 1)揭示了LLMs中有效性与合理性的线性表征及其几何对齐;2)证明了合理性和有效性向量可互相因果影响;3)提出了一种通过表征干预降低内容效应、提升推理准确性的方法。
Method: 通过分析LLMs的内部表征,发现有效性和合理性在几何上是线性对齐的;使用转向向量(steering vectors)验证两者的因果关系;构建去偏向量以分离这两种概念。
Result: 研究发现:1)LLMs内部有效性与合理性紧密对齐;2)对齐程度预测了行为内容效应的强度;3)去偏向量能显著减少内容效应,提高逻辑推理准确性。
Insight: LLMs对逻辑概念的表征受到语义信息的干扰,可能导致推理偏差。通过表征干预可以改善这一问题,为开发更逻辑化的系统提供了新思路。
Abstract: Both humans and large language models (LLMs) exhibit content effects: biases in which the plausibility of the semantic content of a reasoning problem influences judgments regarding its logical validity. While this phenomenon in humans is best explained by the dual-process theory of reasoning, the mechanisms behind content effects in LLMs remain unclear. In this work, we address this issue by investigating how LLMs encode the concepts of validity and plausibility within their internal representations. We show that both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity. Using steering vectors, we demonstrate that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models. Finally, we construct debiasing vectors that disentangle these concepts, reducing content effects and improving reasoning accuracy. Our findings advance understanding of how abstract logical concepts are represented in LLMs and highlight representational interventions as a path toward more logical systems.
[27] Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management
Miao Lu,Weiwei Sun,Weihua Du,Zhan Ling,Xuesong Yao,Kang Liu,Jiecao Chen
Main category: cs.CL
TL;DR: 该论文提出了一种基于摘要的上下文管理方法(SUPO),通过LLM生成的摘要压缩工具使用历史,突破了固定上下文窗口的限制,并在多回合工具使用任务中显著提高了成功率。
Details
Motivation: 现有RL方法在处理长序列多回合工具使用时,会受到上下文长度限制的影响,导致指令跟随性能下降和计算成本高昂。Contribution: 提出了SUPO框架,通过端到端优化的摘要策略和工具使用行为,实现了在固定上下文窗口之外的RL训练。
Method: 引入摘要机制压缩工具使用历史,保留任务相关信息,并设计了策略梯度表示,以端到端方式联合优化工具使用行为和摘要策略。
Result: 在交互式函数调用和搜索任务中,SUPO显著提高了成功率,同时保持了较低的工作上下文长度。
Insight: 摘要机制为突破固定上下文限制提供了一种可扩展的解决方案,尤其在复杂任务中,测试时进一步扩展摘要轮次可以带来额外性能提升。
Abstract: We study reinforcement learning (RL) fine-tuning of large language model (LLM) agents for long-horizon multi-turn tool use, where context length quickly becomes a fundamental bottleneck. Existing RL pipelines can suffer from degraded instruction following, excessive rollout costs, and most importantly, strict context limits. To address these challenges, we introduce summarization-based context management to training. In specific, it periodically compresses the tool using history by LLM-generated summaries that retain task-relevant information to keep a compact context while enabling the agent to scale beyond the fixed context window. Building on this formulation, we derive a policy gradient representation that seamlessly enables standard LLM RL infrastructures to optimize both tool-use behaviors as well as summarization strategies in an end-to-end fashion. We instantiate this framework with \underline{SU}mmarization augmented \underline{P}olicy \underline{O}ptimization (\texttt{SUPO}), an LLM RL algorithm that enables long-horizon training beyond a fixed context limit. Experiments on interactive function calling and searching tasks demonstrate that \texttt{SUPO} significantly improves the success rate while maintaining the same or even lower working context length compared to baselines. We also demonstrate that for complex searching tasks, \texttt{SUPO} can further improve the evaluation performance when scaling test-time maximum round of summarization beyond that of training time. Our results establish summarization-based context management as a principled and scalable approach for training RL agents beyond a fixed context length limit.
[28] AWM: Accurate Weight-Matrix Fingerprint for Large Language Models
Boyi Zeng,Lin Chen,Ziwei He,Xinbing Wang,Zhouhan Lin
Main category: cs.CL
TL;DR: 该论文提出了一种基于权重矩阵的训练无关指纹方法(AWM),用于检测大型语言模型(LLM)是否衍生自现有基础模型,解决了后训练操作对模型识别的挑战。
Details
Motivation: 大型语言模型的训练成本高昂,保护其知识产权至关重要。然而,常见的后训练操作(如微调、剪枝等)对模型识别带来了巨大挑战。Contribution: 提出了基于权重矩阵和线性分配问题(LAP)的无偏相似性度量方法,显著提高了模型识别的鲁棒性和准确性。
Method: 结合LAP和无偏中心核对齐(CKA)相似性,消除参数操作的影响,生成高度鲁棒的指纹。
Result: 在150个正负模型对上实现了完美分类指标,计算时间小于30秒(NVIDIA 3090 GPU)。
Insight: 训练无关的指纹方法为模型知识产权保护提供了可靠工具,尤其在应对复杂后训练操作时表现出色。
Abstract: Protecting the intellectual property of large language models (LLMs) is crucial, given the substantial resources required for their training. Consequently, there is an urgent need for both model owners and third parties to determine whether a suspect LLM is trained from scratch or derived from an existing base model. However, the intensive post-training processes that models typically undergo-such as supervised fine-tuning, extensive continued pretraining, reinforcement learning, multi-modal extension, pruning, and upcycling-pose significant challenges to reliable identification. In this work, we propose a training-free fingerprinting method based on weight matrices. We leverage the Linear Assignment Problem (LAP) and an unbiased Centered Kernel Alignment (CKA) similarity to neutralize the effects of parameter manipulations, yielding a highly robust and high-fidelity similarity metric. On a comprehensive testbed of 60 positive and 90 negative model pairs, our method demonstrates exceptional robustness against all six aforementioned post-training categories while exhibiting a near-zero risk of false positives. By achieving perfect scores on all classification metrics, our approach establishes a strong basis for reliable model lineage verification. Moreover, the entire computation completes within 30s on an NVIDIA 3090 GPU. The code is available at https://github.com/LUMIA-Group/AWM.
[29] Gold-Switch: Training-Free Superposition of Slow- and Fast- Thinking LLMs
Jaeseong Lee,Dayoung Kwon,seung-won hwang
Main category: cs.CL
TL;DR: Gold-Switch提出了一种无需训练的叠加部署策略,通过轻量级调节选择性关闭LRM的推理,避免过度推理浪费资源。
Details
Motivation: 大型推理模型(LRM)在结构化任务中表现优秀,但容易因过度推理导致性能下降和资源浪费。传统方法需要部署多个模型,成本高昂且不切实际。Contribution: 提出了一种无需训练的叠加部署策略,通过选择性关闭LRM的推理部分,优化计算资源的同时保留推理能力。
Method: 通过分析奇异值的累积能量,找到低秩投影的最优解,动态调节推理过程的计算开销。
Result: 该方法减少了过度推理的开销,同时保持了模型的推理能力。
Insight: 选择性关闭模型的推理部分是一种高效优化计算资源的方式,同时不影响任务性能。
Abstract: Large Reasoning Models (LRMs) excel in structured tasks by emulating deliberate human reasoning but often suffer from overthinking, degrading performance and wasting resources. One possible baseline is to deploy both LLM and LRM, then route input by predicting whether it requires reasoning and may cause overthinking. However, deploying multiple models can be costly or impractical. We propose a superposed deployment strategy with a lightweight, training-free regulation to optimize inference by switching one model on and off. Instead of routing, we selectively unlearn from LRM at inference, scaling down computation while preserving reasoning. By analyzing the cumulative energy of singular values, we identify optimal low-rank projections to adjust reasoning just right.
[30] Adaptive LLM-Symbolic Reasoning via Dynamic Logical Solver Composition
Lei Xu,Pierre Beckmann,Marco Valentino,André Freitas
Main category: cs.CL
TL;DR: 该论文提出了一种动态逻辑求解器组合的自适应LLM-符号推理框架,通过自动识别自然语言问题中的形式推理策略并动态选择专用逻辑求解器,显著提升推理性能。
Details
Motivation: 现有神经符号NLP方法多为静态设计,限制了多样形式推理策略的使用,因此需要一种自适应框架以动态整合语言模型和逻辑求解器。Contribution: 提出了一种自适应、多范式的神经符号推理框架,实现了自动策略识别和动态求解器选择,并在实验中显著超越基线模型。
Method: 通过自动形式化接口动态选择专用逻辑求解器,并结合LLM预测推理策略,实现自适应推理。
Result: 框架在推理任务中优于基线模型(如GPT-4o和DeepSeek-V3.1),且对纯LLM方法的推理能力也有提升。
Insight: 自适应推理不仅能提升神经符号方法的性能,还能增强纯LLM方法的表现,同时小型模型可通过后续训练优化改进。
Abstract: Neuro-symbolic NLP methods aim to leverage the complementary strengths of large language models and formal logical solvers. However, current approaches are mostly static in nature, i.e., the integration of a target solver is predetermined at design time, hindering the ability to employ diverse formal inference strategies. To address this, we introduce an adaptive, multi-paradigm, neuro-symbolic inference framework that: (1) automatically identifies formal reasoning strategies from problems expressed in natural language; and (2) dynamically selects and applies specialized formal logical solvers via autoformalization interfaces. Extensive experiments on individual and multi-paradigm reasoning tasks support the following conclusions: LLMs are effective at predicting the necessary formal reasoning strategies with an accuracy above 90 percent. This enables flexible integration with formal logical solvers, resulting in our framework outperforming competing baselines by 27 percent and 6 percent compared to GPT-4o and DeepSeek-V3.1, respectively. Moreover, adaptive reasoning can even positively impact pure LLM methods, yielding gains of 10, 5, and 6 percent on zero-shot, CoT, and symbolic CoT settings with GPT-4o. Finally, although smaller models struggle with adaptive neuro-symbolic reasoning, post-training offers a viable path to improvement. Overall, this work establishes the foundations for adaptive LLM-symbolic reasoning, offering a path forward for unifying material and formal inferences on heterogeneous reasoning challenges.
[31] FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline
Haotian Wu,Shufan Jiang,Chios Chen,Yiyang Feng,Hehai Lin,Heqing Zou,Yao Shu,Yanran Li,Chengwei Qin
Main category: cs.CL
TL;DR: FURINA-Builder是一个可扩展的多智能体协作管道,用于自动构建完全可定制的角色扮演(RP)基准测试,解决了现有基准测试范围窄、交互范式过时和适应性有限的问题。
Details
Motivation: 随着大语言模型(LLMs)在角色扮演任务中的进步,现有基准测试因范围狭窄、交互范式过时以及难以适应多样化应用场景而迅速过时。Contribution: 提出了FURINA-Builder,这是首个用于角色扮演领域的可扩展基准测试构建工具,支持评估任意角色,并提供维度特定的评估标准。
Method: FURINA-Builder通过多智能体协作管道模拟角色间的对话,利用LLM法官选择细粒度评估维度并调整测试角色的响应为最终测试话语。
Result: 实验表明,o3和DeepSeek-R1分别在英文和中文角色扮演任务中表现最佳,推理能力强的模型在RP任务中表现更好但幻觉问题也更突出。
Insight: 研究发现,模型规模并不单调减少幻觉;推理能力与RP性能之间存在新的权衡,推理能力提升RP效果但同时也增加幻觉问题。
Abstract: As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios. To address this gap, we introduce FURINA-Builder, a novel multi-agent collaboration pipeline that automatically constructs fully customizable RP benchmarks at any scale. It enables evaluation of arbitrary characters across diverse scenarios and prompt formats, as the first benchmark builder in RP area for adaptable assessment. FURINA-Builder simulates dialogues between a test character and other characters drawn from a well-constructed character-scene pool, while an LLM judge selects fine-grained evaluation dimensions and adjusts the test character’s responses into final test utterances. Using this pipeline, we build FURINA-Bench, a new comprehensive role-playing benchmark featuring both established and synthesized test characters, each assessed with dimension-specific evaluation criteria. Human evaluation and preliminary separability analysis justify our pipeline and benchmark design. We conduct extensive evaluations of cutting-edge LLMs and find that o3 and DeepSeek-R1 achieve the best performance on English and Chinese RP tasks, respectively. Across all models, established characters consistently outperform synthesized ones, with reasoning capabilities further amplifying this disparity. Interestingly, we observe that model scale does not monotonically reduce hallucinations. More critically, for reasoning LLMs, we uncover a novel trade-off: reasoning improves RP performance but simultaneously increases RP hallucinations. This trade-off extends to a broader Pareto frontier between RP performance and reliability for all LLMs. These findings demonstrate the effectiveness of FURINA-Builder and the challenge posed by FURINA-Bench.
[32] Overview of the Plagiarism Detection Task at PAN 2025
André Greiner-Petter,Maik Fröbe,Jan Philip Wahle,Terry Ruas,Bela Gipp,Akiko Aizawa,Martin Potthast
Main category: cs.CL
TL;DR: PAN 2025的抄袭检测任务聚焦于识别科学文章中自动生成的抄袭内容,并通过新构建的大规模数据集评估方法性能。发现当前基于语义相似度的方法表现较好,但泛化能力不足。
Details
Motivation: 随着大型语言模型的普及,自动生成的抄袭内容成为新挑战。PAN 2025旨在开发能识别此类抄袭并追溯源头的方法。Contribution: 1. 构建了基于Llama、DeepSeek-R1和Mistral的大规模自动生成抄袭数据集;2. 评估了多种方法在新旧数据集上的表现。
Method: 1. 使用三种大型语言模型生成抄袭数据集;2. 基于嵌入向量的语义相似度方法作为基线;3. 对比参与者和基线在PAN 2015和2025数据集上的表现。
Result: 基于嵌入向量的方法在PAN 2025上表现良好(召回率0.8,精确率0.5),但在PAN 2015上表现显著下降,表明泛化能力有限。
Insight: 当前方法对新型抄袭有效,但需提升泛化能力;未来研究需结合更多上下文或多模态信息以提高鲁棒性。
Abstract: The generative plagiarism detection task at PAN 2025 aims at identifying automatically generated textual plagiarism in scientific articles and aligning them with their respective sources. We created a novel large-scale dataset of automatically generated plagiarism using three large language models: Llama, DeepSeek-R1, and Mistral. In this task overview paper, we outline the creation of this dataset, summarize and compare the results of all participants and four baselines, and evaluate the results on the last plagiarism detection task from PAN 2015 in order to interpret the robustness of the proposed approaches. We found that the current iteration does not invite a large variety of approaches as naive semantic similarity approaches based on embedding vectors provide promising results of up to 0.8 recall and 0.5 precision. In contrast, most of these approaches underperform significantly on the 2015 dataset, indicating a lack in generalizability.
[33] Adaptive Tool Generation with Models as Tools and Reinforcement Learning
Chenpeng Wang,Xiaojie Cheng,Chunye Wang,Linfeng Yang,Lei Zhang
Main category: cs.CL
TL;DR: MTR提出了一种基于模拟的训练框架,通过多智能体架构和强化学习,实现了无需依赖实时API的工具增强推理,并在多跳QA任务中表现出色。
Details
Motivation: 现有工具增强语言模型依赖实时API,导致训练和部署时的可扩展性和可靠性问题,MTR旨在通过模拟方法解决这些问题。Contribution: 提出MTR框架,包括ToolMaker、AutoAgent和ToolActor的多智能体架构,以及两阶段训练方法(SFT和GRPO),实现了无需实时API的工具推理。
Method: 采用多智能体架构生成工具接口和模拟响应,结合两阶段训练:SFT学习推理序列的‘轨迹语法’,GRPO优化策略平衡答案正确性和一致性。
Result: 在四个多跳QA基准测试中,MTR的Exact Match分数与依赖实时API的系统相当,尤其在复杂推理任务中表现更优。
Insight: 表明通过结构化轨迹学习工具推理是可行的,无需实时交互,为语言模型的工具增强提供了新思路。
Abstract: Tool-augmented language models have demonstrated strong capabilities, but their reliance on live API access creates scalability and reliability challenges during training and deployment. We propose MTR, a simulation-first training framework for tool-augmented reasoning. Instead of relying on live APIs, MTR learns from complete ReAct traces with schema-validated, simulated observations. Our approach operates through a multi-agent architecture where a ToolMaker generates task-specific, OpenAI-compatible tool interfaces, an AutoAgent produces structured think-act-observe sequences, and a ToolActor simulates realistic responses. Training proceeds in two stages: Stage-1 Supervised Fine-Tuning (SFT) teaches ‘trace grammar’ from complete reasoning sequences; Stage-2 Group Relative Policy Optimization (GRPO) optimizes strategy with a composite trace reward that balances answer correctness and internal consistency. Across four multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA, Bamboogle), MTR attains competitive Exact Match (EM) scores to live-API systems and excels on reasoning-intensive tasks, suggesting that effective tool reasoning can be learned from structured traces without live interactions.
[34] Mid-Training of Large Language Models: A Survey
Kaixiang Mo,Yuxin Shi,Weiwei Weng,Zhiqiang Zhou,Shuman Liu,Haibo Zhang,Anxiang Zeng
Main category: cs.CL
TL;DR: 这篇论文首次将大语言模型(LLM)的中期训练(mid-training)作为一个统一范式进行系统综述,提出了涵盖数据分布、学习率调度和长上下文扩展的分类法,并总结了实际见解、评估基准和性能提升。
Details
Motivation: 虽然中期训练在先进系统中广泛应用,但缺乏对其作为统一范式的系统研究。本文旨在填补这一空白,提供关于中期训练的理论和实践指导。Contribution: 论文的主要贡献包括:(1)首次提出了LLM中期训练的分类法;(2)总结了中期训练的实际见解和性能提升;(3)编译了评估基准;(4)提出了未来研究方向和挑战。
Method: 论文通过分析梯度噪声尺度、信息瓶颈理论和课程学习等理论框架,解释了中期训练的有效性,并提出了一种涵盖数据分布、学习率调度和长上下文扩展的分类法。
Result: 研究表明,中期训练能够缓解噪声标记的收益递减、稳定模型收敛,并扩展模型在后期训练中的能力。
Insight: 中期训练的成功可以通过梯度噪声尺度、信息瓶颈理论和课程学习来解释,这些理论共同促进了模型的泛化和抽象能力。
Abstract: Large language models (LLMs) are typically developed through large-scale pre-training followed by task-specific fine-tuning. Recent advances highlight the importance of an intermediate mid-training stage, where models undergo multiple annealing-style phases that refine data quality, adapt optimization schedules, and extend context length. This stage mitigates diminishing returns from noisy tokens, stabilizes convergence, and expands model capability in late training. Its effectiveness can be explained through gradient noise scale, the information bottleneck, and curriculum learning, which together promote generalization and abstraction. Despite widespread use in state-of-the-art systems, there has been no prior survey of mid-training as a unified paradigm. We introduce the first taxonomy of LLM mid-training spanning data distribution, learning-rate scheduling, and long-context extension. We distill practical insights, compile evaluation benchmarks, and report gains to enable structured comparisons across models. We also identify open challenges and propose avenues for future research and practice.
[35] SID: Multi-LLM Debate Driven by Self Signals
Xuhang Chen,Zhifan Song,Deyi Ji,Shuo Gao,Lanyun Zhu
Main category: cs.CL
TL;DR: 本文提出了一种基于自信号的多LLM辩论方法SID,通过利用模型级置信度和token级语义聚焦信号,动态优化辩论过程,减少冗余计算并提高性能。
Details
Motivation: 现有的多LLM代理辩论方法(MAD)主要依赖外部结构(如辩论图或LLM-as-a-Judge),忽略了生成过程中产生的自信号(如token logits和注意力)。这可能导致冗余计算和性能下降。Contribution: 1. 首次将自信号(模型级置信度和token级语义聚焦)引入多LLM辩论。2. 提出SID方法,通过自信号动态引导辩论过程,实现高置信度代理早期退出和冗余内容压缩。
Method: SID利用两类自信号:模型级置信度(识别高置信度代理以提前退出)和token级语义聚焦(通过注意力机制压缩冗余内容)。
Result: 在多LLM和多模态LLM的多个基准测试中,SID在准确性和token消耗上均优于现有MAD方法。
Insight: 自信号是多LLM辩论中被忽略的关键因素,其动态利用可以显著提升辩论系统的性能和效率。
Abstract: Large Language Models (LLMs) have exhibited impressive capabilities across diverse application domains. Recent work has explored Multi-LLM Agent Debate (MAD) as a way to enhance performance by enabling multiple LLMs to discuss and refine responses iteratively. Nevertheless, existing MAD methods predominantly focus on utilizing external structures, such as debate graphs, using LLM-as-a-Judge, while neglecting the application of self signals, such as token logits and attention, that arise during generation. This omission leads to redundant computation and potential performance degradation. In this paper, we shift the focus to the self signals of multi-LLM debate and introduce a Self-Signals Driven Multi-LLM Debate (SID), which leverages two types of self-signals: model-level confidence and token-level semantic focus, to adaptively guide the debate process. Our approach enables high-confidence agents to exit early at the model level and compress the redundant debate contents based on the attention mechanism. We evaluate our method on various LLMs and Multimodal LLMs across multiple challenging benchmarks. Experimental results demonstrate that our method not only outperforms existing MAD techniques in accuracy but also reduces token consumption, highlighting the effectiveness of utilizing self signals in enhancing both the performance and efficiency of multi-agent debate systems. Our code will be available at~\href{https://github.com/xuhang2019/SID}{\texttt{https://github.com/xuhang2019/SID}}.
[36] OpenJAI-v1.0: An Open Thai Large Language Model
Pontakorn Trakuekul,Attapol T. Rutherford,Jullajak Karnjanaekarin,Narongkorn Panitsrisit,Sumana Sumanakul
Main category: cs.CL
TL;DR: OpenJAI-v1.0是一个基于Qwen3-14B的开源泰语和英语大语言模型,专注于提升指令遵循、长上下文理解及工具使用三种任务的表现。其表现优于其他开源泰语模型,且避免了灾难性遗忘。
Details
Motivation: 针对泰语AI社区对高质量开源语言模型的需求,OpenJAI-v1.0旨在填补这一空白,并提供多样化的任务支持。Contribution: 1.发布了开源泰语大语言模型OpenJAI-v1.0;2.通过精选数据提升了任务表现;3.在评测中优于其他泰语模型。
Method: 基于Qwen3-14B模型,通过精心筛选的数据集优化指令遵循、长上下文理解和工具使用三大任务。
Result: OpenJAI-v1.0在多样化评测中超越其他开源泰语模型,且未出现灾难性遗忘现象。
Insight: 精选数据对任务性能提升至关重要,开源模型的发布能促进泰语AI社区的发展。
Abstract: We introduce OpenJAI-v1.0, an open-source large language model for Thai and English, developed from the Qwen3-14B model. Our work focuses on boosting performance on practical tasks through carefully curated data across three key use cases: instruction following, long-context understanding, and tool use. Evaluation results show that OpenJAI-v1.0 improves on the capabilities of its base model and outperforms other leading open-source Thai models on a diverse suite of benchmarks, while avoiding catastrophic forgetting. OpenJAI-v1.0 is publicly released as another alternative NLP resource for the Thai AI community.
[37] Unlocking Latent Discourse Translation in LLMs Through Quality-Aware Decoding
Wafaa Mohammed,Vlad Niculae,Chrysoula Zerva
Main category: cs.CL
TL;DR: 提出了quality-aware decoding(QAD),用于从LLMs中提取潜在的语篇知识,从而改进上下文感知的翻译效果,该方法在语义丰富度和人类偏好上表现优越。
Details
Motivation: 大型语言模型(LLMs)在机器翻译任务中表现出色,但在处理语篇现象(如代词解析和词汇连贯性)时仍存在不足。作者希望通过解码方法挖掘LLMs中潜在的语篇知识。Contribution: 提出了QAD方法,证明了LLMs内嵌语篇知识,并通过该方法显著提升了上下文感知翻译的质量。
Method: 采用质量感知的解码策略(QAD),通过分析和选择解码路径来提取LLMs中的语篇知识,并对比了其他解码方法的性能。
Result: QAD在语义丰富度和人类偏好方面表现优越,验证了其在提升翻译质量方面的有效性。
Insight: LLMs中确实编码了语篇知识,但需要通过适当的解码方法(如QAD)才能有效提取和应用。
Abstract: Large language models (LLMs) have emerged as strong contenders in machine translation.Yet, they still struggle to adequately handle discourse phenomena, such as pronoun resolution and lexical cohesion at the document level. In this study, we thoroughly investigate the discourse phenomena performance of LLMs in context-aware translation. We demonstrate that discourse knowledge is encoded within LLMs and propose the use of quality-aware decoding (QAD) to effectively extract this knowledge, showcasing its superiority over other decoding approaches through comprehensive analysis. Furthermore, we illustrate that QAD enhances the semantic richness of translations and aligns them more closely with human preferences.
[38] $λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences
Yining Wang,Jinman Zhao,Chuangxin Zhao,Shuhao Guan,Gerald Penn,Shinan Liu
Main category: cs.CL
TL;DR: 论文提出了一种名为$λ$-GRPO的方法,通过引入可学习的参数$λ$,动态调整令牌级别的权重,以解决GRPO框架中的长度偏差问题,并在多个数学推理基准上取得了显著提升。
Details
Motivation: 传统的RLHF方法在LLMs推理能力提升中存在长度偏差问题,即较长的响应会均匀分配奖励给所有令牌,导致梯度更新不合理。现有的GRPO变体(如DAPO和Dr. GRPO)虽尝试调整令牌级损失聚合方式,但仍缺乏解释性。论文旨在提供一个统一框架,并通过学习令牌偏好来优化性能。Contribution: 1. 提出了$λ$-GRPO方法,通过可学习的参数$λ$动态调整令牌权重,解决了GRPO的长度偏差问题;2. 将现有框架统一为单一形式;3. 在多个数学推理任务中显著提升了模型性能。
Method: 1. 引入可学习的参数$λ$,动态调整令牌级损失聚合的权重;2. 将GRPO及其变体统一到一个框架中;3. 通过实验验证$λ$-GRPO的有效性。
Result: 在Qwen2.5模型(1.5B、3B和7B参数)上,$λ$-GRPO相比GRPO平均准确率分别提升1.9%、1.0%和1.7%,且无需额外计算成本或训练数据修改。
Insight: 通过学习令牌偏好,可以更有效地优化LLMs的推理能力,且这种改进方式具有实际应用价值,无需额外的资源投入。
Abstract: Reinforcement Learning with Human Feedback (RLHF) has been the dominant approach for improving the reasoning capabilities of Large Language Models (LLMs). Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has simplified this paradigm by replacing the reward and value models with rule-based verifiers. A prominent example is Group Relative Policy Optimization (GRPO). However, GRPO inherently suffers from a length bias, since the same advantage is uniformly assigned to all tokens of a response. As a result, longer responses distribute the reward over more tokens and thus contribute disproportionately to gradient updates. Several variants, such as DAPO and Dr. GRPO, modify the token-level aggregation of the loss, yet these methods remain heuristic and offer limited interpretability regarding their implicit token preferences. In this work, we explore the possibility of allowing the model to learn its own token preference during optimization. We unify existing frameworks under a single formulation and introduce a learnable parameter $\lambda$ that adaptively controls token-level weighting. We use $\lambda$-GRPO to denote our method, and we find that $\lambda$-GRPO achieves consistent improvements over vanilla GRPO and DAPO on multiple mathematical reasoning benchmarks. On Qwen2.5 models with 1.5B, 3B, and 7B parameters, $\lambda$-GRPO improves average accuracy by $+1.9%$, $+1.0%$, and $+1.7%$ compared to GRPO, respectively. Importantly, these gains come without any modifications to the training data or additional computational cost, highlighting the effectiveness and practicality of learning token preferences.
[39] SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models
Cheng-Han Chiang,Xiaofei Wang,Linjie Li,Chung-Ching Lin,Kevin Lin,Shujie Liu,Zhendong Wang,Zhengyuan Yang,Hung-yi Lee,Lijuan Wang
Main category: cs.CL
TL;DR: SHANKS是一种通用推理框架,允许语音语言模型(SLM)在实时收听用户输入的同时生成未说出的思维链推理,从而降低交互延迟并提升响应准确性。
Details
Motivation: 当前的大型语言模型(LLM)和语音语言模型(SLM)仅在用户说完后才开始思考和行动,导致高响应延迟,不适合实时语音交互。人类在倾听时就能思考,SHANKS试图模拟这一行为。Contribution: 提出了SHANKS框架,使SLM能够在用户说话时生成未说出的推理,并在适当时候打断用户或调用工具完成任务。
Method: SHANKS以固定分块流式接收输入语音,每接收到一个分块即基于先前内容和推理生成思维链,决定是否打断或调用工具。
Result: 在数学问题场景中,SHANKS打断用户错误的准确率比基线高37.1%;在工具辅助对话中,56.9%的工具调用能在用户说完前完成。
Insight: SHANKS展示了模型在整个对话过程中持续思考的重要性,而非仅在用户说完后响应,为实时语音交互提供了新方向。
Abstract: Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting during the user’s turn and can lead to high response latency while it waits to think. Consequently, thinking after receiving the full input is not suitable for speech-to-speech interaction, where real-time, low-latency exchange is important. We address this by noting that humans naturally “think while listening.” In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input. SHANKS streams the input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses this unspoken reasoning to decide whether to interrupt the user and to make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user-SLM interaction in two scenarios: (1) when the user is presenting a step-by-step solution to a math problem, SHANKS can listen, reason, and interrupt when the user makes a mistake, achieving 37.1% higher interruption accuracy than a baseline that interrupts without thinking; and (2) in a tool-augmented dialogue, SHANKS can complete 56.9% of the tool calls before the user finishes their turn. Overall, SHANKS moves toward models that keep thinking throughout the conversation, not only after a turn ends. Animated illustrations of Shanks can be found at https://d223302.github.io/SHANKS/
[40] Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation
Vaibhav Srivastav,Steven Zheng,Eric Bezzam,Eustache Le Bihan,Nithin Koluguri,Piotr Żelasko,Somshubra Majumdar,Adel Moumen,Sanchit Gandhi
Main category: cs.CL
TL;DR: 该论文提出了Open ASR Leaderboard,一个完全可复现的多语言和长语音ASR评估基准,通过标准化文本归一化和报告WER与RTFx,实现了公平的准确率和效率比较。
Details
Motivation: 当前ASR评估主要集中在短英语语音,且效率指标报告不足,缺乏透明和可复现的评测标准。Contribution: 1) 提出了一个开源、可扩展的ASR评测平台;2) 标准化了文本归一化和效率指标报告;3) 通过对60多个系统的评测,揭示了不同架构在效率和准确率上的权衡。
Method: 通过构建多语言和长语音评测轨道,使用标准化文本归一化和同时报告WER与RTFx,比较了多种开源和商业ASR系统。
Result: 研究发现Conformer编码器+LLM解码器在英语转录中准确率最高但速度慢,而CTC和TDT解码器在长语音和离线场景中更具效率优势。
Insight: 1) 准确率和效率需权衡;2) 特定场景优化(如长语音)需要选择不同架构;3) 开源基准推动了ASR研究的透明性。
Abstract: Despite rapid progress, ASR evaluation remains saturated with short-form English, and efficiency is rarely reported. We present the Open ASR Leaderboard, a fully reproducible benchmark and interactive leaderboard comparing 60+ open-source and proprietary systems across 11 datasets, including dedicated multilingual and long-form tracks. We standardize text normalization and report both word error rate (WER) and inverse real-time factor (RTFx), enabling fair accuracy-efficiency comparisons. For English transcription, Conformer encoders paired with LLM decoders achieve the best average WER but are slower, while CTC and TDT decoders deliver much better RTFx, making them attractive for long-form and offline use. Whisper-derived encoders fine-tuned for English improve accuracy but often trade off multilingual coverage. All code and dataset loaders are open-sourced to support transparent, extensible evaluation.
[41] EDUMATH: Generating Standards-aligned Educational Math Word Problems
Bryan R. Christ,Penelope Molitz,Jonathan Kropko,Thomas Hartvigsen
Main category: cs.CL
TL;DR: 论文提出利用大语言模型(LLM)生成符合教育标准的数学应用题(MWP),并通过专家和LLM联合评估的方法验证其效果。研究构建了首个教师标注的数据集,并训练了一个12B的开源模型,效果优于现有基线模型。此外,学生测试表明,生成的MWP与人写MWP表现相近,但学生更偏好定制化的MWP。
Details
Motivation: 数学应用题是K-12教育的关键工具,但因班级规模大和教师负担重,难以实现个性化定制。LLM的潜力可以支持数学教育,解决这一问题。Contribution: 1)构建首个教师标注的符合教育标准的MWP生成数据集;2)训练12B开源模型,效果优于更大规模的模型;3)开发文本分类器使30B模型无需训练即超越现有基线;4)首次在学生中验证LLM生成的MWP的效果和偏好。
Method: 1)联合人类专家和LLM评估生成的MWP;2)利用教师标注数据训练开源模型和文本分类器;3)通过学生实验对比生成的MWP与人写MWP的表现和偏好。
Result: 12B开源模型性能与更大模型相当;30B模型通过分类器超越基线;生成的MWP更接近人写MWP;学生测试显示生成MWP与人写MWP表现相似,但更偏好定制化MWP。
Insight: LLM可以高效生成符合教育标准的MWP,减轻教师负担;小规模模型通过高质量数据可以达到或超越更大模型的性能;学生偏好定制化内容,验证了LLM在教育中的实用性。
Abstract: Math word problems (MWPs) are critical K-12 educational tools, and customizing them to students’ interests and ability levels can increase learning outcomes. However, teachers struggle to find time to customize MWPs for each student given large class sizes and increasing burnout. We propose that LLMs can support math education by generating MWPs customized to student interests and math education standards. To this end, we use a joint human expert-LLM judge approach to evaluate over 11,000 MWPs generated by open and closed LLMs and develop the first teacher-annotated dataset for standards-aligned educational MWP generation. We show the value of our data by using it to train a 12B open model that matches the performance of larger and more capable open models. We also use our teacher-annotated data to train a text classifier that enables a 30B open LLM to outperform existing closed baselines without any training. Next, we show our models’ MWPs are more similar to human-written MWPs than those from existing models. We conclude by conducting the first study of customized LLM-generated MWPs with grade school students, finding they perform similarly on our models’ MWPs relative to human-written MWPs but consistently prefer our customized MWPs.
[42] Probing Social Identity Bias in Chinese LLMs with Gendered Pronouns and Social Groups
Geng Liu,Feng Li,Junjie Mu,Mengxiao Zhu,Francesco Pierri
Main category: cs.CL
TL;DR: 研究通过汉语提示和中国社会群体,探究中文大语言模型中的社会身份偏见,发现模型在群内(‘我们’)和群外(‘他们’)表述中存在系统性偏见,且这种偏见在真实对话中更显著。
Details
Motivation: 随着大语言模型在用户场景中的广泛应用,其可能反映并放大社会偏见的问题引发关注。研究聚焦中文模型,填补跨语言偏见评估的空白。Contribution: 1. 提出针对中文模型的语言感知评估框架;2. 揭示群内-群外偏见的普遍性及其在真实对话中的强化;3. 扩展240个中国社会群体的偏见分析。
Method: 1. 使用汉语提示设计控制实验;2. 分析十个代表性中文模型的响应;3. 结合真实用户-聊天机器人对话语料。
Result: 模型在群内表述中表现积极,群外表述中呈现消极倾向,且这种偏见于真实互动中更显著。
Insight: 社会身份偏见具有跨语言普适性,用户交互可能加剧模型偏见,强调需在非英语语境中加强偏见评估。
Abstract: Large language models (LLMs) are increasingly deployed in user-facing applications, raising concerns about their potential to reflect and amplify social biases. We investigate social identity framing in Chinese LLMs using Mandarin-specific prompts across ten representative Chinese LLMs, evaluating responses to ingroup (“We”) and outgroup (“They”) framings, and extending the setting to 240 social groups salient in the Chinese context. To complement controlled experiments, we further analyze Chinese-language conversations from a corpus of real interactions between users and chatbots. Across models, we observe systematic ingroup-positive and outgroup-negative tendencies, which are not confined to synthetic prompts but also appear in naturalistic dialogue, indicating that bias dynamics might strengthen in real interactions. Our study provides a language-aware evaluation framework for Chinese LLMs, demonstrating that social identity biases documented in English generalize cross-linguistically and intensify in user-facing contexts.
[43] Towards Reliable Retrieval in RAG Systems for Large Legal Datasets
Markus Reuter,Tobias Lingenberg,Rūta Liepiņa,Francesca Lagioia,Marco Lippi,Giovanni Sartor,Andrea Passerini,Burcu Sayin
Main category: cs.CL
TL;DR: 本文提出了一种名为Summary-Augmented Chunking (SAC)的方法,通过在文本块中注入文档级合成摘要,解决法律领域中大型数据集检索的可靠性问题,显著减少了文档级检索失配(DRM),并提升了检索精度和召回率。
Details
Motivation: 在法律领域,大型数据库中结构相似的文档往往导致检索系统失效,尤其是文档级检索失配(DRM)问题严重影响了检索增强生成(RAG)系统的可靠性。Contribution: 本文的主要贡献在于:(1) 识别并量化了DRM问题;(2) 提出了一种简单且计算高效的SAC方法,通过在文本块中加入文档级摘要,提升全局上下文;(3) 实验表明通用摘要策略优于法律专家知识的定向方法。
Method: 采用Summary-Augmented Chunking (SAC)方法,将文档级合成摘要嵌入到文本块中,以保留全局上下文,避免标准分块过程中信息丢失。
Result: SAC显著减少了DRM现象,同时提高了文本级检索的精度和召回率。实验证明通用摘要策略在法律任务中表现更优。
Insight: 在法律数据集的应用中,全局上下文(如文档级摘要)比特定领域知识更能有效提升检索系统的可靠性;SAC方法具有实用性和易集成性。
Abstract: Retrieval-Augmented Generation (RAG) is a promising approach to mitigate hallucinations in Large Language Models (LLMs) for legal applications, but its reliability is critically dependent on the accuracy of the retrieval step. This is particularly challenging in the legal domain, where large databases of structurally similar documents often cause retrieval systems to fail. In this paper, we address this challenge by first identifying and quantifying a critical failure mode we term Document-Level Retrieval Mismatch (DRM), where the retriever selects information from entirely incorrect source documents. To mitigate DRM, we investigate a simple and computationally efficient technique which we refer to as Summary-Augmented Chunking (SAC). This method enhances each text chunk with a document-level synthetic summary, thereby injecting crucial global context that would otherwise be lost during a standard chunking process. Our experiments on a diverse set of legal information retrieval tasks show that SAC greatly reduces DRM and, consequently, also improves text-level retrieval precision and recall. Interestingly, we find that a generic summarization strategy outperforms an approach that incorporates legal expert domain knowledge to target specific legal elements. Our work provides evidence that this practical, scalable, and easily integrable technique enhances the reliability of RAG systems when applied to large-scale legal document datasets.
[44] Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages
Neel Prabhanjan Rachamalla,Aravind Konakalla,Gautam Rajeev,Ashish Kulkarni,Chandra Khatri,Shubham Agarwal
Main category: cs.CL
TL;DR: 论文介绍了一种高质量的文化后训练数据集Pragyaan,专门针对印度语言,通过人机协同流程结合翻译与合成扩展,解决了现有数据集在多样性和文化背景上的不足。
Details
Motivation: 现有开源数据集在印度语言上的覆盖不足,缺乏文化相关性和任务多样性,限制了大规模语言模型(LLMs)的效果。Contribution: 提出了Pragyaan-IT和Pragyaan-Align两个数据集,涵盖10种印度语言,57种数据集,强调任务多样性、多轮对话和文化细节。
Method: 采用人机协同流程,结合翻译与合成扩展,确保数据的可靠性和多样性。
Result: 生成了22.5K和100K规模的印度语言数据集,覆盖13大类56小类任务。
Insight: 文化背景和任务多样性对LLMs的多语言效果至关重要,人机协同方法能显著提升数据质量。
Abstract: The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic post-training data. Using this pipeline, we curate two datasets: Pragyaan-IT (22.5K) and Pragyaan-Align (100K) across 10 Indian languages covering 13 broad and 56 sub-categories, leveraging 57 diverse datasets. Our dataset protocol incorporates several often-overlooked dimensions and emphasize task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and preservation of cultural nuance, providing a foundation for more inclusive and effective multilingual LLMs.
[45] Native Hybrid Attention for Efficient Sequence Modeling
Jusen Du,Jiaxi Hu,Tao Zhang,Weigao Sun,Yu Cheng
Main category: cs.CL
TL;DR: 论文提出了Native Hybrid Attention(NHA),一种结合线性注意力和全注意力的混合架构,通过统一的层设计实现高效序列建模,同时保持长上下文记忆能力。
Details
Motivation: Transformer虽然擅长序列建模,但其二次复杂度问题限制了效率;线性注意力提高了效率,但牺牲了长上下文的召回精度。NHA旨在解决这一矛盾。Contribution: 1. 提出NHA,一种新型混合注意力架构,结合线性注意力和全注意力的优势。2. 引入统一的层设计,无需额外融合参数。3. 通过滑动窗口大小动态调整注意力行为。
Method: NHA通过在键值槽中维护长上下文(线性RNN更新)和短上下文(滑动窗口),使用单一softmax注意力操作实现高效建模。滑动窗口大小作为超参数控制混合行为。
Result: 实验表明,NHA在召回密集和常识推理任务上优于Transformer和其他混合基线,且预训练的LLM与NHA结合可实现高效且高精度表现。
Insight: NHA展示了混合注意力架构在平衡效率和精度上的潜力,为长序列建模提供了一种灵活的设计思路。
Abstract: Transformers excel at sequence modeling but face quadratic complexity, while linear attention offers improved efficiency but often compromises recall accuracy over long contexts. In this work, we introduce Native Hybrid Attention (NHA), a novel hybrid architecture of linear and full attention that integrates both intra & inter-layer hybridization into a unified layer design. NHA maintains long-term context in key-value slots updated by a linear RNN, and augments them with short-term tokens from a sliding window. A single \texttt{softmax attention} operation is then applied over all keys and values, enabling per-token and per-head context-dependent weighting without requiring additional fusion parameters. The inter-layer behavior is controlled through a single hyperparameter, the sliding window size, which allows smooth adjustment between purely linear and full attention while keeping all layers structurally uniform. Experimental results show that NHA surpasses Transformers and other hybrid baselines on recall-intensive and commonsense reasoning tasks. Furthermore, pretrained LLMs can be structurally hybridized with NHA, achieving competitive accuracy while delivering significant efficiency gains. Code is available at https://github.com/JusenD/NHA.
[46] Search-R3: Unifying Reasoning and Embedding Generation in Large Language Models
Yuntao Gui,James Cheng
Main category: cs.CL
TL;DR: Search-R3提出了一种新颖框架,通过结合大型语言模型(LLM)的推理能力与嵌入生成,增强了其在检索任务中的表现。
Details
Motivation: 尽管LLM在自然语言理解方面表现卓越,但在检索任务中尚未充分发挥潜力,Search-R3旨在填补这一空白。Contribution: 主要贡献是提出了一个统一的框架,将LLM的推理过程与嵌入生成结合起来,从而提升了检索任务的效果。
Method: 采用了三个阶段:1. 监督学习提升嵌入质量;2. 强化学习优化嵌入生成与推理;3. 设计了一个高效的RL环境处理动态嵌入表示。
Result: 在多个基准测试中,Search-R3显著优于现有方法,证明了其在复杂知识密集型任务中的有效性。
Insight: 推理与嵌入生成的结合是提升LLM检索能力的关键,这种一体化方法为未来研究提供了新方向。
Abstract: Despite their remarkable natural language understanding capabilities, Large Language Models (LLMs) have been underutilized for retrieval tasks. We present Search-R3, a novel framework that addresses this limitation by adapting LLMs to generate search embeddings as a direct output of their reasoning process. Our approach exploits LLMs’ chain-of-thought capabilities, allowing them to produce more effective embeddings by reasoning step-by-step through complex semantic analyses. We implement this through three complementary mechanisms. (1) a supervised learning stage enables the model’s ability to produce quality embeddings, (2) a reinforcement learning (RL) methodology that optimizes embedding generation alongside reasoning, and (3) a specialized RL environment that efficiently handles evolving embedding representations without requiring complete corpus re-encoding at each training iteration. Our extensive evaluations on diverse benchmarks demonstrate that Search-R3 significantly outperforms prior methods by unifying the reasoning and embedding generation processes. This integrated post-training approach represents a substantial advancement in handling complex knowledge-intensive tasks that require both sophisticated reasoning and effective information retrieval. Project page: https://github.com/ytgui/Search-R3
[47] Does Local News Stay Local?: Online Content Shifts in Sinclair-Acquired Stations
Miriam Wanner,Sophia Hager,Anjalie Field
Main category: cs.CL
TL;DR: 该研究探讨了辛克莱广播集团收购地方新闻台后对其内容的影响,发现这些台在收购后更多地报道全国性新闻,减少了地方性话题,同时增加了对两极化全国话题的报道。
Details
Motivation: 地方新闻台通常被视为非政治化信息的可靠来源,尤其是涉及居民关注的本地话题。辛克莱集团的收购行为引发了对其内容变化的关注。Contribution: 研究发现辛克莱收购后地方新闻台的内容显著转向全国性报道,且增加了对两极化话题的覆盖,揭示了媒体所有权集中对地方新闻内容的潜在影响。
Method: 研究采用计算方法,分析地方新闻台在被收购前后以及对比全国性新闻媒体的在线内容变化。
Result: 结果显示,收购后的地方新闻台更多地报道全国性新闻,减少了地方话题的报道,且内容更具两极化倾向。
Insight: 媒体所有权的集中可能导致地方新闻失去其地方性特点,转而迎合全国性议题,影响公众的信息获取和观点形成。
Abstract: Local news stations are often considered to be reliable sources of non-politicized information, particularly local concerns that residents care about. Because these stations are trusted news sources, viewers are particularly susceptible to the information they report. The Sinclair Broadcast group is a broadcasting company that has acquired many local news stations in the last decade. We investigate the effects of local news stations being acquired by Sinclair: how does coverage change? We use computational methods to investigate changes in internet content put out by local news stations before and after being acquired by Sinclair and in comparison to national news outlets. We find that there is clear evidence that local news stations report more frequently on national news at the expense of local topics, and that their coverage of polarizing national topics increases.
[48] Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages
Amir Hossein Yari,Kalmit Kulkarni,Ahmad Raza Khan,Fajri Koto
Main category: cs.CL
TL;DR: 论文提出了ITEM基准,用于系统评估26种自动指标在六种印度语言中与人类判断的一致性,揭示了基于LLM的评估器表现最佳、离群值影响显著、翻译和摘要任务中指标侧重点不同等发现。
Details
Motivation: 现有自动指标主要针对英语等高资源语言,缺乏对印度语言的验证,限制了评估的普适性和可靠性。Contribution: 引入ITEM基准,提供大规模、细粒度的印度语言评估数据,揭示了指标与人类判断的关系及其在不同任务中的表现差异。
Method: 通过系统评估26种指标在六种印度语言中的表现,分析其与人类判断的一致性、对离群值的敏感性、语言特异性等。
Result: 发现基于LLM的评估器表现最佳,离群值影响显著,翻译任务中指标更关注流畅性,而摘要任务更关注内容保真度。
Insight: 需要为印度语言设计更鲁棒的评估指标,任务类型(翻译或摘要)会影响指标的适用性。
Abstract: While automatic metrics drive progress in Machine Translation (MT) and Text Summarization (TS), existing metrics have been developed and validated almost exclusively for English and other high-resource languages. This narrow focus leaves Indian languages, spoken by over 1.5 billion people, largely overlooked, casting doubt on the universality of current evaluation practices. To address this gap, we introduce ITEM, a large-scale benchmark that systematically evaluates the alignment of 26 automatic metrics with human judgments across six major Indian languages, enriched with fine-grained annotations. Our extensive evaluation, covering agreement with human judgments, sensitivity to outliers, language-specific reliability, inter-metric correlations, and resilience to controlled perturbations, reveals four central findings: (1) LLM-based evaluators show the strongest alignment with human judgments at both segment and system levels; (2) outliers exert a significant impact on metric-human agreement; (3) in TS, metrics are more effective at capturing content fidelity, whereas in MT, they better reflect fluency; and (4) metrics differ in their robustness and sensitivity when subjected to diverse perturbations. Collectively, these findings offer critical guidance for advancing metric design and evaluation in Indian languages.
[49] TALENT: Table VQA via Augmented Language-Enhanced Natural-text Transcription
Guo Yutong,Wanying Wang,Yue Wu,Zichen Miao,Haoyu Wang
Main category: cs.CL
TL;DR: TALENT提出了一种轻量化框架,通过结合OCR文本和自然语言叙述来解决Table VQA问题,避免了直接使用计算昂贵的大型视觉语言模型。
Details
Motivation: 现有的Table VQA方法要么依赖大型视觉语言模型,计算成本高;要么使用结构化输出(如Markdown表格)引入错误。TALENT旨在通过轻量化解决方案解决这些问题。Contribution: 1. 提出TALENT框架,结合OCR文本和自然语言叙述;2. 构建更具挑战性的ReTabVQA数据集;3. 在低计算成本下实现与大型模型相当的性能。
Method: TALENT利用小型视觉语言模型生成OCR文本和自然语言叙述,将问题传递给大型语言模型进行推理,将Table VQA重新定义为以LLM为中心的多模态任务。
Result: 实验表明,TALENT在公共数据集和ReTabVQA上能以更低计算成本匹配或超越大型视觉语言模型的性能。
Insight: 通过分割感知和推理任务,轻量化组合可以在Table VQA中高效工作;自然语言叙述可能更适合LLM推理。
Abstract: Table Visual Question Answering (Table VQA) is typically addressed by large vision-language models (VLMs). While such models can answer directly from images, they often miss fine-grained details unless scaled to very large sizes, which are computationally prohibitive, especially for mobile deployment. A lighter alternative is to have a small VLM perform OCR and then use a large language model (LLM) to reason over structured outputs such as Markdown tables. However, these representations are not naturally optimized for LLMs and still introduce substantial errors. We propose TALENT (Table VQA via Augmented Language-Enhanced Natural-text Transcription), a lightweight framework that leverages dual representations of tables. TALENT prompts a small VLM to produce both OCR text and natural language narration, then combines them with the question for reasoning by an LLM. This reframes Table VQA as an LLM-centric multimodal reasoning task, where the VLM serves as a perception-narration module rather than a monolithic solver. Additionally, we construct ReTabVQA, a more challenging Table VQA dataset requiring multi-step quantitative reasoning over table images. Experiments show that TALENT enables a small VLM-LLM combination to match or surpass a single large VLM at significantly lower computational cost on both public datasets and ReTabVQA.
[50] Reasoning for Hierarchical Text Classification: The Case of Patents
Lekang Jiang,Wenjun Sun,Stephan Goetz
Main category: cs.CL
TL;DR: 论文提出了一种名为RHC的层次文本分类(HTC)框架,将HTC重新定义为逐步推理任务,利用大语言模型(LLMs)通过两阶段训练实现更高的分类性能和可解释性。
Details
Motivation: 专利主题分类是HTC中最具挑战性的场景之一,传统方法仅输出平面标签集,缺乏预测背后的逻辑解释。Contribution: 提出了RHC框架,将HTC转化为多步推理任务,通过两阶段训练(冷启动阶段和强化学习阶段)提升模型性能,同时在实验中展示了有效性、可解释性、可扩展性和广泛适用性。
Method: RHC利用LLMs,通过冷启动阶段对模型进行链式推理(CoT)格式对齐,随后通过强化学习阶段增强多步推理能力。
Result: RHC在专利分类及其他HTC基准测试中优于基线模型,准确率和宏观F1提升约3%,并生成自然语言解释。
Insight: RHC通过逐步推理和语言模型的两阶段训练,为HTC任务提供了兼具高性能和可解释性的解决方案。
Abstract: Hierarchical text classification (HTC) assigns documents to multiple levels of a pre-defined taxonomy. Automated patent subject classification represents one of the hardest HTC scenarios because of domain knowledge difficulty and a huge number of labels. Prior approaches only output a flat label set, which offers little insight into the reason behind predictions. Therefore, we propose Reasoning for Hierarchical Classification (RHC), a novel framework that reformulates HTC as a step-by-step reasoning task to sequentially deduce hierarchical labels. RHC trains large language models (LLMs) in two stages: a cold-start stage that aligns outputs with chain-of-thought (CoT) reasoning format and a reinforcement learning (RL) stage to enhance multi-step reasoning ability. RHC demonstrates four advantages in our experiments. (1) Effectiveness: RHC surpasses previous baselines and outperforms the supervised fine-tuning counterparts by approximately 3% in accuracy and macro F1. (2) Explainability: RHC produces natural-language justifications before prediction to facilitate human inspection. (3) Scalability: RHC scales favorably with model size with larger gains compared to standard fine-tuning. (4) Applicability: Beyond patents, we further demonstrate that RHC achieves state-of-the-art performance on other widely used HTC benchmarks, which highlights its broad applicability.
[51] More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning
Yike Zhao,Simin Guo,Ziqing Yang,Shifan Han,Dahua Lin,Fei Tan
Main category: cs.CL
TL;DR: 该论文分析了开源数学推理数据集和数据合成技术的实用性,强调数据质量的提升(如更可解释的格式或从强模型中提炼)通常优于单纯增加数据量。
Details
Motivation: 大型语言模型(LLM)的推理能力在许多下游任务中至关重要,但高度依赖训练数据的质量。然而,现有数据构建方法在实际应用中的效果尚未充分探索。Contribution: 论文提供了对数学推理数据集和数据合成技术的全面分析,提出了有效的数据选择策略,并总结了适合工业应用的实用方法。
Method: 作者设计了一个统一的流程来评估开源数据集和数据合成技术,通过实验比较不同方法的效果。
Result: 研究发现,结构化数据格式或从更强模型中提炼的知识,比单纯增加数据量更有效。这一结果为数据集成提供了可操作的指导。
Insight: 研究强调了在现实世界推理任务中,平衡‘更多数据’与‘更好数据’的重要性,为未来的研究提供了方向。
Abstract: The reasoning capabilities of Large Language Models (LLMs) play a critical role in many downstream tasks, yet depend strongly on the quality of training data. Despite various proposed data construction methods, their practical utility in real-world pipelines remains underexplored. In this work, we conduct a comprehensive analysis of open-source datasets and data synthesis techniques for mathematical reasoning, evaluating them under a unified pipeline designed to mirror training and deployment scenarios. We further distill effective data selection strategies and identify practical methods suitable for industrial applications. Our findings highlight that structuring data in more interpretable formats, or distilling from stronger models often outweighs simply scaling up data volume. This study provides actionable guidance for integrating training data to enhance LLM capabilities, supporting both cost-effective data curation and scalable model enhancement. We hope this work will inspire further research on how to balance “more data” versus “better data” for real-world reasoning tasks.
[52] NurseLLM: The First Specialized Language Model for Nursing
Md Tawkat Islam Khondaker,Julia Harrington,Shady Shehata
Main category: cs.CL
TL;DR: NurseLLM是首个专为护理领域设计的语言模型,专注于多项选择题任务,通过多阶段数据生成流程构建大规模护理MCQ数据集,并在多个基准测试中表现优异。
Details
Motivation: 当前大型语言模型在医疗系统中应用广泛,但在护理等专业领域的潜力尚未充分挖掘,因此开发专为护理设计的语言模型具有重要意义。Contribution: 1. 提出首个护理专用语言模型NurseLLM;2. 构建首个大规模护理MCQ数据集;3. 引入多个护理基准测试。
Method: 采用多阶段数据生成流程构建护理MCQ数据集,训练NurseLLM,并通过多代理协作和逻辑推理提升性能。
Result: NurseLLM在多个基准测试中表现优于同类通用和医学专用语言模型。
Insight: 专业领域的专用语言模型在性能和实用性上优于通用模型,逻辑推理和多代理协作系统在护理领域具有潜在应用前景。
Abstract: Recent advancements in large language models (LLMs) have significantly transformed medical systems. However, their potential within specialized domains such as nursing remains largely underexplored. In this work, we introduce NurseLLM, the first nursing-specialized LLM tailored for multiple choice question-answering (MCQ) tasks. We develop a multi-stage data generation pipeline to build the first large scale nursing MCQ dataset to train LLMs on a broad spectrum of nursing topics. We further introduce multiple nursing benchmarks to enable rigorous evaluation. Our extensive experiments demonstrate that NurseLLM outperforms SoTA general-purpose and medical-specialized LLMs of comparable size on different benchmarks, underscoring the importance of a specialized LLM for the nursing domain. Finally, we explore the role of reasoning and multi-agent collaboration systems in nursing, highlighting their promise for future research and applications.
[53] Customer-R1: Personalized Simulation of Human Behaviors via RL-based LLM Agent in Online Shopping
Ziyi Wang,Yuxuan Lu,Yimeng Zhang,Jing Huang,Dakuo Wang
Main category: cs.CL
TL;DR: Customer-R1是一种基于强化学习的LLM代理方法,旨在模拟在线购物中个性化的用户行为。该方法通过显式人物设定优化下一步行为和动作生成,显著优于现有基线。
Details
Motivation: 现有方法主要学习群体层面的策略,缺乏对用户个性的建模,导致模拟行为过于通用。Contribution: 提出了Customer-R1,一种基于强化学习的LLM代理方法,能够通过人物设定生成个性化的用户行为模拟。
Method: 采用强化学习优化动作正确性奖励信号,生成下一步的理性分析和动作。
Result: 在OPeRA数据集上,Customer-R1在下一步动作预测任务中显著优于提示和监督微调基线,且行为分布更接近真实用户。
Insight: 显式的人物设定和强化学习的结合能够有效提升个性化行为模拟的真实性和准确性。
Abstract: Simulating step-wise human behavior with Large Language Models (LLMs) has become an emerging research direction, enabling applications in various practical domains. While prior methods, including prompting, supervised fine-tuning (SFT), and reinforcement learning (RL), have shown promise in modeling step-wise behavior, they primarily learn a population-level policy without conditioning on a user’s persona, yielding generic rather than personalized simulations. In this work, we pose a critical question: how can LLM agents better simulate personalized user behavior? We introduce Customer-R1, an RL-based method for personalized, step-wise user behavior simulation in online shopping environments. Our policy is conditioned on an explicit persona, and we optimize next-step rationale and action generation via action correctness reward signals. Experiments on the OPeRA dataset emonstrate that Customer-R1 not only significantly outperforms prompting and SFT-based baselines in next-action prediction tasks, but also better matches users’ action distribution, indicating higher fidelity in personalized behavior simulation.
[54] Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships
Donggyu Lee,Sungwon Park,Yerin Hwang,Hyunwoo Oh,Hyoshin Kim,Jungwon Kim,Meeyoung Cha,Sangyoon Park,Jihee Kim
Main category: cs.CL
TL;DR: 该论文提出了一个基于科学验证因果关系的基准测试,用于评估大型语言模型(LLMs)的因果推理能力。通过从顶级经济学和金融学期刊中提取的因果关系数据,构建了一个包含40,379项任务的多样化测试集。实验发现,现有LLMs在因果推理任务中表现不佳,最优模型精度仅为57.6%。
Details
Motivation: 现有的因果推理基准测试多依赖合成数据且覆盖领域狭窄,无法有效评估LLMs的真实能力。Contribution: 论文提出了一个基于科学验证因果关系的全新基准测试,覆盖健康、环境、技术、法律和文化等多个领域,填补了现有研究的空白。
Method: 从经济学和金融学顶级期刊中提取因果关系数据,利用工具变量、双重差分法和断点回归等严谨方法构建测试集,并对8种最新LLMs进行评估。
Result: LLMs在因果推理任务中表现不佳,最优模型精度仅为57.6%,且模型规模与性能无显著相关性。
Insight: 当前LLMs在因果推理能力上存在显著不足,凸显了在高风险应用中可靠因果推理的需求与现有技术之间的差距。
Abstract: Causal reasoning is fundamental for Large Language Models (LLMs) to understand genuine cause-and-effect relationships beyond pattern matching. Existing benchmarks suffer from critical limitations such as reliance on synthetic data and narrow domain coverage. We introduce a novel benchmark constructed from casually identified relationships extracted from top-tier economics and finance journals, drawing on rigorous methodologies including instrumental variables, difference-in-differences, and regression discontinuity designs. Our benchmark comprises 40,379 evaluation items covering five task types across domains such as health, environment, technology, law, and culture. Experimental results on eight state-of-the-art LLMs reveal substantial limitations, with the best model achieving only 57.6% accuracy. Moreover, model scale does not consistently translate to superior performance, and even advanced reasoning models struggle with fundamental causal relationship identification. These findings underscore a critical gap between current LLM capabilities and demands of reliable causal reasoning in high-stakes applications.
[55] LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding
Zhivar Sourati,Zheng Wang,Marianne Menglin Liu,Yazhe Hu,Mengqing Guo,Sujeeth Bharadwaj,Kyu Han,Tao Sheng,Sujith Ravi,Morteza Dehghani,Dan Roth
Main category: cs.CL
TL;DR: LAD-RAG是一个创新的布局感知动态RAG框架,用于视觉丰富文档(VRD)的理解,通过结合符号化的文档图和动态检索机制,显著提升了多页文档推理任务的检索质量和问答准确性。
Details
Motivation: 传统的RAG方法在视觉丰富文档问答中存在局限性,因为它们忽略了文档的结构和跨页依赖关系,且检索时使用固定数量的页面,导致证据检索不完整和答案质量下降。Contribution: 提出了LAD-RAG框架,包括构建符号化文档图以保留布局结构和跨页依赖关系,以及在推理时动态检索必要证据的方法。
Method: LAD-RAG在文档摄取阶段构建符号化文档图,并结合神经嵌入表示;在推理阶段,通过LLM智能体动态交互神经和符号索引,自适应检索证据。
Result: 在多个基准测试中,LAD-RAG实现了超过90%的完美召回率,且在相同噪声水平下比基线方法高出20%的召回率,显著提升了问答准确性。
Insight: LAD-RAG的核心创新在于将符号化结构信息与神经表示结合,并通过动态检索机制解决传统RAG在多页文档推理中的局限性。
Abstract: Question answering over visually rich documents (VRDs) requires reasoning not only over isolated content but also over documents’ structural organization and cross-page dependencies. However, conventional retrieval-augmented generation (RAG) methods encode content in isolated chunks during ingestion, losing structural and cross-page dependencies, and retrieve a fixed number of pages at inference, regardless of the specific demands of the question or context. This often results in incomplete evidence retrieval and degraded answer quality for multi-page reasoning tasks. To address these limitations, we propose LAD-RAG, a novel Layout-Aware Dynamic RAG framework. During ingestion, LAD-RAG constructs a symbolic document graph that captures layout structure and cross-page dependencies, adding it alongside standard neural embeddings to yield a more holistic representation of the document. During inference, an LLM agent dynamically interacts with the neural and symbolic indices to adaptively retrieve the necessary evidence based on the query. Experiments on MMLongBench-Doc, LongDocURL, DUDE, and MP-DocVQA demonstrate that LAD-RAG improves retrieval, achieving over 90% perfect recall on average without any top-k tuning, and outperforming baseline retrievers by up to 20% in recall at comparable noise levels, yielding higher QA accuracy with minimal latency.
[56] Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts
Christos Ziakas,Nicholas Loo,Nishita Jain,Alessandra Russo
Main category: cs.CL
TL;DR: Red-Bandit是一个针对LLMs的红队测试框架,通过在线自适应机制动态选择攻击风格,以高效发现和利用目标模型的漏洞。
Details
Motivation: 现有的LLMs红队测试方法缺乏在推理阶段高效适应特定模型漏洞的能力,Red-Bandit旨在填补这一空白。Contribution: 1. 提出Red-Bandit框架,结合LoRA专家和强化学习,动态选择攻击风格;2. 设计了一种基于多臂老虎机的策略,平衡探索与利用;3. 在AdvBench上达到SOTA效果,同时生成更易读的提示。
Method: 1. 预训练一组参数高效的LoRA专家,每个专家专攻一种攻击风格;2. 使用强化学习训练这些专家,以生成不安全的提示为目标;3. 推理时通过多臂老虎机策略动态选择最佳攻击风格。
Result: 在AdvBench上实现最高攻击成功率(ASR@10),同时生成更低困惑度的提示。此外,老虎机策略可用于诊断模型特有的漏洞。
Insight: Red-Bandit不仅提升了红队测试的效率和效果,还为模型安全性诊断提供了新工具。
Abstract: Automated red-teaming has emerged as a scalable approach for auditing Large Language Models (LLMs) prior to deployment, yet existing approaches lack mechanisms to efficiently adapt to model-specific vulnerabilities at inference. We introduce Red-Bandit, a red-teaming framework that adapts online to identify and exploit model failure modes under distinct attack styles (e.g., manipulation, slang). Red-Bandit post-trains a set of parameter-efficient LoRA experts, each specialized for a particular attack style, using reinforcement learning that rewards the generation of unsafe prompts via a rule-based safety model. At inference, a multi-armed bandit policy dynamically selects among these attack-style experts based on the target model’s response safety, balancing exploration and exploitation. Red-Bandit achieves state-of-the-art results on AdvBench under sufficient exploration (ASR@10), while producing more human-readable prompts (lower perplexity). Moreover, Red-Bandit’s bandit policy serves as a diagnostic tool for uncovering model-specific vulnerabilities by indicating which attack styles most effectively elicit unsafe behaviors.
[57] Hybrid Reinforcement: When Reward Is Sparse, It’s Better to Be Dense
Leitian Tao,Ilia Kulikov,Swarnadeep Saha,Tianlu Wang,Jing Xu,Yixuan Li,Jason E Weston,Ping Yu
Main category: cs.CL
TL;DR: HERO(混合集成奖励优化)是一个结合验证器和奖励模型信号的强化学习框架,通过分层归一化和方差感知加权,提高了语言模型在数学推理任务中的性能。
Details
Motivation: 现有的验证器反馈(0-1信号)过于简单,无法捕捉部分正确或替代答案的细微之处,限制了学习效果。奖励模型提供的连续反馈可以补足这一缺陷。Contribution: 提出了HERO框架,结合验证器和奖励模型的信号,并通过分层归一化和方差感知加权优化学习。
Method: HERO采用分层归一化(规范奖励模型的评分范围)和方差感知加权(强调挑战性任务),将验证器的稳定性与奖励模型的灵活性结合。
Result: 在多样化的数学推理任务中,HERO的表现优于仅使用奖励模型或验证器的基线方法,尤其在难以验证的任务上表现突出。
Insight: 混合奖励设计能够保留验证器的稳定性,同时利用奖励模型的细微反馈,提升模型的推理能力。
Abstract: Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle–many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.
[58] LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation
Joseph Enguehard,Morgane Van Ermengem,Kate Atkinson,Sujeong Cha,Arijit Ghosh Chowdhury,Prashanth Kallur Ramaswamy,Jeremy Roghair,Hannah R Marlowe,Carina Suzana Negreanu,Kitty Boxall,Diana Mincu
Main category: cs.CL
TL;DR: 论文提出了LeMAJ方法,通过将法律领域的LLM输出拆分为‘Legal Data Points’(LDPs),提出了一种无参考的新型评估方法,并在法律问答任务中优于基线方法,同时与人类专家的评估更一致。
Details
Motivation: 法律领域的LLM评估存在独特挑战,现有方法依赖参考数据或标准化评估,但这对法律应用有局限性,且可靠性不足。Contribution: 论文的贡献包括:1)引入LDPs和无参考评估方法;2)在专有和开源数据集上优于基线;3)与人类专家评估更一致;4)开源了部分LegalBench的LDPs。
Method: 方法包括将LLM输出拆分为LDPs,开发了一种基于律师评估逻辑的无参考评估方法。
Result: 实验表明,该方法优于多个基线方法,且在人类专家评估一致性上表现更优。
Insight: 法律领域的LLM评估需要更贴近专业律师的逻辑,LDPs的实现为法律问答评估提供了新思路。
Abstract: Evaluating large language model (LLM) outputs in the legal domain presents unique challenges due to the complex and nuanced nature of legal analysis. Current evaluation approaches either depend on reference data, which is costly to produce, or use standardized assessment methods, both of which have significant limitations for legal applications. Although LLM-as-a-Judge has emerged as a promising evaluation technique, its reliability and effectiveness in legal contexts depend heavily on evaluation processes unique to the legal industry and how trustworthy the evaluation appears to the human legal expert. This is where existing evaluation methods currently fail and exhibit considerable variability. This paper aims to close the gap: a) we break down lengthy responses into ‘Legal Data Points’ (LDPs), self-contained units of information, and introduce a novel, reference-free evaluation methodology that reflects how lawyers evaluate legal answers; b) we demonstrate that our method outperforms a variety of baselines on both our proprietary dataset and an open-source dataset (LegalBench); c) we show how our method correlates more closely with human expert evaluations and helps improve inter-annotator agreement; and finally d) we open source our Legal Data Points for a subset of LegalBench used in our experiments, allowing the research community to replicate our results and advance research in this vital area of LLM evaluation on legal question-answering.
[59] Don’t Adapt Small Language Models for Tools; Adapt Tool Schemas to the Models
Jonggeun Lee,Woojung Song,Jongwook Han,Haesung Pyun,Yohan Jo
Main category: cs.CL
TL;DR: 论文提出了一种无需训练的方法PA-Tool,通过调整工具模式的命名以对齐小型语言模型的预训练知识,显著提升了工具使用任务的性能。
Details
Motivation: 小型语言模型(SLMs)在工具增强的AI系统中具有计算优势,但在工具使用任务中存在模式不对齐的问题(如虚构的工具名称)。直接让模型适应任意模式效果不佳,因此作者提出调整模式以匹配模型的预训练知识。Contribution: 提出了PA-Tool方法,通过分析预训练熟悉度信号(peakedness),自动重命名工具组件以对齐模型的预训练知识,显著减少模式不对齐错误。
Method: PA-Tool利用contamination检测中的peakedness信号,生成多个候选命名并选择输出集中度最高的模式,无需额外训练即可实现模式对齐。
Result: 在MetaTool和RoTBench上的实验显示,性能提升了17%,模式不对齐错误减少了80%,使小型模型接近SOTA性能。
Insight: 通过对模式(schema)而非模型进行调整,可以高效释放资源受限模型的工具使用潜力,避免重新训练的成本。
Abstract: Small language models (SLMs) offer significant computational advantages for tool-augmented AI systems, yet they struggle with tool-use tasks, particularly in selecting appropriate tools and identifying correct parameters. A common failure mode is schema misalignment: models hallucinate plausible but non-existent tool names that reflect naming conventions internalized during pretraining but absent from the provided tool schema. Rather than forcing models to adapt to arbitrary schemas, we propose adapting schemas to align with models’ pretrained knowledge. We introduce PA-Tool (Pretraining-Aligned Tool Schema Generation), a training-free method that leverages peakedness-a signal from contamination detection indicating pretraining familiarity-to automatically rename tool components. By generating multiple candidates and selecting those with highest output concentration across samples, PA-Tool identifies pretrain-aligned naming patterns. Experiments on MetaTool and RoTBench show improvements of up to 17% points, with schema misalignment errors reduced by 80%. PA-Tool enables small models to approach state-of-the-art performance while maintaining computational efficiency for adaptation to new tools without retraining. Our work demonstrates that schema-level interventions can unlock the tool-use potential of resource-efficient models by adapting schemas to models rather than models to schemas.
[60] Online Rubrics Elicitation from Pairwise Comparisons
MohammadHossein Rezaei,Robert Vacareanu,Zihao Wang,Clinton Wang,Yunzhong He,Afra Feyza Akyürek
Main category: cs.CL
TL;DR: 该论文提出了一种动态生成评测标准的在线方法(OnlineRubrics),通过动态对比当前和参考策略的响应,避免了静态标准在训练中的局限性,并提升了大型语言模型的表现。
Details
Motivation: 静态评测标准在训练过程中容易受到奖励攻击行为的影响,且无法捕捉训练中出现的需求变化。为解决这一问题,论文提出了一种动态生成评测标准的方法。Contribution: 1. 提出了OnlineRubrics方法,动态生成评测标准;2. 展示了该方法在多个评估数据集上的性能提升(最高达8%);3. 定性分析了动态标准中的突出主题(如透明性、实用性等)。
Method: 通过在线方式对当前和参考策略的响应进行成对比较,动态生成评测标准,从而持续识别和修正训练中的错误。
Result: 在AlpacaEval、GPQA、ArenaHard等数据集上,相比静态标准,该方法带来了最高8%的性能提升。
Insight: 动态标准能够更好地适应训练需求,提升模型表现;评测标准中的透明性和实用性等主题是关键因素。
Abstract: Rubrics provide a flexible way to train LLMs on open-ended long-form answers where verifiable rewards are not applicable and human preferences provide coarse signals. Prior work shows that reinforcement learning with rubric-based rewards leads to consistent gains in LLM post-training. Most existing approaches rely on rubrics that remain static over the course of training. Such static rubrics, however, are vulnerable to reward-hacking type behaviors and fail to capture emergent desiderata that arise during training. We introduce Online Rubrics Elicitation (OnlineRubrics), a method that dynamically curates evaluation criteria in an online manner through pairwise comparisons of responses from current and reference policies. This online process enables continuous identification and mitigation of errors as training proceeds. Empirically, this approach yields consistent improvements of up to 8% over training exclusively with static rubrics across AlpacaEval, GPQA, ArenaHard as well as the validation sets of expert questions and rubrics. We qualitatively analyze the elicited criteria and identify prominent themes such as transparency, practicality, organization, and reasoning.
[61] On the Convergence of Moral Self-Correction in Large Language Models
Guangliang Liu,Haitao Mao,Bochuan Cao,Zhiyu Xue,Xitong Zhang,Rongrong Wang,Kristen Marie Johnson
Main category: cs.CL
TL;DR: 大型语言模型(LLM)能够通过自我修正(self-correction)改善其响应,特别是道德自我修正(moral self-correction)表现出性能收敛的特性。本文揭示了多轮互动中这种收敛行为的机制:持续的自我修正指令激活了道德概念,从而减少模型不确定性。
Details
Motivation: 探讨LLM在缺乏具体问题细节的情况下,如何通过内部知识自我修正(intrinsic self-correction)来提高响应质量,尤其是道德领域的自我修正机制。Contribution: 揭示了道德自我修正的关键特性——性能收敛,并提供了对这种收敛行为的机制分析。实验表明,持续的自我修正指令减少了模型的不确定性。
Method: 通过多轮交互实验,分析LLM在道德自我修正中的行为模式,研究指令如何激活道德概念并导致性能收敛。
Result: 道德自我修正表现出性能收敛的特性,这是因为持续的指令输入稳定了激活的道德概念,减少了不确定性。
Insight: 持续的自我修正指令能够有效引导LLM在道德领域的行为趋于稳定,这一机制可以扩展到其他领域的自我修正研究。
Abstract: Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. Focusing on moral self-correction in LLMs, we reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions; and provide a mechanistic analysis of this convergence behavior. Based on our experimental results and analysis, we uncover the underlying mechanism of convergence: consistently injected self-correction instructions activate moral concepts that reduce model uncertainty, leading to converged performance as the activated moral concepts stabilize over successive rounds. This paper demonstrates the strong potential of moral self-correction by showing that it exhibits a desirable property of converged performance.
[62] Think Natively: Unlocking Multilingual Reasoning with Consistency-Enhanced Reinforcement Learning
Xue Zhang,Yunlong Liang,Fandong Meng,Songming Zhang,Kaiyu Huang,Yufeng Chen,Jinan Xu,Jie Zhou
Main category: cs.CL
TL;DR: 本文提出了M-Thinker模型,通过GRPO算法结合语言一致性奖励和跨语言思维对齐奖励,解决了大型推理模型在多语言任务中的输入输出不一致和推理能力不足的问题。
Details
Motivation: 当前的大型推理模型在处理非英语语言时存在输入输出语言不一致以及推理能力较低的问题,影响了用户体验和全球化部署。Contribution: 提出了M-Thinker模型和GRPO算法,引入了语言一致性奖励和跨语言思维对齐奖励,显著提升了多语言任务的性能。
Method: 采用GRPO算法,结合语言一致性奖励(LC)和跨语言思维对齐奖励(CTA),通过强化学习迭代优化模型在多语言任务中的表现。
Result: M-Thinker-1.5B/7B模型在MMATH和PolyMath基准测试中表现出色,实现了近100%的语言一致性,并在域外语言上展现了优秀的泛化能力。
Insight: 通过强化学习对语言一致性和跨语言推理能力进行优化,可以有效提升模型在多语言任务中的表现,为全球化部署提供了可能。
Abstract: Large Reasoning Models (LRMs) have achieved remarkable performance on complex reasoning tasks by adopting the “think-then-answer” paradigm, which enhances both accuracy and interpretability. However, current LRMs exhibit two critical limitations when processing non-English languages: (1) They often struggle to maintain input-output language consistency; (2) They generally perform poorly with wrong reasoning paths and lower answer accuracy compared to English. These limitations significantly degrade the user experience for non-English speakers and hinder the global deployment of LRMs. To address these limitations, we propose M-Thinker, which is trained by the GRPO algorithm that involves a Language Consistency (LC) reward and a novel Cross-lingual Thinking Alignment (CTA) reward. Specifically, the LC reward defines a strict constraint on the language consistency between the input, thought, and answer. Besides, the CTA reward compares the model’s non-English reasoning paths with its English reasoning path to transfer its own reasoning capability from English to non-English languages. Through an iterative RL procedure, our M-Thinker-1.5B/7B models not only achieve nearly 100% language consistency and superior performance on two multilingual benchmarks (MMATH and PolyMath), but also exhibit excellent generalization on out-of-domain languages.
[63] Agent Bain vs. Agent McKinsey: A New Text-to-SQL Benchmark for the Business Domain
Yue Li,Ran Tao,Derek Hommel,Yusuf Denizay Dönder,Sungyong Chang,David Mimno,Unso Eun Seo Jo
Main category: cs.CL
TL;DR: CORGI是一个新的文本到SQL基准测试,专注于真实业务场景,涵盖描述性、解释性、预测性和建议性问题,揭示了LLM在高阶业务查询中的不足。
Details
Motivation: 现有文本到SQL基准测试主要关注历史数据的检索,无法满足业务领域对复杂查询的需求,特别是涉及因果推理、时间预测和多步决策的场景。Contribution: 提出了CORGI基准测试,模拟企业数据(如DoorDash、Airbnb等),包含四类复杂业务查询,并公开了数据集和评估框架。
Method: 设计合成数据库和四类问题(描述、解释、预测和推荐),评估LLM在这些任务上的表现。
Result: LLM在高阶业务查询(如预测和建议)中表现不佳,CORGI比BIRD基准难21%。
Insight: 现有LLM在真实业务场景中的多步推理和智能决策能力仍需提升,强调了业务智能的需求与LLM能力的差距。
Abstract: In the business domain, where data-driven decision making is crucial, text-to-SQL is fundamental for easy natural language access to structured data. While recent LLMs have achieved strong performance in code generation, existing text-to-SQL benchmarks remain focused on factual retrieval of past records. We introduce CORGI, a new benchmark specifically designed for real-world business contexts. CORGI is composed of synthetic databases inspired by enterprises such as Doordash, Airbnb, and Lululemon. It provides questions across four increasingly complex categories of business queries: descriptive, explanatory, predictive, and recommendational. This challenge calls for causal reasoning, temporal forecasting, and strategic recommendation, reflecting multi-level and multi-step agentic intelligence. We find that LLM performance drops on high-level questions, struggling to make accurate predictions and offer actionable plans. Based on execution success rate, the CORGI benchmark is about 21% more difficult than the BIRD benchmark. This highlights the gap between popular LLMs and the need for real-world business intelligence. We release a public dataset and evaluation framework, and a website for public submissions.
[64] Vibe Checker: Aligning Code Evaluation with Human Preference
Ming Zhong,Xiang Zhou,Ting-Yun Chang,Qingze Wang,Nan Xu,Xiance Si,Dan Garrette,Shyam Upadhyay,Jeremiah Liu,Jiawei Han,Benoit Schillings,Jiao Sun
Main category: cs.CL
TL;DR: 本文提出了Vibe Checker,通过结合功能正确性和代码指令遵循能力,量化LLMs与人类编程偏好的对齐程度,揭示了指令遵循是影响用户体验的关键因素。
Details
Motivation: 当前的代码评估(如pass@k)仅关注功能正确性,忽略了非功能性的人类偏好(如代码可读性、意图保留等)。本文旨在填补这一空白,研究如何量化LLMs在代码生成中对指令遵循的表现。Contribution: 1. 提出了VeriCode分类法,包含30种可验证的代码指令和对应的确定性验证器;2. 开发了Vibe Checker测试平台,综合评价功能正确性和指令遵循能力;3. 发现指令遵循能力是区分LLMs在实际任务中表现的关键指标。
Method: 1. 构建VeriCode分类法,定义30种代码指令和验证方法;2. 将指令验证与传统功能测试结合,形成Vibe Checker;3. 评估31个主流LLMs,分析其功能回归和指令遵循能力。
Result: 评估显示,即使是顶级LLMs也难以同时满足多项指令,且功能表现常出现退化。综合评分(功能正确性+指令遵循)与人类偏好相关性最高,且指令遵循是主要区分因素。
Insight: 1. 功能正确性不足以全面评估代码生成质量;2. 指令遵循是用户体验的核心组成部分;3. Vibe Checker为开发更符合人类偏好的模型提供了新方向。
Abstract: Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check is tied to real-world human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking the non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check that represents human preference in coding besides functional correctness. To quantify models’ code instruction following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with corresponding deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in Vibe Checker, a testbed to assess both code instruction following and functional correctness. Upon evaluating 31 leading LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit clear functional regression. Most importantly, a composite score of functional correctness and instruction following correlates the best with human preference, with the latter emerging as the primary differentiator on real-world programming tasks. Our work identifies core factors of the vibe check, providing a concrete path for benchmarking and developing models that better align with user preferences in coding.
[65] Artificial Hippocampus Networks for Efficient Long-Context Modeling
Yunhao Fang,Weihao Yu,Shu Zhong,Qinghao Ye,Xuehan Xiong,Lai Wei
Main category: cs.CL
TL;DR: 该论文提出了一种结合RNN和Transformer优点的记忆框架,引入人工海马网络(AHN)来高效处理长序列建模任务,显著减少计算和内存需求。
Details
Motivation: 长序列建模面临RNN固定大小内存的高效性与Transformer无损增长内存的高保真性之间的权衡问题。Contribution: 提出了人工海马网络(AHN),通过滑动窗口和无损长效记忆的结合,显著优化了长序列建模的效率。
Method: 使用滑动窗口作为短时记忆(KV缓存),通过AHN压缩窗口外信息为固定大小的长效记忆,结合RNN架构(如Mamba2)实现。
Result: 实验表明,AHN增强的模型在LV-Eval和InfiniteBench基准上表现优于滑动窗口基线,并接近完整注意力模型的性能,同时计算和内存需求大幅降低。
Insight: AHN框架为长序列建模提供了一种高效且轻量化的解决方案,尤其是在推理阶段显著减少了资源消耗。
Abstract: Long-sequence modeling faces a fundamental trade-off between the efficiency of compressive fixed-size memory in RNN-like models and the fidelity of lossless growing memory in attention-based Transformers. Inspired by the Multi-Store Model in cognitive science, we introduce a memory framework of artificial neural networks. Our method maintains a sliding window of the Transformer’s KV cache as lossless short-term memory, while a learnable module termed Artificial Hippocampus Network (AHN) recurrently compresses out-of-window information into a fixed-size compact long-term memory. To validate this framework, we instantiate AHNs using modern RNN-like architectures, including Mamba2, DeltaNet, and Gated DeltaNet. Extensive experiments on long-context benchmarks LV-Eval and InfiniteBench demonstrate that AHN-augmented models consistently outperform sliding window baselines and achieve performance comparable or even superior to full-attention models, while substantially reducing computational and memory requirements. For instance, augmenting the Qwen2.5-3B-Instruct with AHNs reduces inference FLOPs by 40.5% and memory cache by 74.0%, while improving its average score on LV-Eval (128k sequence length) from 4.41 to 5.88. Code is available at: https://github.com/ByteDance-Seed/AHN.
cs.CV [Back]
[66] Milestone Determination for Autonomous Railway Operation
Josh Hunter,John McDermid,Simon Burton,Poppy Fynes,Mia Dempster
Main category: cs.CV
TL;DR: 论文提出了一种基于里程碑确定的计算机视觉方法,用于铁路自动化,通过生成上下文相关的序列数据,简化动态组件识别,专注于关键决策点。
Details
Motivation: 铁路自动化领域的计算机视觉系统面临高质量序列数据稀缺的问题,传统数据集缺乏时空上下文,而替代方案又存在真实性和适用性问题。Contribution: 论文的主要贡献是通过里程碑确定的方法,开发了针对性的规则模型,避免了动态组件的泛化识别,专注于路线中的关键决策点,从而提高了铁路自动化系统的安全性和效率。
Method: 提出了基于路线特定上下文线索的数据生成方法,构建丰富的序列数据集,并通过里程碑确定简化学习过程,避免了对动态内容的泛化识别需求。
Result: 该方法为铁路自动化提供了一个实用的框架,能够在可控环境中训练视觉代理,优化实时决策。
Insight: 专注于关键决策点而非动态内容的泛化识别,可以显著简化计算机视觉系统的训练过程,同时提高其在铁路自动化中的实用性和可靠性。
Abstract: In the field of railway automation, one of the key challenges has been the development of effective computer vision systems due to the limited availability of high-quality, sequential data. Traditional datasets are restricted in scope, lacking the spatio temporal context necessary for real-time decision-making, while alternative solutions introduce issues related to realism and applicability. By focusing on route-specific, contextually relevant cues, we can generate rich, sequential datasets that align more closely with real-world operational logic. The concept of milestone determination allows for the development of targeted, rule-based models that simplify the learning process by eliminating the need for generalized recognition of dynamic components, focusing instead on the critical decision points along a route. We argue that this approach provides a practical framework for training vision agents in controlled, predictable environments, facilitating safer and more efficient machine learning systems for railway automation.
[67] CML-Bench: A Framework for Evaluating and Enhancing LLM-Powered Movie Scripts Generation
Mingzhe Zheng,Dingjie Song,Guanyu Zhou,Jun You,Jiahao Zhan,Xuran Ma,Xinyuan Song,Ser-Nam Lim,Qifeng Chen,Harry Yang
Main category: cs.CV
TL;DR: 该论文提出了CML-Bench框架,用于评估和改进LLM生成的电影剧本质量,重点关注对话连贯性、角色一致性和情节合理性三个维度,并通过CML-Instruction提示策略提升了LLM生成剧本的效果。
Details
Motivation: 尽管LLM在生成结构化文本方面表现出色,但电影剧本需要更复杂的故事性和情感深度,这是LLM目前难以捕捉的。Contribution: 1) 构建了CML-Dataset数据集,包含高质量电影剧本的(summary, content)对;2) 提出了CML-Bench评估框架,定义了DC、CC、PR三个质量维度;3) 设计了CML-Instruction提示策略,显著提升了LLM生成剧本的质量。
Method: 基于CML-Dataset分析剧本的多镜头连续性和叙事结构,定义了三个质量评估维度,并开发了CML-Bench的定量指标。提出CML-Instruction提示策略,为LLM提供详细的角色对话和事件逻辑指导。
Result: 实验表明,CML-Bench能有效区分高质量人类剧本和LLM生成剧本的弱点,而CML-Instruction显著提升了LLM生成剧本的质量和人类偏好。
Insight: 1) 电影剧本生成需关注故事性和情感深度;2) 结构化提示策略(如CML-Instruction)能显著改善LLM生成效果;3) 定量评估框架有助于系统性优化生成内容。
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in generating highly structured texts. However, while exhibiting a high degree of structural organization, movie scripts demand an additional layer of nuanced storytelling and emotional depth-the ‘soul’ of compelling cinema-that LLMs often fail to capture. To investigate this deficiency, we first curated CML-Dataset, a dataset comprising (summary, content) pairs for Cinematic Markup Language (CML), where ‘content’ consists of segments from esteemed, high-quality movie scripts and ‘summary’ is a concise description of the content. Through an in-depth analysis of the intrinsic multi-shot continuity and narrative structures within these authentic scripts, we identified three pivotal dimensions for quality assessment: Dialogue Coherence (DC), Character Consistency (CC), and Plot Reasonableness (PR). Informed by these findings, we propose the CML-Bench, featuring quantitative metrics across these dimensions. CML-Bench effectively assigns high scores to well-crafted, human-written scripts while concurrently pinpointing the weaknesses in screenplays generated by LLMs. To further validate our benchmark, we introduce CML-Instruction, a prompting strategy with detailed instructions on character dialogue and event logic, to guide LLMs to generate more structured and cinematically sound scripts. Extensive experiments validate the effectiveness of our benchmark and demonstrate that LLMs guided by CML-Instruction generate higher-quality screenplays, with results aligned with human preferences.
[68] User to Video: A Model for Spammer Detection Inspired by Video Classification Technology
Haoyang Zhang,Zhou Yang,Yucai Pang
Main category: cs.CV
TL;DR: 该论文提出了一种基于视频分类技术的垃圾用户检测模型UVSD,通过将用户行为子空间视为帧图像并构建用户行为视频,结合视频分类算法进行检测。
Details
Motivation: 受视频分类技术的启发,将用户行为序列类比为视频帧,从而设计一种新的垃圾用户检测方法。Contribution: 1. 提出user2piexl算法,将用户行为量化为像素;2. 提出behavior2image算法,将用户行为子空间转化为帧图像;3. 结合视频分类算法识别垃圾用户。
Method: 1. 用户像素化(user2piexl);2. 行为子空间图像化(behavior2image);3. 构建用户行为视频并应用视频分类算法。
Result: 在WEIBO和TWITTER数据集上,UVSD模型优于现有方法。
Insight: 将用户行为序列建模为视频帧是一种新颖且有效的垃圾用户检测方法。
Abstract: This article is inspired by video classification technology. If the user behavior subspace is viewed as a frame image, consecutive frame images are viewed as a video. Following this novel idea, a model for spammer detection based on user videoization, called UVSD, is proposed. Firstly, a user2piexl algorithm for user pixelization is proposed. Considering the adversarial behavior of user stances, the user is viewed as a pixel, and the stance is quantified as the pixel’s RGB. Secondly, a behavior2image algorithm is proposed for transforming user behavior subspace into frame images. Low-rank dense vectorization of subspace user relations is performed using representation learning, while cutting and diffusion algorithms are introduced to complete the frame imageization. Finally, user behavior videos are constructed based on temporal features. Subsequently, a video classification algorithm is combined to identify the spammers. Experiments using publicly available datasets, i.e., WEIBO and TWITTER, show an advantage of the UVSD model over state-of-the-art methods.
[69] Uncertainty Quantification In Surface Landmines and UXO Classification Using MC Dropout
Sagar Lekhak,Emmett J. Ientilucci,Dimah Dera,Susmita Ghosh
Main category: cs.CV
TL;DR: 该论文提出了一种基于MC Dropout的深度学习模型,用于量化地表地雷和未爆弹药(UXO)分类中的不确定性。通过将MC Dropout集成到经过微调的ResNet-50架构中,研究展示了在对抗性扰动和噪声条件下的预测可靠性评估。
Details
Motivation: 传统确定性神经网络在地雷和UXO分类中易受噪声和对抗性攻击影响,可能导致漏检或误分类。因此,需要量化不确定性以提高模型的可靠性和鲁棒性。Contribution: 主要贡献包括:1)将MC Dropout引入地表地雷和UXO分类任务,量化认知不确定性;2)揭示了现有神经网络在对抗性威胁下的脆弱性;3)提出了一个概念验证模型,展示了不确定性量化在复杂条件下的有效性。
Method: 方法包括:1)使用微调的ResNet-50作为基础架构;2)集成MC Dropout技术以生成预测不确定性;3)在干净、对抗性扰动和噪声图像上测试模型的可靠性。
Result: 实验结果表明,MC Dropout能够有效量化不确定性,并在对抗性和噪声条件下标记不可靠预测,为扫雷操作提供了额外的决策依据。
Insight: 该研究强调了不确定性量化在扫雷任务中的重要性,并为开发更鲁棒的模型奠定了基础。同时,揭示了对抗性攻击对现有模型的威胁,呼唤更多研究提升实用场景中的可靠性。
Abstract: Detecting surface landmines and unexploded ordnances (UXOs) using deep learning has shown promise in humanitarian demining. However, deterministic neural networks can be vulnerable to noisy conditions and adversarial attacks, leading to missed detection or misclassification. This study introduces the idea of uncertainty quantification through Monte Carlo (MC) Dropout, integrated into a fine-tuned ResNet-50 architecture for surface landmine and UXO classification, which was tested on a simulated dataset. Integrating the MC Dropout approach helps quantify epistemic uncertainty, providing an additional metric for prediction reliability, which could be helpful to make more informed decisions in demining operations. Experimental results on clean, adversarially perturbed, and noisy test images demonstrate the model’s ability to flag unreliable predictions under challenging conditions. This proof-of-concept study highlights the need for uncertainty quantification in demining, raises awareness about the vulnerability of existing neural networks in demining to adversarial threats, and emphasizes the importance of developing more robust and reliable models for practical applications.
[70] multimodars: A Rust-powered toolkit for multi-modality cardiac image fusion and registration
Anselm W. Stark,Marc Ilic,Ali Mokhtari,Pooya Mohammadi Kazaj,Christoph Graeni,Isaac Shiri
Main category: cs.CV
TL;DR: multimodars是一个基于Rust的工具包,专注于多模态心脏图像的融合和配准,旨在解决现有工具在确定性、性能和灵活性上的不足。
Details
Motivation: 心脏成像中,不同模态的图像具有互补性(如高分辨率的血管内影像和提供整体几何结构的CCTA),但缺乏一个开源、灵活的工具包来支持多状态分析并提供高性能和确定性行为。Contribution: multimodars填补了这一空白,提供了确定性配准算法、以NumPy为中心的紧凑数据模型,以及优化的Rust后端,适合可扩展和可重复的实验。
Method: 该工具包采用Rust作为后端,实现了高性能的图像处理算法,支持CSV/NumPy输入,并与AIVUS-CAA软件的数据格式兼容。
Result: multimodars实现了高效的多模态图像融合和配准,并为用户提供了易于集成的管道工具。
Insight: 通过结合高性能的Rust实现和灵活的NumPy数据模型,multimodars展示了在多模态医学图像处理中平衡性能与易用性的潜力。
Abstract: Combining complementary imaging modalities is critical to build reliable 3D coronary models: intravascular imaging gives sub-millimetre resolution but limited whole-vessel context, while CCTA supplies 3D geometry but suffers from limited spatial resolution and artefacts (e.g., blooming). Prior work demonstrated intravascular/CCTA fusion, yet no open, flexible toolkit is tailored for multi-state analysis (rest/stress, pre-/post-stenting) while offering deterministic behaviour, high performance, and easy pipeline integration. multimodars addresses this gap with deterministic alignment algorithms, a compact NumPy-centred data model, and an optimised Rust backend suitable for scalable, reproducible experiments. The package accepts CSV/NumPy inputs including data formats produced by the AIVUS-CAA software
[71] Does Physics Knowledge Emerge in Frontier Models?
Ieva Bagdonaviciute,Vibhav Vineet
Main category: cs.CV
TL;DR: 前沿视觉语言模型(VLMs)在视觉感知和通用推理方面表现优异,但其物理动力学理解能力尚不明确。本文通过三个物理模拟数据集(CLEVRER、Physion和Physion++)对六种前沿VLM进行了评测,发现感知能力与物理推理能力的相关性较弱,揭示了当前模型的局限性。
Details
Motivation: 探究前沿视觉语言模型是否具备物理动力学理解能力,以及感知能力与物理推理能力之间的关系。Contribution: 1. 在三个物理模拟数据集上系统评测了六种前沿VLM的性能;2. 设计诊断性子任务,分离感知与物理推理能力;3. 发现感知与物理推理能力的弱相关性,揭示了模型的局限性。
Method: 1. 使用CLEVRER、Physion和Physion++数据集进行评测;2. 设计诊断性子任务,分别测试感知(物体、颜色、遮挡)和物理推理(运动预测、空间关系)能力;3. 分析模型在各任务中的表现相关性。
Result: 当前VLMs在感知和物理推理任务中表现各异,但两类能力的相关性较弱,模型未能将二者紧密结合为因果理解能力。
Insight: 前沿VLMs在物理动力学理解上存在局限性,感知与推理能力未紧密结合,亟需设计更紧密耦合的架构。
Abstract: Leading Vision-Language Models (VLMs) show strong results in visual perception and general reasoning, but their ability to understand and predict physical dynamics remains unclear. We benchmark six frontier VLMs on three physical simulation datasets - CLEVRER, Physion, and Physion++ - where the evaluation tasks test whether a model can predict outcomes or hypothesize about alternative situations. To probe deeper, we design diagnostic subtests that isolate perception (objects, colors, occluders) from physics reasoning (motion prediction, spatial relations). Intuitively, stronger diagnostic performance should support higher evaluation accuracy. Yet our analysis reveals weak correlations: models that excel at perception or physics reasoning do not consistently perform better on predictive or counterfactual evaluation. This counterintuitive gap exposes a central limitation of current VLMs: perceptual and physics skills remain fragmented and fail to combine into causal understanding, underscoring the need for architectures that bind perception and reasoning more tightly.
[72] Enhanced Self-Distillation Framework for Efficient Spiking Neural Network Training
Xiaochen Zhao,Chengting Yu,Kairong Yu,Lei Liu,Aili Wang
Main category: cs.CV
TL;DR: 该论文提出了一种增强的自蒸馏框架,用于高效训练脉冲神经网络(SNN),通过联合优化速率反向传播和自蒸馏,减少了训练复杂度并提升了性能。
Details
Motivation: 传统的SNN训练方法(如代理梯度和BPTT)在性能和计算开销上均落后于人工神经网络(ANN),尤其是在时间维度上计算和内存开销线性增长。论文旨在解决这一问题,提升SNN的训练效率和性能。Contribution: 1. 提出了一种增强的自蒸馏框架,通过投影SNN中间层的发放速率到轻量级ANN分支,利用高质量的自生成知识优化模型子结构;2. 将教师信号分解为可靠和不可靠部分,仅使用可靠知识指导优化。
Method: 1. 联合优化速率反向传播和自蒸馏;2. 利用SNN中间层的发放速率投影到轻量级ANN分支;3. 分解教师信号,仅保留可靠部分用于指导模型优化。
Result: 在CIFAR-10、CIFAR-100、CIFAR10-DVS和ImageNet等数据集上的实验表明,该方法在减少训练复杂度的同时实现了高性能SNN训练。
Insight: 低质量的自生成知识可能阻碍收敛,因此分解教师信号并仅使用可靠部分能有效提升训练效率。
Abstract: Spiking Neural Networks (SNNs) exhibit exceptional energy efficiency on neuromorphic hardware due to their sparse activation patterns. However, conventional training methods based on surrogate gradients and Backpropagation Through Time (BPTT) not only lag behind Artificial Neural Networks (ANNs) in performance, but also incur significant computational and memory overheads that grow linearly with the temporal dimension. To enable high-performance SNN training under limited computational resources, we propose an enhanced self-distillation framework, jointly optimized with rate-based backpropagation. Specifically, the firing rates of intermediate SNN layers are projected onto lightweight ANN branches, and high-quality knowledge generated by the model itself is used to optimize substructures through the ANN pathways. Unlike traditional self-distillation paradigms, we observe that low-quality self-generated knowledge may hinder convergence. To address this, we decouple the teacher signal into reliable and unreliable components, ensuring that only reliable knowledge is used to guide the optimization of the model. Extensive experiments on CIFAR-10, CIFAR-100, CIFAR10-DVS, and ImageNet demonstrate that our method reduces training complexity while achieving high-performance SNN training. Our code is available at https://github.com/Intelli-Chip-Lab/enhanced-self-distillation-framework-for-snn.
[73] Ensemble Deep Learning and LLM-Assisted Reporting for Automated Skin Lesion Diagnosis
Sher Khan,Raz Muhammad,Adil Hussain,Muhammad Sajjad,Muhammad Rashid
Main category: cs.CV
TL;DR: 这篇论文提出了一种统一的AI框架,通过集成异构卷积神经网络和大型语言模型,实现皮肤病变的自动诊断和临床报告生成,提高诊断可靠性和患者沟通效果。
Details
Motivation: 当前皮肤病诊断存在观察者间变异性和肤色数据偏见等问题,且现有系统多将自然语言处理作为事后解释而非临床决策的一部分。Contribution: 1. 提出异构集成卷积神经网络,提供互补诊断视角;2. 将大型语言模型嵌入诊断流程,生成结构化临床报告。
Method: 1. 使用异构CNN集成,引入不确定性机制;2. 整合LLM生成患者友好的临床报告。
Result: 该方法提升了诊断精确性,同时支持从检测到患者教育的全流程,改善了皮肤病变的早期干预。
Insight: 融合多模态AI和临床工作流,不仅能提高诊断质量,还能通过患者教育增强早期干预效果。
Abstract: Cutaneous malignancies demand early detection for favorable outcomes, yet current diagnostics suffer from inter-observer variability and access disparities. While AI shows promise, existing dermatological systems are limited by homogeneous architectures, dataset biases across skin tones, and fragmented approaches that treat natural language processing as separate post-hoc explanations rather than integral to clinical decision-making. We introduce a unified framework that fundamentally reimagines AI integration for dermatological diagnostics through two synergistic innovations. First, a purposefully heterogeneous ensemble of architecturally diverse convolutional neural networks provides complementary diagnostic perspectives, with an intrinsic uncertainty mechanism flagging discordant cases for specialist review – mimicking clinical best practices. Second, we embed large language model capabilities directly into the diagnostic workflow, transforming classification outputs into clinically meaningful assessments that simultaneously fulfill medical documentation requirements and deliver patient-centered education. This seamless integration generates structured reports featuring precise lesion characterization, accessible diagnostic reasoning, and actionable monitoring guidance – empowering patients to recognize early warning signs between visits. By addressing both diagnostic reliability and communication barriers within a single cohesive system, our approach bridges the critical translational gap that has prevented previous AI implementations from achieving clinical impact. The framework represents a significant advancement toward deployable dermatological AI that enhances diagnostic precision while actively supporting the continuum of care from initial detection through patient education, ultimately improving early intervention rates for skin lesions.
[74] Vision Transformer for Transient Noise Classification
Divyansh Srivastava,Andrzej Niedzielski
Main category: cs.CV
TL;DR: 使用Vision Transformer (ViT)模型对LIGO数据中的瞬态噪声(glitches)进行分类,结合Gravity Spy数据集和O3a运行中的两个新增类别,实现了92.26%的分类效率。
Details
Motivation: LIGO数据中的瞬态噪声(glitches)干扰了引力波的探测。随着O3运行的开展,引入了两个新的噪声类别,需要训练新模型以提高分类效果。Contribution: 首次将Vision Transformer (ViT)应用于LIGO瞬态噪声分类,验证了其在引力波探测噪声分类中的有效性。
Method: 使用预训练的Vision Transformer (ViT-B/32)模型,结合Gravity Spy数据集和O3a运行的两个新增类别进行训练。
Result: 实现了92.26%的分类效率,展示了ViT在区分瞬态噪声方面的潜力。
Insight: Vision Transformer在引力波探测噪声分类中表现出色,有望进一步提升引力波探测的准确性。
Abstract: Transient noise (glitches) in LIGO data hinders the detection of gravitational waves (GW). The Gravity Spy project has categorized these noise events into various classes. With the O3 run, there is the inclusion of two additional noise classes and thus a need to train new models for effective classification. We aim to classify glitches in LIGO data into 22 existing classes from the first run plus 2 additional noise classes from O3a using the Vision Transformer (ViT) model. We train a pre-trained Vision Transformer (ViT-B/32) model on a combined dataset consisting of the Gravity Spy dataset with the additional two classes from the LIGO O3a run. We achieve a classification efficiency of 92.26%, demonstrating the potential of Vision Transformer to improve the accuracy of gravitational wave detection by effectively distinguishing transient noise. Key words: gravitational waves –vision transformer –machine learning
[75] General and Efficient Visual Goal-Conditioned Reinforcement Learning using Object-Agnostic Masks
Fahim Shahriar,Cheryl Wang,Alireza Azimi,Gautham Vasan,Hany Hamed Elanwar,A. Rupam Mahmood,Colin Bellinger
Main category: cs.CV
TL;DR: 该论文提出了一种基于掩码的目标表示方法,用于目标条件强化学习(GCRL),通过对象无关的视觉提示实现高效学习与泛化。
Details
Motivation: 现有目标表示方法(如目标状态图像、3D坐标或独热向量)存在泛化性差、收敛慢或需要特殊设备的问题。掩码表示可以避免这些局限性。Contribution: 提出了一种掩码目标表示系统,无需目标位置信息即可实现高效学习和泛化,支持高精度任务完成和sim-to-real转移。
Method: 利用对象无关的掩码生成密集奖励,避免复杂的距离计算;结合仿真中的真实掩码训练,使用预训练的开放词汇目标检测模型生成掩码。
Result: 在仿真中达到99.9%的训练与未见测试对象的到达准确率,并成功应用于真实机器人任务。
Insight: 掩码表示可以简化目标条件强化学习的复杂性,同时提升泛化能力和任务完成效率。
Abstract: Goal-conditioned reinforcement learning (GCRL) allows agents to learn diverse objectives using a unified policy. The success of GCRL, however, is contingent on the choice of goal representation. In this work, we propose a mask-based goal representation system that provides object-agnostic visual cues to the agent, enabling efficient learning and superior generalization. In contrast, existing goal representation methods, such as target state images, 3D coordinates, and one-hot vectors, face issues of poor generalization to unseen objects, slow convergence, and the need for special cameras. Masks can be processed to generate dense rewards without requiring error-prone distance calculations. Learning with ground truth masks in simulation, we achieved 99.9% reaching accuracy on training and unseen test objects. Our proposed method can be utilized to perform pick-up tasks with high accuracy, without using any positional information of the target. Moreover, we demonstrate learning from scratch and sim-to-real transfer applications using two different physical robots, utilizing pretrained open vocabulary object detection models for mask generation.
[76] Improving the Spatial Resolution of GONG Solar Images to GST Quality Using Deep Learning
Chenyang Li,Qin Li,Haimin Wang,Bo Shen
Main category: cs.CV
TL;DR: 该论文提出了一种基于GAN的超分辨率方法,用于提升GONG低分辨率太阳图像的质量,使其接近BBSO/GST的高分辨率观测水平。通过Real-ESRGAN模型,显著恢复了太阳黑子和细丝等精细结构。
Details
Motivation: 高分辨率太阳成像对小规模动态特征(如细丝和纤维)的捕获至关重要。然而,GONG的全盘Hα图像分辨率不足,无法清晰呈现这些结构,因此需要一种有效的超分辨率方法。Contribution: 提出了一种基于GAN的超分辨率框架(Real-ESRGAN),显著提升了GONG图像的分辨率,使其质量接近BBSO/GST的高分辨率观测结果。
Method: 使用Real-ESRGAN模型,结合残差密集块(Residual-in-Residual Dense Blocks)和相对判别器(relativistic discriminator),并精心对齐GONG-GST图像对。
Result: 模型有效恢复了太阳黑子半影和细丝的精细细节,平均MSE为467.15,RMSE为21.59,交叉相关度为0.7794。图像对的轻微错位限制了定量表现。
Insight: GAN技术在超分辨率任务中表现出色,但对图像对齐要求较高。未来可通过扩大数据集和优化对齐进一步提升重建质量。
Abstract: High-resolution (HR) solar imaging is crucial for capturing fine-scale dynamic features such as filaments and fibrils. However, the spatial resolution of the full-disk H$\alpha$ images is limited and insufficient to resolve these small-scale structures. To address this, we propose a GAN-based superresolution approach to enhance low-resolution (LR) full-disk H$\alpha$ images from the Global Oscillation Network Group (GONG) to a quality comparable with HR observations from the Big Bear Solar Observatory/Goode Solar Telescope (BBSO/GST). We employ Real-ESRGAN with Residual-in-Residual Dense Blocks and a relativistic discriminator. We carefully aligned GONG-GST pairs. The model effectively recovers fine details within sunspot penumbrae and resolves fine details in filaments and fibrils, achieving an average mean squared error (MSE) of 467.15, root mean squared error (RMSE) of 21.59, and cross-correlation (CC) of 0.7794. Slight misalignments between image pairs limit quantitative performance, which we plan to address in future work alongside dataset expansion to further improve reconstruction quality.
[77] ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations
Yike Wu,Yiwei Wang,Yujun Cai
Main category: cs.CV
TL;DR: ChainMPQ是一种无需训练的方法,通过多视角问题和交错的图像-文本链,减少大型视觉语言模型中的关系幻觉问题。
Details
Motivation: 关系幻觉在大型视觉语言模型中占比最大但研究最少,影响了模型的可靠性。Contribution: 提出了ChainMPQ方法,通过多视角问题和累积的文本-视觉记忆,显著减少关系幻觉。
Method: ChainMPQ从问题中提取主客体关键词,增强对应图像区域,构建多视角问题,并利用交错的图像-文本链逐步推理关系。
Result: 实验表明ChainMPQ在多模型和基准测试中显著减少关系幻觉,消融研究验证了其核心模块的有效性。
Insight: 通过多视角问题和逐步推理的交错链,可以有效提升模型对关系的理解能力。
Abstract: While Large Vision-Language Models (LVLMs) achieve strong performance in multimodal tasks, hallucinations continue to hinder their reliability. Among the three categories of hallucinations, which include object, attribute, and relation, relation hallucinations account for the largest proportion but have received the least attention. To address this issue, we propose ChainMPQ (Multi-Perspective Questions guided Interleaved Chain of Image and Text), a training-free method that improves relational inference in LVLMs by utilizing accumulated textual and visual memories. ChainMPQ first extracts subject and object keywords from the question to enhance the corresponding image regions. It then constructs multi-perspective questions that focus on the three core components of a relationship: the subject, the object, and the relation that links them. These questions are sequentially input to the model, with textual and visual memories from earlier steps providing supporting context for subsequent ones, thereby forming an interleaved chain of images and text that guides progressive relational reasoning. Experiments on multiple LVLMs and benchmarks show that ChainMPQ substantially reduces relation hallucinations, while ablation studies further validate the effectiveness of its three core modules.
[78] Scalable deep fusion of spaceborne lidar and synthetic aperture radar for global forest structural complexity mapping
Tiago de Conto,John Armston,Ralph Dubayah
Main category: cs.CV
TL;DR: 该论文提出了一种可扩展的深度学习方法,通过融合GEDI星载激光雷达和多模态SAR数据,生成了全球高分辨率(25米)的森林结构复杂度地图,实现了高效、准确的预测和不确定性估计。
Details
Motivation: 传统星载激光雷达(GEDI)采样稀疏,无法实现连续高分辨率森林结构复杂度制图。为了解决这一问题,需要结合SAR数据以实现全球范围的连续监测。Contribution: 1. 提出了一种可扩展的深度学习框架,融合GEDI和多模态SAR数据,生成全球高分辨率(25米)森林结构复杂度地图; 2. 基于EfficientNetV2的轻量化模型(参数少于40万),在130万个GEDI足迹上训练,表现优异(全局R2 = 0.82); 3. 生成了2015至2022年全球多时相森林结构复杂度数据集。
Method: 使用改进的EfficientNetV2架构,融合GEDI激光雷达和多模态SAR数据,训练了一个轻量化模型(参数少于40万)。模型支持不确定性估计,并通过迁移学习扩展至其他森林结构变量预测。
Result: 模型全局R2达到0.82,能够生成高分辨率(25米)全球森林结构复杂度地图,并支持多时相监测。预测结果在不同生物群落和时间段均表现良好。
Insight: 1. 多模态数据融合(激光雷达+ SAR)显著提升了森林结构复杂度的制图能力; 2. 轻量化模型设计使其具备可扩展性和计算效率,适合全球范围应用; 3. 不确定性估计和迁移学习为生态系统监测提供了灵活工具。
Abstract: Forest structural complexity metrics integrate multiple canopy attributes into a single value that reflects habitat quality and ecosystem function. Spaceborne lidar from the Global Ecosystem Dynamics Investigation (GEDI) has enabled mapping of structural complexity in temperate and tropical forests, but its sparse sampling limits continuous high-resolution mapping. We present a scalable, deep learning framework fusing GEDI observations with multimodal Synthetic Aperture Radar (SAR) datasets to produce global, high-resolution (25 m) wall-to-wall maps of forest structural complexity. Our adapted EfficientNetV2 architecture, trained on over 130 million GEDI footprints, achieves high performance (global R2 = 0.82) with fewer than 400,000 parameters, making it an accessible tool that enables researchers to process datasets at any scale without requiring specialized computing infrastructure. The model produces accurate predictions with calibrated uncertainty estimates across biomes and time periods, preserving fine-scale spatial patterns. It has been used to generate a global, multi-temporal dataset of forest structural complexity from 2015 to 2022. Through transfer learning, this framework can be extended to predict additional forest structural variables with minimal computational cost. This approach supports continuous, multi-temporal monitoring of global forest structural dynamics and provides tools for biodiversity conservation and ecosystem management efforts in a changing climate.
[79] Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding
Yi Xin,Qi Qin,Siqi Luo,Kaiwen Zhu,Juncheng Yan,Yan Tai,Jiayi Lei,Yuewen Cao,Keqi Wang,Yibin Wang,Jinbin Bai,Qian Yu,Dengyang Jiang,Yuandong Pu,Haoxing Chen,Le Zhuo,Junjun He,Gen Luo,Tianbin Li,Ming Hu,Jin Ye,Shenglong Ye,Bo Zhang,Chang Xu,Wenhai Wang,Hongsheng Li,Guangtao Zhai,Tianfan Xue,Bin Fu,Xiaohong Liu,Yu Qiao,Yihao Liu
Main category: cs.CV
TL;DR: Lumina-DiMOO是一种基于全离散扩散模型的开源多模态生成与理解基础模型,通过高效的采样和多任务支持,超越现有开源统一多模态模型的性能。
Details
Motivation: 现有统一多模态模型在采样效率和任务多样性上存在局限,Lumina-DiMOO旨在通过离散扩散模型解决这些问题。Contribution: 提出了一种全离散扩散模型范式,支持高效采样和多模态任务(如文本/图像生成与理解),并在多个基准上达到SOTA性能。
Method: 采用离散扩散模型处理多模态输入输出,避免了传统自回归或混合模型的效率瓶颈,支持广泛的生成与理解任务。
Result: 在多项任务中超越现有开源统一多模态模型,展现了更高的采样效率和性能。
Insight: 离散扩散模型在多模态任务中具有潜力,为未来研究方向提供了新思路。
Abstract: We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: https://synbol.github.io/Lumina-DiMOO.
[80] TransFIRA: Transfer Learning for Face Image Recognizability Assessment
Allen Tu,Kartik Narayan,Joshua Gleason,Jennifer Xu,Matthew Meyn,Tom Goldstein,Vishal M. Patel
Main category: cs.CV
TL;DR: TransFIRA提出了一种基于迁移学习的轻量级、无需标注的人脸图像可识别性评估框架,通过嵌入空间的几何特性定义可识别性,实现了在验证任务上的SOTA性能,并拓展至其他模态。
Details
Motivation: 传统人脸图像质量评估方法依赖视觉启发式或标注数据,无法直接反映编码器的决策几何特性。Contribution: 1. 提出了基于类中心相似度和角度的可识别性定义;2. 设计了无需外部标注或启发式的聚合策略;3. 扩展至非人脸模态并提供可解释性分析。
Method: 通过迁移学习利用编码器的嵌入空间几何特性(CCS和CCAS)评估可识别性,并结合聚合策略优化验证性能。
Result: 在BRIAR和IJB-C数据集上实现了SOTA验证精度,并在跨数据集测试中表现出鲁棒性。
Insight: 嵌入空间的几何特性为评估可识别性提供了自然且高效的依据,并可推广至其他识别任务。
Abstract: Face recognition in unconstrained environments such as surveillance, video, and web imagery must contend with extreme variation in pose, blur, illumination, and occlusion, where conventional visual quality metrics fail to predict whether inputs are truly recognizable to the deployed encoder. Existing FIQA methods typically rely on visual heuristics, curated annotations, or computationally intensive generative pipelines, leaving their predictions detached from the encoder’s decision geometry. We introduce TransFIRA (Transfer Learning for Face Image Recognizability Assessment), a lightweight and annotation-free framework that grounds recognizability directly in embedding space. TransFIRA delivers three advances: (i) a definition of recognizability via class-center similarity (CCS) and class-center angular separation (CCAS), yielding the first natural, decision-boundary–aligned criterion for filtering and weighting; (ii) a recognizability-informed aggregation strategy that achieves state-of-the-art verification accuracy on BRIAR and IJB-C while nearly doubling correlation with true recognizability, all without external labels, heuristics, or backbone-specific training; and (iii) new extensions beyond faces, including encoder-grounded explainability that reveals how degradations and subject-specific factors affect recognizability, and the first recognizability-aware body recognition assessment. Experiments confirm state-of-the-art results on faces, strong performance on body recognition, and robustness under cross-dataset shifts. Together, these contributions establish TransFIRA as a unified, geometry-driven framework for recognizability assessment – encoder-specific, accurate, interpretable, and extensible across modalities – significantly advancing FIQA in accuracy, explainability, and scope.
[81] Road Surface Condition Detection with Machine Learning using New York State Department of Transportation Camera Images and Weather Forecast Data
Carly Sutter,Kara J. Sulia,Nick P. Bassill,Christopher D. Wirz,Christopher D. Thorncroft,Jay C. Rothenberger,Vanessa Przybylo,Mariana G. Cains,Jacob Radford,David Aaron Evans
Main category: cs.CV
TL;DR: 该研究利用机器学习(卷积神经网络和随机森林)结合纽约州交通部摄像头图像和天气预报数据,自动分类道路表面状况,准确率达81.5%。
Details
Motivation: 纽约州交通部目前依赖人工观察摄像头和实地巡查评估道路状况,耗时耗力。机器学习可提供自动化支持,提升决策效率。Contribution: 1) 构建了一个包含约2.2万张手动标注图像的数据集;2) 提出结合CNN和随机森林的方法,实现高精度泛化能力。
Method: 使用卷积神经网络(CNN)处理图像数据,随机森林处理天气数据,联合训练模型分类六种道路状况。
Result: 模型在未见过的摄像头数据上达到81.5%的准确率,满足实际需求。
Insight: 结合图像与天气数据能显著提升道路状况分类的泛化性能,适用于大规模交通管理。
Abstract: The New York State Department of Transportation (NYSDOT) has a network of roadside traffic cameras that are used by both the NYSDOT and the public to observe road conditions. The NYSDOT evaluates road conditions by driving on roads and observing live cameras, tasks which are labor-intensive but necessary for making critical operational decisions during winter weather events. However, machine learning models can provide additional support for the NYSDOT by automatically classifying current road conditions across the state. In this study, convolutional neural networks and random forests are trained on camera images and weather data to predict road surface conditions. Models are trained on a hand-labeled dataset of ~22,000 camera images, each classified by human labelers into one of six road surface conditions: severe snow, snow, wet, dry, poor visibility, or obstructed. Model generalizability is prioritized to meet the operational needs of the NYSDOT decision makers, and the weather-related road surface condition model in this study achieves an accuracy of 81.5% on completely unseen cameras.
[82] From Captions to Keyframes: Efficient Video Summarization via Caption- and Context-Aware Frame Scoring
Shih-Yao Lin,Sibendu Paul,Caren Chen
Main category: cs.CV
TL;DR: 该论文提出了一种名为KeyScore的多模态帧评分框架,结合字幕和视觉上下文来评估帧的重要性,并引入STACFP生成紧凑且多样的候选帧,实现了高效的视频摘要。
Details
Motivation: 长视频的高效语言理解需要选取少量保留语义和上下文信息的帧。现有方法通常依赖固定数量的帧或显式视频摘要,忽略了多模态对齐的效率问题。Contribution: 1) 提出KeyScore框架,结合语义相似性、时间多样性和上下文影响评估帧重要性;2) 引入STACFP生成紧凑且多样的候选帧;3) 实验验证了方法在MSRVTT、MSVD和DiDeMo数据集上的高效性。
Method: 1) KeyScore利用字幕和视觉上下文计算帧的重要性;2) STACFP通过时空自适应聚类生成候选帧;3) 联合模块实现高达99%的帧缩减。
Result: 在MSRVTT、MSVD和DiDeMo数据集上,该方法显著优于标准的8帧编码器,实现了高效且可扩展的视频理解。
Insight: 强调视觉和文本信号的多模态对齐是实现高效视频理解的关键,无需显式的视频摘要步骤。
Abstract: Efficient video-language understanding requires selecting a small set of frames that retain semantic and contextual information from long videos. We propose KeyScore, a multimodal frame scoring framework that jointly leverages captions and visual context to estimate frame-level importance. By combining semantic similarity, temporal diversity, and contextual drop impact, KeyScore identifies the most informative frames for downstream tasks such as retrieval, captioning, and video-language reasoning. To complement KeyScore, we introduce STACFP (Spatio-Temporal Adaptive Clustering for Frame Proposals), which generates compact and diverse frame candidates for long-form videos. Together, these modules achieve up to 99% frame reduction compared to full-frame inference and substantially outperform standard 8-frame encoders on MSRVTT, MSVD, and DiDeMo. Our results demonstrate that emphasizing multimodal alignment between visual and textual signals enables scalable, efficient, and caption-grounded video understanding – without explicit video summarization.
[83] LogSTOP: Temporal Scores over Prediction Sequences for Matching and Retrieval
Avishree Khare,Hideki Okamoto,Bardh Hoxha,Georgios Fainekos,Rajeev Alur
Main category: cs.CV
TL;DR: LogSTOP是一种用于计算时间属性分数的评分函数,基于局部属性的预测序列,适用于视频和音频的查询匹配与检索任务,性能优于大型视觉/音频语言模型和其他时间逻辑基线。
Details
Motivation: 现有神经模型(如YOLO和HuBERT)可以检测视频帧或音频片段中的局部属性(如物体或情感),但其输出是单帧/片段的分数。为了支持时间属性(如“说话者最终是否快乐”)的查询匹配和检索任务,需要将这些局部分数扩展到时间序列上。Contribution: 1)形式化了基于局部属性分数的时间属性评分问题(STOPs);2)提出了LogSTOP方法,能够高效计算线性时间逻辑表示的时间属性分数;3)在视频和音频任务中验证了LogSTOP的优越性。
Method: LogSTOP通过局部属性预测序列(如YOLO或HuBERT的输出)计算时间属性的分数,并采用线性时间逻辑表示时间属性。其核心是一种高效的评分函数,支持复杂的时间逻辑查询。
Result: 实验显示,LogSTOP在视频中的物体检测和音频中的情感分析任务中,分别比大型视觉/音频语言模型和其他时间逻辑基线性能提升至少16%。在视频检索任务中,其平均精度和召回率也显著优于零样本文本到视频检索基线。
Insight: 1)局部属性的时间扩展可以显著提升复杂时间逻辑查询的性能;2)LogSTOP的高效性使其适用于实际应用;3)时间逻辑与神经模型的结合为多媒体检索提供了新思路。
Abstract: Neural models such as YOLO and HuBERT can be used to detect local properties such as objects (“car”) and emotions (“angry”) in individual frames of videos and audio clips respectively. The likelihood of these detections is indicated by scores in [0, 1]. Lifting these scores to temporal properties over sequences can be useful for several downstream applications such as query matching (e.g., “does the speaker eventually sound happy in this audio clip?”), and ranked retrieval (e.g., “retrieve top 5 videos with a 10 second scene where a car is detected until a pedestrian is detected”). In this work, we formalize this problem of assigning Scores for TempOral Properties (STOPs) over sequences, given potentially noisy score predictors for local properties. We then propose a scoring function called LogSTOP that can efficiently compute these scores for temporal properties represented in Linear Temporal Logic. Empirically, LogSTOP, with YOLO and HuBERT, outperforms Large Vision / Audio Language Models and other Temporal Logic-based baselines by at least 16% on query matching with temporal properties over objects-in-videos and emotions-in-speech respectively. Similarly, on ranked retrieval with temporal properties over objects and actions in videos, LogSTOP with Grounding DINO and SlowR50 reports at least a 19% and 16% increase in mean average precision and recall over zero-shot text-to-video retrieval baselines respectively.
[84] VUGEN: Visual Understanding priors for GENeration
Xiangyi Chen,Théophane Vallaeys,Maha Elbayad,John Nguyen,Jakob Verbeek
Main category: cs.CV
TL;DR: VUGEN提出了一种新框架,利用预训练的视觉语言模型的视觉理解先验,实现高效高质量的图像生成,避免了传统方法的表示不匹配问题。
Details
Motivation: 现有的视觉语言模型在图像理解上表现优异,但在图像生成方面仍面临挑战,如生成与理解表示之间的不匹配或架构复杂性问题。Contribution: VUGEN的主要贡献是通过降维保留视觉信息,直接在预训练模型的视觉空间中生成高质量图像,同时保持其原始理解能力。
Method: 方法包括:1) 将高维视觉编码器空间降维到可处理的分布;2) 训练模型在降维空间中采样;3) 使用像素扩散解码器将潜在空间映射回图像空间。
Result: VUGEN在COCO数据集上显著提升生成性能(DPG Bench从71.17到74.32,FID从11.86到9.06),并保持理解能力不变。
Insight: 结果表明,无需依赖复杂潜在扩散或VAE的解码器,直接利用视觉理解先验能实现高质量的图像生成。
Abstract: Recent advances in Vision-Language Models (VLMs) have enabled unified understanding across text and images, yet equipping these models with robust image generation capabilities remains challenging. Existing approaches often rely on reconstruction-oriented autoencoders or complex bridging mechanisms, leading to misalignment between understanding and generation representations, or architectural complexity. In this work, we propose VUGEN, a novel framework that explicitly leverages VLM’s pretrained visual understanding priors for efficient and high-quality image generation. Our approach first transforms the high-dimensional latent space of the VLM’s native vision encoder into a lower-dimensional, tractable distribution that maximally preserves visual information. The VLM is then trained to sample within this reduced latent space, ensuring alignment with its visual understanding capabilities. Finally, a dedicated pixel decoder maps these generated latents back to the image space. We find that a VAE-free pixel diffusion decoder to be on par or better than commonly used complex latent diffusion decoders that internally rely on VAE latents. Extensive experiments demonstrate that VUGEN achieves superior image generation performance, improving DPG Bench from 71.17 to 74.32 and FID from 11.86 to 9.06 on COCO, while fully preserving the VLM’s original understanding capabilities.
[85] Cluster Paths: Navigating Interpretability in Neural Networks
Nicholas M. Kroeger,Vincent Bindschaedler
Main category: cs.CV
TL;DR: 该论文提出了一种称为簇路径(cluster paths)的后解释性方法,通过聚类神经网络中间层的激活来表示输入的序列。该方法通过四项指标评估簇路径的有效性,并在多个任务中验证了其识别虚假特征、保持高保真度和稳定性,以及检测分布外样本的能力。
Details
Motivation: 深度神经网络在视觉任务中表现出色,但其决策过程不透明,可能导致误信、未察觉的偏见和意外失败。为了解决这一可解释性问题,作者提出了簇路径方法。Contribution: 1. 提出了簇路径方法,用于解释神经网络的决策过程;2. 引入了四项评估指标;3. 展示了簇路径在虚假特征识别、分布外检测和多尺度视觉概念发现中的能力。
Method: 通过聚类选择层的激活,将输入表示为簇ID序列,并提出了路径复杂性、加权路径纯度、决策对齐忠实度和路径一致性四项评估指标。
Result: 在CIFAR-10虚假特征实验中识别了颜色捷径,CelebA任务中达到90%忠实度和96%稳定性,并能有效检测分布外样本。
Insight: 簇路径不仅能提供可解释性,还能揭示网络的潜在视觉概念(如颜色、纹理),并扩展到大规模视觉模型(如ViT)。
Abstract: While modern deep neural networks achieve impressive performance in vision tasks, they remain opaque in their decision processes, risking unwarranted trust, undetected biases and unexpected failures. We propose cluster paths, a post-hoc interpretability method that clusters activations at selected layers and represents each input as its sequence of cluster IDs. To assess these cluster paths, we introduce four metrics: path complexity (cognitive load), weighted-path purity (class alignment), decision-alignment faithfulness (predictive fidelity), and path agreement (stability under perturbations). In a spurious-cue CIFAR-10 experiment, cluster paths identify color-based shortcuts and collapse when the cue is removed. On a five-class CelebA hair-color task, they achieve 90% faithfulness and maintain 96% agreement under Gaussian noise without sacrificing accuracy. Scaling to a Vision Transformer pretrained on ImageNet, we extend cluster paths to concept paths derived from prompting a large language model on minimal path divergences. Finally, we show that cluster paths can serve as an effective out-of-distribution (OOD) detector, reliably flagging anomalous samples before the model generates over-confident predictions. Cluster paths uncover visual concepts, such as color palettes, textures, or object contexts, at multiple network depths, demonstrating that cluster paths scale to large vision models while generating concise and human-readable explanations.
[86] Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer
Ziyuan Huang,DanDan Zheng,Cheng Zou,Rui Liu,Xiaolong Wang,Kaixiang Ji,Weilong Chai,Jianxin Sun,Libin Wang,Yongjie Lv,Taozhi Huang,Jiajia Liu,Qingpei Guo,Ming Yang,Jingdong Chen,Jun Zhou
Main category: cs.CV
TL;DR: 论文提出了一种名为MingTok的连续潜空间视觉分词器,通过统一的自动回归范式实现图像理解与生成任务。
Details
Motivation: 现有方法采用离散潜空间的分词器与大型语言模型的令牌对齐,量化误差限制了语义表达能力和视觉语言理解能力。Contribution: 提出MingTok连续潜空间视觉分词器,支持统一的自动回归生成与理解任务。
Method: 采用三阶段架构:低层编码、语义扩展和视觉重建,统一视觉表示。
Result: 在理解和生成任务上均达到先进水平。
Insight: 统一连续视觉表示可以调和理解与生成任务对分词器的竞争需求。
Abstract: Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.
[87] A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages
Zibo Su,Kun Wei,Jiahua Li,Xu Yang,Cheng Deng
Main category: cs.CV
TL;DR: 这篇论文提出了MuEx框架,通过音素-视素对齐技术解决了多语言驱动的人脸动画生成问题,并在多语言数据集上展示了卓越性能。
Details
Motivation: 当前语音驱动人脸动画合成(TFS)模型在英语上表现良好,但在非英语语言中效果不佳,主要因为训练数据以英语为主且缺乏跨语言泛化能力。Contribution: 1. 提出MuEx框架,采用音素和视素作为通用中介;2. 提出PV-Align机制解决同步问题;3. 构建多语言基准数据集MTFB。
Method: 1. 通过PG-MoE架构结合音素和视素特征;2. 引入PV-Align机制优化跨模态对齐;3. 使用大规模多语言数据集训练和评估。
Result: MuEx在多语言数据集上表现优异,并能零样本泛化到未见过的语言。
Insight: 音素和视素作为通用中介可以有效解决跨语言人脸动画生成的挑战。
Abstract: Speech-driven talking face synthesis (TFS) focuses on generating lifelike facial animations from audio input. Current TFS models perform well in English but unsatisfactorily in non-English languages, producing wrong mouth shapes and rigid facial expressions. The terrible performance is caused by the English-dominated training datasets and the lack of cross-language generalization abilities. Thus, we propose Multilingual Experts (MuEx), a novel framework featuring a Phoneme-Guided Mixture-of-Experts (PG-MoE) architecture that employs phonemes and visemes as universal intermediaries to bridge audio and video modalities, achieving lifelike multilingual TFS. To alleviate the influence of linguistic differences and dataset bias, we extract audio and video features as phonemes and visemes respectively, which are the basic units of speech sounds and mouth movements. To address audiovisual synchronization issues, we introduce the Phoneme-Viseme Alignment Mechanism (PV-Align), which establishes robust cross-modal correspondences between phonemes and visemes. In addition, we build a Multilingual Talking Face Benchmark (MTFB) comprising 12 diverse languages with 95.04 hours of high-quality videos for training and evaluating multilingual TFS performance. Extensive experiments demonstrate that MuEx achieves superior performance across all languages in MTFB and exhibits effective zero-shot generalization to unseen languages without additional training.
[88] MSITrack: A Challenging Benchmark for Multispectral Single Object Tracking
Tao Feng,Tingfa Xu,Haolin Qin,Tianhao Li,Shuaihao Han,Xuyang Zou,Zhan Lv,Jianan Li
Main category: cs.CV
TL;DR: 论文介绍了MSITrack,一个大规模、多样化的多光谱单目标跟踪数据集,旨在解决RGB跟踪器在复杂场景中的局限性。
Details
Motivation: RGB跟踪器在遮挡、相似物体干扰和复杂背景等真实场景中表现受限,多光谱数据因其像素级光谱反射能力能提升目标辨识度,但当前多光谱跟踪数据集稀缺。Contribution: 推出了MSITrack数据集,具备更多挑战性属性、丰富自然场景和大规模多光谱图像(300个视频、129k帧),填补了多光谱跟踪领域的空白。
Method: 通过精心采集多光谱图像,手工标注和多阶段验证,构建了一个高质量的数据集。
Result: 实验表明多光谱数据显著提升了跟踪性能,优于RGB基线。
Insight: 多光谱数据在复杂场景中具有显著优势,未来可推动多光谱跟踪算法的进一步发展。
Abstract: Visual object tracking in real-world scenarios presents numerous challenges including occlusion, interference from similar objects and complex backgrounds-all of which limit the effectiveness of RGB-based trackers. Multispectral imagery, which captures pixel-level spectral reflectance, enhances target discriminability. However, the availability of multispectral tracking datasets remains limited. To bridge this gap, we introduce MSITrack, the largest and most diverse multispectral single object tracking dataset to date. MSITrack offers the following key features: (i) More Challenging Attributes-including interference from similar objects and similarity in color and texture between targets and backgrounds in natural scenarios, along with a wide range of real-world tracking challenges; (ii) Richer and More Natural Scenes-spanning 55 object categories and 300 distinct natural scenes, MSITrack far exceeds the scope of existing benchmarks. Many of these scenes and categories are introduced to the multispectral tracking domain for the first time; (iii) Larger Scale-300 videos comprising over 129k frames of multispectral imagery. To ensure annotation precision, each frame has undergone meticulous processing, manual labeling and multi-stage verification. Extensive evaluations using representative trackers demonstrate that the multispectral data in MSITrack significantly improves performance over RGB-only baselines, highlighting its potential to drive future advancements in the field. The MSITrack dataset is publicly available at: https://github.com/Fengtao191/MSITrack.
[89] StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
Zhihao Wen,Wenkang Wei,Yuan Fang,Xingtong Yu,Hui Zhang,Weicheng Zhu,Xin Zhang
Main category: cs.CV
TL;DR: StaR-KVQA提出了一个结构化推理方法,通过监督双重符号关系路径和自然语言解释,提升隐式知识视觉问答(IK-KVQA)的准确性和可解释性。
Details
Motivation: 现有MLLM在IK-KVQA中缺乏显式推理监督,生成的理由不一致,且标准监督微调后泛化能力差。Contribution: 通过结构化推理痕迹(关系路径和解释)构建数据集,采用自蒸馏方法微调模型,无需外部检索或知识库。
Method: 构建路径驱动的推理痕迹数据集,通过结构化自蒸馏方法微调MLLM,使推理过程透明且可验证。
Result: 在OK-VQA基准上,StaR-KVQA比最强基线提高了11.3%的准确率,并展示了强大的跨域泛化能力。
Insight: 结构化推理痕迹可以有效提升模型的透明性和泛化性,减少对外部资源的依赖。
Abstract: Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. We study its implicit-knowledge variant, IK-KVQA, where a multimodal large language model (MLLM) is the sole knowledge source, without external retrieval. Yet, MLLMs lack explicit reasoning supervision and produce inconsistent justifications, and generalize poorly after standard supervised fine-tuning (SFT). We present StaR-KVQA (Structured Reasoning Traces for IK-KVQA), which supervises structured traces - dual symbolic relation paths plus path-grounded natural-language explanations - so that reasoning becomes transparent and verifiable. With one open-source MLLM, StaR-KVQA constructs and selects path-grounded reasoning traces to form a trace-enriched dataset, then fine-tunes via structured self-distillation to align generation with supervision; no external retrievers, verifiers, or curated knowledge bases (KBs) are used, traces are built offline, and inference is a single autoregressive pass. Across benchmarks, StaR-KVQA improves both accuracy and interpretability, achieving up to +11.3% higher answer accuracy on OK-VQA over the strongest baseline while exhibiting robust cross-domain generalization.
[90] Automated Neural Architecture Design for Industrial Defect Detection
Yuxi Liu,Yunfeng Ma,Yi Tang,Min Liu,Shuai Jiang,Yaonan Wang
Main category: cs.CV
TL;DR: AutoNAD是一个自动化神经网络架构设计框架,用于工业表面缺陷检测(SDD),通过联合搜索卷积、Transformer和多层感知机,解决了类内差异和类间相似性两大挑战。
Details
Motivation: 工业SDD面临类内差异和类间相似性的挑战,传统人工设计模型效率低且效果不佳,因此需要一种自动化方法提升检测性能和效率。Contribution: 1.提出AutoNAD框架,联合搜索多种模型结构;2.引入跨权重共享策略加速训练;3.设计可搜索的多级特征聚合模块(MFAM);4.结合延迟感知先验优化运行时效率。
Method: AutoNAD通过混合搜索卷积、Transformer和MLP捕获局部和全局特征,利用跨权重共享和MFAM提升训练效率和特征学习,并通过延迟感知先验优化架构选择。
Result: 在三个工业缺陷数据集上验证了AutoNAD的有效性,并将其整合到缺陷成像与检测平台中。
Insight: 自动化设计框架可显著减少人工设计成本,同时提升工业SDD的性能和效率。
Abstract: Industrial surface defect detection (SDD) is critical for ensuring product quality and manufacturing reliability. Due to the diverse shapes and sizes of surface defects, SDD faces two main challenges: intraclass difference and interclass similarity. Existing methods primarily utilize manually designed models, which require extensive trial and error and often struggle to address both challenges effectively. To overcome this, we propose AutoNAD, an automated neural architecture design framework for SDD that jointly searches over convolutions, transformers, and multi-layer perceptrons. This hybrid design enables the model to capture both fine-grained local variations and long-range semantic context, addressing the two key challenges while reducing the cost of manual network design. To support efficient training of such a diverse search space, AutoNAD introduces a cross weight sharing strategy, which accelerates supernet convergence and improves subnet performance. Additionally, a searchable multi-level feature aggregation module (MFAM) is integrated to enhance multi-scale feature learning. Beyond detection accuracy, runtime efficiency is essential for industrial deployment. To this end, AutoNAD incorporates a latency-aware prior to guide the selection of efficient architectures. The effectiveness of AutoNAD is validated on three industrial defect datasets and further applied within a defect imaging and detection platform. Code will be available at https://github.com/Yuxi104/AutoNAD.
[91] Heptapod: Language Modeling on Visual Signals
Yongxin Zhu,Jiawei Chen,Yuanzhe Chen,Zhuo Chen,Dongya Jia,Jian Cong,Xiaobin Zhuang,Yuping Wang,Yuxuan Wang
Main category: cs.CV
TL;DR: Heptapod是一个基于视觉信号的自回归语言模型,通过因果注意力和二维分布预测实现图像生成,显著优于现有方法。
Details
Motivation: 传统的视觉自回归模型依赖语义分词器和CFG,缺乏统一的生成和监督学习目标。Heptapod旨在通过新的学习框架弥补这一缺陷。Contribution: 提出了一种新的学习目标——next 2D distribution prediction,统一了自回归和掩码自编码的学习方式,并避免了语义分词器的使用。
Method: 采用因果注意力的Transformer模型,结合重构导向的视觉分词器,预测每个时间步的二维空间分布。
Result: 在ImageNet生成基准上,FID达到2.70,优于现有自回归方法。
Insight: 通过统一的生成和监督学习目标,模型能够更全面地捕捉图像语义,为视觉信号的语言建模提供了新思路。
Abstract: We introduce Heptapod, an image autoregressive model that adheres to the foundational principles of language modeling. Heptapod employs \textbf{causal attention}, \textbf{eliminates reliance on CFG}, and \textbf{eschews the trend of semantic tokenizers}. Our key innovation is \textit{next 2D distribution prediction}: a causal Transformer with reconstruction-focused visual tokenizer, learns to predict the distribution over the entire 2D spatial grid of images at each timestep. This learning objective unifies the sequential modeling of autoregressive framework with the holistic self-supervised learning of masked autoencoding, enabling the model to capture comprehensive image semantics via generative training. On the ImageNet generation benchmark, Heptapod achieves an FID of $2.70$, significantly outperforming previous causal autoregressive approaches. We hope our work inspires a principled rethinking of language modeling on visual signals and beyond.
[92] DreamOmni2: Multimodal Instruction-based Editing and Generation
Bin Xia,Bohao Peng,Yuechen Zhang,Junjia Huang,Jiyang Liu,Jingyao Li,Haoru Tan,Sitong Wu,Chengyao Wang,Yitong Wang,Xinglong Wu,Bei Yu,Jiaya Jia
Main category: cs.CV
TL;DR: DreamOmni2提出了基于多模态指令的编辑和生成任务,解决了传统方法的局限性,并通过创新的数据合成和模型框架实现了高效的多图像输入处理和复杂指令解析。
Details
Motivation: 传统的基于指令的图像编辑仅依赖语言指令,无法捕捉细节,而基于主题的生成局限于具体对象,忽略了抽象概念。这两种方法在实际应用中存在显著不足。Contribution: 1. 提出了多模态指令编辑和生成任务;2. 设计了创新的数据合成流程;3. 提出了索引编码和位置编码移位机制;4. 引入了VLM联合训练方法;5. 建立了综合基准测试。
Method: 1. 数据合成:通过特征混合生成抽象和具体概念的提取数据,利用编辑和提取模型生成训练数据;2. 框架设计:采用索引编码和位置编码移位避免像素混淆;3. 联合训练:结合VLM和生成/编辑模型处理复杂指令。
Result: 实验表明DreamOmni2在多模态指令编辑和生成任务上表现优异,验证了方法的有效性。
Insight: 多模态指令支持(文本+图像)扩展了任务的适用范围,结合具体和抽象概念的处理能力,显著提升了实用性。创新数据合成和模型设计是关键。
Abstract: Recent advancements in instruction-based image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 has achieved impressive results. Models and codes will be released.
[93] Semantic Segmentation Algorithm Based on Light Field and LiDAR Fusion
Jie Luo,Yuxuan Jiang,Xin Jin,Mingyu Liu,Yihui Fan
Main category: cs.CV
TL;DR: 该论文提出了一种基于光场和LiDAR融合的多模态语义分割算法(Mlpfseg),通过特征补全和深度感知模块提升了复杂场景(如遮挡)下的分割效果。
Details
Motivation: 自动驾驶中的语义分割在遮挡等复杂场景下面临挑战,光场和LiDAR提供了互补的视觉与空间信息,但二者的有效融合因视角局限和模态差异受到阻碍。Contribution: 1. 提出了首个结合光场和点云数据的多模态语义分割数据集;2. 设计了Mlpfseg网络,通过特征补全和深度感知模块实现模态融合,显著提升分割性能。
Method: 1. 特征补全模块通过点云特征图的微分重构解决点云与图像像素的密度不匹配问题;2. 深度感知模块通过增强注意力分数提升遮挡感知能力。
Result: 相比纯图像分割和纯点云分割,Mlpfseg在mIoU指标上分别提升1.71和2.38。
Insight: 多模态融合(光场+LiDAR)能有效解决复杂场景中的分割问题,特征补全和深度感知是实现模态互补的关键技术。
Abstract: Semantic segmentation serves as a cornerstone of scene understanding in autonomous driving but continues to face significant challenges under complex conditions such as occlusion. Light field and LiDAR modalities provide complementary visual and spatial cues that are beneficial for robust perception; however, their effective integration is hindered by limited viewpoint diversity and inherent modality discrepancies. To address these challenges, the first multimodal semantic segmentation dataset integrating light field data and point cloud data is proposed. Based on this dataset, we proposed a multi-modal light field point-cloud fusion segmentation network(Mlpfseg), incorporating feature completion and depth perception to segment both camera images and LiDAR point clouds simultaneously. The feature completion module addresses the density mismatch between point clouds and image pixels by performing differential reconstruction of point-cloud feature maps, enhancing the fusion of these modalities. The depth perception module improves the segmentation of occluded objects by reinforcing attention scores for better occlusion awareness. Our method outperforms image-only segmentation by 1.71 Mean Intersection over Union(mIoU) and point cloud-only segmentation by 2.38 mIoU, demonstrating its effectiveness.
[94] SCas4D: Structural Cascaded Optimization for Boosting Persistent 4D Novel View Synthesis
Jipeng Lyu,Jiahua Dong,Yu-Xiong Wang
Main category: cs.CV
TL;DR: SCas4D通过级联优化框架利用3D高斯泼溅中的结构性模式,高效建模动态场景,仅需100次迭代即可达到与现有方法相当的效果。
Details
Motivation: 动态场景建模在保持计算效率的同时捕捉精确变形存在挑战。SCas4D旨在通过利用真实世界中变形的层次模式来解决这一问题。Contribution: 提出了SCas4D,一种级联优化框架,通过从粗到细的变形优化,显著减少了训练迭代次数(仅为现有方法的1/20)。
Method: 利用3D高斯泼溅中的结构性模式,分层次(从部件级到点级)逐步优化变形。
Result: 在每帧100次迭代内实现收敛,效果与现有方法相当,并在自监督关节对象分割和新视角合成等任务中表现优异。
Insight: 真实世界变形具有层次性,通过结构化的级联优化可以大幅提升计算效率和建模精度。
Abstract: Persistent dynamic scene modeling for tracking and novel-view synthesis remains challenging due to the difficulty of capturing accurate deformations while maintaining computational efficiency. We propose SCas4D, a cascaded optimization framework that leverages structural patterns in 3D Gaussian Splatting for dynamic scenes. The key idea is that real-world deformations often exhibit hierarchical patterns, where groups of Gaussians share similar transformations. By progressively refining deformations from coarse part-level to fine point-level, SCas4D achieves convergence within 100 iterations per time frame and produces results comparable to existing methods with only one-twentieth of the training iterations. The approach also demonstrates effectiveness in self-supervised articulated object segmentation, novel view synthesis, and dense point tracking tasks.
[95] Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities
Maria Levchenko
Main category: cs.CV
TL;DR: 这篇论文提出了一种评估大型语言模型(LLM)在历史文档OCR中的方法论框架,解决了传统指标无法捕捉的时空偏差和时代特定错误问题。
Details
Motivation: 数字人文学者越来越多地使用LLM进行历史文档数字化,但缺乏针对LLM的OCR评估框架。传统指标无法有效衡量历史语料库创建中的关键问题,如时空偏差和时代特定错误。Contribution: 提出了针对历史文档OCR的LLM评估方法,包括新型指标(如HCPR和AIR)、污染控制协议和稳定性测试,并评估了12种多模态LLM的性能。
Method: 使用18世纪俄文文本,引入HCPR和AIR等指标,设计污染控制和稳定性测试协议,评估不同LLM模型的表现。
Result: Gemini和Qwen模型表现优于传统OCR,但存在过度历史化问题;后处理OCR校正反而降低了性能。
Insight: 论文为数字人文学者提供了LLM选择和语料库质量评估的实用指南,揭示了后处理校正的局限性。
Abstract: Digital humanities scholars increasingly use Large Language Models for historical document digitization, yet lack appropriate evaluation frameworks for LLM-based OCR. Traditional metrics fail to capture temporal biases and period-specific errors crucial for historical corpus creation. We present an evaluation methodology for LLM-based historical OCR, addressing contamination risks and systematic biases in diplomatic transcription. Using 18th-century Russian Civil font texts, we introduce novel metrics including Historical Character Preservation Rate (HCPR) and Archaic Insertion Rate (AIR), alongside protocols for contamination control and stability testing. We evaluate 12 multimodal LLMs, finding that Gemini and Qwen models outperform traditional OCR while exhibiting over-historicization: inserting archaic characters from incorrect historical periods. Post-OCR correction degrades rather than improves performance. Our methodology provides digital humanities practitioners with guidelines for model selection and quality assessment in historical corpus digitization.
[96] DeRainMamba: A Frequency-Aware State Space Model with Detail Enhancement for Image Deraining
Zhiliang Zhu,Tao Zeng,Tao Yang,Guoliang Luo,Jiyong Zeng
Main category: cs.CV
TL;DR: 论文提出DeRainMamba,通过结合频率感知状态空间模块(FASSM)和多方向感知卷积(MDPConv),在去雨任务中平衡雨线去除与细节保留,并在多个公开数据集上表现优于现有方法。
Details
Motivation: 现有基于Mamba的模型在去雨任务中因难以捕捉细粒度细节和缺乏频率域感知而受限,因此需要一种更有效的方法来平衡雨线移除与图像细节保留。Contribution: 提出DeRainMamba,集成FASSM和MDPConv,首次在状态空间框架中结合频率域建模和空间细节增强,显著提升去雨效果。
Method: FASSM利用傅里叶变换区分高频雨线与图像细节,MDPConv通过多方向卷积恢复局部结构并融合多分支特征。
Result: 在四个公开数据集上的实验显示,DeRainMamba在PSNR和SSIM指标上优于现有方法,且参数量和计算成本更低。
Insight: 结合频率域建模与空间细节增强的框架为单图像去雨任务提供了新思路,高效且性能优越。
Abstract: Image deraining is crucial for improving visual quality and supporting reliable downstream vision tasks. Although Mamba-based models provide efficient sequence modeling, their limited ability to capture fine-grained details and lack of frequency-domain awareness restrict further improvements. To address these issues, we propose DeRainMamba, which integrates a Frequency-Aware State-Space Module (FASSM) and Multi-Directional Perception Convolution (MDPConv). FASSM leverages Fourier transform to distinguish rain streaks from high-frequency image details, balancing rain removal and detail preservation. MDPConv further restores local structures by capturing anisotropic gradient features and efficiently fusing multiple convolution branches. Extensive experiments on four public benchmarks demonstrate that DeRainMamba consistently outperforms state-of-the-art methods in PSNR and SSIM, while requiring fewer parameters and lower computational costs. These results validate the effectiveness of combining frequency-domain modeling and spatial detail enhancement within a state-space framework for single image deraining.
[97] OBS-Diff: Accurate Pruning For Diffusion Models in One-Shot
Junhan Zhu,Hesong Wang,Mingluo Su,Zefang Wang,Huan Wang
Main category: cs.CV
TL;DR: OBS-Diff提出了一种用于大规模文本到图像扩散模型的一次性剪枝框架,通过改进OBS方法并结合时间感知的Hessian构建,实现了高效的训练无关压缩。
Details
Motivation: 大规模文本到图像扩散模型的计算成本过高,现有的一次性剪枝方法无法直接应用,因其迭代去噪特性与普通网络不同。OBS-Diff旨在填补这一空白。Contribution: 1. 改进了经典的OBS方法,适应现代扩散模型的复杂架构;2. 提出时间感知的Hessian构建,更好地对齐扩散过程的迭代动态;3. 设计了计算高效的组间顺序剪枝策略。
Method: 结合了改进的OBS剪枝方法、时间感知的Hessian构建和组间顺序剪枝策略,支持多种剪枝粒度(无结构化、半结构化和结构化)。
Result: OBS-Diff在一次性剪枝任务中表现优异,显著加速推理同时保持视觉质量。
Insight: 通过优化剪枝标准以减少误差累积,特别是早期时间步的加权处理,是提升扩散模型剪枝效果的关键。
Abstract: Large-scale text-to-image diffusion models, while powerful, suffer from prohibitive computational cost. Existing one-shot network pruning methods can hardly be directly applied to them due to the iterative denoising nature of diffusion models. To bridge the gap, this paper presents OBS-Diff, a novel one-shot pruning framework that enables accurate and training-free compression of large-scale text-to-image diffusion models. Specifically, (i) OBS-Diff revitalizes the classic Optimal Brain Surgeon (OBS), adapting it to the complex architectures of modern diffusion models and supporting diverse pruning granularity, including unstructured, N:M semi-structured, and structured (MHA heads and FFN neurons) sparsity; (ii) To align the pruning criteria with the iterative dynamics of the diffusion process, by examining the problem from an error-accumulation perspective, we propose a novel timestep-aware Hessian construction that incorporates a logarithmic-decrease weighting scheme, assigning greater importance to earlier timesteps to mitigate potential error accumulation; (iii) Furthermore, a computationally efficient group-wise sequential pruning strategy is proposed to amortize the expensive calibration process. Extensive experiments show that OBS-Diff achieves state-of-the-art one-shot pruning for diffusion models, delivering inference acceleration with minimal degradation in visual quality.
[98] A deep multiple instance learning approach based on coarse labels for high-resolution land-cover mapping
Gianmarco Perantoni,Lorenzo Bruzzone
Main category: cs.CV
TL;DR: 该论文提出了一种基于深度多实例学习(DMIL)的方法,利用低分辨率标签训练高分辨率土地覆盖分类器,通过灵活的池化层隐式学习高分辨率标签,实验证明了其有效性。
Details
Motivation: 高分辨率土地覆盖映射中,训练标签的数量和质量是关键问题。现有低分辨率或过时的产品可以提供大量弱标签,但需要有效利用这些标签训练高分辨率分类器。Contribution: 提出了一种基于深度多实例学习的框架,能够利用低分辨率标签隐式学习高分辨率土地覆盖分类器,支持多类和多标签场景。
Method: 采用灵活的池化层连接高分辨率图像的像素语义和低分辨率参考标签,将多实例学习问题重新定义为多类和多标签设置,并引入了Positive-Unlabeled Learning策略。
Result: 在IEEE GRSS Data Fusion Contest数据集上,提出的方法优于标准的训练策略。
Insight: 通过弱监督学习方法(如DMIL和PUL),可以有效利用低分辨率标签训练高分辨率分类器,为土地覆盖映射提供了新的解决方案。
Abstract: The quantity and the quality of the training labels are central problems in high-resolution land-cover mapping with machine-learning-based solutions. In this context, weak labels can be gathered in large quantities by leveraging on existing low-resolution or obsolete products. In this paper, we address the problem of training land-cover classifiers using high-resolution imagery (e.g., Sentinel-2) and weak low-resolution reference data (e.g., MODIS -derived land-cover maps). Inspired by recent works in Deep Multiple Instance Learning (DMIL), we propose a method that trains pixel-level multi-class classifiers and predicts low-resolution labels (i.e., patch-level classification), where the actual high-resolution labels are learned implicitly without direct supervision. This is achieved with flexible pooling layers that are able to link the semantics of the pixels in the high-resolution imagery to the low-resolution reference labels. Then, the Multiple Instance Learning (MIL) problem is re-framed in a multi-class and in a multi-label setting. In the former, the low-resolution annotation represents the majority of the pixels in the patch. In the latter, the annotation only provides us information on the presence of one of the land-cover classes in the patch and thus multiple labels can be considered valid for a patch at a time, whereas the low-resolution labels provide us only one label. Therefore, the classifier is trained with a Positive-Unlabeled Learning (PUL) strategy. Experimental results on the 2020 IEEE GRSS Data Fusion Contest dataset show the effectiveness of the proposed framework compared to standard training strategies.
[99] TTRV: Test-Time Reinforcement Learning for Vision Language Models
Akshit Singh,Shyam Marjit,Wei Lin,Paul Gavrikov,Serena Yeung-Levy,Hilde Kuehne,Rogerio Feris,Sivan Doveh,James Glass,M. Jehanzeb Mirza
Main category: cs.CV
TL;DR: 论文提出TTRV方法,通过测试时强化学习增强视觉语言模型的能力,无需标注数据即可在推理时动态调整模型,显著提升了目标识别和视觉问答任务的性能。
Details
Motivation: 现有强化学习方法通常依赖标注数据和独立训练集,这与人类直接从环境中学习的方式不同。TTRV旨在通过测试时自适应提升视觉语言模型的表现,无需额外标注数据。Contribution: 1. 提出TTRV方法,首次在视觉语言模型中实现测试时强化学习;2. 改进了GRPO框架,设计基于模型输出频率的奖励机制;3. 结合低熵输出分布控制多样性,进一步提升性能。
Method: 1. 在GRPO框架中引入基于模型输出频率的奖励;2. 通过多次推断测试样本动态调整模型;3. 结合输出分布的低熵奖励来控制多样性。
Result: TTRV在目标识别和VQA任务中分别实现了最高52.4%和29.8%的提升,平均提升为24.6%和10.0%。在图像识别中,TTRV在8个基准测试中平均超越GPT-4o 2.3%。
Insight: 1. 测试时强化学习可显著提升视觉语言模型性能;2. 即使在单样本极端数据约束下,TTRV仍能带来明显改进;3. 该方法在与私有模型的竞争中表现出色。
Abstract: Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model’s output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model’s output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets.Remarkably, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.
[100] VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance
Teng Wang,Haojun Jiang,Yuxuan Wang,Zhenguo Sun,Shiji Song,Gao Huang
Main category: cs.CV
TL;DR: 本文提出了一种名为VA-Adapter的高效参数适配器,用于将预训练的超声基础模型适应于心脏超声探针引导任务,帮助初级超声医师实时获取高质量图像。
Details
Motivation: 心脏超声操作难度高,专业人员短缺,导致患者难以获得及时检查服务。本研究旨在利用基础模型从大数据中学到的医学知识,为探针引导任务提供实时操作建议。Contribution: 设计了参数高效的Vision-Action Adapter (VA-Adapter),使基础模型的图像编码器能够编码视觉-动作序列,提升引导性能。
Method: VA-Adapter通过紧凑设计内置序列推理能力,仅需微调少量参数,即可使预训练的超声基础模型学习精确的探针调整策略。
Result: 大量实验表明,VA-Adapter在探针引导任务中表现优于现有强基线模型。
Insight: 通过适配器结构,可以高效地将大规模预训练模型的知识迁移到特定医疗任务,同时大幅减少微调成本。
Abstract: Echocardiography is a critical tool for detecting heart diseases. Recently, ultrasound foundation models have demonstrated remarkable capabilities in cardiac ultrasound image analysis. However, obtaining high-quality ultrasound images is a prerequisite for accurate diagnosis. Due to the exceptionally high operational difficulty of cardiac ultrasound, there is a shortage of highly skilled personnel, which hinders patients from receiving timely examination services. In this paper, we aim to adapt the medical knowledge learned by foundation models from vast datasets to the probe guidance task, which is designed to provide real-time operational recommendations for junior sonographers to acquire high-quality ultrasound images. Moreover, inspired by the practice where experts optimize action decisions based on past explorations, we meticulously design a parameter-efficient Vision-Action Adapter (VA-Adapter) to enable foundation model’s image encoder to encode vision-action sequences, thereby enhancing guidance performance. With built-in sequential reasoning capabilities in a compact design, the VA-Adapter enables a pre-trained ultrasound foundation model to learn precise probe adjustment strategies by fine-tuning only a small subset of parameters. Extensive experiments demonstrate that the VA-Adapter can surpass strong probe guidance models. Our code will be released after acceptance.
[101] Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking
Mitchell Keren Taraday,Shahaf Wagner,Chaim Baskin
Main category: cs.CV
TL;DR: EDJE是一种高效的多模态联合编码器,通过预处理视觉token并压缩存储,大幅降低了在线计算和存储需求,同时保持了检索性能。
Details
Motivation: 现有的多模态检索方法在视觉特征提取阶段计算成本高,难以大规模部署。EDJE旨在解决这一瓶颈。Contribution: 提出EDJE方法,通过预处理和压缩视觉token,显著降低了存储和在线计算需求,同时保持高检索性能。
Method: EDJE预计算视觉token并通过轻量级注意力适配器压缩,在线推理时仅运行小型联合编码器。
Result: EDJE在处理速度和存储效率上显著优于现有方法,50k对图像-文本/秒,每图像仅需49kB存储。
Insight: 通过分离预处理和在线计算,可以实现高效的多模态检索,为大规模部署提供了新思路。
Abstract: Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision–language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image–text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval. The implementation and checkpoints will be made publicly available shortly.
[102] Continual Action Quality Assessment via Adaptive Manifold-Aligned Graph Regularization
Kanglei Zhou,Qingyi Pan,Xingxing Zhang,Hubert P. H. Shum,Frederick W. B. Li,Xiaohui Liang,Liyuan Wang
Main category: cs.CV
TL;DR: 该论文提出了一个名为MAGR++的方法,用于解决动作质量评估中的持续学习问题,通过自适应流形对齐图正则化来稳定特征表示并减少遗忘。
Details
Motivation: 动作质量评估(AQA)在实际应用中面临质量分布动态变化的挑战,传统的静态方法难以适应这种变化。持续学习(CL)可以帮助解决这一问题,但现有的参数高效微调方法在AQA中表现不足。Contribution: 1. 提出了持续动作质量评估(CAQA)任务;2. 揭示了全参数微调的必要性和问题;3. 设计了MAGR++方法,结合流形投影和图正则化以稳定特征表示;4. 构造了四个CAQA基准数据集。
Method: MAGR++结合了两部分:1. 主干网络微调,稳定浅层网络并调整深层网络;2. 两阶段特征修正流程,包括流形投影器和图正则化器。
Result: 实验表明MAGR++在离线(3.6%)和在线(12.2%)评估中均优于基线方法。
Insight: 全参数微调对特征学习至关重要,但需结合正则化以避免过拟合和特征流形偏移。
Abstract: Action Quality Assessment (AQA) quantifies human actions in videos, supporting applications in sports scoring, rehabilitation, and skill evaluation. A major challenge lies in the non-stationary nature of quality distributions in real-world scenarios, which limits the generalization ability of conventional methods. We introduce Continual AQA (CAQA), which equips AQA with Continual Learning (CL) capabilities to handle evolving distributions while mitigating catastrophic forgetting. Although parameter-efficient fine-tuning of pretrained models has shown promise in CL for image classification, we find it insufficient for CAQA. Our empirical and theoretical analyses reveal two insights: (i) Full-Parameter Fine-Tuning (FPFT) is necessary for effective representation learning; yet (ii) uncontrolled FPFT induces overfitting and feature manifold shift, thereby aggravating forgetting. To address this, we propose Adaptive Manifold-Aligned Graph Regularization (MAGR++), which couples backbone fine-tuning that stabilizes shallow layers while adapting deeper ones with a two-step feature rectification pipeline: a manifold projector to translate deviated historical features into the current representation space, and a graph regularizer to align local and global distributions. We construct four CAQA benchmarks from three datasets with tailored evaluation protocols and strong baselines, enabling systematic cross-dataset comparison. Extensive experiments show that MAGR++ achieves state-of-the-art performance, with average correlation gains of 3.6% offline and 12.2% online over the strongest baseline, confirming its robustness and effectiveness. Our code is available at https://github.com/ZhouKanglei/MAGRPP.
[103] Online Generic Event Boundary Detection
Hyungrok Jung,Daneul Kim,Seunggyun Lim,Jeany Son,Jonghyun Choi
Main category: cs.CV
TL;DR: 该论文提出了一个在线通用事件边界检测(On-GEBD)的新任务和框架Estimator,用于实时检测流媒体视频中的事件边界。Estimator包含两个核心组件——一致性事件预测器(CEA)和在线边界判别器(OBD),通过预测未来帧与实际帧的差异来检测事件边界,实验结果表明该方法优于基线,并与离线方法性能相当。
Details
Motivation: 现有通用事件边界检测(GEBD)方法需要完整视频帧才能预测,而人类可以实时在线处理数据。论文旨在弥合这一差距,提出实时检测流媒体视频中事件边界的任务。Contribution: 1. 提出在线通用事件边界检测(On-GEBD)任务;2. 设计Estimator框架,包含CEA和OBD组件;3. 在Kinetics-GEBD和TAPOS数据集上取得优越性能。
Method: Estimator框架基于事件分割理论(EST),通过CEA预测未来帧,OBD测量预测误差并动态调整阈值,以检测事件边界。该方法仅依赖历史帧,无需未来帧。
Result: 实验表明,Estimator优于所有基线方法,性能与离线GEBD方法相当。
Insight: 实时事件边界的检测需要动态预测和误差分析,人类认知理论(EST)可有效指导算法设计。
Abstract: Generic Event Boundary Detection (GEBD) aims to interpret long-form videos through the lens of human perception. However, current GEBD methods require processing complete video frames to make predictions, unlike humans processing data online and in real-time. To bridge this gap, we introduce a new task, Online Generic Event Boundary Detection (On-GEBD), aiming to detect boundaries of generic events immediately in streaming videos. This task faces unique challenges of identifying subtle, taxonomy-free event changes in real-time, without the access to future frames. To tackle these challenges, we propose a novel On-GEBD framework, Estimator, inspired by Event Segmentation Theory (EST) which explains how humans segment ongoing activity into events by leveraging the discrepancies between predicted and actual information. Our framework consists of two key components: the Consistent Event Anticipator (CEA), and the Online Boundary Discriminator (OBD). Specifically, the CEA generates a prediction of the future frame reflecting current event dynamics based solely on prior frames. Then, the OBD measures the prediction error and adaptively adjusts the threshold using statistical tests on past errors to capture diverse, subtle event transitions. Experimental results demonstrate that Estimator outperforms all baselines adapted from recent online video understanding models and achieves performance comparable to prior offline-GEBD methods on the Kinetics-GEBD and TAPOS datasets.
[104] Explaining raw data complexity to improve satellite onboard processing
Adrien Dorise,Marjorie Bellizzi,Adrien Girard,Benjamin Francesconi,Stéphane May
Main category: cs.CV
TL;DR: 论文探讨了在卫星上直接使用原始数据(raw data)进行AI处理的可行性,并通过模拟实验比较了在原始数据和预处理数据上训练的物体检测模型的性能差异,发现原始数据在高置信度下对物体边界识别较差,提出了改进轮廓方法的建议。
Details
Motivation: 随着处理器能力的提升,将AI模型直接部署在卫星上进行遥感数据处理成为可能。然而,使用原始传感器数据而非预处理的地面产品带来了新的挑战。目前的研究主要依赖预处理数据,而直接利用原始数据的研究较少。Contribution: 1. 提出了一个模拟工作流,从高分辨率L1图像生成类似原始数据的产品,以便系统性地评估原始数据的影响。2. 比较了在原始数据和L1数据上训练的两种物体检测模型(YOLOv11s和YOLOX-S)的性能,并揭示了原始数据在高置信度下边界识别的问题。
Method: 1. 通过模拟工作流生成原始数据产品。2. 使用YOLOv11s和YOLOX-S模型分别在原始数据和L1数据上进行训练。3. 使用标准检测指标和可解释性工具评估模型性能。
Result: 实验结果表明,在低到中等置信度阈值下,两种模型的性能相似;但在高置信度下,原始数据训练的模型在物体边界识别上表现较差。
Insight: 改进AI模型的轮廓识别方法可能是提高原始数据上物体检测性能的关键,从而推动卫星上AI处理的进一步发展。
Abstract: With increasing processing power, deploying AI models for remote sensing directly onboard satellites is becoming feasible. However, new constraints arise, mainly when using raw, unprocessed sensor data instead of preprocessed ground-based products. While current solutions primarily rely on preprocessed sensor images, few approaches directly leverage raw data. This study investigates the effects of utilising raw data on deep learning models for object detection and classification tasks. We introduce a simulation workflow to generate raw-like products from high-resolution L1 imagery, enabling systemic evaluation. Two object detection models (YOLOv11s and YOLOX-S) are trained on both raw and L1 datasets, and their performance is compared using standard detection metrics and explainability tools. Results indicate that while both models perform similarly at low to medium confidence thresholds, the model trained on raw data struggles with object boundary identification at high confidence levels. It suggests that adapting AI architectures with improved contouring methods can enhance object detection on raw images, improving onboard AI for remote sensing.
[105] HARP-NeXt: High-Speed and Accurate Range-Point Fusion Network for 3D LiDAR Semantic Segmentation
Samir Abou Haidar,Alexandre Chariot,Mehdi Darouich,Cyril Joly,Jean-Emmanuel Deschaud
Main category: cs.CV
TL;DR: HARP-NeXt提出了一种高速且精确的激光雷达语义分割网络,通过新颖的预处理方法和多尺度范围点融合主干,显著提升了速度和精度,无需依赖测试时增强或集成模型。
Details
Motivation: 现有激光雷达语义分割方法在速度和精度之间存在权衡,点基和稀疏卷积方法准确但速度慢,投影方法快速但丢失几何信息。此外,预处理阶段和测试时增强增加了计算负担。Contribution: 1. 提出了一种新颖的预处理方法,显著减少计算开销;2. 设计了Conv-SE-NeXt特征提取块,避免深层堆叠;3. 提出多尺度范围点融合主干,保留几何细节并提升精度。
Method: 1. 预处理阶段优化计算;2. Conv-SE-NeXt块高效提取特征;3. 多尺度范围点融合主干综合多层次信息。
Result: 在nuScenes和SemanticKITTI基准测试中,HARP-NeXt在速度-精度权衡上优于所有现有方法,无需集成模型或TTA,速度比PTv3快24倍。
Insight: 预处理优化和多尺度信息融合是实现高速且精确激光雷达语义分割的关键,避免深层网络和TTA提升了效率。
Abstract: LiDAR semantic segmentation is crucial for autonomous vehicles and mobile robots, requiring high accuracy and real-time processing, especially on resource-constrained embedded systems. Previous state-of-the-art methods often face a trade-off between accuracy and speed. Point-based and sparse convolution-based methods are accurate but slow due to the complexity of neighbor searching and 3D convolutions. Projection-based methods are faster but lose critical geometric information during the 2D projection. Additionally, many recent methods rely on test-time augmentation (TTA) to improve performance, which further slows the inference. Moreover, the pre-processing phase across all methods increases execution time and is demanding on embedded platforms. Therefore, we introduce HARP-NeXt, a high-speed and accurate LiDAR semantic segmentation network. We first propose a novel pre-processing methodology that significantly reduces computational overhead. Then, we design the Conv-SE-NeXt feature extraction block to efficiently capture representations without deep layer stacking per network stage. We also employ a multi-scale range-point fusion backbone that leverages information at multiple abstraction levels to preserve essential geometric details, thereby enhancing accuracy. Experiments on the nuScenes and SemanticKITTI benchmarks show that HARP-NeXt achieves a superior speed-accuracy trade-off compared to all state-of-the-art methods, and, without relying on ensemble models or TTA, is comparable to the top-ranked PTv3, while running 24$\times$ faster. The code is available at https://github.com/SamirAbouHaidar/HARP-NeXt
[106] Addressing the ID-Matching Challenge in Long Video Captioning
Zhantao Yang,Huangji Wang,Ruili Feng,Han Zhang,Yuting Hu,Shangwen Zhu,Junyan Li,Yu Liu,Fan Cheng
Main category: cs.CV
TL;DR: 该论文提出了一种名为RICE的新方法,利用大型视觉语言模型(LVLM)解决长视频字幕生成中的ID匹配问题,显著提升了精度和召回率。
Details
Motivation: 长视频字幕生成中的ID匹配问题至关重要,但现有方法泛化能力有限且依赖逐点匹配,效果不佳。论文旨在利用LVLM的先验知识解决这一问题。Contribution: 1)提出了评估视频字幕ID匹配能力的新基准;2)通过增强图像信息利用和个体描述信息量,提出了RICE方法,显著提升了ID匹配性能。
Method: 1)利用LVLM的潜在能力;2)通过增强图像信息和个体描述信息量提升ID匹配;3)提出RICE方法并在GPT-4o上实现。
Result: RICE将ID匹配的精度从50%提升至90%,召回率从15%提升至80%,实现了对长视频中个体的持续跟踪。
Insight: LVLM的ID匹配能力可以通过优化图像信息和个体描述来显著提升,为解决复杂视频字幕生成问题提供了新思路。
Abstract: Generating captions for long and complex videos is both critical and challenging, with significant implications for the growing fields of text-to-video generation and multi-modal understanding. One key challenge in long video captioning is accurately recognizing the same individuals who appear in different frames, which we refer to as the ID-Matching problem. Few prior works have focused on this important issue. Those that have, usually suffer from limited generalization and depend on point-wise matching, which limits their overall effectiveness. In this paper, unlike previous approaches, we build upon LVLMs to leverage their powerful priors. We aim to unlock the inherent ID-Matching capabilities within LVLMs themselves to enhance the ID-Matching performance of captions. Specifically, we first introduce a new benchmark for assessing the ID-Matching capabilities of video captions. Using this benchmark, we investigate LVLMs containing GPT-4o, revealing key insights that the performance of ID-Matching can be improved through two methods: 1) enhancing the usage of image information and 2) increasing the quantity of information of individual descriptions. Based on these insights, we propose a novel video captioning method called Recognizing Identities for Captioning Effectively (RICE). Extensive experiments including assessments of caption quality and ID-Matching performance, demonstrate the superiority of our approach. Notably, when implemented on GPT-4o, our RICE improves the precision of ID-Matching from 50% to 90% and improves the recall of ID-Matching from 15% to 80% compared to baseline. RICE makes it possible to continuously track different individuals in the captions of long videos.
[107] No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts
Girolamo Macaluso,Lorenzo Mandelli,Mirko Bicchierai,Stefano Berretti,Andrew D. Bagdanov
Main category: cs.CV
TL;DR: 本文提出了一种基于强化学习的后训练框架,通过仅使用文本提示微调预训练的运动扩散模型,无需额外的动作捕捉数据,实现了对未见动作或风格的适应。
Details
Motivation: 传统的运动扩散模型在适应新动作或风格时需要额外的动作捕捉数据和完整重训练,成本高昂且难以扩展。本文旨在解决这一问题,提出一种低成本、高效且隐私友好的方法。Contribution: 主要贡献包括:1) 提出了一个仅依赖文本提示的后训练框架;2) 使用了预训练的文本-运动检索网络作为奖励信号;3) 通过Denoising Diffusion Policy Optimization优化扩散策略;4) 在跨数据集和留一实验中验证了方法的有效性。
Method: 方法的核心是结合强化学习与扩散模型:1) 利用预训练的文本-运动检索网络提供奖励信号;2) 采用Denoising Diffusion Policy Optimization优化生成策略;3) 在潜在空间和关节空间扩散架构上进行实验。
Result: 实验结果表明,该方法在HumanML3D和KIT-ML数据集上显著提高了生成运动的多样性和质量,同时保持了对原始分布的生成性能。用户研究和定量指标均支持这一结论。
Insight: 本文的洞察在于展示了强化学习可以有效地用于扩散模型的领域适应,同时强调了无需配对运动数据的方法在隐私保护和数据效率方面的优势。
Abstract: Diffusion models have recently advanced human motion generation, producing realistic and diverse animations from textual prompts. However, adapting these models to unseen actions or styles typically requires additional motion capture data and full retraining, which is costly and difficult to scale. We propose a post-training framework based on Reinforcement Learning that fine-tunes pretrained motion diffusion models using only textual prompts, without requiring any motion ground truth. Our approach employs a pretrained text-motion retrieval network as a reward signal and optimizes the diffusion policy with Denoising Diffusion Policy Optimization, effectively shifting the model’s generative distribution toward the target domain without relying on paired motion data. We evaluate our method on cross-dataset adaptation and leave-one-out motion experiments using the HumanML3D and KIT-ML datasets across both latent- and joint-space diffusion architectures. Results from quantitative metrics and user studies show that our approach consistently improves the quality and diversity of generated motions, while preserving performance on the original distribution. Our approach is a flexible, data-efficient, and privacy-preserving solution for motion adaptation.
[108] Bayesian Modelling of Multi-Year Crop Type Classification Using Deep Neural Networks and Hidden Markov Models
Gianmarco Perantoni,Giulio Weikmann,Lorenzo Bruzzone
Main category: cs.CV
TL;DR: 该论文提出了一种结合深度学习和贝叶斯建模的新方法,用于分类年度卫星图像时间序列(SITS)。方法整合了Transformer Encoder(TE)和隐马尔可夫模型(HMM),旨在捕捉时间相关性和多年作物类型模式。
Details
Motivation: 年度土地覆盖图的时间一致性对建模多年土地覆盖变化至关重要。现有方法常忽略时间一致性,导致预测结果不稳定。Contribution: 提出了一种结合TE和HMM的新方法,显著提升了多年度作物分类的性能和时间一致性。
Method: 利用TE-DNN提取特征,HMM层建模时间一致性。通过贝叶斯建模优化分类结果。
Result: 在47种作物类型和6年Sentinel-2数据的验证中,HMM显著提升了分类性能和F1分数。
Insight: 建模时间一致性对多年度作物分类至关重要,HMM可作为提升深度学习方法时间鲁棒性的有效工具。
Abstract: The temporal consistency of yearly land-cover maps is of great importance to model the evolution and change of the land cover over the years. In this paper, we focus the attention on a novel approach to classification of yearly satellite image time series (SITS) that combines deep learning with Bayesian modelling, using Hidden Markov Models (HMMs) integrated with Transformer Encoder (TE) based DNNs. The proposed approach aims to capture both i) intricate temporal correlations in yearly SITS and ii) specific patterns in multiyear crop type sequences. It leverages the cascade classification of an HMM layer built on top of the TE, discerning consistent yearly crop-type sequences. Validation on a multiyear crop type classification dataset spanning 47 crop types and six years of Sentinel-2 acquisitions demonstrates the importance of modelling temporal consistency in the predicted labels. HMMs enhance the overall performance and F1 scores, emphasising the effectiveness of the proposed approach.
[109] DADO: A Depth-Attention framework for Object Discovery
Federico Gonzalez,Estefania Talavera,Petia Radeva
Main category: cs.CV
TL;DR: DADO提出了一种结合注意力机制和深度模型的框架,用于无监督对象发现,通过动态加权来解决噪声注意力或复杂场景问题,并在标准基准测试中表现优异。
Details
Motivation: 无监督对象发现是计算机视觉中的重要挑战,传统方法常受限于噪声注意力或复杂场景。DADO旨在结合深度和注意力信息,提升对象发现的准确性和鲁棒性。Contribution: 1. 提出DADO框架,首次结合注意力机制和深度模型进行无监督对象发现。2. 引入动态加权方法,自适应调整深度和注意力特征的权重。
Method: 1. 利用注意力机制生成对象候选区域。2. 通过深度模型估计场景的深度信息。3. 动态加权策略根据图像全局特征调整深度和注意力的权重。
Result: 在标准基准测试中,DADO在对象发现准确性和鲁棒性上优于现有方法,且无需微调。
Insight: 结合深度信息和注意力机制可以有效解决无监督对象发现中的复杂性和噪声问题,动态加权方法是一个值得进一步探索的方向。
Abstract: Unsupervised object discovery, the task of identifying and localizing objects in images without human-annotated labels, remains a significant challenge and a growing focus in computer vision. In this work, we introduce a novel model, DADO (Depth-Attention self-supervised technique for Discovering unseen Objects), which combines an attention mechanism and a depth model to identify potential objects in images. To address challenges such as noisy attention maps or complex scenes with varying depth planes, DADO employs dynamic weighting to adaptively emphasize attention or depth features based on the global characteristics of each image. We evaluated DADO on standard benchmarks, where it outperforms state-of-the-art methods in object discovery accuracy and robustness without the need for fine-tuning.
[110] Enhancing Concept Localization in CLIP-based Concept Bottleneck Models
Rémi Kazmierczak,Steve Azzolin,Eloïse Berthier,Goran Frehse,Gianni Franchi
Main category: cs.CV
TL;DR: 论文提出CHILI方法,解决CLIP在概念瓶颈模型(CBMs)中因概念幻觉导致的解释不忠实问题,通过解耦图像嵌入并局部化目标概念像素,提升解释的可解释性。
Details
Motivation: 现有基于CLIP的概念瓶颈模型(CBMs)在零样本场景下提取概念时容易产生概念幻觉(错误预测概念的存在或缺失),从而影响解释的忠实性。Contribution: 1. 揭示CLIP在CBMs中的概念幻觉问题;2. 提出CHILI方法,通过解耦图像嵌入和局部化目标概念像素来抑制幻觉;3. 支持生成更具可解释性的显著性解释。
Method: CHILI方法通过解耦CLIP的图像嵌入空间,识别和局部化与目标概念相关的像素区域,从而减少概念幻觉并提升解释的局部性和可解释性。
Result: CHILI显著减少了概念幻觉现象,并生成了更忠实且可解释的显著性解释。
Insight: 1. CLIP在CBMs中的概念幻觉是一个关键问题;2. 嵌入空间的解耦和局部化是改善解释忠实性的有效途径。
Abstract: This paper addresses explainable AI (XAI) through the lens of Concept Bottleneck Models (CBMs) that do not require explicit concept annotations, relying instead on concepts extracted using CLIP in a zero-shot manner. We show that CLIP, which is central in these techniques, is prone to concept hallucination, incorrectly predicting the presence or absence of concepts within an image in scenarios used in numerous CBMs, hence undermining the faithfulness of explanations. To mitigate this issue, we introduce Concept Hallucination Inhibition via Localized Interpretability (CHILI), a technique that disentangles image embeddings and localizes pixels corresponding to target concepts. Furthermore, our approach supports the generation of saliency-based explanations that are more interpretable.
[111] MoRe: Monocular Geometry Refinement via Graph Optimization for Cross-View Consistency
Dongki Jung,Jaehoon Choi,Yonghan Lee,Sungmin Eum,Heesung Kwon,Dinesh Manocha
Main category: cs.CV
TL;DR: MoRe是一种无需训练的几何优化方法,通过图优化提升跨视角一致性和尺度对齐,利用单目几何先验信息增强3D重建和新视角合成效果。
Details
Motivation: 单目3D基础模型在感知任务中潜力巨大,但存在跨视角一致性和尺度模糊问题。MoRe旨在通过训练无关的方法优化几何先验信息。Contribution: 提出了基于图优化的单目几何优化框架MoRe,通过局部平面近似解决尺度模糊问题,并在3D重建和新视角合成中表现优异。
Method: 利用单目基础模型估计的3D点和表面法向量,通过特征匹配建立帧间对应关系,并构建图优化框架进行局部平面近似优化。
Result: MoRe显著提升了跨视角一致性,优化了稀疏视角下的渲染效果,并在3D重建和新视角合成任务中表现出色。
Insight: 通过几何约束和图优化结合的单目方法,能够在无需额外训练的情况下有效提升跨视角一致性,为单目3D感知提供了新思路。
Abstract: Monocular 3D foundation models offer an extensible solution for perception tasks, making them attractive for broader 3D vision applications. In this paper, we propose MoRe, a training-free Monocular Geometry Refinement method designed to improve cross-view consistency and achieve scale alignment. To induce inter-frame relationships, our method employs feature matching between frames to establish correspondences. Rather than applying simple least squares optimization on these matched points, we formulate a graph-based optimization framework that performs local planar approximation using the estimated 3D points and surface normals estimated by monocular foundation models. This formulation addresses the scale ambiguity inherent in monocular geometric priors while preserving the underlying 3D structure. We further demonstrate that MoRe not only enhances 3D reconstruction but also improves novel view synthesis, particularly in sparse view rendering scenarios.
[112] Validation of Various Normalization Methods for Brain Tumor Segmentation: Can Federated Learning Overcome This Heterogeneity?
Jan Fiszer,Dominika Ciupek,Maciej Malawski
Main category: cs.CV
TL;DR: 该论文研究了联邦学习(FL)在非独立同分布(non-IID)数据下对脑肿瘤分割的影响,通过不同MRI强度归一化方法模拟数据异质性。FL表现出对数据不一致的鲁棒性,性能与集中式模型相当。
Details
Motivation: 尽管深度学习在医学影像中广泛应用,但数据隐私和异质性限制了其效果。FL是潜在解决方案,但其在non-IID数据下的表现需要验证。Contribution: 论文验证了FL在MRI数据归一化异质性下的表现,表明FL能克服此类问题且不影响性能。
Method: 通过不同MRI强度归一化方法模拟non-IID数据,训练FL和集中式模型进行脑肿瘤分割对比。
Result: FL在Dice分数上达到92%,与集中式模型相当,表现出对归一化不一致的鲁棒性。
Insight: FL能有效解决医学数据隐私和异质性问题,适用于高要求的医疗应用。
Abstract: Deep learning (DL) has been increasingly applied in medical imaging, however, it requires large amounts of data, which raises many challenges related to data privacy, storage, and transfer. Federated learning (FL) is a training paradigm that overcomes these issues, though its effectiveness may be reduced when dealing with non-independent and identically distributed (non-IID) data. This study simulates non-IID conditions by applying different MRI intensity normalization techniques to separate data subsets, reflecting a common cause of heterogeneity. These subsets are then used for training and testing models for brain tumor segmentation. The findings provide insights into the influence of the MRI intensity normalization methods on segmentation models, both training and inference. Notably, the FL methods demonstrated resilience to inconsistently normalized data across clients, achieving the 3D Dice score of 92%, which is comparable to a centralized model (trained using all data). These results indicate that FL is a solution to effectively train high-performing models without violating data privacy, a crucial concern in medical applications. The code is available at: https://github.com/SanoScience/fl-varying-normalization.
[113] Graph Conditioned Diffusion for Controllable Histopathology Image Generation
Sarah Cechnicka,Matthew Baugh,Weitong Zhang,Mischa Dombrowski,Zhe Li,Johannes C. Paetzold,Candice Roufosse,Bernhard Kainz
Main category: cs.CV
TL;DR: 该论文提出了一种基于图的扩散模型(Graph-Conditioned-Diffusion),用于可控的医学图像生成,通过引入图节点表征图像中的主要结构及其关系,实现了对生成内容的精细化控制。
Details
Motivation: 医学图像(如病理图像)具有固有的结构和纹理特征,现有扩散模型在噪声潜在空间中缺乏语义结构和强先验,难以实现有意义的可控生成。Contribution: 提出了基于图的对象级表征方法,将图像中的主要结构及其关系建模为图节点,并通过变换器模块和扩散模型的文本条件机制实现精细化控制。
Method: 采用图节点表征图像结构,通过变换器模块处理这些表征,并扩散到扩散模型中,实现对生成的精细控制。
Result: 在真实世界的病理图像用例中,生成的数据可以可靠地替代标注的患者数据用于下游分割任务。
Insight: 通过引入结构化图表征,扩散模型可以在医学图像生成中实现更高水平的控制,解决了现有方法在语义结构上的不足。
Abstract: Recent advances in Diffusion Probabilistic Models (DPMs) have set new standards in high-quality image synthesis. Yet, controlled generation remains challenging, particularly in sensitive areas such as medical imaging. Medical images feature inherent structure such as consistent spatial arrangement, shape or texture, all of which are critical for diagnosis. However, existing DPMs operate in noisy latent spaces that lack semantic structure and strong priors, making it difficult to ensure meaningful control over generated content. To address this, we propose graph-based object-level representations for Graph-Conditioned-Diffusion. Our approach generates graph nodes corresponding to each major structure in the image, encapsulating their individual features and relationships. These graph representations are processed by a transformer module and integrated into a diffusion model via the text-conditioning mechanism, enabling fine-grained control over generation. We evaluate this approach using a real-world histopathology use case, demonstrating that our generated data can reliably substitute for annotated patient data in downstream segmentation tasks. The code is available here.
[114] Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models
Karim El Khoury,Maxime Zanella,Christophe De Vleeschouwer,Benoit Macq
Main category: cs.CV
TL;DR: 该论文首次提出了一个结构化基准,用于评估遥感视觉语言模型(RSVLMs)在少样本学习中的适应性,揭示了不同模型在少样本条件下的表现差异,并呼吁开发更稳健的方法。
Details
Motivation: 尽管RSVLMs在大规模预训练后表现出色,但它们在少样本学习等低数据环境下的泛化能力尚未充分研究,因此需要系统地评估和比较其适应性。Contribution: 1. 第一个针对RSVLMs的少样本适应性结构化基准;2. 对10个遥感场景分类数据集和5种少样本适应方法的全面实验;3. 开源代码和可复现的基准框架。
Method: 在10个遥感数据集上应用5种少样本适应策略,测试3种不同骨干网络的RSVLMs,通过对比实验分析其表现差异。
Result: 发现零样本表现相似的模型在少样本适应中表现差异显著,且现有方法中无明确最优方法。
Insight: RSVLMs的少样本适应性与其骨干网络结构和预训练策略密切相关,未来研究需针对遥感任务设计更有效的适应方法。
Abstract: Remote Sensing Vision-Language Models (RSVLMs) have shown remarkable potential thanks to large-scale pretraining, achieving strong zero-shot performance on various tasks. However, their ability to generalize in low-data regimes, such as few-shot learning, remains insufficiently explored. In this work, we present the first structured benchmark for evaluating few-shot adaptation methods on RSVLMs. We conduct comprehensive experiments across ten remote sensing scene classification datasets, applying five widely used few-shot adaptation strategies to three state-of-the-art RSVLMs with varying backbones. Our findings reveal that models with similar zero-shot performance can exhibit markedly different behavior under few-shot adaptation, with some RSVLMs being inherently more amenable to such adaptation than others. The variability of performance and the absence of a clear winner among existing methods highlight the need for the development of more robust methods for few-shot adaptation tailored to RS. To facilitate future research, we provide a reproducible benchmarking framework and open-source code to systematically evaluate RSVLMs under few-shot conditions. The source code is publicly available on Github: https://github.com/elkhouryk/fewshot_RSVLMs
[115] Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods
Chenfei Liao,Wensong Wang,Zichen Wen,Xu Zheng,Yiyu Wang,Haocong He,Yuanhuiyi Lyu,Lutao Jiang,Xin Zou,Yuqian Fu,Bin Ren,Linfeng Zhang,Xuming Hu
Main category: cs.CV
TL;DR: 该论文指出当前多模态大语言模型(MLLMs)中视觉标记压缩方法的评估存在任务不匹配问题,并提出新的评估框架VTC-Bench,通过数据过滤机制提升评估的公平性和准确性。
Details
Motivation: 现有基准测试原本是为评估MLLMs的感知和推理能力设计的,而非视觉标记压缩方法。简单下采样的表现优于许多先进方法,表明当前基准存在噪声,亟需更合适的评估框架。Contribution: 论文的主要贡献是提出了VTC-Bench框架,通过数据过滤机制去噪现有基准,从而更公平、准确地评估视觉标记压缩方法。
Method: 通过广泛实验发现基准噪声问题后,作者提出数据过滤机制对样本难度进行筛选,并构建了VTC-Bench框架。
Result: 实验表明简单下采样在现有基准中表现优于许多先进方法,验证了基准的噪声问题;VTC-Bench提供了更可靠的评估结果。
Insight: 仅依赖现有基准评估视觉标记压缩方法可能误导研究;数据过滤是提升评估质量的有效手段。
Abstract: Recent endeavors to accelerate inference in Multimodal Large Language Models (MLLMs) have primarily focused on visual token compression. The effectiveness of these methods is typically assessed by measuring the accuracy drop on established benchmarks, comparing model performance before and after compression. However, these benchmarks are originally designed to assess the perception and reasoning capabilities of MLLMs, rather than to evaluate compression techniques. As a result, directly applying them to visual token compression introduces a task mismatch. Strikingly, our investigation reveals that simple image downsampling consistently outperforms many advanced compression methods across multiple widely used benchmarks. Through extensive experiments, we make the following observations: (i) Current benchmarks are noisy for the visual token compression task. (ii) Down-sampling is able to serve as a data filter to evaluate the difficulty of samples in the visual token compression task. Motivated by these findings, we introduce VTC-Bench, an evaluation framework that incorporates a data filtering mechanism to denoise existing benchmarks, thereby enabling fairer and more accurate assessment of visual token compression methods. All data and code are available at https://github.com/Chenfei-Liao/VTC-Bench.
[116] MV-Performer: Taming Video Diffusion Model for Faithful and Synchronized Multi-view Performer Synthesis
Yihao Zhi,Chenghong Li,Hongjie Liao,Xihe Yang,Zhengwentai Sun,Jiahao Chang,Xiaodong Cun,Wensen Feng,Xiaoguang Han
Main category: cs.CV
TL;DR: MV-Performer是一种创新的视频扩散模型框架,专注于从单目全身捕捉生成同步的多视角视频,解决了现有方法在360度视角变化上的局限性。
Details
Motivation: 当前视频生成方法主要集中于前视角的相机轨迹重定向,而难以生成360度的视角变化。本文旨在解决这一局限性,特别是在人类为中心的领域中。Contribution: 提出了MV-Performer框架,通过利用MVHumanNet数据集和相机相关的法线图,缓解了可见与不可见观察之间的模糊性,并设计了多视图人类中心视频扩散模型以保持视频同步性。
Method: 使用相机相关的法线图作为条件信号,结合参考视频、部分渲染和多视角信息,提出了一个多视图人类中心视频扩散模型。此外,引入了鲁棒的推理过程以减少不完美单目深度估计带来的伪影。
Result: 在三个数据集上的实验表明,MV-Performer在人类为中心的4D新视角合成任务中表现出色,具有高效的鲁棒性。
Insight: 通过结合多视角信息和条件信号,MV-Performer在360度视角变化和视频同步性方面取得了显著进展,为人类为中心的新视角合成提供了有力工具。
Abstract: Recent breakthroughs in video generation, powered by large-scale datasets and diffusion techniques, have shown that video diffusion models can function as implicit 4D novel view synthesizers. Nevertheless, current methods primarily concentrate on redirecting camera trajectory within the front view while struggling to generate 360-degree viewpoint changes. In this paper, we focus on human-centric subdomain and present MV-Performer, an innovative framework for creating synchronized novel view videos from monocular full-body captures. To achieve a 360-degree synthesis, we extensively leverage the MVHumanNet dataset and incorporate an informative condition signal. Specifically, we use the camera-dependent normal maps rendered from oriented partial point clouds, which effectively alleviate the ambiguity between seen and unseen observations. To maintain synchronization in the generated videos, we propose a multi-view human-centric video diffusion model that fuses information from the reference video, partial rendering, and different viewpoints. Additionally, we provide a robust inference procedure for in-the-wild video cases, which greatly mitigates the artifacts induced by imperfect monocular depth estimation. Extensive experiments on three datasets demonstrate our MV-Performer’s state-of-the-art effectiveness and robustness, setting a strong model for human-centric 4D novel view synthesis.
[117] Resolution scaling governs DINOv3 transfer performance in chest radiograph classification
Soroosh Tayebi Arasteh,Mina Shaigan,Christiane Kuhl,Jakob Nikolas Kather,Sven Nebelung,Daniel Truhn
Main category: cs.CV
TL;DR: 论文探讨了DINOv3在胸部X光分类中的迁移性能,发现更高的输入分辨率(512x512)能显著提升模型表现,而更大的分辨率(1024x1024)则未带来进一步增益,ConvNeXt-B在多数情况下优于ViT-B/16。
Details
Motivation: 自监督学习(SSL)在视觉表示学习中取得进展,但其在胸部X光这种高分辨率、细粒度任务中的价值尚不明确。研究目的是评估DINOv3在该领域的性能及其设计选择的有效性。Contribution: 1. 首次系统评估DINOv3在胸部X光分类中的迁移性能;2. 发现分辨率提升(至512x512)对性能的关键影响;3. 验证ConvNeXt-B比ViT-B/16更适合该任务;4. 表明域适应的必要性(微调优于冻结特征)。
Method: 1. 对比DINOv3、DINOv2和ImageNet初始化在7个数据集(n>814,000);2. 评估ViT-B/16和ConvNeXt-B两种主干网络;3. 测试不同分辨率(224x224, 512x512, 1024x1024)和冻结7B模型特征的效果;4. 主要指标为平均AUROC。
Result: 1. 512x512分辨率下,DINOv3表现优于DINOv2和ImageNet;2. ConvNeXt-B普遍优于ViT-B/16;3. 更大的分辨率(1024x1024)未带来显著提升;4. 冻结特征表现较差,凸显微调的重要性。
Insight: 1. 胸部X光任务中,分辨率提升(512x512)是关键优化方向;2. ConvNeXt-B更适合此类细粒度任务;3. 临床应用中,512x512的DINOv3初始化ConvNeXt-B是实用高效选择。
Abstract: Self-supervised learning (SSL) has advanced visual representation learning, but its value in chest radiography, a high-volume imaging modality with fine-grained findings, remains unclear. Meta’s DINOv3 extends earlier SSL models through Gram-anchored self-distillation. Whether these design choices improve transfer learning for chest radiography has not been systematically tested. We benchmarked DINOv3 against DINOv2 and ImageNet initialization across seven datasets (n>814,000). Two representative backbones were evaluated: ViT-B/16 and ConvNeXt-B. Images were analyzed at 224x224, 512x512, and 1024x1024 pixels. We additionally assessed frozen features from a 7B model. The primary outcome was mean AUROC across labels. At 224x224, DINOv3 and DINOv2 achieved comparable performance on adult datasets. Increasing resolution to 512x512 yielded consistent improvements for DINOv3 over both DINOv2 and ImageNet. In contrast, results in pediatric cohort showed no differences across initializations. Across all settings, ConvNeXt-B outperformed ViT-B/16. Models using frozen DINOv3-7B features underperformed relative to fully finetuned 86-89M-parameter backbones, highlighting the importance of domain adaptation. Scaling to 1024x1024 did not further improve accuracy. Resolution-related gains were most evident for boundary-dependent and small focal abnormalities. In chest radiography, higher input resolution is critical for leveraging the benefits of modern self-supervised models. 512x512 pixels represent a practical upper limit where DINOv3-initialized ConvNeXt-B networks provide the strongest performance, while larger inputs offer minimal return on cost. Clinically, these findings support use of finetuned, mid-sized backbones at 512x512 for chest radiograph interpretation, with the greatest gains expected in detecting subtle or boundary-centered lesions relevant to emergency and critical care settings.
[118] TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation
Jiaben Chen,Zixin Wang,Ailing Zeng,Yang Fu,Xueyang Yu,Siyuan Cen,Julian Tanke,Yihang Chen,Koichi Saito,Yuki Mitsufuji,Chuang Gan
Main category: cs.CV
TL;DR: TalkCuts 是一个专注于多镜头人类语音视频生成的大规模数据集,提供多样化的镜头视角和丰富的标注信息。Orator 是一个基于 LLM 的多模态生成框架,展示了数据集的价值。
Details
Motivation: 现有数据集多聚焦于单镜头静态视角,而 TalkCuts 填补了多镜头语音视频生成领域的空白,支持更丰富的多模态学习和可控视频生成研究。Contribution: 提出了 TalkCuts,一个包含多样化镜头视角和丰富标注的大规模数据集,并开发了 Orator,一个基于 LLM 的多模态生成框架作为基线。
Method: 使用语言模型作为多模态导演,协调镜头切换、手势和声音调制,通过多模态视频生成模块合成连贯的长视频。
Result: 实验表明,TalkCuts 显著提升了生成视频的电影连贯性和视觉吸引力,适用于姿态引导和音频驱动的场景。
Insight: TalkCuts 为可控多镜头语音视频生成和多模态学习提供了重要基础,展示了语言模型在多模态视频生成中的潜力。
Abstract: In this work, we present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.
[119] Evaluating Fundus-Specific Foundation Models for Diabetic Macular Edema Detection
Franco Javier Arellano,José Ignacio Orlando
Main category: cs.CV
TL;DR: 该论文比较了Fundus-Specific Foundation Models (FM) 与标准迁移学习方法在糖尿病黄斑水肿(DME)检测任务中的表现,发现FM在大多数情况下并未显著优于轻量级CNN模型。
Details
Motivation: 糖尿病黄斑水肿(DME)是导致糖尿病患者视力丧失的主要原因之一,但深度学习应用于该任务的挑战在于标注数据的稀缺性。Foundation Models (FM) 被视为一种潜在解决方案,但尚未明确其在DME检测中的实际效果。Contribution: 论文系统比较了RETFound和FLAIR这两种主流FM与EfficientNet-B0在不同训练和评估设置下的表现,揭示了FM在细粒度眼科任务中的局限性。
Method: 使用RETFound、FLAIR和EfficientNet-B0模型,在IDRiD、MESSIDOR-2和OEFI数据集上进行多场景评估,包括微调和零样本性能测试。
Result: EfficientNet-B0在大多数评估设置中表现最优,FM仅在特定数据集(如OEFI)中显示出潜力。FLAIR在零样本设置中表现竞争性,但总体而言FM未显著超越轻量级CNN。
Insight: 在某些数据稀缺的场景中,轻量级CNN仍可能是更高效的基线方法,而FM的应用可能需要更细致的设计或更适合的任务。
Abstract: Diabetic Macular Edema (DME) is a leading cause of vision loss among patients with Diabetic Retinopathy (DR). While deep learning has shown promising results for automatically detecting this condition from fundus images, its application remains challenging due the limited availability of annotated data. Foundation Models (FM) have emerged as an alternative solution. However, it is unclear if they can cope with DME detection in particular. In this paper, we systematically compare different FM and standard transfer learning approaches for this task. Specifically, we compare the two most popular FM for retinal images–RETFound and FLAIR–and an EfficientNet-B0 backbone, across different training regimes and evaluation settings in IDRiD, MESSIDOR-2 and OCT-and-Eye-Fundus-Images (OEFI). Results show that despite their scale, FM do not consistently outperform fine-tuned CNNs in this task. In particular, an EfficientNet-B0 ranked first or second in terms of area under the ROC and precision/recall curves in most evaluation settings, with RETFound only showing promising results in OEFI. FLAIR, on the other hand, demonstrated competitive zero-shot performance, achieving notable AUC-PR scores when prompted appropriately. These findings reveal that FM might not be a good tool for fine-grained ophthalmic tasks such as DME detection even after fine-tuning, suggesting that lightweight CNNs remain strong baselines in data-scarce environments.
[120] SpecGuard: Spectral Projection-based Advanced Invisible Watermarking
Inzamamul Alam,Md Tanvir Islam,Khan Muhammad,Simon S. Woo
Main category: cs.CV
TL;DR: SpecGuard提出了一种基于频谱投影的先进隐形水印方法,通过在频域中嵌入信息,提高了对抗多种攻击的鲁棒性,同时保持了水印的不可见性和容量。
Details
Motivation: 现有水印方法在面对各种图像变换(如畸变、对抗扰动和图像再生)时缺乏鲁棒性,难以在实际场景中可靠地保护版权信息。因此,作者提出了SpecGuard以解决这一问题。Contribution: 1. 提出了一种新颖的频域水印嵌入方法,利用小波投影分解高频带进行频谱投影;2. 引入强度因子增强了对多种攻击的抵抗力;3. 解码器利用Parseval定理高效提取水印信息。
Method: 1. 将空间域图像转换到频域,使用小波投影分解高频带;2. 在编码阶段引入强度因子提升对抗攻击的能力;3. 解码阶段利用Parseval定理优化水印提取。
Result: 实验表明SpecGuard在不可见性、容量和鲁棒性上均优于现有方法,尤其是在对抗几何畸变和对抗扰动时表现突出。
Insight: 频域水印嵌入方法结合小波分解和Parseval定理的使用,为提升水印的鲁棒性和不可见性提供了新思路。
Abstract: Watermarking embeds imperceptible patterns into images for authenticity verification. However, existing methods often lack robustness against various transformations primarily including distortions, image regeneration, and adversarial perturbation, creating real-world challenges. In this work, we introduce SpecGuard, a novel watermarking approach for robust and invisible image watermarking. Unlike prior approaches, we embed the message inside hidden convolution layers by converting from the spatial domain to the frequency domain using spectral projection of a higher frequency band that is decomposed by wavelet projection. Spectral projection employs Fast Fourier Transform approximation to transform spatial data into the frequency domain efficiently. In the encoding phase, a strength factor enhances resilience against diverse attacks, including adversarial, geometric, and regeneration-based distortions, ensuring the preservation of copyrighted information. Meanwhile, the decoder leverages Parseval’s theorem to effectively learn and extract the watermark pattern, enabling accurate retrieval under challenging transformations. We evaluate the proposed SpecGuard based on the embedded watermark’s invisibility, capacity, and robustness. Comprehensive experiments demonstrate the proposed SpecGuard outperforms the state-of-the-art models. To ensure reproducibility, the full code is released on \href{https://github.com/inzamamulDU/SpecGuard_ICCV_2025}{\textcolor{blue}{\textbf{GitHub}}}.
[121] MATRIX: Mask Track Alignment for Interaction-aware Video Generation
Siyoon Jin,Seongchan Kim,Dahyun Chung,Jaeho Lee,Hyunwook Choi,Jisu Nam,Jiyoung Kim,Seungryong Kim
Main category: cs.CV
TL;DR: 该论文提出MATRIX方法,通过对齐视频DiTs的注意力与多实例掩码轨迹,提升多实例及主客体交互的视频生成效果,并提出了InterGenEval评估协议。
Details
Motivation: 当前视频DiTs在多实例或主客体交互建模方面表现不佳,论文旨在探究它们如何内部表征交互,并提出改进方法。Contribution: 1) 构建MATRIX-11K数据集;2) 提出MATRIX正则化方法;3) 提出InterGenEval评估协议。
Method: 通过视频到文本和视频到视频的注意力分析交互表征,设计MATRIX正则化对齐掩码轨迹,增强交互保真度和语义对齐。
Result: MATRIX提升了交互保真度和语义对齐,减少了漂移和幻觉现象。
Insight: 交互主导的注意力集中在少数层,针对性对齐这些层能显著提升交互建模能力。
Abstract: Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks from the MATRIX-11K dataset, enhancing both grounding and propagation. We further propose InterGenEval, an evaluation protocol for interaction-aware video generation. In experiments, MATRIX improves both interaction fidelity and semantic alignment while reducing drift and hallucination. Extensive ablations validate our design choices. Codes and weights will be released.
[122] WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation
Zezhong Qian,Xiaowei Chi,Yuming Li,Shizun Wang,Zhiyuan Qin,Xiaozhu Ju,Sirui Han,Shanghang Zhang
Main category: cs.CV
TL;DR: WristWorld是一个4D世界模型,首次实现仅从锚点视图生成腕视图视频,通过几何一致性和时空一致性提升机器人操作性能。
Details
Motivation: 大规模数据集中腕视图记录稀缺,导致锚点视图和腕视图之间存在巨大差距,现有世界模型无法解决这一问题。Contribution: 提出WristWorld,通过几何一致的腕视图姿态估计和生成模型,填补锚点视图与腕视图之间的差距。
Method: 两阶段方法:(i)重建阶段扩展VGGT并引入SPC损失估计姿态和4D点云;(ii)生成阶段合成时空一致的腕视图视频。
Result: 在Droid、Calvin和Franka Panda数据集上实现SOTA生成效果,提升VLA性能,任务完成长度平均提高3.81%。
Insight: 几何先验和跨视图先验能够有效解决极端视角偏移问题,为机器人操作的视觉感知提供新思路。
Abstract: Wrist-view observations are crucial for VLA models as they capture fine-grained hand-object interactions that directly enhance manipulation performance. Yet large-scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist-view first frame and thus fail to generate wrist-view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with geometric and cross-view priors that make it possible to address extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model that generates wrist-view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation, which employs our video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap.
[123] Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers
Gangwei Xu,Haotong Lin,Hongcheng Luo,Xianqi Wang,Jingfeng Yao,Lianghui Zhu,Yuechuan Pu,Cheng Chi,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye,Sida Peng,Xin Yang
Main category: cs.CV
TL;DR: 这篇论文提出了Pixel-Perfect Depth,一种基于像素空间扩散生成的单目深度估计模型,通过避免VAE压缩引入的飞像素问题,生成高质量的点云。
Details
Motivation: 当前基于生成模型的深度估计方法因使用VAE压缩深度图到隐空间,导致边缘和细节出现飞像素问题。本文旨在直接通过像素空间生成解决这一问题。Contribution: 1) 提出了直接在像素空间进行扩散生成的方法,避免VAE引入的伪影;2) 设计了语义提示扩散变换器(SP-DiT)和级联DiT结构,提升全局语义一致性和细节生成能力。
Method: 1) SP-DiT将视觉基础模型的语义表征引入DiT以提示扩散过程;2) 级联DiT通过逐步增加token数量提高效率和精度。
Result: 在五个基准测试中取得了最优性能,显著优于其他模型,尤其在边缘感知点云评估中表现突出。
Insight: 像素空间生成能有效避免隐空间压缩带来的伪影,但需高效设计;语义信息的融入可显著提升深度图的全局一致性。
Abstract: This paper presents Pixel-Perfect Depth, a monocular depth estimation model based on pixel-space diffusion generation that produces high-quality, flying-pixel-free point clouds from estimated depth maps. Current generative depth estimation models fine-tune Stable Diffusion and achieve impressive performance. However, they require a VAE to compress depth maps into latent space, which inevitably introduces \textit{flying pixels} at edges and details. Our model addresses this challenge by directly performing diffusion generation in the pixel space, avoiding VAE-induced artifacts. To overcome the high complexity associated with pixel-space generation, we introduce two novel designs: 1) Semantics-Prompted Diffusion Transformers (SP-DiT), which incorporate semantic representations from vision foundation models into DiT to prompt the diffusion process, thereby preserving global semantic consistency while enhancing fine-grained visual details; and 2) Cascade DiT Design that progressively increases the number of tokens to further enhance efficiency and accuracy. Our model achieves the best performance among all published generative models across five benchmarks, and significantly outperforms all other models in edge-aware point cloud evaluation.
[124] Quantum-enhanced Computer Vision: Going Beyond Classical Algorithms
Natacha Kuete Meli,Shuteng Wang,Marcel Seelbach Benkner,Michele Sasdelli,Tat-Jun Chin,Tolga Birdal,Michael Moeller,Vladislav Golyanik
Main category: cs.CV
TL;DR: 这篇论文探讨了量子增强计算机视觉(QeCV)这一新兴领域,综述了量子计算方法在计算机视觉中的潜力和应用。
Details
Motivation: 传统非量子方法在某些场景下无法找到合理的解或只能提供近似解,而量子计算可能通过利用量子力学效应在这些领域提供更好的时间可扩展性或解决方案。Contribution: 论文为计算机视觉领域提供了一个全面的量子计算参考,介绍了QeCV的基础知识、方法和工具,并探讨了其开放挑战和社会影响。
Method: 论文采用综述方法,结合了门控量子计算和量子退火两种主要的量子计算范式,介绍了QeCV的具体技术和实现工具。
Result: 论文总结了现有的量子计算工具和学习资源,并讨论了QeCV的发表、评审以及对社会的影响。
Insight: 量子计算在计算机视觉中的应用潜力巨大,但需要开发全新的算法以适应量子硬件,并释放量子计算范式的潜力。
Abstract: Quantum-enhanced Computer Vision (QeCV) is a new research field at the intersection of computer vision, optimisation theory, machine learning and quantum computing. It has high potential to transform how visual signals are processed and interpreted with the help of quantum computing that leverages quantum-mechanical effects in computations inaccessible to classical (i.e. non-quantum) computers. In scenarios where existing non-quantum methods cannot find a solution in a reasonable time or compute only approximate solutions, quantum computers can provide, among others, advantages in terms of better time scalability for multiple problem classes. Parametrised quantum circuits can also become, in the long term, a considerable alternative to classical neural networks in computer vision. However, specialised and fundamentally new algorithms must be developed to enable compatibility with quantum hardware and unveil the potential of quantum computational paradigms in computer vision. This survey contributes to the existing literature on QeCV with a holistic review of this research field. It is designed as a quantum computing reference for the computer vision community, targeting computer vision students, scientists and readers with related backgrounds who want to familiarise themselves with QeCV. We provide a comprehensive introduction to QeCV, its specifics, and methodologies for formulations compatible with quantum hardware and QeCV methods, leveraging two main quantum computational paradigms, i.e. gate-based quantum computing and quantum annealing. We elaborate on the operational principles of quantum computers and the available tools to access, program and simulate them in the context of QeCV. Finally, we review existing quantum computing tools and learning materials and discuss aspects related to publishing and reviewing QeCV papers, open challenges and potential social implications.
[125] Temporal Prompting Matters: Rethinking Referring Video Object Segmentation
Ci-Siang Lin,Min-Hung Chen,I-Jieh Liu,Chien-Yi Wang,Sifei Liu,Yu-Chiang Frank Wang
Main category: cs.CV
TL;DR: 该论文提出了一种通过解构RVOS任务并使用时序提示生成与选择(Tenet)框架来解决RVOS问题的方法,利用基础分割模型和外部检测器/跟踪器生成高质量提示,并通过提示偏好学习评估其质量,实现高效适应。
Details
Motivation: 现有的RVOS方法通常需要密集标注的端到端训练,计算成本高且扩展性差。作者希望通过重新思考任务的关键因素,利用现有基础模型和外部工具实现高效解决方案。Contribution: 1) 提出了解构RVOS任务为Referring、Video和Segmentation三部分。2) 设计了Tenet框架,包括时序提示生成与选择和提示偏好学习。3) 实现了无需密集标注的高效RVOS模型适应。
Method: 1) 利用外部目标检测器和跟踪器生成时序提示。2) 提出Prompt Preference Learning评估提示质量。3) 结合基础图像分割模型生成高质量掩码。
Result: 在RVOS基准测试中验证了Tenet框架的有效性,展示了无需端到端训练的适应性。
Insight: 时序提示的质量和选择对RVOS任务至关重要,通过解构任务并利用外部工具可以显著降低计算成本,提升模型适应效率。
Abstract: Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence in the video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we rethink the RVOS problem and aim to investigate the key to this task. Based on existing foundation segmentation models, we decompose the RVOS task into referring, video, and segmentation factors, and propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors while leaving the segmentation problem to foundation models. To efficiently adapt image-based foundation segmentation models to referring video object segmentation, we leverage off-the-shelf object detectors and trackers to produce temporal prompts associated with the referring sentence. While high-quality temporal prompts could be produced, they can not be easily identified from confidence scores. To tackle this issue, we propose Prompt Preference Learning to evaluate the quality of the produced temporal prompts. By taking such prompts to instruct image-based foundation segmentation models, we would be able to produce high-quality masks for the referred object, enabling efficient model adaptation to referring video object segmentation. Experiments on RVOS benchmarks demonstrate the effectiveness of the Tenet framework.
eess.IV [Back]
[126] Stacked Regression using Off-the-shelf, Stimulus-tuned and Fine-tuned Neural Networks for Predicting fMRI Brain Responses to Movies (Algonauts 2025 Report)
Robert Scholz,Kunal Bagga,Christine Ahrends,Carlo Alberto Barbano
Main category: eess.IV
TL;DR: 本文介绍了在Algonauts 2025挑战赛中提出的方法,通过整合多模态表示(包括大型语言模型、视频编码器、音频模型和视觉-语言模型),并采用堆叠回归结合预测,成功预测了电影刺激下的fMRI脑响应。
Details
Motivation: 研究动机是开发一种能够准确预测电影刺激下大脑响应的多模态编码模型,结合现有技术和微调策略以提高预测性能。Contribution: 主要贡献包括:
- 整合了多模态表示(语言、视频、音频、视觉-语言模型);
- 提出了基于堆叠回归的预测融合方法;
- 开源了代码和资源,推动了多模态脑编码模型的发展。
Method: 方法包括:
- 使用离线和微调的神经网络模型;
- 通过详细转录和摘要增强文本输入;
- 对语言和视觉模型进行刺激调优和微调;
- 采用堆叠回归融合各模型的预测结果。
Result: 团队在挑战赛中排名第10,证明了方法的有效性。
Insight: 研究表明,多模态表示的融合和模型调优策略能显著提升大脑响应的预测性能,为未来脑编码模型提供了新思路。
Abstract: We present our submission to the Algonauts 2025 Challenge, where the goal is to predict fMRI brain responses to movie stimuli. Our approach integrates multimodal representations from large language models, video encoders, audio models, and vision-language models, combining both off-the-shelf and fine-tuned variants. To improve performance, we enhanced textual inputs with detailed transcripts and summaries, and we explored stimulus-tuning and fine-tuning strategies for language and vision models. Predictions from individual models were combined using stacked regression, yielding solid results. Our submission, under the team name Seinfeld, ranked 10th. We make all code and resources publicly available, contributing to ongoing efforts in developing multimodal encoding models for brain activity.
[127] A Total Variation Regularized Framework for Epilepsy-Related MRI Image Segmentation
Mehdi Rabiee,Sergio Greco,Reza Shahbazian,Irina Trubitsyna
Main category: eess.IV
TL;DR: 该论文提出了一种新的框架,用于分割3D脑MRI图像中的FCD区域,结合了Transformer增强的编码器-解码器结构和各向异性TV损失函数,显著提升了分割精度和一致性。
Details
Motivation: FCD由于其病变的细微性和小规模性,在脑MRI中难以检测,且现有方法在处理3D多模态输入和输出平滑性方面存在不足。Contribution: 1. 提出了一种结合Dice损失和各向异性TV损失的新框架;2. 显著提升了FCD分割的准确性和一致性。
Method: 采用了Transformer增强的编码器-解码器结构,并设计了一种新型损失函数,结合Dice损失和各向异性TV项。
Result: 在85名癫痫患者的公共数据集上,Dice系数提高了11.9%,精度提升了13.3%,假阳性簇减少了61.6%。
Insight: TV损失项的引入可以有效提升分割结果的空间平滑性,减少假阳性簇,无需后处理。
Abstract: Focal Cortical Dysplasia (FCD) is a primary cause of drug-resistant epilepsy and is difficult to detect in brain {magnetic resonance imaging} (MRI) due to the subtle and small-scale nature of its lesions. Accurate segmentation of FCD regions in 3D multimodal brain MRI images is essential for effective surgical planning and treatment. However, this task remains highly challenging due to the limited availability of annotated FCD datasets, the extremely small size and weak contrast of FCD lesions, the complexity of handling 3D multimodal inputs, and the need for output smoothness and anatomical consistency, which is often not addressed by standard voxel-wise loss functions. This paper presents a new framework for segmenting FCD regions in 3D brain MRI images. We adopt state-of-the-art transformer-enhanced encoder-decoder architecture and introduce a novel loss function combining Dice loss with an anisotropic {Total Variation} (TV) term. This integration encourages spatial smoothness and reduces false positive clusters without relying on post-processing. The framework is evaluated on a public FCD dataset with 85 epilepsy patients and demonstrates superior segmentation accuracy and consistency compared to standard loss formulations. The model with the proposed TV loss shows an 11.9% improvement on the Dice coefficient and 13.3% higher precision over the baseline model. Moreover, the number of false positive clusters is reduced by 61.6%
[128] SER-Diff: Synthetic Error Replay Diffusion for Incremental Brain Tumor Segmentation
Sashank Makanaboyina
Main category: eess.IV
TL;DR: SER-Diff是一种结合扩散模型和增量学习的创新框架,通过合成错误重放解决脑肿瘤分割中的灾难性遗忘问题,并在多个数据集上取得最优性能。
Details
Motivation: 在脑肿瘤分割任务中,增量学习对于适应不断变化的临床数据至关重要,但灾难性遗忘问题限制了模型的性能。现有的增量学习方法依赖生成重放或额外存储,而扩散模型虽有潜力但尚未用于增量学习。Contribution: 提出了首个将扩散模型与增量学习结合的框架SER-Diff,利用冻结的教师扩散模型生成合成错误图进行重放训练,通过双损失函数兼顾新任务学习和旧知识保留。
Method: SER-Diff的核心方法包括:使用教师模型生成历史任务的合成错误图,在新任务训练中重放这些图,并结合Dice损失和新任务的适应性与知识蒸馏损失的遗忘缓解。
Result: 在BraTS2020、BraTS2021和BraTS2023数据集上,SER-Diff的Dice分数分别达到95.8%、94.9%和94.6%,HD95值最低(4.4 mm、4.7 mm和4.9 mm),显著优于现有方法。
Insight: SER-Diff的创新在于将扩散模型的细粒度生成能力与增量学习的需求结合,不仅缓解了灾难性遗忘,还提升了分割的准确性和解剖连贯性。
Abstract: Incremental brain tumor segmentation is critical for models that must adapt to evolving clinical datasets without retraining on all prior data. However, catastrophic forgetting, where models lose previously acquired knowledge, remains a major obstacle. Recent incremental learning frameworks with knowledge distillation partially mitigate forgetting but rely heavily on generative replay or auxiliary storage. Meanwhile, diffusion models have proven effective for refining tumor segmentations, but have not been explored in incremental learning contexts. We propose Synthetic Error Replay Diffusion (SER-Diff), the first framework that unifies diffusion-based refinement with incremental learning. SER-Diff leverages a frozen teacher diffusion model to generate synthetic error maps from past tasks, which are replayed during training on new tasks. A dual-loss formulation combining Dice loss for new data and knowledge distillation loss for replayed errors ensures both adaptability and retention. Experiments on BraTS2020, BraTS2021, and BraTS2023 demonstrate that SER-Diff consistently outperforms prior methods. It achieves the highest Dice scores of 95.8%, 94.9%, and 94.6%, along with the lowest HD95 values of 4.4 mm, 4.7 mm, and 4.9 mm, respectively. These results indicate that SER-Diff not only mitigates catastrophic forgetting but also delivers more accurate and anatomically coherent segmentations across evolving datasets.
[129] Conditional Denoising Diffusion Model-Based Robust MR Image Reconstruction from Highly Undersampled Data
Mohammed Alsubaie,Wenxi Liu,Linxia Gu,Ovidiu C. Andronesi,Sirani M. Perera,Xianqi Li
Main category: eess.IV
TL;DR: 该论文提出了一种基于条件去噪扩散模型的鲁棒MRI重建方法,通过在每个反向扩散步骤中嵌入测量模型,并结合成对的欠采样-真实数据训练,显著提升了重建图像的像素级保真度和感知真实性。
Details
Motivation: MRI的采集时间过长是其临床应用的主要限制,而现有的欠采样重建方法往往无法兼顾图像质量和重建速度。扩散模型虽然展现了潜力,但通常在无监督或后处理中使用数据一致性。Contribution: 提出了一种条件去噪扩散框架,将测量模型直接嵌入到每个反向扩散步骤中,并结合成对的监督数据训练,实现了生成灵活性与MRI物理模型的显式结合。
Method: 采用条件去噪扩散模型设计,结合迭代的数据一致性校正,嵌入测量模型到每一步反向扩散中。
Result: 在fastMRI数据集上,该方法在SSIM、PSNR和LPIPS等指标上优于现有方法,尤其是LPIPS更好地捕捉了感知质量的提升。
Insight: 将条件监督与迭代一致性更新结合,可以为加速MRI重建提供一种更鲁棒和实用的方法,同时兼顾像素级和感知层面的质量。
Abstract: Magnetic Resonance Imaging (MRI) is a critical tool in modern medical diagnostics, yet its prolonged acquisition time remains a critical limitation, especially in time-sensitive clinical scenarios. While undersampling strategies can accelerate image acquisition, they often result in image artifacts and degraded quality. Recent diffusion models have shown promise for reconstructing high-fidelity images from undersampled data by learning powerful image priors; however, most existing approaches either (i) rely on unsupervised score functions without paired supervision or (ii) apply data consistency only as a post-processing step. In this work, we introduce a conditional denoising diffusion framework with iterative data-consistency correction, which differs from prior methods by embedding the measurement model directly into every reverse diffusion step and training the model on paired undersampled-ground truth data. This hybrid design bridges generative flexibility with explicit enforcement of MRI physics. Experiments on the fastMRI dataset demonstrate that our framework consistently outperforms recent state-of-the-art deep learning and diffusion-based methods in SSIM, PSNR, and LPIPS, with LPIPS capturing perceptual improvements more faithfully. These results demonstrate that integrating conditional supervision with iterative consistency updates yields substantial improvements in both pixel-level fidelity and perceptual realism, establishing a principled and practical advance toward robust, accelerated MRI reconstruction.
cs.CR [Back]
[130] Reading Between the Lines: Towards Reliable Black-box LLM Fingerprinting via Zeroth-order Gradient Estimation
Shuo Shao,Yiming Li,Hongwei Yao,Yifei Chen,Yuchen Yang,Zhan Qin
Main category: cs.CR
TL;DR: 该论文提出了ZeroPrint方法,通过零阶梯度估计在黑盒环境中生成LLM的指纹,解决了现有黑盒指纹方法因依赖非线性输出而导致信息损失的问题。
Details
Motivation: LLMs作为高价值知识产权需要可靠的版权保护手段,但现有黑盒指纹方法因依赖模型输出而难以生成独特指纹。Contribution: 1. 通过Fisher信息理论证明输入梯度比输出更适合生成指纹;2. 提出ZeroPrint方法,利用零阶估计在黑盒环境中逼近梯度。
Method: ZeroPrint通过语义保留的词替换模拟输入扰动,估计模型的Jacobian矩阵作为指纹。
Result: 在标准基准测试中,ZeroPrint显著优于现有黑盒方法,表现出优异的效果和鲁棒性。
Insight: 输入梯度比输出更能反映模型的独特参数信息,是生成LLM指纹的更有效特征。
Abstract: The substantial investment required to develop Large Language Models (LLMs) makes them valuable intellectual property, raising significant concerns about copyright protection. LLM fingerprinting has emerged as a key technique to address this, which aims to verify a model’s origin by extracting an intrinsic, unique signature (a “fingerprint”) and comparing it to that of a source model to identify illicit copies. However, existing black-box fingerprinting methods often fail to generate distinctive LLM fingerprints. This ineffectiveness arises because black-box methods typically rely on model outputs, which lose critical information about the model’s unique parameters due to the usage of non-linear functions. To address this, we first leverage Fisher Information Theory to formally demonstrate that the gradient of the model’s input is a more informative feature for fingerprinting than the output. Based on this insight, we propose ZeroPrint, a novel method that approximates these information-rich gradients in a black-box setting using zeroth-order estimation. ZeroPrint overcomes the challenge of applying this to discrete text by simulating input perturbations via semantic-preserving word substitutions. This operation allows ZeroPrint to estimate the model’s Jacobian matrix as a unique fingerprint. Experiments on the standard benchmark show ZeroPrint achieves a state-of-the-art effectiveness and robustness, significantly outperforming existing black-box methods.
[131] RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning
Artur Horal,Daniel Pina,Henrique Paz,Iago Paulo,João Soares,Rafael Ferreira,Diogo Tavares,Diogo Glória-Silva,João Magalhães,David Semedo
Main category: cs.CR
TL;DR: 论文提出了RedTWIZ框架,用于通过自适应攻击规划对大型语言模型(LLM)进行多样性红队测试,评估其在AI辅助软件开发中的鲁棒性。
Details
Motivation: 当前LLM在对抗性攻击下的鲁棒性评估缺乏系统性和多样性,亟需一种能够生成多轮、目标导向的攻击策略的方法。Contribution: 1. 系统评估LLM的对话越狱漏洞;2. 提供多样性多轮攻击生成工具;3. 设计分层攻击规划器,自适应触发攻击。
Method: 结合评估、攻击生成和战略规划的统一框架,采用分层规划器自适应地针对LLM弱点生成攻击。
Result: 实验表明,RedTWIZ的多轮对抗攻击策略成功诱导了前沿LLM产生不安全输出,揭示了LLM鲁棒性问题。
Insight: 研究强调了需进一步探索提升LLM鲁棒性的方法,尤其是多轮对话中的对抗性防御。
Abstract: This paper presents the vision, scientific contributions, and technical details of RedTWIZ: an adaptive and diverse multi-turn red teaming framework, to audit the robustness of Large Language Models (LLMs) in AI-assisted software development. Our work is driven by three major research streams: (1) robust and systematic assessment of LLM conversational jailbreaks; (2) a diverse generative multi-turn attack suite, supporting compositional, realistic and goal-oriented jailbreak conversational strategies; and (3) a hierarchical attack planner, which adaptively plans, serializes, and triggers attacks tailored to specific LLM’s vulnerabilities. Together, these contributions form a unified framework – combining assessment, attack generation, and strategic planning – to comprehensively evaluate and expose weaknesses in LLMs’ robustness. Extensive evaluation is conducted to systematically assess and analyze the performance of the overall system and each component. Experimental results demonstrate that our multi-turn adversarial attack strategies can successfully lead state-of-the-art LLMs to produce unsafe generations, highlighting the pressing need for more research into enhancing LLM’s robustness.
cs.HC [Back]
[132] GPT-5 Model Corrected GPT-4V’s Chart Reading Errors, Not Prompting
Kaichun Yang,Jian Chen
Main category: cs.HC
TL;DR: 论文通过定量评估比较了零样本大型语言模型(LLMs)和多模态GPT-4V在图表阅读任务中的表现,发现GPT-5显著提升了准确性,而提示变体的影响较小。
Details
Motivation: 研究旨在理解零样本大型语言模型和多模态模型在图表阅读任务中的表现差异,尤其是对GPT-4V难以处理的困难图像实例。Contribution: 主要贡献是量化评估了模型架构(如GPT-5)相对于提示变体对图表阅读任务的影响,证明了模型改进比提示工程更有效。
Method: 通过让LLMs回答107个可视化问题,比较GPT-5和多模态GPT-4V的推理准确性,重点分析困难图像实例的表现。
Result: GPT-5大幅提高了准确性,而提示变体的效果有限,表明模型架构是影响推理准确性的主要因素。
Insight: 模型改进(如GPT-5)在处理复杂任务(如图表阅读)时比提示工程更为关键。
Abstract: We present a quantitative evaluation to understand the effect of zero-shot large-language model (LLMs) and prompting uses on chart reading tasks. We asked LLMs to answer 107 visualization questions to compare inference accuracies between the agentic GPT-5 and multimodal GPT-4V, for difficult image instances, where GPT-4V failed to produce correct answers. Our results show that model architecture dominates the inference accuracy: GPT5 largely improved accuracy, while prompt variants yielded only small effects. Pre-registration of this work is available here: https://osf.io/u78td/?view_only=6b075584311f48e991c39335c840ded3; the Google Drive materials are here:https://drive.google.com/file/d/1ll8WWZDf7cCNcfNWrLViWt8GwDNSvVrp/view.
cs.IR [Back]
[133] Crossing Domains without Labels: Distant Supervision for Term Extraction
Elena Senger,Yuri Campbell,Rob van der Goot,Barbara Plank
Main category: cs.IR
TL;DR: 该论文提出了一种利用远程监督和大语言模型(LLM)的术语提取方法,解决了现有方法依赖昂贵人工标注和跨域迁移困难的问题。
Details
Motivation: 现有术语提取方法(ATE)依赖大量人工标注且跨域性能差,限制了实际应用。因此,需要一种更鲁棒、可扩展的解决方案和更现实的评估设置。Contribution: 1. 提出了一个涵盖7个领域的综合基准数据集;2. 设计了一种基于LLM的术语提取模型,性能优于跨域监督模型和小样本基线,并与GPT-4o教师模型竞争;3. 引入了轻量级后处理启发式方法以提高文档级一致性。
Method: 1. 使用黑盒LLM生成伪标签以提升模型泛化能力;2. 在此基础上微调首个针对ATE的LLM;3. 结合轻量级后处理启发式方法优化结果。
Result: 在7个领域中,该方法在5个领域上优于先前方法,平均提升10个百分点。
Insight: 1. 远程监督和伪标签生成可以有效减少对人工标注的依赖;2. LLM在术语提取任务中表现优越,尤其在跨域场景;3. 轻量级后处理方法对文档级一致性至关重要。
Abstract: Automatic Term Extraction (ATE) is a critical component in downstream NLP tasks such as document tagging, ontology construction and patent analysis. Current state-of-the-art methods require expensive human annotation and struggle with domain transfer, limiting their practical deployment. This highlights the need for more robust, scalable solutions and realistic evaluation settings. To address this, we introduce a comprehensive benchmark spanning seven diverse domains, enabling performance evaluation at both the document- and corpus-levels. Furthermore, we propose a robust LLM-based model that outperforms both supervised cross-domain encoder models and few-shot learning baselines and performs competitively with its GPT-4o teacher on this benchmark. The first step of our approach is generating psuedo-labels with this black-box LLM on general and scientific domains to ensure generalizability. Building on this data, we fine-tune the first LLMs for ATE. To further enhance document-level consistency, oftentimes needed for downstream tasks, we introduce lightweight post-hoc heuristics. Our approach exceeds previous approaches on 5/7 domains with an average improvement of 10 percentage points. We release our dataset and fine-tuned models to support future research in this area.
cs.GR [Back]
[134] Capture and Interact: Rapid 3D Object Acquisition and Rendering with Gaussian Splatting in Unity
Islomjon Shukhratov,Sergey Gorinsky
Main category: cs.GR
TL;DR: 该论文提出了一种基于3D高斯泼溅(3D GS)的端到端流水线,用于快速捕获和实时渲染3D对象,应用于AR、数字孪生等领域,实现了移动设备扫描、云端处理和Unity交互渲染,速度达150 fps。
Details
Motivation: 实时捕获和渲染3D对象在许多应用中具有巨大潜力,如增强现实和数字孪生,但现有方法在速度和交互性上存在挑战。Contribution: 主要贡献包括:1)结合移动设备、云端3D GS和Unity渲染的端到端流水线;2)实现快速3D对象捕获(10分钟)和高帧率交互渲染(150 fps)。
Method: 方法包括:1)使用智能手机视频扫描对象;2)云端进行3D GS重建;3)在Unity中实时渲染。
Result: 实验表明显卡处理扫描约需10分钟,笔记本电脑上实时渲染达150 fps。
Insight: 该研究表明,结合移动设备和云端计算,能够高效实现3D对象的捕获与交互式渲染,为AR和远程协作提供了新思路。
Abstract: Capturing and rendering three-dimensional (3D) objects in real time remain a significant challenge, yet hold substantial potential for applications in augmented reality, digital twin systems, remote collaboration and prototyping. We present an end-to-end pipeline that leverages 3D Gaussian Splatting (3D GS) to enable rapid acquisition and interactive rendering of real-world objects using a mobile device, cloud processing and a local computer. Users scan an object with a smartphone video, upload it for automated 3D reconstruction, and visualize it interactively in Unity at an average of 150 frames per second (fps) on a laptop. The system integrates mobile capture, cloud-based 3D GS and Unity rendering to support real-time telepresence. Our experiments show that the pipeline processes scans in approximately 10 minutes on a graphics processing unit (GPU) achieving real-time rendering on the laptop.
cs.CY [Back]
[135] Surgeons Are Indian Males and Speech Therapists Are White Females: Auditing Biases in Vision-Language Models for Healthcare Professionals
Zohaib Hasan Siddiqui,Dayam Nadeem,Mohammad Masudur Rahman,Mohammad Nadeem,Shahab Saquib Sohail,Beenish Moalla Chaudhry
Main category: cs.CY
TL;DR: 该论文研究了视觉语言模型(VLMs)在医疗职业领域中反映的人口统计偏见,并提出了一种评估协议来衡量和评估这些偏见及其操作风险。
Details
Motivation: AI模型(如CLIP和OpenCLIP)在医疗职业中可能存在偏见,这可能对公平性、合规性和患者信任产生负面影响。Contribution: 论文提出了一种评估协议,包括定义医疗职业分类、设计职业相关提示语料库,以及通过平衡人脸语料库量化人口统计偏差。
Method: 研究定义了一套医疗职业分类,并设计了相关提示语料库,通过对比平衡人脸数据来量化模型的行为偏差。
Result: 实验表明,多个VLMs在医疗职业中表现出一致的人口统计偏见。
Insight: 研究强调了在医疗等关键领域中识别和解决AI模型偏见的重要性,以避免对公平性和患者信任的负面影响。
Abstract: Vision language models (VLMs), such as CLIP and OpenCLIP, can encode and reflect stereotypical associations between medical professions and demographic attributes learned from web-scale data. We present an evaluation protocol for healthcare settings that quantifies associated biases and assesses their operational risk. Our methodology (i) defines a taxonomy spanning clinicians and allied healthcare roles (e.g., surgeon, cardiologist, dentist, nurse, pharmacist, technician), (ii) curates a profession-aware prompt suite to probe model behavior, and (iii) benchmarks demographic skew against a balanced face corpus. Empirically, we observe consistent demographic biases across multiple roles and vision models. Our work highlights the importance of bias identification in critical domains such as healthcare as AI-enabled hiring and workforce analytics can have downstream implications for equity, compliance, and patient trust.
[136] Asking For It: Question-Answering for Predicting Rule Infractions in Online Content Moderation
Mattia Samory,Diana Pamfile,Andrew To,Shruti Phadke
Main category: cs.CY
TL;DR: 该论文提出了一种新颖的问答框架ModQ,用于在线内容审核中规则违规的预测,通过将规则与评论关联,优于现有基线方法。
Details
Motivation: 在线社区的规则多样且动态变化,传统审核方法难以应对。论文旨在通过问答框架提高审核的透明度和自动化能力。Contribution: 提出了ModQ框架,基于问答模型实现规则敏感的内容审核,支持提取式和多选式两种变体,具有轻量化和可解释性。
Method: 通过提取式和多选式问答模型,训练Reddit和Lemmy数据,将规则与评论关联以识别违规。
Result: ModQ在识别违规规则上优于现有基线,并能泛化到未见过的社区和规则。
Insight: 问答框架为动态规则环境提供了一种灵活的审核方法,适用于低资源和多变的治理场景。
Abstract: Online communities rely on a mix of platform policies and community-authored rules to define acceptable behavior and maintain order. However, these rules vary widely across communities, evolve over time, and are enforced inconsistently, posing challenges for transparency, governance, and automation. In this paper, we model the relationship between rules and their enforcement at scale, introducing ModQ, a novel question-answering framework for rule-sensitive content moderation. Unlike prior classification or generation-based approaches, ModQ conditions on the full set of community rules at inference time and identifies which rule best applies to a given comment. We implement two model variants - extractive and multiple-choice QA - and train them on large-scale datasets from Reddit and Lemmy, the latter of which we construct from publicly available moderation logs and rule descriptions. Both models outperform state-of-the-art baselines in identifying moderation-relevant rule violations, while remaining lightweight and interpretable. Notably, ModQ models generalize effectively to unseen communities and rules, supporting low-resource moderation settings and dynamic governance environments.
cs.AI [Back]
[137] AlphaApollo: Orchestrating Foundation Models and Professional Tools into a Self-Evolving System for Deep Agentic Reasoning
Zhanke Zhou,Chentao Cao,Xiao Feng,Xuan Li,Zongze Li,Xiangyu Lu,Jiangchao Yao,Weikai Huang,Linrui Xu,Tian Cheng,Guanyu Jiang,Yiming Zheng,Brando Miranda,Tongliang Liu,Sanmi Koyejo,Masashi Sugiyama,Bo Han
Main category: cs.AI
TL;DR: AlphaApollo是一个自演化的代理推理系统,通过结合基础模型和专业工具,解决了基础模型推理中容量有限和测试迭代不可靠的问题。
Details
Motivation: 基础模型在推理时存在模型内部容量限制和测试迭代不可靠的问题,AlphaApollo通过整合专业工具和多轮多模型协作来提升推理能力。Contribution: 提出了AlphaApollo系统,结合计算工具和检索工具,支持多轮多模型协作,提升基础模型的推理能力和可靠性。
Method: 系统整合了Python计算工具和任务相关的外部信息检索工具,通过共享状态图记录候选方案和执行反馈,实现迭代优化。
Result: 在AIME 2024/2025评测中,AlphaApollo显著提升了Qwen2.5-14B-Instruct和Llama-3.3-70B-Instruct的性能,工具调用成功率超过80%。
Insight: 通过工具整合和协作机制,AlphaApollo有效提升了基础模型的推理能力,展示了自演化系统的潜力。
Abstract: We present AlphaApollo, a self-evolving agentic reasoning system that aims to address two bottlenecks in foundation model (FM) reasoning-limited model-intrinsic capacity and unreliable test-time iteration. AlphaApollo orchestrates multiple models with professional tools to enable deliberate, verifiable reasoning. It couples (i) a computation tool (Python with numerical and symbolic libraries) and (ii) a retrieval tool (task-relevant external information) to execute exact calculations and ground decisions. The system further supports multi-round, multi-model solution evolution via a shared state map that records candidates, executable checks, and feedback for iterative refinement. In evaluations on AIME 2024/2025 across multiple models, AlphaApollo delivers consistent gains: +5.15% Average@32 and +23.34% Pass@32 for Qwen2.5-14B-Instruct, and +8.91% Average@32 with +26.67% Pass@32 for Llama-3.3-70B-Instruct. Tool-use analysis shows that more than 80% of tool calls are successfully executed, with consistent outperformance of non-tool baselines, thereby lifting the capability ceiling of FMs. More empirical results and implementation details will be updated at https://github.com/tmlr-group/AlphaApollo.
[138] PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles
Yitao Long,Yuru Jiang,Hongjun Liu,Yilun Zhao,Jingchen Sun,Yiqiu Shen,Chen Zhao,Arman Cohan,Dennis Shasha
Main category: cs.AI
TL;DR: 引入了PuzzlePlex基准,用于评估基础模型在复杂动态环境中的推理和规划能力,涵盖15种不同类型的谜题,并提出定制化评测指标和分析方法。
Details
Motivation: 研究背景是为了了解基础模型在复杂环境中的推理和规划能力,并通过多样化的谜题设计一个可扩展的评测基准。Contribution: 主要贡献是设计了PuzzlePlex基准,开发了细粒度评测指标,并对前沿基础模型进行了系统分析。
Method: 方法包括构建15种谜题环境,实现定制化游戏策略,并基于指令和代码两种设置评测模型的性能。
Result: 结果显示推理模型在指令设置中表现更优,而代码执行更具挑战性但可扩展性更强。
Insight: PuzzlePlex为未来基础模型的推理和规划能力改进提供了指导,展示了评测基准的重要性。
Abstract: This work investigates the reasoning and planning capabilities of foundation models and their scalability in complex, dynamic environments. We introduce PuzzlePlex, a benchmark designed to assess these capabilities through a diverse set of puzzles. PuzzlePlex consists of 15 types of puzzles, including deterministic and stochastic games of varying difficulty, as well as single-player and two-player scenarios. The PuzzlePlex framework provides a comprehensive environment for each game, and supports extensibility to generate more challenging instances as foundation models evolve. Additionally, we implement customized game-playing strategies for comparison. Building on this benchmark, we develop fine-grained metrics to measure performance and conduct an in-depth analysis of frontier foundation models across two settings: instruction-based and code-based. Furthermore, we systematically investigate their scaling limits. Our findings show that reasoning models outperform others in instruction-based settings, while code-based execution presents greater challenges but offers a scalable and efficient alternative. PuzzlePlex enables targeted evaluation and guides future improvements in reasoning, planning, and generalization for foundation models.
[139] Evolving and Executing Research Plans via Double-Loop Multi-Agent Collaboration
Zhi Zhang,Yan Liu,Zhejing Hu,Gong Chen,Sheng-hua Zhong,Jiannong Cao
Main category: cs.AI
TL;DR: 该论文提出了一种双循环多智能体(DLMA)框架,用于自动生成和执行科研计划。上层循环(教授智能体)通过进化算法迭代优化研究提案,下层循环(博士生智能体)动态调整执行过程,确保计划的正确实施。实验表明,DLMA在研究论文自动评估中表现优异。
Details
Motivation: 端到端自动化科研过程面临高层计划生成与动态执行的双重挑战,传统的单层方法难以兼顾新颖性和正确性。Contribution: 提出了DLMA框架,通过双循环协作分别解决科研计划的进化和执行问题,显著提升了自动化研究的质量。
Method: 上层循环(教授智能体)使用进化算法优化提案;下层循环(博士生智能体)通过动态调整执行计划,结合上下文和外部观察确保实施质量。
Result: 在ACLAward和Laboratory等基准测试中,DLMA生成的论文在自动评估中达到SOTA分数,显著优于基线方法。
Insight: 双循环设计将新颖性(进化)和正确性(执行)分离并协同优化,是自动化科研领域的重要突破。
Abstract: Automating the end-to-end scientific research process poses a fundamental challenge: it requires both evolving high-level plans that are novel and sound, and executing these plans correctly amidst dynamic and uncertain conditions. To address this bilevel challenge, we propose a novel Double-Loop Multi-Agent (DLMA) framework to solve the given research problem automatically. The leader loop, composed of professor agents, is responsible for evolving research plans. It employs an evolutionary algorithm through involvement, improvement, and integration meetings to iteratively generate and refine a pool of research proposals, exploring the solution space effectively. The follower loop, composed of doctoral student agents, is responsible for executing the best-evolved plan. It dynamically adjusts the plan during implementation via pre-hoc and post-hoc meetings, ensuring each step (e.g., drafting, coding) is well-supported by contextual and external observations. Extensive experiments on benchmarks like ACLAward and Laboratory show that DLMA generates research papers that achieve state-of-the-art scores in automated evaluation, significantly outperforming strong baselines. Ablation studies confirm the critical roles of both loops, with evolution driving novelty and execution ensuring soundness.
[140] Revisiting the Uniform Information Density Hypothesis in LLM Reasoning Traces
Minju Gwak,Guijin Son,Jaehyung Kim
Main category: cs.AI
TL;DR: 本文探索了大型语言模型(LLM)推理过程中信息密度的均匀性,提出了一种基于熵的逐步信息密度度量方法,并通过实验验证了信息密度均匀性与推理质量的正相关性。
Details
Motivation: 研究动机是验证均匀信息密度假说(UID)是否适用于LLM的推理过程,并探索信息密度的均匀性是否可以作为推理质量的指标。Contribution: 主要贡献包括:1)提出了一种基于熵的逐步信息密度度量方法;2)引入了局部和全局均匀性评分;3)验证了信息密度均匀性与推理质量的正相关性,并展示了其在提升推理准确性上的实践价值。
Method: 方法包括:1)设计了一种基于熵的逐步信息密度度量;2)提出了局部和全局均匀性评分来衡量信息密度的均匀性;3)在六个推理基准测试中进行了实验验证。
Result: 实验结果显示:1)信息密度均匀性与推理质量显著相关;2)正确的推理轨迹避免了信息密度的剧烈波动;3)基于信息密度的均匀性选择推理轨迹可提升准确性(相对提升10-32%)。
Insight: 研究发现信息密度的均匀性是推理质量的重要指标,避免信息密度剧烈波动有助于生成更可靠的推理系统。
Abstract: The Uniform Information Density (UID) hypothesis suggests that effective communication maintains a stable flow of information. In this work, we revisit this principle in the context of large language model (LLM) reasoning traces, asking whether step-level uniformity reflects reasoning quality. To this end, we propose an entropy-based stepwise information density metric and introduce two complementary measures of uniformity, local and global uniformity scores. Across the experiments on six different reasoning benchmarks, we find that step-level uniformity not only provides a strong theoretical lens but also yields practical performance benefits; for example, selecting reasoning traces with more uniform information density at the step-level improves accuracy by 10-32% relative gains over baselines at AIME2025. Our analysis further reveals that correct reasoning traces tend to avoid sharp information density spikes, while incorrect traces exhibit irregular information bursts. These results demonstrate that UID-inspired information density measures outperform alternative internal signals as predictors of reasoning quality. Results highlight the uniformity of the information density as a robust diagnostic and selection criterion for building more reliable and accurate reasoning systems.
cs.RO [Back]
[141] Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications
Kento Kawaharazuka,Jihoon Oh,Jun Yamada,Ingmar Posner,Yuke Zhu
Main category: cs.RO
TL;DR: 这篇论文综述了Vision-Language-Action (VLA) 模型在机器人领域的应用,涵盖了从模型架构、学习范式到实际部署的全栈视角。
Details
Motivation: 大型语言模型(LLMs)和视觉语言模型(VLMs)在机器人领域的应用潜力引起了广泛关注。VLA模型试图统一视觉、语言和动作数据,以实现在多样化任务和环境中的泛化能力,从而推动机器人技术的灵活和规模化部署。Contribution: 论文提供了对VLA模型的系统性综述,包括策略与架构的演变、构建模块、模态处理技术、学习范式,以及机器人平台、数据集和评估基准的全面回顾。
Method: 通过综合分析和分类现有工作,论文整合了VLA模型的软件和硬件组件,提出了一个全栈视角,涵盖数据收集、训练方法和实际部署的策略。
Result: 论文总结了当前VLA模型的研究进展,并为机器人社区提供了在实际系统中应用VLA模型的实用指南和资源。
Insight: VLA模型的泛化能力有望减少对任务特定数据的依赖,但其在实际部署中仍需解决数据收集、模态对齐和硬件集成的挑战。
Abstract: Amid growing efforts to leverage advances in large language models (LLMs) and vision-language models (VLMs) for robotics, Vision-Language-Action (VLA) models have recently gained significant attention. By unifying vision, language, and action data at scale, which have traditionally been studied separately, VLA models aim to learn policies that generalise across diverse tasks, objects, embodiments, and environments. This generalisation capability is expected to enable robots to solve novel downstream tasks with minimal or no additional task-specific data, facilitating more flexible and scalable real-world deployment. Unlike previous surveys that focus narrowly on action representations or high-level model architectures, this work offers a comprehensive, full-stack review, integrating both software and hardware components of VLA systems. In particular, this paper provides a systematic review of VLAs, covering their strategy and architectural transition, architectures and building blocks, modality-specific processing techniques, and learning paradigms. In addition, to support the deployment of VLAs in real-world robotic applications, we also review commonly used robot platforms, data collection strategies, publicly available datasets, data augmentation methods, and evaluation benchmarks. Throughout this comprehensive survey, this paper aims to offer practical guidance for the robotics community in applying VLAs to real-world robotic systems. All references categorized by training approach, evaluation method, modality, and dataset are available in the table on our project website: https://vla-survey.github.io .
[142] TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking
Jiahang Liu,Yunpeng Qi,Jiazhao Zhang,Minghan Li,Shaoan Wang,Kui Wu,Hanjing Ye,Hong Zhang,Zhibo Chen,Fangwei Zhong,Zhizheng Zhang,He Wang
Main category: cs.RO
TL;DR: TrackVLA++ 是一种新的视觉-语言-动作(VLA)模型,通过引入空间推理机制和目标识别内存(TIM),显著提升了具身视觉跟踪的能力,并在公开基准上实现了最先进的性能。
Details
Motivation: 现有的具身视觉跟踪方法缺乏显式的空间推理和有效的时间记忆能力,导致在严重遮挡或相似干扰物存在的情况下失败。TrackVLA++ 旨在解决这些问题。Contribution: 1. 引入了 Polar-CoT 空间推理机制;2. 提出了目标识别内存(TIM)模块;3. 在 EVT-Bench DT 上性能显著提升。
Method: 结合了 Chain-of-Thought 范式(Polar-CoT)和门控更新的 TIM 模块,以实现对目标位置的精准推理和长期记忆。
Result: TrackVLA++ 在公开基准上表现优异,显着超越现有方法,并且在零样本泛化能力上表现出色。
Insight: 显式空间推理和目标记忆机制是具身视觉跟踪中解决遮挡和干扰问题的关键。
Abstract: Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack explicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar-looking distractors. To address these challenges, we present TrackVLA++, a novel Vision-Language-Action (VLA) model that enhances embodied visual tracking with two key modules, a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of-Thought paradigm, termed Polar-CoT, which infers the target’s relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long-horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1 and 12, respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real-world tracking in dynamic and occluded scenarios.
[143] TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics
Yi Han,Cheng Chi,Enshen Zhou,Shanyu Rong,Jingkun An,Pengwei Wang,Zhongyuan Wang,Lu Sheng,Shanghang Zhang
Main category: cs.RO
TL;DR: TIGeR提出了一种新框架,将视觉语言模型(VLMs)从感知估计工具转变为几何计算工具,通过外部工具生成和执行精确的几何计算,显著提升了机器人操作的厘米级精度。
Details
Motivation: 当前VLMs在空间推理中依赖定性精度,无法满足机器人操作所需的计算精度需求。作者希望通过工具集成几何推理来解决这一限制。Contribution: 1. 提出了TIGeR框架,使VLMs能够调用外部工具进行精确几何计算;2. 发布了TIGeR-300K数据集,涵盖多种几何任务;3. 设计了分阶段训练方法和分层奖励机制。
Method: 通过监督微调(SFT)和强化微调(RFT)两阶段训练,结合分层奖励设计,使模型能够识别几何需求、生成计算代码并调用外部工具。
Result: TIGeR在几何推理基准测试中实现了SOTA性能,并在真实机器人操作任务中展示了厘米级精度。
Insight: 工具集成可以弥补神经网络在精确几何计算上的不足,同时保留了VLMs的高层次推理能力。
Abstract: Vision-Language Models (VLMs) have shown remarkable capabilities in spatial reasoning, yet they remain fundamentally limited to qualitative precision and lack the computational precision required for real-world robotics. Current approaches fail to leverage metric cues from depth sensors and camera calibration, instead reducing geometric problems to pattern recognition tasks that cannot deliver the centimeter-level accuracy essential for robotic manipulation. We present TIGeR (Tool-Integrated Geometric Reasoning), a novel framework that transforms VLMs from perceptual estimators to geometric computers by enabling them to generate and execute precise geometric computations through external tools. Rather than attempting to internalize complex geometric operations within neural networks, TIGeR empowers models to recognize geometric reasoning requirements, synthesize appropriate computational code, and invoke specialized libraries for exact calculations. To support this paradigm, we introduce TIGeR-300K, a comprehensive tool-invocation-oriented dataset covering point transformations, pose estimation, trajectory generation, and spatial compatibility verification, complete with tool invocation sequences and intermediate computations. Through a two-stage training pipeline combining supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT) with our proposed hierarchical reward design, TIGeR achieves SOTA performance on geometric reasoning benchmarks while demonstrating centimeter-level precision in real-world robotic manipulation tasks.
cs.SD [Back]
[144] XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection
Phuong Tuan Dat,Tran Huy Dat
Main category: cs.SD
TL;DR: 论文提出了一种新型的XLSR-Kanformer模型,通过将XLSR-Conformer中的MLP替换为Kolmogorov-Arnold Network (KAN),显著提升了合成语音检测性能。在ASVspoof2021数据集上,相对误等率(EER)提升了60.55%,并在21LA子集上实现了0.70%的EER。
Details
Motivation: 由于语音合成技术的进步,欺骗攻击变得越来越复杂,给自动说话人验证系统带来了巨大挑战。现有基于自监督学习(SSL)的XLSR-Conformer模型虽有优异表现,但仍需架构改进以进一步提升性能。Contribution: 主要贡献是提出了一种将KAN集成到XLSR-Conformer模型中的方法,替代传统的MLP,显著提升了合成语音检测的性能。此外,该方法对其他SSL架构也具有鲁棒性。
Method: 论文方法是将XLSR-Conformer模型中的MLP替换为Kolmogorov-Arnold Network (KAN),利用KAN的强大逼近能力优化模型性能。实验在ASVspoof2021数据集上进行。
Result: 实验结果显示,XLSR-Kanformer在LA和DF子集上的EER相对提升了60.55%,并在21LA子集上实现了0.70%的EER,验证了方法的有效性。
Insight: KAN作为一种通用逼近器,可以有效提升SSL模型的性能,未来在合成语音检测领域有广阔的应用前景。
Abstract: Recent advancements in speech synthesis technologies have led to increasingly sophisticated spoofing attacks, posing significant challenges for automatic speaker verification systems. While systems based on self-supervised learning (SSL) models, particularly the XLSR-Conformer architecture, have demonstrated remarkable performance in synthetic speech detection, there remains room for architectural improvements. In this paper, we propose a novel approach that replaces the traditional Multi-Layer Perceptron (MLP) in the XLSR-Conformer model with a Kolmogorov-Arnold Network (KAN), a powerful universal approximator based on the Kolmogorov-Arnold representation theorem. Our experimental results on ASVspoof2021 demonstrate that the integration of KAN to XLSR-Conformer model can improve the performance by 60.55% relatively in Equal Error Rate (EER) LA and DF sets, further achieving 0.70% EER on the 21LA set. Besides, the proposed replacement is also robust to various SSL architectures. These findings suggest that incorporating KAN into SSL-based models is a promising direction for advances in synthetic speech detection.
[145] AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs
Peize He,Zichen Wen,Yubo Wang,Yuxuan Wang,Xiaoqian Liu,Jiajie Huang,Zehui Lei,Zhuangcheng Gu,Xiangqi Jin,Jiabing Yang,Kai Li,Zhifei Liu,Weijia Li,Cunxiang Wang,Conghui He,Linfeng Zhang
Main category: cs.SD
TL;DR: AudioMarathon是一个针对长上下文音频理解的综合基准测试,旨在解决当前大型音频语言模型(LALMs)在处理长音频时的效率和性能问题。
Details
Motivation: 当前LALMs在处理长音频时面临注意力的二次方开销和长范围时间依赖建模的挑战,而现有音频基准测试多基于短片段,无法评估长上下文场景下的模型表现。Contribution: 提出了AudioMarathon基准测试,涵盖语音、声音和音乐领域,支持长音频输入(90-300秒)和复杂推理任务,填补了长上下文音频评估的空白。
Method: Benchmark设计了三个核心:长音频输入、全领域覆盖和多跳推理任务,并评估了当前LALMs的性能和各种加速技术的效率。
Result: 实验显示随着音频长度增加,模型性能显著下降,突出了当前模型的局限性,同时分析了令牌修剪和KV缓存淘汰的权衡。
Insight: AudioMarathon揭示了现有LALMs在时间推理和内存效率上的不足,为未来模型改进提供了方向,推动了音频和多模态研究的进步。
Abstract: Processing long-form audio is a major challenge for Large Audio Language models (LALMs). These models struggle with the quadratic cost of attention ($O(N^2)$) and with modeling long-range temporal dependencies. Existing audio benchmarks are built mostly from short clips and do not evaluate models in realistic long context settings. To address this gap, we introduce AudioMarathon, a benchmark designed to evaluate both understanding and inference efficiency on long-form audio. AudioMarathon provides a diverse set of tasks built upon three pillars: long-context audio inputs with durations ranging from 90.0 to 300.0 seconds, which correspond to encoded sequences of 2,250 to 7,500 audio tokens, respectively, full domain coverage across speech, sound, and music, and complex reasoning that requires multi-hop inference. We evaluate state-of-the-art LALMs and observe clear performance drops as audio length grows. We also study acceleration techniques and analyze the trade-offs of token pruning and KV cache eviction. The results show large gaps across current LALMs and highlight the need for better temporal reasoning and memory-efficient architectures. We believe AudioMarathon will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.
cs.LG [Back]
[146] SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models
Huahui Yi,Kun Wang,Qiankun Li,Miao Yu,Liang Lin,Gongli Xi,Hao Wu,Xuming Hu,Kang Li,Yang Liu
Main category: cs.LG
TL;DR: SaFeR-VLM是一个安全感知的强化学习框架,旨在将安全性直接嵌入多模态推理过程中,通过数据集、安全感知的生成、结构化奖励模型和GRPO优化,显著提升了模型的安全性和实用性,超越了多个大规模模型。
Details
Motivation: 现有的多模态大型推理模型(MLRMs)在推理过程中可能放大安全风险,现有防御措施主要作用于输出层面,未能约束推理过程,导致潜在风险。Contribution: 提出了SaFeR-VLM框架,包含QI-Safe-10K数据集、安全感知的生成机制、结构化奖励模型和GRPO优化,实现了安全性与推理能力的统一。
Method: 采用强化学习方法,整合数据集、安全感知生成、多维度奖励模型和GRPO优化,主动约束推理过程。
Result: SaFeR-VLM在安全性和实用性上超越了多个大规模模型,甚至在7B规模下优于GPT-5-mini和Gemini-2.5-Flash。
Insight: 通过将安全性嵌入推理过程,解决了现有模型的潜在风险,证明了安全性和推理能力的协同提升是可行的。
Abstract: Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal reasoning but often amplify safety risks under adversarial or unsafe prompts, a phenomenon we call the \textit{Reasoning Tax}. Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models exposed to implicit risks. In this paper, we propose SaFeR-VLM, a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning. The framework integrates four components: (I) QI-Safe-10K, a curated dataset emphasizing safety-critical and reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations undergo reflection and correction instead of being discarded; (III) structured reward modeling with multi-dimensional weighted criteria and explicit penalties for hallucinations and contradictions; and (IV) GRPO optimization, which reinforces both safe and corrected trajectories. This unified design shifts safety from a passive safeguard to an active driver of reasoning, enabling scalable and generalizable safety-aware reasoning. SaFeR-VLM further demonstrates robustness against both explicit and implicit risks, supporting dynamic and interpretable safety decisions beyond surface-level filtering. SaFeR-VLM-3B achieves average performance $70.13$ and $78.97$ on safety and helpfulness across six benchmarks, surpassing both same-scale and $>10\times$ larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B. Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass GPT-5-mini and Gemini-2.5-Flash by \num{6.47} and \num{16.76} points respectively on safety metrics, achieving this improvement without any degradation in helpfulness performance. Our codes are available at https://github.com/HarveyYi/SaFeR-VLM.
[147] The Markovian Thinker
Milad Aghajohari,Kamran Chitsaz,Amirhossein Kazemnejad,Sarath Chandar,Alessandro Sordoni,Aaron Courville,Siva Reddy
Main category: cs.LG
TL;DR: 论文提出了Markovian Thinking范式,通过固定大小的状态解耦推理长度与上下文大小,实现了线性计算和恒定内存的使用。Delethink环境的实例化展示了该方法在长推理任务中的高效性。
Details
Motivation: 标准强化学习(RL)的推理环境中,状态随推理长度增长,导致二次计算开销。作者希望通过重新设计推理环境,解决这一效率瓶颈。Contribution: 提出了Markovian Thinking范式,通过固定大小的状态实现线性计算;设计了Delethink环境,验证方法的有效性。
Method: 将推理分割为固定大小的块,每块结束时重置上下文并携带短状态,训练模型学习无缝过渡。
Result: Delethink在8K token的块中推理24K token,性能优于24K token的传统方法,计算开销显著降低。
Insight: 重新设计推理环境是提升长推理效率的关键,现有预训练模型已具备Markovian特性的潜力。
Abstract: Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL “thinking environment”, where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.
[148] Revisiting Mixout: An Overlooked Path to Robust Finetuning
Masih Aminbeidokhti,Heitor Rapela Medeiros,Eric Granger,Marco Pedersoli
Main category: cs.LG
TL;DR: 论文重新审视了Mixout方法,提出GMixout,通过动态锚点和显式采样频率提升模型微调的鲁棒性,实验证明其在域外泛化上优于现有方法。
Details
Motivation: 现有的视觉基础模型微调通常在域内精度上表现良好,但在域外分布偏移时鲁棒性下降。Mixout作为一种正则化方法被重新审视,以解决这一问题。Contribution: 论文的主要贡献是:(1) 提出GMixout,动态锚点和显式采样频率调节;(2) 稀疏核实现减少计算开销;(3) 在多个基准测试中验证GMixout的优越性。
Method: GMixout通过指数移动平均动态更新锚点权重,并引入显式的采样频率超参数调节掩码稀疏性。稀疏核实现仅更新部分参数,减少计算开销。
Result: 实验表明,GMixout在ImageNet、DomainNet等多个数据集上,不仅提升域内精度,且在域外分布偏移下超越Model Soups和其他参数高效微调基线。
Insight: 研究揭示掩码锚点、采样频率和掩码稀疏性是控制模型鲁棒性的关键因素,动态调整这些参数能显著提升泛化性能。
Abstract: Finetuning vision foundation models often improves in-domain accuracy but comes at the cost of robustness under distribution shift. We revisit Mixout, a stochastic regularizer that intermittently replaces finetuned weights with their pretrained reference, through the lens of a single-run, weight-sharing implicit ensemble. This perspective reveals three key levers that govern robustness: the \emph{masking anchor}, \emph{resampling frequency}, and \emph{mask sparsity}. Guided by this analysis, we introduce GMixout, which (i) replaces the fixed anchor with an exponential moving-average snapshot that adapts during training, and (ii) regulates masking period via an explicit resampling-frequency hyperparameter. Our sparse-kernel implementation updates only a small fraction of parameters with no inference-time overhead, enabling training on consumer-grade GPUs. Experiments on benchmarks covering covariate shift, corruption, and class imbalance, ImageNet / ImageNet-LT, DomainNet, iWildCam, and CIFAR100-C, GMixout consistently improves in-domain accuracy beyond zero-shot performance while surpassing both Model Soups and strong parameter-efficient finetuning baselines under distribution shift.
[149] Sharpness-Aware Data Generation for Zero-shot Quantization
Dung Hoang-Anh,Cuong Pham Trung Le,Jianfei Cai,Thanh-Toan Do
Main category: cs.LG
TL;DR: 该论文提出了一种新的零样本量化方法,通过在生成合成数据时考虑量化模型的锐度(sharpness)来提升模型的泛化能力。实验证明该方法在低比特量化设置下优于现有技术。
Details
Motivation: 零样本量化需要在不使用原始训练数据的情况下生成合成数据,但现有方法忽略了量化模型的锐度对泛化能力的影响。本文旨在通过优化锐度来改进合成数据的生成。Contribution: 1. 提出锐度感知的合成数据生成方法,首次将量化模型的锐度作为生成数据的准则。2. 展示了锐度最小化可以通过最大化合成数据与真实验证数据的梯度匹配来实现。3. 通过梯度匹配的近似方法,避免了真实验证数据的依赖。
Method: 1. 通过最大化合成数据与真实验证数据的梯度匹配来最小化锐度。2. 在缺乏真实验证数据的情况下,用生成样本与其邻域的梯度匹配近似替代。
Result: 在CIFAR-100和ImageNet数据集上的实验表明,该方法在低比特量化设置中优于现有技术。
Insight: 量化模型的锐度对泛化能力有重要影响,通过梯度匹配优化锐度可以有效提升零样本量化的性能。
Abstract: Zero-shot quantization aims to learn a quantized model from a pre-trained full-precision model with no access to original real training data. The common idea in zero-shot quantization approaches is to generate synthetic data for quantizing the full-precision model. While it is well-known that deep neural networks with low sharpness have better generalization ability, none of the previous zero-shot quantization works considers the sharpness of the quantized model as a criterion for generating training data. This paper introduces a novel methodology that takes into account quantized model sharpness in synthetic data generation to enhance generalization. Specifically, we first demonstrate that sharpness minimization can be attained by maximizing gradient matching between the reconstruction loss gradients computed on synthetic and real validation data, under certain assumptions. We then circumvent the problem of the gradient matching without real validation set by approximating it with the gradient matching between each generated sample and its neighbors. Experimental evaluations on CIFAR-100 and ImageNet datasets demonstrate the superiority of the proposed method over the state-of-the-art techniques in low-bit quantization settings.
[150] A Multi-Agent Framework for Stateful Inference-Time Search
Arshika Lalan,Rajat Ghosh,Aditya Kolsur,Debojyoti Dutta
Main category: cs.LG
TL;DR: 论文提出了一种训练无关的多智能体状态推理框架,通过结合持久状态、对抗性变异和进化保留,显著提升了自动化单元测试生成的覆盖率和鲁棒性。
Details
Motivation: 现有无状态推理方法在多步任务中表现不佳,而任务特定微调或指令微调在需要深度推理和长时依赖的任务中效果有限。Contribution: 提出了一种新型的多智能体状态推理框架,引入了持久状态、对抗性变异和进化保留机制。
Method: 框架包含持久推理状态维护、对抗性变异和进化保留,通过专门智能体的序列化操作生成鲁棒的边缘用例。
Result: 在HumanEval和TestGenEvalMini等基准测试中,该方法显著提升了单元测试生成的覆盖率,并适用于多种LLM家族(如Llama、Gemma和GPT)。
Insight: 结合持久推理状态和进化搜索能有效提升单元测试生成的性能,尤其在复杂任务中表现突出。
Abstract: Recent work explores agentic inference-time techniques to perform structured, multi-step reasoning. However, stateless inference often struggles on multi-step tasks due to the absence of persistent state. Moreover, task-specific fine-tuning or instruction-tuning often achieve surface-level code generation but remain brittle on tasks requiring deeper reasoning and long-horizon dependencies. To address these limitations, we propose stateful multi-agent evolutionary search, a training-free framework that departs from prior stateless approaches by combining (i) persistent inference-time state, (ii) adversarial mutation, and (iii) evolutionary preservation. We demonstrate its effectiveness in automated unit test generation through the generation of edge cases. We generate robust edge cases using an evolutionary search process, where specialized agents sequentially propose, mutate, and score candidates. A controller maintains persistent state across generations, while evolutionary preservation ensures diversity and exploration across all possible cases. This yields a generalist agent capable of discovering robust, high-coverage edge cases across unseen codebases. Experiments show our stateful multi-agent inference framework achieves substantial gains in coverage over stateless single-step baselines, evaluated on prevalent unit-testing benchmarks such as HumanEval and TestGenEvalMini and using three diverse LLM families - Llama, Gemma, and GPT. These results indicate that combining persistent inference-time state with evolutionary search materially improves unit-test generation.