Table of Contents

cs.CL [Back]

[1] Hallucination reduction with CASAL: Contrastive Activation Steering For Amortized Learning

Wannan Yang,Xinchi Qiu,Lei Yu,Yuchen Zhang,Oliver Aobo Yang,Narine Kokhlikyan,Nicola Cancedda,Diego Garcia-Olano

Main category: cs.CL

TL;DR: CASAL是一种高效的算法,通过对比激活引导和摊销优化,将激活引导的优势直接嵌入模型权重中,显著减少LLM的幻觉问题。

Details Motivation: 大型语言模型(LLM)在生成答案时经常出现幻觉(自信地提供错误答案),现有方法需实时干预,不够高效。CASAL旨在设计一种高效、数据需求低的解决方案。

Contribution: 1. 提出CASAL算法,首次将激活引导的优势直接嵌入模型权重;2. CASAL在计算和数据效率上显著优于基线方法(30倍和20倍);3. 首次证明该方法在密集和MoE模型中均有效。

Method: CASAL通过对比激活引导和摊销优化,仅训练单个Transformer层的子模块,将激活引导的效果直接融入模型权重中,从而减少幻觉。

Result: CASAL在多个短问答基准上将幻觉减少30%-40%,在计算和数据效率上显著优于SFT和DPO基线方法,且能有效推广到OOD领域。

Insight: CASAL展示了可解释性方法在实践中的潜力,为生产系统中的部署提供了高效解决方案。

Abstract: Large Language Models (LLMs) exhibit impressive capabilities but often hallucinate, confidently providing incorrect answers instead of admitting ignorance. Prior work has shown that models encode linear representations of their own knowledge and that activation steering can reduce hallucinations. These approaches, however, require real-time monitoring and intervention during inference. We introduce Contrastive Activation Steering for Amortized Learning (CASAL), an efficient algorithm that connects interpretability with amortized optimization. CASAL directly bakes the benefits of activation steering into model’s weights. Once trained, LLMs answer questions they know while abstaining from answering those they do not. CASAL’s light-weight design requires training only a submodule of a single transformer layer and yet reduces hallucination by 30%-40% across multiple short-form QA benchmarks. CASAL is 30x more compute-efficient and 20x more data-efficient than strong LoRA-based baselines such as SFT and DPO, boosting its practical applicability in data scarce domains. Importantly, CASAL also generalizes effectively to out-of-distribution (OOD) domains. We showcase CASAL’s flexibility in mitigating hallucinations in both text-only and vision-language models. To our knowledge, CASAL is the first steering-based training method that has been shown to be effective for both dense and Mixture-of-Experts (MoE) models. CASAL represents a promising step forward for applying interpretability-inspired method for practical deployment in production systems.

[2] Hallucination-Resistant, Domain-Specific Research Assistant with Self-Evaluation and Vector-Grounded Retrieval

Vivek Bhavsar,Joseph Ereifej,Aravanan Gurusami

Main category: cs.CL

TL;DR: RA-FSM是一个基于GPT的模块化研究助手,通过有限状态机控制流程(相关性->置信度->知识)减少幻觉和错误引用,提升专家工作流的实用性。

Details Motivation: 大语言模型在文献综述中存在幻觉和错误引用问题,限制了其在专家工作流中的应用。

Contribution: 提出RA-FSM系统,结合向量检索和确定性引用流程,实现高透明度、可验证的答案生成。

Method: 采用有限状态机控制流程,结合向量检索和关系数据库,分解问题并触发检索。

Result: 在光电领域的六类任务评估中,专家更偏好RA-FSM,认为其能更好地处理边界条件和证据支持。

Insight: 通过模块化和确定性流程设计,可以显著提升语言模型在专业领域的可靠性和实用性。

Abstract: Large language models accelerate literature synthesis but can hallucinate and mis-cite, limiting their usefulness in expert workflows. We present RA-FSM (Research Assistant - Finite State Machine), a modular GPT-based research assistant that wraps generation in a finite-state control loop: Relevance -> Confidence -> Knowledge. The system is grounded in vector retrieval and a deterministic citation pipeline. The controller filters out-of-scope queries, scores answerability, decomposes questions, and triggers retrieval only when needed, and emits answers with confidence labels and in-corpus, de-duplicated references. A ranked-tier ingestion workflow constructs a domain knowledge base from journals, conferences, indices, preprints, and patents, writing both to a dense vector index and to a relational store of normalized metrics. We implement the system for photonics and evaluate it on six task categories: analytical reasoning, numerical analysis, methodological critique, comparative synthesis, factual extraction, and application design. In blinded A/B reviews, domain experts prefer RA-FSM to both a strong Notebook LM (NLM) and a vanilla Default GPT API call single-pass baseline, citing stronger boundary-condition handling and more defensible evidence use. Coverage and novelty analyses indicate that RA-FSM explores beyond the NLM while incurring tunable latency and cost overheads. The design emphasizes transparent, well-cited answers for high-stakes technical work and is generalizable to other scientific domains.

[3] AMANDA: Agentic Medical Knowledge Augmentation for Data-Efficient Medical Visual Question Answering

Ziqing Wang,Chengsheng Mao,Xiaole Wen,Yuan Luo,Kaize Ding

Main category: cs.CL

TL;DR: 论文提出AMANDA框架,通过LLM智能体增强医学知识,解决医学多模态大语言模型在低资源环境下的诊断瓶颈问题。

Details Motivation: 现有Med-MLLMs在低资源环境下表现不佳,主要因为内在推理忽略医学图像细节,外在推理缺乏专业医学知识。

Contribution: 提出了AMANDA框架,通过内在(问题分解)和外在(知识图谱检索)医学知识增强,提升Med-VQA性能。

Method: 采用训练免费的智能体框架,结合粗粒度到细粒度问题分解和生物医学知识图谱检索。

Result: 在8个Med-VQA基准测试中,零样本和小样本设置下均显著提升。

Insight: LLM智能体和知识图谱的结合能有效弥补医学推理的不足,尤其在低资源场景下。

Abstract: Medical Multimodal Large Language Models (Med-MLLMs) have shown great promise in medical visual question answering (Med-VQA). However, when deployed in low-resource settings where abundant labeled data are unavailable, existing Med-MLLMs commonly fail due to their medical reasoning capability bottlenecks: (i) the intrinsic reasoning bottleneck that ignores the details from the medical image; (ii) the extrinsic reasoning bottleneck that fails to incorporate specialized medical knowledge. To address those limitations, we propose AMANDA, a training-free agentic framework that performs medical knowledge augmentation via LLM agents. Specifically, our intrinsic medical knowledge augmentation focuses on coarse-to-fine question decomposition for comprehensive diagnosis, while extrinsic medical knowledge augmentation grounds the reasoning process via biomedical knowledge graph retrieval. Extensive experiments across eight Med-VQA benchmarks demonstrate substantial improvements in both zero-shot and few-shot Med-VQA settings. The code is available at https://github.com/REAL-Lab-NU/AMANDA.

[4] SelfJudge: Faster Speculative Decoding via Self-Supervised Judge Verification

Kanghoon Yoon,Minsub Kim,Sungjae Lee,Joonhyung Lee,Sunghyeon Woo,Yeonjun In,Se Jung Kwon,Chanyoung Park,Dongsoo Lee

Main category: cs.CL

TL;DR: 本文提出了一种名为SelfJudge的方法,通过目标模型的自我监督训练法官验证器,以加速大型语言模型(LLM)的推理过程。该方法无需依赖人工标注或可验证的基准数据,提高了在多样化NLP任务中的泛化能力。

Details Motivation: 现有的法官解码方法依赖于人工标注或特定任务的基准数据,限制了其在多样化NLP任务中的通用性。SelfJudge的目标是通过自我监督的方式训练验证器,提供一个更通用的解决方案。

Contribution: SelfJudge的主要贡献是提出了一种自我监督的法官验证方法,通过评估语义保留性来自动训练验证器,从而在不依赖外部数据的情况下提升推理速度与准确性。

Method: SelfJudge利用目标模型生成的数据,通过比较原始响应和替换标记后的响应是否保留语义,训练法官验证器。这种方法自动适用于多样化的NLP任务。

Result: 实验结果表明,SelfJudge在推理速度和准确性之间的权衡上优于基线法官解码方法。

Insight: SelfJudge的核心洞察是通过目标模型的自我监督数据生成验证标准,减少对外部标注数据的依赖,增强了方法的通用性和灵活性。

Abstract: Speculative decoding accelerates LLM inference by verifying candidate tokens from a draft model against a larger target model. Recent judge decoding boosts this process by relaxing verification criteria by accepting draft tokens that may exhibit minor discrepancies from target model output, but existing methods are restricted by their reliance on human annotations or tasks with verifiable ground truths, limiting generalizability across diverse NLP tasks. We propose SelfJudge, which trains judge verifiers via self-supervision of the target model. Our method measures semantic preservation by assessing whether token-substituted responses preserve the meaning of original responses, enabling automatic verifier training across diverse NLP tasks. Our experiments show SelfJudge achieves superior inference-accuracy trade-offs than judge decoding baselines, offering a broadly applicable solution for faster LLM inference.

[5] Human Mobility Datasets Enriched With Contextual and Social Dimensions

Chiara Pugliese,Francesco Lettich,Guido Rocchietti,Chiara Renso,Fabio Pinelli

Main category: cs.CL

TL;DR: 这篇资源论文介绍了两个语义丰富的人类轨迹数据集及其构建流程,结合了真实GPS轨迹、情境数据(如停留点、POI、交通模式、天气)和LLM生成的社交媒体内容,支持多模态和语义分析。

Details Motivation: 现有的人类移动数据集通常缺乏语义丰富性和多模态支持,难以满足复杂分析需求。本文旨在填补这一空白。

Contribution: 提出首个结合真实轨迹、语义情境数据、LLM生成文本及语义网兼容性的数据集框架,覆盖巴黎和纽约两大城市。

Method: 通过开源流水线构建数据集,整合OpenStreetMap GPS轨迹、情境层(停留点、交通模式等),并利用LLM生成合成社交媒体内容。

Result: 数据集以表格和RDF格式发布,支持行为建模、移动预测、知识图谱构建等研究任务。

Insight: 通过LLM生成的社交媒体内容为轨迹数据增添了语义维度,展示了多模态数据融合在移动分析中的潜力。

Abstract: In this resource paper, we present two publicly available datasets of semantically enriched human trajectories, together with the pipeline to build them. The trajectories are publicly available GPS traces retrieved from OpenStreetMap. Each dataset includes contextual layers such as stops, moves, points of interest (POIs), inferred transportation modes, and weather data. A novel semantic feature is the inclusion of synthetic, realistic social media posts generated by Large Language Models (LLMs), enabling multimodal and semantic mobility analysis. The datasets are available in both tabular and Resource Description Framework (RDF) formats, supporting semantic reasoning and FAIR data practices. They cover two structurally distinct, large cities: Paris and New York. Our open source reproducible pipeline allows for dataset customization, while the datasets support research tasks such as behavior modeling, mobility prediction, knowledge graph construction, and LLM-based applications. To our knowledge, our resource is the first to combine real-world movement, structured semantic enrichment, LLM-generated text, and semantic web compatibility in a reusable framework.

[6] Where Did It Go Wrong? Attributing Undesirable LLM Behaviors via Representation Gradient Tracing

Zhe Li,Wei Zhao,Yige Li,Jun Sun

Main category: cs.CL

TL;DR: 该论文提出了一种高效框架,通过分析表示及其梯度,直接在模型激活空间中诊断大型语言模型(LLM)的不良行为,如有害内容生成和偏见输出。

Details Motivation: 现有基于参数梯度的归因方法存在噪声信号高和计算复杂的问题,难以有效诊断LLM的不良行为。

Contribution: 提出了一种新颖的基于表示梯度的框架,可直接在激活空间中提供语义信号,实现样本级和细粒度token级行为归因。

Method: 通过分析表示及其梯度,直接在LLM的激活空间中连接输出与训练数据,从而实现高效的行为诊断。

Result: 该方法在跟踪有害内容、检测后门毒化和识别知识污染等任务中表现优异,支持细粒度分析。

Insight: 该框架为理解和减轻LLM风险提供了强大的诊断工具,尤其适用于需要精确归因的场景。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their deployment is frequently undermined by undesirable behaviors such as generating harmful content, factual inaccuracies, and societal biases. Diagnosing the root causes of these failures poses a critical challenge for AI safety. Existing attribution methods, particularly those based on parameter gradients, often fall short due to prohibitive noisy signals and computational complexity. In this work, we introduce a novel and efficient framework that diagnoses a range of undesirable LLM behaviors by analyzing representation and its gradients, which operates directly in the model’s activation space to provide a semantically meaningful signal linking outputs to their training data. We systematically evaluate our method for tasks that include tracking harmful content, detecting backdoor poisoning, and identifying knowledge contamination. The results demonstrate that our approach not only excels at sample-level attribution but also enables fine-grained token-level analysis, precisely identifying the specific samples and phrases that causally influence model behavior. This work provides a powerful diagnostic tool to understand, audit, and ultimately mitigate the risks associated with LLMs. The code is available at https://github.com/plumprc/RepT.

[7] Optimizing Long-Form Clinical Text Generation with Claim-Based Rewards

Samyak Jhaveri,Praphul Singh,Jangwon Kim,Tara Taghavi,Krishnaram Kenthapadi

Main category: cs.CL

TL;DR: 本文提出了一种用于长文本临床文本生成的强化学习框架,结合了Group Relative Policy Optimization(GRPO)和DocLens评估器,直接优化事实基础和完整性,无需训练额外的奖励模型或依赖人工参考。

Details Motivation: 自动化临床文档生成需要确保内容的完整性和事实基础,而现有方法可能依赖于人工参考或复杂的奖励模型,限制了效率和质量。

Contribution: 本文的主要贡献是提出了一个评估集成的强化学习框架,直接优化临床文本的事实性和完整性,并通过简单的奖励门控策略降低了训练成本。

Method: 方法结合了GRPO和DocLens评估器,DocLens提供基于对话的确定性奖励,GRPO则用于优化生成策略。

Result: 实验表明,该方法提高了临床笔记的质量,减少了遗漏和幻觉,并在GPT-5的定性评估中获得了更高的偏好。

Insight: 该方法展示了在不依赖额外奖励模型或人工参考的情况下,直接优化生成质量的潜力,适用于真实世界的临床文档生成场景。

Abstract: Automating clinical documentation with large language models requires precise alignment with priorities such as completeness and factual grounding. We present an evaluation-integrated reinforcement learning framework for long-form clinical text generation that couples Group Relative Policy Optimization (GRPO) with DocLens, a claim-level evaluator that provides deterministic, dialogue-grounded rewards. Our method directly optimizes factual grounding and completeness without training a separate reward model or relying on human-authored references. Empirically, the approach improves clinical note quality and reduces training cost via a simple reward-gating strategy. An independent GPT-5 qualitative evaluation further supports these gains, showing higher preference for GRPO outputs in factuality, completeness, and brevity, with fewer omissions and hallucinations. Because the benchmarks are relatively clean and the base model already well aligned, these improvements likely represent a conservative lower bound. The framework is scalable to real-world settings and can incorporate custom objectives such as guideline adherence or billing preferences.

[8] Can Prompts Rewind Time for LLMs? Evaluating the Effectiveness of Prompted Knowledge Cutoffs

Xin Gao,Ruiyi Zhang,Daniel Du,Saurabh Mahindre,Sai Ashish Somayajula,Pengtao Xie

Main category: cs.CL

TL;DR: 这篇论文研究了是否可以通过提示让大型语言模型(LLMs)模拟较早的知识截止时间,并评估了其有效性。研究发现,提示方法在直接查询截止时间后的信息时有效,但对因果关联内容的遗忘效果不佳,强调了时间预测任务中更严格评估的必要性。

Details Motivation: 大型语言模型(LLMs)在时间预测任务中依赖预训练数据可能导致记忆而非推理问题,从而高估其泛化能力。因此,作者探讨了通过提示模拟知识截止时间的有效性。

Contribution: 1. 构建了三个评估数据集,测试LLMs对不同类型知识的遗忘能力;2. 揭示了提示方法在直接查询和因果关联知识上的表现差异;3. 提供了数据集和评估代码,促进进一步研究。

Method: 使用提示技术模拟LLMs的知识遗忘,并通过三种数据集评估其效果:直接事实知识、语义变化和因果关联知识。

Result: 提示方法在直接查询后能够有效模拟知识遗忘,但对非直接查询的因果关联知识遗忘效果有限。

Insight: 研究表明,提示技术在模拟知识截止时间时存在局限性,特别是在因果推理任务中。这呼吁在时间预测任务中采用更严谨的评估方法。

Abstract: Large Language Models (LLMs) are widely used for temporal prediction, but their reliance on pretraining data raises contamination concerns, as accurate predictions on pre-cutoff test data may reflect memorization rather than reasoning, leading to an overestimation of their generalization capability. With the recent emergence of prompting-based unlearning techniques, a natural question arises: Can LLMs be prompted to simulate an earlier knowledge cutoff? In this work, we investigate the capability of prompting to simulate earlier knowledge cutoff in LLMs. We construct three evaluation datasets to assess the extent to which LLMs can forget (1) direct factual knowledge, (2) semantic shifts, and (3) causally related knowledge. Results demonstrate that while prompt-based simulated knowledge cutoffs show effectiveness when directly queried with the information after that date, they struggle to induce forgetting when the forgotten content is not directly asked but causally related to the query. These findings highlight the need for more rigorous evaluation settings when applying LLMs for temporal prediction tasks. The full dataset and evaluation code are available at https://github.com/gxx27/time_unlearn.

[9] LLMSQL: Upgrading WikiSQL for the LLM Era of Text-to-SQL

Dzmitry Pihulski,Karol Charchut,Viktoria Novogrodskaia,Jan Kocoń

Main category: cs.CL

TL;DR: LLMSQL是对WikiSQL数据集的系统性修订和转换,旨在适应大语言模型(LLM)时代的需求,解决了原数据集的结构和标注问题,并提供了一个干净的、适合现代文本到SQL模型的基准。

Details Motivation: WikiSQL数据集在早期NL2SQL研究中发挥了重要作用,但由于其结构和标注问题(如大小写敏感不一致、数据类型不匹配、语法错误等),其使用率下降。LLMSQL的目标是提供一个更适合LLM时代的数据集。

Contribution: LLMSQL的主要贡献包括:1) 对WikiSQL的错误进行分类并通过自动化方法进行清理和重新标注;2) 提供了一个干净的、直接支持生成和评估现代文本到SQL模型的基准。

Method: 论文使用自动化方法清理和重新标注WikiSQL数据集,解决了大小写敏感、数据类型匹配等问题,并生成清晰的提问和SQL查询文本。

Result: 通过评估多个大型语言模型(如Gemma 3、LLaMA 3.2等),验证了LLMSQL作为现代文本到SQL任务的可靠基准的有效性。

Insight: LLMSQL不仅是一个数据集的更新,更是一个为LLM时代设计的全新基准,突出了直接生成和评估SQL查询的能力,而非传统指针网络模型的选择式生成。

Abstract: Converting natural language questions into SQL queries (Text-to-SQL) enables non-expert users to interact with relational databases and has long been a central task for natural language interfaces to data. While the WikiSQL dataset played a key role in early NL2SQL research, its usage has declined due to structural and annotation issues, including case sensitivity inconsistencies, data type mismatches, syntax errors, and unanswered questions. We present LLMSQL, a systematic revision and transformation of WikiSQL designed for the LLM era. We classify these errors and implement automated methods for cleaning and re-annotation. To assess the impact of these improvements, we evaluated multiple large language models (LLMs), including Gemma 3, LLaMA 3.2, Mistral 7B, gpt-oss 20B, Phi-3.5 Mini, Qwen 2.5, OpenAI o4-mini, DeepSeek R1 and others. Rather than serving as an update, LLMSQL is introduced as an LLM-ready benchmark: unlike the original WikiSQL, tailored for pointer-network models selecting tokens from input, LLMSQL provides clean natural language questions and full SQL queries as plain text, enabling straightforward generation and evaluation for modern natural language-to-SQL models.

[10] Language, Culture, and Ideology: Personalizing Offensiveness Detection in Political Tweets with Reasoning LLMs

Dzmitry Pihulski,Jan Kocoń

Main category: cs.CL

TL;DR: 本文研究了大型语言模型(LLMs)在特定政治和文化视角下如何评估政治推文的攻击性,发现具备明确推理能力的大型模型在意识形态和文化差异方面表现更优。

Details Motivation: 探讨LLMs在多语言和政治多样化背景下如何个性化地评估攻击性,以提升模型在社会政治文本分类中的敏感性和适应性。

Contribution: 验证了推理能力对LLMs在攻击性检测中的重要性,并展示了大型模型在多语言和政治视角下的优越表现。

Method: 使用多语言MD-Agreement数据集,评估多个LLMs(如DeepSeek-R1、GPT-4.1-mini等)在不同政治视角(极右、保守等)和文化背景下的表现。

Result: 大型模型(如DeepSeek-R1)在攻击性检测中表现更一致且敏感,而小型模型难以捕捉细微差异。推理能力显著提升了个性化和可解释性。

Insight: 推理机制是LLMs在多语言和意识形态背景下适应社会政治文本分类的关键。

Abstract: We explore how large language models (LLMs) assess offensiveness in political discourse when prompted to adopt specific political and cultural perspectives. Using a multilingual subset of the MD-Agreement dataset centered on tweets from the 2020 US elections, we evaluate several recent LLMs - including DeepSeek-R1, o4-mini, GPT-4.1-mini, Qwen3, Gemma, and Mistral - tasked with judging tweets as offensive or non-offensive from the viewpoints of varied political personas (far-right, conservative, centrist, progressive) across English, Polish, and Russian contexts. Our results show that larger models with explicit reasoning abilities (e.g., DeepSeek-R1, o4-mini) are more consistent and sensitive to ideological and cultural variation, while smaller models often fail to capture subtle distinctions. We find that reasoning capabilities significantly improve both the personalization and interpretability of offensiveness judgments, suggesting that such mechanisms are key to adapting LLMs for nuanced sociopolitical text classification across languages and ideologies.

[11] Modeling the language cortex with form-independent and enriched representations of sentence meaning reveals remarkable semantic abstractness

Shreya Saha,Shurui Li,Greta Tuckute,Yuanning Li,Ru-Yuan Zhang,Leila Wehbe,Evelina Fedorenko,Meenakshi Khosla

Main category: cs.CL

TL;DR: 该论文研究了人类语言皮层中意义的抽象表征,通过结合视觉和语言模型的嵌入,发现多图像或多释义的平均嵌入能更准确地预测语言皮层的响应,揭示了意义的高度抽象性。

Details Motivation: 探讨人类语言系统中意义的抽象性,理解语言皮层如何表示和处理超越具体形式的语义信息。

Contribution: 揭示了语言皮层中存在高度抽象的形式独立意义表征,并通过视觉和语言模型的嵌入验证了这一点。

Method: 使用视觉和语言模型的嵌入生成图像的聚合或句子的多释义平均嵌入,预测语言皮层的响应。

Result: 多图像或多释义的平均嵌入提高了预测准确性,甚至超过原始句子的嵌入,表明语言系统具有更丰富的语义表征。

Insight: 语言皮层的语义表征超越了语言模型的局限性,具有更高的抽象性和丰富性。

Abstract: The human language system represents both linguistic forms and meanings, but the abstractness of the meaning representations remains debated. Here, we searched for abstract representations of meaning in the language cortex by modeling neural responses to sentences using representations from vision and language models. When we generate images corresponding to sentences and extract vision model embeddings, we find that aggregating across multiple generated images yields increasingly accurate predictions of language cortex responses, sometimes rivaling large language models. Similarly, averaging embeddings across multiple paraphrases of a sentence improves prediction accuracy compared to any single paraphrase. Enriching paraphrases with contextual details that may be implicit (e.g., augmenting “I had a pancake” to include details like “maple syrup”) further increases prediction accuracy, even surpassing predictions based on the embedding of the original sentence, suggesting that the language system maintains richer and broader semantic representations than language models. Together, these results demonstrate the existence of highly abstract, form-independent meaning representations within the language cortex.

[12] ChunkLLM: A Lightweight Pluggable Framework for Accelerating LLMs Inference

Haojie Ouyang,Jianwei Lv,Lei Ren,Chen Wei,Xiaojie Wang,Fangxiang Feng

Main category: cs.CL

TL;DR: ChunkLLM 是一种轻量级可插拔框架,旨在通过优化自注意力机制的计算效率,加速大型语言模型的推理。

Details Motivation: 为了解决 Transformer 大模型因自注意力机制的平方复杂度导致的计算效率低下问题,现有方法在语义完整性或训练-推理效率上存在不足,因此需要一个更全面的解决方案。

Contribution: 提出了 ChunkLLM 框架,包含两个核心组件(QK Adapter 和 Chunk Adapter),通过特征压缩和分块注意力优化推理效率,同时设计了注意力蒸馏方法来提升关键分块的召回率。

Method: 1. QK Adapter 用于特征压缩和分块注意力获取;2. Chunk Adapter 用于检测分块边界;3. 训练时冻结主干参数,仅训练适配器;4. 推理时分块选择仅在检测到边界时触发。

Result: 在长短文本基准测试中,ChunkLLM 性能接近短文本基准,长上下文基准保持 98.64% 性能,键值缓存保留率达 48.58%,最大加速比为 4.48 倍。

Insight: 通过适配器模块优化自注意力计算效率,解决了语义完整性和训练-推理效率的平衡问题,显著提升了长文本处理效率。

Abstract: Transformer-based large models excel in natural language processing and computer vision, but face severe computational inefficiencies due to the self-attention’s quadratic complexity with input tokens. Recently, researchers have proposed a series of methods based on block selection and compression to alleviate this problem, but they either have issues with semantic incompleteness or poor training-inference efficiency. To comprehensively address these challenges, we propose ChunkLLM, a lightweight and pluggable training framework. Specifically, we introduce two components: QK Adapter (Q-Adapter and K-Adapter) and Chunk Adapter. The former is attached to each Transformer layer, serving dual purposes of feature compression and chunk attention acquisition. The latter operates at the bottommost layer of the model, functioning to detect chunk boundaries by leveraging contextual semantic information. During the training phase, the parameters of the backbone remain frozen, with only the QK Adapter and Chunk Adapter undergoing training. Notably, we design an attention distillation method for training the QK Adapter, which enhances the recall rate of key chunks. During the inference phase, chunk selection is triggered exclusively when the current token is detected as a chunk boundary, thereby accelerating model inference. Experimental evaluations are conducted on a diverse set of long-text and short-text benchmark datasets spanning multiple tasks. ChunkLLM not only attains comparable performance on short-text benchmarks but also maintains 98.64% of the performance on long-context benchmarks while preserving a 48.58% key-value cache retention rate. Particularly, ChunkLLM attains a maximum speedup of 4.48x in comparison to the vanilla Transformer in the processing of 120K long texts.

[13] Beyond Manuals and Tasks: Instance-Level Context Learning for LLM Agents

Kuntai Cai,Juncheng Liu,Xianglin Yang,Zhaojie Niu,Xiaokui Xiao,Xing Chen

Main category: cs.CL

TL;DR: 该论文提出了一种新型上下文学习方法——实例级上下文学习(ILCL),旨在解决大语言模型(LLM)代理在复杂任务中忽视特定环境实例的可验证和可重用事实的问题。通过引导探索和轻量级计划-行动-提取循环,该方法显著提升了任务的成功率和效率。

Details Motivation: LLM代理通常依赖环境级手册和任务级指导,但忽略了实例级上下文(如对象位置、制作配方等),这导致复杂任务中的常见失败。作者认为高效探索和利用此类上下文是提升代理性能的关键。

Contribution: (1)提出实例级上下文学习(ILCL)问题;(2)设计了一种任务无关的方法,通过TODO森林和plan-act-extract循环自动生成高精度、可重用的上下文文档;(3)在多个基准任务中验证了方法的有效性。

Method: 采用引导式探索和轻量级计划-行动-提取循环:(1)使用TODO森林优化行动优先级;(2)通过迭代执行计划-行动-提取生成上下文文档;(3)该文档可跨任务复用。

Result: 在TextWorld、ALFWorld和Crafter上的实验显示,ReAct的成功率从37%提升至95%,IGE从81%提升至95%。方法显著提高了任务的效率和可靠性。

Insight: 实例级上下文是LLM代理在复杂任务中成功的关键。通过将一次性探索转化为持久知识,该方法为代理设计提供了新思路,强调了上下文复用和高效探索的重要性。

Abstract: Large language model (LLM) agents typically receive two kinds of context: (i) environment-level manuals that define interaction interfaces and global rules, and (ii) task-level guidance or demonstrations tied to specific goals. In this work, we identify a crucial but overlooked third type of context, instance-level context, which consists of verifiable and reusable facts tied to a specific environment instance, such as object locations, crafting recipes, and local rules. We argue that the absence of instance-level context is a common source of failure for LLM agents in complex tasks, as success often depends not only on reasoning over global rules or task prompts but also on making decisions based on precise and persistent facts. Acquiring such context requires more than memorization: the challenge lies in efficiently exploring, validating, and formatting these facts under tight interaction budgets. We formalize this problem as Instance-Level Context Learning (ILCL) and introduce our task-agnostic method to solve it. Our method performs a guided exploration, using a compact TODO forest to intelligently prioritize its next actions and a lightweight plan-act-extract loop to execute them. This process automatically produces a high-precision context document that is reusable across many downstream tasks and agents, thereby amortizing the initial exploration cost. Experiments across TextWorld, ALFWorld, and Crafter demonstrate consistent gains in both success and efficiency: for instance, ReAct’s mean success rate in TextWorld rises from 37% to 95%, while IGE improves from 81% to 95%. By transforming one-off exploration into persistent, reusable knowledge, our method complements existing contexts to enable more reliable and efficient LLM agents.

[14] Pretraining with hierarchical memories: separating long-tail and common knowledge

Hadi Pouransari,David Grangier,C Thomas,Michael Kirchhof,Oncel Tuzel

Main category: cs.CL

TL;DR: 该论文提出了一种结合层次化记忆库的小型语言模型预训练方法,通过动态加载上下文相关的小规模记忆块,显著提升了模型性能,同时减少了参数量。

Details Motivation: 现代语言模型通过增加参数量来提升性能,但将所有世界知识压缩到模型参数中既不必要也不适合边缘设备。因此,作者提出了一种结合记忆库的方法,将长尾知识存储在记忆参数中,而小型语言模型专注通用推理能力。

Contribution: 主要贡献包括设计了一种层次化记忆增强的架构和预训练策略,展示了小型语言模型结合记忆库的性能优势,并对记忆类型和规模进行了深入实验分析。

Method: 方法的核心是分层参数化记忆库和动态记忆块加载机制。预训练时,长尾知识被存储到记忆参数中,而小型模型学习通用知识。推断时仅加载少量相关记忆块以减少计算负担。

Result: 实验表明,160M参数的模型结合18M记忆块(来自4.6B记忆库)性能与2倍参数的常规模型相当。记忆库规模扩展至21B参数仍能稳定工作。

Insight: 层次化记忆设计可以有效分离长尾知识与通用知识,为小型语言模型的高效部署提供了新思路。

Abstract: The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a fraction is used per prompt, and impractical for edge devices with limited inference-time memory and compute. We address this shortcoming by a memory-augmented architecture and a pretraining strategy aligned with existing hardware paradigms. We introduce small language models that access large hierarchical parametric memory banks encoding world knowledge. During pretraining and inference, we fetch a small, context-dependent memory block and add it to the model. Our pretraining learns to store long-tail world knowledge in the memory parameters, while the small language model acts as an anchor capturing common knowledge and general reasoning abilities. Through trillion-token-scale experiments, we show significant gains: a 160M-parameters model augmented with an 18M-parameters memory fetched from a 4.6B memory bank obtains comparable performance to a regular model with more than 2x the parameters. Through extensive experiments, we study the optimal type and size of parametric memories in transformers, scaling them to over 21B parameters. We find that our proposed hierarchical feed-forward memories work robustly across transformer architectures, whether added during pretraining or post-hoc.

[15] Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems

Aakriti Agrawal,Rohith Aralikatti,Anirudh Satheesh,Souradip Chakraborty,Amrit Singh Bedi,Furong Huang

Main category: cs.CL

TL;DR: 论文提出了一种基于校准对数似然得分的、高效的多LLM系统回答选择方法,旨在提高多LLM系统的推理性能,无需依赖昂贵的外部验证或人类评估。

Details Motivation: 从多个LLM中选择最可靠回答是一个挑战,现有方法依赖高成本的外部验证或多轮采样,限制了多LLM系统的潜力。

Contribution: 提出了一种新颖且计算高效的方法,通过校准对数似然得分隐式利用LLM的知识和置信度,显著提升推理性能。

Method: 使用校准的对数似然得分从多个LLM中选择最佳回答,适用于辩论(多轮讨论)和非辩论(Best-of-N)场景。

Result: 在GSM8K、MMLU(6个子集)和ARC数据集上分别实现了约4%、3%和5%的性能提升。

Insight: 该方法隐式利用了LLM的内在知识和置信度,为多LLM系统的回答选择提供了高效且可靠的解决方案。

Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge, particularly in resource-constrained settings. Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model. While multi-LLM systems produce more diverse responses than single models and thus have greater potential, they often underperform compared to single LLM self-consistency. We propose a principled, novel and computationally efficient method to select the best response from multiple different LLMs using a calibrated log-likelihood score, implicitly leveraging the inherent knowledge and confidence of these models. Our method demonstrates improvements of approx. 4%, 3%, and 5% across both debate (multi-round LLM discussions) and non-debate (Best-of-N with multiple LLMs) settings on GSM8K, MMLU (6 subsets), and ARC datasets respectively.

[16] Learning to Route: A Rule-Driven Agent Framework for Hybrid-Source Retrieval-Augmented Generation

Haoyue Bai,Haoyu Wang,Shengyu Chen,Zhengzhang Chen,Lu-An Tang,Wei Cheng,Haifeng Chen,Yanjie Fu

Main category: cs.CL

TL;DR: 论文提出了一个基于规则驱动的路由框架,用于混合源的检索增强生成(RAG),通过动态选择最适合的检索路径(数据库或文档)以提高准确性和效率。

Details Motivation: 现有RAG系统主要依赖非结构化文档,忽略了关系数据库的优势。本文旨在结合两者的互补性,并通过规则驱动的路由优化查询性能。

Contribution: 1) 系统性分析了数据库和文档的互补性;2) 提出了一种基于规则的路由框架,包含路由代理、规则优化代理和元缓存;3) 在多个QA基准测试中验证了框架的有效性。

Method: 通过路由代理评分候选路径,规则专家代理迭代优化规则,并结合元缓存复用相似查询的路由决策。

Result: 在三个QA基准测试中,该框架优于静态策略和学习型基线,实现了更高的准确性,同时保持较低的计算成本。

Insight: 查询类型与检索路径之间存在规律性,规则驱动的路由可以显著提升混合源RAG的性能。

Abstract: Large Language Models (LLMs) have shown remarkable performance on general Question Answering (QA), yet they often struggle in domain-specific scenarios where accurate and up-to-date information is required. Retrieval-Augmented Generation (RAG) addresses this limitation by enriching LLMs with external knowledge, but existing systems primarily rely on unstructured documents, while largely overlooking relational databases, which provide precise, timely, and efficiently queryable factual information, serving as indispensable infrastructure in domains such as finance, healthcare, and scientific research. Motivated by this gap, we conduct a systematic analysis that reveals three central observations: (i) databases and documents offer complementary strengths across queries, (ii) naively combining both sources introduces noise and cost without consistent accuracy gains, and (iii) selecting the most suitable source for each query is crucial to balance effectiveness and efficiency. We further observe that query types show consistent regularities in their alignment with retrieval paths, suggesting that routing decisions can be effectively guided by systematic rules that capture these patterns. Building on these insights, we propose a rule-driven routing framework. A routing agent scores candidate augmentation paths based on explicit rules and selects the most suitable one; a rule-making expert agent refines the rules over time using QA feedback to maintain adaptability; and a path-level meta-cache reuses past routing decisions for semantically similar queries to reduce latency and cost. Experiments on three QA benchmarks demonstrate that our framework consistently outperforms static strategies and learned routing baselines, achieving higher accuracy while maintaining moderate computational cost.

[17] Words That Make Language Models Perceive

Sophie L. Wang,Phillip Isola,Brian Cheung

Main category: cs.CL

TL;DR: 该论文探讨了纯文本训练的大型语言模型(LLMs)是否可以通过感官提示(如’看’或’听’)激活多模态的表征能力,研究表明简单的提示工程可以有效地引导模型生成更接近视觉或听觉编码器的表征。

Details Motivation: 尽管LLMs仅通过文本训练,但它们可能隐含地学习了多模态规律。论文试图验证是否可以通过显式的感官提示激活这种潜在的多模态表征能力。

Contribution: 研究发现,通过简单的感官提示(如’看’或’听’),可以显著提升纯文本LLMs的表征对齐能力,使其更接近专业的视觉和音频编码器。

Method: 研究通过在模型输入中添加感官提示(例如’see’或’hear’),并分析模型输出的表征是否更接近对应模态的专业编码器(如CLIP或HuBERT)。

Result: 实验表明,纯文本LLMs在感官提示下生成的表征与多模态编码器的表征更接近,验证了感官提示的有效性。

Insight: 纯文本训练的语言模型可能隐含地学习了多模态知识,感官提示可以作为一种轻量级方法激活这种能力,为LLMs的多模态应用提供了新思路。

Abstract: Large language models (LLMs) trained purely on text ostensibly lack any direct perceptual experience, yet their internal representations are implicitly shaped by multimodal regularities encoded in language. We test the hypothesis that explicit sensory prompting can surface this latent structure, bringing a text-only LLM into closer representational alignment with specialist vision and audio encoders. When a sensory prompt tells the model to ‘see’ or ‘hear’, it cues the model to resolve its next-token predictions as if they were conditioned on latent visual or auditory evidence that is never actually supplied. Our findings reveal that lightweight prompt engineering can reliably activate modality-appropriate representations in purely text-trained LLMs.

[18] CLARITY: Clinical Assistant for Routing, Inference, and Triage

Vladimir Shaposhnikov,Aleksandr Nesterov,Ilia Kopanichuk,Ivan Bakulin,Egor Zhelvakov,Ruslan Abramov,Ekaterina Tsapieva,Dmitry V. Dylov,Ivan Oseledets

Main category: cs.CL

TL;DR: CLARITY是一个AI驱动的临床辅助平台,结合有限状态机和大型语言模型,用于患者分诊、临床咨询和病情严重性评估。

Details Motivation: 提升患者分诊效率和准确性,减少咨询时间,适应医疗IT系统的需求。

Contribution: 提出一种混合架构,结合FSM和LLM,实现高效的患者路由和病情评估,并在大规模医疗IT平台中验证其性能。

Method: 采用模块化微服务框架,结合FSM结构化对话流和LLM协作代理分析症状并分诊。

Result: 在5.5万次对话中验证,CLARITY的首诊路由精度超过人工,咨询时间缩短至三分之一。

Insight: 模块化设计和混合架构可提升AI系统的适应性和性能,适用于复杂的医疗场景。

Abstract: We present CLARITY (Clinical Assistant for Routing, Inference, and Triage), an AI-driven platform designed to facilitate patient-to-specialist routing, clinical consultations, and severity assessment of patients’ conditions. Its hybrid architecture combines a Finite State Machine (FSM) for structured dialogue flows with collaborative agents that employ Large Language Model (LLM) to analyze symptoms and prioritize referrals to appropriate specialists. Built on a modular microservices framework, CLARITY ensures safe, efficient, and robust performance, flexible and readily scalable to meet the demands of existing workflows and IT solutions in healthcare. We report integration of our clinical assistant into a large-scale nation-wide inter-hospital IT platform, with over 55,000 content-rich user dialogues completed within the two months of deployment, 2,500 of which were expert-annotated for a consequent validation. The validation results show that CLARITY surpasses human-level performance in terms of the first-attempt routing precision, naturally requiring up to 3 times shorter duration of the consultation than with a human.

[19] Knowledge-Graph Based RAG System Evaluation Framework

Sicheng Dong,Vahid Zolfaghari,Nenad Petrovic,Alois Knoll

Main category: cs.CL

TL;DR: 本文提出了一种基于知识图谱(KG)的RAG系统评估框架,扩展了RAGAS工具,通过多跳推理和语义社区聚类,提供了更全面的评分指标。

Details Motivation: 现有评估指标难以捕捉现代LLM生成内容的高流畅性和自然性,传统方法不足以全面评估RAG系统的性能。

Contribution: 扩展RAGAS工具,引入多跳推理和语义社区聚类,提出更全面的KG评估框架。

Method: 基于知识图谱的多跳推理和语义社区聚类,结合人类标注数据验证相关性。

Result: 实验表明,KG评估方法比RAGAS更敏感于语义差异,并与人类判断相关性更高。

Insight: 未来研究需关注更敏感的语义评估指标,以及如何进一步优化多跳推理和聚类方法。

Abstract: Large language models (LLMs) has become a significant research focus and is utilized in various fields, such as text generation and dialog systems. One of the most essential applications of LLM is Retrieval Augmented Generation (RAG), which greatly enhances generated content’s reliability and relevance. However, evaluating RAG systems remains a challenging task. Traditional evaluation metrics struggle to effectively capture the key features of modern LLM-generated content that often exhibits high fluency and naturalness. Inspired by the RAGAS tool, a well-known RAG evaluation framework, we extended this framework into a KG-based evaluation paradigm, enabling multi-hop reasoning and semantic community clustering to derive more comprehensive scoring metrics. By incorporating these comprehensive evaluation criteria, we gain a deeper understanding of RAG systems and a more nuanced perspective on their performance. To validate the effectiveness of our approach, we compare its performance with RAGAS scores and construct a human-annotated subset to assess the correlation between human judgments and automated metrics. In addition, we conduct targeted experiments to demonstrate that our KG-based evaluation method is more sensitive to subtle semantic differences in generated outputs. Finally, we discuss the key challenges in evaluating RAG systems and highlight potential directions for future research.

[20] Transcribe, Translate, or Transliterate: An Investigation of Intermediate Representations in Spoken Language Models

Tolúl\d{o}pé Ògúnrèmí,Christopher D. Manning,Dan Jurafsky,Karen Livescu

Main category: cs.CL

TL;DR: 论文研究了语音语言模型(SLMs)中模态适配器(MAs)如何将语音编码器的输出转换为解码器语言模型(LM)可理解的表示,发现两种策略:基于意义的英语中介语和基于语音的英语表达。

Details Motivation: 理解SLMs中MAs如何转换语音表示对提升多模态模型的性能至关重要,但目前对其工作机制的研究较少。

Contribution: 揭示了MAs的两种表示策略:1)基于意义的英语中介语(支持未见语言);2)基于语音的英语表达。提出表示策略取决于语音编码器的训练目标(仅语音识别或包括翻译)。

Method: 通过分析三个SLMs(SALMONN、Qwen2-Audio和Phi-4-Multimodal-Instruct)的MA输出,找到与解码器LM令牌最接近的表示,以推断MAs的转换策略。

Result: 使用Whisper编码器的模型倾向于将语音转换为基于意义的英语中介语;其他模型(如Phi-4)则更关注语音的英语表达。

Insight: MAs的表示策略与语音编码器的训练目标密切相关,这为设计更高效的多模态模型提供了重要指导。

Abstract: Spoken language models (SLMs) that integrate speech with large language models (LMs) rely on modality adapters (MAs) to map the output of speech encoders to a representation that is understandable to the decoder LM. Yet we know very little about how these crucial MAs transform representations. Here we examine the MA output representation in three SLMs (SALMONN, Qwen2-Audio and Phi-4-Multimodal-Instruct). By finding the nearest decoder LM token to an MA representation, we uncover two strategies for MA representations. For models using a Whisper encoder, MAs appear to represent the meaning of the input using an English-based interlingua, allowing them to handle languages unseen in instruction tuning. For models that don’t, like Phi-4-Multimodal-Instruct, MAs instead represent the phonetics of the input, but expressed with English words. We hypothesise that which arises depends on whether the speech encoder is trained only for speech recognition or also for translation.

[21] SoT: Structured-of-Thought Prompting Guides Multilingual Reasoning in Large Language Models

Rui Qi,Zhibo Man,Yufeng Chen,Fengran Mo,Jinan Xu,Kaiyu Huang

Main category: cs.CL

TL;DR: 论文提出了一种无需训练的方法SoT(Structured-of-Thought),通过多步转换(语言思维转换和结构化知识转换)提升大型语言模型在多语言推理任务中的表现。

Details Motivation: 当前大型语言模型在高资源语言上的复杂推理能力较强,但在非高资源语言上表现不佳。资源限制和语言表达差异影响了多语言推理的效果。

Contribution: 提出了SoT方法,将语言特定的语义信息转换为语言无关的结构化表示,使模型能更好地跨语言理解查询,同时保持一致的推理路径。

Method: 通过两步转换(语言思维转换和结构化知识转换),将多语言查询转换为结构化表示,使模型能够集中推理并处理跨语言表达差异。

Result: 实验表明,SoT在多个多语言推理基准测试中优于基线方法,并能与其他无需训练的策略结合进一步提升性能。

Insight: 结构化表示是解决多语言推理问题的有效途径,无需额外训练即可显著提升模型的跨语言推理能力。

Abstract: Recent developments have enabled Large Language Models (LLMs) to engage in complex reasoning tasks through deep thinking. However, the capacity of reasoning has not been successfully transferred to non-high-resource languages due to resource constraints, which struggles with multilingual reasoning tasks. To this end, we propose Structured-of-Thought (SoT), a training-free method that improves the performance on multilingual reasoning through a multi-step transformation: Language Thinking Transformation and Structured Knowledge Transformation. The SoT method converts language-specific semantic information into language-agnostic structured representations, enabling the models to understand the query in different languages more sophisticated. Besides, SoT effectively guides LLMs toward more concentrated reasoning to maintain consistent underlying reasoning pathways when handling cross-lingual variations in expression. Experimental results demonstrate that SoT outperforms several strong baselines on multiple multilingual reasoning benchmarks when adapting to various backbones of LLMs. It can also be integrated with other training-free strategies for further improvements. Our code is available at https://github.com/Cherry-qwq/SoT.

[22] Self-Improvement in Multimodal Large Language Models: A Survey

Shijian Deng,Kai Wang,Tianyu Yang,Harsh Singh,Yapeng Tian

Main category: cs.CL

TL;DR: 本文是第一篇关于多模态大语言模型(MLLMs)自我改进的综合综述,从数据收集、数据组织和模型优化三个角度讨论了现有方法,并总结了评测方法和下游应用,同时指出了未来研究方向。

Details Motivation: 随着单模态大语言模型(LLMs)自我改进的成功,将这一能力扩展到多模态领域具有巨大潜力,但目前相关研究较少。本文旨在填补这一空白,推动MLLMs的进一步发展。

Contribution: 1)首次系统综述了MLLMs的自我改进方法;2)从数据收集、组织和模型优化三个维度梳理了现有技术;3)总结了评测和应用,并提出了开放挑战和未来方向。

Method: 本文采用结构化分类方法,将现有技术分为数据收集(如自监督数据生成)、数据组织(如多模态数据对齐)和模型优化(如自适应微调)三类,并进行详细分析。

Result: 综述结果表明,MLLMs的自我改进能够显著提升模型性能,同时降低人工成本,但目前仍面临数据异构性和模态对齐等挑战。

Insight: 多模态领域的自我改进需要更关注跨模态一致性和动态适应性,未来的研究方向包括更高效的优化框架和多模态协同学习机制。

Abstract: Recent advancements in self-improvement for Large Language Models (LLMs) have efficiently enhanced model capabilities without significantly increasing costs, particularly in terms of human effort. While this area is still relatively young, its extension to the multimodal domain holds immense potential for leveraging diverse data sources and developing more general self-improving models. This survey is the first to provide a comprehensive overview of self-improvement in Multimodal LLMs (MLLMs). We provide a structured overview of the current literature and discuss methods from three perspectives: 1) data collection, 2) data organization, and 3) model optimization, to facilitate the further development of self-improvement in MLLMs. We also include commonly used evaluations and downstream applications. Finally, we conclude by outlining open challenges and future research directions.

[23] TravelBench : Exploring LLM Performance in Low-Resource Domains

Srinivas Billa,Xiaonan Jing

Main category: cs.CL

TL;DR: 论文提出了一个名为TravelBench的低资源领域测评集,专注于旅行领域,分析了LLM在这些任务中的表现,发现通用评测结果不足以反映低资源任务中的性能瓶颈。

Details Motivation: 现有LLM评测集在低资源任务中提供的信息有限,难以有效评估模型在这些领域的表现,因此需要特定领域的评测集。

Contribution: 构建了包含14个旅行领域数据集的测评集TravelBench,覆盖7种常见NLP任务,并分析了LLM的性能、扩展行为和推理能力。

Method: 使用来自真实场景的匿名数据,评测不同LLM在低资源任务中的准确率、扩展行为和推理能力。

Result: 结果显示,通用评测结果无法准确反映低资源任务的性能瓶颈,即便训练FLOPs较高,预训练LLM在复杂领域任务中仍存在性能瓶颈;推理能力对较小LLM提升更显著。

Insight: 在低资源领域,特定领域的测评至关重要,且推理能力对小模型的性能提升尤为关键。

Abstract: Results on existing LLM benchmarks capture little information over the model capabilities in low-resource tasks, making it difficult to develop effective solutions in these domains. To address these challenges, we curated 14 travel-domain datasets spanning 7 common NLP tasks using anonymised data from real-world scenarios, and analysed the performance across LLMs. We report on the accuracy, scaling behaviour, and reasoning capabilities of LLMs in a variety of tasks. Our results confirm that general benchmarking results are insufficient for understanding model performance in low-resource tasks. Despite the amount of training FLOPs, out-of-the-box LLMs hit performance bottlenecks in complex, domain-specific scenarios. Furthermore, reasoning provides a more significant boost for smaller LLMs by making the model a better judge on certain tasks.

[24] PGMEL: Policy Gradient-based Generative Adversarial Network for Multimodal Entity Linking

KM Pooja,Cheng Long,Aixin Sun

Main category: cs.CL

TL;DR: PGMEL提出了一种基于策略梯度的生成对抗网络,用于解决多模态实体链接任务,通过生成高质量的负样本提升表示学习效果。

Details Motivation: 现有的多模态实体链接技术未充分利用高质量负样本的选择潜力,影响了表示学习的效果。

Contribution: PGMEL首次在多模态实体链接中采用生成对抗网络框架,通过策略梯度优化生成器,生成具有挑战性的负样本。

Method: 提出了基于策略梯度的生成对抗网络(PGMEL),生成器负责生成负样本,判别器执行度量学习任务。

Result: 在Wiki-MEL、Richpedia-MEL和WikiDiverse数据集上,PGMEL通过生成挑战性负样本取得了优于现有方法的表现。

Insight: 高质量负样本的生成对多模态实体链接任务的表示学习至关重要。

Abstract: The task of entity linking, which involves associating mentions with their respective entities in a knowledge graph, has received significant attention due to its numerous potential applications. Recently, various multimodal entity linking (MEL) techniques have been proposed, targeted to learn comprehensive embeddings by leveraging both text and vision modalities. The selection of high-quality negative samples can potentially play a crucial role in metric/representation learning. However, to the best of our knowledge, this possibility remains unexplored in existing literature within the framework of MEL. To fill this gap, we address the multimodal entity linking problem in a generative adversarial setting where the generator is responsible for generating high-quality negative samples, and the discriminator is assigned the responsibility for the metric learning tasks. Since the generator is involved in generating samples, which is a discrete process, we optimize it using policy gradient techniques and propose a policy gradient-based generative adversarial network for multimodal entity linking (PGMEL). Experimental results based on Wiki-MEL, Richpedia-MEL and WikiDiverse datasets demonstrate that PGMEL learns meaningful representation by selecting challenging negative samples and outperforms state-of-the-art methods.

[25] IndiCASA: A Dataset and Bias Evaluation Framework in LLMs Using Contrastive Embedding Similarity in the Indian Context

Santhosh G S,Akshay Govind S,Gokul S Krishnan,Balaraman Ravindran,Sriraam Natarajan

Main category: cs.CL

TL;DR: 论文提出了一个基于对比学习的编码器框架,用于评估LLMs在印度文化背景下的细粒度偏见,并引入了一个名为IndiCASA的新数据集,包含2575个人工验证的句子。研究发现所有模型均存在一定的偏见,尤其在残疾相关偏见上表现突出。

Details Motivation: 由于LLMs在高风险应用中的广泛部署,尤其是在印度这样文化多样的背景下,现有偏见评估方法难以捕捉细微的刻板印象,因此需要开发更精准的评估框架。

Contribution: 1) 提出基于对比学习的编码器框架,用于评估LLMs的细粒度偏见;2) 引入IndiCASA数据集,涵盖五种人口统计维度的偏见;3) 对多个LLMs的偏见进行了实证分析。

Method: 使用对比学习训练编码器,通过嵌入相似性捕捉偏见,并结合IndiCASA数据集进行量化评估。

Result: 研究发现所有LLMs均存在偏见,其中残疾相关偏见最为显著,宗教偏见相对较低,可能与全球去偏见努力有关。

Insight: 揭示了LLMs在印度文化背景下的偏见分布,强调了开发更公平模型的必要性。

Abstract: Large Language Models (LLMs) have gained significant traction across critical domains owing to their impressive contextual understanding and generative capabilities. However, their increasing deployment in high stakes applications necessitates rigorous evaluation of embedded biases, particularly in culturally diverse contexts like India where existing embedding-based bias assessment methods often fall short in capturing nuanced stereotypes. We propose an evaluation framework based on a encoder trained using contrastive learning that captures fine-grained bias through embedding similarity. We also introduce a novel dataset - IndiCASA (IndiBias-based Contextually Aligned Stereotypes and Anti-stereotypes) comprising 2,575 human-validated sentences spanning five demographic axes: caste, gender, religion, disability, and socioeconomic status. Our evaluation of multiple open-weight LLMs reveals that all models exhibit some degree of stereotypical bias, with disability related biases being notably persistent, and religion bias generally lower likely due to global debiasing efforts demonstrating the need for fairer model development.

[26] The Path of Self-Evolving Large Language Models: Achieving Data-Efficient Learning via Intrinsic Feedback

Hangfan Zhang,Siyuan Xu,Zhimeng Guo,Huaisheng Zhu,Shicheng Liu,Xinrun Wang,Qiaosheng Zhang,Yang Chen,Peng Ye,Lei Bai,Shuyue Hu

Main category: cs.CL

TL;DR: 该论文提出了一种基于自我反馈的强化学习方法,通过让大语言模型(LLM)交替生成任务并解决任务,实现了数据高效的学习。通过自我认知机制(任务难度预测和能力边界突破),显著提升了模型性能,仅需少量额外数据。

Details Motivation: 传统的强化学习方法在大语言模型训练中需要大量标注数据,成本高昂。本文旨在通过自我反馈机制,减少对外部数据的依赖,实现高效训练。

Contribution: 1. 提出了两种自我认知机制:自我感知的任务难度预测和自我感知的能力边界突破;2. 展示了通过少量额外数据实现显著性能提升的实验结果。

Method: 1. 使用LLM交替生成任务并尝试解决任务;2. 引入自我认知机制,动态调整任务的难度和边界;3. 结合外部数据请求机制,突破模型能力限制。

Result: 在九个基准测试上实现了53.8%的相对性能提升,仅需1.2%的额外数据。

Insight: 通过自我反馈和动态调整任务难度,可以显著提升模型的训练效率,同时减少外部数据依赖,为大语言模型的自我进化提供了新思路。

Abstract: Reinforcement learning (RL) has demonstrated potential in enhancing the reasoning capabilities of large language models (LLMs), but such training typically demands substantial efforts in creating and annotating data. In this work, we explore improving LLMs through RL with minimal data. Our approach alternates between the LLM proposing a task and then attempting to solve it. To minimize data dependency, we introduce two novel mechanisms grounded in self-awareness: (1) self-aware difficulty prediction, where the model learns to assess task difficulty relative to its own abilities and prioritize challenging yet solvable tasks, and (2) self-aware limit breaking, where the model recognizes when a task is beyond its capability boundary and proactively requests external data to break through that limit. Extensive experiments on nine benchmarks showing a 53.8% relative improvement with less than 1.2% extra data demonstrate the efficacy of self-aware RL and underscore the promise of self-evolving agent training.

[27] XTRA: Cross-Lingual Topic Modeling with Topic and Representation Alignments

Tien Phat Nguyen,Vu Minh Ngo,Tung Nguyen,Linh Van Ngo,Duc Anh Nguyen,Sang Dinh,Trung Le

Main category: cs.CL

TL;DR: XTRA是一种新型跨语言主题建模框架,通过结合词汇袋模型和多语言嵌入,提出了表示对齐和主题对齐的双重机制,显著提升了主题一致性、多样性和跨语言对齐质量。

Details Motivation: 现有的跨语言主题建模方法在主题一致性和跨语言对齐方面表现不佳,XTRA旨在解决这一问题。

Contribution: 1) 提出了XTRA框架,结合词汇袋模型和多语言嵌入;2) 引入了表示对齐和主题对齐的双重机制;3) 实验验证了XTRA在主题一致性、多样性和对齐质量上的优越表现。

Method: 1) 表示对齐:通过对比学习在共享语义空间中对齐文档-主题分布;2) 主题对齐:将主题-词分布投影到同一空间以增强跨语言一致性。

Result: 在多语言语料库上的实验表明,XTRA显著优于现有基线方法。

Insight: XTRA的双重对齐机制能够同时保证主题的可解释性和跨语言一致性,为跨语言主题建模提供了新思路。

Abstract: Cross-lingual topic modeling aims to uncover shared semantic themes across languages. Several methods have been proposed to address this problem, leveraging both traditional and neural approaches. While previous methods have achieved some improvements in topic diversity, they often struggle to ensure high topic coherence and consistent alignment across languages. We propose XTRA (Cross-Lingual Topic Modeling with Topic and Representation Alignments), a novel framework that unifies Bag-of-Words modeling with multilingual embeddings. XTRA introduces two core components: (1) representation alignment, aligning document-topic distributions via contrastive learning in a shared semantic space; and (2) topic alignment, projecting topic-word distributions into the same space to enforce crosslingual consistency. This dual mechanism enables XTRA to learn topics that are interpretable (coherent and diverse) and well-aligned across languages. Experiments on multilingual corpora confirm that XTRA significantly outperforms strong baselines in topic coherence, diversity, and alignment quality. Code and reproducible scripts are available at https: //github.com/tienphat140205/XTRA.

[28] StepChain GraphRAG: Reasoning Over Knowledge Graphs for Multi-Hop Question Answering

Tengjun Ni,Xin Yuan,Shenghong Li,Kai Wu,Ren Ping Liu,Wei Ni,Wenjie Zhang

Main category: cs.CL

TL;DR: StepChain GraphRAG结合问题分解与广度优先搜索(BFS)推理流,提升了多跳问答(QA)的性能与可解释性,并在多个数据集上实现了最优效果。

Details Motivation: 现有检索增强生成(RAG)方法在多跳QA中难以有效结合迭代推理步骤与外部知识检索,影响了准确性和可解释性。

Contribution: 提出了StepChain GraphRAG框架,通过动态构建知识图和BFS推理流解决了多跳QA中的复杂推理问题。

Method: 先构建全局索引,推理时将检索到的文本动态解析为知识图,分拆复杂问题并基于BFS遍历图结构生成显式证据链。

Result: 在MuSiQue、2WikiMultiHopQA和HotpotQA上取得了最优成绩,EM和F1分别平均提升了2.57%和2.13%。

Insight: 研究强调了动态知识图构建和多跳推理的结合潜力,但也指出计算开销和大语言模型幻觉问题需进一步解决。

Abstract: Recent progress in retrieval-augmented generation (RAG) has led to more accurate and interpretable multi-hop question answering (QA). Yet, challenges persist in integrating iterative reasoning steps with external knowledge retrieval. To address this, we introduce StepChain GraphRAG, a framework that unites question decomposition with a Breadth-First Search (BFS) Reasoning Flow for enhanced multi-hop QA. Our approach first builds a global index over the corpus; at inference time, only retrieved passages are parsed on-the-fly into a knowledge graph, and the complex query is split into sub-questions. For each sub-question, a BFS-based traversal dynamically expands along relevant edges, assembling explicit evidence chains without overwhelming the language model with superfluous context. Experiments on MuSiQue, 2WikiMultiHopQA, and HotpotQA show that StepChain GraphRAG achieves state-of-the-art Exact Match and F1 scores. StepChain GraphRAG lifts average EM by 2.57% and F1 by 2.13% over the SOTA method, achieving the largest gain on HotpotQA (+4.70% EM, +3.44% F1). StepChain GraphRAG also fosters enhanced explainability by preserving the chain-of-thought across intermediate retrieval steps. We conclude by discussing how future work can mitigate the computational overhead and address potential hallucinations from large language models to refine efficiency and reliability in multi-hop QA.

[29] Evaluating Large Language Models for IUCN Red List Species Information

Shinya Uryu

Main category: cs.CL

TL;DR: 该研究评估了五种大型语言模型在IUCN红色名录物种信息中的表现,发现其在分类学任务中表现优异(94.9%),但在保护推理任务中表现较差(27.2%),揭示了知识-推理的鸿沟,并提出需结合人类专家的混合方法。

Details Motivation: 为了应对生物多样性危机,大型语言模型在保护领域被广泛应用,但其在物种评估中的可靠性尚不明确。本研究旨在验证这些模型在IUCN红色名录核心评估组件中的表现。

Contribution: 研究发现大型语言模型在分类学任务中表现优异,但在保护推理任务中表现不佳,揭示了模型的知识-推理鸿沟,并提出了一种结合人类专家的混合方法。

Method: 研究系统地评估了五种领先的大型语言模型在21,955个物种的分类、保护状态、分布和威胁四个核心IUCN评估组件上的表现。

Result: 模型在分类学任务中表现优异(94.9%),但在保护状态评估等推理任务中表现较差(27.2%)。此外,模型对魅力型脊椎动物存在系统性偏见。

Insight: 研究揭示了大型语言模型的知识-推理鸿沟,表明其适合信息检索任务,但在需要判断的任务中需结合人类专家。这一发现为负责任地部署模型提供了指导。

Abstract: Large Language Models (LLMs) are rapidly being adopted in conservation to address the biodiversity crisis, yet their reliability for species evaluation is uncertain. This study systematically validates five leading models on 21,955 species across four core IUCN Red List assessment components: taxonomy, conservation status, distribution, and threats. A critical paradox was revealed: models excelled at taxonomic classification (94.9%) but consistently failed at conservation reasoning (27.2% for status assessment). This knowledge-reasoning gap, evident across all models, suggests inherent architectural constraints, not just data limitations. Furthermore, models exhibited systematic biases favoring charismatic vertebrates, potentially amplifying existing conservation inequities. These findings delineate clear boundaries for responsible LLM deployment: they are powerful tools for information retrieval but require human oversight for judgment-based decisions. A hybrid approach is recommended, where LLMs augment expert capacity while human experts retain sole authority over risk assessment and policy.

[30] Constraint Satisfaction Approaches to Wordle: Novel Heuristics and Cross-Lexicon Validation

Jahidul Arafat,Fariha Tasmin,Sanjaya Poudel,Kamrujjaman,Eftakhar Ahmed Arnob,Ahsan Habib Tareq

Main category: cs.CL

TL;DR: 该论文首次提出了Wordle游戏的全面约束满足问题(CSP)表述,并引入了两种新策略:CSP感知熵和概率CSP框架,显著提升了解决性能和鲁棒性。

Details Motivation: 现有Wordle求解器通常基于信息熵最大化或频率启发式方法,缺乏对约束的正式处理。该研究旨在通过CSP方法填补这一空白。

Contribution: 1. 首次提出Wordle的全面CSP表述;2. 引入CSP感知熵和概率CSP框架;3. 在多语言验证中展示了CSP方法的普适性。

Method: 1. CSP感知熵:在约束传播后计算信息增益;2. 概率CSP:结合贝叶斯词频先验与逻辑约束;3. 在多语言数据集上验证。

Result: CSP感知熵平均猜测次数3.54,成功率99.9%;概率CSP在所有噪声水平下均达到100%成功率;西班牙语验证成功率为88%。

Insight: 研究表明,基于CSP的正式方法在结构化谜题领域中优于传统信息论和学习方法,且核心CSP原则具有语言无关性。

Abstract: Wordle presents an algorithmically rich testbed for constraint satisfaction problem (CSP) solving. While existing solvers rely on information-theoretic entropy maximization or frequency-based heuristics without formal constraint treatment, we present the first comprehensive CSP formulation of Wordle with novel constraint-aware solving strategies. We introduce CSP-Aware Entropy, computing information gain after constraint propagation rather than on raw candidate sets, and a Probabilistic CSP framework integrating Bayesian word-frequency priors with logical constraints. Through evaluation on 2,315 English words, CSP-Aware Entropy achieves 3.54 average guesses with 99.9% success rate, a statistically significant 1.7% improvement over Forward Checking (t=-4.82, p<0.001, Cohen’s d=0.07) with 46% faster runtime (12.9ms versus 23.7ms per guess). Under 10% noise, CSP-aware approaches maintain 5.3 percentage point advantages (29.0% versus 23.7%, p=0.041), while Probabilistic CSP achieves 100% success across all noise levels (0-20%) through constraint recovery mechanisms. Cross-lexicon validation on 500 Spanish words demonstrates 88% success with zero language-specific tuning, validating that core CSP principles transfer across languages despite an 11.2 percentage point gap from linguistic differences (p<0.001, Fisher’s exact test). Our open-source implementation with 34 unit tests achieving 91% code coverage provides reproducible infrastructure for CSP research. The combination of formal CSP treatment, constraint-aware heuristics, probabilistic-logical integration, robustness analysis, and cross-lexicon validation establishes new performance benchmarks demonstrating that principled constraint satisfaction techniques outperform classical information-theoretic and learning-based approaches for structured puzzle-solving domains.

[31] Self-Reflective Generation at Test Time

Jian Mu,Qixin Zhang,Zhiyong Wang,Menglin Yang,Shuang Qiu,Chengwei Qin,Zhongxiang Dai,Yao Shu

Main category: cs.CL

TL;DR: SRGen提出了一种轻量级的测试时自我反思框架,通过在生成不确定点时提前反思,动态调整token概率分布,显著提升语言模型的推理能力。

Details Motivation: 大语言模型(LLMs)在复杂推理任务中容易出现早期错误传播的问题,现有自我反思方法要么需要完整草稿修订,要么通过昂贵训练学习自我修正,效率低下且被动。

Contribution: 提出了SRGen框架,利用动态熵阈值识别不确定token,通过在生成过程中实时调整token概率分布,实现高效且无需训练的自适应推理修正。

Method: SRGen通过动态熵阈值识别高不确定性token,训练特定修正向量,利用已生成上下文调整token概率分布,实现实时自我反思和修正。

Result: 在数学推理基准测试中,SRGen显著提升了模型性能,例如在AIME2024上Pass@1提高了12.0%,Cons@5提高了13.3%。

Insight: SRGen是一种即插即用方法,能够与RLHF和SLOT等其他技术兼容,为LLM推理任务提供了一种高效且可靠的解决方案。

Abstract: Large language models (LLMs) increasingly solve complex reasoning tasks via long chain-of-thought, but their forward-only autoregressive generation process is fragile; early token errors can cascade, which creates a clear need for self-reflection mechanisms. However, existing self-reflection either performs revisions over full drafts or learns self-correction via expensive training, both fundamentally reactive and inefficient. To address this, we propose Self-Reflective Generation at Test Time (SRGen), a lightweight test-time framework that reflects before generating at uncertain points. During token generation, SRGen utilizes dynamic entropy thresholding to identify high-uncertainty tokens. For each identified token, it trains a specific corrective vector, which fully exploits the already generated context for a self-reflective generation to correct the token probability distribution. By retrospectively analyzing the partial output, this self-reflection enables more trustworthy decisions, thereby significantly reducing the probability of errors at highly uncertain points. Evaluated on challenging mathematical reasoning benchmarks and a diverse set of LLMs, SRGen can consistently strengthen model reasoning: improvements in single-pass quality also translate into stronger self-consistency voting. Especially, on AIME2024 with DeepSeek-R1-Distill-Qwen-7B, SRGen yields absolute improvements of +12.0% on Pass@1 and +13.3% on Cons@5. Moreover, our findings position SRGen as a plug-and-play method that integrates reflection into the generation process for reliable LLM reasoning, achieving consistent gains with bounded overhead and broad composability with other training-time (e.g., RLHF) and test-time (e.g., SLOT) techniques.

[32] Semantic Differentiation in Speech Emotion Recognition: Insights from Descriptive and Expressive Speech Roles

Rongchen Guo,Vincent Francoeur,Isar Nejadgholi,Sylvain Gagnon,Miodrag Bolic

Main category: cs.CL

TL;DR: 论文研究了语音情感识别(SER)中描述性语义和表达性语义的区别,通过实验表明描述性语义与预期情感一致,而表达性语义与引发的情感相关。

Details Motivation: 语音情感识别的准确性受限于语音中复杂的情感细微差别,研究旨在通过区分描述性语义和表达性语义提升识别效果。

Contribution: 提出了描述性语义和表达性语义的区分方法,并通过实验验证了两者在情感识别中的不同作用。

Method: 记录参与者在观看情感电影片段后的音频描述,结合预期情感标签、自我评分和价/唤醒度分数进行分析。

Result: 描述性语义与预期情感一致,表达性语义与引发的情感相关,为SER应用提供了新的视角。

Insight: 区分描述性和表达性语义有助于提升人工智能系统的上下文感知能力。

Abstract: Speech Emotion Recognition (SER) is essential for improving human-computer interaction, yet its accuracy remains constrained by the complexity of emotional nuances in speech. In this study, we distinguish between descriptive semantics, which represents the contextual content of speech, and expressive semantics, which reflects the speaker’s emotional state. After watching emotionally charged movie segments, we recorded audio clips of participants describing their experiences, along with the intended emotion tags for each clip, participants’ self-rated emotional responses, and their valence/arousal scores. Through experiments, we show that descriptive semantics align with intended emotions, while expressive semantics correlate with evoked emotions. Our findings inform SER applications in human-AI interaction and pave the way for more context-aware AI systems.

[33] Revisiting Direct Speech-to-Text Translation with Speech LLMs: Better Scaling than CoT Prompting?

Oriol Pareras,Gerard I. Gállego,Federico Costa,Cristina España-Bonet,Javier Hernando

Main category: cs.CL

TL;DR: 本文系统地比较了语音到文本翻译(S2TT)中Chain-of-Thought(CoT)提示和Direct提示的性能,发现随着数据量的增加,Direct提示表现更一致且优于CoT。

Details Motivation: 研究动机在于探索在S2TT任务中,随着数据量的增加,CoT提示和Direct提示的性能差异,以确定哪种方法更适合未来大规模数据场景。

Contribution: 主要贡献是通过伪标注ASR数据集并翻译成多语言,系统地比较了CoT和Direct提示在不同数据规模下的表现,证明Direct提示更具扩展性。

Method: 方法包括伪标注ASR数据集(翻译为六种欧洲语言),并基于LLM模型训练两种提示策略(CoT和Direct)的S2TT系统,在不同数据规模下进行比较。

Result: 结果表明,随着数据量增加,Direct提示的提升更一致,优于CoT提示,表明在更大规模的S2TT资源中,Direct提示可能更有效。

Insight: 研究揭示了Direct提示在大规模数据场景下的潜力,为未来S2TT模型设计提供了方向。

Abstract: Recent work on Speech-to-Text Translation (S2TT) has focused on LLM-based models, introducing the increasingly adopted Chain-of-Thought (CoT) prompting, where the model is guided to first transcribe the speech and then translate it. CoT typically outperforms direct prompting primarily because it can exploit abundant Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) datasets to explicitly model its steps. In this paper, we systematically compare CoT and Direct prompting under increasing amounts of S2TT data. To this end, we pseudo-label an ASR corpus by translating its transcriptions into six European languages, and train LLM-based S2TT systems with both prompting strategies at different data scales. Our results show that Direct improves more consistently as the amount of data increases, suggesting that it may become a more effective approach as larger S2TT resources are created.

[34] Semantic Similarity in Radiology Reports via LLMs and NER

Beth Pearson,Ahmed Adnan,Zahraa Abdallah

Main category: cs.CL

TL;DR: 论文探讨了在放射学报告中使用LLMs和NER进行语义相似性比较的方法,提出了Llama-EntScore方法,结合Llama 3.1和NER,取得了优于独立使用两者的效果。

Details Motivation: 放射学报告的比较对医生培训和诊断准确性至关重要,但目前AI在该领域的应用面临挑战,尤其是LLMs需要领域专业知识。

Contribution: 提出了Llama-EntScore方法,结合LLMs和NER,并提供可解释的语义相似性评分。

Method: 比较了多种LLMs和NER方法的表现,提出了Llama-EntScore,通过可调权重结合Llama 3.1和NER。

Result: Llama-EntScore在67%的精确匹配和93%的近似匹配(±1分内)上优于独立使用的LLMs和NER。

Insight: 结合LLMs和传统NER方法能更有效地评估放射学报告的语义差异,并提供可解释的反馈。

Abstract: Radiology report evaluation is a crucial part of radiologists’ training and plays a key role in ensuring diagnostic accuracy. As part of the standard reporting workflow, a junior radiologist typically prepares a preliminary report, which is then reviewed and edited by a senior radiologist to produce the final report. Identifying semantic differences between preliminary and final reports is essential for junior doctors, both as a training tool and to help uncover gaps in clinical knowledge. While AI in radiology is a rapidly growing field, the application of large language models (LLMs) remains challenging due to the need for specialised domain knowledge. In this paper, we explore the ability of LLMs to provide explainable and accurate comparisons of reports in the radiology domain. We begin by comparing the performance of several LLMs in comparing radiology reports. We then assess a more traditional approach based on Named-Entity-Recognition (NER). However, both approaches exhibit limitations in delivering accurate feedback on semantic similarity. To address this, we propose Llama-EntScore, a semantic similarity scoring method using a combination of Llama 3.1 and NER with tunable weights to emphasise or de-emphasise specific types of differences. Our approach generates a quantitative similarity score for tracking progress and also gives an interpretation of the score that aims to offer valuable guidance in reviewing and refining their reporting. We find our method achieves 67% exact-match accuracy and 93% accuracy within +/- 1 when compared to radiologist-provided ground truth scores - outperforming both LLMs and NER used independently. Code is available at: \href{https://github.com/otmive/llama_reports}{github.com/otmive/llama\_reports}

[35] Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation

Jacobo Romero-Díaz,Gerard I. Gállego,Oriol Pareras,Federico Costa,Javier Hernando,Cristina España-Bonet

Main category: cs.CL

TL;DR: 该论文研究了链式思维(CoT)提示在语音到文本翻译(S2TT)中的作用,发现其依赖转录而非语音信号,并提出简单训练干预方法以改进语音信息的利用。

Details Motivation: 当前语音到文本翻译系统依赖于自动语音识别(ASR)和文本到文本翻译(T2TT)的级联,存在错误传播和无法利用声学线索的问题。研究旨在验证CoT是否能克服这些问题。

Contribution: 揭示了CoT在S2TT中主要依赖转录而非语音信号的局限性,提出了通过直接S2TT数据或噪声转录注入的训练方法,提高了系统的鲁棒性和语音信息利用。

Method: 通过归因方法、鲁棒性评估(使用损坏转录)和韵律感知分析CoT行为,并测试了两种训练干预方法的效果。

Result: CoT在S2TT中表现出级联行为,依赖转录而非语音;简单训练干预能显著提升语音信息的利用和系统鲁棒性。

Insight: 论文挑战了CoT的优势假设,强调需要设计明确整合声学信息的翻译架构。

Abstract: Speech-to-Text Translation (S2TT) systems built from Automatic Speech Recognition (ASR) and Text-to-Text Translation (T2TT) modules face two major limitations: error propagation and the inability to exploit prosodic or other acoustic cues. Chain-of-Thought (CoT) prompting has recently been introduced, with the expectation that jointly accessing speech and transcription will overcome these issues. Analyzing CoT through attribution methods, robustness evaluations with corrupted transcripts, and prosody-awareness, we find that it largely mirrors cascaded behavior, relying mainly on transcripts while barely leveraging speech. Simple training interventions, such as adding Direct S2TT data or noisy transcript injection, enhance robustness and increase speech attribution. These findings challenge the assumed advantages of CoT and highlight the need for architectures that explicitly integrate acoustic information into translation.

[36] SurveyBench: How Well Can LLM(-Agents) Write Academic Surveys?

Zhaojun Sun,Xuzhou Zhu,Xuanhe Zhou,Xin Tong,Shuo Wang,Jie Fu,Guoliang Li,Zhiyuan Liu,Fan Wu

Main category: cs.CL

TL;DR: SurveyBench是一个细粒度的、基于测验的评估框架,用于评估LLM(及其代理)自动生成学术综述的能力,揭示了现有方法与人类标准的差距。

Details Motivation: 学术综述写作是一项繁重且高要求的任务,现有的自动化方法(如LLM4Survey)生成的综述质量不足,且缺乏与读者需求对齐的严格评测标准。

Contribution: 1) 提出了SurveyBench评测框架;2) 构建了基于11,343篇arXiv论文和4,947篇高质量综述的数据集;3) 设计了多维度量层级和双模式评测协议。

Method: 通过内容质量和基于测验的回答能力测试(内容模式和测验模式)评估综述的质量,重点关注大纲、内容和非文本丰富性。

Result: 现有LLM4Survey方法在内容质量评测中平均比人类低21%。

Insight: SurveyBench揭示了LLM在自动生成综述时的核心不足(如逻辑连贯性和见解清晰度),为未来改进提供了方向。

Abstract: Academic survey writing, which distills vast literature into a coherent and insightful narrative, remains a labor-intensive and intellectually demanding task. While recent approaches, such as general DeepResearch agents and survey-specialized methods, can generate surveys automatically (a.k.a. LLM4Survey), their outputs often fall short of human standards and there lacks a rigorous, reader-aligned benchmark for thoroughly revealing their deficiencies. To fill the gap, we propose a fine-grained, quiz-driven evaluation framework SurveyBench, featuring (1) typical survey topics source from recent 11,343 arXiv papers and corresponding 4,947 high-quality surveys; (2) a multifaceted metric hierarchy that assesses the outline quality (e.g., coverage breadth, logical coherence), content quality (e.g., synthesis granularity, clarity of insights), and non-textual richness; and (3) a dual-mode evaluation protocol that includes content-based and quiz-based answerability tests, explicitly aligned with readers’ informational needs. Results show SurveyBench effectively challenges existing LLM4Survey approaches (e.g., on average 21% lower than human in content-based evaluation).

[37] Beyond the Final Layer: Intermediate Representations for Better Multilingual Calibration in Large Language Models

Ej Zhou,Caiqi Zhang,Tiancheng Hu,Chengzu Li,Nigel Collier,Ivan Vulić,Anna Korhonen

Main category: cs.CL

TL;DR: 论文首次系统研究了大型语言模型的多语言校准问题,发现在非英语语言中校准效果较差,并提出了一种无需训练的方法(LACE)通过中间层优化校准效果。

Details Motivation: 多语言环境下大型语言模型的置信校准问题未充分研究,非英语语言表现较差,亟需一种更公平的解决方案。

Contribution: 1. 首次大规模研究多语言校准问题;2. 发现中间层比最终层更适合多语言校准;3. 提出无需训练的LACE方法优化校准效果。

Method: 通过分析模型的内部表示,发现中间层更适合校准;提出LACE方法,自适应选择每语言最优组合中间层。

Result: LACE显著提升多语言校准效果,尤其是非英语语言。

Insight: 英语中心化的训练导致最终层校准效果不佳,中间层提供了更公平的多语言校准信号。

Abstract: Confidence calibration, the alignment of a model’s predicted confidence with its actual accuracy, is crucial for the reliable deployment of Large Language Models (LLMs). However, this critical property remains largely under-explored in multilingual contexts. In this work, we conduct the first large-scale, systematic studies of multilingual calibration across six model families and over 100 languages, revealing that non-English languages suffer from systematically worse calibration. To diagnose this, we investigate the model’s internal representations and find that the final layer, biased by English-centric training, provides a poor signal for multilingual confidence. In contrast, our layer-wise analysis uncovers a key insight that late-intermediate layers consistently offer a more reliable and better-calibrated signal. Building on this, we introduce a suite of training-free methods, including Language-Aware Confidence Ensemble (LACE), which adaptively selects an optimal ensemble of layers for each specific language. Our study highlights the hidden costs of English-centric alignment and offer a new path toward building more globally equitable and trustworthy LLMs by looking beyond the final layer.

[38] EditLens: Quantifying the Extent of AI Editing in Text

Katherine Thai,Bradley Emi,Elyas Masrour,Mohit Iyyer

Main category: cs.CL

TL;DR: 本文提出了EditLens,一种量化AI编辑文本程度的模型,通过轻量级相似性度量来区分人工写作、AI生成和混合编辑文本。

Details Motivation: 目前的研究多关注完全由AI生成的文本检测,忽视了AI编辑文本的重要性,尤其是在教育和政策等领域。

Contribution: 提出了EditLens模型,能够预测文本中AI编辑的程度,并在分类任务中达到最先进性能。

Method: 使用轻量级相似性度量作为中间监督,训练回归模型预测AI编辑量。

Result: 在二元(F1=94.7%)和三元(F1=90.4%)分类任务中表现优异,并通过Grammarly案例展示了模型的实际应用。

Insight: AI编辑的文本可以被检测,且编辑程度也能量化,这对作者归属和教育政策具有重要意义。

Abstract: A significant proportion of queries to large language models ask them to edit user-provided text, rather than generate new text from scratch. While previous work focuses on detecting fully AI-generated text, we demonstrate that AI-edited text is distinguishable from human-written and AI-generated text. First, we propose using lightweight similarity metrics to quantify the magnitude of AI editing present in a text given the original human-written text and validate these metrics with human annotators. Using these similarity metrics as intermediate supervision, we then train EditLens, a regression model that predicts the amount of AI editing present within a text. Our model achieves state-of-the-art performance on both binary (F1=94.7%) and ternary (F1=90.4%) classification tasks in distinguishing human, AI, and mixed writing. Not only do we show that AI-edited text can be detected, but also that the degree of change made by AI to human writing can be detected, which has implications for authorship attribution, education, and policy. Finally, as a case study, we use our model to analyze the effects of AI-edits applied by Grammarly, a popular writing assistance tool. To encourage further research, we commit to publicly releasing our models and dataset.

[39] FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Imene Kerboua,Sahar Omidi Shayegan,Megh Thakkar,Xing Han Lù,Léo Boisvert,Massimo Caccia,Jérémy Espinas,Alexandre Aussem,Véronique Eglin,Alexandre Lacoste

Main category: cs.CL

TL;DR: FocusAgent提出了一种轻量化的LLM检索器方法,通过提取AxTree中的相关行来修剪网页内容,从而减少计算成本和安全风险,同时在任务性能上不输基线。

Details Motivation: 网页代理需要处理大量网页内容,导致上下文饱和、计算成本高,并容易受到提示注入攻击,现有修剪策略效果不佳。

Contribution: 提出了FocusAgent,通过轻量级LLM检索器提取AxTree中相关行,显著减少观察内容并降低安全风险,同时保持任务性能。

Method: 使用轻量级LLM检索器从AxTree中提取任务相关的行,修剪无关内容,减少计算量和安全漏洞。

Result: 在WorkArena和WebArena基准测试中,FocusAgent减少50%以上的观察内容,同时性能与基线相当,并能有效防御提示注入攻击。

Insight: 针对性强的内容修剪不仅能提高效率,还能增强安全性,是一种实用的网页代理构建策略。

Abstract: Web agents powered by large language models (LLMs) must process lengthy web page observations to complete user goals; these pages often exceed tens of thousands of tokens. This saturates context limits and increases computational cost processing; moreover, processing full pages exposes agents to security risks such as prompt injection. Existing pruning strategies either discard relevant content or retain irrelevant context, leading to suboptimal action prediction. We introduce FocusAgent, a simple yet effective approach that leverages a lightweight LLM retriever to extract the most relevant lines from accessibility tree (AxTree) observations, guided by task goals. By pruning noisy and irrelevant content, FocusAgent enables efficient reasoning while reducing vulnerability to injection attacks. Experiments on WorkArena and WebArena benchmarks show that FocusAgent matches the performance of strong baselines, while reducing observation size by over 50%. Furthermore, a variant of FocusAgent significantly reduces the success rate of prompt-injection attacks, including banner and pop-up attacks, while maintaining task success performance in attack-free settings. Our results highlight that targeted LLM-based retrieval is a practical and robust strategy for building web agents that are efficient, effective, and secure.

[40] Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Tianyu Fu,Zihan Min,Hanling Zhang,Jichao Yan,Guohao Dai,Wanli Ouyang,Yu Wang

Main category: cs.CL

TL;DR: 本文提出了Cache-to-Cache(C2C)方法,实现了大型语言模型(LLMs)之间的直接语义通信,避免了显式文本生成的开销和信息损失,提升了性能和效率。

Details Motivation: 现有设计中,LLMs通过文本通信导致语义信息丢失和生成延迟。本文探索是否能实现LLMs之间超越文本的直接通信。

Contribution: 提出C2C方法,通过KV-Cache的直接语义传输实现LLMs高效通信,并结合学习门控机制优化通信层选择。

Method: C2C使用神经网络投影和融合源模型与目标模型的KV-Cache,实现语义转移;学习门控机制动态选择目标层。

Result: C2C的平均准确率比单个模型高出8.5-10.5%,比文本通信范式高3.0-5.0%,且延迟降低2.0倍。

Insight: KV-Cache可作为LLMs间高效语义通信的有效媒介,避免显式文本生成的语义损失和延迟。

Abstract: Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model’s KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.

[41] Self-Anchor: Large Language Model Reasoning via Step-by-step Attention Alignment

Hongxiang Zhang,Yuan Tian,Tianyi Zhang

Main category: cs.CL

TL;DR: 该论文提出了一种名为Self-Anchor的新方法,通过结构化推理步骤来引导大语言模型(LLM)的注意力,解决复杂推理任务中注意力不集中的问题。

Details Motivation: 在复杂推理任务中,随着推理链的延伸,关键的中间步骤和原始提示容易被淹没在上下文中,导致注意力不足和错误产生。现有的基于提示的方法无法有效解决这一问题。

Contribution: 提出了Self-Anchor方法,通过分解推理轨迹为结构化计划,并自动对齐模型注意力到最相关的推理步骤,显著提升推理性能。

Method: Self-Anchor将推理过程分解为结构化计划,并通过注意力对齐机制确保模型在生成过程中始终保持对关键步骤的关注。

Result: 实验表明,Self-Anchor在六个基准测试中优于现有的最优提示方法,并显著缩小了”非推理”模型与专用推理模型之间的性能差距。

Insight: 该方法表明,通过注意力对齐机制,无需重新训练即可使大多数LLM具备处理复杂推理任务的能力。

Abstract: To solve complex reasoning tasks for Large Language Models (LLMs), prompting-based methods offer a lightweight alternative to fine-tuning and reinforcement learning. However, as reasoning chains extend, critical intermediate steps and the original prompt will be buried in the context, receiving insufficient attention and leading to errors. In this paper, we propose Self-Anchor, a novel pipeline that leverages the inherent structure of reasoning to steer LLM attention. Self-Anchor decomposes reasoning trajectories into structured plans and automatically aligns the model’s attention to the most relevant inference steps, allowing the model to maintain focus throughout generation. Our experiment shows that Self-Anchor outperforms SOTA prompting methods across six benchmarks. Notably, Self-Anchor significantly reduces the performance gap between ``non-reasoning’’ models and specialized reasoning models, with the potential to enable most LLMs to tackle complex reasoning tasks without retraining.

[42] Reward Models are Metrics in a Trench Coat

Sebastian Gehrmann

Main category: cs.CL

TL;DR: 这篇论文探讨了奖励模型和评估指标之间的相似性与分离性,提出了两者应该更紧密合作的立场。

Details Motivation: 大型语言模型的后训练中,强化学习的兴起引发了奖励模型的广泛关注,但其与评估指标的研究领域分离,导致术语冗余和问题重复。

Contribution: 揭示了奖励模型和评估指标的相似性,呼吁二者更紧密合作,避免冗余和共性问题。

Method: 通过对比任务表现和广泛的文献综述,证明了评估指标在特定任务上优于奖励模型。

Result: 论文指出,奖励模型和评估指标在偏好获取、避免虚假相关性和奖励攻击等方面可以通过合作改进。

Insight: 奖励模型本质上是一种特定形式的评估指标,二者的结合可以避免重复工作和共同挑战。

Abstract: The emergence of reinforcement learning in post-training of large language models has sparked significant interest in reward models. Reward models assess the quality of sampled model outputs to generate training signals. This task is also performed by evaluation metrics that monitor the performance of an AI model. We find that the two research areas are mostly separate, leading to redundant terminology and repeated pitfalls. Common challenges include susceptibility to spurious correlations, impact on downstream reward hacking, methods to improve data quality, and approaches to meta-evaluation. Our position paper argues that a closer collaboration between the fields can help overcome these issues. To that end, we show how metrics outperform reward models on specific tasks and provide an extensive survey of the two areas. Grounded in this survey, we point to multiple research topics in which closer alignment can improve reward models and metrics in areas such as preference elicitation methods, avoidance of spurious correlations and reward hacking, and calibration-aware meta-evaluation.

cs.CV [Back]

[43] Exploring OCR-augmented Generation for Bilingual VQA

JoonHo Lee,Sunho Park

Main category: cs.CV

TL;DR: 论文研究了OCR增强生成在双语VQA任务中的应用,提出KLOCR模型并发布KOCRBench数据集,实验表明OCR提取的文本显著提升了模型性能。

Details Motivation: 探索如何在视觉语言模型(VLMs)中融入OCR能力,以支持多语言VQA任务,尤其是韩语和英语的双语场景。

Contribution: 1. 训练并发布了KLOCR,一个强大的双语OCR基准模型;2. 构建了韩语VQA数据集KOCRBench;3. 分析了不同提示方法的效果。

Method: 通过OCR提取图像中的文本信息,并将其增强到VLMs中,以支持双语VQA任务。实验包括开源和商用模型,验证OCR文本对性能的提升作用。

Result: 实验表明,OCR提取的文本显著提升了双语VQA任务的性能,尤其是在韩语和英语的场景中。

Insight: OCR增强的文本能够有效提升VLMs在多语言VQA任务中的表现,为未来多语言OCR和VQA研究提供了新方向。

Abstract: We investigate OCR-augmented generation with Vision Language Models (VLMs), exploring tasks in Korean and English toward multilingualism. To support research in this domain, we train and release KLOCR, a strong bilingual OCR baseline trained on 100M instances to augment VLMs with OCR ability. To complement existing VQA benchmarks, we curate KOCRBench for Korean VQA, and analyze different prompting methods. Extensive experiments show that OCR-extracted text significantly boosts performance across open source and commercial models. Our work offers new insights into OCR-augmented generation for bilingual VQA. Model, code, and data are available at https://github.com/JHLee0513/KLOCR.

[44] Oracle-RLAIF: An Improved Fine-Tuning Framework for Multi-modal Video Models through Reinforcement Learning from Ranking Feedback

Derek Shi,Ruben Glatt,Christine Klymko,Shubham Mohole,Hongjun Choi,Shashank Kushwaha,Sam Sakla,Felipe Leno da Silva

Main category: cs.CV

TL;DR: Oracle-RLAIF是一个改进的多模态视频模型微调框架,通过基于排名的强化学习反馈(RLAIF)替代传统的奖励模型,降低了成本并提高了性能。

Details Motivation: 随着大规模视频语言模型(VLMs)参数的增加,获取人类反馈的成本显著上升。现有的RLAIF框架依赖于训练专用的奖励模型,成本高且限制多。

Contribution: 提出Oracle-RLAIF框架,用通用的Oracle排名器替代训练奖励模型,同时引入基于排名的损失函数$GRPO_{rank}$,直接优化序数反馈。

Method: 使用Oracle排名器对模型响应进行排名而非评分,结合$GRPO_{rank}$损失函数优化排名反馈。

Result: 在多个视频理解基准测试中,Oracle-RLAIF性能优于现有微调方法。

Insight: 基于排名的反馈可以更灵活且高效地对齐大规模多模态视频模型,减少了对昂贵奖励模型的依赖。

Abstract: Recent advances in large video-language models (VLMs) rely on extensive fine-tuning techniques that strengthen alignment between textual and visual comprehension. Leading pipelines typically pair supervised fine-tuning (SFT) with reinforcement learning from preference data to enhance video comprehension. However, as VLMs scale in parameter size, so does the cost of gathering enough human feedback. To make fine-tuning more cost-effective, recent frameworks explore reinforcement learning with AI feedback (RLAIF), which replace human preference with AI as a judge. Current RLAIF frameworks rely on a specialized reward model trained with video narratives to create calibrated scalar rewards – an expensive and restrictive pipeline. We propose Oracle-RLAIF, a novel framework that replaces the trained reward model with a more general Oracle ranker which acts as a drop-in model ranking candidate model responses rather than scoring them. Alongside Oracle-RLAIF, we introduce $GRPO_{rank}$, a novel rank-based loss function based on Group Relative Policy Optimization (GRPO) that directly optimizes ordinal feedback with rank-aware advantages. Empirically, we demonstrate that Oracle-RLAIF consistently outperforms leading VLMs using existing fine-tuning methods when evaluated across various video comprehension benchmarks. Oracle-RLAIF paves the path to creating flexible and data-efficient frameworks for aligning large multi-modal video models with reinforcement learning from rank rather than score.

[45] PhysHMR: Learning Humanoid Control Policies from Vision for Physically Plausible Human Motion Reconstruction

Qiao Feng,Yiming Huang,Yufu Wang,Jiatao Gu,Lingjie Liu

Main category: cs.CV

TL;DR: PhysHMR提出了一种直接从单目视频中学习基于物理的人体运动重建的统一框架,通过结合视觉特征和物理约束,避免了传统两阶段方法的误差累积问题。

Details Motivation: 现有方法大多依赖基于运动学的姿态估计和后续物理后处理,导致结果不真实且误差累积。PhysHMR旨在通过直接学习视觉到动作的策略,实现物理合理且对齐输入的运动重建。

Contribution: 1. 提出首个直接从视觉输入学习物理合理运动的重建框架;2. 引入像素光线策略,实现无需噪声3D根预测的全局姿态引导;3. 设计知识蒸馏和强化学习结合的优化方法,提高样本效率。

Method: 1. 利用像素光线策略将2D关键点提升为3D空间光线;2. 结合预训练编码器的局部视觉特征和全局光线信息;3. 使用知识蒸馏从动作捕捉专家迁移知识,并通过强化学习优化策略。

Result: PhysHMR在多样场景中生成高保真且物理合理的运动,优于现有方法,尤其在视觉精度和物理真实性方面表现突出。

Insight: 通过直接学习视觉到物理动作的策略,可以避免传统方法的误差累积问题,而软性全局引导和知识蒸馏的结合显著提升了模型的效率和效果。

Abstract: Reconstructing physically plausible human motion from monocular videos remains a challenging problem in computer vision and graphics. Existing methods primarily focus on kinematics-based pose estimation, often leading to unrealistic results due to the lack of physical constraints. To address such artifacts, prior methods have typically relied on physics-based post-processing following the initial kinematics-based motion estimation. However, this two-stage design introduces error accumulation, ultimately limiting the overall reconstruction quality. In this paper, we present PhysHMR, a unified framework that directly learns a visual-to-action policy for humanoid control in a physics-based simulator, enabling motion reconstruction that is both physically grounded and visually aligned with the input video. A key component of our approach is the pixel-as-ray strategy, which lifts 2D keypoints into 3D spatial rays and transforms them into global space. These rays are incorporated as policy inputs, providing robust global pose guidance without depending on noisy 3D root predictions. This soft global grounding, combined with local visual features from a pretrained encoder, allows the policy to reason over both detailed pose and global positioning. To overcome the sample inefficiency of reinforcement learning, we further introduce a distillation scheme that transfers motion knowledge from a mocap-trained expert to the vision-conditioned policy, which is then refined using physically motivated reinforcement learning rewards. Extensive experiments demonstrate that PhysHMR produces high-fidelity, physically plausible motion across diverse scenarios, outperforming prior approaches in both visual accuracy and physical realism.

[46] Unlocking the power of partnership: How humans and machines can work together to improve face recognition

P. Jonathon Phillips,Geraldine Jeckeln,Carina A. Hahn,Amy N. Yates,Peter C. Fontana,Alice J. O’Toole

Main category: cs.CV

TL;DR: 这篇论文探讨了人类与机器在面部识别中的协作效果,提出了一种基于‘邻近准确率规则(PAR)’的智能融合方法,证明了在某些情况下人类与机器的结合可以显著提高识别准确性。

Details Motivation: 人类与机器在面部识别中各有优劣,但如何有效结合两者的优势以提升整体识别准确性尚不清晰。本文旨在通过实证研究,明确人类与机器协作的最佳条件和效果。

Contribution: 1)提出了‘邻近准确率规则(PAR)’,用于预测人类与机器协作的效果;2)定义了‘关键融合区域’,在此区域内人类与机器的结合能显著提升准确性;3)实现了‘智能人机融合’,通过筛选合适的人类参与者优化机器性能;4)使用图论方法分析了纯人类协作的极限性能,并与人机协作进行了对比。

Method: 利用专家和非专家的面部识别数据,分析了人类-人类和人类-机器的协作效果。基于PAR规则,建立了‘关键融合区域’,并实现了智能人机融合策略。

Result: 研究发现,智能人机融合能够超越单独的机器性能,同时比无差别结合所有人类与机器判断更准确。纯人类协作的系统性能接近于智能人机协作的平均水平,但后者更能减少低效人类参与者的负面影响。

Insight: 1)人类与机器的协作效果取决于双方的基础准确率差异;2)在‘关键融合区域’内,即使是准确性低于机器的人类也能显著提升系统性能;3)智能筛选人类参与者是实现最优人机协作的关键。

Abstract: Human review of consequential decisions by face recognition algorithms creates a “collaborative” human-machine system. Individual differences between people and machines, however, affect whether collaboration improves or degrades accuracy in any given case. We establish the circumstances under which combining human and machine face identification decisions improves accuracy. Using data from expert and non-expert face identifiers, we examined the benefits of human-human and human-machine collaborations. The benefits of collaboration increased as the difference in baseline accuracy between collaborators decreased-following the Proximal Accuracy Rule (PAR). This rule predicted collaborative (fusion) benefit across a wide range of baseline abilities, from people with no training to those with extensive training. Using the PAR, we established a critical fusion zone, where humans are less accurate than the machine, but fusing the two improves system accuracy. This zone was surprisingly large. We implemented “intelligent human-machine fusion” by selecting people with the potential to increase the accuracy of a high-performing machine. Intelligent fusion was more accurate than the machine operating alone and more accurate than combining all human and machine judgments. The highest system-wide accuracy achievable with human-only partnerships was found by graph theory. This fully human system approximated the average performance achieved by intelligent human-machine collaboration. However, intelligent human-machine collaboration more effectively minimized the impact of low-performing humans on system-wide accuracy. The results demonstrate a meaningful role for both humans and machines in assuring accurate face identification. This study offers an evidence-based road map for the intelligent use of AI in face identification.

[47] How Confident are Video Models? Empowering Video Models to Express their Uncertainty

Zhiting Mei,Ola Shorinwa,Anirudha Majumdar

Main category: cs.CV

TL;DR: 本文首次提出了针对生成式视频模型的量化不确定性方法,包括一个基于鲁棒秩相关性的校准评估指标、一种黑盒不确定性量化方法(S-QUBED)以及一个用于基准测试的数据集。

Details Motivation: 生成式视频模型在文本到视频任务中表现出强大能力,但也存在幻觉问题(生成看似合理但事实错误的视频)。目前缺乏针对视频模型的不确定性量化方法,存在安全隐患。

Contribution: 1. 提出首个针对视频模型的不确定性量化框架;2. 设计了一种基于鲁棒秩相关的校准评估指标;3. 开发了黑盒不确定性量化方法S-QUBED,能分解不确定性为认知性和偶然性成分。

Method: 1. 使用鲁棒秩相关性指标评估视频模型校准度;2. S-QUBED方法通过潜在空间建模分解预测不确定性;3. 构建基准数据集支持模型校准测试。

Result: 实验表明,S-QUBED能提供校准的总不确定性估计,且与任务准确度负相关,有效分解了认知性和偶然性不确定性。

Insight: 视频模型的不确定性可通过潜在空间条件化任务进行分解,未来可结合校准技术提升模型安全性。

Abstract: Generative video models demonstrate impressive text-to-video capabilities, spurring widespread adoption in many real-world applications. However, like large language models (LLMs), video generation models tend to hallucinate, producing plausible videos even when they are factually wrong. Although uncertainty quantification (UQ) of LLMs has been extensively studied in prior work, no UQ method for video models exists, raising critical safety concerns. To our knowledge, this paper represents the first work towards quantifying the uncertainty of video models. We present a framework for uncertainty quantification of generative video models, consisting of: (i) a metric for evaluating the calibration of video models based on robust rank correlation estimation with no stringent modeling assumptions; (ii) a black-box UQ method for video models (termed S-QUBED), which leverages latent modeling to rigorously decompose predictive uncertainty into its aleatoric and epistemic components; and (iii) a UQ dataset to facilitate benchmarking calibration in video models. By conditioning the generation task in the latent space, we disentangle uncertainty arising due to vague task specifications from that arising from lack of knowledge. Through extensive experiments on benchmark video datasets, we demonstrate that S-QUBED computes calibrated total uncertainty estimates that are negatively correlated with the task accuracy and effectively computes the aleatoric and epistemic constituents.

[48] Ego-Exo 3D Hand Tracking in the Wild with a Mobile Multi-Camera Rig

Patrick Rim,Kun He,Kevin Harris,Braden Copple,Shangchen Han,Sizhe An,Ivan Shugurov,Tomas Hodan,He Wen,Xu Xie

Main category: cs.CV

TL;DR: 本文提出了一种新型的无标记多摄像头系统,用于在野外环境中精确追踪3D手部姿态,结合了轻量化的背戴式设备和Meta Quest 3头显,提供了高精度的地面真实数据和多样化的环境数据。

Details Motivation: 现有数据集多在受控实验室环境下采集,限制了环境多样性和模型泛化能力。本文旨在通过开发一种新型系统,解决野外环境中3D手部追踪的挑战。

Contribution: 1. 设计了一种轻量化的多摄像头系统,支持野外环境下的精确3D手部追踪。2. 提出了一个结合外视和内视的追踪流程,生成高精度地面真实数据。3. 发布了一个包含同步多视角图像和精确3D手部姿态的标注数据集。

Method: 开发了一种背戴式多摄像头系统(含8个外视摄像头和2个内视摄像头),并设计了一个ego-exo追踪流程,用于生成3D手部姿态的地面真实数据。

Result: 实验表明,该系统能够在多样化环境下显著减少环境真实性与3D标注精度之间的权衡。

Insight: 结合外视和内视摄像头的多视角系统可以显著提升野外环境下3D手部追踪的精度和实用性。

Abstract: Accurate 3D tracking of hands and their interactions with the world in unconstrained settings remains a significant challenge for egocentric computer vision. With few exceptions, existing datasets are predominantly captured in controlled lab setups, limiting environmental diversity and model generalization. To address this, we introduce a novel marker-less multi-camera system designed to capture precise 3D hands and objects, which allows for nearly unconstrained mobility in genuinely in-the-wild conditions. We combine a lightweight, back-mounted capture rig with eight exocentric cameras, and a user-worn Meta Quest 3 headset, which contributes two egocentric views. We design an ego-exo tracking pipeline to generate accurate 3D hand pose ground truth from this system, and rigorously evaluate its quality. By collecting an annotated dataset featuring synchronized multi-view images and precise 3D hand poses, we demonstrate the capability of our approach to significantly reduce the trade-off between environmental realism and 3D annotation accuracy.

[49] Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation

Beijia Lu,Ziyi Chen,Jing Xiao,Jun-Yan Zhu

Main category: cs.CV

TL;DR: 该论文提出了一种基于输入感知的稀疏注意力机制和蒸馏损失的实时共语音视频生成方法,显著提升了生成效率和质量。

Details Motivation: 现有基于扩散模型的共语音视频生成方法因计算量大而无法实现实时性,直接应用现有蒸馏方法会导致质量下降。

Contribution: 1) 提出输入感知的稀疏注意力机制,通过人体关键点引导注意力,减少冗余计算;2) 设计输入感知的蒸馏损失,提升唇部同步和手部动作的真实性。

Method: 1) 利用输入人体姿态关键点引导注意力区域;2) 结合稀疏注意力和蒸馏损失进行模型蒸馏,实现高效视频生成。

Result: 方法在保持视觉质量的同时,实现了实时性能,优于现有音频驱动和输入驱动的方法。

Insight: 输入感知的注意力机制和损失设计能显著提升生成效率和质量,为实时视频生成提供了新思路。

Abstract: Diffusion models can synthesize realistic co-speech video from audio for various applications, such as video creation and virtual agents. However, existing diffusion-based methods are slow due to numerous denoising steps and costly attention mechanisms, preventing real-time deployment. In this work, we distill a many-step diffusion video model into a few-step student model. Unfortunately, directly applying recent diffusion distillation methods degrades video quality and falls short of real-time performance. To address these issues, our new video distillation method leverages input human pose conditioning for both attention and loss functions. We first propose using accurate correspondence between input human pose keypoints to guide attention to relevant regions, such as the speaker’s face, hands, and upper body. This input-aware sparse attention reduces redundant computations and strengthens temporal correspondences of body parts, improving inference efficiency and motion coherence. To further enhance visual quality, we introduce an input-aware distillation loss that improves lip synchronization and hand motion realism. By integrating our input-aware sparse attention and distillation loss, our method achieves real-time performance with improved visual quality compared to recent audio-driven and input-driven methods. We also conduct extensive experiments showing the effectiveness of our algorithmic design choices.

[50] Deep Generative Continual Learning using Functional LoRA: FunLoRA

Victor Enescu,Hichem Sahbi

Main category: cs.CV

TL;DR: 论文提出了一种基于低秩适应(LoRA)的新颖条件机制FunLoRA,用于深度生成模型的持续学习,避免了灾难性遗忘问题,并通过动态调节提高了模型性能,同时降低了内存需求和采样时间。

Details Motivation: 深度生成模型在文本和视觉应用中具有广泛潜力,但增量训练面临灾难性遗忘的挑战,传统方法依赖合成数据且训练时间不可持续。

Contribution: 设计了FunLoRA,一种基于LoRA的动态条件机制,仅使用秩1矩阵并通过函数增加矩阵秩,有效避免了灾难性遗忘,且仅需训练当前任务数据。

Method: 采用低秩适应(LoRA)技术,通过函数动态调节矩阵秩,实现参数高效微调(PEFT),提升了生成模型的持续学习能力。

Result: 实验表明,FunLoRA在基于流匹配的模型中超越了扩散模型的当前最优结果,实现了更高的分类准确率,同时显著降低了内存和计算成本。

Insight: FunLoRA展示了通过动态调节和参数高效微调技术,可以在持续学习中避免灾难性遗忘问题,同时保持高性能和低资源消耗。

Abstract: Continual adaptation of deep generative models holds tremendous potential and critical importance, given their rapid and expanding usage in text and vision based applications. Incremental training, however, remains highly challenging due to catastrophic forgetting phenomenon, which makes it difficult for neural networks to effectively incorporate new knowledge. A common strategy consists in retraining the generative model on its own synthetic data in order to mitigate forgetting. Yet, such an approach faces two major limitations: (i) the continually increasing training time eventually becomes intractable, and (ii) reliance on synthetic data inevitably leads to long-term performance degradation, since synthetic samples lack the richness of real training data. In this paper, we attenuate these issues by designing a novel and more expressive conditioning mechanism for generative models based on low rank adaptation (LoRA), that exclusively employs rank 1 matrices, whose reparametrized matrix rank is functionally increased using carefully selected functions – and dubbed functional LoRA: FunLoRA. Using this dynamic conditioning, the generative model is guaranteed to avoid catastrophic forgetting and needs only to be trained on data from the current task. Extensive experiments using flow-matching based models trained from scratch, showcase that our proposed parameter-efficient fine-tuning (PEFT) method surpasses prior state-of-the-art results based on diffusion models, reaching higher classification accuracy scores, while only requiring a fraction of the memory cost and sampling time.

[51] Sequence-Preserving Dual-FoV Defense for Traffic Sign and Light Recognition in Autonomous Vehicles

Abhishek Joshi,Jahnavi Krishna Koda,Abhishek Phadke

Main category: cs.CV

TL;DR: 该论文提出了一个双视场(FoV)且保持时序性的防御框架,用于自动驾驶车辆中的交通标志和信号灯识别,通过统一的三层防御堆栈(特征压缩、防御蒸馏和熵基异常检测)提升系统对数字和自然扰动的鲁棒性。

Details Motivation: 交通标志和信号灯的错误识别会直接影响自动驾驶车辆的安全性和导航性能,当前研究缺乏对时序连续性、多静态视场(FoV)感知以及对数字和自然扰动的鲁棒性的综合考虑。

Contribution: 提出了一种双视场且保持时序性的防御框架,整合了特征压缩、防御蒸馏和熵基异常检测的三层防御堆栈,并通过多源数据集验证了其有效性。

Method: 使用多源数据集(aiMotive、Udacity、Waymo及自录视频),对RGB图像的中长期时序进行对齐,设计了统一防御堆栈框架,并引入时序投票机制。

Result: 统一防御堆栈的mAP达到79.8%,攻击成功率(ASR)降至18.2%,优于YOLOv8、YOLOv9和BEVFormer,同时高风险误分类率降至32%。

Insight: 时序信息在多视场交通标志和信号灯识别中至关重要,统一的防御策略可以有效提升系统对数字和物理扰动的鲁棒性。

Abstract: Traffic light and sign recognition are key for Autonomous Vehicles (AVs) because perception mistakes directly influence navigation and safety. In addition to digital adversarial attacks, models are vulnerable to existing perturbations (glare, rain, dirt, or graffiti), which could lead to dangerous misclassifications. The current work lacks consideration of temporal continuity, multistatic field-of-view (FoV) sensing, and robustness to both digital and natural degradation. This study proposes a dual FoV, sequence-preserving robustness framework for traffic lights and signs in the USA based on a multi-source dataset built on aiMotive, Udacity, Waymo, and self-recorded videos from the region of Texas. Mid and long-term sequences of RGB images are temporally aligned for four operational design domains (ODDs): highway, night, rainy, and urban. Over a series of experiments on a real-life application of anomaly detection, this study outlines a unified three-layer defense stack framework that incorporates feature squeezing, defensive distillation, and entropy-based anomaly detection, as well as sequence-wise temporal voting for further enhancement. The evaluation measures included accuracy, attack success rate (ASR), risk-weighted misclassification severity, and confidence stability. Physical transferability was confirmed using probes for recapture. The results showed that the Unified Defense Stack achieved 79.8mAP and reduced the ASR to 18.2%, which is superior to YOLOv8, YOLOv9, and BEVFormer, while reducing the high-risk misclassification to 32%.

[52] Smart-GRPO: Smartly Sampling Noise for Efficient RL of Flow-Matching Models

Benjamin Yu,Jackie Liu,Justin Cui

Main category: cs.CV

TL;DR: Smart-GRPO提出了一种针对流匹配模型的强化学习方法,通过智能采样噪声优化扰动,显著提升了奖励优化和视觉质量。

Details Motivation: 流匹配模型的高质量文本生成缺乏适用于强化学习的随机性,传统的随机噪声扰动方法效率低且不稳定,Smart-GRPO旨在解决这一问题。

Contribution: 首次提出了针对流匹配模型的强化学习方法,通过迭代搜索优化噪声扰动分布,提高了训练效率和生成质量。

Method: 采用迭代搜索策略:解码候选扰动、使用奖励函数评估、优化噪声分布以朝着更高奖励区域收敛。

Result: 实验表明Smart-GRPO在奖励优化和视觉质量上均优于基线方法,证明了该方法在流匹配框架中的实用性。

Insight: Smart-GRPO为流匹配模型与强化学习的结合提供了可行路径,同时解决了训练效率和人类对齐生成之间的矛盾。

Abstract: Recent advancements in flow-matching have enabled high-quality text-to-image generation. However, the deterministic nature of flow-matching models makes them poorly suited for reinforcement learning, a key tool for improving image quality and human alignment. Prior work has introduced stochasticity by perturbing latents with random noise, but such perturbations are inefficient and unstable. We propose Smart-GRPO, the first method to optimize noise perturbations for reinforcement learning in flow-matching models. Smart-GRPO employs an iterative search strategy that decodes candidate perturbations, evaluates them with a reward function, and refines the noise distribution toward higher-reward regions. Experiments demonstrate that Smart-GRPO improves both reward optimization and visual quality compared to baseline methods. Our results suggest a practical path toward reinforcement learning in flow-matching frameworks, bridging the gap between efficient training and human-aligned generation.

[53] FSFSplatter: Build Surface and Novel Views with Sparse-Views within 3min

Yibin Zhao,Yihan Pan,Jun Nan,Jianjun Yi

Main category: cs.CV

TL;DR: FSFSplatter 是一种基于高斯溅射的快速表面重建方法,能从稀疏的自由视图中高效重建场景,解决了现有方法对密集校准视图的需求和稀疏视图下的表面质量差问题。

Details Motivation: 传统的高斯溅射方法需要密集校准视图来重建高质量的场景表面和生成新视图,而稀疏视图容易导致重建质量下降和过拟合问题。FSFSplatter 旨在解决这些问题,提供一种快速、高质量的重建方案。

Contribution: 1. 提出了一种端到端的高斯溅射方法FSFSplatter,支持从稀疏自由视图中快速重建高质量表面;2. 引入了密集高斯初始化和几何优化的场景优化策略;3. 通过贡献度剪枝和深度及多视图特征监督减少了过拟合。

Method: 1. 使用大型Transformer编码多视图图像;2. 通过自分裂高斯头生成密集几何一致的高斯场景初始化;3. 在快速优化中结合深度和多视图特征监督,裁剪局部浮点以减少过拟合。

Result: FSFSplatter 在DTU和Replica数据集上优于当前最先进的方法,实现了高质量的表面重建和新视图合成。

Insight: 稀疏视图下的高质量重建可通过密集初始化和几何优化策略实现,同时结合Transformer和多任务监督能有效提升重建效果。

Abstract: Gaussian Splatting has become a leading reconstruction technique, known for its high-quality novel view synthesis and detailed reconstruction. However, most existing methods require dense, calibrated views. Reconstructing from free sparse images often leads to poor surface due to limited overlap and overfitting. We introduce FSFSplatter, a new approach for fast surface reconstruction from free sparse images. Our method integrates end-to-end dense Gaussian initialization, camera parameter estimation, and geometry-enhanced scene optimization. Specifically, FSFSplatter employs a large Transformer to encode multi-view images and generates a dense and geometrically consistent Gaussian scene initialization via a self-splitting Gaussian head. It eliminates local floaters through contribution-based pruning and mitigates overfitting to limited views by leveraging depth and multi-view feature supervision with differentiable camera parameters during rapid optimization. FSFSplatter outperforms current state-of-the-art methods on widely used DTU and Replica.

[54] MoGIC: Boosting Motion Generation via Intention Understanding and Visual Context

Junyu Shi,Yong Sun,Zhiyuan Zhang,Lijiang Liu,Zhengjie Zhang,Yuxin He,Qiang Nie

Main category: cs.CV

TL;DR: MoGIC 是一个统一的框架,通过引入意图建模和视觉先验,改进了基于文本的运动生成方法,实现了多模态运动合成。它通过混合注意力机制有效对齐条件和动作子序列,并在大规模基准 Mo440H 上验证了其优越性。

Details Motivation: 现有文本驱动的运动生成方法通常将其视为语言与动作的双向映射,但缺乏对动作执行因果逻辑和人类意图的捕捉。同时,缺乏视觉基础限制了生成的精确性和个性化。

Contribution: 1. 提出 MoGIC,整合意图建模和视觉先验的多模态运动合成框架;2. 引入混合注意力机制,实现条件标记与动作子序列的有效对齐;3. 构建 Mo440H 基准数据集。

Method: 联合优化多模态条件运动生成和意图预测,通过混合注意力机制自适应对齐条件和动作子序列。

Result: 在 HumanML3D 和 Mo440H 上将 FID 分别降低 38.6% 和 34.6%,在运动字幕任务上超过基于 LLM 的方法,并支持意图预测和视觉条件生成。

Insight: 结合意图和视觉先验可显著提升运动生成的精确性和可控性,混合注意力机制为多模态对齐提供了有效工具。

Abstract: Existing text-driven motion generation methods often treat synthesis as a bidirectional mapping between language and motion, but remain limited in capturing the causal logic of action execution and the human intentions that drive behavior. The absence of visual grounding further restricts precision and personalization, as language alone cannot specify fine-grained spatiotemporal details. We propose MoGIC, a unified framework that integrates intention modeling and visual priors into multimodal motion synthesis. By jointly optimizing multimodal-conditioned motion generation and intention prediction, MoGIC uncovers latent human goals, leverages visual priors to enhance generation, and exhibits versatile multimodal generative capability. We further introduce a mixture-of-attention mechanism with adaptive scope to enable effective local alignment between conditional tokens and motion subsequences. To support this paradigm, we curate Mo440H, a 440-hour benchmark from 21 high-quality motion datasets. Experiments show that after finetuning, MoGIC reduces FID by 38.6% on HumanML3D and 34.6% on Mo440H, surpasses LLM-based methods in motion captioning with a lightweight text head, and further enables intention prediction and vision-conditioned generation, advancing controllable motion synthesis and intention understanding. The code is available at https://github.com/JunyuShi02/MoGIC

[55] From Tokens to Nodes: Semantic-Guided Motion Control for Dynamic 3D Gaussian Splatting

Jianing Chen,Zehao Li,Yujun Cai,Hao Jiang,Shuqin Gao,Honglong Zhao,Tianlu Mao,Yucheng Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种语义引导的运动控制框架,用于动态3D高斯抛射重建,通过自适应压缩动态区域的稀疏控制点,提升重建质量和效率。

Details Motivation: 动态3D重建从单目视频中推断3D运动存在模糊性和计算复杂度高的挑战。现有稀疏控制方法仅依赖几何分配控制点,导致静态区域冗余和动态区域不足。

Contribution: 1. 提出基于语义和运动先验的运动自适应框架,优化控制点分配;2. 引入样条轨迹参数化取代MLP变形场,实现更平滑的运动表示;3. 显著提升了重建质量和效率。

Method: 1. 利用视觉基础模型提取语义和运动先验;2. 建立面片-标记-节点对应关系;3. 迭代体素化和运动趋势评分自适应压缩控制点;4. 引入样条轨迹参数化。

Result: 实验表明,该方法在重建质量和效率上显著优于现有最先进方法。

Insight: 通过语义引导的运动自适应控制点和样条轨迹参数化,可以更高效地解决动态3D重建中的控制点分配和运动表示问题。

Abstract: Dynamic 3D reconstruction from monocular videos remains difficult due to the ambiguity inferring 3D motion from limited views and computational demands of modeling temporally varying scenes. While recent sparse control methods alleviate computation by reducing millions of Gaussians to thousands of control points, they suffer from a critical limitation: they allocate points purely by geometry, leading to static redundancy and dynamic insufficiency. We propose a motion-adaptive framework that aligns control density with motion complexity. Leveraging semantic and motion priors from vision foundation models, we establish patch-token-node correspondences and apply motion-adaptive compression to concentrate control points in dynamic regions while suppressing redundancy in static backgrounds. Our approach achieves flexible representational density adaptation through iterative voxelization and motion tendency scoring, directly addressing the fundamental mismatch between control point allocation and motion complexity. To capture temporal evolution, we introduce spline-based trajectory parameterization initialized by 2D tracklets, replacing MLP-based deformation fields to achieve smoother motion representation and more stable optimization. Extensive experiments demonstrate significant improvements in reconstruction quality and efficiency over existing state-of-the-art methods.

[56] Net2Net: When Un-trained Meets Pre-trained Networks for Robust Real-World Denoising

Weimin Yuan,Cai Meng

Main category: cs.CV

TL;DR: Net2Net是一种创新方法,结合了无训练网络和预训练网络的优点,通过正则化去噪(RED)技术解决了真实世界噪声去除的挑战,无需大量标注数据即可适应各种噪声模式。

Details Motivation: 传统去噪方法依赖手工先验,难以处理真实噪声的复杂性和多样性;深度学习需要大量标注数据且泛化能力有限。Net2Net旨在结合无监督和预训练网络的优点,解决这些问题。

Contribution: 提出Net2Net框架,结合无监督DIP和监督预训练DRUNet,通过RED技术实现高性能真实噪声去除,提升了泛化能力和小样本场景下的表现。

Method: 利用无训练网络(DIP)适应输入图像噪声特性,结合预训练网络(DRUNet)的大规模数据学习能力,通过正则化去噪(RED)实现协同优化。

Result: 在基准数据集上验证了方法的优越性,尤其在真实噪声去除和小样本场景中表现突出。

Insight: Net2Net展示了无监督和预训练网络的互补性,为无需标注数据的自适应去噪提供了新思路。

Abstract: Traditional denoising methods for noise removal have largely relied on handcrafted priors, often perform well in controlled environments but struggle to address the complexity and variability of real noise. In contrast, deep learning-based approaches have gained prominence for learning noise characteristics from large datasets, but these methods frequently require extensive labeled data and may not generalize effectively across diverse noise types and imaging conditions. In this paper, we present an innovative method, termed as Net2Net, that combines the strengths of untrained and pre-trained networks to tackle the challenges of real-world noise removal. The innovation of Net2Net lies in its combination of unsupervised DIP and supervised pre-trained model DRUNet by regularization by denoising (RED). The untrained network adapts to the unique noise characteristics of each input image without requiring labeled data, while the pre-trained network leverages learned representations from large-scale datasets to deliver robust denoising performance. This hybrid framework enhances generalization across varying noise patterns and improves performance, particularly in scenarios with limited training data. Extensive experiments on benchmark datasets demonstrate the superiority of our method for real-world noise removal.

[57] Retrv-R1: A Reasoning-Driven MLLM Framework for Universal and Efficient Multimodal Retrieval

Lanyun Zhu,Deyi Ji,Tianrun Chen,Haiyang Wu,Shiqi Wang

Main category: cs.CV

TL;DR: Retrv-R1是一个基于推理的多模态检索框架,通过引入信息压缩模块和新训练范式,解决了现有RL方法在检索任务中的高计算成本和性能不稳定的问题,实现了高效、通用且高性能的多模态检索。

Details Motivation: 现有基于RL的多模态检索方法存在高计算成本和性能不稳定的问题,限制了其在检索任务中的应用。Retrv-R1旨在解决这些问题,提升检索效率和性能。

Contribution: 1. 设计了信息压缩模块和细节检查机制,降低计算成本并保留关键信息。2. 提出了新的训练范式,包括使用合成的CoT数据集和基于课程的奖励机制。

Method: 1. 使用信息压缩模块减少token数量。2. 引入细节检查机制确保重要信息不丢失。3. 分两阶段训练:先用合成CoT数据集激活模型,再用RL和课程奖励优化。

Result: Retrv-R1在多个基准测试和任务中展现出SOTA性能、高效率和强泛化能力。

Insight: 推理驱动的多模态检索可以显著提升性能和效率,信息压缩和新训练范式的结合是关键。

Abstract: The success of DeepSeek-R1 demonstrates the immense potential of using reinforcement learning (RL) to enhance LLMs’ reasoning capabilities. This paper introduces Retrv-R1, the first R1-style MLLM specifically designed for multimodal universal retrieval, achieving higher performance by employing step-by-step reasoning to produce more accurate retrieval results. We find that directly applying the methods of DeepSeek-R1 to retrieval tasks is not feasible, mainly due to (1) the high computational cost caused by the large token consumption required for multiple candidates with reasoning processes, and (2) the instability and suboptimal results when directly applying RL to train for retrieval tasks. To address these issues, Retrv-R1 introduces an information compression module with a details inspection mechanism, which enhances computational efficiency by reducing the number of tokens while ensuring that critical information for challenging candidates is preserved. Furthermore, a new training paradigm is proposed, including an activation stage using a retrieval-tailored synthetic CoT dataset for more effective optimization, followed by RL with a novel curriculum reward to improve both performance and efficiency. Incorporating these novel designs, Retrv-R1 achieves SOTA performance, high efficiency, and strong generalization ability, as demonstrated by experiments across multiple benchmarks and tasks.

[58] Bayesian Test-time Adaptation for Object Recognition and Detection with Vision-language Models

Lihua Zhou,Mao Ye,Shuaifeng Li,Nianxin Li,Jinlin Wu,Xiatian Zhu,Lei Deng,Hongbin Liu,Jiebo Luo,Zhen Lei

Main category: cs.CV

TL;DR: BCA+是一个无需训练、高效的测试时适应框架,通过贝叶斯推断动态更新类嵌入、空间尺度和自适应先验,统一处理物体识别和检测任务,显著提升了性能。

Details Motivation: 现有测试时适应方法要么计算成本高(依赖反向传播),要么仅关注似然而忽略先验的重要性,限制了其在实时部署中的应用。

Contribution: 提出了BCA+,一个统一的训练免费框架,结合动态缓存和贝叶斯推断,实现了对物体识别和检测的双重适应,提升了模型的语义理解和上下文置信度。

Method: 1. 引入动态缓存,存储和更新类嵌入、空间尺度和自适应先验;2. 将适应问题建模为贝叶斯推断,融合VLM初始输出和基于缓存的预测;3. 不确定性引导的融合机制。

Result: 在多个识别和检测基准测试中达到了最先进的性能。

Insight: 动态缓存和贝叶斯推断的结合有效解决了分布偏移问题,无需训练的设计使其在实时应用中更具优势。

Abstract: Vision-language models (VLMs) such as CLIP and Grounding DINO have achieved remarkable success in object recognition and detection. However, their performance often degrades under real-world distribution shifts. Test-time adaptation (TTA) aims to mitigate this issue by adapting models during inference. Existing methods either rely on computationally expensive backpropagation, which hinders real-time deployment, or focus solely on likelihood adaptation, which overlooks the critical role of the prior. Our prior work, Bayesian Class Adaptation (BCA), addressed these shortcomings for object recognition by introducing a training-free framework that incorporates adaptive priors. Building upon this foundation, we now present Bayesian Class Adaptation plus (BCA+), a unified, training-free framework for TTA for both object recognition and detection. BCA+ introduces a dynamic cache that adaptively stores and updates class embeddings, spatial scales (for detection), and, crucially, adaptive class priors derived from historical predictions. We formulate adaptation as a Bayesian inference problem, where final predictions are generated by fusing the initial VLM output with a cache-based prediction. This cache-based prediction combines a dynamically updated likelihood (measuring feature and scale similarity) and a prior (reflecting the evolving class distribution). This dual-adaptation mechanism, coupled with uncertainty-guided fusion, enables BCA+ to correct both the model’s semantic understanding and its contextual confidence. As a training-free method requiring no backpropagation, BCA+ is highly efficient. Extensive experiments demonstrate that BCA+ achieves state-of-the-art performance on both recognition and detection benchmarks.

[59] Hierarchical Generalized Category Discovery for Brain Tumor Classification in Digital Pathology

Matthias Perkonigg,Patrick Rockenschaub,Georg Göbel,Adelheid Wöhrer

Main category: cs.CV

TL;DR: 这篇论文提出了一种新的分层广义类别发现方法(HGCD-BT),用于脑肿瘤分类,结合了分层聚类和对比学习,显著提升了分类准确率,并在多个数据集上展现出色表现。

Details Motivation: 脑肿瘤分类对神经外科手术决策至关重要,但现有方法仅限于预定义的类别,无法识别训练中未见的肿瘤类型。广义类别发现(GCD)虽能填补这一空白,但缺乏对分层结构的建模。

Contribution: 提出HGCD-BT方法,首次将分层聚类引入GCD框架,设计了一种半监督分层聚类损失函数。在OpenSRH和Digital Brain Tumor Atlas数据集上显著优于现有方法,尤其在未见类别识别上表现突出。

Method: 方法结合了对比学习和半监督分层聚类损失,通过分层结构建模肿瘤分类学知识。训练时利用标注数据和无标注数据,同时优化对比学习和聚类目标。

Result: 在OpenSRH数据集上,HGCD-BT比现有GCD方法提升了28%的分类准确率,尤其在未见肿瘤类别识别上表现优异。在多模态数据上也展现良好泛化能力。

Insight: 分层结构的引入有效提升了GCD的性能,尤其是在复杂分类任务中。方法展示了在多模态医学图像上的适用性,为未见类别的识别提供了新思路。

Abstract: Accurate brain tumor classification is critical for intra-operative decision making in neuro-oncological surgery. However, existing approaches are restricted to a fixed set of predefined classes and are therefore unable to capture patterns of tumor types not available during training. Unsupervised learning can extract general-purpose features, but it lacks the ability to incorporate prior knowledge from labelled data, and semi-supervised methods often assume that all potential classes are represented in the labelled data. Generalized Category Discovery (GCD) aims to bridge this gap by categorizing both known and unknown classes within unlabelled data. To reflect the hierarchical structure of brain tumor taxonomies, in this work, we introduce Hierarchical Generalized Category Discovery for Brain Tumor Classification (HGCD-BT), a novel approach that integrates hierarchical clustering with contrastive learning. Our method extends contrastive learning based GCD by incorporating a novel semi-supervised hierarchical clustering loss. We evaluate HGCD-BT on OpenSRH, a dataset of stimulated Raman histology brain tumor images, achieving a +28% improvement in accuracy over state-of-the-art GCD methods for patch-level classification, particularly in identifying previously unseen tumor categories. Furthermore, we demonstrate the generalizability of HGCD-BT on slide-level classification of hematoxylin and eosin stained whole-slide images from the Digital Brain Tumor Atlas, confirming its utility across imaging modalities.

[60] AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding

Xian Zhang,Zexi Wu,Zinuo Li,Hongming Xu,Luqi Gong,Farid Boussaid,Naoufel Werghi,Mohammed Bennamoun

Main category: cs.CV

TL;DR: AdaRD-Key提出了一种无需训练的关键帧采样方法,结合查询相关性和视觉多样性,优化长视频理解任务,显著提升性能。

Details Motivation: 现有长视频理解方法多依赖均匀采样或固定时间间隔的关键帧选择,可能导致忽略重要时刻或冗余信息。查询相关性与视觉多样性的平衡是关键挑战。

Contribution: 1. 提出AdaRD-Key模块,通过统一的相关性-多样性目标(RD-MV)选择关键帧;2. 引入轻量级的查询感知门控机制,动态调整采样策略;3. 方法高效且即插即用,无需额外训练。

Method: AdaRD-Key结合查询条件的相关性评分和对数行列式多样性组件,最大化RD-MV目标。弱对齐时切换到多样性模式。

Result: 在LongVideoBench和Video-MME上取得SOTA性能,尤其适合长视频任务。

Insight: 平衡查询相关性和视觉多样性对长视频理解至关重要;轻量门控机制有效解决了弱对齐问题。

Abstract: Understanding long-form videos remains a significant challenge for vision–language models (VLMs) due to their extensive temporal length and high information density. Most current multimodal large language models (MLLMs) rely on uniform sampling, which often overlooks critical moments, leading to incorrect responses to queries. In parallel, many keyframe selection approaches impose rigid temporal spacing: once a frame is chosen, an exclusion window suppresses adjacent timestamps to reduce redundancy. While effective at limiting overlap, this strategy frequently misses short, fine-grained cues near important events. Other methods instead emphasize visual diversity but neglect query relevance. We propose AdaRD-Key, a training-free keyframe sampling module for query-driven long-form video understanding. AdaRD-Key maximizes a unified Relevance–Diversity Max-Volume (RD-MV) objective, combining a query-conditioned relevance score with a log-determinant diversity component to yield informative yet non-redundant frames. To handle broad queries with weak alignment to the video, AdaRD-Key employs a lightweight relevance-aware gating mechanism; when the relevance distribution indicates weak alignment, the method seamlessly shifts into a diversity-only mode, enhancing coverage without additional supervision. Our pipeline is training-free, computationally efficient (running in real time on a single GPU), and compatible with existing VLMs in a plug-and-play manner. Extensive experiments on LongVideoBench and Video-MME demonstrate state-of-the-art performance, particularly on long-form videos. Code available at https://github.com/Xian867/AdaRD-Key.

[61] Reasoning Riddles: How Explainability Reveals Cognitive Limits in Vision-Language Models

Prahitha Movva

Main category: cs.CV

TL;DR: 论文通过可解释性分析揭示了视觉语言模型(VLMs)在解决复杂横向思维谜题(如字谜)时的认知局限,提出了系统的数据集和评估框架,展示了不同提示策略对模型推理质量和效果的影响。

Details Motivation: 尽管VLMs在多模态任务中表现优异,但其在复杂横向思维挑战(如字谜)中的认知过程和失败模式尚不清晰。论文旨在填补这一空白,通过可解释性分析深入理解VLMs的推理机制。

Contribution: 1)贡献了一个包含221个字谜的系统标注数据集,涵盖六种认知类别;2)提出了一种分离推理质量和答案正确性的评估框架;3)分析了三种提示策略对模型推理的影响。

Method: 1)构建多类别的系统性数据集;2)设计评估框架,区分推理质量和答案正确性;3)采用三种不同提示策略进行研究。

Result: 研究发现VLMs在不同谜题类别中推理质量差异显著,模型在视觉组合方面表现良好,但在缺失解释和文化符号方面存在根本性局限。提示策略显著影响认知方式和解题效果。

Insight: 可解释性是模型性能的重要组成部分,而非事后分析。提示设计直接影响模型的认知过程和问题解决能力。

Abstract: Vision-Language Models (VLMs) excel at many multimodal tasks, yet their cognitive processes remain opaque on complex lateral thinking challenges like rebus puzzles. While recent work has demonstrated these models struggle significantly with rebus puzzle solving, the underlying reasoning processes and failure patterns remain largely unexplored. We address this gap through a comprehensive explainability analysis that moves beyond performance metrics to understand how VLMs approach these complex lateral thinking challenges. Our study contributes a systematically annotated dataset of 221 rebus puzzles across six cognitive categories, paired with an evaluation framework that separates reasoning quality from answer correctness. We investigate three prompting strategies designed to elicit different types of explanatory processes and reveal critical insights into VLM cognitive processes. Our findings demonstrate that reasoning quality varies dramatically across puzzle categories, with models showing systematic strengths in visual composition while exhibiting fundamental limitations in absence interpretation and cultural symbolism. We also discover that prompting strategy substantially influences both cognitive approach and problem-solving effectiveness, establishing explainability as an integral component of model performance rather than a post-hoc consideration.

[62] OTR: Synthesizing Overlay Text Dataset for Text Removal

Jan Zdenek,Wataru Shimoda,Kota Yamaguchi

Main category: cs.CV

TL;DR: 提出了一个名为OTR的新数据集,用于文本移除任务,解决了现有数据集中地面真实数据存在人工编辑痕迹、背景过于简单以及评估指标不全面的问题。

Details Motivation: 现有文本移除数据集(如SCUT-EnsText)存在地面真实数据质量问题,且仅限于场景文本移除任务,限制了模型的泛化能力和评估准确性。

Contribution: 提出了一个合成数据集OTR,支持复杂背景和跨域文本移除任务,并通过物体感知布局和视觉语言模型生成内容确保地面真实数据的质量。

Method: 通过合成技术生成文本覆盖在复杂背景上的图像,结合物体感知布局和视觉语言模型生成内容,避免了手工编辑带来的地面真实数据问题。

Result: OTR数据集提供了高质量的合成文本移除场景,支持更准确的评估和跨域泛化。

Insight: 合成数据集可以克服真实数据集的局限性,尤其是在地面真实数据质量和任务多样性方面。

Abstract: Text removal is a crucial task in computer vision with applications such as privacy preservation, image editing, and media reuse. While existing research has primarily focused on scene text removal in natural images, limitations in current datasets hinder out-of-domain generalization or accurate evaluation. In particular, widely used benchmarks such as SCUT-EnsText suffer from ground truth artifacts due to manual editing, overly simplistic text backgrounds, and evaluation metrics that do not capture the quality of generated results. To address these issues, we introduce an approach to synthesizing a text removal benchmark applicable to domains other than scene texts. Our dataset features text rendered on complex backgrounds using object-aware placement and vision-language model-generated content, ensuring clean ground truth and challenging text removal scenarios. The dataset is available at https://huggingface.co/datasets/cyberagent/OTR .

[63] Align Your Query: Representation Alignment for Multimodality Medical Object Detection

Ara Seo,Bryan Sangwoo Kim,Hyungjin Chung,Jong Chul Ye

Main category: cs.CV

TL;DR: 论文提出了一种简单且与检测器无关的框架,通过表示对齐方法解决多模态医学目标检测中的异质性问题。主要贡献是引入了Modality Tokens和多模态上下文注意力(MoCA),并结合QueryREPA预训练阶段,实现了模态感知的查询表示,提升了检测性能。

Details Motivation: 医学目标检测在混合多模态数据(如CXR、CT、MRI)上训练时,由于数据统计异质性和表示空间不一致,性能受到影响。因此需要一种方法来对齐不同模态的特征表示。

Contribution: 1. 引入紧凑的Modality Tokens编码模态信息;2. 提出多模态上下文注意力(MoCA)注入模态上下文;3. 设计QueryREPA预训练阶段对齐查询表示。

Method: 1. 使用文本驱动的Modality Tokens编码模态;2. 通过MoCA将模态上下文传播到查询集中;3. QueryREPA利用对比目标预训练对齐查询表示。

Result: 方法在多模态数据上显著提升了AP(平均精度),且无需修改架构或增加显著延迟。

Insight: 表示对齐是多模态医学目标检测的有效解决方案,轻量化的模态嵌入和上下文传播是关键。

Abstract: Medical object detection suffers when a single detector is trained on mixed medical modalities (e.g., CXR, CT, MRI) due to heterogeneous statistics and disjoint representation spaces. To address this challenge, we turn to representation alignment, an approach that has proven effective for bringing features from different sources into a shared space. Specifically, we target the representations of DETR-style object queries and propose a simple, detector-agnostic framework to align them with modality context. First, we define modality tokens: compact, text-derived embeddings encoding imaging modality that are lightweight and require no extra annotations. We integrate the modality tokens into the detection process via Multimodality Context Attention (MoCA), mixing object-query representations via self-attention to propagate modality context within the query set. This preserves DETR-style architectures and adds negligible latency while injecting modality cues into object queries. We further introduce QueryREPA, a short pretraining stage that aligns query representations to their modality tokens using a task-specific contrastive objective with modality-balanced batches. Together, MoCA and QueryREPA produce modality-aware, class-faithful queries that transfer effectively to downstream training. Across diverse modalities trained altogether, the proposed approach consistently improves AP with minimal overhead and no architectural modifications, offering a practical path toward robust multimodality medical object detection. Project page: https://araseo.github.io/alignyourquery/.

[64] MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding

Jingyuan Deng,Yujiu Yang

Main category: cs.CV

TL;DR: MaskCD提出了一种通过掩码图像头部构建对比样本的对比解码方法,有效缓解大视觉语言模型(LVLM)的幻觉问题,同时保持其通用能力。

Details Motivation: 大视觉语言模型在多模态任务中表现出色,但存在幻觉问题,即生成与输入内容矛盾的输出。现有方法(如对比解码和注意力操纵)存在构建样本困难或敏感性问题。

Contribution: 提出了MaskCD方法,通过掩码LVLM中的图像头部构建对比样本,改进传统的对比解码方法,从而更有效地缓解幻觉问题。

Method: 利用LVLM中的“图像头部”,掩码这些部分以生成对比样本,进而用于对比解码。这种方法避免了传统对比解码中的样本构建难题。

Result: 在LLaVA-1.5-7b和Qwen-VL-7b上,MaskCD在CHAIR、POPE、AMBER和MME等基准测试中显著减少了幻觉现象,同时保留了模型的通用能力。

Insight: 掩码图像头部为构建对比样本提供了一种稳定且高效的方式,为解决LVLM的幻觉问题提供了新思路。

Abstract: Large vision-language models (LVLMs) have shown remarkable performance in visual-language understanding for downstream multimodal tasks. While their capabilities are improving, problems emerge simultaneously. Among those problems, the hallucinations have attracted much attention, which stands for the phenomenon where LVLMs generate contradictory content to their input visual and text contents. Many approaches have been proposed to deal with this issue, such as contrastive decoding and attention manipulation. However, contrastive decoding methods struggle in constructing appropriate contrastive samples, and attention manipulation methods are highly sensitive, lacking stability. In this work, we propose image head Masked Contrastive Decoding (MaskCD). Our approach utilizes the “image heads” in LVLMs, masking them to construct contrastive samples for contrastive decoding. We evaluated MaskCD on LLaVA-1.5-7b and Qwen-VL-7b, using various benchmarks such as CHAIR, POPE, AMBER and MME. The results demonstrate that MaskCD effectively alleviates the phenomenon of hallucinations and retains the general capabilities of LVLMs. Corresponding resources could be found at: https://github.com/Deng-Jingyuan/MaskCD .

[65] VERNIER: an open-source software pushing marker pose estimation down to the micrometer and nanometer scales

Patrick Sandoz,Antoine N. André,Guillaume J. Laurent

Main category: cs.CV

TL;DR: 这篇论文介绍了VERNIER,一款开源相位处理软件,能够在微米和纳米尺度下实现高精度的6自由度姿态估计。

Details Motivation: 在小尺度下实现高精度的姿态估计仍然是一个挑战。现有的方法难以在较大范围内实现纳米级和微弧度级的分辨率。为了解决这一问题,论文提出了VERNIER软件。

Contribution: 主要贡献是VERNIER软件,它基于相位处理伪周期性图案,能够实现快速、可靠的姿态测量,并且对噪声、离焦和遮挡具有鲁棒性。

Method: 方法包括相位本地阈值算法,以及针对不同应用需求的多种图案设计。论文详细介绍了相位处理的步骤,并通过合成和实验图像验证了其有效性。

Result: VERNIER表现出很高的鲁棒性和精度,能够满足不同显微镜应用的需求。

Insight: 论文还提供了选择合适图案设计和显微镜放大镜头的指南,以帮助用户根据需求优化性能。

Abstract: Pose estimation is still a challenge at the small scales. Few solutions exist to capture the 6 degrees of freedom of an object with nanometric and microradians resolutions over relatively large ranges. Over the years, we have proposed several fiducial marker and pattern designs to achieve reliable performance for various microscopy applications. Centimeter ranges are possible using pattern encoding methods, while nanometer resolutions can be achieved using phase processing of the periodic frames. This paper presents VERNIER, an open source phase processing software designed to provide fast and reliable pose measurement based on pseudo-periodic patterns. Thanks to a phase-based local thresholding algorithm, the software has proven to be particularly robust to noise, defocus and occlusion. The successive steps of the phase processing are presented, as well as the different types of patterns that address different application needs. The implementation procedure is illustrated with synthetic and experimental images. Finally, guidelines are given for selecting the appropriate pattern design and microscope magnification lenses as a function of the desired performance.

[66] Med-K2N: Flexible K-to-N Modality Translation for Medical Image Synthesis

Feng Yuan,Yifan Gao,Yuehua Ye,Haoyue Li,Xin Gao

Main category: cs.CV

TL;DR: Med-K2N提出了一种灵活的K-to-N模态转换方法用于医学图像合成,解决了多模态贡献建模、融合质量控制及模态一致性三大挑战。

Details Motivation: 临床需求驱动多模态图像重建的灵活性,同时需解决多模态贡献不均、噪声信息融合及模态一致性等问题。

Contribution: 1. 提出K-to-N模态转换框架;2. 设计了三个模块(PreWeightNet、ThresholdNet、EffiWeightNet)实现自适应权重学习;3. 提出CMIM模块保持模态一致性。

Method: 1. 将多模态数据视为序列帧,采用质量驱动选择机制;2. 使用PreWeightNet全局评估贡献、ThresholdNet自适应滤波、EffiWeightNet计算有效权重;3. CMIM通过视觉-语言建模建立因果约束。

Result: 在多个基准上显著超越现有方法。

Insight: 1. 序列帧和质量驱动机制可有效解决多模态融合问题;2. 视觉-语言建模能增强模态一致性;3. 自适应权重学习是关键。

Abstract: Cross-modal medical image synthesis research focuses on reconstructing missing imaging modalities from available ones to support clinical diagnosis. Driven by clinical necessities for flexible modality reconstruction, we explore K to N medical generation, where three critical challenges emerge: How can we model the heterogeneous contributions of different modalities to various target tasks? How can we ensure fusion quality control to prevent degradation from noisy information? How can we maintain modality identity consistency in multi-output generation? Driven by these clinical necessities, and drawing inspiration from SAM2’s sequential frame paradigm and clinicians’ progressive workflow of incrementally adding and selectively integrating multi-modal information, we treat multi-modal medical data as sequential frames with quality-driven selection mechanisms. Our key idea is to “learn” adaptive weights for each modality-task pair and “memorize” beneficial fusion patterns through progressive enhancement. To achieve this, we design three collaborative modules: PreWeightNet for global contribution assessment, ThresholdNet for adaptive filtering, and EffiWeightNet for effective weight computation. Meanwhile, to maintain modality identity consistency, we propose the Causal Modality Identity Module (CMIM) that establishes causal constraints between generated images and target modality descriptions using vision-language modeling. Extensive experimental results demonstrate that our proposed Med-K2N outperforms state-of-the-art methods by significant margins on multiple benchmarks. Source code is available.

[67] ELMF4EggQ: Ensemble Learning with Multimodal Feature Fusion for Non-Destructive Egg Quality Assessment

Md Zahim Hassan,Md. Osama,Muhammad Ashad Kabir,Md. Saiful Islam,Zannatul Naim

Main category: cs.CV

TL;DR: ELMF4EggQ是一种集成学习框架,利用多模态特征融合对鸡蛋的等级和新鲜度进行非破坏性评估。它通过结合图像、形状和重量等外部特征,显著提升了分类准确性。

Details Motivation: 传统鸡蛋质量评估需要破坏性检测,成本高且效率低。非破坏性方法对食品安全和生产效率至关重要。

Contribution: 1. 提出首个仅基于外部特征的鸡蛋质量评估框架;2. 发布首个公开数据集;3. 多模态融合显著提升性能。

Method: 结合预训练CNN(ResNet152等)提取图像特征,PCA降维,SMOTE数据增强,集成投票机制融合分类器结果。

Result: 多模态集成方法在等级分类和新鲜度预测上分别达到86.57%和70.83%的准确率,优于单模态基线。

Insight: 多模态特征融合在非破坏性质量评估中潜力巨大;公开数据集可推动领域发展。

Abstract: Accurate, non-destructive assessment of egg quality is critical for ensuring food safety, maintaining product standards, and operational efficiency in commercial poultry production. This paper introduces ELMF4EggQ, an ensemble learning framework that employs multimodal feature fusion to classify egg grade and freshness using only external attributes - image, shape, and weight. A novel, publicly available dataset of 186 brown-shelled eggs was constructed, with egg grade and freshness levels determined through laboratory-based expert assessments involving internal quality measurements, such as yolk index and Haugh unit. To the best of our knowledge, this is the first study to apply machine learning methods for internal egg quality assessment using only external, non-invasive features, and the first to release a corresponding labeled dataset. The proposed framework integrates deep features extracted from external egg images with structural characteristics such as egg shape and weight, enabling a comprehensive representation of each egg. Image feature extraction is performed using top-performing pre-trained CNN models (ResNet152, DenseNet169, and ResNet152V2), followed by PCA-based dimensionality reduction, SMOTE augmentation, and classification using multiple machine learning algorithms. An ensemble voting mechanism combines predictions from the best-performing classifiers to enhance overall accuracy. Experimental results demonstrate that the multimodal approach significantly outperforms image-only and tabular (shape and weight) only baselines, with the multimodal ensemble approach achieving 86.57% accuracy in grade classification and 70.83% in freshness prediction. All code and data are publicly available at https://github.com/Kenshin-Keeps/Egg_Quality_Prediction_ELMF4EggQ, promoting transparency, reproducibility, and further research in this domain.

[68] One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework

Lorenzo Bianchi,Giacomo Pacini,Fabio Carrara,Nicola Messina,Giuseppe Amato,Fabrizio Falchi

Main category: cs.CV

TL;DR: 论文提出了一个统一的零样本图像描述框架(Patch-ioner),通过从全局图像表示转向局部patch表示,实现了无需区域级监督的任意区域描述能力。

Details Motivation: 现有零样本描述方法局限于全局图像表示和整体描述,无法灵活描述图像中的任意区域。Patch-ioner旨在突破这一限制,无需额外监督即可实现patch级描述。

Contribution: 1. 提出了一种patch中心的零样本描述框架,支持从单个patch到非连续区域的灵活描述;2. 分析了现有潜在描述方法的核心要素,并证明密集视觉特征(如DINO)是关键;3. 在多任务中实现了SOTA性能。

Method: 将图像分割成patch作为原子描述单元,通过聚合patch特征描述任意区域。利用密集视觉特征(如DINO)生成patch级语义表示,从而实现灵活描述。

Result: 在零样本密集描述、区域集合描述和新引入的轨迹描述任务中,Patch-ioner优于现有方法,验证了patch级表示的有效性。

Insight: 密集视觉特征是零样本描述的关键,patch级表示能够显著提升描述的灵活性和可扩展性。

Abstract: Zero-shot captioners are recently proposed models that utilize common-space vision-language representations to caption images without relying on paired image-text data. To caption an image, they proceed by textually decoding a text-aligned image feature, but they limit their scope to global representations and whole-image captions. We present \frameworkName{}, a unified framework for zero-shot captioning that shifts from an image-centric to a patch-centric paradigm, enabling the captioning of arbitrary regions without the need of region-level supervision. Instead of relying on global image representations, we treat individual patches as atomic captioning units and aggregate them to describe arbitrary regions, from single patches to non-contiguous areas and entire images. We analyze the key ingredients that enable current latent captioners to work in our novel proposed framework. Experiments demonstrate that backbones producing meaningful, dense visual features, such as DINO, are key to achieving state-of-the-art performance in multiple region-based captioning tasks. Compared to other baselines and state-of-the-art competitors, our models achieve better performance on zero-shot dense, region-set, and a newly introduced trace captioning task, highlighting the effectiveness of patch-wise semantic representations for scalable caption generation. Project page at https://paciosoft.com/Patch-ioner/ .

[69] Training-Free Out-Of-Distribution Segmentation With Foundation Models

Laith Nayal,Hadi Salloum,Ahmad Taha,Yaroslav Kholodov,Alexander Gasnikov

Main category: cs.CV

TL;DR: 该论文提出了一种无需训练的OoD分割方法,利用基础模型(如InternImage)的特征和简单的聚类技术,实现了在RoadAnomaly和ADE-OoD基准上的优异性能。

Details Motivation: 在安全关键应用(如自动驾驶)中,检测语义分割中的未知物体至关重要。尽管基础模型在闭集任务中表现出色,但其在OoD区域检测方面的能力尚未充分探索。

Contribution: 论文的主要贡献是提出了一种简单且无需训练的OoD分割方法,仅通过基础模型特征结合K-Means聚类和置信度阈值,实现了优于监督和无监督基线的方法。

Method: 方法利用InternImage主干网络的特征,通过K-Means聚类和置信度阈值技术,无需额外训练即可识别OoD区域。

Result: 在RoadAnomaly基准上达到50.02 AP,ADE-OoD基准上达到48.77 AP,优于多种基线方法。

Insight: 基础模型的特征具有强大的泛化能力,可用于无需额外监督的OoD检测,展示了其在开放世界任务中的潜力。

Abstract: Detecting unknown objects in semantic segmentation is crucial for safety-critical applications such as autonomous driving. Large vision foundation models, including DINOv2, InternImage, and CLIP, have advanced visual representation learning by providing rich features that generalize well across diverse tasks. While their strength in closed-set semantic tasks is established, their capability to detect out-of-distribution (OoD) regions in semantic segmentation remains underexplored. In this work, we investigate whether foundation models fine-tuned on segmentation datasets can inherently distinguish in-distribution (ID) from OoD regions without any outlier supervision. We propose a simple, training-free approach that utilizes features from the InternImage backbone and applies K-Means clustering alongside confidence thresholding on raw decoder logits to identify OoD clusters. Our method achieves 50.02 Average Precision on the RoadAnomaly benchmark and 48.77 on the benchmark of ADE-OoD with InternImage-L, surpassing several supervised and unsupervised baselines. These results suggest a promising direction for generic OoD segmentation methods that require minimal assumptions or additional data.

[70] Don’t Just Chase “Highlighted Tokens” in MLLMs: Revisiting Visual Holistic Context Retention

Xin Zou,Di Lu,Yizhou Wang,Yibo Yan,Yuanhuiyi Lyu,Xu Zheng,Linfeng Zhang,Xuming Hu

Main category: cs.CV

TL;DR: 该论文提出了一种名为HoloV的新型视觉令牌修剪框架,旨在通过全局视角保留视觉上下文信息,以解决现有注意力优先修剪方法在高修剪率下性能下降的问题。

Details Motivation: 当前的多模态大型语言模型(MLLMs)依赖大量视觉令牌,导致计算开销巨大。现有的修剪方法(如注意力优先修剪)在高修剪率下会保留语义相似的令牌,导致性能显著下降。

Contribution: 论文的主要贡献是提出了HoloV框架,通过从全局视角自适应分配修剪预算,确保保留的令牌能够捕捉全局视觉上下文,从而在高修剪率下保持任务相关信息。

Method: HoloV通过重新思考令牌保留策略,对不同空间区域自适应分配修剪预算,避免仅保留孤立的显著特征。

Result: 实验表明,HoloV在多种任务、MLLM架构和修剪率下均优于现有方法。例如,LLaVA1.5在使用HoloV后,修剪88.9%的视觉令牌后仍保留95.8%的原始性能。

Insight: 全局视角的令牌修剪策略可以有效避免表示崩溃,在高修剪率下仍能保持模型的准确性和效率。

Abstract: Despite their powerful capabilities, Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens. Recent studies have explored token pruning to alleviate this problem, which typically uses text-vision cross-attention or [\texttt{CLS}] attention to assess and discard redundant visual tokens. In this work, we identify a critical limitation of such attention-first pruning approaches, i.e., they tend to preserve semantically similar tokens, resulting in pronounced performance drops under high pruning ratios. To this end, we propose {HoloV}, a simple yet effective, plug-and-play visual token pruning framework for efficient inference. Distinct from previous attention-first schemes, HoloV rethinks token retention from a holistic perspective. By adaptively distributing the pruning budget across different spatial crops, HoloV ensures that the retained tokens capture the global visual context rather than isolated salient features. This strategy minimizes representational collapse and maintains task-relevant information even under aggressive pruning. Experimental results demonstrate that our HoloV achieves superior performance across various tasks, MLLM architectures, and pruning ratios compared to SOTA methods. For instance, LLaVA1.5 equipped with HoloV preserves 95.8% of the original performance after pruning 88.9% of visual tokens, achieving superior efficiency-accuracy trade-offs.

[71] Zero-Shot Robustness of Vision Language Models Via Confidence-Aware Weighting

Nikoo Naghavian,Mostafa Tavassolipour

Main category: cs.CV

TL;DR: 提出了一种名为CAW(Confidence-Aware Weighting)的方法,通过置信度感知损失和特征对齐正则化提升视觉语言模型的零样本鲁棒性,优于现有方法且内存消耗更低。

Details Motivation: 尽管视觉语言模型(如CLIP)在零样本泛化上表现优异,但其在面对对抗攻击时仍非常脆弱。CAW旨在解决这一问题,提升模型的鲁棒性。

Contribution: 1. 提出CAW方法,包含置信度感知损失和特征对齐正则化;2. 在不牺牲泛化能力的情况下提升干净和对抗样本的准确率;3. 在多个数据集和攻击场景下表现优越。

Method: 1. 置信度感知损失:通过缩放干净和对抗预测的KL散度,优先处理不确定的对抗样本;2. 特征对齐正则化:最小化冻结和微调图像编码器特征之间的距离,保持语义一致性。

Result: 在TinyImageNet和14个额外数据集上,CAW在AutoAttack等强攻击下优于PMG-AFT和TGA-ZSR等方法且内存需求更低。

Insight: 结合置信度感知和特征对齐可以有效提升视觉语言模型的鲁棒性,同时保持零样本泛化能力。

Abstract: Vision-language models like CLIP demonstrate impressive zero-shot generalization but remain highly vulnerable to adversarial attacks. In this work, we propose Confidence-Aware Weighting (CAW) to enhance zero-shot robustness in vision-language models. CAW consists of two components: (1) a Confidence-Aware loss that prioritizes uncertain adversarial examples by scaling the KL divergence between clean and adversarial predictions, and (2) a feature alignment regularization that preserves semantic consistency by minimizing the distance between frozen and fine-tuned image encoder features on adversarial inputs. These components work jointly to improve both clean and robust accuracy without sacrificing generalization. Extensive experiments on TinyImageNet and 14 additional datasets show that CAW outperforms recent methods such as PMG-AFT and TGA-ZSR under strong attacks like AutoAttack, while using less memory.

[72] Multimodal Carotid Risk Stratification with Large Vision-Language Models: Benchmarking, Fine-Tuning, and Clinical Insights

Daphne Tsolissou,Theofanis Ganitidis,Konstantinos Mitsis,Stergios CHristodoulidis,Maria Vakalopoulou,Konstantina Nikita

Main category: cs.CV

TL;DR: 该论文探讨了大型视觉-语言模型(LVLM)在多模态颈动脉粥样硬化疾病风险评估中的应用。通过整合超声影像与临床、人口统计和生物标记数据,研究发现通用LVLM在风险分类上表现不佳。通过使用低秩适配(LoRA)技术微调LLaVa-NeXT-Vicuna模型,显著提升了中风风险分层性能。

Details Motivation: 颈动脉粥样硬化疾病的风险评估需要整合多种临床和影像信息,但目前方法在透明性和可解释性上存在不足。研究旨在利用LVLM探索其在多模态风险评估中的潜力。

Contribution: 1. 提出一种模拟真实诊断场景的多模态评估框架;2. 比较多种开源LVLM在颈动脉风险评估中的表现;3. 通过LoRA技术改进LLaVa-NeXT-Vicuna模型,显著提升性能;4. 展示了多模态数据整合对提升模型特异性和平衡准确率的重要性。

Method: 1. 采用开放源码的LVLM,包括通用模型和医学调优模型;2. 使用LoRA技术对LLaVa-NeXT-Vicuna进行领域适配;3. 设计模拟诊断场景的问答序列;4. 整合超声影像和结构化多模态数据。

Result: 1. 零样本实验中,通用LVLM在风险分类上表现不佳;2. 微调后的LLaVa-NeXT-Vicuna显著改善了风险分层性能;3. 多模态数据整合提升了模型的临床适用性。

Insight: LVLM在多模态医疗影像分析中具有潜力,但需结合领域适配技术和多模态数据整合才能实现临床实用化。

Abstract: Reliable risk assessment for carotid atheromatous disease remains a major clinical challenge, as it requires integrating diverse clinical and imaging information in a manner that is transparent and interpretable to clinicians. This study investigates the potential of state-of-the-art and recent large vision-language models (LVLMs) for multimodal carotid plaque assessment by integrating ultrasound imaging (USI) with structured clinical, demographic, laboratory, and protein biomarker data. A framework that simulates realistic diagnostic scenarios through interview-style question sequences is proposed, comparing a range of open-source LVLMs, including both general-purpose and medically tuned models. Zero-shot experiments reveal that even if they are very powerful, not all LVLMs can accurately identify imaging modality and anatomy, while all of them perform poorly in accurate risk classification. To address this limitation, LLaVa-NeXT-Vicuna is adapted to the ultrasound domain using low-rank adaptation (LoRA), resulting in substantial improvements in stroke risk stratification. The integration of multimodal tabular data in the form of text further enhances specificity and balanced accuracy, yielding competitive performance compared to prior convolutional neural network (CNN) baselines trained on the same dataset. Our findings highlight both the promise and limitations of LVLMs in ultrasound-based cardiovascular risk prediction, underscoring the importance of multimodal integration, model calibration, and domain adaptation for clinical translation.

[73] TIT-Score: Evaluating Long-Prompt Based Text-to-Image Alignment via Text-to-Image-to-Text Consistency

Juntong Wang,Huiyu Duan,Jiarui Wang,Ziheng Jia,Guangtao Zhai,Xiongkuo Min

Main category: cs.CV

TL;DR: 论文提出了一种评估长提示文本生成图像对齐性的新方法TIT-Score,通过文本-图像-文本一致性量化对齐性,并在新的基准LPG-Bench上验证其优于现有指标。

Details Motivation: 当前文本到图像(T2I)模型在短提示下表现良好,但在长提示下生成图像的准确性和一致性不足,且现有评估指标与人类偏好一致性较差。

Contribution: 1. 提出LPG-Bench,包含200个平均长度超过250词的长提示基准。2. 提出TIT-Score和TIT-Score-LLM两种基于文本-图像-文本一致性的评估方法,显著优于现有指标。

Method: 通过比较原始提示与生成图像的LMM描述的一致性来衡量对齐性。TIT-Score是一种高效评分方法,TIT-Score-LLM则基于LLM。

Result: TIT-Score-LLM在配准准确率上比最强基线提高了7.31%,与人类评价一致性更高。

Insight: 长提示生成的评估需要更贴近人类偏好的方法,文本-图像-文本一致性是一种有效且可扩展的评估框架。

Abstract: With the rapid advancement of large multimodal models (LMMs), recent text-to-image (T2I) models can generate high-quality images and demonstrate great alignment to short prompts. However, they still struggle to effectively understand and follow long and detailed prompts, displaying inconsistent generation. To address this challenge, we introduce LPG-Bench, a comprehensive benchmark for evaluating long-prompt-based text-to-image generation. LPG-Bench features 200 meticulously crafted prompts with an average length of over 250 words, approaching the input capacity of several leading commercial models. Using these prompts, we generate 2,600 images from 13 state-of-the-art models and further perform comprehensive human-ranked annotations. Based on LPG-Bench, we observe that state-of-the-art T2I alignment evaluation metrics exhibit poor consistency with human preferences on long-prompt-based image generation. To address the gap, we introduce a novel zero-shot metric based on text-to-image-to-text consistency, termed TIT, for evaluating long-prompt-generated images. The core concept of TIT is to quantify T2I alignment by directly comparing the consistency between the raw prompt and the LMM-produced description on the generated image, which includes an efficient score-based instantiation TIT-Score and a large-language-model (LLM) based instantiation TIT-Score-LLM. Extensive experiments demonstrate that our framework achieves superior alignment with human judgment compared to CLIP-score, LMM-score, etc., with TIT-Score-LLM attaining a 7.31% absolute improvement in pairwise accuracy over the strongest baseline. LPG-Bench and TIT methods together offer a deeper perspective to benchmark and foster the development of T2I models. All resources will be made publicly available.

[74] Towards Scalable and Consistent 3D Editing

Ruihao Xia,Yang Tang,Pan Zhou

Main category: cs.CV

TL;DR: 论文提出了3DEditFormer方法和3DEditVerse数据集,解决了3D编辑中的一致性、结构保真和精细控制问题,实现了无需辅助3D掩码的精确编辑。

Details Motivation: 3D编辑在沉浸式内容创作和AR/VR中有广泛应用,但现有方法存在速度慢、几何失真或依赖手动3D掩码等问题。论文旨在解决这些挑战。

Contribution: 1)提出3DEditVerse,最大的配对3D编辑基准数据集;2)提出3DEditFormer模型,通过双引导注意力和时间自适应门控实现无需掩码的精确编辑。

Method: 3DEditFormer结合图像到3D生成的双引导注意力机制和时间自适应门控,解耦可编辑区域与保留结构,确保编辑的精确性和一致性。

Result: 实验表明,3DEditFormer在定量和定性上均优于现有方法,为3D编辑设立了新标准。

Insight: 解耦可编辑区域与保留结构的双引导注意力机制是实现高效3D编辑的关键;大规模高质量数据集对模型性能至关重要。

Abstract: 3D editing - the task of locally modifying the geometry or appearance of a 3D asset - has wide applications in immersive content creation, digital entertainment, and AR/VR. However, unlike 2D editing, it remains challenging due to the need for cross-view consistency, structural fidelity, and fine-grained controllability. Existing approaches are often slow, prone to geometric distortions, or dependent on manual and accurate 3D masks that are error-prone and impractical. To address these challenges, we advance both the data and model fronts. On the data side, we introduce 3DEditVerse, the largest paired 3D editing benchmark to date, comprising 116,309 high-quality training pairs and 1,500 curated test pairs. Built through complementary pipelines of pose-driven geometric edits and foundation model-guided appearance edits, 3DEditVerse ensures edit locality, multi-view consistency, and semantic alignment. On the model side, we propose 3DEditFormer, a 3D-structure-preserving conditional transformer. By enhancing image-to-3D generation with dual-guidance attention and time-adaptive gating, 3DEditFormer disentangles editable regions from preserved structure, enabling precise and consistent edits without requiring auxiliary 3D masks. Extensive experiments demonstrate that our framework outperforms state-of-the-art baselines both quantitatively and qualitatively, establishing a new standard for practical and scalable 3D editing. Dataset and code will be released. Project: https://www.lv-lab.org/3DEditFormer/

[75] Not every day is a sunny day: Synthetic cloud injection for deep land cover segmentation robustness evaluation across data sources

Sara Mobsite,Renaud Hostache,Laure Berti Equille,Emmanuel Roux,Joris Guerin

Main category: cs.CV

TL;DR: 该论文提出了一种合成云注入算法,用于评估Sentinel-2卫星数据的云覆盖对深度学习土地覆盖分割的影响,并提出了一种轻量级方法,通过注入归一化差异指数(NDIs)提升模型性能。同时,结合Sentinel-1雷达数据,解决了光学数据在云层覆盖下的不足。

Details Motivation: 现有Sentinel-2数据集多为无云覆盖,限制了在热带地区的应用。此外,深度网络编码器的下采样会导致空间和光谱信息丢失。论文旨在解决这些问题,提升模型在云覆盖条件下的鲁棒性。

Contribution: 1. 提出合成云注入算法,模拟真实云覆盖。2. 提出轻量级NDI注入方法,减少模型下采样中的信息丢失。3. 验证Sentinel-1雷达数据在云覆盖条件下的有效性。

Method: 1. 合成云注入算法模拟云覆盖。2. 在解码层注入NDIs,保留关键空间特征。3. 结合Sentinel-1雷达数据弥补光学数据的不足。

Result: 1. NDI注入在DFC2020数据集上提升了性能(U-Net +1.99%,DeepLabV3 +2.78%)。2. 在云覆盖条件下,结合Sentinel-1数据显著优于仅用光学数据。

Insight: 雷达-光学数据融合在云覆盖情境下具有显著优势。轻量级NDI注入方法可有效提升模型性能,同时计算开销低。

Abstract: Supervised deep learning for land cover semantic segmentation (LCS) relies on labeled satellite data. However, most existing Sentinel-2 datasets are cloud-free, which limits their usefulness in tropical regions where clouds are common. To properly evaluate the extent of this problem, we developed a cloud injection algorithm that simulates realistic cloud cover, allowing us to test how Sentinel-1 radar data can fill in the gaps caused by cloud-obstructed optical imagery. We also tackle the issue of losing spatial and/or spectral details during encoder downsampling in deep networks. To mitigate this loss, we propose a lightweight method that injects Normalized Difference Indices (NDIs) into the final decoding layers, enabling the model to retain key spatial features with minimal additional computation. Injecting NDIs enhanced land cover segmentation performance on the DFC2020 dataset, yielding improvements of 1.99% for U-Net and 2.78% for DeepLabV3 on cloud-free imagery. Under cloud-covered conditions, incorporating Sentinel-1 data led to significant performance gains across all models compared to using optical data alone, highlighting the effectiveness of radar-optical fusion in challenging atmospheric scenarios.

[76] When and Where do Events Switch in Multi-Event Video Generation?

Ruotong Liao,Guowen Huang,Qing Cheng,Thomas Seidl,Daniel Cremers,Volker Tresp

Main category: cs.CV

TL;DR: 这篇论文研究了多事件文本到视频(T2V)生成中的事件切换问题,提出了MEve评估套件,并揭示了在去噪步骤和模块层中早期干预的重要性。

Details Motivation: 现有多事件生成方法忽略了事件切换的内在因素,论文旨在回答T2V生成中多事件提示事件转换的关键问题。

Contribution: 引入了MEve评估套件,系统性研究了OpenSora和CogVideoX两种代表性模型家族,揭示了多事件视频生成的关键因素。

Method: 通过实验分析去噪步骤和模块层中的早期干预对事件切换的影响。

Result: 研究发现早期干预对多事件视频生成至关重要,为未来模型的多元事件条件提供了可能性。

Insight: 事件切换的核心在于早期去噪和模块设计的优化。

Abstract: Text-to-video (T2V) generation has surged in response to challenging questions, especially when a long video must depict multiple sequential events with temporal coherence and controllable content. Existing methods that extend to multi-event generation omit an inspection of the intrinsic factor in event shifting. The paper aims to answer the central question: When and where multi-event prompts control event transition during T2V generation. This work introduces MEve, a self-curated prompt suite for evaluating multi-event text-to-video (T2V) generation, and conducts a systematic study of two representative model families, i.e., OpenSora and CogVideoX. Extensive experiments demonstrate the importance of early intervention in denoising steps and block-wise model layers, revealing the essential factor for multi-event video generation and highlighting the possibilities for multi-event conditioning in future models.

[77] InsideOut: An EfficientNetV2-S Based Deep Learning Framework for Robust Multi-Class Facial Emotion Recognition

Ahsan Farabi,Israt Khandaker,Ibrahim Khalil Shanto,Md Abdul Ahad Minhaz,Tanisha Zaman

Main category: cs.CV

TL;DR: InsideOut是一个基于EfficientNetV2-S的深度学习框架,用于鲁棒的多类面部表情识别(FER),通过数据增强和类别不平衡优化展现了竞争力。

Details Motivation: FER在现实应用中受遮挡、光照变化、姿势差异和数据集不平衡的挑战,需高效且鲁棒的解决方案。

Contribution: 提出了InsideOut框架,结合EfficientNetV2-S和类别加权损失,解决了FER中的不平衡问题和复杂场景挑战。

Method: 使用EfficientNetV2-S作为主干网络,通过数据增强和类别加权损失优化模型性能,实现高效FER。

Result: 在FER2013数据集上达到62.8%准确率和0.590宏平均F1分数,优于传统CNN基线。

Insight: 高效架构结合不平衡处理可提供实用且可复现的FER解决方案。

Abstract: Facial Emotion Recognition (FER) is a key task in affective computing, enabling applications in human-computer interaction, e-learning, healthcare, and safety systems. Despite advances in deep learning, FER remains challenging due to occlusions, illumination and pose variations, subtle intra-class differences, and dataset imbalance that hinders recognition of minority emotions. We present InsideOut, a reproducible FER framework built on EfficientNetV2-S with transfer learning, strong data augmentation, and imbalance-aware optimization. The approach standardizes FER2013 images, applies stratified splitting and augmentation, and fine-tunes a lightweight classification head with class-weighted loss to address skewed distributions. InsideOut achieves 62.8% accuracy with a macro averaged F1 of 0.590 on FER2013, showing competitive results compared to conventional CNN baselines. The novelty lies in demonstrating that efficient architectures, combined with tailored imbalance handling, can provide practical, transparent, and reproducible FER solutions.

[78] What Drives Compositional Generalization in Visual Generative Models?

Karim Farid,Rajat Sahay,Yumna Ali Alnaggar,Simon Schrodi,Volker Fischer,Cordelia Schmid,Thomas Brox

Main category: cs.CV

TL;DR: 这篇论文研究了视觉生成模型中设计选择对组合泛化能力的影响,确定了离散或连续分布训练目标以及条件信息的重要性,并提出改进离散模型组合性能的方法。

Details Motivation: 研究旨在理解哪些机制促进或阻碍视觉生成模型的组合泛化能力,从而改进模型生成新颖组合的能力。

Contribution: 系统地分析了设计选择对组合泛化的影响,提出了一种结合离散和连续目标的改进方法。

Method: 通过控制实验,研究了训练目标的离散/连续性质以及条件信息的作用,并提出了一种辅助连续目标(基于JEPA)改进MaskGIT的方法。

Result: 改进后的MaskGIT在组合性能上表现更好。

Insight: 组合泛化的关键在于训练目标的性质和条件信息的充分性,结合离散和连续目标可以提升性能。

Abstract: Compositional generalization, the ability to generate novel combinations of known concepts, is a key ingredient for visual generative models. Yet, not all mechanisms that enable or inhibit it are fully understood. In this work, we conduct a systematic study of how various design choices influence compositional generalization in image and video generation in a positive or negative way. Through controlled experiments, we identify two key factors: (i) whether the training objective operates on a discrete or continuous distribution, and (ii) to what extent conditioning provides information about the constituent concepts during training. Building on these insights, we show that relaxing the MaskGIT discrete loss with an auxiliary continuous JEPA-based objective can improve compositional performance in discrete models like MaskGIT.

[79] Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields

Zhiting Mei,Ola Shorinwa,Anirudha Majumdar

Main category: cs.CV

TL;DR: 该论文研究了在辐射场(radiance fields)中几何基础(geometry-grounding)对语义蒸馏的影响,并提出了一种新框架SPINE,用于无需初始猜测的反转辐射场。研究发现几何基础特征虽然提高了几何细节,但在姿态估计任务中表现不佳。

Details Motivation: 探索几何基础语义特征在辐射场中的潜力,以改进空间任务的性能,如姿态估计和语义定位。

Contribution: 提出了SPINE框架,结合粗粒度语义蒸馏和细粒度光度优化,无需初始猜测即可反转辐射场。同时揭示了视觉特征在多任务中的泛用性优于几何基础特征。

Method: 采用SPINE框架,分为粗粒度反转(基于语义蒸馏)和细粒度反转(基于光度优化)。实验对比了视觉特征和几何基础特征在多个任务中的表现。

Result: 几何基础特征在几何细节上更丰富,但在姿态估计任务中表现较差;视觉特征在多任务中更具优势。

Insight: 未来的研究需探索更有效的几何基础策略,以平衡语义特征的几何细节和多任务泛用性。

Abstract: Semantic distillation in radiance fields has spurred significant advances in open-vocabulary robot policies, e.g., in manipulation and navigation, founded on pretrained semantics from large vision models. While prior work has demonstrated the effectiveness of visual-only semantic features (e.g., DINO and CLIP) in Gaussian Splatting and neural radiance fields, the potential benefit of geometry-grounding in distilled fields remains an open question. In principle, visual-geometry features seem very promising for spatial tasks such as pose estimation, prompting the question: Do geometry-grounded semantic features offer an edge in distilled fields? Specifically, we ask three critical questions: First, does spatial-grounding produce higher-fidelity geometry-aware semantic features? We find that image features from geometry-grounded backbones contain finer structural details compared to their counterparts. Secondly, does geometry-grounding improve semantic object localization? We observe no significant difference in this task. Thirdly, does geometry-grounding enable higher-accuracy radiance field inversion? Given the limitations of prior work and their lack of semantics integration, we propose a novel framework SPINE for inverting radiance fields without an initial guess, consisting of two core components: coarse inversion using distilled semantics, and fine inversion using photometric-based optimization. Surprisingly, we find that the pose estimation accuracy decreases with geometry-grounded features. Our results suggest that visual-only features offer greater versatility for a broader range of downstream tasks, although geometry-grounded features contain more geometric detail. Notably, our findings underscore the necessity of future research on effective strategies for geometry-grounding that augment the versatility and performance of pretrained semantic features.

[80] GeoComplete: Geometry-Aware Diffusion for Reference-Driven Image Completion

Beibei Lin,Tingting Chen,Robby T. Tan

Main category: cs.CV

TL;DR: GeoComplete提出了一种新颖的参考驱动图像补全框架,通过显式的3D结构引导和几何一致性增强,解决了现有生成方法在视角差异较大时的对齐问题。

Details Motivation: 现有方法仅依赖扩散先验,缺乏几何线索(如相机姿态或深度),导致补全内容不准确或不合理。GeoComplete旨在通过引入几何信息解决这一问题。

Contribution: 主要贡献包括:1) 将投影点云信息融入扩散过程;2) 提出目标感知掩码策略;3) 设计双分支扩散架构,结合几何特征与图像合成。

Method: 提出双分支扩散架构:一支从掩码目标合成缺失区域,另一支从投影点云提取几何特征。采用联合自注意力确保补全的一致性和准确性。

Result: 实验显示,GeoComplete在PSNR上比现有最优方法提升17.1,显著提高了几何精度并保持高视觉质量。

Insight: 融合几何信息能显著提升参考驱动图像补全的质量和目标一致性,尤其适用于视角差异大的场景。

Abstract: Reference-driven image completion, which restores missing regions in a target view using additional images, is particularly challenging when the target view differs significantly from the references. Existing generative methods rely solely on diffusion priors and, without geometric cues such as camera pose or depth, often produce misaligned or implausible content. We propose GeoComplete, a novel framework that incorporates explicit 3D structural guidance to enforce geometric consistency in the completed regions, setting it apart from prior image-only approaches. GeoComplete introduces two key ideas: conditioning the diffusion process on projected point clouds to infuse geometric information, and applying target-aware masking to guide the model toward relevant reference cues. The framework features a dual-branch diffusion architecture. One branch synthesizes the missing regions from the masked target, while the other extracts geometric features from the projected point cloud. Joint self-attention across branches ensures coherent and accurate completion. To address regions visible in references but absent in the target, we project the target view into each reference to detect occluded areas, which are then masked during training. This target-aware masking directs the model to focus on useful cues, enhancing performance in difficult scenarios. By integrating a geometry-aware dual-branch diffusion architecture with a target-aware masking strategy, GeoComplete offers a unified and robust solution for geometry-conditioned image completion. Experiments show that GeoComplete achieves a 17.1 PSNR improvement over state-of-the-art methods, significantly boosting geometric accuracy while maintaining high visual quality.

[81] Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

Kaisi Guan,Xihua Wang,Zhengfeng Lai,Xin Cheng,Peng Zhang,XiaoJiang Liu,Ruihua Song,Meng Cao

Main category: cs.CV

TL;DR: 本文提出了一种新型的文本到声音视频(T2SV)生成方法,通过解耦视频和音频的文本描述,并引入双塔扩散变换器(BridgeDiT)实现跨模态交互,显著提升了生成质量和同步性,成为当前最佳方法。

Details Motivation: 现有文本到声音视频生成方法存在模态干扰和跨模态交互不明确的挑战,导致生成结果的质量和同步性受限。本文旨在解决这些问题。

Contribution: 1. 提出分层视觉基础标题生成(HVGC)框架,生成解耦的视频和音频文本描述;2. 设计双塔扩散变换器(BridgeDiT),通过双交叉注意力机制(DCA)实现对称的双向信息交换。

Method: 1. HVGC框架:为视频和音频生成独立的文本描述;2. BridgeDiT:基于扩散变换器,利用DCA机制实现跨模态特征交互。

Result: 在三个基准数据集上的实验和人类评估表明,该方法在大多数指标上达到最优,并通过消融研究验证了各贡献的有效性。

Insight: 解耦文本描述和对称跨模态交互是提升文本到声音视频生成性能的关键方向。

Abstract: This study focuses on a challenging yet promising task, Text-to-Sounding-Video (T2SV) generation, which aims to generate a video with synchronized audio from text conditions, meanwhile ensuring both modalities are aligned with text. Despite progress in joint audio-video training, two critical challenges still remain unaddressed: (1) a single, shared text caption where the text for video is equal to the text for audio often creates modal interference, confusing the pretrained backbones, and (2) the optimal mechanism for cross-modal feature interaction remains unclear. To address these challenges, we first propose the Hierarchical Visual-Grounded Captioning (HVGC) framework that generates pairs of disentangled captions, a video caption, and an audio caption, eliminating interference at the conditioning stage. Based on HVGC, we further introduce BridgeDiT, a novel dual-tower diffusion transformer, which employs a Dual CrossAttention (DCA) mechanism that acts as a robust ``bridge” to enable a symmetric, bidirectional exchange of information, achieving both semantic and temporal synchronization. Extensive experiments on three benchmark datasets, supported by human evaluations, demonstrate that our method achieves state-of-the-art results on most metrics. Comprehensive ablation studies further validate the effectiveness of our contributions, offering key insights for the future T2SV task. All the codes and checkpoints will be publicly released.

[82] HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion

Shiyi Zhang,Dong Liang,Hairong Zheng,Yihang Zhou

Main category: cs.CV

TL;DR: HAVIR是一个分层视觉到图像重建模型,通过CLIP引导的多功能扩散方法解决了现有技术在复杂场景中重建视觉信息的挑战。

Details Motivation: 现有方法在重建复杂视觉刺激时面临困难,原因在于自然场景的低级特征异质性和高级特征语义纠缠。HAVIR受视觉皮层分层表征理论启发,提出分层处理来解决这一问题。

Contribution: HAVIR的核心贡献是分层提取和处理视觉信息:Structural Generator提取空间处理体素的结构信息为扩散先验,Semantic Extractor将语义处理体素转化为CLIP嵌入,并通过Versatile Diffusion模型整合生成最终图像。

Method: HAVIR将视觉皮层分为两个层次,分别提取结构和语义特征。结构信息通过扩散先验建模,语义信息转化为CLIP嵌入,最终通过多功能扩散模型融合生成图像。

Result: 实验表明,HAVIR在复杂场景中提升了重建图像的结构和语义质量,优于现有模型。

Insight: 分层处理方法能够有效区分低级和高级视觉特征,CLIP嵌入的引入增强了语义信息的表达,结合扩散模型进一步提升了重建效果。

Abstract: The reconstruction of visual information from brain activity fosters interdisciplinary integration between neuroscience and computer vision. However, existing methods still face challenges in accurately recovering highly complex visual stimuli. This difficulty stems from the characteristics of natural scenes: low-level features exhibit heterogeneity, while high-level features show semantic entanglement due to contextual overlaps. Inspired by the hierarchical representation theory of the visual cortex, we propose the HAVIR model, which separates the visual cortex into two hierarchical regions and extracts distinct features from each. Specifically, the Structural Generator extracts structural information from spatial processing voxels and converts it into latent diffusion priors, while the Semantic Extractor converts semantic processing voxels into CLIP embeddings. These components are integrated via the Versatile Diffusion model to synthesize the final image. Experimental results demonstrate that HAVIR enhances both the structural and semantic quality of reconstructions, even in complex scenes, and outperforms existing models.

[83] Mask2IV: Interaction-Centric Video Generation via Mask Trajectories

Gen Li,Bo Zhao,Jianfei Yang,Laura Sevilla-Lara

Main category: cs.CV

TL;DR: Mask2IV提出了一个两阶段解耦框架,用于生成交互中心视频,通过预测演员和物体的运动轨迹,再生成视频,无需密集掩码输入,同时支持用户灵活控制交互过程。

Details Motivation: 生成交互中心视频对具身智能至关重要,但现有方法难以建模复杂动态交互,且密集掩码标注在实际应用中具有挑战性。

Contribution: 提出Mask2IV框架,解耦运动轨迹预测和视频生成两阶段,无需密集掩码输入,支持灵活控制和直观交互指导。

Method: 采用两阶段管道:1)预测演员和物体的运动轨迹;2)基于轨迹生成视频。支持通过动作描述或空间位置线索控制交互。

Result: 在两个多样化的基准测试中表现出色,视觉真实性和可控性优于现有基线。

Insight: 解耦设计和轨迹预测显著提升了交互视频生成的灵活性和质量,同时降低了标注需求。

Abstract: Generating interaction-centric videos, such as those depicting humans or robots interacting with objects, is crucial for embodied intelligence, as they provide rich and diverse visual priors for robot learning, manipulation policy training, and affordance reasoning. However, existing methods often struggle to model such complex and dynamic interactions. While recent studies show that masks can serve as effective control signals and enhance generation quality, obtaining dense and precise mask annotations remains a major challenge for real-world use. To overcome this limitation, we introduce Mask2IV, a novel framework specifically designed for interaction-centric video generation. It adopts a decoupled two-stage pipeline that first predicts plausible motion trajectories for both actor and object, then generates a video conditioned on these trajectories. This design eliminates the need for dense mask inputs from users while preserving the flexibility to manipulate the interaction process. Furthermore, Mask2IV supports versatile and intuitive control, allowing users to specify the target object of interaction and guide the motion trajectory through action descriptions or spatial position cues. To support systematic training and evaluation, we curate two benchmarks covering diverse action and object categories across both human-object interaction and robotic manipulation scenarios. Extensive experiments demonstrate that our method achieves superior visual realism and controllability compared to existing baselines.

[84] ReeMark: Reeb Graphs for Simulating Patterns of Life in Spatiotemporal Trajectories

Anantajit Subrahmanya,Chandrakanth Gudavalli,Connor Levenson,Umang Garg,B. S. Manjunath

Main category: cs.CV

TL;DR: 该论文提出了一种名为Markovian Reeb Graphs的新框架,用于模拟保留基础数据中生活模式(PoLs)的时空轨迹,结合个体和群体层面的移动性结构,生成既一致又多样化的未来轨迹。

Details Motivation: 准确建模人类移动性对城市规划、流行病学和交通管理至关重要。现有方法在保持生活模式和计算效率方面存在不足,因此需要一种新的框架来解决这些问题。

Contribution: 提出了Markovian Reeb Graphs,一种结合概率拓扑模型的框架,能够在模拟时空轨迹时保留个体和群体的移动性结构,同时保持数据高效性。

Method: 通过结合个体和群体层面的移动性结构与概率拓扑模型,生成未来轨迹。使用Jensen-Shannon Divergence(JSD)在Urban Anomalies数据集(亚特兰大和柏林子集)上进行评估。

Result: 在群体和个体层面的指标上展示了高度的保真度,同时保持了数据和计算效率。

Insight: Markovian Reeb Graphs作为一种可扩展的框架,能够广泛应用于多样化的城市环境中,为轨迹模拟提供了新的解决方案。

Abstract: Accurately modeling human mobility is critical for urban planning, epidemiology, and traffic management. In this work, we introduce Markovian Reeb Graphs, a novel framework for simulating spatiotemporal trajectories that preserve Patterns of Life (PoLs) learned from baseline data. By combining individual- and population-level mobility structures within a probabilistic topological model, our approach generates realistic future trajectories that capture both consistency and variability in daily life. Evaluations on the Urban Anomalies dataset (Atlanta and Berlin subsets) using the Jensen-Shannon Divergence (JSD) across population- and agent-level metrics demonstrate that the proposed method achieves strong fidelity while remaining data- and compute-efficient. These results position Markovian Reeb Graphs as a scalable framework for trajectory simulation with broad applicability across diverse urban environments.

[85] SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus

Ming Zhao,Wenhui Dong,Yang Zhang,Xiang Zheng,Zhonghao Zhang,Zian Zhou,Yunzhi Guan,Liukun Xu,Wei Peng,Zhaoyang Gong,Zhicheng Zhang,Dachuan Li,Xiaosheng Ma,Yuli Ma,Jianing Ni,Changjiang Jiang,Lixia Tian,Qixin Chen,Kaishun Xia,Pingping Liu,Tongshun Zhang,Zhiqiang Liu,Zhongan Bi,Chenyang Si,Tiansheng Sun,Caifeng Shan

Main category: cs.CV

TL;DR: 论文介绍了SpineMed生态系统,包括首个针对脊椎级别推理的大规模数据集SpineMed-450k和评估框架SpineBench,通过临床医师参与的两阶段LLM生成方法确保数据质量,展示了基于该数据集微调的模型在临床任务中的显著优势。

Details Motivation: 脊椎疾病是全球性健康问题,但AI辅助诊断因缺乏针对脊椎级别的多模态数据集和标准化评估工具而受限。

Contribution: 1. 提出首个大规模、多模态的脊椎级别数据集SpineMed-450k;2. 设计了临床实用的评估框架SpineBench;3. 通过两阶段LLM生成方法确保数据质量。

Method: 使用两阶段LLM生成方法(草稿与修订),结合临床医师参与的数据标注流程,从教科书、指南和医院案例中构建数据集。

Result: 基于SpineMed-450k微调的模型在SpineBench评估中表现优越,尤其在脊椎级别推理和病理评估任务上显著优于其他LVLM模型。

Insight: 1. 临床医师参与的数据生成对AI模型在医疗任务中的实用性至关重要;2. 精细化的推理能力是当前LVLM模型的短板。

Abstract: Spine disorders affect 619 million people globally and are a leading cause of disability, yet AI-assisted diagnosis remains limited by the lack of level-aware, multimodal datasets. Clinical decision-making for spine disorders requires sophisticated reasoning across X-ray, CT, and MRI at specific vertebral levels. However, progress has been constrained by the absence of traceable, clinically-grounded instruction data and standardized, spine-specific benchmarks. To address this, we introduce SpineMed, an ecosystem co-designed with practicing spine surgeons. It features SpineMed-450k, the first large-scale dataset explicitly designed for vertebral-level reasoning across imaging modalities with over 450,000 instruction instances, and SpineBench, a clinically-grounded evaluation framework. SpineMed-450k is curated from diverse sources, including textbooks, guidelines, open datasets, and ~1,000 de-identified hospital cases, using a clinician-in-the-loop pipeline with a two-stage LLM generation method (draft and revision) to ensure high-quality, traceable data for question-answering, multi-turn consultations, and report generation. SpineBench evaluates models on clinically salient axes, including level identification, pathology assessment, and surgical planning. Our comprehensive evaluation of several recently advanced large vision-language models (LVLMs) on SpineBench reveals systematic weaknesses in fine-grained, level-specific reasoning. In contrast, our model fine-tuned on SpineMed-450k demonstrates consistent and significant improvements across all tasks. Clinician assessments confirm the diagnostic clarity and practical utility of our model’s outputs.

[86] Memory Forcing: Spatio-Temporal Memory for Consistent Scene Generation on Minecraft

Junchao Huang,Xinting Hu,Boyao Han,Shaoshuai Shi,Zhuotao Tian,Tianyu He,Li Jiang

Main category: cs.CV

TL;DR: Memory Forcing是一种结合时空记忆的学习框架,用于在Minecraft游戏中生成一致的场景。通过混合训练和链式前向训练,它动态平衡了时空记忆的使用,同时利用几何索引的空间记忆和高效检索方法优化性能,实现了长期空间一致性和生成质量。

Details Motivation: 在有限的计算预算下,现有模型难以同时保证新场景生成的质量和探索区域的长期空间一致性。需要一种方法平衡时空记忆的使用,以在Minecraft等游戏中实现自然的交互式场景生成。

Contribution: 1. 提出Memory Forcing学习框架,结合混合训练和链式前向训练;2. 引入几何索引的空间记忆和高效的点到帧检索方法;3. 通过实验验证其在长期空间一致性和生成质量上的优越性。

Method: 1. 混合训练:区分探索和重访阶段,动态调整时空记忆使用;2. 链式前向训练:通过模型滚动扩展预测范围;3. 几何索引的空间记忆和点到帧检索优化历史信息利用。

Result: 在多样环境下,Memory Forcing在长期空间一致性和生成质量上表现优异,同时保持计算效率。

Insight: 时空记忆的动态平衡是关键,混合训练和链式前向训练的结合能有效提升模型的适应性和一致性。

Abstract: Autoregressive video diffusion models have proved effective for world modeling and interactive scene generation, with Minecraft gameplay as a representative application. To faithfully simulate play, a model must generate natural content while exploring new scenes and preserve spatial consistency when revisiting explored areas. Under limited computation budgets, it must compress and exploit historical cues within a finite context window, which exposes a trade-off: Temporal-only memory lacks long-term spatial consistency, whereas adding spatial memory strengthens consistency but may degrade new scene generation quality when the model over-relies on insufficient spatial context. We present Memory Forcing, a learning framework that pairs training protocols with a geometry-indexed spatial memory. Hybrid Training exposes distinct gameplay regimes, guiding the model to rely on temporal memory during exploration and incorporate spatial memory for revisits. Chained Forward Training extends autoregressive training with model rollouts, where chained predictions create larger pose variations and encourage reliance on spatial memory for maintaining consistency. Point-to-Frame Retrieval efficiently retrieves history by mapping currently visible points to their source frames, while Incremental 3D Reconstruction maintains and updates an explicit 3D cache. Extensive experiments demonstrate that Memory Forcing achieves superior long-term spatial consistency and generative quality across diverse environments, while maintaining computational efficiency for extended sequences.

[87] MonSTeR: a Unified Model for Motion, Scene, Text Retrieval

Luca Collorone,Matteo Gioia,Massimiliano Pappa,Paolo Leoni,Giovanni Ficarra,Or Litany,Indro Spinelli,Fabio Galasso

Main category: cs.CV

TL;DR: MonSTeR是一个统一的运动-场景-文本检索模型,首次实现了多模态(运动、文本、场景)对齐的评估,通过构建统一潜在空间实现灵活检索。

Details Motivation: 人类运动受意图驱动,但运动是否可行取决于周围环境是否支持。现有研究缺乏评估运动、意图和场景对齐的工具。

Contribution: 提出首个多模态(运动-场景-文本)检索模型MonSTeR,通过统一潜在空间捕捉模态间复杂依赖关系,支持灵活检索。

Method: 利用单模态和跨模态表示构建统一潜在空间,捕捉高阶关系,实现多模态对齐。

Result: MonSTeR优于仅依赖单模态表示的模型,检索分数与人类偏好一致,并在零样本场景对象放置和运动描述上展示多功能性。

Insight: 多模态统一潜在空间能有效捕捉复杂依赖关系,为后续研究提供工具支持。

Abstract: Intention drives human movement in complex environments, but such movement can only happen if the surrounding context supports it. Despite the intuitive nature of this mechanism, existing research has not yet provided tools to evaluate the alignment between skeletal movement (motion), intention (text), and the surrounding context (scene). In this work, we introduce MonSTeR, the first MOtioN-Scene-TExt Retrieval model. Inspired by the modeling of higher-order relations, MonSTeR constructs a unified latent space by leveraging unimodal and cross-modal representations. This allows MonSTeR to capture the intricate dependencies between modalities, enabling flexible but robust retrieval across various tasks. Our results show that MonSTeR outperforms trimodal models that rely solely on unimodal representations. Furthermore, we validate the alignment of our retrieval scores with human preferences through a dedicated user study. We demonstrate the versatility of MonSTeR’s latent space on zero-shot in-Scene Object Placement and Motion Captioning. Code and pre-trained models are available at github.com/colloroneluca/MonSTeR.

[88] Improving GUI Grounding with Explicit Position-to-Coordinate Mapping

Suyuchen Wang,Tianyu Zhang,Ahmed Masry,Christopher Pal,Spandana Gella,Bang Liu,Perouz Taslakian

Main category: cs.CV

TL;DR: 论文提出通过显式的位置到坐标映射改进GUI grounding任务,避免了现有VLMs在未见分辨率下性能下降的问题,通过RULER tokens和I-MRoPE提升了准确性和鲁棒性。

Details Motivation: GUI grounding任务需要将自然语言指令映射到像素坐标,但现有方法在高分辨率未见训练数据时性能下降严重,主要由于隐式的坐标映射不够可靠。

Contribution: 1. 提出RULER tokens作为显式坐标标记;2. 设计Interleaved MRoPE(I-MRoPE)改进空间编码对称性;3. 在多个数据集上验证了高分辨率下的性能提升。

Method: 1. RULER tokens提供显式坐标参考;2. I-MRoPE通过交错编码确保宽度和高度维度的对称表示。

Result: 在ScreenSpot系列数据集上,新方法显著提升了高分辨率GUI的grounding准确性。

Insight: 显式位置编码和对称的空间表示对提升GUI grounding任务在高分辨率下的性能至关重要。

Abstract: GUI grounding, the task of mapping natural-language instructions to pixel coordinates, is crucial for autonomous agents, yet remains difficult for current VLMs. The core bottleneck is reliable patch-to-pixel mapping, which breaks when extrapolating to high-resolution displays unseen during training. Current approaches generate coordinates as text tokens directly from visual features, forcing the model to infer complex position-to-pixel mappings implicitly; as a result, accuracy degrades and failures proliferate on new resolutions. We address this with two complementary innovations. First, RULER tokens serve as explicit coordinate markers, letting the model reference positions similar to gridlines on a map and adjust rather than generate coordinates from scratch. Second, Interleaved MRoPE (I-MRoPE) improves spatial encoding by ensuring that width and height dimensions are represented equally, addressing the asymmetry of standard positional schemes. Experiments on ScreenSpot, ScreenSpot-V2, and ScreenSpot-Pro show consistent gains in grounding accuracy, with the largest improvements on high-resolution interfaces. By providing explicit spatial guidance rather than relying on implicit learning, our approach enables more reliable GUI automation across diverse resolutions and platforms.

[89] LEAML: Label-Efficient Adaptation to Out-of-Distribution Visual Tasks for Multimodal Large Language Models

Ci-Siang Lin,Min-Hung Chen,Yu-Yang Sheng,Yu-Chiang Frank Wang

Main category: cs.CV

TL;DR: LEAML是一个标签高效的适应框架,专为多模态大语言模型(MLLMs)设计,用于解决特定领域(如医学影像)中分布外(OOD)任务的标签稀缺问题。

Details Motivation: 多模态大语言模型在通用视觉任务上表现优秀,但在特定领域的分布外任务(如医学影像)中表现不佳,主要原因在于标签数据稀缺且昂贵。

Contribution: 提出了LEAML框架,通过结合少量标记数据和大量未标记图像,生成领域相关的伪问答对,并选择性更新与问答最相关的神经元,以高效适应领域特定知识。

Method: 利用基于标题蒸馏正则化的问答生成器为未标记数据生成伪问答对,并选择性更新关键神经元以优化问答生成器的领域适应能力。

Result: 在胃肠镜检查和体育领域的视觉问答任务中,LEAML在极少监督下表现优于标准微调方法。

Insight: LEAML展示了通过伪标签生成和选择性神经元更新的方式可以有效解决特定领域中的标签稀缺问题。

Abstract: Multimodal Large Language Models (MLLMs) have achieved strong performance on general visual benchmarks but struggle with out-of-distribution (OOD) tasks in specialized domains such as medical imaging, where labeled data is limited and expensive. We introduce LEAML, a label-efficient adaptation framework that leverages both scarce labeled VQA samples and abundant unlabeled images. Our approach generates domain-relevant pseudo question-answer pairs for unlabeled data using a QA generator regularized by caption distillation. Importantly, we selectively update only those neurons most relevant to question-answering, enabling the QA Generator to efficiently acquire domain-specific knowledge during distillation. Experiments on gastrointestinal endoscopy and sports VQA demonstrate that LEAML consistently outperforms standard fine-tuning under minimal supervision, highlighting the effectiveness of our proposed LEAML framework.

cs.CY [Back]

[90] Representing Beauty: Towards a Participatory but Objective Latent Aesthetics

Alexander Michael Rusnak

Main category: cs.CY

TL;DR: 这篇论文探讨了神经网络如何通过跨模型表征趋同来客观表征美感,揭示了美感图像的形式结构具有现实基础,并提出人机共创的可能性。

Details Motivation: 研究机器如何识别美感,尽管美感是一个复杂且多元的概念,神经网络却能通过学习建模审美判断。

Contribution: 证明了美感图像在多模型间的表征趋同性,表明美感具有现实基础,而非仅是社会建构的结果。

Method: 利用跨模型表征趋同技术,分析不同数据和模态训练的模型对美感图像的表征一致性。

Result: 美感图像在多模型中产生更相似的表征,表明其形式结构具有现实基础,支持人机共创的可能性。

Insight: 美感不仅是文化建构的产物,还具有物理和文化基础,机器能够通过规模优势产生新颖的创意见解。

Abstract: What does it mean for a machine to recognize beauty? While beauty remains a culturally and experientially compelling but philosophically elusive concept, deep learning systems increasingly appear capable of modeling aesthetic judgment. In this paper, we explore the capacity of neural networks to represent beauty despite the immense formal diversity of objects for which the term applies. By drawing on recent work on cross-model representational convergence, we show how aesthetic content produces more similar and aligned representations between models which have been trained on distinct data and modalities - while unaesthetic images do not produce more aligned representations. This finding implies that the formal structure of beautiful images has a realist basis - rather than only as a reflection of socially constructed values. Furthermore, we propose that these realist representations exist because of a joint grounding of aesthetic form in physical and cultural substance. We argue that human perceptual and creative acts play a central role in shaping these the latent spaces of deep learning systems, but that a realist basis for aesthetics shows that machines are not mere creative parrots but can produce novel creative insights from the unique vantage point of scale. Our findings suggest that human-machine co-creation is not merely possible, but foundational - with beauty serving as a teleological attractor in both cultural production and machine perception.

cs.DC [Back]

[91] PyRadiomics-cuda: a GPU-accelerated 3D features extraction from medical images within PyRadiomics

Jakub Lisowski,Piotr Tyrakowski,Szymon Zyguła,Krzysztof Kaczmarski

Main category: cs.DC

TL;DR: PyRadiomics-cuda是一个基于GPU加速的PyRadiomics扩展,用于高效提取医学图像的三维形状特征,显著减少了处理时间,并与现有PyRadiomics API完全兼容。

Details Motivation: 医学图像处理中提取三维形状特征的计算成本高,PyRadiomics-cuda旨在通过GPU加速解决这一问题,支持高效的AI流程。

Contribution: 开发了PyRadiomics-cuda,一个GPU加速的扩展工具,保留了PyRadiomics的兼容性,显著提升了特征提取效率。

Method: 利用Python和C/CUDA实现,将关键几何计算卸载到GPU硬件上,同时提供详细的安装和使用文档。

Result: 在不同硬件环境下测试表明,PyRadiomics-cuda能显著减少处理时间,适用于大规模数据集。

Insight: GPU加速可以显著提升医学图像特征提取的效率,适合高吞吐量的AI应用场景。

Abstract: PyRadiomics-cuda is a GPU-accelerated extension of the PyRadiomics library, designed to address the computational challenges of extracting three-dimensional shape features from medical images. By offloading key geometric computations to GPU hardware it dramatically reduces processing times for large volumetric datasets. The system maintains full compatibility with the original PyRadiomics API, enabling seamless integration into existing AI workflows without code modifications. This transparent acceleration facilitates efficient, scalable radiomics analysis, supporting rapid feature extraction essential for high-throughput AI pipeline. Tests performed on a typical computational cluster, budget and home devices prove usefulness in all scenarios. PyRadiomics-cuda is implemented in Python and C/CUDA and is freely available under the BSD license at https://github.com/mis-wut/pyradiomics-CUDA Additionally PyRadiomics-cuda test suite is available at https://github.com/mis-wut/pyradiomics-cuda-data-gen. It provides detailed handbook and sample scripts suited for different kinds of workflows plus detailed installation instructions. The dataset used for testing is available at Kaggle https://www.kaggle.com/datasets/sabahesaraki/kidney-tumor-segmentation-challengekits-19

cs.CR [Back]

[92] Secure and Robust Watermarking for AI-generated Images: A Comprehensive Survey

Jie Cao,Qi Li,Zelin Zhang,Jianbing Ni

Main category: cs.CR

TL;DR: 本文对AI生成图像的鲁棒水印技术进行了全面综述,探讨了水印系统的形式化、技术比较、评估方法、安全漏洞及未来方向,旨在推动可信数字生态发展。

Details Motivation: 随着生成式AI的快速发展,AI生成图像的知识产权保护、真实性和责任追溯成为关键挑战。水印技术作为一种解决方案,能够区分AI生成内容与自然内容,确保来源可追溯。

Contribution: 论文的主要贡献包括:1)形式化水印系统;2)比较多样化的水印技术;3)提出评估方法;4)分析安全漏洞;5)总结挑战与未来方向。

Method: 本文采用综述方法,从五个维度系统梳理了AI生成图像水印技术的现状:形式化、技术对比、评估指标、安全攻击和未来发展。

Result: 综述提供了对当前水印技术的全面理解,指出了其在视觉质量、容量和可检测性等方面的表现,并强调了对抗攻击的脆弱性。

Insight: 水印技术是实现可信AI生成图像的关键,但需进一步研究以应对恶意攻击和改进评估方法。

Abstract: The rapid advancement of generative artificial intelligence (Gen-AI) has facilitated the effortless creation of high-quality images, while simultaneously raising critical concerns regarding intellectual property protection, authenticity, and accountability. Watermarking has emerged as a promising solution to these challenges by distinguishing AI-generated images from natural content, ensuring provenance, and fostering trustworthy digital ecosystems. This paper presents a comprehensive survey of the current state of AI-generated image watermarking, addressing five key dimensions: (1) formalization of image watermarking systems; (2) an overview and comparison of diverse watermarking techniques; (3) evaluation methodologies with respect to visual quality, capacity, and detectability; (4) vulnerabilities to malicious attacks; and (5) prevailing challenges and future directions. The survey aims to equip researchers with a holistic understanding of AI-generated image watermarking technologies, thereby promoting their continued development.

eess.AS [Back]

[93] WEE-Therapy: A Mixture of Weak Encoders Framework for Psychological Counseling Dialogue Analysis

Yongqi Kang,Yong Zhao

Main category: eess.AS

TL;DR: 论文提出WEE-Therapy框架,通过集成弱编码器和双路由策略提升心理咨询对话分析的性能。

Details Motivation: 现有语音语言模型通常依赖通用数据的单一编码器,难以捕捉心理咨询领域的情感和专业特征。

Contribution: 提出WEE-Therapy框架,结合弱编码器集合和双路由策略,增强领域特征提取能力。

Method: 使用弱编码器集合补充强大的基础编码器,通过双路由策略动态选择专家模型。

Result: 在多任务评估中,WEE-Therapy显著提升性能,且参数量增加较少。

Insight: 轻量化的弱编码器集合和动态路由策略可有效提升特定领域任务的模型表现。

Abstract: The advancement of computational psychology requires AI tools capable of deeply understanding counseling dialogues. Existing audio language models (AudioLLMs) often rely on single speech encoders pre-trained on general data, struggling to capture domain-specific features like complex emotions and professional techniques. To address this, we propose WEE-Therapy, a multi-task AudioLLM incorporating a Weak Encoder Ensemble (WEE) mechanism. This supplements a powerful base encoder with a pool of lightweight, specialized encoders. A novel dual-routing strategy combines stable, data-independent domain knowledge with dynamic, data-dependent expert selection. Evaluated on emotion recognition, technique classification, risk detection, and summarization, WEE-Therapy achieves significant performance gains across all tasks with minimal parameter overhead, demonstrating strong potential for AI-assisted clinical analysis.

[94] SpeechCT-CLIP: Distilling Text-Image Knowledge to Speech for Voice-Native Multimodal CT Analysis

Lukas Buess,Jan Geier,David Bani-Harouni,Chantal Pellegrini,Matthias Keicher,Paula Andrea Perez-Toro,Nassir Navab,Andreas Maier,Tomas Arias-Vergara

Main category: eess.AS

TL;DR: 这篇论文探讨了如何直接从语音放射报告中学习视觉-语言表示,提出了SpeechCT-CLIP模型,通过知识蒸馏从预训练的文本-图像CLIP模型中传递语义对齐能力,显著缩小了语音与文本模型之间的性能差距。

Details Motivation: 临床工作流程中语音通信占据重要地位,但目前医学AI系统主要依赖书面文本。这篇论文旨在填补这一空白,探索如何直接从语音放射报告中学习多模态表示。

Contribution: 论文的主要贡献包括:(1) 合成了一个大规模的语音放射报告数据集Speech-RATE;(2) 提出了SpeechCT-CLIP模型,通过对比学习对齐语音和3D CT扫描;(3) 展示了知识蒸馏对提升语音模型性能的有效性。

Method: 论文采用了对比学习方法训练SpeechCT-CLIP模型,对齐语音和3D CT扫描。通过知识蒸馏从预训练的文本-图像CLIP模型中转移语义对齐能力,提升了语音模型的性能。

Result: 实验结果表明,语音模型的零样本分类F1分数从0.623提升到0.705,恢复了88%的性能差距。同时,模型在检索任务中表现出色,无需依赖推理时的文本输入。

Insight: 研究结果表明,语音可以作为多模态预训练中文本的有效替代方案,为临床实践中的语音驱动诊断支持工具提供了可能性。

Abstract: Spoken communication plays a central role in clinical workflows. In radiology, for example, most reports are created through dictation. Yet, nearly all medical AI systems rely exclusively on written text. In this work, we address this gap by exploring the feasibility of learning visual-language representations directly from spoken radiology reports. Specifically, we synthesize a large-scale dataset (Speech-RATE) of spoken radiology reports and train SpeechCT-CLIP, a contrastive model that aligns speech and 3D CT volumes in a shared representation space. While naive speech-based models underperform compared to text-trained counterparts, we show that knowledge distillation from a pretrained text-image CLIP model effectively transfers semantic alignment capabilities from text to speech, substantially narrowing this gap. Experiments demonstrate improved zero-shot classification F1 from 0.623 to 0.705, recovering 88% of the performance difference, and strong retrieval results without requiring text at inference. These findings highlight speech as a practical alternative to text in multimodal pretraining and open the door to voice-driven diagnostic support tools in clinical practice.

cs.RO [Back]

[95] Work Zones challenge VLM Trajectory Planning: Toward Mitigation and Robust Autonomous Driving

Yifan Liao,Zhen Sun,Xiaoyun Qiu,Zixiao Zhao,Wenbing Tang,Xinlei He,Xinhu Zheng,Tianwei Zhang,Xinyi Huang,Xingshuo Han

Main category: cs.RO

TL;DR: 该论文首次系统研究了视觉语言模型(VLM)在工作区轨迹规划中的表现,发现主流模型在68%的情况下无法生成正确的轨迹。通过分析失败模式,作者提出了REACT-Drive框架,结合检索增强生成(RAG)技术,显著提升了轨迹规划的准确性和效率。

Details Motivation: 工作区的复杂环境(如不规则布局、动态几何结构)对VLM的轨迹规划能力提出了挑战,但目前尚无相关研究。作者旨在填补这一空白,并提升VLM在这一领域的实用性。

Contribution: 1)首次系统研究了VLM在工作区轨迹规划中的表现;2)识别了8种常见的失败模式;3)提出了REACT-Drive框架,显著提升了性能。

Method: 1)通过子图挖掘和聚类分析识别失败模式;2)结合VLM和RAG技术,将先验失败案例转化为约束规则和可执行代码;3)在新场景中检索相似模式以指导轨迹生成。

Result: 在ROADWork数据集上,REACT-Drive的平均位移误差降低了3倍,推理时间(0.58秒)显著优于微调等方法(17.90秒)。在实际场景中验证了其实用性。

Insight: 1)VLM在工作区轨迹规划中存在显著局限性;2)结合先验知识和检索技术可有效提升性能;3)框架在真实环境中表现良好,具有实际应用潜力。

Abstract: Visual Language Models (VLMs), with powerful multimodal reasoning capabilities, are gradually integrated into autonomous driving by several automobile manufacturers to enhance planning capability in challenging environments. However, the trajectory planning capability of VLMs in work zones, which often include irregular layouts, temporary traffic control, and dynamically changing geometric structures, is still unexplored. To bridge this gap, we conduct the \textit{first} systematic study of VLMs for work zone trajectory planning, revealing that mainstream VLMs fail to generate correct trajectories in $68.0%$ of cases. To better understand these failures, we first identify candidate patterns via subgraph mining and clustering analysis, and then confirm the validity of $8$ common failure patterns through human verification. Building on these findings, we propose REACT-Drive, a trajectory planning framework that integrates VLMs with Retrieval-Augmented Generation (RAG). Specifically, REACT-Drive leverages VLMs to convert prior failure cases into constraint rules and executable trajectory planning code, while RAG retrieves similar patterns in new scenarios to guide trajectory generation. Experimental results on the ROADWork dataset show that REACT-Drive yields a reduction of around $3\times$ in average displacement error relative to VLM baselines under evaluation with Qwen2.5-VL. In addition, REACT-Drive yields the lowest inference time ($0.58$s) compared with other methods such as fine-tuning ($17.90$s). We further conduct experiments using a real vehicle in 15 work zone scenarios in the physical world, demonstrating the strong practicality of REACT-Drive.

[96] MM-Nav: Multi-View VLA Model for Robust Visual Navigation via Multi-Expert Learning

Tianyu Xu,Jiawei Chen,Jiazhao Zhang,Wenyao Zhang,Zekun Qi,Minghan Li,Zhizheng Zhang,He Wang

Main category: cs.RO

TL;DR: 该论文提出了一种名为MM-Nav的多视角视觉-语言-动作(VLA)模型,通过多专家学习实现鲁棒的视觉导航。模型利用预训练的大型语言模型和视觉基础模型,结合合成专家数据,展示了强大的泛化能力。

Details Motivation: 视觉导航策略因其模仿人类通过视觉观察导航而备受关注,但视觉信息的显式建模困难,需要智能模型和大规模数据支持。

Contribution: 提出了多视角VLA模型MM-Nav,通过多专家学习框架从合成的RL专家数据中学习多样化的导航能力。

Method: 基于预训练的大型语言模型和视觉基础模型,结合360度观测数据,通过动态平衡训练比例的方法迭代训练VLA模型。

Result: 在合成环境和真实世界的实验中,MM-Nav展示了强大的泛化能力,并且超越了RL专家教师的性能。

Insight: 多专家学习的整合效果显著,能够通过多视角数据和动态训练策略提升模型的导航能力。

Abstract: Visual navigation policy is widely regarded as a promising direction, as it mimics humans by using egocentric visual observations for navigation. However, optical information of visual observations is difficult to be explicitly modeled like LiDAR point clouds or depth maps, which subsequently requires intelligent models and large-scale data. To this end, we propose to leverage the intelligence of the Vision-Language-Action (VLA) model to learn diverse navigation capabilities from synthetic expert data in a teacher-student manner. Specifically, we implement the VLA model, MM-Nav, as a multi-view VLA (with 360 observations) based on pretrained large language models and visual foundation models. For large-scale navigation data, we collect expert data from three reinforcement learning (RL) experts trained with privileged depth information in three challenging tailor-made environments for different navigation capabilities: reaching, squeezing, and avoiding. We iteratively train our VLA model using data collected online from RL experts, where the training ratio is dynamically balanced based on performance on individual capabilities. Through extensive experiments in synthetic environments, we demonstrate that our model achieves strong generalization capability. Moreover, we find that our student VLA model outperforms the RL teachers, demonstrating the synergistic effect of integrating multiple capabilities. Extensive real-world experiments further confirm the effectiveness of our method.

[97] Simulation to Rules: A Dual-VLM Framework for Formal Visual Planning

Yilun Hao,Yongchao Chen,Chuchu Fan,Yang Zhang

Main category: cs.RO

TL;DR: VLMFP是一个双VLM框架,通过SimVLM模拟动作结果和GenVLM生成PDDL文件,解决了视觉语言模型在生成PDDL域文件时的困难,提升了视觉规划的精确性和泛化能力。

Details Motivation: 视觉语言模型(VLMs)在视觉规划中表现出潜力,但在精确空间和长时推理上表现不佳,而PDDL规划器虽擅长形式规划却无法处理视觉输入。结合两者的优势需要解决VLMs生成PDDL域文件的准确性问题。

Contribution: 提出了VLMFP框架,通过双VLM协同工作(SimVLM模拟动作和GenVLM生成PDDL文件),实现了无需人工干预的PDDL问题与域文件自主生成,显著提高了规划的泛化能力。

Method: 框架包含两个VLM:SimVLM基于规则描述模拟动作效果,GenVLM生成并迭代优化PDDL文件,通过对比PDDL执行结果与模拟结果确保准确性。

Result: 在6个网格世界领域测试中,SimVLM对场景和动作序列的描述准确率分别为95.5%和85.5%,VLMFP生成的文件在未见实例中实现了70.0%的有效规划成功率。

Insight: 双VLM框架不仅解决了PDDL域文件生成的难题,还展示了VLMs在跨问题和外观泛化中的潜力,为视觉形式规划提供了新思路。

Abstract: Vision Language Models (VLMs) show strong potential for visual planning but struggle with precise spatial and long-horizon reasoning. In contrast, Planning Domain Definition Language (PDDL) planners excel at long-horizon formal planning, but cannot interpret visual inputs. Recent works combine these complementary advantages by enabling VLMs to turn visual planning problems into PDDL files for formal planning. However, while VLMs can generate PDDL problem files satisfactorily, they struggle to accurately generate the PDDL domain files, which describe all the planning rules. As a result, prior methods rely on human experts to predefine domain files or on constant environment access for refinement. We propose VLMFP, a Dual-VLM-guided framework that can autonomously generate both PDDL problem and domain files for formal visual planning. VLMFP introduces two VLMs to ensure reliable PDDL file generation: A SimVLM that simulates action consequences based on input rule descriptions, and a GenVLM that generates and iteratively refines PDDL files by comparing the PDDL and SimVLM execution results. VLMFP unleashes multiple levels of generalizability: The same generated PDDL domain file works for all the different instances under the same problem, and VLMs generalize to different problems with varied appearances and rules. We evaluate VLMFP with 6 grid-world domains and test its generalization to unseen instances, appearance, and game rules. On average, SimVLM accurately describes 95.5%, 82.6% of scenarios, simulates 85.5%, 87.8% of action sequence, and judges 82.4%, 85.6% goal reaching for seen and unseen appearances, respectively. With the guidance of SimVLM, VLMFP can generate PDDL files to reach 70.0%, 54.1% valid plans for unseen instances in seen and unseen appearances, respectively. Project page: https://sites.google.com/view/vlmfp.

cs.SE [Back]

[98] When Names Disappear: Revealing What LLMs Actually Understand About Code

Cuong Chi Le,Minh V. T. Pham,Cuong Duc Van,Hoang N. Phan,Huy N. Phan,Tien N. Nguyen

Main category: cs.SE

TL;DR: 论文研究了大型语言模型(LLMs)如何理解代码,发现命名模式对意图和执行任务的影响,提出了一种混淆方法以更真实评估LLMs的语义理解能力。

Details Motivation: LLMs在代码任务中表现优异,但其如何理解程序语义尚不明确。研究旨在区分代码的结构语义和人类可读命名的贡献,揭示LLMs是否真正理解代码语义或依赖命名模式。

Contribution: 1. 提出代码通过结构语义和命名两个通道传递信息;2. 发现命名对意图和执行任务的影响;3. 引入语义保留混淆方法ClassEval-Obf,更真实评估LLMs的语义理解。

Method: 通过移除命名信息(混淆代码),分析LLMs在意图级(如摘要)和执行任务的表现,引入ClassEval-Obf基准系统评估混淆后的代码理解能力。

Result: 混淆显著降低了意图任务的性能(如摘要变为逐行描述),甚至影响执行任务,表明当前基准奖励命名模式记忆而非语义推理。ClassEval-Obf削弱了记忆捷径,提供了更可靠的评估。

Insight: LLMs当前对代码的理解可能过度依赖命名模式而非语义结构;混淆方法是评估模型真实语义推理能力的有效工具。

Abstract: Large Language Models (LLMs) achieve strong results on code tasks, but how they derive program meaning remains unclear. We argue that code communicates through two channels: structural semantics, which define formal behavior, and human-interpretable naming, which conveys intent. Removing the naming channel severely degrades intent-level tasks such as summarization, where models regress to line-by-line descriptions. Surprisingly, we also observe consistent reductions on execution tasks that should depend only on structure, revealing that current benchmarks reward memorization of naming patterns rather than genuine semantic reasoning. To disentangle these effects, we introduce a suite of semantics-preserving obfuscations and show that they expose identifier leakage across both summarization and execution. Building on these insights, we release ClassEval-Obf, an obfuscation-enhanced benchmark that systematically suppresses naming cues while preserving behavior. Our results demonstrate that ClassEval-Obf reduces inflated performance gaps, weakens memorization shortcuts, and provides a more reliable basis for assessing LLMs’ code understanding and generalization.

q-bio.QM [Back]

[99] Glaucoma Detection and Structured OCT Report Generation via a Fine-tuned Multimodal Large Language Model

Jalil Jalili,Yashraj Gavhane,Evan Walker,Anna Heinke,Christopher Bowd,Akram Belghith,Massimo A. Fazio,Christopher A. Girkin,C. Gustavo De Moraes,Jeffrey M. Liebmann,Sally L. Baxter,Robert N. Weinreb,Linda M. Zangwill,Mark Christopher

Main category: q-bio.QM

TL;DR: 本文提出了一种可解释的多模态大型语言模型(MM-LLM),用于青光眼筛查和结构化OCT报告生成,通过微调提高了诊断准确性和报告质量。

Details Motivation: 青光眼是全球不可逆致盲的主要原因之一,临床需要一种高效、准确的方法来自动分析OCT图像并生成结构化报告,降低医生的负担和提高诊断效率。

Contribution: 1. 提出了一个可解释的MM-LLM模型,能够评估OCT图像质量、诊断青光眼并生成结构化报告;2. 在大规模数据集(43,849张OCT扫描)上验证了模型的高性能。

Method: 通过微调Llama 3.2 Vision-Instruct模型,利用配对的OCT图像和结构化报告进行训练。模型在三个任务上进行了评估:图像质量检测、青光眼诊断和RNFL变薄分类。

Result: 模型在质量评估上达到0.90准确率和0.98特异性;青光眼诊断准确率为0.86,敏感性和特异性分别为0.91和0.73;RNFL变薄预测表现优异(0.83至0.94)。

Insight: 多模态大型语言模型在医疗图像分析和报告生成中具有巨大潜力,能够显著提升临床效率和诊断准确性,但其可解释性和适应性仍需进一步研究。

Abstract: Objective: To develop an explainable multimodal large language model (MM-LLM) that (1) screens optic nerve head (ONH) OCT circle scans for quality and (2) generates structured clinical reports that include glaucoma diagnosis and sector-wise retinal nerve fiber layer (RNFL) thinning assessments. Design: Retrospective cohort study of 1,310 subjects contributing 43,849 Spectralis ONH OCT circle scans (1,331 glaucomatous and 867 healthy eyes) from the DIGS and ADAGES cohorts. Methods: A MM-LLM (Llama 3.2 Vision-Instruct model) was fine-tuned to generate clinical descriptions of OCT imaging data. Training data included paired OCT images and automatically generated, structured clinical reports that described global and sectoral RNFL thinning. Poor-quality scans were labeled as unusable and paired with a fixed refusal statement. The model was evaluated on a held-out test set for three tasks: quality assessment, glaucoma detection, and RNFL thinning classification across seven anatomical sectors. Evaluation metrics included accuracy, sensitivity, specificity, precision, and F1-score. Model description quality was also evaluated using standard text evaluation metrics. Results: The model achieved 0.90 accuracy and 0.98 specificity for quality triage. For glaucoma detection, accuracy was 0.86 (sensitivity 0.91, specificity 0.73, F1-score 0.91). RNFL thinning prediction accuracy ranged from 0.83 to 0.94, with highest performance in global and temporal sectors. Text generation scores showed strong alignment with reference reports (BLEU: 0.82; ROUGE-1: 0.94; ROUGE-2: 0.87; ROUGE-L: 0.92; BERTScore-F1: 0.99). Conclusions: The fine-tuned MM-LLM generated accurate clinical descriptions based on OCT imaging. The model achieved high accuracy in identifying image quality issues and detecting glaucoma. The model also provided sectoral descriptions of RNFL thinning to help support clinical OCT evaluation.

eess.IV [Back]

Daeyoung Kim

Main category: eess.IV

TL;DR: GCVAMD是一种改进的CausalVAE模型,专注于通过原始OCT图像检测和预测年龄相关性黄斑变性(AMD)的因果关系及风险因素,如玻璃膜疣和新生血管化。

Details Motivation: AMD是永久性视力障碍的主要原因之一,但当前治疗方法无法逆转视力损失。传统深度学习方法虽能区分AMD视网膜,但忽略了病理学或因果机制的研究。GCVAMD旨在填补这一空白。

Contribution: 提出了GCVAMD模型,改进CausalVAE以仅从原始OCT图像中提取潜在因果因素,支持干预分析和治疗模拟,同时增强下游任务的表现。

Method: 采用改进的CausalVAE方法,通过潜在空间建模提取AMD的因果机制,风险因素如玻璃膜疣和新生血管化被显式建模。

Result: 实验表明GCVAMD能准确识别玻璃膜疣和新生血管化状态,并在AMD检测和干预分析中表现优异。

Insight: 结合因果关系的模型能更可靠地支持医学决策,尤其是在干预分析和治疗模拟中,为AMD的早期诊断和治疗提供了新思路。

Abstract: Age Related Macular Degeneration(AMD) has been one of the most leading causes of permanent vision impairment in ophthalmology. Though treatments, such as anti VEGF drugs or photodynamic therapies, were developed to slow down the degenerative process of AMD, there is still no specific cure to reverse vision loss caused by AMD. Thus, for AMD, detecting existence of risk factors of AMD or AMD itself within the patient retina in early stages is a crucial task to reduce the possibility of vision impairment. Apart from traditional approaches, deep learning based methods, especially attention mechanism based CNNs and GradCAM based XAI analysis on OCT scans, exhibited successful performance in distinguishing AMD retina from normal retinas, making it possible to use AI driven models to aid medical diagnosis and analysis by ophthalmologists regarding AMD. However, though having significant success, previous works mostly focused on prediction performance itself, not pathologies or underlying causal mechanisms of AMD, which can prohibit intervention analysis on specific factors or even lead to less reliable decisions. Thus, this paper introduces a novel causal AMD analysis model: GCVAMD, which incorporates a modified CausalVAE approach that can extract latent causal factors from only raw OCT images. By considering causality in AMD detection, GCVAMD enables causal inference such as treatment simulation or intervention analysis regarding major risk factors: drusen and neovascularization, while returning informative latent causal features that can enhance downstream tasks. Results show that through GCVAMD, drusen status and neovascularization status can be identified with AMD causal mechanisms in GCVAMD latent spaces, which can in turn be used for various tasks from AMD detection(classification) to intervention analysis.

[101] Wave-GMS: Lightweight Multi-Scale Generative Model for Medical Image Segmentation

Talha Ahmed,Nehal Ahmed Shaikh,Hassan Mohy-ud-Din

Main category: eess.IV

TL;DR: Wave-GMS是一种轻量级的多尺度生成模型,用于医学图像分割,旨在高性能、低成本GPU上训练,参数量少且支持大批量训练。

Details Motivation: 在医疗领域广泛部署AI工具时,需要高性能且能在资源有限(如内存和计算能力有限)的GPU上训练的深度学习分割网络。

Contribution: 提出了Wave-GMS,一种参数量少(约260万)、无需加载内存密集型预训练视觉基础模型的高效生成模型,支持大批量训练。

Method: 采用多尺度生成模型架构,优化模型大小和训练效率,实现在资源有限GPU上的高性能分割任务。

Result: 在四个公开数据集(BUS, BUSI, Kvasir-Instrument, HAM10000)上表现出色,具有卓越的跨域泛化能力。

Insight: 轻量化设计和高效率训练是医学图像分割模型实际部署的关键,特别是在资源受限的环境中。

Abstract: For equitable deployment of AI tools in hospitals and healthcare facilities, we need Deep Segmentation Networks that offer high performance and can be trained on cost-effective GPUs with limited memory and large batch sizes. In this work, we propose Wave-GMS, a lightweight and efficient multi-scale generative model for medical image segmentation. Wave-GMS has a substantially smaller number of trainable parameters, does not require loading memory-intensive pretrained vision foundation models, and supports training with large batch sizes on GPUs with limited memory. We conducted extensive experiments on four publicly available datasets (BUS, BUSI, Kvasir-Instrument, and HAM10000), demonstrating that Wave-GMS achieves state-of-the-art segmentation performance with superior cross-domain generalizability, while requiring only ~2.6M trainable parameters. Code is available at https://github.com/ATPLab-LUMS/Wave-GMS.

cs.AI [Back]

[102] On the Role of Temperature Sampling in Test-Time Scaling

Yuheng Wu,Azalia Mirhoseini,Thierry Tambe

Main category: cs.AI

TL;DR: 论文研究了测试时间缩放(TTS)中温度采样的作用,发现单一温度采样仅能探索模型潜力的一部分,而多温度采样可显著提升推理能力。提出的多温度投票方法进一步降低了计算开销。

Details Motivation: 现有研究表明,增加采样数量K可以提升推理精度,但这种提升在K较大时趋于饱和。作者发现不同温度采样能解决不同子集的问题,由此探索温度维度的缩放潜力。

Contribution: 1. 揭示了温度采样在TTS中的重要性;2. 提出了温度缩放方法,显著提升模型推理能力;3. 设计了多温度投票方法,降低计算开销。

Method: 通过多温度采样扩展TTS的推理边界,并结合多温度投票策略优化性能与计算效率。

Result: 在多个基准测试上,温度缩放比单一温度TTS平均提升7.3分,且无需额外训练即可接近RL训练模型的性能。

Insight: 温度缩放是解锁基础模型潜力的简单有效方法,TTS的潜力比先前认知的更大。

Abstract: Large language models (LLMs) can improve reasoning at inference time through test-time scaling (TTS), where multiple reasoning traces are generated and the best one is selected. Prior work shows that increasing the number of samples K steadily improves accuracy. In this paper, we demonstrate that this trend does not hold indefinitely: at large K, further scaling yields no gains, and certain hard questions remain unsolved regardless of the number of traces. Interestingly, we find that different sampling temperatures solve different subsets of problems, implying that single-temperature scaling explores only part of a model’s potential. We therefore propose scaling along the temperature dimension, which enlarges the reasoning boundary of LLMs. Averaged over Qwen3 (0.6B, 1.7B, 4B, 8B) and five representative reasoning benchmarks (AIME 2024/2025, MATH500, LiveCodeBench, Hi-ToM), temperature scaling yields an additional 7.3 points over single-temperature TTS. Temperature scaling also enables base models to reach performance comparable to reinforcement learning (RL)-trained counterparts, without additional post-training. We further provide a comprehensive analysis of this phenomenon and design a multi-temperature voting method that reduces the overhead of temperature scaling. Overall, our findings suggest that TTS is more powerful than previously thought, and that temperature scaling offers a simple and effective way to unlock the latent potential of base models.

[103] NCV: A Node-Wise Consistency Verification Approach for Low-Cost Structured Error Localization in LLM Reasoning

Yulong Zhang,Li Wang,Wei Du,Peilin Li,Yuqin Dai Zhiyuan Zhao,Lingyong Fang,Ziniu Liu,Ru Zhang,Huijia Zhu,Gongshen Liu

Main category: cs.AI

TL;DR: NCV提出了一种节点级一致性验证方法,用于低成本准确定位大语言模型推理中的错误,显著提高了效率和可解释性。

Details Motivation: 现有方法在验证多步推理时存在错误定位不精确和计算成本高的问题,需要一种更高效的解决方案。

Contribution: 提出了NCV,一种无需训练的训练方法框架,通过轻量级二进制一致性检查在节点级准确验证推理过程。

Method: 将推理链分解为验证节点,进行节点级一致性检查,避免了不必要的长文本生成。

Result: 在公开数据集上,NCV的F1分数比基线提升10%至25%,同时使用的token数量减少了6倍到58倍。

Insight: 节点级验证方法是一种可行的低计算成本解决方案,能有效平衡精确性和效率。

Abstract: Verifying multi-step reasoning in large language models is difficult due to imprecise error localization and high token costs. Existing methods either assess entire reasoning chains, suffering attention dilution, or rely on expensive multi-sampling. We introduce Node-wise Consistency Verification (NCV), a training-free framework that recasts verification as lightweight binary consistency checks at the node level. By decomposing the chain of thought into interconnected verification nodes, NCV precisely localizes errors and avoids unnecessary long-form generation. Experiments demonstrate that our approach enhances interpretability and efficiency, presenting a scalable solution for reliable LLM reasoning verification. On public datasets, NCV achieves a 10% to 25% improvement in F1 scores over baselines while utilizing $6\times$~$58\times$ fewer tokens than traditional methods like CoT-based verifiers.

[104] Beyond the Final Answer: Evaluating the Reasoning Trajectories of Tool-Augmented Agents

Wonjoong Kim,Sangwu Park,Yeonjun In,Sein Kim,Dongha Lee,Chanyoung Park

Main category: cs.AI

TL;DR: 论文提出TRACE框架,用于多维度评估工具增强LLM智能体的问题解决轨迹,超越传统答案匹配,关注效率、幻觉和适应性,并通过证据库和新的元评估数据集验证其有效性。

Details Motivation: 现有工具增强基准测试主要依赖答案匹配评估,忽视了问题解决轨迹的多个方面(如效率、幻觉、适应性),且标注所有真实轨迹成本高昂,需一种低成本、多维度的评估方法。

Contribution: 1. 提出TRACE框架,支持对工具增强LLM智能体的多维评估;2. 引入’证据库’累积推理步骤知识;3. 构建新型元评估数据集,验证TRACE的有效性;4. 揭示智能体在任务中的新观察。

Method: TRACE框架结合证据库,动态积累推理步骤信息,实现对轨迹的多维评估(包括效率、幻觉等)。通过构建包含多样化错误轨迹的元评估数据集,验证评估准确性。

Result: TRACE能够低成本且准确地评估复杂行为,即使使用小型开源LLM。实验表明其在多维分析中的有效性,并提供了智能体行为的新洞察。

Insight: 传统答案匹配不足以评估复杂任务;证据库的动态积累和多维评估是解决轨迹评估问题的关键;TRACE展示了小型LLM在评估任务中的潜力。

Abstract: Although recent tool-augmented benchmarks incorporate complex user requests and diverse tools, the evaluation methods for most of them remain limited to answer matching. However, as the number of steps required to resolve a user request increases, a proper evaluation of an agent’s performance must go beyond the final answer to also assess the problem-solving trajectory, including previously ignored aspects such as efficiency, hallucination, and adaptivity. The most straightforward method for evaluating these aspects is to compare an agent’s trajectory with the ground-truth trajectory, but this approach is fundamentally limited since annotating all valid ground-truth trajectories is prohibitively expensive. However, a simple LLM-based evaluator struggles to assess trajectories in detail without ground truth. To effectively evaluate the agents in this manner, we introduce TRACE, a framework for the multi-dimensional evaluation of tool-augmented LLM agent performance. By incorporating an evidence bank, which accumulates knowledge gathered from preceding reasoning steps, TRACE enables a multi-faceted analysis and evaluation of an agent’s reasoning trajectory effectively. To validate our framework, we develop a new meta-evaluation dataset by augmenting existing benchmarks with diverse and flawed trajectories, each labeled with multi-faceted performance scores. Our results confirm that TRACE accurately evaluates these complex behaviors in a scalable and cost-effective manner, even with small open-source LLMs. Furthermore, we apply our method to evaluate the trajectories that agents produce while solving tool-augmented tasks, presenting previously unreported observations and their corresponding insights.

[105] Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

Cai Zhou,Chenxiao Yang,Yi Hu,Chenyu Wang,Chubin Zhang,Muhan Zhang,Lester Mackey,Tommi Jaakkola,Stephen Bates,Dinghuai Zhang

Main category: cs.AI

TL;DR: 论文提出了一种新的多模态扩散方法CCDD,通过结合连续和离散空间的联合扩散过程,解决了传统连续扩散模型在语言建模中的表现不佳问题,并实现了更好的表达能力和训练效果。

Details Motivation: 传统连续扩散模型在语言建模中表现不如离散扩散模型,但理论上连续扩散模型具有更强的表达能力。论文旨在解决理论与实际表现之间的矛盾。

Contribution: 提出了CCDD方法,联合连续和离散空间的扩散过程,证明了其更强的表达能力,并设计了高效的架构和训练/采样技术。

Method: 通过定义联合多模态扩散过程,利用单一模型在联合空间中同时去噪,结合了连续表示的语义丰富性和离散标记的可训练性。

Result: 在现实任务的广泛语言建模实验中,CCDD表现出强大的实证性能。

Insight: 连续扩散模型的表达能力虽强,但其在离散标记空间中的解码难度是限制其表现的关键。联合多模态扩散方法为解决这一问题提供了新的方向。

Abstract: Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages of latent reasoning with looped transformers or continuous chain-of-thoughts, continuous diffusion models typically underperform their discrete counterparts. In this paper, we argue that diffusion language models do not necessarily need to be in the discrete space. In particular, we prove that continuous diffusion models have stronger expressivity than discrete diffusions and looped transformers. We attribute the contradiction between the theoretical expressiveness and empirical performance to their practical trainability: while continuous diffusion provides intermediate supervision that looped transformers lack, they introduce additional difficulty decoding tokens into the discrete token space from the continuous representation space. We therefore propose Coevolutionary Continuous Discrete Diffusion (CCDD), which defines a joint multimodal diffusion process on the union of a continuous representation space and a discrete token space, leveraging a single model to simultaneously denoise in the joint space. By combining two modalities, CCDD is expressive with rich semantics in the latent space, as well as good trainability and sample quality with the help of explicit discrete tokens. We also propose effective architectures and advanced training/sampling techniques for CCDD, which reveals strong empirical performance in extensive language modeling experiments on real-world tasks.

cs.LG [Back]

[106] How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models

Parth Asawa,Alan Zhu,Matei Zaharia,Alexandros G. Dimakis,Joseph E. Gonzalez

Main category: cs.LG

TL;DR: 这篇论文提出了一种轻量级的Advisor Models方法,通过强化学习训练小模型来动态生成自然语言指令,以优化黑盒大模型的行为,适应不同输入和环境。

Details Motivation: 随着基础模型越来越多地以黑盒服务形式部署,用户无法修改模型权重,只能通过提示词进行有限定制。静态提示优化虽然有效,但无法适应动态输入、用户或环境。

Contribution: 提出Advisor Models,一种可学习的轻量级策略模型,通过强化学习动态生成上下文提示词,优化黑盒模型的行为,实现个性化与环境适应。

Method: 训练一个小型Advisor Model,利用强化学习生成针对每个输入的动态提示词,并根据环境反馈优化其策略。

Result: 在推理和个性化任务中,Advisor Models表现优于静态提示优化方法,并能适应环境动态,泛化到不同黑盒模型。

Insight: Advisor Models为黑盒系统提供了一种可学习的接口,通过动态优化实现个性化的同时保持对分布外输入的鲁棒性。

Abstract: Foundation models are increasingly deployed as black-box services, where model weights cannot be modified and customization is limited to prompting. While static prompt optimization has shown promise, it produces a single fixed prompt that fails to adapt to different inputs, users, or environments. We introduce Advisor Models, lightweight parametric policies trained with reinforcement learning to reactively issue natural language steering instructions in-context to black-box models. The advisor is a second small model that sits between the input and the model, shaping behavior on a per-instance basis using reward signals from the environment. Across multiple domains involving reasoning and personalization, we show that Advisor Models outperform static prompt optimizers, discovering environment dynamics and improving downstream task performance. We also demonstrate the generalizability of advisors by transferring them across black-box models, as well as the framework’s ability to achieve specialization while retaining robustness to out-of-distribution inputs. Viewed more broadly, Advisor Models provide a learnable interface to black-box systems where the advisor acts as a parametric, environment-specific memory. We argue that dynamic optimization of black-box models via Advisor Models is a promising direction for enabling personalization and environment-adaptable AI with frontier-level capabilities.

[107] Beyond Imitation: Recovering Dense Rewards from Demonstrations

Jiangnan Li,Thuy-Trang Vu,Ehsan Abbasnejad,Gholamreza Haffari

Main category: cs.LG

TL;DR: 本文提出了一种新视角,将监督微调(SFT)重新定义为一种逆向强化学习(IRL)过程,揭示了SFT不仅学习策略,还隐式学习了一个密集的、基于token的奖励模型。作者进一步展示了如何从SFT模型中提取这一奖励信号,并利用其为强化学习提供细粒度的信用分配。

Details Motivation: 传统上,监督微调(SFT)被视为简单的模仿学习过程,仅通过模仿专家行为训练策略。本文挑战了这一观点,试图证明SFT实际上是一种逆向强化学习,隐含地学习了一种密集奖励模型。

Contribution: 本文的主要贡献包括:1)证明SFT目标函数是逆向Q学习的一个特例,揭示了SFT隐含学习密集奖励模型的能力;2)提出了一种基线相对奖励函数,直接从SFT模型中提取密集奖励信号;3)展示了如何利用这一奖励信号改进策略,提出了Dense-Path REINFORCE方法。

Method: 作者通过理论分析证明了SFT与逆向Q学习的等价性,并提出了一种基线相对奖励函数来提取隐含的密集奖励信号。随后,利用这些奖励信号通过强化学习进一步优化策略。

Result: 提出的Dense-Path REINFORCE方法在指令跟随基准测试中一致优于原始SFT模型,验证了密集奖励模型的实用价值。

Insight: 本文的创新点在于将SFT重新定义为一种奖励学习机制,而不仅仅是策略模仿。这种视角为利用专家演示数据提供了新的可能性,尤其是在细粒度信用分配方面。

Abstract: Conventionally, supervised fine-tuning (SFT) is treated as a simple imitation learning process that only trains a policy to imitate expert behavior on demonstration datasets. In this work, we challenge this view by establishing a fundamental equivalence between SFT and Inverse Reinforcement Learning. We prove that the SFT objective is a special case of Inverse Q-Learning, which implies that the SFT process does not just learn a policy, but also an implicit, dense, token-level reward model that explains the expert demonstrations. We then show how to recover this dense reward signal directly from the SFT model by formulating a baseline-relative reward function. The availability of such a dense reward model offers numerous benefits, providing granular credit assignment for each token generated. We demonstrate one key application by using these recovered rewards to further improve the policy with reinforcement learning. Our method, Dense-Path REINFORCE, consistently outperforms the original SFT models on instruction-following benchmarks. This work reframes SFT not merely as policy imitation but as a powerful reward learning mechanism, opening new possibilities for leveraging expert demonstrations.

[108] A Granular Study of Safety Pretraining under Model Abliteration

Shashank Agnihotri,Jonas Jakubassa,Priyam Dey,Sachin Goyal,Bernt Schiele,Venkatesh Babu Radhakrishnan,Margret Keuper

Main category: cs.LG

TL;DR: 该论文研究了模型消除(model abliteration)技术对安全预训练的影响,通过实验评估了20个系统在不同检查点下的安全性能,提出了一个实用的协议用于评估推理时编辑的安全性。

Details Motivation: 开放权重的LLMs可以通过简单的激活编辑修改推理行为,这对安全性提出了挑战。论文旨在探索常见的安全干预措施(如拒绝训练或元标签训练)是否能在模型消除技术下保持效果。

Contribution: 1. 提出了一种轻量级的模型消除技术;2. 对安全预训练检查点进行了细粒度的评估;3. 量化了法官选择对评估结果的影响;4. 提出了一个整合推理时编辑的安全评估协议。

Method: 1. 使用模型消除技术移除拒绝敏感方向;2. 对SmolLM2-1.7B和开放基线模型进行控制实验;3. 通过多法官分类和人工验证评估响应;4. 探究模型是否能识别自身的拒绝行为。

Result: 实验结果显示,某些数据驱动的安全组件在模型消除技术下仍具鲁棒性,但法官的选择显著影响评估结果。

Insight: 1. 安全干预措施的效果可能因模型消除而减弱;2. 评估协议的设计对结果至关重要;3. 结合人工验证可以提高评估的可靠性。

Abstract: Open-weight LLMs can be modified at inference time with simple activation edits, which raises a practical question for safety: do common safety interventions like refusal training or metatag training survive such edits? We study model abliteration, a lightweight projection technique designed to remove refusal-sensitive directions, and conduct a controlled evaluation across a granular sequence of Safety Pretraining checkpoints for SmolLM2-1.7B, alongside widely used open baselines. For each of 20 systems, original and abliterated, we issue 100 prompts with balanced harmful and harmless cases, classify responses as Refusal or Non-Refusal using multiple judges, and validate judge fidelity on a small human-labeled subset. We also probe whether models can identify refusal in their own outputs. Our study produces a checkpoint-level characterization of which data-centric safety components remain robust under abliteration, quantifies how judge selection influences evaluation outcomes, and outlines a practical protocol for integrating inference-time edits into safety assessments. Code: https://github.com/shashankskagnihotri/safety_pretraining.

[109] Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward

Guanhua Huang,Tingqiang Xu,Mingze Wang,Qi Yi,Xue Gong,Siheng Li,Ruibin Xiong,Kejiao Li,Yuhao Jiang,Bo Zhou

Main category: cs.LG

TL;DR: 该论文研究了强化学习中探索动态的关键问题,提出了低概率正则化方法(Lp-Reg),通过保护低概率但对探索有价值的分词(reasoning sparks)来提升性能。

Details Motivation: 在带有可验证奖励的强化学习(RLVR)中,探索能力的退化导致性能瓶颈,传统的高熵方法未能有效解决这一问题。作者发现低概率分词在探索中具有重要作用,但被现有方法过度惩罚。

Contribution: 1. 识别了RLVR中探索能力退化的原因:低概率但有价值的分词(reasoning sparks)被逐步消除;2. 提出了低概率正则化方法(Lp-Reg),通过构建启发式代理分布来保护这些分词。

Method: Lp-Reg的核心是对策略进行正则化,使其接近一个经过去噪和重新归一化的启发式代理分布。该方法通过KL散度避免有价值分词的过度惩罚。

Result: 实验表明,Lp-Reg能够在1000步训练中保持稳定探索,在五个数学基准测试中平均准确率达到60.17%,比现有方法提升2.66%。

Insight: 低概率分词在探索中具有不可忽视的作用,传统的熵控制方法可能因其噪声特性而忽视其价值,Lp-Reg通过去噪和正则化有效解决了这一问题。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. Previous methods typically address this by maintaining high policy entropy, yet the precise mechanisms that govern meaningful exploration have remained underexplored. Our analysis suggests that an unselective focus on entropy risks amplifying irrelevant tokens and destabilizing training. This paper investigates the exploration dynamics within RLVR and identifies a key issue: the gradual elimination of valuable low-probability exploratory tokens, which we term \textbf{\textit{reasoning sparks}}. We find that while abundant in pre-trained models, these sparks are systematically extinguished during RLVR due to over-penalization, leading to a degeneracy in exploration. To address this, we introduce Low-probability Regularization (Lp-Reg). Its core mechanism regularizes the policy towards a heuristic proxy distribution. This proxy is constructed by filtering out presumed noise tokens and re-normalizing the distribution over the remaining candidates. The result is a less-noisy proxy where the probability of \textit{reasoning sparks} is amplified, which then serves as a soft regularization target to shield these valuable tokens from elimination via KL divergence. Experiments show that Lp-Reg enables stable on-policy training for around 1,000 steps, a regime where baseline entropy-control methods collapse. This sustained exploration leads to state-of-the-art performance, achieving a $60.17%$ average accuracy on five math benchmarks, an improvement of $2.66%$ over prior methods. Code is available at https://github.com/CarlanLark/Lp-Reg.

cs.IR [Back]

[110] Less LLM, More Documents: Searching for Improved RAG

Jingjie Ning,Yibo Kong,Yunfan Long,Jamie Callan

Main category: cs.IR

TL;DR: 本文探讨了通过扩大检索器语料库以减少对大语言模型(LLM)依赖的方法,证明语料扩展可以有效提升检索增强生成(RAG)性能,且在某些情况下可替代模型规模的扩大。

Details Motivation: 当前检索增强生成(RAG)通常依赖大语言模型来提高准确性,但伴随成本高和部署受限的问题。作者希望通过扩大检索器的语料库,减少对大模型的依赖,降低成本并提升实用性。

Contribution: 主要贡献是提出并验证了一种正交优化方向:通过扩大检索语料库而非增大LLM规模来提升RAG性能,并揭示了语料规模与生成器规模之间的权衡关系。

Method: 通过实验对比不同规模的生成器与不同大小的语料库组合,分析了语料扩展对RAG性能的影响,并探讨了性能提升的机制(如覆盖更多答案段落)。

Result: 实验表明,中等规模的生成器搭配大语料库可以在性能上媲美大模型小语料的组合,且语料扩展的收益随着规模增大而递减。

Insight: 语料库规模的扩展主要通过增加答案段落的覆盖范围提升性能,而利用率效率变化不大。这为RAG系统设计提供了新的优化方向,即在语料库和生成器规模之间权衡。

Abstract: Retrieval-Augmented Generation (RAG) couples document retrieval with large language models (LLMs). While scaling generators improves accuracy, it also raises cost and limits deployability. We explore an orthogonal axis: enlarging the retriever’s corpus to reduce reliance on large LLMs. Experimental results show that corpus scaling consistently strengthens RAG and can often serve as a substitute for increasing model size, though with diminishing returns at larger scales. Small- and mid-sized generators paired with larger corpora often rival much larger models with smaller corpora; mid-sized models tend to gain the most, while tiny and large models benefit less. Our analysis shows that improvements arise primarily from increased coverage of answer-bearing passages, while utilization efficiency remains largely unchanged. These findings establish a principled corpus-generator trade-off: investing in larger corpora offers an effective path to stronger RAG, often comparable to enlarging the LLM itself.