Table of Contents

cs.CL [Back]

[1] Trainable Reference-Based Evaluation Metric for Identifying Quality of English-Gujarati Machine Translation System

Nisheeth Joshi,Pragya Katyayan,Palak Arora

Main category: cs.CL

TL;DR: 该论文提出了一种基于监督学习的英语-古吉拉特语机器翻译评估指标,针对印度语言的独特性设计了两个版本模型(6层和10层隐藏层),并验证了其在1000个MT输出与人工参考翻译对比中的优越性。

Details Motivation: 现有的机器翻译评估指标主要针对欧洲语言,对印度语言的效果不佳,因此需要一种专门针对古吉拉特语的评估方法。

Contribution: 提出了一种基于25特征的监督学习评估指标,设计了两个版本的神经网络模型,并验证了其与人类评估的高度相关性。

Method: 使用25个特征训练了两个神经网络模型(6层和10层隐藏层,500轮训练),并在1000个MT系统输出与人工参考翻译对比中测试性能。

Result: 与现有指标相比,该模型表现出更好的人类相关性。

Insight: 针对特定语言(如印度语言)的评估指标需要定制化设计,监督学习方法可以有效捕捉翻译质量的复杂性。

Abstract: Machine Translation (MT) Evaluation is an integral part of the MT development life cycle. Without analyzing the outputs of MT engines, it is impossible to evaluate the performance of an MT system. Through experiments, it has been identified that what works for English and other European languages does not work well with Indian languages. Thus, In this paper, we have introduced a reference-based MT evaluation metric for Gujarati which is based on supervised learning. We have trained two versions of the metric which uses 25 features for training. Among the two models, one model is trained using 6 hidden layers with 500 epochs while the other model is trained using 10 hidden layers with 500 epochs. To test the performance of the metric, we collected 1000 MT outputs of seven MT systems. These MT engine outputs were compared with 1 human reference translation. While comparing the developed metrics with other available metrics, it was found that the metrics produced better human correlations.

[2] Towards Structured Knowledge: Advancing Triple Extraction from Regional Trade Agreements using Large Language Models

Durgesh Nandini,Rebekka Koch,Mirco Schoenfeld

Main category: cs.CL

TL;DR: 本文研究了利用大语言模型(LLMs)从区域贸易协定文本中提取主语-谓语-宾语三元组的有效性,并探讨了零样本、少样本和多样本提示技术的表现。

Details Motivation: 研究旨在解决从自然语言法律文本中提取结构化知识(如贸易相关信息三元组)的挑战,以支持经济贸易知识图谱的构建。

Contribution: 主要贡献包括:(1) 应用LLMs(如Llama 3.1)提取区域贸易协定中的三元组;(2) 评估零样本、少样本和多样本提示技术的有效性;(3) 提供了经济领域中结构化知识提取的实践案例。

Method: 采用Llama 3.1模型,结合零样本、少样本和多样本提示技术(包括正负示例),从非结构化的区域贸易协定文本中提取三元组。

Result: 通过定量和定性指标评估提取技术的性能,展示了LLMs在经济领域知识提取的潜力。

Insight: 研究发现LLMs能够有效提取结构化知识,但仍面临挑战;强调了语言模型在经济应用中的重要性,并提出了未来研究方向。

Abstract: This study investigates the effectiveness of Large Language Models (LLMs) for the extraction of structured knowledge in the form of Subject-Predicate-Object triples. We apply the setup for the domain of Economics application. The findings can be applied to a wide range of scenarios, including the creation of economic trade knowledge graphs from natural language legal trade agreement texts. As a use case, we apply the model to regional trade agreement texts to extract trade-related information triples. In particular, we explore the zero-shot, one-shot and few-shot prompting techniques, incorporating positive and negative examples, and evaluate their performance based on quantitative and qualitative metrics. Specifically, we used Llama 3.1 model to process the unstructured regional trade agreement texts and extract triples. We discuss key insights, challenges, and potential future directions, emphasizing the significance of language models in economic applications.

[3] CARE: Cognitive-reasoning Augmented Reinforcement for Emotional Support Conversation

Jie Zhu,Yuanchen Zhou,Shuo Jiang,Junhui Li,Lifan Guo,Feng Chen,Chi Zhang,Fang Kong

Main category: cs.CL

TL;DR: 本文提出CARE框架,通过增强认知推理和强化学习改进情感支持对话系统的逻辑性和支持性,无需依赖大规模合成数据。

Details Motivation: 现有研究侧重于数据增强和合成语料构建,但忽略了深层认知推理过程对情感支持的重要性。

Contribution: CARE框架通过强化认知推理和结合强化学习,提升了情感支持对话的逻辑连贯性和支持质量。

Method: 利用原始训练数据指导模型生成支持性响应,并引入强化学习进一步优化推理过程。

Result: 实验表明,CARE显著改善了响应的逻辑性和支持性,推动了更具同理心和人性化的情感支持系统发展。

Insight: 结合认知推理和强化学习是一种有效的方法,可在不依赖大规模数据的情况下提升对话系统的情感支持能力。

Abstract: Emotional Support Conversation (ESC) plays a vital role in alleviating psychological stress and providing emotional value through dialogue. While recent studies have largely focused on data augmentation and synthetic corpus construction, they often overlook the deeper cognitive reasoning processes that underpin effective emotional support. To address this gap, we propose \textbf{CARE}, a novel framework that strengthens reasoning in ESC without relying on large-scale synthetic data. CARE leverages the original ESC training set to guide models in generating logically coherent and supportive responses, thereby explicitly enhancing cognitive reasoning. Building on this foundation, we further employ reinforcement learning to refine and reinforce the reasoning process. Experimental results demonstrate that CARE significantly improves both the logical soundness and supportive quality of responses, advancing the development of empathetic, cognitively robust, and human-like emotional support systems.

[4] Catalog-Native LLM: Speaking Item-ID Dialect with Less Entanglement for Recommendation

Reza Shirkavand,Xiaokai Wei,Chen Wang,Zheng Hui,Heng Huang,Michelle Gong

Main category: cs.CL

TL;DR: 论文提出了一种结合协作过滤和大型语言模型(LLM)的统一推荐系统方法,通过引入IDIOMoE模型,将物品交互历史视为语言空间的‘方言’,避免了文本和目录模态的干扰。

Details Motivation: 用户对自然语言查询和透明解释的需求不断增加,现有的协作过滤和LLM各有优劣,需要一种统一方法来结合两者的优势。

Contribution: 提出了IDIOMoE模型,通过分离文本专家和物品专家,避免模态干扰,同时保留了预训练模型的文本理解能力。

Method: 在预训练LLM的每个块中,将Feed Forward Network拆分为文本专家和物品专家,并使用令牌类型门控机制。

Result: IDIOMoE在公开和专有数据集上均表现出强大的推荐性能,同时保持了文本理解能力。

Insight: 将物品交互历史视为语言空间的一部分是一种有效的结合协作信号和自然语言理解的方法。

Abstract: While collaborative filtering delivers predictive accuracy and efficiency, and Large Language Models (LLMs) enable expressive and generalizable reasoning, modern recommendation systems must bring these strengths together. Growing user expectations, such as natural-language queries and transparent explanations, further highlight the need for a unified approach. However, doing so is nontrivial. Collaborative signals are often token-efficient but semantically opaque, while LLMs are semantically rich but struggle to model implicit user preferences when trained only on textual inputs. This paper introduces Item-ID + Oral-language Mixture-of-Experts Language Model (IDIOMoE), which treats item interaction histories as a native dialect within the language space, enabling collaborative signals to be understood in the same way as natural language. By splitting the Feed Forward Network of each block of a pretrained LLM into a separate text expert and an item expert with token-type gating, our method avoids destructive interference between text and catalog modalities. IDIOMoE demonstrates strong recommendation performance across both public and proprietary datasets, while preserving the text understanding of the pretrained model.

[5] Improving Metacognition and Uncertainty Communication in Language Models

Mark Steyvers,Catarina Belem,Padhraic Smyth

Main category: cs.CL

TL;DR: 论文研究了如何通过监督微调提升语言模型的元认知能力,尤其是其在不确定性表达(如置信度校准和对错判别)方面的表现。结果显示,多任务训练可以显著提升模型的泛化能力,但单任务训练的效果无法互相迁移。

Details Motivation: 大型语言模型在决策场景中广泛应用,但其置信度表达通常不准,可能导致用户误信错误答案。研究旨在通过微调提升模型的不确定性表达能力。

Contribution: 1. 证明了监督微调可以有效提升语言模型的置信度校准和对错判别能力;2. 发现多任务训练能实现跨任务和跨领域的泛化增益。

Method: 1. 在通用知识、数学和开放性问答数据集上微调模型;2. 评估单问题置信度估计和成对置信度比较两种任务;3. 测试模型在医学和法律领域的泛化能力。

Result: 微调显著提升了模型的置信度校准和对错判别能力,且在跨领域任务中表现良好。多任务训练的泛化效果优于单任务训练。

Insight: 不确定性表达能力的提升需要多任务训练,单任务训练的效果无法自然迁移。这表明元认知能力是多样的,需针对性开发。

Abstract: Large language models (LLMs) are increasingly used in decision-making contexts, but when they present answers without signaling low confidence, users may unknowingly act on erroneous outputs. While prior work shows that LLMs maintain internal uncertainty signals, their explicit verbalized confidence is typically miscalibrated and poorly discriminates between correct and incorrect answers. Across two types of LLMs, we investigate whether supervised finetuning can improve models’ ability to communicate uncertainty and whether such improvements generalize across tasks and domains. We finetune the LLMs on datasets spanning general knowledge, mathematics, and open-ended trivia, and evaluate two metacognitive tasks: (1) single-question confidence estimation, where the model assigns a numeric certainty to its answer, and (2) pairwise confidence comparison, where the model selects which of two answers it is more likely to have correct. We assess generalization to unseen domains, including medical and legal reasoning. Results show that finetuning improves calibration (alignment between stated confidence and accuracy) and discrimination (higher confidence for correct vs. incorrect responses) within and across domains, while leaving accuracy unchanged. However, improvements are task-specific: training on single-question calibration does not transfer to pairwise comparison, and vice versa. In contrast, multitask finetuning on both forms of metacognition yields broader gains, producing lower calibration error and stronger discrimination in out-of-domain evaluations. These results show that while uncertainty communication in LLMs is trainable and generalizable, different metacognitive skills do not naturally reinforce one another and must be developed together through multitask training.

[6] Advancing Automated Spatio-Semantic Analysis in Picture Description Using Language Models

Si-Ioi Ng,Pranav S. Ambadi,Kimberly D. Mueller,Julie Liss,Visar Berisha

Main category: cs.CL

TL;DR: 论文提出了一种基于BERT的自动化方法,用于从图片描述中提取和排序内容信息单元(CIU),并通过二元交叉熵和成对排序损失进行微调,显著提升了CIU检测和排序的准确性。

Details Motivation: 现有方法在通过图片描述评估认知语言障碍时,常忽略视觉叙事路径(即描述中元素的顺序和位置)。手动标记或基于词典的映射方法效率低下,因此需要一种自动化解决方案。

Contribution: 1. 提出了一种基于BERT的自动化流程,用于CIU提取和排序;2. 通过二元交叉熵和成对排序损失微调模型;3. 实现了高精度的CIU检测和排序,验证了其在认知障碍评估中的有效性。

Method: 1. 使用BERT模型微调;2. 结合二元交叉熵和成对排序损失优化;3. 通过5折交叉验证评估模型性能。

Result: 模型在CIU检测中达到93%的中位数精度和96%的中位数召回率,序列错误率为24%。提取的特征与人工标注的特征相关性高,且在ANCOVA分析中表现类似。

Insight: 该方法不仅显著提升了自动化分析的效率,还为认知障碍的评估提供了一种新的工具,推动了基于视觉叙事路径的研究。

Abstract: Current methods for automated assessment of cognitive-linguistic impairment via picture description often neglect the visual narrative path - the sequence and locations of elements a speaker described in the picture. Analyses of spatio-semantic features capture this path using content information units (CIUs), but manual tagging or dictionary-based mapping is labor-intensive. This study proposes a BERT-based pipeline, fine tuned with binary cross-entropy and pairwise ranking loss, for automated CIU extraction and ordering from the Cookie Theft picture description. Evaluated by 5-fold cross-validation, it achieves 93% median precision, 96% median recall in CIU detection, and 24% sequence error rates. The proposed method extracts features that exhibit strong Pearson correlations with ground truth, surpassing the dictionary-based baseline in external validation. These features also perform comparably to those derived from manual annotations in evaluating group differences via ANCOVA. The pipeline is shown to effectively characterize visual narrative paths for cognitive impairment assessment, with the implementation and models open-sourced to public.

[7] Training Large Language Models To Reason In Parallel With Global Forking Tokens

Sheng Jia,Xiao Wang,Shiva Prasad Kasiviswanathan

Main category: cs.CL

TL;DR: 本文提出了一种新的监督微调方法(SSFT),通过全局分叉令牌(global forking tokens)和多推理路径的双向匹配损失,提升了大型语言模型(LLM)在并行推理任务中的多样性和准确性。

Details Motivation: 当前LLM通过并行测试时计算提升性能时,面临推理路径多样性和准确性之间的权衡问题,尤其是在复杂任务中,分叉令牌通常位于采样树的深层。传统的多样性激励方法(如温度缩放)难以兼顾两者。

Contribution: 提出了Set Supervised Fine-Tuning (SSFT),一种基于集的下一个令牌预测任务的自监督双向匹配损失方法,用于在SFT中保留独特的推理模式并生成全局分叉令牌。

Method: 将并行推理视为集的下一个令牌预测问题,通过自监督双向匹配将全局分叉令牌与独特推理轨迹对齐,避免了传统微调导致的模式坍塌。

Result: 在多个推理基准测试中,SSFT在Pass@1和Cons@k指标上均优于传统SFT方法。

Insight: 通过全局分叉令牌和集监督机制,能够有效平衡推理的多样性与准确性,这在复杂推理任务中尤为重要。

Abstract: Although LLMs have demonstrated improved performance by scaling parallel test-time compute, doing so relies on generating reasoning paths that are both diverse and accurate. For challenging problems, the forking tokens that trigger diverse yet correct reasoning modes are typically deep in the sampling tree. Consequently, common strategies to encourage diversity, such as temperature scaling, encounter a worsened trade-off between diversity and accuracy. Motivated by this challenge, we treat parallel reasoning as a set-of-next-token-prediction problem, and incorporate a set-based global loss into Supervised Fine-Tuning (SFT) using self-supervised bipartite matching between our global forking tokens and unique reasoning traces. We observe that, while naive fine-tuning with multiple reasoning traces collapses these unique reasoning modes, our proposed method, Set Supervised Fine-Tuning (SSFT), preserves these modes and produces emergent global forking tokens. Experiments on multiple reasoning benchmarks show that our SSFT consistently outperforms SFT under both Pass@1 and Cons@k metrics.

[8] Characterizing Model Behavior Under Synthetic Data Training: An Empirical Study Across Scales and Mixing Ratios

Y. Du,G. Wu,G. Tang,W. Wang,Q. Fan

Main category: cs.CL

TL;DR: 该论文通过实证研究探讨了在不同规模的模型和任务中,合成数据比例如何影响模型性能、校准和输出特性,发现合成数据比例的阈值以及模型规模和任务类型的影响,并给出了实践建议。

Details Motivation: 合成数据在现代NLP训练中日益重要,但其比例对模型行为的影响缺乏系统性研究。本文旨在填补这一空白,帮助理解合成数据的使用界限。

Contribution: 1)确定了合成数据的阈值(20%以内性能稳定);2)发现大模型对合成数据的鲁棒性更强;3)揭示了校准退化是性能下降的早期信号;4)提供了针对模型规模和任务的合成数据预算指导。

Method: 使用了Pythia模型套件(410M-12B参数),在五个任务上进行1-3次训练,合成数据比例从0-50%,评估性能、校准和输出特性。

Result: 1)合成数据比例超过30%会加速性能下降;2)大模型(6.9B-12B)更鲁棒;3)校准退化先于准确性下降;4)推理任务比检索任务更容易受合成数据影响。

Insight: 合成数据的使用需谨慎,需结合模型规模和任务特性调整比例;当前最佳实践(如STaR和Self-Instruct)的合成数据比例(<20%)是安全的。

Abstract: Synthetic data generated by large language models has become integral to modern NLP training pipelines, from bootstrapping reasoning capabilities to augmenting instruction-following datasets. While recent work demonstrates successful applications maintaining high external data ratios, systematic understanding of how synthetic data proportion affects model behavior across different scales remains limited. This paper presents a controlled empirical study examining model performance, calibration, and output characteristics when trained on varying synthetic-to-external data ratios. Using the Pythia model suite (410M-12B parameters) across five diverse tasks, we evaluate models after one to three training iterations with synthetic data proportions ranging from 0-50%. Our key findings include: models maintain stable performance with up to 20% synthetic data, but degradation accelerates beyond 30%; larger models (6.9B-12B) show greater robustness to synthetic data than smaller models (410M-1.4B); calibration degradation precedes accuracy loss, providing an early warning signal; and task characteristics matter, with reasoning tasks degrading faster than retrieval tasks under synthetic data training. Importantly, we find that current best practices, such as those employed in STaR and Self-Instruct systems that maintain greater than 80% external data, operate well within safe regimes identified by our experiments. We provide practical guidance for practitioners on synthetic data budgets based on model scale and task requirements, alongside detailed comparison with concurrent work including Shumailov et al.’s model collapse findings.

[9] Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment

Vanya Bannihatti Kumar,Divyanshu Goyal,Akhil Eppa,Neel Bhandari

Main category: cs.CL

TL;DR: 该论文提出了一种基于好奇心的LLM-as-a-judge方法,用于个性化评估创意写作,解决了LLM在主观创意评估中的不足。

Details Motivation: 现代大型语言模型(LLM)在客观任务(如数学推理和事实准确性评估)上表现优异,但在主观创意评估(如创意写作)中表现不足。论文旨在填补这一空白,提出个性化评估方法。

Contribution: 主要贡献是提出了一种好奇心驱动的LLM-as-a-judge框架,能够学习个体的创意评判标准,并在Torrance Test of Creative Thinking (TTCW)基准上显著优于基线方法。

Method: 方法通过好奇心驱动机制,使LLM能够学习不同个体的主观创意评判标准。实验基于TTCW数据集,使用监督微调(SFT)作为基线进行比较。

Result: 结果表明,该方法在皮尔逊相关系数、Cohen’s值和F1值等指标上优于基线方法,尤其在标注者意见不一致的情况下表现突出。

Insight: 研究表明,好奇心驱动的个性化评估能够有效提升LLM在主观创意任务中的表现,为个性化创意评判提供了新思路。

Abstract: Modern large language models (LLMs) excel at objective tasks such as evaluating mathematical reasoning and factual accuracy, yet they falter when faced with the nuanced, subjective nature of assessing creativity. In this work, we propose a novel curiosity-driven LLM-as-a-judge for evaluating creative writing which is personlized to each individual’s creative judgments. We use the Torrance Test of Creative Thinking(TTCW) benchmark introduced in Chakrabarty et al. (2024), which has stories annotated by expert humans across various subjective dimensions like Originality, to test our hypothesis. We show that our method enables models across various sizes, to learn the nuanced creative judgments of different individuals, by showing improvements over baseline supervised finetuning(SFT) method across various evaluation metrics like Pearson correlation, Cohen’s and F1 values. Our method is especially useful in subjective evaluations where not all the annotators agree with each other.

[10] Linguistic Characteristics of AI-Generated Text: A Survey

Luka Terčon,Kaja Dobrovoljc

Main category: cs.CL

TL;DR: 这篇综述论文总结了AI生成文本的语言学特征,提出了现有研究的分类框架,并指出了未来研究方向,如跨语言和多模型研究。

Details Motivation: 随着大语言模型(LLM)在多个领域的广泛应用,研究AI生成文本的语言学特征对语言学、计算语言学和自然语言处理等领域具有重要意义。

Contribution: 提供了现有研究的系统综述,分类总结了语言学特征,并指出了当前研究的局限和未来方向。

Method: 通过多个维度(如语言学描述层次、模型、文体、语言和提示方法)对现有研究进行分类和分析。

Result: AI生成文本倾向于更正式和非个人化的风格,词汇多样性较低,且研究集中在英语和GPT模型上。

Insight: 未来研究需扩展至更多语言和模型,并解决提示敏感性问题。

Abstract: Large language models (LLMs) are solidifying their position in the modern world as effective tools for the automatic generation of text. Their use is quickly becoming commonplace in fields such as education, healthcare, and scientific research. There is a growing need to study the linguistic features present in AI-generated text, as the increasing presence of such texts has profound implications in various disciplines such as corpus linguistics, computational linguistics, and natural language processing. Many observations have already been made, however a broader synthesis of the findings made so far is required to provide a better understanding of the topic. The present survey paper aims to provide such a synthesis of extant research. We categorize the existing works along several dimensions, including the levels of linguistic description, the models included, the genres analyzed, the languages analyzed, and the approach to prompting. Additionally, the same scheme is used to present the findings made so far and expose the current trends followed by researchers. Among the most-often reported findings is the observation that AI-generated text is more likely to contain a more formal and impersonal style, signaled by the increased presence of nouns, determiners, and adpositions and the lower reliance on adjectives and adverbs. AI-generated text is also more likely to feature a lower lexical diversity, a smaller vocabulary size, and repetitive text. Current research, however, remains heavily concentrated on English data and mostly on text generated by the GPT model family, highlighting the need for broader cross-linguistic and cross-model investigation. In most cases authors also fail to address the issue of prompt sensitivity, leaving much room for future studies that employ multiple prompt wordings in the text generation phase.

[11] Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics

Maojia Song,Renhang Liu,Xinyu Wang,Yong Jiang,Pengjun Xie,Fei Huang,Soujanya Poria,Jingren Zhou

Main category: cs.CL

TL;DR: 该论文提出了WebDetective基准和EvidenceLoop方法,解决了当前多跳深度搜索任务评估中的表面线索泄漏和单一通过率问题,揭示了模型在知识利用和拒绝行为上的系统性弱点,并展示了如何通过诊断框架改进模型架构。

Details Motivation: 当前的多跳深度搜索任务评估存在两个主要问题:问题文本泄漏推理路径导致模型依赖表面线索;评估简化为单一通过率,掩盖了失败的具体原因(如搜索不足、知识利用差或不恰当拒绝)。

Contribution: 1. 提出了WebDetective基准,包含无提示的多跳问题和可控的Wikipedia沙盒,确保模型行为的完全可追溯性。
2. 提出了一个全面的评估框架,将搜索充分性、知识利用和拒绝行为分开评估。
3. 设计了EvidenceLoop方法,通过验证循环和系统性证据跟踪提升搜索和综合能力。

Method: 1. 构建WebDetective基准,消除问题中的推理路径泄漏。
2. 设计多维度评估指标,分别衡量搜索充分性、知识利用和拒绝行为。
3. 提出EvidenceLoop方法,引入验证循环和证据跟踪机制,优化模型的搜索和合成能力。

Result: 评估了25个最先进的模型,发现它们在知识利用和拒绝行为上存在系统性弱点。EvidenceLoop方法显著提升了模型的搜索和综合能力,证明了WebDetective的诊断价值。

Insight: 当前模型擅长执行给定推理路径,但在自主发现推理路径时表现不佳。诊断性评估框架(如WebDetective)可以指导具体架构改进,推动真正自主推理系统的发展。

Abstract: RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviours into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilisation, and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilisation despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap: today’s systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective’s diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.

[12] SynCED-EnDe 2025: A Synthetic and Curated English - German Dataset for Critical Error Detection in Machine Translation

Muskaan Chopra,Lorenz Sparrenberg,Rafet Sifa

Main category: cs.CL

TL;DR: SynCED-EnDe 2025是一个新的人工合成和精心整理的英语-德语数据集,旨在检测机器翻译中的关键错误。它弥补了WMT21数据集的规模、标签平衡、领域覆盖和时间新鲜度等不足。数据集包含1000个黄金标注和8000个白银标注的句子对,并引入了细粒度的辅助判断,推动了CRA任务的发展。

Details Motivation: 现有的WMT21英语-德语CED数据集存在规模小、标签不平衡、领域覆盖有限和时间新鲜度不足等问题。为此,研究者需要一个更丰富的数据集来支持机器翻译中关键错误的检测和分析。

Contribution: 提出了SynCED-EnDe数据集,包含1000个黄金标注和8000个白银标注的句子对,标签平衡,并引入了细粒度的辅助判断(如明显性、严重性、定位复杂性等)。数据集来源于2024-2025年的多样化来源,并永久托管在GitHub和Hugging Face上。

Method: 数据集构建采用了人工合成和精心整理的方式,从StackExchange和GOV.UK等多样化来源收集数据。标注过程中引入了显式错误子类、结构化触发标志和细粒度的辅助判断。

Result: 实验表明,基于XLM-R和相关编码器的基准模型在SynCED-EnDe数据集上表现优于WMT21数据集,这归因于标签平衡和精细化标注。

Insight: SynCED-EnDe不仅提供了二元错误检测,还支持对错误风险和复杂性的系统分析。这将有助于机器翻译在信息检索和对话助手等新兴场景中的安全部署。

Abstract: Critical Error Detection (CED) in machine translation aims to determine whether a translation is safe to use or contains unacceptable deviations in meaning. While the WMT21 English-German CED dataset provided the first benchmark, it is limited in scale, label balance, domain coverage, and temporal freshness. We present SynCED-EnDe, a new resource consisting of 1,000 gold-labeled and 8,000 silver-labeled sentence pairs, balanced 50/50 between error and non-error cases. SynCED-EnDe draws from diverse 2024-2025 sources (StackExchange, GOV.UK) and introduces explicit error subclasses, structured trigger flags, and fine-grained auxiliary judgments (obviousness, severity, localization complexity, contextual dependency, adequacy deviation). These enrichments enable systematic analyses of error risk and intricacy beyond binary detection. The dataset is permanently hosted on GitHub and Hugging Face, accompanied by documentation, annotation guidelines, and baseline scripts. Benchmark experiments with XLM-R and related encoders show substantial performance gains over WMT21 due to balanced labels and refined annotations. We envision SynCED-EnDe as a community resource to advance safe deployment of MT in information retrieval and conversational assistants, particularly in emerging contexts such as wearable AI devices.

[13] Every Step Counts: Decoding Trajectories as Authorship Fingerprints of dLLMs

Qi Li,Runpeng Yu,Haiquan Lu,Xinchao Wang

Main category: cs.CL

TL;DR: 该论文提出了一种利用离散扩散大语言模型 (dLLMs) 的解码轨迹作为模型指纹的方法,通过捕捉解码步骤之间的结构关系,实现了对不同模型及其检查点的有效溯源。

Details Motivation: 当前缺乏能够区分不同dLLMs及其检查点的有效方法,而dLLMs的解码机制具有独特的结构信息,可用于模型溯源。该论文旨在利用解码轨迹中的信息开发一种通用的模型指纹技术。

Contribution: 1. 提出了 Directed Decoding Map (DDM) 方法,提取解码步骤之间的结构关系;2. 设计了 Gaussian-Trajectory Attribution (GTA) 方法,通过拟合高斯分布计算解码轨迹的似然分数,实现模型溯源;3. 在多样化的实验场景中验证了方法的有效性。

Method: 1. DDM:通过解码轨迹提取结构关系,避免传统模型置信度的冗余问题;2. GTA:拟合每个解码位置的高斯分布,利用似然分数判定解码轨迹的来源模型。

Result: 实验表明,DDM和GTA能够有效区分不同dLLMs及其检查点,显著优于直接使用模型置信度的方法。

Insight: dLLMs的解码轨迹携带了丰富的模型特有行为信息,通过结构化的方式提取和解码这些信息,可以实现高精度的模型溯源。

Abstract: Discrete Diffusion Large Language Models (dLLMs) have recently emerged as a competitive paradigm for non-autoregressive language modeling. Their distinctive decoding mechanism enables faster inference speed and strong performance in code generation and mathematical tasks. In this work, we show that the decoding mechanism of dLLMs not only enhances model utility but also can be used as a powerful tool for model attribution. A key challenge in this problem lies in the diversity of attribution scenarios, including distinguishing between different models as well as between different checkpoints or backups of the same model. To ensure broad applicability, we identify two fundamental problems: what information to extract from the decoding trajectory, and how to utilize it effectively. We first observe that relying directly on per-step model confidence yields poor performance. This is mainly due to the bidirectional decoding nature of dLLMs: each newly decoded token influences the confidence of other decoded tokens, making model confidence highly redundant and washing out structural signal regarding decoding order or dependencies. To overcome this, we propose a novel information extraction scheme called the Directed Decoding Map (DDM), which captures structural relationships between decoding steps and better reveals model-specific behaviors. Furthermore, to make full use of the extracted structural information during attribution, we propose Gaussian-Trajectory Attribution (GTA), where we fit a cell-wise Gaussian distribution at each decoding position for each target model, and define the likelihood of a trajectory as the attribution score: if a trajectory exhibits higher log-likelihood under the distribution of a specific model, it is more likely to have been generated by that model. Extensive experiments under different settings validate the utility of our methods.

[14] Chronological Thinking in Full-Duplex Spoken Dialogue Language Models

Donghang Wu,Haoyang Zhang,Chen Chen,Tianyu Zhang,Fei Tian,Xuerui Yang,Gang Yu,Hexin Liu,Nana Hou,Yuchen Hu,Eng Siong Chng

Main category: cs.CL

TL;DR: 提出了Chronological Thinking,一种实时的对话思考机制,以提升全双工语音对话语言模型的响应质量。该方法严格因果且无额外延迟,实验表明其显著提升了响应质量和交互性能。

Details Motivation: 现有全双工语音对话系统在监听阶段会重复预测静音标记,这与人类在对话中的轻量思考行为不符。作者希望模拟人类行为,提出一种实时思考机制。

Contribution: 提出了Chronological Thinking,一种适用于流式音频输入的实时思考机制,严格因果且不引入额外延迟,显著提升了响应质量和交互性能。

Method: Chronological Thinking通过增量推理更新内部假设,仅在监听窗口中进行推理,用户停止说话后立即响应,无需额外延迟。

Result: 实验证明该方法在客观指标和人工评估中均提升了响应质量,并能够稳健处理对话动态。

Insight: 全双工语音对话系统中,模拟人类实时轻量思考行为可以显著提升交互体验。

Abstract: Recent advances in spoken dialogue language models (SDLMs) reflect growing interest in shifting from turn-based to full-duplex systems, where the models continuously perceive user speech streams while generating responses. This simultaneous listening and speaking design enables real-time interaction and the agent can handle dynamic conversational behaviors like user barge-in. However, during the listening phase, existing systems keep the agent idle by repeatedly predicting the silence token, which departs from human behavior: we usually engage in lightweight thinking during conversation rather than remaining absent-minded. Inspired by this, we propose Chronological Thinking, a on-the-fly conversational thinking mechanism that aims to improve response quality in full-duplex SDLMs. Specifically, chronological thinking presents a paradigm shift from conventional LLM thinking approaches, such as Chain-of-Thought, purpose-built for streaming acoustic input. (1) Strictly causal: the agent reasons incrementally while listening, updating internal hypotheses only from past audio with no lookahead. (2) No additional latency: reasoning is amortized during the listening window; once the user stops speaking, the agent halts thinking and begins speaking without further delay. Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations show consistent improvements in response quality. Furthermore, chronological thinking robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.

[15] Exploring Large Language Models for Financial Applications: Techniques, Performance, and Challenges with FinMA

Prudence Djagba,Abdelkader Y. Saley

Main category: cs.CL

TL;DR: 本文研究了金融领域适应的大语言模型(LLMs)的优势与局限性,重点关注FinMA模型在金融NLP任务中的表现。研究表明FinMA在情感分析和分类任务中表现良好,但在数值推理、实体识别和摘要任务中存在挑战。

Details Motivation: 金融NLP任务对准确性、可靠性和领域适应有严格要求,因此需要深入研究金融领域专用的LLMs的设计和评估方法。

Contribution: 提出了FinMA模型及其指令调优方法(使用FIT数据集),并在FLARE基准下进行了评估,揭示了其在金融任务中的优势和不足。

Method: FinMA基于PIXIU框架构建,采用Financial Instruction Tuning(FIT)数据集进行指令调优,并在FLARE基准下测试性能。

Result: FinMA在情感分析和分类任务中表现优异,但在数值推理、实体识别和摘要任务中表现较差。

Insight: 金融LLMs的设计需要更注重数值推理和复杂任务的优化,同时需要更全面的评估框架。

Abstract: This research explores the strengths and weaknesses of domain-adapted Large Language Models (LLMs) in the context of financial natural language processing (NLP). The analysis centers on FinMA, a model created within the PIXIU framework, which is evaluated for its performance in specialized financial tasks. Recognizing the critical demands of accuracy, reliability, and domain adaptation in financial applications, this study examines FinMA’s model architecture, its instruction tuning process utilizing the Financial Instruction Tuning (FIT) dataset, and its evaluation under the FLARE benchmark. Findings indicate that FinMA performs well in sentiment analysis and classification, but faces notable challenges in tasks involving numerical reasoning, entity recognition, and summarization. This work aims to advance the understanding of how financial LLMs can be effectively designed and evaluated to assist in finance-related decision-making processes.

[16] A Single Character can Make or Break Your LLM Evals

Jingtong Su,Jianyu Zhang,Karen Ullrich,Léon Bottou,Mark Ibrahim

Main category: cs.CL

TL;DR: 论文研究发现,大型语言模型(LLM)评估中的分隔符选择对模型性能影响巨大,甚至可以操纵排名。通过分析注意力机制,提出了提升模型鲁棒性的方法。

Details Motivation: 研究中发现LLM评估对示例分隔符的选择极为敏感,但相关研究较少。这种细微差别可能导致性能大幅波动,因此需要深入探讨其影响和解决方案。

Contribution: 1. 揭示分隔符选择对LLM性能的显著影响;2. 提出通过注意力机制分析解释性能差异;3. 提供提升模型鲁棒性的实用建议。

Method: 分析了不同分隔符对LLM性能的影响,通过注意力头分数探究了机制,并在提示中明确分隔符以提高鲁棒性。

Result: 性能波动可达±23%,不同分隔符可操纵模型排名。明确分隔符的提示能显著提升模型稳定性。

Insight: LLM对输入格式的微小变化极为敏感,注意力机制是关键影响因素。在设计评估和实际应用时需注意分隔符选择。

Abstract: Common Large Language model (LLM) evaluations rely on demonstration examples to steer models’ responses to the desired style. While the number of examples used has been studied and standardized, the choice of how to format examples is less investigated. In evaluation protocols and real world usage, users face the choice how to separate in-context examples: use a comma? new line? semi-colon? hashtag? etc.? Surprisingly, we find this seemingly minor choice can dramatically alter model response quality. Across leading model families (Llama, Qwen, Gemma), performance on MMLU for example can vary by $\pm 23%$ depending on the choice of delimiter. In fact, one can manipulate model rankings to put any model in the lead by only modifying the single character separating examples. We find LLMs’ brittleness pervades topics, model families, and doesn’t improve with scale. By probing attention head scores, we find that good-performing delimiters steer attention towards key tokens in the input. Finally, we explore methods to improve LLMs’ robustness to the choice of delimiter. We find specifying the selected delimiter in the prompt boosts robustness and offer practical recommendations for the best-performing delimiters to select.

[17] Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs

Shenzhe Zhu,Shu Yang,Michiel A. Bakker,Alex Pentland,Jiaxin Pei

Main category: cs.CL

TL;DR: 论文提出DeliberationBank数据集和DeliberationJudge模型,用于评估大规模审议中LLM生成的摘要的代表性和公平性,揭示少数视角被低估的问题。

Details Motivation: 大规模公共审议中,LLM生成的摘要可能低估少数视角并存在偏见,需可靠工具评估和改进其公平性。

Contribution: 1) 构建DeliberationBank数据集;2) 训练DeliberationJudge模型,与人类判断更一致;3) 评估18个LLM的摘要弱点。

Method: 1) 收集3,000人的审议数据;2) 通过4,500人标注的摘要判断数据训练DeBERTa模型;3) 评估LLM的摘要表现。

Result: DeliberationJudge比LLM评委更高效且与人类判断更一致,发现LLM摘要持续低估少数立场。

Insight: LLM在审议摘要中存在系统性偏见,需专用评估工具确保AI系统的公平性和代表性。

Abstract: Large-scale public deliberations generate thousands of free-form contributions that must be synthesized into representative and neutral summaries for policy use. While LLMs have been shown as a promising tool to generate summaries for large-scale deliberations, they also risk underrepresenting minority perspectives and exhibiting bias with respect to the input order, raising fairness concerns in high-stakes contexts. Studying and fixing these issues requires a comprehensive evaluation at a large scale, yet current practice often relies on LLMs as judges, which show weak alignment with human judgments. To address this, we present DeliberationBank, a large-scale human-grounded dataset with (1) opinion data spanning ten deliberation questions created by 3,000 participants and (2) summary judgment data annotated by 4,500 participants across four dimensions (representativeness, informativeness, neutrality, policy approval). Using these datasets, we train DeliberationJudge, a fine-tuned DeBERTa model that can rate deliberation summaries from individual perspectives. DeliberationJudge is more efficient and more aligned with human judgements compared to a wide range of LLM judges. With DeliberationJudge, we evaluate 18 LLMs and reveal persistent weaknesses in deliberation summarization, especially underrepresentation of minority positions. Our framework provides a scalable and reliable way to evaluate deliberation summarization, helping ensure AI systems are more representative and equitable for policymaking.

[18] A novel hallucination classification framework

Maksym Zavhorodnii,Dmytro Dehtiarov,Anna Konovalenko

Main category: cs.CL

TL;DR: 该论文提出了一种新方法,通过系统分类和可控的幻觉类型生成,结合无监督学习和向量空间分析,实现了对大语言模型(LLM)生成的幻觉的自动检测。

Details Motivation: 大语言模型(LLM)在生成内容时常出现幻觉(hallucinations),这些幻觉可能导致信息失真,影响模型可靠性。当前的检测方法通常需要复杂的监督学习或外部验证,缺乏轻量化的解决方案。

Contribution: 1. 提出了一种基于系统性分类和提示工程的幻觉检测框架;2. 构建了一个专用的幻觉数据集,并通过嵌入模型将其映射到向量空间;3. 通过无监督学习方法展示了幻觉与正确输出之间的空间可分性。

Method: 1. 利用提示工程系统性生成多样化的幻觉类型;2. 使用嵌入模型将幻觉数据集映射到向量空间;3. 通过降维和无监督学习技术分析幻觉与正确输出的空间分布;4. 基于簇间距离定量评估幻觉的严重程度。

Result: 研究发现,幻觉的信息失真程度与其在向量空间中与正确输出簇的空间距离存在一致相关性。即使是简单的分类算法也能可靠地区分幻觉与正确输出。

Insight: 该工作表明,幻觉在向量空间中具有可区分的特征,无需复杂的外部验证即可实现轻量化检测。这种方法为提高LLM的可靠性提供了实用工具。

Abstract: This work introduces a novel methodology for the automatic detection of hallucinations generated during large language model (LLM) inference. The proposed approach is based on a systematic taxonomy and controlled reproduction of diverse hallucination types through prompt engineering. A dedicated hallucination dataset is subsequently mapped into a vector space using an embedding model and analyzed with unsupervised learning techniques in a reduced-dimensional representation of hallucinations with veridical responses. Quantitative evaluation of inter-centroid distances reveals a consistent correlation between the severity of informational distortion in hallucinations and their spatial divergence from the cluster of correct outputs. These findings provide theoretical and empirical evidence that even simple classification algorithms can reliably distinguish hallucinations from accurate responses within a single LLM, thereby offering a lightweight yet effective framework for improving model reliability.

[19] Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning

Chenghao Yang,Lin Gui,Chenxiao Yang,Victor Veitch,Lizhu Zhang,Zhuokai Zhao

Main category: cs.CL

TL;DR: 论文提出了Exploratory Annealed Decoding (EAD)方法,通过动态调整采样温度(从高到低)来平衡探索与利用,从而提升RLVR中的样本效率。

Details Motivation: RLVR的成功依赖于有效的探索,但固定温度采样难以平衡样本质量和训练稳定性。

Contribution: 提出了EAD方法,实现了‘先探索后利用’的动态采样策略,显著提升了样本效率。

Method: EAD通过从高到低的温度退火策略,早期鼓励语义多样性,后期保护样本质量。

Result: EAD在多种RLVR算法和模型规模中表现优于固定温度采样。

Insight: 将探索与序列生成的自然动态对齐,是提升LLM推理能力的有效路径。

Abstract: Reinforcement learning with verifiable rewards (RLVR) is a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs), yet its success hinges on effective exploration. An ideal exploration strategy must navigate two fundamental challenges: it must preserve sample quality while also ensuring training stability. While standard fixed-temperature sampling is simple, it struggles to balance these competing demands, as high temperatures degrade sample quality and low temperatures limit discovery. In this work, we propose a simpler and more effective strategy, Exploratory Annealed Decoding (EAD), grounded in the insight that exploration is most impactful on early tokens which define a sequence’s semantic direction. EAD implements an intuitive explore-at-the-beginning, exploit-at-the-end strategy by annealing the sampling temperature from high to low during generation. This dynamic schedule encourages meaningful, high-level diversity at the start, then gradually lowers the temperature to preserve sample quality and keep the sampling distribution close to the target policy, which is essential for stable training. We demonstrate that EAD is a lightweight, plug-and-play method that significantly improves sample efficiency, consistently outperforming fixed-temperature sampling across various RLVR algorithms and model sizes. Our work suggests that aligning exploration with the natural dynamics of sequential generation offers a robust path to improving LLM reasoning.

[20] Camellia: Benchmarking Cultural Biases in LLMs for Asian Languages

Tarek Naous,Anagha Savit,Carlos Rafael Catalan,Geyang Guo,Jaehyeok Lee,Kyungdon Lee,Lheane Marie Dizon,Mengyu Ye,Neel Kothari,Sahajpreet Singh,Sarah Masud,Tanish Patwa,Trung Thanh Tran,Zohaib Khan,Alan Ritter,JinYeong Bak,Keisuke Sakaguchi,Tanmoy Chakraborty,Yuki Arase,Wei Xu

Main category: cs.CL

TL;DR: 论文提出了一项名为Camellia的基准测试,用于衡量大型语言模型(LLMs)在六种亚洲文化背景下九种亚洲语言中的文化偏见问题。研究揭示了LLMs在文化适应性和跨文化实体理解方面的普遍困境。

Details Motivation: 随着LLMs的多语言能力增强,处理多元文化实体的公平性成为关键。此前研究表明LLMs在阿拉伯语中倾向于西方实体,但缺乏针对其他非西方语言的基准测试,尤其是亚洲语言。

Contribution: Camellia基准测试:涵盖九种亚洲语言的19,530个文化标记实体和2,173个自然语境掩码任务,填补了亚洲语言文化偏见研究的空白。

Method: 通过人工标注实体文化关联性(亚洲vs西方)和社交媒体提取的自然语境掩码任务,评估四种主流多语言LLMs在文化适应、情感关联和实体抽取任务中的表现。

Result: LLMs在所有亚洲语言中均表现出文化适应困难,性能差异与模型开发地区的数据可访问性相关;不同模型家族存在独特偏见;亚洲语言的语境理解能力显著不足。

Insight: LLMs的文化偏见具有语言和模型特异性,强调了文化多样性数据的重要性;亚洲语言的语境理解亟待改进,以缩小文化间的性能差距。

Abstract: As Large Language Models (LLMs) gain stronger multilingual capabilities, their ability to handle culturally diverse entities becomes crucial. Prior work has shown that LLMs often favor Western-associated entities in Arabic, raising concerns about cultural fairness. Due to the lack of multilingual benchmarks, it remains unclear if such biases also manifest in different non-Western languages. In this paper, we introduce Camellia, a benchmark for measuring entity-centric cultural biases in nine Asian languages spanning six distinct Asian cultures. Camellia includes 19,530 entities manually annotated for association with the specific Asian or Western culture, as well as 2,173 naturally occurring masked contexts for entities derived from social media posts. Using Camellia, we evaluate cultural biases in four recent multilingual LLM families across various tasks such as cultural context adaptation, sentiment association, and entity extractive QA. Our analyses show a struggle by LLMs at cultural adaptation in all Asian languages, with performance differing across models developed in regions with varying access to culturally-relevant data. We further observe that different LLM families hold their distinct biases, differing in how they associate cultures with particular sentiments. Lastly, we find that LLMs struggle with context understanding in Asian languages, creating performance gaps between cultures in entity extraction.

[21] WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives

Yongan Yu,Xianda Du,Qingchen Hu,Jiahao Liang,Jingwei Ni,Dan Qiang,Kaiyu Huang,Grant McKenzie,Renee Sieber,Fengran Mo

Main category: cs.CL

TL;DR: WeatherArchive-Bench是首个用于评估历史天气档案的检索增强生成(RAG)系统的基准,包含检索和评估两项任务,揭示了稠密检索器和大型语言模型在历史档案处理中的局限性。

Details Motivation: 历史天气档案提供了社会脆弱性和抗灾能力的定性数据,但其规模庞大、质量噪音高、语言古老,难以转化为结构化知识。现有RAG系统在历史术语和社会指标理解上存在不足。

Contribution: 提出了首个针对历史天气档案的RAG基准WeatherArchive-Bench,包含检索(WeatherArchive-Retrieval)和评估(WeatherArchive-Assessment)两项任务,并公开了数据集和评估框架。

Method: 构建了一个包含百万级档案新闻段落的基准数据集,设计了检索任务(衡量系统定位相关段落的能力)和评估任务(评估LLM对脆弱性和抗灾能力指标的分类能力)。

Result: 实验表明,稠密检索器在处理历史术语时表现不佳,LLM经常误解脆弱性和抗灾能力概念。这揭示了RAG系统在复杂社会指标推理上的局限性。

Insight: 历史档案的语言和概念复杂性对RAG系统提出了独特挑战,未来的系统设计需更紧密结合气候研究的特定需求和社会背景。

Abstract: Historical archives on weather events are collections of enduring primary source records that offer rich, untapped narratives of how societies have experienced and responded to extreme weather events. These qualitative accounts provide insights into societal vulnerability and resilience that are largely absent from meteorological records, making them valuable for climate scientists to understand societal responses. However, their vast scale, noisy digitized quality, and archaic language make it difficult to transform them into structured knowledge for climate research. To address this challenge, we introduce WeatherArchive-Bench, the first benchmark for evaluating retrieval-augmented generation (RAG) systems on historical weather archives. WeatherArchive-Bench comprises two tasks: WeatherArchive-Retrieval, which measures a system’s ability to locate historically relevant passages from over one million archival news segments, and WeatherArchive-Assessment, which evaluates whether Large Language Models (LLMs) can classify societal vulnerability and resilience indicators from extreme weather narratives. Extensive experiments across sparse, dense, and re-ranking retrievers, as well as a diverse set of LLMs, reveal that dense retrievers often fail on historical terminology, while LLMs frequently misinterpret vulnerability and resilience concepts. These findings highlight key limitations in reasoning about complex societal indicators and provide insights for designing more robust climate-focused RAG systems from archival contexts. The constructed dataset and evaluation framework are publicly available at https://anonymous.4open.science/r/WeatherArchive-Bench/.

[22] Residualized Similarity for Faithfully Explainable Authorship Verification

Peter Zeng,Pegah Alipoormolabashi,Jihu Mun,Gourab Dey,Nikita Soni,Niranjan Balasubramanian,Owen Rambow,H. Schwartz

Main category: cs.CL

TL;DR: 论文提出了一种名为残差相似性(RS)的新方法,通过结合可解释特征和神经网络,提升作者验证任务的性能,同时保持模型的解释性和忠实性。

Details Motivation: 现有的作者验证(AV)系统尽管准确率高,但缺乏直接的可解释性,尤其是基于大语言模型(LLM)的预测无法提供忠实于模型的解释。这限制了其在需要真实可信决策的场景中的应用。

Contribution: 提出了残差相似性(RS)方法,通过在可解释系统的基础上引入神经网络预测相似性残差,既提升了性能,又保持了模型的解释性和忠实性。

Method: RS方法将作者验证任务视为相似性任务,利用神经网络预测可解释系统相似性结果的误差(残差),从而提升预测性能。

Result: 在四个数据集上的实验表明,该方法不仅达到了当前最先进模型的性能,还能展示预测的忠实性和可解释性。

Insight: 通过结合可解释特征和神经网络的优势,可以在不牺牲性能的情况下实现模型的忠实解释,为高风险的决策支持提供了更可靠的解决方案。

Abstract: Responsible use of Authorship Verification (AV) systems not only requires high accuracy but also interpretable solutions. More importantly, for systems to be used to make decisions with real-world consequences requires the model’s prediction to be explainable using interpretable features that can be traced to the original texts. Neural methods achieve high accuracies, but their representations lack direct interpretability. Furthermore, LLM predictions cannot be explained faithfully – if there is an explanation given for a prediction, it doesn’t represent the reasoning process behind the model’s prediction. In this paper, we introduce Residualized Similarity (RS), a novel method that supplements systems using interpretable features with a neural network to improve their performance while maintaining interpretability. Authorship verification is fundamentally a similarity task, where the goal is to measure how alike two documents are. The key idea is to use the neural network to predict a similarity residual, i.e. the error in the similarity predicted by the interpretable system. Our evaluation across four datasets shows that not only can we match the performance of state-of-the-art authorship verification models, but we can show how and to what degree the final prediction is faithful and interpretable.

[23] A Lightweight Large Language Model-Based Multi-Agent System for 2D Frame Structural Analysis

Ziheng Geng,Jiachen Liu,Ran Cao,Lu Cheng,Haifeng Wang,Minghui Cheng

Main category: cs.CL

TL;DR: 論文提出了一個基於輕量級大型語言模型(LLM)的多智能體系統,用於自動化二維框架的有限元建模任務,通過分解任務並由專用智能體處理,提高了建模效率和準確性。

Details Motivation: 大型語言模型在結構工程中的有限元建模任務潛力尚未被充分探索,尤其是在需要幾何建模、複雜推理和領域知識的任務中。

Contribution: 開發了一個基於Llama-3.3 70B Instruct模型的多智能體系統,專門用於自動化二維框架的有限元建模。

Method: 系統將結構分析分解為多個子任務,每個子任務由專用智能體處理,包括問題分析、幾何建模、代碼生成、模型驗證和載荷應用。

Result: 在20個基準問題上的實驗結果顯示,系統在多數情況下準確率超過80%,優於Gemini-2.5 Pro和ChatGPT-4o模型。

Insight: 輕量級LLM結合多智能體系統可以有效解決結構工程中的複雜建模問題,提高自動化和準確性。

Abstract: Large language models (LLMs) have recently been used to empower autonomous agents in engineering, significantly improving automation and efficiency in labor-intensive workflows. However, their potential remains underexplored in structural engineering, particularly for finite element modeling tasks requiring geometric modeling, complex reasoning, and domain knowledge. To bridge this gap, this paper develops a LLM-based multi-agent system to automate finite element modeling of 2D frames. The system decomposes structural analysis into subtasks, each managed by a specialized agent powered by the lightweight Llama-3.3 70B Instruct model. The workflow begins with a Problem Analysis Agent, which extracts geometry, boundary, and material parameters from the user input. Next, a Geometry Agent incrementally derives node coordinates and element connectivity by applying expert-defined rules. These structured outputs are converted into executable OpenSeesPy code by a Translation Agent and refined by a Model Validation Agent through consistency checks. Then, a Load Agent applies load conditions into the assembled structural model. Experimental evaluations on 20 benchmark problems demonstrate that the system achieves accuracy over 80% in most cases across 10 repeated trials, outperforming Gemini-2.5 Pro and ChatGPT-4o models.

[24] Self-Filtered Distillation with LLMs-generated Trust Indicators for Reliable Patent Classification

Yoo Yongmin,Zhang Xu,Cao Longbing

Main category: cs.CL

TL;DR: 该论文提出了一种名为Self-Filtered Distillation的框架,用于解决LLM生成的逻辑解释中存在的错误问题,通过信任信号而非监督信号进行选择性蒸馏,提升了专利分类的准确性、稳定性和可解释性。

Details Motivation: LLM生成的解释虽能增强可解释性,但常包含逻辑错误和标签不匹配问题,直接使用会引入噪声并影响训练稳定性。论文旨在解决这一问题。

Contribution: 提出了Self-Filtered Distillation框架,利用三个无监督信任指标(自一致性、类别蕴含对齐和LLM一致性评分)生成信任信号,优化专利分类任务。

Method: 框架通过三个信任指标计算统一信任分数,选择性加权训练样本或过滤低信任样本,实现了基于推理的监督。

Result: 在USPTO-2M数据集上,该方法在准确性、稳定性和可解释性上优于基于标签的学习和传统蒸馏方法。

Insight: 将LLM生成的解释视为信任信号而非监督信号,可以有效减少噪声,提升模型性能和训练稳定性。

Abstract: Large language models (LLMs) increasingly generate natural language rationales to enhance interpretability, but these often contain logical errors, label mismatches, and domain-specific misalignments. Directly using such rationales as supervision risks propagating noise and undermining training stability. To address this challenge, we introduce Self-Filtered Distillation, a framework specifically tailored for patent classification, which treats LLM-generated rationales as trust signals rather than ground-truth supervision. The framework employs selective distillation guided by three unsupervised trust metrics: (1) Self-Consistency, which measures the stability of LLM-generated rationales across multiple generations; (2) Class Entailment Alignment, which assesses semantic coherence with patent-specific class definitions; and (3) LLM Agreement Scoring, which validates rationale-label plausibility. These metrics are integrated into a unified trust score that primarily weights training samples while optionally filtering out extremely low-trust cases, enabling reasoning-aware supervision. Experiments on the USPTO-2M dataset, a widely used benchmark for patent classification, show that our method outperforms label-based learning and conventional distillation in accuracy, stability, and interpretability, establishing a reliable paradigm for leveraging reasoning-aware trust indicators in patent analytics.

[25] AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering

Zheyuan Zhang,Kaiwen Shi,Zhengqing Yuan,Zehong Wang,Tianyi Ma,Keerthiram Murugesan,Vincent Galassi,Chuxu Zhang,Yanfang Ye

Main category: cs.CL

TL;DR: AgentRouter提出了一种基于知识图谱的大型语言模型(LLM)路由框架,通过图神经网络(GNN)实现多代理问答的协同路由,显著优于单代理和集成基线。

Details Motivation: 随着LLM和代理框架的快速发展,选择最佳配置成为挑战。现有路由方法常忽略问答任务的细粒度上下文和关系结构,需要自适应路由机制以利用不同代理的互补优势。

Contribution: 提出了AgentRouter,将多代理问答建模为知识图谱引导的路由问题,通过异构GNN生成任务感知的路由分布,并利用软监督和加权聚合学习协作方案。

Method: 将问答实例转换为知识图谱,编码查询、上下文实体和代理;使用异构GNN传播信息并生成路由分布;通过软监督和加权聚合优化协作策略。

Result: 实验表明AgentRouter在多个基准和LLM主干上优于单代理和集成基线,显示出其有效性和鲁棒性。

Insight: 知识图谱和GNN的结合为多代理协同提供了新思路,通过任务感知的路由分布可充分利用代理的互补优势。

Abstract: Large language models (LLMs) and agent-based frameworks have advanced rapidly, enabling diverse applications. Yet, with the proliferation of models and agentic strategies, practitioners face substantial uncertainty in selecting the best configuration for a downstream task. Prior studies show that different agents and backbones exhibit complementary strengths, and that larger models are not always superior, underscoring the need for adaptive routing mechanisms. Existing approaches to agent routing, however, often emphasize cost efficiency while overlooking the fine-grained contextual and relational structure inherent in QA tasks. In this paper, we propose tAgentRouter, a framework that formulates multi-agent QA as a knowledge-graph-guided routing problem supervised by empirical performance signals. Specifically, we convert QA instance into a knowledge graph that jointly encodes queries, contextual entities, and agents, and then train a heterogeneous graph neural network (GNN) to propagate information across node types and produce task-aware routing distributions over agents. By leveraging soft supervision and weighted aggregation of agent outputs, AgentRouter learns principled collaboration schemes that capture the complementary strengths of diverse agents. Extensive experiments demonstrate that our framework consistently outperforms single-agent and ensemble baselines, while generalizing across benchmarks and LLM backbones. These results highlight the effectiveness and robustness of graph-supervised multi-agent routing for question answering.

[26] SocialNLI: A Dialogue-Centric Social Inference Dataset

Akhil Deo,Kate Sanders,Benjamin Van Durme

Main category: cs.CL

TL;DR: 论文提出了SocialNLI(SoNLI),这是一个专注于社交对话推理的数据集,用于评估大型语言和推理模型在处理复杂社交现象(如反讽和讽刺)时的能力。

Details Motivation: 当前的AI模型在理解人类对话中的复杂社交现象(如反讽和讽刺)方面表现不佳,需要一种方法来评估和改进它们的社交推理能力。

Contribution: SoNLI是首个专注于社交对话推理的数据集,包含对话转录、推理、可能性分数和人类编写的解释,旨在评估和改进模型的社交推理能力。

Method: 通过收集包含复杂社交现象的对话转录,并与推理和人类解释配对,构建数据集。使用多步反事实推理评估模型的理论心理能力。

Result: SoNLI数据集展示了当前大型语言和推理模型在社交推理任务上的局限性。

Insight: 社交推理能力是AI助手的重要基础,通过评估和改进模型在这方面的表现,可以提升其在真实对话场景中的实用性。

Abstract: Making theory-of-mind inferences from human dialogue is a strong indicator of a model’s underlying social abilities, which are fundamental for adept AI assistants. However, large language and reasoning models struggle to understand sophisticated social phenomena in transcript data, such as sarcasm and irony. To assess the weaknesses of current models and to identify their solutions, we introduce SocialNLI (SoNLI) – the first social dialogue inference dataset. SoNLI consists of a collection of dialogue transcripts hand-picked to center complex social nuances like irony and sarcasm, paired with inferences, corresponding likelihood scores, and human-written explanations. We explore social inference analysis as a facet of theory-of-mind, and evaluate LLM and reasoning model theory-of-mind ability through multi-step counterfactual reasoning.

[27] TensorBLEU: Vectorized GPU-based BLEU Score Implementation for Per-Sentence In-Training Evaluation

Adam Filipek

Main category: cs.CL

TL;DR: 论文提出了TensorBLEU,一种基于GPU的向量化BLEU评分实现,旨在解决NLP模型训练中评估工具的计算瓶颈问题,相比传统CPU实现(如NLTK),速度提升了13倍(T4 GPU)至40倍(A100 GPU)。

Details Motivation: 现代NLP模型规模不断扩大,但评估工具(如BLEU)的计算效率成为瓶颈,尤其在训练中需要高效处理句子级奖励信号时更为突出,因此需要一种GPU加速的实现。

Contribution: 1. 提出TensorBLEU,专为GPU优化的BLEU评分实现;2. 设计了内存高效的分词计数机制;3. 开源实现,提升了RL等领域的研究效率。

Method: 1. 使用向量化操作在PyTorch中实现GPU加速;2. 利用torch.unique构建紧凑的n-gram字典,避免传统哈希方法的高内存开销;3. 支持批处理的句子级计算。

Result: 实验表明,TensorBLEU在NVIDIA T4和A100 GPU上的速度分别提升了13倍和40倍,显著降低了训练过程的计算瓶颈。

Insight: GPU优化的评估工具可以大幅提升NLP模型的训练效率,尤其在RL等需要频繁评估的场景中,TensorBLEU展现了高效性和实用性。

Abstract: Modern natural language processing models have achieved unprecedented scale, yet the tools for their evaluation often remain a computational bottleneck, limiting the pace of research. This is particularly acute for in-training evaluation metrics, such as per-sentence reward signals in Reinforcement Learning, which must operate efficiently on batches of token IDs directly on the GPU. In this paper, we introduce TensorBLEU, a novel implementation of the BLEU metric designed from the ground up for this specific use case. Our approach is fully vectorized for GPU-accelerated, per-sentence computation within PyTorch and introduces a memory-efficient counting mechanism. By creating a compact, batch-specific dictionary of n-grams using \texttt{torch.unique}, our method avoids the prohibitive memory costs of traditional hashing-based vectorization, making it practical for large-vocabulary models. We benchmark TensorBLEU against NLTK, the standard library for token-ID-based BLEU calculation on the CPU. Experiments show that TensorBLEU provides speedups of over 13x on consumer-grade GPUs (NVIDIA T4) and exceeding 40x on data-center-class hardware (NVIDIA A100). This performance transforms a significant bottleneck into a negligible part of the training loop. By clearly defining its role as a “Token-ID BLEU” for development purposes and open-sourcing our implementation, we provide a powerful tool for accelerating research in areas like RL-based model fine-tuning.

[28] Language Model as Planner and Formalizer under Constraints

Cassie Huang,Stuti Mohan,Ziyi Yang,Stefanie Tellex,Li Zhang

Main category: cs.CL

TL;DR: 论文研究了语言模型(LLMs)在规划和形式化任务中的能力,指出现有基准测试的局限性,并通过引入精细化自然语言约束来评估模型的鲁棒性。

Details Motivation: 现有规划任务基准测试过于简单和通用,可能导致高估LLMs的规划能力,并在下游任务中引发安全隐患。

Contribution: 通过手动标注的精细化自然语言约束扩展基准测试,评估LLMs在复杂约束条件下的规划和形式化能力。

Method: 在4个SOTA推理LLMs、3种形式语言、5种方法和4个数据集上测试约束条件下的性能表现。

Result: 约束引入后,性能普遍下降一半,且模型对问题复杂性和词汇变化的鲁棒性显著降低。

Insight: 精细化约束能更真实地评估LLMs的实际能力,突显其在复杂环境中的局限性。

Abstract: LLMs have been widely used in planning, either as planners to generate action sequences end-to-end, or as formalizers to represent the planning domain and problem in a formal language that can derive plans deterministically. However, both lines of work rely on standard benchmarks that only include generic and simplistic environmental specifications, leading to potential overestimation of the planning ability of LLMs and safety concerns in downstream tasks. We bridge this gap by augmenting widely used planning benchmarks with manually annotated, fine-grained, and rich natural language constraints spanning four formally defined categories. Over 4 state-of-the-art reasoning LLMs, 3 formal languages, 5 methods, and 4 datasets, we show that the introduction of constraints not only consistently halves performance, but also significantly challenges robustness to problem complexity and lexical shift.

[29] LANTERN: Scalable Distillation of Large Language Models for Job-Person Fit and Explanation

Zhoutong Fu,Yihan Cao,Yi-Lin Chen,Aman Lunia,Liming Dong,Neha Saraf,Ruijie Jiang,Yun Dai,Qingquan Song,Tan Wang,Guoyao Li,Derek Koh,Haichao Wei,Zhipeng Wang,Aman Gupta,Chengming Jiang,Jianqiang Shen,Liangjie Hong,Wenjing Zhang

Main category: cs.CL

TL;DR: LANTERN是一个针对职位-人选匹配任务的LLM知识蒸馏框架,通过多目标建模和多层次知识蒸馏,显著提升了任务指标和用户体验。

Details Motivation: 由于大型语言模型(LLM)在特定领域任务(如职位-人选匹配)中的复杂性、高推理延迟和结构化输出需求,直接应用LLM效果不佳。LANTERN旨在解决这些问题。

Contribution: 提出了LANTERN框架,结合多层次知识蒸馏技术,成功将LLM知识迁移到轻量级模型中,同时分享了后训练技术和提示工程的关键见解。

Method: LANTERN通过多目标建模(分类和生成任务)、多层次知识蒸馏(数据和逻辑层面)以及后训练优化,实现了高效的知识迁移。

Result: 实验表明,LANTERN显著提升了任务指标,在线评估中进一步验证了其有效性,提升了求职者的参与度(申请率和合格申请率分别增加0.24%和0.28%)。

Insight: 提示工程和后训练技术对领域特定的LLM应用至关重要;多层次知识蒸馏可以有效缩小轻量级模型与复杂LLM之间的性能差距。

Abstract: Large language models (LLMs) have achieved strong performance across a wide range of natural language processing tasks. However, deploying LLMs at scale for domain specific applications, such as job-person fit and explanation in job seeking platforms, introduces distinct challenges. At LinkedIn, the job person fit task requires analyzing a candidate’s public profile against job requirements to produce both a fit assessment and a detailed explanation. Directly applying open source or finetuned LLMs to this task often fails to yield high quality, actionable feedback due to the complexity of the domain and the need for structured outputs. Moreover, the large size of these models leads to high inference latency and limits scalability, making them unsuitable for online use. To address these challenges, we introduce LANTERN, a novel LLM knowledge distillation framework tailored specifically for job person fit tasks. LANTERN involves modeling over multiple objectives, an encoder model for classification purpose, and a decoder model for explanation purpose. To better distill the knowledge from a strong black box teacher model to multiple downstream models, LANTERN incorporates multi level knowledge distillation that integrates both data and logit level insights. In addition to introducing the knowledge distillation framework, we share our insights on post training techniques and prompt engineering, both of which are crucial for successfully adapting LLMs to domain specific downstream tasks. Extensive experimental results demonstrate that LANTERN significantly improves task specific metrics for both job person fit and explanation. Online evaluations further confirm its effectiveness, showing measurable gains in job seeker engagement, including a 0.24% increase in apply rate and a 0.28% increase in qualified applications.

[30] Prototype-Based Dynamic Steering for Large Language Models

Ceyhun Efe Kayan,Li Zhang

Main category: cs.CL

TL;DR: 该论文提出了一种名为原型动态引导(PDS)的方法,用于在推理阶段动态增强大语言模型(LLM)的推理能力,而无需修改指令或进行微调。

Details Motivation: 现有的LLM通常依赖于显式推理指令或静态引导方法,缺乏自适应、无需指令的动态推理增强能力。

Contribution: 提出了一种基于原型的动态引导(PDS)方法,通过在推理时无监督地聚类激活差异来生成原型,并动态引导模型推理。

Method: 通过聚类CoT和中性提示的激活差异生成推理原型;在推理时,将输入的隐藏状态投影到原型上,形成实例特定的引导向量。

Result: 在GSM8K、AQuA-RAT和BIG-Bench任务上,PDS显著提升了准确性,即使在CoT被抑制的情况下也表现优异。

Insight: 动态原型引导是一种轻量级的替代方案,能够在不改变模型参数或提示的情况下增强LLM的推理能力。

Abstract: Despite impressive breadth, LLMs still rely on explicit reasoning instructions or static, one-fits-all steering methods, leaving a gap for adaptive, instruction-free reasoning amplification. We present Prototype-Based Dynamic Steering (PDS), a test-time method that amplifies large language model (LLM) reasoning without adding or altering instructions. We introduce “reasoning prototypes” by clustering activation differences between Chain-of-Thought (CoT) and neutral prompts. At inference, an input’s hidden state is projected onto these prototypes to form an instance-specific steering vector. Evaluated on GSM8K, AQuA-RAT, and BIG-Bench tasks, PDS consistently improves accuracy without fine-tuning or prompt engineering. Notably, the gains persist even when CoT is explicitly suppressed to improve cost-efficiency, indicating that the intervention strengthens latent reasoning processes rather than inducing a superficial behavioral shift. These results position dynamic, prototype-guided steering as a lightweight alternative to training-time approaches for enhancing LLM reasoning.

[31] CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension

Rui Li,Zeyu Zhang,Xiaohe Bo,Zihang Tian,Xu Chen,Quanyu Dai,Zhenhua Dong,Ruiming Tang

Main category: cs.CL

TL;DR: 论文提出了CAM(Constructivist Agentic Memory),一种基于建构主义理论的记忆模块,旨在提升大语言模型(LLM)在长文档阅读理解中的表现。CAM通过结构化、灵活性和动态性设计,显著提高了性能和效率。

Details Motivation: 当前大语言模型在处理长文档时面临信息过载的问题,缺乏系统的记忆模块设计原则。研究者从Jean Piaget的建构主义理论中汲取灵感,试图填补这一空白。

Contribution: 提出了CAM,一个基于建构主义理论的代理记忆模块,具有结构化总结、灵活的信息整合和动态更新的特性。

Method: CAM的核心是一个增量重叠聚类算法,支持分层总结和在线批量整合。在推理阶段,CAM自适应地探索记忆结构以激活与查询相关的信息。

Result: CAM在多项长文本阅读理解任务(如问答、查询式总结和声明验证)中表现出性能和效率的双重优势。

Insight: 建构主义理论为设计LLM的记忆模块提供了系统性指导,CAM的成功验证了这一理论在人工智能领域的适用性。

Abstract: Current Large Language Models (LLMs) are confronted with overwhelming information volume when comprehending long-form documents. This challenge raises the imperative of a cohesive memory module, which can elevate vanilla LLMs into autonomous reading agents. Despite the emergence of some heuristic approaches, a systematic design principle remains absent. To fill this void, we draw inspiration from Jean Piaget’s Constructivist Theory, illuminating three traits of the agentic memory – structured schemata, flexible assimilation, and dynamic accommodation. This blueprint forges a clear path toward a more robust and efficient memory system for LLM-based reading comprehension. To this end, we develop CAM, a prototype implementation of Constructivist Agentic Memory that simultaneously embodies the structurality, flexibility, and dynamicity. At its core, CAM is endowed with an incremental overlapping clustering algorithm for structured memory development, supporting both coherent hierarchical summarization and online batch integration. During inference, CAM adaptively explores the memory structure to activate query-relevant information for contextual response, akin to the human associative process. Compared to existing approaches, our design demonstrates dual advantages in both performance and efficiency across diverse long-text reading comprehension tasks, including question answering, query-based summarization, and claim verification.

[32] KEO: Knowledge Extraction on OMIn via Knowledge Graphs and RAG for Safety-Critical Aviation Maintenance

Kuangshi Ai,Jonathan A. Karr Jr,Meng Jiang,Nitesh V. Chawla,Chaoli Wang

Main category: cs.CL

TL;DR: KEO是一个针对航空维护领域的知识提取与推理框架,结合知识图谱(KG)和检索增强生成(RAG),提升全局感知和细粒度任务的表现。

Details Motivation: 为了解决安全关键领域中知识提取和推理的挑战,特别是在航空维护这种高风险的场景中,传统RAG方法在全局推理上表现不足。

Contribution: 提出了KEO框架,通过构建知识图谱并将其集成到RAG中,显著提升了全局推理能力,同时保持了细粒度任务的表现。

Method: 使用OMIn数据集构建QA基准,结合KG和RAG,以及评估多种LLM(如Gemma-3、Phi-4)和强力模型(如GPT-4o)作为裁判。

Result: KEO在全局感知上表现优异,而传统RAG在细粒度任务上更有效,展示了KG增强LLM在安全关键QA中的潜力。

Insight: KG与RAG的结合在高风险领域中具有显著优势,能够同时支持全局推理和局部任务,未来可扩展到其他类似领域。

Abstract: We present Knowledge Extraction on OMIn (KEO), a domain-specific knowledge extraction and reasoning framework with large language models (LLMs) in safety-critical contexts. Using the Operations and Maintenance Intelligence (OMIn) dataset, we construct a QA benchmark spanning global sensemaking and actionable maintenance tasks. KEO builds a structured Knowledge Graph (KG) and integrates it into a retrieval-augmented generation (RAG) pipeline, enabling more coherent, dataset-wide reasoning than traditional text-chunk RAG. We evaluate locally deployable LLMs (Gemma-3, Phi-4, Mistral-Nemo) and employ stronger models (GPT-4o, Llama-3.3) as judges. Experiments show that KEO markedly improves global sensemaking by revealing patterns and system-level insights, while text-chunk RAG remains effective for fine-grained procedural tasks requiring localized retrieval. These findings underscore the promise of KG-augmented LLMs for secure, domain-specific QA and their potential in high-stakes reasoning.

[33] H1B-KV: Hybrid One-Bit Caches for Memory-Efficient Large Language Model Inference

Harshil Vejendla

Main category: cs.CL

TL;DR: H1B-KV是一种混合压缩方案,通过1-bit二进制草图量化键向量和4-bit量化值向量,显著减少大语言模型推理时的内存占用,同时保持上下文完整性。

Details Motivation: 大语言模型的自回归解码需要缓存大量过去的键值对,导致长上下文推理成为内存瓶颈。现有方法仅部分压缩或丢弃上下文信息,未能彻底解决问题。

Contribution: 提出H1B-KV,首次全面压缩键值对缓存,结合1-bit键向量和4-bit值向量量化,实现70倍内存减少,并在下游任务中保持性能。

Method: 采用1-bit二进制草图量化键向量以支持硬件友好的位运算注意力,并对值向量进行4-bit量化。通过轻量级微调恢复性能。

Result: H1B-KV将7B参数模型的8k上下文缓存压缩至60MB以内,性能与全精度模型相当,在GSM8K、MMLU和HumanEval等任务中表现优异。

Insight: H1B-KV展示了低精度量化与硬件优化的结合可在内存受限环境下高效部署大语言模型,同时避免上下文丢失。

Abstract: Autoregressive decoding in large language models (LLMs) requires caching a growing list of past key-value (KV) pairs, making long-context inference a memory-bound problem. While recent methods have explored quantizing the cache, evicting tokens, or using binary sketches for keys (e.g., Loki), these approaches often provide an incomplete solution by leaving one component (like values) uncompressed or by discarding context information. This paper introduces the Hybrid One-Bit KV Cache (H1B-KV), a comprehensive compression scheme that radically reduces memory usage without sacrificing context. H1B-KV represents each key vector using a 1-bit binary sketch, enabling hardware-friendly bitwise attention, and further compresses value vectors using 4-bit quantization. This holistic, hybrid approach allows a 7-billion parameter LLM to handle an 8k-token context with under 60 MB of cache memory - a 70x reduction. We demonstrate that after a lightweight finetuning, H1B-KV matches full-precision performance not only on perplexity benchmarks but also on complex downstream tasks like mathematical reasoning (GSM8K), multi-task understanding (MMLU), and code generation (HumanEval). Our results show H1B-KV significantly outperforms leading quantization (KIVI), token eviction (SparseLLM), and key-only sketching (Loki) methods in quality-per-byte, establishing it as a robust solution for deploying LLMs in memory-constrained environments.

[34] On the Role of Difficult Prompts in Self-Play Preference Optimization

Yao Xiao,Jung-jae Kim,Roy Ka-wei Lee,Lidong Bing

Main category: cs.CL

TL;DR: 本文研究了提示(prompts)难度在自对弈偏好优化中的作用,发现困难提示会降低优化性能,且模型容量越大,这种影响越小。提出了选择性剔除困难提示的策略以提升性能。

Details Motivation: 自对弈偏好优化(self-play preference optimization)已成为对齐大语言模型的重要范式,但提示的作用尚未充分研究。本文旨在探究不同难度提示对优化性能的影响。

Contribution: 1. 提出以提示生成响应的平均奖励作为难度指标;2. 揭示困难提示对自对弈优化的负面影响;3. 发现模型容量能缓解难度影响;4. 提出选择性剔除困难提示的策略。

Method: 1. 使用平均奖励量化提示难度;2. 分析不同难度提示对优化的影响;3. 实验验证模型容量与难度的交互作用;4. 探索剔除困难提示的策略。

Result: 1. 困难提示显著降低优化性能;2. 增加模型容量可缩小困难与简单提示的性能差距;3. 选择性剔除困难提示提升了整体性能。

Insight: 提示难度是优化中的重要因素,模型容量能部分缓解其负面影响,但需策略性管理困难提示以避免性能下降。

Abstract: Self-play preference optimization has emerged as a prominent paradigm for aligning large language models (LLMs). It typically involves a language model to generate on-policy responses for prompts and a reward model (RM) to guide the selection of chosen and rejected responses, which can be further trained with direct preference optimization (DPO). However, the role of prompts remains underexplored, despite being a core component in this pipeline. In this work, we investigate how prompts of varying difficulty influence self-play preference optimization. We first use the mean reward of $N$ sampled responses of a prompt as a proxy for its difficulty. We find that difficult prompts exhibit substantially inferior self-play optimization performance in comparison to easy prompts for language models. Moreover, incorporating difficult prompts into training fails to enhance overall performance and, in fact, leads to slight degradation compared to training on easy prompts alone. We also observe that the performance gap between difficult and easy prompts closes as the model capacity increases, suggesting that difficulty interacts with the model capacity. Building on these findings, we explore strategies to mitigate the negative effect of difficult prompts on final performance. We demonstrate that selectively removing an appropriate portion of challenging prompts enhances overall self-play performance, while also reporting failed attempts and lessons learned.

[35] Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM

Ryan Solgi,Parsa Madinei,Jiayi Tian,Rupak Swaminathan,Jing Liu,Nathan Susanj,Zheng Zhang

Main category: cs.CL

TL;DR: 论文提出了一种新颖的低秩压缩框架PGSVD,用于高效压缩大型语言模型(LLM)和视觉语言模型(VLM),通过激活感知的Pareto优化实现更高的准确性和推理速度。

Details Motivation: LLM和VLM在性能上表现出色,但其部署面临巨大的内存和计算挑战。作者希望通过低秩压缩解决这一问题,同时保持模型性能。

Contribution: 1. 通过层间激活压缩误差理论上界填补了文献空白;2. 将低秩压缩建模为双目标优化问题,证明单一容忍度可得到Pareto最优非均匀秩;3. 提出了PGSVD零样本压缩框架。

Method: 1. 基于激活信息的上界理论分析;2. 双目标优化建模;3. Pareto引导的奇异值分解(PGSVD)及其交替最小二乘实现。

Result: PGSVD在LLM和VLM上实现了相同压缩级别下更高的准确性和更快的推理速度。

Insight: 激活信息对低秩压缩至关重要,Pareto优化能有效指导非均匀秩选择,兼顾压缩率和性能。

Abstract: Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment. We present a novel low-rank compression framework to address this challenge. First, we upper bound the change of network loss via layer-wise activation-based compression errors, filling a theoretical gap in the literature. We then formulate low-rank model compression as a bi-objective optimization and prove that a single uniform tolerance yields surrogate Pareto-optimal heterogeneous ranks. Based on our theoretical insights, we propose Pareto-Guided Singular Value Decomposition (PGSVD), a zero-shot pipeline that improves activation-aware compression via Pareto-guided rank selection and alternating least-squares implementation. We apply PGSVD to both LLM and VLM, showing better accuracy at the same compression levels and inference speedup.

[36] Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations

Chengzhi Liu,Yuzhe Yang,Kaiwen Zhou,Zhen Zhang,Yue Fan,Yannan Xie,Peng Qi,Xin Eric Wang

Main category: cs.CL

TL;DR: 该文介绍了EvoPresent框架,通过PresAesth多任务强化学习模型提升学术展示的美学质量和内容连贯性,并提出了EvoPresent Benchmark用于系统评估方法。论文强调高质量反馈对代理自我改进的重要性,并揭示了自动化生成在视觉设计和内容构建之间的权衡。

Details Motivation: 学术论文推广需要高效且吸引人的传播方式,但现有的自动化方法在故事叙述、美学质量和自我调整方面存在局限,难以满足需求。

Contribution: 提出了EvoPresent框架,结合PresAesth多任务强化学习模型,实现了学术展示的自改进美学代理,并建立了EvoPresent Benchmark用于评估。

Method: 采用多任务强化学习(RL)模型PresAesth,提供美学评分、缺陷调整和比较反馈,并通过虚拟角色展示连贯的内容设计。

Result: 研究指出高质量反馈对代理自改进至关重要,多任务RL训练在美学意识任务中表现出更强的泛化能力。

Insight: 自动化生成方法在视觉设计和内容构建之间存在权衡,这表明需要平衡美学与内容的协同优化。

Abstract: The promotion of academic papers has become an important means of enhancing research visibility. However, existing automated methods struggle limited storytelling, insufficient aesthetic quality, and constrained self-adjustment, making it difficult to achieve efficient and engaging dissemination. At the heart of those challenges is a simple principle: \emph{there is no way to improve it when you cannot evaluate it right}. To address this, we introduce \textbf{EvoPresent}, a self-improvement agent framework that unifies coherent narratives, aesthetic-aware designs, and realistic presentation delivery via virtual characters. Central to EvoPresent is \textbf{PresAesth}, a multi-task reinforcement learning (RL) aesthetic model that provides reliable aesthetic scoring, defect adjustment, and comparative feedback, enabling iterative self-improvement even under limited aesthetic training data. To systematically evaluate the methods, we introduce \textbf{EvoPresent Benchmark}, a comprehensive benchmark comprising: \textit{Presentation Generation Quality}, built on 650 top-tier AI conference papers with multimodal resources (slides, videos and scripts) to assess both content and design; and \textit{Aesthetic Awareness}, consisting of 2,000 slide pairs with varying aesthetic levels, supporting joint training and evaluation on scoring, defect adjustment, and comparison. Our findings highlight that (i) High-quality feedback is essential for agent self-improvement, while initial capability alone does not guarantee effective self-correction. (ii) Automated generation pipelines exhibit a trade-off between visual design and content construction. (iii) Multi-task RL training shows stronger generalization in aesthetic awareness tasks.

[37] Mission Impossible: Feedback-Guided Dynamic Interactive Planning for Improving Reasoning on LLMs

Dong Yan,Gaochen Wu,Bowen Zhou

Main category: cs.CL

TL;DR: 该论文提出了一种名为FGDIP的新框架,通过动态和自适应的信息检索策略增强大语言模型在多跳推理任务中的表现,显著提升了开放域问题的处理能力。

Details Motivation: 现有方法在多跳推理任务中通常依赖固定的动作序列,难以应对需要大规模信息检索的开放域问题。因此,作者提出了一种动态和自适应的方法来解决这一问题。

Contribution: 主要贡献是提出FGDIP框架,通过结合历史错误分析和实时反馈的动态策略,优化推理过程,显著提升了在大规模开放域数据集上的性能。

Method: FGDIP利用深度优先搜索和创新的节点生成技术,动态调整推理策略。具体包括识别关键实体作为初始节点,生成子节点,并通过反馈优化路径选择。

Result: 实验结果显示,FGDIP在HotpotQA和StrategyQA数据集上的F1分数分别达到54.47%和70.05%,超过基线5.03%和7.25%。

Insight: 动态调整推理策略并结合实时反馈可以显著提升语言模型在多跳推理任务中的表现,尤其是在处理开放域问题时效果更佳。

Abstract: Recent advancements in language agents have led to significant improvements in multi-hop reasoning tasks. However, existing approaches often struggle with handling open-domain problems, which require massive information retrieval due to their reliance on a fixed sequence of actions. To address this, we propose Feedback-Guided Dynamic Interactive Planning (FGDIP), a novel framework tailored to enhance reasoning in LLMs by utilizing dynamic and adaptive strategies for information exploration in open-domain multi-hop reasoning tasks. Our approach begins by identifying key entities relevant to the problem, which serve as the initial nodes in the reasoning process. From these initial nodes, we then generate reasoning child nodes with the process being refined through a combination of historical error analysis and real-time feedback, which allows the framework to dynamically adjust and optimize its reasoning strategies. By integrating depth-first search with an innovative node generation technique, our framework adapts based on both prior error paths and concurrently generated nodes at the same hierarchical level. This dynamic strategy effectively expands the search space while ensuring the reasoning process systematically converges toward accurate solutions. Experimental results show that FGDIP achieved up to 54.47% F1 score on the HotpotQA dataset and 70.05% on the StrategyQA dataset, surpassing the best baseline by 5.03% and 7.25% respectively, highlighting its versatility and potential to enhance language agents in multi-hop reasoning tasks.

[38] A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks

Shuzheng Si,Haozhe Zhao,Kangyang Luo,Gang Chen,Fanchao Qi,Minjia Zhang,Baobao Chang,Maosong Sun

Main category: cs.CL

TL;DR: EAGLET是一种高效且有效的全局规划器训练方法,通过两步训练过程(合成高质量计划和基于规则的强化学习)提升执行代理的长期规划能力,在减少训练成本的同时达到SOTA性能。

Details Motivation: 现有基于大语言模型的代理在长期任务中缺乏全局规划,导致盲目试错和幻觉动作生成,亟需一种无需人工干预的高效规划训练方法。

Contribution: 提出了EAGLET方法,通过两步训练过程(合成计划与强化学习)提升代理规划能力,显著降低训练成本且无需额外数据。

Method: 1)使用同源共识过滤策略从高级LLM合成高质量计划并进行微调;2)基于规则强化学习阶段引入执行能力增益奖励。

Result: 在三个长期任务中EAGLET超越现有方法,达到SOTA性能,训练成本降低8倍。

Insight: 合成高质量计划与强化学习的结合是提升代理长期规划能力的关键,同时显著减少对人工和数据的需求。

Abstract: Agents based on large language models (LLMs) struggle with brainless trial-and-error and generating hallucinatory actions due to a lack of global planning in long-horizon tasks. In this paper, we introduce a plan-and-execute framework and propose EAGLET, an efficient and effective planner training method to enhance the executor agent’s planning abilities without human effort. Specifically, we train a plug-and-play global planner through a two-step process: we first synthesize high-quality plans from an advanced LLM using our proposed homologous consensus filtering strategy, and apply fine-tuning as a cold start. Moreover, we further improve the planner with a rule-based reinforcement learning stage using a novel executor capability gain reward, ensuring it can handle task instructions of varying difficulty. Experiments on three long-horizon agent tasks show that executor agents equipped with our planner outperform existing methods, achieving new state-of-the-art performance. Meanwhile, EAGLET reduces training costs by 8x compared to RL-based baselines, and it does not require manual effort or extra training data, offering an efficient and effective solution.

[39] MADIAVE: Multi-Agent Debate for Implicit Attribute Value Extraction

Wei-Chieh Huang,Cornelia Caragea

Main category: cs.CL

TL;DR: MADIAVE提出了一种多代理辩论框架,通过多个MLLM代理的迭代辩论来提高隐性属性值提取的准确性和鲁棒性。在ImplicitAVE数据集上的实验表明,该方法显著提升了性能,尤其是对初始表现较差的属性。

Details Motivation: 隐性属性值提取在电子商务中至关重要,但现有的多模态大型语言模型(MLLMs)在处理多维数据和视觉-文本理解时仍存在局限。

Contribution: 引入了一种基于多代理辩论的框架,通过代理间的迭代验证和更新改进隐性属性值提取的性能和鲁棒性。

Method: 使用多个MLLM代理进行多轮辩论,代理间相互验证和更新推断结果,通过实验评估了不同辩论配置的效果。

Result: 实验证明,即使在少量辩论轮次后,准确性也显著提升,尤其对于初始表现较差的属性。

Insight: 多代理辩论策略能够有效克服单代理方法的局限,为多模态电子商务中的隐性属性值提取提供了可扩展的解决方案。

Abstract: Implicit Attribute Value Extraction (AVE) is essential for accurately representing products in e-commerce, as it infers lantent attributes from multimodal data. Despite advances in multimodal large language models (MLLMs), implicit AVE remains challenging due to the complexity of multidimensional data and gaps in vision-text understanding. In this work, we introduce \textsc{\modelname}, a multi-agent debate framework that employs multiple MLLM agents to iteratively refine inferences. Through a series of debate rounds, agents verify and update each other’s responses, thereby improving inference performance and robustness. Experiments on the ImplicitAVE dataset demonstrate that even a few rounds of debate significantly boost accuracy, especially for attributes with initially low performance. We systematically evaluate various debate configurations, including identical or different MLLM agents, and analyze how debate rounds affect convergence dynamics. Our findings highlight the potential of multi-agent debate strategies to address the limitations of single-agent approaches and offer a scalable solution for implicit AVE in multimodal e-commerce.

[40] The African Languages Lab: A Collaborative Approach to Advancing Low-Resource African NLP

Sheriff Issaka,Keyi Wang,Yinka Ajibola,Oluwatumininu Samuel-Ipaye,Zhaoyi Zhang,Nicte Aguillon Jimenez,Evans Kofi Agyei,Abraham Lin,Rohan Ramachandran,Sadick Abdul Mumin,Faith Nchifor,Mohammed Shuraim,Lieqi Liu,Erick Rosas Gonzalez,Sylvester Kpei,Jemimah Osei,Carlene Ajeneza,Persis Boateng,Prisca Adwoa Dufie Yeboah,Saadia Gabriel

Main category: cs.CL

TL;DR: 论文提出了非洲语言实验室(All Lab)以解决非洲语言在NLP领域的低资源问题,通过系统数据收集、模型开发和能力建设,取得了显著进展。

Details Motivation: 非洲语言虽然占全球语言的近三分之一,但在NLP技术中严重不足,88%的语言被忽视或代表不足。

Contribution: 1. 建立了质量可控的数据收集管道,涵盖40种语言的文本和语音数据集;2. 实验验证了数据集结合微调显著优于基线模型;3. 通过研究项目培养了15名早期研究人员。

Method: 采用系统性数据收集、多模态数据集构建和模型微调方法。

Result: 在31种语言中平均提升了23.69 ChrF++、0.33 COMET和15.34 BLEU分数。

Insight: 非洲语言NLP的进步需要数据、模型和本地能力的协同发展。

Abstract: Despite representing nearly one-third of the world’s languages, African languages remain critically underserved by modern NLP technologies, with 88% classified as severely underrepresented or completely ignored in computational linguistics. We present the African Languages Lab (All Lab), a comprehensive research initiative that addresses this technological gap through systematic data collection, model development, and capacity building. Our contributions include: (1) a quality-controlled data collection pipeline, yielding the largest validated African multi-modal speech and text dataset spanning 40 languages with 19 billion tokens of monolingual text and 12,628 hours of aligned speech data; (2) extensive experimental validation demonstrating that our dataset, combined with fine-tuning, achieves substantial improvements over baseline models, averaging +23.69 ChrF++, +0.33 COMET, and +15.34 BLEU points across 31 evaluated languages; and (3) a structured research program that has successfully mentored fifteen early-career researchers, establishing sustainable local capacity. Our comparative evaluation against Google Translate reveals competitive performance in several languages while identifying areas that require continued development.

[41] Code-Switching In-Context Learning for Cross-Lingual Transfer of Large Language Models

Haneul Yoo,Jiho Jin,Kyunghyun Cho,Alice Oh

Main category: cs.CL

TL;DR: 论文提出了一种名为CSICL的提示策略,通过逐步从目标语言切换到英语的演示和指令,增强多语言LLM的跨语言推理能力。

Details Motivation: 大型语言模型(LLM)的多语言能力依赖于英语作为潜在表示,导致性能在非英语语言中急剧下降。现有方法未能有效解决这一问题,反而强化了翻译壁垒。

Contribution: 提出了一种新颖的Code-Switching In-Context Learning(CSICL)策略,通过显式控制语言切换,减少LLM对英语翻译的依赖,提升跨语言性能。

Method: CSICL在演示和指令中逐步从目标语言过渡到英语,形成隐式语言桥梁,促进LLM在英语中进行潜在推理。实验覆盖4个LLM、6个数据集和10种语言。

Result: CSICL在目标语言和未见语言中分别实现了3.1%和1.9%的性能提升,在低资源语言中效果更显著(目标语言提升14.7%,未见语言提升5.3%)。

Insight: 通过控制语言切换的显式策略,可以有效减少LLM对内部翻译的依赖,推动更公平和高效的多语言系统发展。

Abstract: While large language models (LLMs) exhibit strong multilingual abilities, their reliance on English as latent representations creates a translation barrier, where reasoning implicitly depends on internal translation into English. When this process fails, performance in non-English languages deteriorates sharply, limiting the inclusiveness of LLM-based applications. Existing cross-lingual in-context learning (X-ICL) methods primarily leverage monolingual demonstrations, often failing to mitigate this barrier and instead reinforcing it. In this work, we introduce code-switching in-context learning (CSICL), a simple yet effective prompting strategy that progressively transitions from a target language to English within demonstrations and instruction to facilitate their latent reasoning in English. By explicitly scaffolding the reasoning process through controlled code-switching, CSICL acts as an implicit linguistic bridge that enhances cross-lingual alignment and reduces reliance on the translation barrier. We conduct extensive experiments across 4 LLMs, 6 datasets, and 10 languages, spanning both knowledge-intensive and reasoning-oriented domains. Our results demonstrate that CSICL consistently outperforms X-ICL baselines, achieving gains of 3.1%p and 1.9%p in both target and unseen languages, respectively. The improvement is even more pronounced in low-resource settings, with gains of 14.7% in target and 5.3% in unseen languages. These findings establish code-switching as a principled and robust approach for overcoming the translation barrier during inference, moving LLMs toward more equitable and effective multilingual systems.

[42] DecEx-RAG: Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision

Yongqi Leng,Yikun Lei,Xikai Liu,Meizhi Zhong,Bojian Xiong,Yurong Zhang,Yan Gao,Yi Wu,Yao Hu,Deyi Xiong

Main category: cs.CL

TL;DR: DecEx-RAG通过决策与执行优化的过程监督,显著提升了Agentic RAG的任务处理能力,解决了现有方法在探索效率、奖励稀疏性和全局反馈模糊性上的不足。

Details Motivation: 传统的基于结果的监督强化学习方法(如Search-R1)在复杂任务中存在探索效率低、奖励稀疏和全局反馈模糊的问题,限制了Agentic RAG的性能。

Contribution: 1. 将RAG建模为一个包含决策与执行的MDP;2. 提出高效的剪枝策略优化数据扩展;3. 通过过程级策略优化显著提升了任务分解、动态检索和高质量答案生成能力。

Method: 1. 将RAG任务建模为MDP,结合决策与执行;2. 引入过程监督的强化学习;3. 提出剪枝策略以提高数据扩展效率。

Result: 在六个数据集上平均绝对性能提升6.2%,数据构建效率提高近6倍。

Insight: 过程监督和剪枝策略的结合为Agentic RAG的训练提供了高效且性能优异的解决方案。

Abstract: Agentic Retrieval-Augmented Generation (Agentic RAG) enhances the processing capability for complex tasks through dynamic retrieval and adaptive workflows. Recent advances (e.g., Search-R1) have shown that outcome-supervised reinforcement learning demonstrate strong performance. However, this approach still suffers from inefficient exploration, sparse reward signals, and ambiguous global reward feedback. To address these challenges, we propose DecEx-RAG, which models RAG as a Markov Decision Process (MDP) incorporating decision-making and execution, while introducing an efficient pruning strategy to optimize data expansion. Through comprehensive process-level policy optimization, DecEx-RAG significantly enhances the autonomous task decomposition, dynamic retrieval, and high-quality answer generation capabilities of large language models (LLMs). Experiments show that DecEx-RAG achieves an average absolute performance improvement of $6.2%$ across six datasets, significantly outperforming existing baselines. Moreover, the pruning strategy improves data construction efficiency by nearly $6 \times$, providing an efficient solution for process-supervised RAG training. The code is available at https://github.com/sdsxdxl/DecEx-RAG.

[43] Adaptive and Multi-Source Entity Matching for Name Standardization of Astronomical Observation Facilities

Liza Fretel,Baptiste Cecconi,Laura Debisschop

Main category: cs.CL

TL;DR: 该论文提出了一种自适应、多源的天文观测设施名称标准化方法,结合可调评分标准和NLP技术,利用多种语义资源生成标准化标签,并通过LLM验证映射的合理性。

Details Motivation: 解决天文观测设施名称不一致性问题,通过多源映射和标准化方法提升数据的互操作性和FAIR原则(可查找、可访问、可互操作、可重用)。

Contribution: 提出了一种自适应评分标准和多源NLP技术结合的实体匹配方法,并通过LLM验证映射的合理性,生成标准化标签。

Method: 结合可调评分标准和NLP技术(如词袋模型、序列方法和表层方法),利用多种语义资源(如Wikidata和专业天文资源)提取实体属性,并通过LLM验证映射。

Result: 生成了多源同义词集合,每个实体仅对应一个标准化标签,用于Name Resolver API,并将集成到IVOA Vocabularies和OntoPortal-Astro平台。

Insight: 多源数据和自适应评分标准的结合为领域内实体标准化提供了新思路,LLM的引入提升了映射的可解释性和合理性。

Abstract: This ongoing work focuses on the development of a methodology for generating a multi-source mapping of astronomical observation facilities. To compare two entities, we compute scores with adaptable criteria and Natural Language Processing (NLP) techniques (Bag-of-Words approaches, sequential approaches, and surface approaches) to map entities extracted from eight semantic artifacts, including Wikidata and astronomy-oriented resources. We utilize every property available, such as labels, definitions, descriptions, external identifiers, and more domain-specific properties, such as the observation wavebands, spacecraft launch dates, funding agencies, etc. Finally, we use a Large Language Model (LLM) to accept or reject a mapping suggestion and provide a justification, ensuring the plausibility and FAIRness of the validated synonym pairs. The resulting mapping is composed of multi-source synonym sets providing only one standardized label per entity. Those mappings will be used to feed our Name Resolver API and will be integrated into the International Virtual Observatory Alliance (IVOA) Vocabularies and the OntoPortal-Astro platform.

[44] EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget

Liang Chen,Xueting Han,Qizhou Wang,Bo Han,Jing Bai,Hinrich Schutze,Kam-Fai Wong

Main category: cs.CL

TL;DR: EEPO通过两阶段滚动和自适应遗忘机制增强强化学习中的探索能力,解决了传统方法过度强调开发导致探索不足的问题,显著提升了性能。

Details Motivation: 传统RLVR方法因过度强调开发而陷入熵崩溃,探索能力下降,性能提升有限。EEPO旨在打破这一循环,促进更广泛的探索。

Contribution: 提出了EEPO框架,利用两阶段滚动和自适应遗忘机制强制模型探索输出空间的不同区域,解决了探索不足的问题。

Method: EEPO采用两阶段滚动:首阶段生成部分轨迹,通过轻量级遗忘步骤暂时抑制这些轨迹,迫使第二阶段探索不同区域。

Result: 在五个推理基准测试中,EEPO表现优于GRPO,平均相对增益分别为24.3%(Qwen2.5-3B)、33.0%(Llama3.2-3B-Instruct)和10.4%(Qwen3-8B-Base)。

Insight: 通过强制遗忘已探索区域,EEPO能够打破行为模式的自我强化循环,显著提升探索能力和模型性能。

Abstract: Balancing exploration and exploitation remains a central challenge in reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs). Current RLVR methods often overemphasize exploitation, leading to entropy collapse, diminished exploratory capacity, and ultimately limited performance gains. Although techniques that increase policy stochasticity can promote exploration, they frequently fail to escape dominant behavioral modes. This creates a self-reinforcing loop-repeatedly sampling and rewarding dominant modes-that further erodes exploration. We introduce Exploration-Enhanced Policy Optimization (EEPO), a framework that promotes exploration via two-stage rollouts with adaptive unlearning. In the first stage, the model generates half of the trajectories; it then undergoes a lightweight unlearning step to temporarily suppress these sampled responses, forcing the second stage to explore different regions of the output space. This sample-then-forget mechanism disrupts the self-reinforcing loop and promotes wider exploration during rollouts. Across five reasoning benchmarks, EEPO outperforms GRPO, achieving average relative gains of 24.3% on Qwen2.5-3B, 33.0% on Llama3.2-3B-Instruct, and 10.4% on Qwen3-8B-Base.

[45] Automated Boilerplate: Prevalence and Quality of Contract Generators in the Context of Swiss Privacy Policies

Luka Nenadic,David Rodriguez

Main category: cs.CL

TL;DR: 论文研究了瑞士隐私法律修订背景下自动合同生成器的普及程度和质量,发现使用生成器显著提高了合规性。

Details Motivation: 企业在面对日益复杂的数字法规时,尤其是中小型企业,往往缺乏资源起草合规法律文件,因此转向便宜的自动化合同生成器。但此前缺乏对其普及性和质量的实证研究。

Contribution: 1. 创建并标注了一个多语言基准数据集,涵盖瑞士和欧盟隐私法的关键合规义务;2. 提出了一种基于GPT-5的大规模合规评估方法;3. 发现自动生成器显著提升了合规性。

Method: 1. 构建多语言基准数据集;2. 使用GPT-5进行大规模合规评估;3. 统计分析生成器使用与合规性的关系。

Result: 18%的本地网站使用了生成器,其合规性显著更高(最高提升15个百分点),表明修订法律和生成器的积极作用。

Insight: 1. LLM在多语言法律分析中具有潜力;2. 欧盟法规的布鲁塞尔效应显著;3. 自动化工具可有效提升合规性和合同质量。

Abstract: It has become increasingly challenging for firms to comply with a plethora of novel digital regulations. This is especially true for smaller businesses that often lack both the resources and know-how to draft complex legal documents. Instead of seeking costly legal advice from attorneys, firms may turn to cheaper alternative legal service providers such as automated contract generators. While these services have a long-standing presence, there is little empirical evidence on their prevalence and output quality. We address this gap in the context of a 2023 Swiss privacy law revision. To enable a systematic evaluation, we create and annotate a multilingual benchmark dataset that captures key compliance obligations under Swiss and EU privacy law. Using this dataset, we validate a novel GPT-5-based method for large-scale compliance assessment of privacy policies, allowing us to measure the impact of the revision. We observe compliance increases indicating an effect of the revision. Generators, explicitly referenced by 18% of local websites, are associated with substantially higher levels of compliance, with increases of up to 15 percentage points compared to privacy policies without generator use. These findings contribute to three debates: the potential of LLMs for cross-lingual legal analysis, the Brussels Effect of EU regulations, and, crucially, the role of automated tools in improving compliance and contractual quality.

[46] Evaluating the Sensitivity of LLMs to Harmful Contents in Long Input

Faeze Ghorbanpour,Alexander Fraser

Main category: cs.CL

TL;DR: 该论文评估了大语言模型(LLMs)在长上下文中对有害内容的敏感性,揭示了其在安全关键场景中的表现模式。

Details Motivation: 随着LLMs在长上下文任务中的应用增多(如文档处理和检索增强生成),其对有害内容的响应能力尚未得到系统研究,特别是在安全关键场景中。

Contribution: 首次系统研究了LLMs在不同类型、位置、频率和上下文长度的有害内容中的检测表现。

Method: 通过实验设计,变化有害内容的类型(显式/隐式)、位置(开头/中间/结尾)、频率(0.01-0.50)和上下文长度(600-6000 tokens),评估了LLaMA-3、Qwen-2.5和Mistral的性能。

Result: 发现LLMs在中等有害内容频率(0.25)时表现最佳;召回率随上下文长度增加而下降;开头的有害内容更易检测;显式内容比隐式内容的识别更稳定。

Insight: 研究表明LLMs在长上下文中对有害内容的检测存在局限性,尤其是在内容稀疏或隐式的情况下,需要进一步优化以提高安全关键应用的可靠性。

Abstract: Large language models (LLMs) increasingly support applications that rely on extended context, from document processing to retrieval-augmented generation. While their long-context capabilities are well studied for reasoning and retrieval, little is known about their behavior in safety-critical scenarios. We evaluate LLMs’ sensitivity to harmful content under extended context, varying type (explicit vs. implicit), position (beginning, middle, end), prevalence (0.01-0.50 of the prompt), and context length (600-6000 tokens). Across harmful content categories such as toxic, offensive, and hate speech, with LLaMA-3, Qwen-2.5, and Mistral, we observe similar patterns: performance peaks at moderate harmful prevalence (0.25) but declines when content is very sparse or dominant; recall decreases with increasing context length; harmful sentences at the beginning are generally detected more reliably; and explicit content is more consistently recognized than implicit. These findings provide the first systematic view of how LLMs prioritize and calibrate harmful content in long contexts, highlighting both their emerging strengths and the challenges that remain for safety-critical use.

[47] Prompt reinforcing for long-term planning of large language models

Hsien-Chin Lin,Benjamin Matthias Ruppik,Carel van Niekerk,Chia-Hao Shen,Michael Heck,Nurul Lubis,Renato Vukovic,Shutong Feng,Milica Gašić

Main category: cs.CL

TL;DR: 提出了一个基于强化学习的提示优化框架,通过修改LLM的任务指令提示,实现多轮交互任务的长期规划,显著提升了任务性能。

Details Motivation: 大语言模型在多轮交互任务中表现不佳,依赖错误的早期假设且难以追踪用户目标。长期规划是处理此类任务的关键。

Contribution: 1) 提出了基于强化学习的提示优化框架;2) 通过反馈和经验回放优化提示;3) 在多轮任务(如文本转SQL和任务对话)中表现显著提升;4) 方法通用性强,适用于不同LLM代理。

Method: 1) 生成轮次反馈;2) 利用经验回放重写提示;3) 不改变模型参数,仅优化任务指令提示。

Result: 在文本转SQL和任务导向对话等多轮任务中表现显著提升,且方法适用于不同LLM代理。

Insight: 基于强化学习的无参数优化方法有望成为未来研究方向,提示优化是多轮任务规划的有效手段。

Abstract: Large language models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks and can be adapted through prompting. However, they remain suboptimal in multi-turn interactions, often relying on incorrect early assumptions and failing to track user goals over time, which makes such tasks particularly challenging. Prior works in dialogue systems have shown that long-term planning is essential for handling interactive tasks. In this work, we propose a prompt optimisation framework inspired by reinforcement learning, which enables such planning to take place by only modifying the task instruction prompt of the LLM-based agent. By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting, our proposed method shows significant improvement in multi-turn tasks such as text-to-SQL and task-oriented dialogue. Moreover, it generalises across different LLM-based agents and can leverage diverse LLMs as meta-prompting agents. This warrants future research in reinforcement learning-inspired parameter-free optimisation methods.

[48] Probing the Difficulty Perception Mechanism of Large Language Models

Sunbowen Lee,Qingyu Yin,Chak Tou Leong,Jialiang Zhang,Yicheng Gong,Xiaoyu Shen

Main category: cs.CL

TL;DR: 本文研究了大型语言模型(LLM)是否能在其内部表征中隐式编码问题难度,发现线性探针可以建模数学问题的难度水平,并定位了最终Transformer层中特定注意力头的作用。

Details Motivation: LLM越来越多地用于复杂推理任务,但其内部如何评估问题难度尚不明确。这种能力对自适应推理和高效资源分配至关重要,因此研究LLM是否能够感知难度具有重要意义。

Contribution: 1. 证明LLM能够通过内部表征线性建模问题难度;2. 定位了最终Transformer层中特定注意力头的作用;3. 提出LLM可作为自动难度标注器的实用价值。

Method: 使用线性探针分析LLM最终token的表征,并在最终Transformer层中找到具有相反激活模式的注意力头,从而建模难度感知。

Result: 实验表明LLM不仅能感知难度,还能通过特定注意力头结构化地表征这一能力。此外,熵和难度感知在token级别存在显著差异。

Insight: LLM的难度感知不仅是隐式的,还具有结构性组织,为未来研究和应用(如自动标注和课程学习)提供了理论和实践基础。

Abstract: Large language models (LLMs) are increasingly deployed on complex reasoning tasks, yet little is known about their ability to internally evaluate problem difficulty, which is an essential capability for adaptive reasoning and efficient resource allocation. In this work, we investigate whether LLMs implicitly encode problem difficulty in their internal representations. Using a linear probe on the final-token representations of LLMs, we demonstrate that the difficulty level of math problems can be linearly modeled. We further locate the specific attention heads of the final Transformer layer: these attention heads have opposite activation patterns for simple and difficult problems, thus achieving perception of difficulty. Our ablation experiments prove the accuracy of the location. Crucially, our experiments provide practical support for using LLMs as automatic difficulty annotators, potentially substantially reducing reliance on costly human labeling in benchmark construction and curriculum learning. We also uncover that there is a significant difference in entropy and difficulty perception at the token level. Our study reveals that difficulty perception in LLMs is not only present but also structurally organized, offering new theoretical insights and practical directions for future research.

[49] LexiCon: a Benchmark for Planning under Temporal Constraints in Natural Language

Periklis Mantenoglou,Rishi Hazra,Pedro Zuidberg Dos Martires,Luc De Raedt

Main category: cs.CL

TL;DR: LexiCon是一个用于评估大语言模型(LLMs)在自然语言描述的时态约束规划任务中表现的基准测试,扩展性强,支持新环境生成。

Details Motivation: 现实世界中,规划任务通常需要满足时态约束(如安全约束),而现有LLMs主要在无约束环境下测试,亟需一个评估其约束规划能力的工具。

Contribution: 提出了LexiCon基准测试,支持自动生成时态约束的自然语言规划任务,并能随着LLMs能力提升扩展问题复杂度。

Method: 基于现有规划环境,自动添加时态约束并转化为自然语言任务,供LLMs解决。核心特点是支持新环境生成和约束自动构建。

Result: 实验表明当前最强LLMs(如GPT-5、o3、R1)的性能随任务约束度的增加而显著下降。

Insight: LLMs在约束规划任务中的表现仍需提升,LexiCon为未来研究提供了可扩展的评估工具。

Abstract: Owing to their reasoning capabilities, large language models (LLMs) have been evaluated on planning tasks described in natural language. However, LLMs have largely been tested on planning domains without constraints. In order to deploy them in real-world settings where adherence to constraints, in particular safety constraints, is critical, we need to evaluate their performance on constrained planning tasks. We introduce LexiCon – a natural language-based (Lexi) constrained (Con) planning benchmark, consisting of a suite of environments, that can be used to evaluate the planning capabilities of LLMs in a principled fashion. The core idea behind LexiCon is to take existing planning environments and impose temporal constraints on the states. These constrained problems are then translated into natural language and given to an LLM to solve. A key feature of LexiCon is its extensibility. That is, the set of supported environments can be extended with new (unconstrained) environment generators, for which temporal constraints are constructed automatically. This renders LexiCon future-proof: the hardness of the generated planning problems can be increased as the planning capabilities of LLMs improve. Our experiments reveal that the performance of state-of-the-art LLMs, including reasoning models like GPT-5, o3, and R1, deteriorates as the degree of constrainedness of the planning tasks increases.

[50] MASA: Rethinking the Representational Bottleneck in LoRA with Multi-A Shared Adaptation

Qin Dong,Yuntian Tang,Heming Jia,Yunhang Shen,Bohan Jia,Wenxuan Huang,Lianyue Zhang,Jiao Xie,Shaohui Lin

Main category: cs.CL

TL;DR: 论文提出了MASA方法,通过多A专家共享的结构解决了LoRA中的表征瓶颈问题,提升了模型在下游任务中的适应能力。

Details Motivation: LoRA方法中单一的降维矩阵A成为表征瓶颈,限制了模型对多样化特征的捕捉能力,因此需要改进特征适配能力。

Contribution: 提出了MASA架构,通过多A专家共享的方式丰富特征适配,同时保持参数效率。

Method: MASA采用多A、单B的结构,多A专家不对称地共享于各层,通过单一的B矩阵整合特征。

Result: 在MMLU基准测试中,MASA平均准确率59.62%,比标准LoRA提升1.08点(相对改进1.84%),参数效率保持在0.52%。

Insight: 多A专家异构共享能够捕捉更多样化的特征,同时保持参数效率,提升模型在下游任务中的表现。

Abstract: Low-Rank Adaptation (LoRA) has emerged as a dominant method in Parameter-Efficient Fine-Tuning (PEFT) for large language models, which augments the transformer layer with one down-projection $A$ and one up-projection $B$. However, LoRA’s reliance on a single down-projection matrix ($A$) creates a representational bottleneck, as this solitary feature extractor is inherently insufficient for capturing the diverse signals required by complex tasks. This motivates our architectural shift to focus on enriching the feature adaptation to improve the downstream task adaptation ability. We propose MASA (Multi-$A$ Shared Adaptation), an architecture that implements a multi-$A$, single-$B$ structure where the multi-$A$ expert ensemble is asymmetrically shared across layers to ensure parameter efficiency. In MASA, these specialized experts capture diverse features, which are then integrated by a single, layer-specific $B$-matrix. The effectiveness and versatility of our method are validated through a comprehensive suite of experiments spanning multi-domain generalization, single-domain specialization, and multi-task reasoning. For example, on the MMLU benchmark, MASA achieves an average accuracy of 59.62%, outperforming the standard LoRA by 1.08 points (a relative improvement of 1.84%) with comparable learnable parameters of 0.52%.

[51] CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs

Chengwei Wu,Jiapu Wang,Mingyang Gao,Xingrui Zhuo,Jipeng Guo,Runlin Lei,Haoran Luo,Tianyu Chen,Haoyi Zhou,Shirui Pan,Zechao Li

Main category: cs.CL

TL;DR: 该论文提出了一个大规模中文数据集CDTP,用于评测中文大语言模型的性能。CDTP包含700万对齐的文本对和1500万三元组,覆盖四个关键领域,支持知识驱动任务的细粒度评测和多任务微调。

Details Motivation: 当前中文大语言模型的评测主要依赖英文基准,缺乏针对中文语言特性及结构化数据的评测数据集。CDTP旨在填补这一空白,提供高质量的结构化信息支持。

Contribution: 1. 构建了包含700万文本对和1500万三元组的中文数据集CDTP;2. 支持知识图谱补全、三元组生成和问答等多任务评测;3. 提供了开源代码和未来研究方向。

Method: 通过收集和标注大规模中文文本与结构化三元组对,构建CDTP数据集,并设计多任务评测框架(如知识图谱补全、三元组生成和问答)进行实验验证。

Result: 实验表明,CDTP能有效评测中文大语言模型的性能,支持细粒度任务评测和多任务微调,提升模型的泛化能力和鲁棒性。

Insight: 结构化数据的引入对中文大语言模型的评测至关重要,CDTP为未来研究提供了标准化评测工具和数据支持。

Abstract: Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks. However, Chinese LLMs face unique challenges, primarily due to the dominance of unstructured free text and the lack of structured representations in Chinese corpora. While existing benchmarks for LLMs partially assess Chinese LLMs, they are still predominantly English-centric and fail to address the unique linguistic characteristics of Chinese, lacking structured datasets essential for robust evaluation. To address these challenges, we present a Comprehensive Benchmark for Evaluating Chinese Large Language Models (CB-ECLLM) based on the newly constructed Chinese Data-Text Pair (CDTP) dataset. Specifically, CDTP comprises over 7 million aligned text pairs, each consisting of unstructured text coupled with one or more corresponding triples, alongside a total of 15 million triples spanning four critical domains. The core contributions of CDTP are threefold: (i) enriching Chinese corpora with high-quality structured information; (ii) enabling fine-grained evaluation tailored to knowledge-driven tasks; and (iii) supporting multi-task fine-tuning to assess generalization and robustness across scenarios, including Knowledge Graph Completion, Triple-to-Text generation, and Question Answering. Furthermore, we conduct rigorous evaluations through extensive experiments and ablation studies to assess the effectiveness, Supervised Fine-Tuning (SFT), and robustness of the benchmark. To support reproducible research, we offer an open-source codebase and outline potential directions for future investigations based on our insights.

[52] ASPO: Asymmetric Importance Sampling Policy Optimization

Jiakang Wang,Runze Liu,Lei Lin,Wenping Hu,Xiu Li,Fuzheng Zhang,Guorui Zhou,Kun Gai

Main category: cs.CL

TL;DR: 本文提出了ASPO(Asymmetric Importance Sampling Policy Optimization),针对大语言模型后训练中基于结果监督的强化学习方法中存在的IS比率不匹配问题,提出了一种简单有效的解决方案。ASPO通过翻转正优势标记的IS比率,并引入软双裁剪机制,显著提升了训练稳定性和性能。

Details Motivation: 在基于结果监督的强化学习(OSRL)中,正负标记的重要性采样(IS)比率存在不匹配问题,导致低概率标记更新被抑制,而高概率标记更新过度放大,影响了模型性能。

Contribution: 提出了ASPO方法,通过翻转正优势标记的IS比率,解决了OSRL中IS比率不匹配的问题,并引入了软双裁剪机制稳定极端更新。

Method: ASPO的核心方法是翻转正优势标记的IS比率,使其更新方向与负标记对齐,并采用软双裁剪机制来稳定梯度和避免极端更新。

Result: 在编码和数学推理基准测试中,ASPO显著缓解了早熟收敛问题,提升了训练稳定性和最终性能,优于基于GRPO的基线方法。

Insight: 研究表明,标记级别的重要性采样在大语言模型强化学习中扮演关键角色,校正IS比率对提升模型性能至关重要。

Abstract: Recent Large Language Model (LLM) post-training methods rely on token-level clipping mechanisms during Reinforcement Learning (RL). However, we identify a fundamental flaw in this Outcome-Supervised RL (OSRL) paradigm: the Importance Sampling (IS) ratios of positive-advantage tokens are mismatched, leading to unbalanced token weighting for positive and negative tokens. This mismatch suppresses the update of low-probability tokens while over-amplifying already high-probability ones. To address this, we propose Asymmetric Importance Sampling Policy Optimization (ASPO), which uses a simple yet effective strategy that flips the IS ratios of positive-advantage tokens, aligning their update direction with the learning dynamics of negative ones. AIS further incorporates a soft dual-clipping mechanism to stabilize extreme updates while maintaining gradient flow. Comprehensive experiments on coding and mathematical reasoning benchmarks demonstrate that ASPO significantly mitigates premature convergence, improves training stability, and enhances final performance over strong GRPO-based baselines. Our analysis provides new insights into the role of token-level weighting in OSRL and highlights the critical importance of correcting IS in LLM RL. The code and models of ASPO are available at https://github.com/wizard-III/Archer2.0.

[53] Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability

Taylor Sorensen,Benjamin Newman,Jared Moore,Chan Park,Jillian Fisher,Niloofar Mireshghallah,Liwei Jiang,Yejin Choi

Main category: cs.CL

TL;DR: 该论文提出了语言模型后训练的三个理想特性:上下文可操控性、有效输出空间覆盖和分布对齐,并发现当前后训练方法可能削弱这些特性。作者引入了Spectrum Suite数据集和Spectrum Tuning方法,以提高模型的操控性和分布覆盖能力。

Details Motivation: 当前语言模型的后训练方法在提升指令跟随能力的同时,可能牺牲了对多样答案任务的适应性。作者希望通过研究改进模型的分布覆盖和上下文操控能力。

Contribution: 1. 提出了语言模型后训练的三种理想特性;2. 引入了Spectrum Suite数据集;3. 提出了Spectrum Tuning方法以改善操控性和分布覆盖。

Method: 作者引入了Spectrum Suite数据集,并提出Spectrum Tuning方法,通过在多样任务上进一步微调模型,以提高其操控性和分布覆盖能力。

Result: 实验表明,Spectrum Tuning在提高模型的上下文操控性、输出空间覆盖和分布对齐方面优于传统的预训练和指令微调模型。

Insight: 后训练方法需要平衡指令跟随能力和分布覆盖能力,上下文操控性是一个重要的研究方向。

Abstract: Language model post-training has enhanced instruction-following and performance on many downstream tasks, but also comes with an often-overlooked cost on tasks with many possible valid answers. We characterize three desiderata for conditional distributional modeling: in-context steerability, valid output space coverage, and distributional alignment, and document across three model families how current post-training can reduce these properties. In particular, we disambiguate between two kinds of in-context learning: ICL for eliciting existing underlying knowledge or capabilities, and in-context steerability, where a model must use in-context information to override its priors and steer to a novel data generating distribution. To better evaluate and improve these desiderata, we introduce Spectrum Suite, a large-scale resource compiled from >40 data sources and spanning >90 tasks requiring models to steer to and match diverse distributions ranging from varied human preferences to numerical distributions and more. We find that while current post-training techniques help elicit underlying capabilities and knowledge, they hurt models’ ability to flexibly steer in-context. To mitigate these issues, we propose Spectrum Tuning, a post-training method using Spectrum Suite to improve steerability and distributional coverage. We find that Spectrum Tuning often improves over pretrained models and their instruction-tuned counterparts, enhancing steerability, spanning more of the output space, and improving distributional alignment on held-out datasets.

[54] The Valley of Code Reasoning: Scaling Knowledge Distillation of Large Language Models

Muyu He,Muhammad Ali Shafique,Anand Kumar,Tsach Mackey,Nazneen Rajani

Main category: cs.CL

TL;DR: 该论文研究了大规模语言模型(LLM)在代码推理任务中知识蒸馏的性能随蒸馏数据量的变化趋势,发现存在一个’代码推理的谷底’现象:下游性能先下降后以超对数线性方式上升。

Details Motivation: 已有的研究表明,将从具有推理能力的大模型中提取的思维轨迹蒸馏到小模型中是有效的,但关于蒸馏数据量如何影响性能的研究较少。本文旨在填补这一空白。

Contribution: 1. 发现了代码推理任务中知识蒸馏的’谷底’现象;2. 验证了在不同数据量阶段,小模型从简单问题中获益更多;3. 出人意料地发现训练数据中输出的正确性对蒸馏结果无影响。

Method: 通过在两个小型非推理LLM上蒸馏竞赛编程技能,系统地研究性能随数据量的变化趋势,并在不同蒸馏阶段对模型进行微调。

Result: 下游性能随数据量增加先下降后快速上升(超对数线性),且在低数据量阶段,简单问题的效果显著优于难题。

Insight: 1. 数据量对知识蒸馏的影响是非单调的;2. 小模型在低数据量阶段更依赖简单问题;3. 输出正确性并非蒸馏的关键因素。

Abstract: Distilling the thinking traces of a Large Language Model (LLM) with reasoning capabilities into a smaller model has been proven effective. Yet, there is a scarcity of work done on how model performances scale with the quantity of distillation data. In this work, we study the scaling trend of distilling competitive coding skills on two small non-reasoning LLMs. We validate the hypothesis that there is a $\textit{valley of code reasoning}$: downstream performance on competitive coding first drops as data quantity increases, then it steadily increases in a sharper-than-log-linear fashion. Having identified the trend, we further fine-tune the models at two different distillation stages on the same data to ground conclusions on their respective learning phases. We learn that across stages in the low and medium-low data regimes, small models benefit significantly from easier coding questions than from harder ones. We also find that, surprisingly, the correctness of outputs in training data makes no difference to distillation outcomes. Our work represents a step forward in understanding the training dynamics of code reasoning distillation outside intuition

[55] Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

Gagan Bhatia,Somayajulu G Sripada,Kevin Allan,Jacobo Azcona

Main category: cs.CL

TL;DR: 论文提出了Distributional Semantics Tracing(DST)框架,用于解释大语言模型中的幻觉问题,通过分析模型的内部语义路径和失败机制。

Details Motivation: 大语言模型(LLMs)容易产生幻觉(生成看似合理但事实错误的陈述),而目前缺乏对其内在架构失败模式的解释。

Contribution: 1. 提出DST框架,整合可解释性技术生成模型的因果推理图;2. 定义了幻觉不可避免的commitment layer;3. 揭示了导致失败的机制(双路径冲突)。

Method: 采用DST框架,结合分布语义学和双过程理论(System 1快速联想路径 vs. System 2慢速上下文路径)。

Result: 发现上下文路径的连贯性与幻觉率高度负相关(ρ = -0.863),表明幻觉是内部语义弱点的直接结果。

Insight: Transformer架构中幻觉的发生机制与双路径冲突相关,且可通过分析语义路径的连贯性预测。

Abstract: Large Language Models (LLMs) are prone to hallucination, the generation of plausible yet factually incorrect statements. This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions.First, to enable the reliable tracing of internal semantic failures, we propose \textbf{Distributional Semantics Tracing (DST)}, a unified framework that integrates established interpretability techniques to produce a causal map of a model’s reasoning, treating meaning as a function of context (distributional semantics). Second, we pinpoint the model’s layer at which a hallucination becomes inevitable, identifying a specific \textbf{commitment layer} where a model’s internal representations irreversibly diverge from factuality. Third, we identify the underlying mechanism for these failures. We observe a conflict between distinct computational pathways, which we interpret using the lens of dual-process theory: a fast, heuristic \textbf{associative pathway} (akin to System 1) and a slow, deliberate \textbf{contextual pathway} (akin to System 2), leading to predictable failure modes such as \textit{Reasoning Shortcut Hijacks}. Our framework’s ability to quantify the coherence of the contextual pathway reveals a strong negative correlation ($\rho = -0.863$) with hallucination rates, implying that these failures are predictable consequences of internal semantic weakness. The result is a mechanistic account of how, when, and why hallucinations occur within the Transformer architecture.

[56] Parallel Tokenizers: Rethinking Vocabulary Design for Cross-Lingual Transfer

Muhammad Dehan Al Kautsar,Fajri Koto

Main category: cs.CL

TL;DR: 该论文提出了一种新的并行分词器框架,通过单语言训练和对齐词汇表,解决了跨语言迁移中语义等效词汇共享表示的问题,显著提升了低资源语言的性能。

Details Motivation: 现有的分词方法在跨语言迁移中表现不佳,因为语义等效的词汇在不同语言中通常被分配到不同的词汇索引,限制了共享表示和跨语言泛化能力。

Contribution: 提出了并行分词器框架,通过单语言训练和对齐词汇表,确保语义等效词汇在不同语言中的索引一致,从而改善了跨语言表征学习的效果。

Method: 首先单语言训练分词器,然后利用双语词典或词对词翻译对齐词汇表,确保语义等效词汇的索引一致性。

Result: 在13种低资源语言上预训练的模型在情感分析、仇恨言论检测等任务中显著优于传统多语言基线。

Insight: 重新设计分词方法是提升跨语言表征学习的关键,尤其在低资源语言场景下效果显著。

Abstract: Tokenization defines the foundation of multilingual language models by determining how words are represented and shared across languages. However, existing methods often fail to support effective cross-lingual transfer because semantically equivalent words are assigned distinct embeddings. For example, “I eat rice” in English and “Ina cin shinkafa” in Hausa are typically mapped to different vocabulary indices, preventing shared representations and limiting cross-lingual generalization. We introduce parallel tokenizers. This new framework trains tokenizers monolingually and then aligns their vocabularies exhaustively using bilingual dictionaries or word-to-word translation, ensuring consistent indices for semantically equivalent words. This alignment enforces a shared semantic space across languages while naturally improving fertility balance. To assess their effectiveness, we pretrain a transformer encoder from scratch on thirteen low-resource languages and evaluate it on sentiment analysis, hate speech detection, emotion classification, and sentence embedding similarity. Across all tasks, models trained with parallel tokenizers outperform conventional multilingual baselines, confirming that rethinking tokenization is essential for advancing multilingual representation learning–especially in low-resource settings.

[57] VecInfer: Efficient LLM Inference with Low-Bit KV Cache via Outlier-Suppressed Vector Quantization

Dingyu Yao,Chenxu Yang,Zhengyang Tong,Zheng Lin,Wei Liu,Jian Luan,Weiping Wang

Main category: cs.CL

TL;DR: VecInfer提出了一种新的向量量化方法,通过抑制KV缓存中的离群值,实现了高效的LLM推理和低比特压缩,显著降低了内存开销并提升了计算效率。

Details Motivation: 大型语言模型(LLM)推理过程中,KV缓存引入了大量的内存开销。现有的向量量化方法在极低比特宽度下性能下降严重,主要由于键缓存中的离群值影响了码本的利用率。

Contribution: 1. 提出VecInfer,通过平滑和Hadamard变换抑制键缓存中的离群值,实现高效的低比特KV缓存压缩。2. 设计了优化的CUDA内核,减少内存访问开销。

Method: 1. 使用平滑和Hadamard变换抑制键缓存中的离群值。2. 通过优化的CUDA内核融合计算与反量化,减少内存访问开销。

Result: 在2比特量化下,VecInfer实现了与全精度相当的性能,在长文本理解和数学推理任务中表现优异。计算速度提升2.7倍,端到端延迟降低8.3倍。

Insight: 抑制离群值是实现高效低比特量化的关键;融合计算与反量化的优化设计显著提升了推理效率。

Abstract: The Key-Value (KV) cache introduces substantial memory overhead during large language model (LLM) inference. Although existing vector quantization (VQ) methods reduce KV cache usage and provide flexible representational capacity across bit-widths, they suffer severe performance degradation at ultra-low bit-widths due to key cache outliers that hinder effective codebook utilization. To address this challenge, we propose VecInfer, a novel VQ method for aggressive KV cache compression while enabling efficient inference. By applying smooth and Hadamard transformations, VecInfer suppresses outliers in the key cache, enabling the codebook to comprehensively cover the original data distribution and thereby reducing quantization difficulty. To facilitate efficient deployment, we design an optimized CUDA kernel that fuses computation with dequantization to minimize memory access overhead. Extensive evaluations demonstrate that VecInfer consistently outperforms existing quantization baselines across both long-context understanding and mathematical reasoning tasks. With only 2-bit quantization, VecInfer achieves performance comparable to full precision, while delivering up to $\mathbf{2.7\times}$ speedup in large-batch self-attention computation and $\mathbf{8.3\times}$ reduction in single-batch end-to-end latency on Llama-3.1-8B with a 196k sequence length.

[58] Mixing Mechanisms: How Language Models Retrieve Bound Entities In-Context

Yoav Gur-Arieh,Mor Geva,Atticus Geiger

Main category: cs.CL

TL;DR: 本文研究了语言模型(LMs)如何在上下文中绑定和检索实体,发现随着绑定实体数量的增加,传统的基于位置的机制变得不可靠,但LMs会补充词法和反射机制来弥补这一不足。

Details Motivation: 探索语言模型在复杂上下文中绑定和检索实体的机制,以往研究主要关注基于位置的机制,但在更复杂场景中表现不佳,因此需要更全面的理解。

Contribution: 1. 揭示了LMs在复杂上下文中混合使用位置、词法和反射三种机制的行为模式;2. 提出了结合这三种机制的因果模型,对下一步词分布的预测准确率达95%;3. 证明了该模型在更长、更自然的文本中的泛化能力。

Method: 通过实验分析九种模型和十种绑定任务,揭示了LMs混合使用位置、词法和反射机制的规律,并开发了一个结合这三种机制的因果模型。

Result: 提出的因果模型能够准确预测下一步词分布(95%一致性),并且在更长、更自然的文本中表现出良好的泛化能力。

Insight: LMs在复杂上下文中不只是依赖单一机制,而是灵活结合多种机制(位置、词法和反射)来实现高效的实体检索。

Abstract: A key component of in-context reasoning is the ability of language models (LMs) to bind entities for later retrieval. For example, an LM might represent “Ann loves pie” by binding “Ann” to “pie”, allowing it to later retrieve “Ann” when asked “Who loves pie?” Prior research on short lists of bound entities found strong evidence that LMs implement such retrieval via a positional mechanism, where “Ann” is retrieved based on its position in context. In this work, we find that this mechanism generalizes poorly to more complex settings; as the number of bound entities in context increases, the positional mechanism becomes noisy and unreliable in middle positions. To compensate for this, we find that LMs supplement the positional mechanism with a lexical mechanism (retrieving “Ann” using its bound counterpart “pie”) and a reflexive mechanism (retrieving “Ann” through a direct pointer). Through extensive experiments on nine models and ten binding tasks, we uncover a consistent pattern in how LMs mix these mechanisms to drive model behavior. We leverage these insights to develop a causal model combining all three mechanisms that estimates next token distributions with 95% agreement. Finally, we show that our model generalizes to substantially longer inputs of open-ended text interleaved with entity groups, further demonstrating the robustness of our findings in more natural settings. Overall, our study establishes a more complete picture of how LMs bind and retrieve entities in-context.

[59] Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction

Xinyu Guo,Zhengliang Shi,Minglai Yang,Mahdi Rahimi,Mihai Surdeanu

Main category: cs.CL

TL;DR: 本文提出了一种名为CogRE的框架,通过结合认知科学启发的文本处理步骤和强化学习优化,提升了关系抽取的准确性和可解释性。

Details Motivation: 传统的关系抽取方法缺乏对基于语言解释的监督,且在少样本学习能力上表现不佳,因此需要一种既能提升准确性又能生成高质量解释的方法。

Contribution: 1. 提出CogRE框架,结合认知科学启发的推理机制和强化学习优化;2. 设计了新颖的奖励函数,同时优化任务准确性和解释质量;3. 通过LLM自动构建高质量的关键词词典。

Method: 1. 通过认知科学启发的文本处理步骤进行关系抽取;2. 使用强化学习优化过程,奖励函数关注关键词生成;3. 自动构建关键词词典。

Result: CogRE在One-shot NYT29上取得了24.65% F1,强化学习优化后性能进一步提升23.46%。人类评估显示,模型生成的关键词与黄金标签高度一致,解释质量评分提升了54%。

Insight: 认知科学启发的推理机制和强化学习的结合,能够显著提升少样本关系抽取的性能和可解释性。

Abstract: This paper introduces a framework for relation extraction (RE) that enhances both accuracy and explainability. The framework has two key components: (i) a reasoning mechanism that formulates relation extraction as a series of text-processing steps inspired by cognitive science, and (ii) an optimization process driven by reinforcement learning (RL) with a novel reward function designed to improve both task accuracy and explanation quality. We call our approach CogRE. Our framework addresses the lack of supervision for language-based explanations in traditional RE by promoting outputs that include important relation keywords. These keywords are drawn from a high-quality dictionary that is automatically constructed using an LLM. We evaluate our approach for the task of one-shot RE using two LLMs and two RE datasets. Our experiments show that CogRE improves explanation quality by addressing two common failure patterns in one-shot RE: poor attention focus and limited one-shot learning capability. For example, our cognitive-structured reasoning with Qwen2.5-15B-Instruct on One-shot NYT29 achieves 24.65% F1, surpassing prior reasoning-based designs. Optimizing this approach with RL using our reward further improves performance by +23.46% (absolute). Finally, human evaluation shows that our best model generates relational keywords closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).

cs.CV [Back]

[60] SkinMap: Weighted Full-Body Skin Segmentation for Robust Remote Photoplethysmography

Zahra Maleki,Amirhossein Akbari,Amirhossein Binesh,Babak Khalaj

Main category: cs.CV

TL;DR: 论文提出了一种名为SkinMap的新型皮肤分割方法,用于提升远程光电容积描记(rPPG)信号的质量,其特点是通过加权全身皮肤检测以减少干扰区域的影响。

Details Motivation: 远程光电容积描记技术(rPPG)在远程医疗和智能应用中有广泛潜力,但其信号质量易受光照、运动和干扰区域(如嘴、眼睛)的影响,亟需改进皮肤分割方法以提升信号提取的鲁棒性。

Contribution: 1. 提出了SkinMap方法,优先选择皮肤区域以增强rPPG信号质量;2. 开发了新的数据集SYNC-rPPG,更好地模拟真实场景;3. 在运动和人脸干扰等挑战性条件下仍能准确检测心率。

Method: 采用加权全身皮肤分割技术,排除嘴、眼睛和头发等干扰区域,同时检测全身皮肤区域以提高抗运动干扰能力。

Result: 在公开数据集和新数据集SYNC-rPPG上的实验表明,SkinMap在运动和说话等挑战条件下仍能保持较低的平均绝对误差(MAE),且在多种肤色中表现优异。

Insight: 1. 全身皮肤检测有助于提升rPPG的鲁棒性;2. 干扰区域的排除对信号质量至关重要;3. 新数据集为真实场景下的性能评估提供了更全面的基准。

Abstract: Remote photoplethysmography (rPPG) is an innovative method for monitoring heart rate and vital signs by using a simple camera to record a person, as long as any part of their skin is visible. This low-cost, contactless approach helps in remote patient monitoring, emotion analysis, smart vehicle utilization, and more. Over the years, various techniques have been proposed to improve the accuracy of this technology, especially given its sensitivity to lighting and movement. In the unsupervised pipeline, it is necessary to first select skin regions from the video to extract the rPPG signal from the skin color changes. We introduce a novel skin segmentation technique that prioritizes skin regions to enhance the quality of the extracted signal. It can detect areas of skin all over the body, making it more resistant to movement, while removing areas such as the mouth, eyes, and hair that may cause interference. Our model is evaluated on publicly available datasets, and we also present a new dataset, called SYNC-rPPG, to better represent real-world conditions. The results indicate that our model demonstrates a prior ability to capture heartbeats in challenging conditions, such as talking and head rotation, and maintain the mean absolute error (MAE) between predicted and actual heart rates, while other methods fail to do so. In addition, we demonstrate high accuracy in detecting a diverse range of skin tones, making this technique a promising option for real-world applications.

[61] DeepAf: One-Shot Spatiospectral Auto-Focus Model for Digital Pathology

Yousef Yeganeh,Maximilian Frantzen,Michael Lee,Kun-Hsing Yu,Nassir Navab,Azade Farshad

Main category: cs.CV

TL;DR: DeepAf是一種新穎的自動對焦框架,結合空間和頻譜特徵,通過單次預測實現高效對焦,顯著減少時間並保持高精度,適用於資源有限的環境。

Details Motivation: 傳統WSI掃描儀成本高,而低成本解決方案在對焦一致性、時間消耗或泛化能力上存在不足,需要一種高效且通用的自動對焦方法。

Contribution: 提出DeepAf,一種結合空間和頻譜特徵的混合架構,實現單次輸入的精準對焦預測,顯著提高效率和準確性,並展現跨實驗室的泛化能力。

Method: 通過混合架構提取空間和頻譜特徵,單次回歸預測最佳對焦點距離,調整控制參數以優化圖像質量。

Result: 對焦時間減少80%,精度達0.18μm,癌症分類AUC達0.90(4x放大),且在跨實驗室數據中表現優異。

Insight: 結合空間和頻譜特徵能有效提升單次輸入的對焦精度,適用於低成本顯微鏡改造,推動數位病理學在資源有限環境中的應用。

Abstract: While Whole Slide Imaging (WSI) scanners remain the gold standard for digitizing pathology samples, their high cost limits accessibility in many healthcare settings. Other low-cost solutions also face critical limitations: automated microscopes struggle with consistent focus across varying tissue morphology, traditional auto-focus methods require time-consuming focal stacks, and existing deep-learning approaches either need multiple input images or lack generalization capability across tissue types and staining protocols. We introduce a novel automated microscopic system powered by DeepAf, a novel auto-focus framework that uniquely combines spatial and spectral features through a hybrid architecture for single-shot focus prediction. The proposed network automatically regresses the distance to the optimal focal point using the extracted spatiospectral features and adjusts the control parameters for optimal image outcomes. Our system transforms conventional microscopes into efficient slide scanners, reducing focusing time by 80% compared to stack-based methods while achieving focus accuracy of 0.18 {\mu}m on the same-lab samples, matching the performance of dual-image methods (0.19 {\mu}m) with half the input requirements. DeepAf demonstrates robust cross-lab generalization with only 0.72% false focus predictions and 90% of predictions within the depth of field. Through an extensive clinical study of 536 brain tissue samples, our system achieves 0.90 AUC in cancer classification at 4x magnification, a significant achievement at lower magnification than typical 20x WSI scans. This results in a comprehensive hardware-software design enabling accessible, real-time digital pathology in resource-constrained settings while maintaining diagnostic accuracy.

[62] LightCache: Memory-Efficient, Training-Free Acceleration for Video Generation

Yang Xiao,Gen Li,Kaiyuan Deng,Yushu Wu,Zheng Zhan,Yanzhi Wang,Xiaolong Ma,Bo Hui

Main category: cs.CV

TL;DR: 本文提出了LightCache,一种无需训练的加速方法,用于视频生成中的扩散模型推理,通过分阶段优化策略减少内存消耗,同时保持速度提升和质量稳定性。

Details Motivation: 扩散模型推理中的潜在冗余为视频生成提供了加速的切入点,但现有缓存方法在后两个阶段(去噪和解码)中会导致内存激增,亟需优化。

Contribution: 1) 提出分阶段优化策略:异步缓存交换、特征分块和潜在切片解码;2) 在加速的同时显著降低内存消耗;3) 开源代码。

Method: 1) 将推理过程分解为编码、去噪和解码三个阶段;2) 针对不同阶段的特性提出异步缓存交换、特征分块和潜在切片解码三种策略。

Result: 相比基线方法,LightCache实现了更快的推理速度和更低的内存占用,同时质量退化在可接受范围内。

Insight: 扩散模型推理的分阶段特性为内存优化提供了灵活的设计空间,未来可进一步探索更精细的阶段划分和策略组合。

Abstract: Training-free acceleration has emerged as an advanced research area in video generation based on diffusion models. The redundancy of latents in diffusion model inference provides a natural entry point for acceleration. In this paper, we decompose the inference process into the encoding, denoising, and decoding stages, and observe that cache-based acceleration methods often lead to substantial memory surges in the latter two stages. To address this problem, we analyze the characteristics of inference across different stages and propose stage-specific strategies for reducing memory consumption: 1) Asynchronous Cache Swapping. 2) Feature chunk. 3) Slicing latents to decode. At the same time, we ensure that the time overhead introduced by these three strategies remains lower than the acceleration gains themselves. Compared with the baseline, our approach achieves faster inference speed and lower memory usage, while maintaining quality degradation within an acceptable range. The Code is available at https://github.com/NKUShaw/LightCache .

[63] See the past: Time-Reversed Scene Reconstruction from Thermal Traces Using Visual Language Models

Kebin Contreras,Luis Toscano-Palomino,Mauro Dalla Mura,Jorge Bacca

Main category: cs.CV

TL;DR: 该论文提出了一种基于热成像和视觉语言模型的时间反向重建框架,用于从热痕迹中推断过去的场景状态。

Details Motivation: 热成像能捕捉人机交互留下的热痕迹,这些痕迹随时间消退,可作为推断近期事件的被动时间码。现有RGB摄像头无法实现这一功能,因此研究如何从热痕迹中重建过去的场景具有重要应用价值。

Contribution: 提出了一种结合热成像和视觉语言模型的时间反向重建框架,首次展示了从热痕迹中重建120秒前场景的可行性。

Method: 框架利用配对RGB和热图像,通过两个视觉语言模型(一个生成场景描述,另一个引导图像重建)和约束扩散过程,恢复语义和结构一致的过去场景。

Result: 在三种受控场景中验证了该方法的可行性,能够重建120秒前的合理场景帧。

Insight: 热痕迹作为被动时间码的潜力被首次挖掘,为时间反向成像提供了新思路。

Abstract: Recovering the past from present observations is an intriguing challenge with potential applications in forensics and scene analysis. Thermal imaging, operating in the infrared range, provides access to otherwise invisible information. Since humans are typically warmer (37 C -98.6 F) than their surroundings, interactions such as sitting, touching, or leaning leave residual heat traces. These fading imprints serve as passive temporal codes, allowing for the inference of recent events that exceed the capabilities of RGB cameras. This work proposes a time-reversed reconstruction framework that uses paired RGB and thermal images to recover scene states from a few seconds earlier. The proposed approach couples Visual-Language Models (VLMs) with a constrained diffusion process, where one VLM generates scene descriptions and another guides image reconstruction, ensuring semantic and structural consistency. The method is evaluated in three controlled scenarios, demonstrating the feasibility of reconstructing plausible past frames up to 120 seconds earlier, providing a first step toward time-reversed imaging from thermal traces.

[64] Personalizing Retrieval using Joint Embeddings or “the Return of Fluffy”

Bruno Korbar,Andrew Zisserman

Main category: cs.CV

TL;DR: 该论文提出了一种结合图像嵌入和文本描述的个性化图像检索方法,通过映射网络将图像转换为文本标记,并与CLIP结合,提升了检索效果。

Details Motivation: 现有方法难以同时利用图像实例信息和自然语言描述进行个性化检索,例如检索特定对象的特定状态(如'独角兽Fluffy在某人的头上')。

Contribution: 设计了pi-map映射网络,将对象实例的图像嵌入转换为文本标记,结合CLIP编码器,提升了检索性能。

Method: 使用可训练的pi-map网络将图像嵌入转换为文本标记,与自然语言查询结合后输入CLIP文本编码器进行检索。

Result: 在两个个性化检索基准测试中达到了最先进的性能。

Insight: 仅需对每个对象实例训练一次映射网络,即可实现高效个性化的跨模态检索。

Abstract: The goal of this paper is to be able to retrieve images using a compound query that combines object instance information from an image, with a natural text description of what that object is doing or where it is. For example, to retrieve an image of “Fluffy the unicorn (specified by an image) on someone’s head”. To achieve this we design a mapping network that can “translate” from a local image embedding (of the object instance) to a text token, such that the combination of the token and a natural language query is suitable for CLIP style text encoding, and image retrieval. Generating a text token in this manner involves a simple training procedure, that only needs to be performed once for each object instance. We show that our approach of using a trainable mapping network, termed pi-map, together with frozen CLIP text and image encoders, improves the state of the art on two benchmarks designed to assess personalized retrieval.

[65] ArchitectHead: Continuous Level of Detail Control for 3D Gaussian Head Avatars

Peizhi Yan,Rabab Ward,Qiang Tang,Shan Du

Main category: cs.CV

TL;DR: ArchitectHead是首个支持连续细节级别(LOD)控制的3D高斯头化身框架,通过参数化高斯点在2D UV特征空间中,实现高效的LOD调整,无需重新训练。

Details Motivation: 现有3D高斯化身通常依赖于固定数量的高斯点,但实际应用需要动态调整LOD以平衡渲染效率与视觉质量。

Contribution: 提出了ArchitectHead框架,支持连续LOD控制,通过UV特征场和轻量解码器实现高效动态调整。

Method: 将高斯点参数化在2D UV特征空间中,使用多级可学习特征图编码潜在特征,并通过解码器转换为3D高斯属性。动态重采样特征图以控制LOD。

Result: 在最高LOD下达到SOTA质量,最低LOD仅使用6.2%高斯点,渲染速度提升近一倍,质量下降可控。

Insight: UV特征场的动态重采样为实现连续LOD控制提供了一种高效、灵活的解决方案。

Abstract: 3D Gaussian Splatting (3DGS) has enabled photorealistic and real-time rendering of 3D head avatars. Existing 3DGS-based avatars typically rely on tens of thousands of 3D Gaussian points (Gaussians), with the number of Gaussians fixed after training. However, many practical applications require adjustable levels of detail (LOD) to balance rendering efficiency and visual quality. In this work, we propose “ArchitectHead”, the first framework for creating 3D Gaussian head avatars that support continuous control over LOD. Our key idea is to parameterize the Gaussians in a 2D UV feature space and propose a UV feature field composed of multi-level learnable feature maps to encode their latent features. A lightweight neural network-based decoder then transforms these latent features into 3D Gaussian attributes for rendering. ArchitectHead controls the number of Gaussians by dynamically resampling feature maps from the UV feature field at the desired resolutions. This method enables efficient and continuous control of LOD without retraining. Experimental results show that ArchitectHead achieves state-of-the-art (SOTA) quality in self and cross-identity reenactment tasks at the highest LOD, while maintaining near SOTA performance at lower LODs. At the lowest LOD, our method uses only 6.2% of the Gaussians while the quality degrades moderately (L1 Loss +7.9%, PSNR –0.97%, SSIM –0.6%, LPIPS Loss +24.1%), and the rendering speed nearly doubles.

[66] Human Action Recognition from Point Clouds over Time

James Dickens

Main category: cs.CV

TL;DR: 该论文提出了一种基于3D视频的人体动作识别新方法,通过分割和跟踪点云数据,并结合点基技术和稀疏卷积网络,提升了识别准确率。

Details Motivation: 现有的人体动作识别研究主要集中在骨骼动作识别和基于视频的方法。随着深度传感器和Lidar设备的普及,利用密集3D数据进行动作识别成为一种新方向。

Contribution: 论文的主要贡献包括:1)提出了一种从3D视频中分割和跟踪人体点云的流水线;2)结合点基技术和稀疏卷积网络设计了新型骨干网络;3)通过多模态特征(如表面法线、颜色等)提升了识别精度。

Method: 方法核心包括:1)从场景中分割人体点云并进行跟踪;2)使用点基技术和稀疏卷积网络构建3D动作识别骨干网络;3)融合多模态特征(表面法线、颜色、红外强度、身体部位标签)。

Result: 在NTU RGB-D 120数据集上,该方法在训练和测试集不同时的准确率达到89.3%,优于其他基于点云的动作识别方法。

Insight: 论文展示了将多模态特征与3D点云技术结合的潜力,为未来基于深度传感器和Lidar的动作识别研究提供了新思路。

Abstract: Recent research into human action recognition (HAR) has focused predominantly on skeletal action recognition and video-based methods. With the increasing availability of consumer-grade depth sensors and Lidar instruments, there is a growing opportunity to leverage dense 3D data for action recognition, to develop a third way. This paper presents a novel approach for recognizing actions from 3D videos by introducing a pipeline that segments human point clouds from the background of a scene, tracks individuals over time, and performs body part segmentation. The method supports point clouds from both depth sensors and monocular depth estimation. At the core of the proposed HAR framework is a novel backbone for 3D action recognition, which combines point-based techniques with sparse convolutional networks applied to voxel-mapped point cloud sequences. Experiments incorporate auxiliary point features including surface normals, color, infrared intensity, and body part parsing labels, to enhance recognition accuracy. Evaluation on the NTU RGB- D 120 dataset demonstrates that the method is competitive with existing skeletal action recognition algorithms. Moreover, combining both sensor-based and estimated depth inputs in an ensemble setup, this approach achieves 89.3% accuracy when different human subjects are considered for training and testing, outperforming previous point cloud action recognition methods.

[67] Seeing the Big Picture: Evaluating Multimodal LLMs’ Ability to Interpret and Grade Handwritten Student Work

Owen Henkel,Bill Roberts,Doug Jaffe,Laurence Holt

Main category: cs.CV

TL;DR: 论文研究了多模态大语言模型(MLLMs)在手写学生作业评分和分析中的表现,发现其在客观数学题上表现接近人类,但在需要复杂视觉解读和教学判断的绘图题上仍有较大差距。

Details Motivation: 研究旨在探索MLLMs在教育领域的潜力,特别是在手写作业评分和反馈上的应用,以减轻教师负担并提高效率。

Contribution: 论文的主要贡献是通过两个实验评估MLLMs在手写数学作业中的表现,揭示了其在客观题和绘图题上的能力差异,并提出了改进方向。

Method: 实验分为两部分:A实验评估MLLMs在288份手写算术题上的评分能力;B实验评估其在150份数学绘图题上的表现,并尝试用人类描述增强模型能力。

Result: MLLMs在算术题上表现优异(准确率95%,k=0.90),但在绘图题上表现较差(k=0.20),人类描述可显著提升其表现(k=0.47)。

Insight: MLLMs在简单视觉任务上表现良好,但在需要复杂视觉解读和教学判断的任务上仍需改进,人类辅助可以显著提升其能力。

Abstract: Recent advances in multimodal large language models (MLLMs) raise the question of their potential for grading, analyzing, and offering feedback on handwritten student classwork. This capability would be particularly beneficial in elementary and middle-school mathematics education, where most work remains handwritten, because seeing students’ full working of a problem provides valuable insights into their learning processes, but is extremely time-consuming to grade. We present two experiments investigating MLLM performance on handwritten student mathematics classwork. Experiment A examines 288 handwritten responses from Ghanaian middle school students solving arithmetic problems with objective answers. In this context, models achieved near-human accuracy (95%, k = 0.90) but exhibited occasional errors that human educators would be unlikely to make. Experiment B evaluates 150 mathematical illustrations from American elementary students, where the drawings are the answer to the question. These tasks lack single objective answers and require sophisticated visual interpretation as well as pedagogical judgment in order to analyze and evaluate them. We attempted to separate MLLMs’ visual capabilities from their pedagogical abilities by first asking them to grade the student illustrations directly, and then by augmenting the image with a detailed human description of the illustration. We found that when the models had to analyze the student illustrations directly, they struggled, achieving only k = 0.20 with ground truth scores, but when given human descriptions, their agreement levels improved dramatically to k = 0.47, which was in line with human-to-human agreement levels. This gap suggests MLLMs can “see” and interpret arithmetic work relatively well, but still struggle to “see” student mathematical illustrations.

[68] Midway Network: Learning Representations for Recognition and Motion from Latent Dynamics

Christopher Hoang,Mengye Ren

Main category: cs.CV

TL;DR: Midway Network提出了一种新的自监督学习架构,首次通过扩展潜在动态建模,从自然视频中学习同时适用于目标识别和运动理解的表示。

Details Motivation: 现有的自监督学习方法主要专注于目标识别或运动理解,而忽略了二者的结合。潜在动态建模虽然用于决策任务,但尚未应用于视觉表示学习中。

Contribution: 提出了Midway Network,通过引入中间自上而下路径和密集前向预测目标,首次实现了从自然视频中同时学习目标识别和运动理解的表示。

Method: Midway Network使用中间路径推断视频帧间的运动潜在变量,结合密集前向预测目标和分层结构处理复杂多目标场景。

Result: 在两个大型自然视频数据集上预训练后,Midway Network在语义分割和光流任务中表现优于之前的自监督学习方法。

Insight: Midway Network的动态学习能力可以通过前向特征扰动的新分析方法捕捉高级对应关系,表明其在复杂场景中的潜力。

Abstract: Object recognition and motion understanding are key components of perception that complement each other. While self-supervised learning methods have shown promise in their ability to learn from unlabeled data, they have primarily focused on obtaining rich representations for either recognition or motion rather than both in tandem. On the other hand, latent dynamics modeling has been used in decision making to learn latent representations of observations and their transformations over time for control and planning tasks. In this work, we present Midway Network, a new self-supervised learning architecture that is the first to learn strong visual representations for both object recognition and motion understanding solely from natural videos, by extending latent dynamics modeling to this domain. Midway Network leverages a midway top-down path to infer motion latents between video frames, as well as a dense forward prediction objective and hierarchical structure to tackle the complex, multi-object scenes of natural videos. We demonstrate that after pretraining on two large-scale natural video datasets, Midway Network achieves strong performance on both semantic segmentation and optical flow tasks relative to prior self-supervised learning methods. We also show that Midway Network’s learned dynamics can capture high-level correspondence via a novel analysis method based on forward feature perturbation.

[69] HoloScene: Simulation-Ready Interactive 3D Worlds from a Single Video

Hongchi Xia,Chih-Hao Lin,Hao-Yu Hsu,Quentin Leboutet,Katelyn Gao,Michael Paulitsch,Benjamin Ummenhofer,Shenlong Wang

Main category: cs.CV

TL;DR: HoloScene是一种从单视频重建交互式3D世界的框架,综合解决几何完整性、物体交互性、物理合理性和渲染质量等问题。

Details Motivation: 现有3D重建方法在几何完整性、物体交互性、物理合理性和渲染质量等方面存在不足,限制了虚拟环境的模拟能力。

Contribution: 提出HoloScene框架,首次实现单视频重建兼具几何完整性、交互性、物理合理性和渲染质量的3D虚拟环境。

Method: 采用基于能量的优化方法,结合观测数据、物理约束和生成先验,通过采样探索和梯度优化的混合方式完成重建。

Result: HoloScene生成的数字孪生具备完整几何、物理稳定性和逼真渲染效果,在基准测试和实际应用中表现优异。

Insight: HoloScene的综合优化框架为虚拟环境的真实模拟和交互提供了新思路,适用于AR/VR、游戏和机器人等领域。

Abstract: Digitizing the physical world into accurate simulation-ready virtual environments offers significant opportunities in a variety of fields such as augmented and virtual reality, gaming, and robotics. However, current 3D reconstruction and scene-understanding methods commonly fall short in one or more critical aspects, such as geometry completeness, object interactivity, physical plausibility, photorealistic rendering, or realistic physical properties for reliable dynamic simulation. To address these limitations, we introduce HoloScene, a novel interactive 3D reconstruction framework that simultaneously achieves these requirements. HoloScene leverages a comprehensive interactive scene-graph representation, encoding object geometry, appearance, and physical properties alongside hierarchical and inter-object relationships. Reconstruction is formulated as an energy-based optimization problem, integrating observational data, physical constraints, and generative priors into a unified, coherent objective. Optimization is efficiently performed via a hybrid approach combining sampling-based exploration with gradient-based refinement. The resulting digital twins exhibit complete and precise geometry, physical stability, and realistic rendering from novel viewpoints. Evaluations conducted on multiple benchmark datasets demonstrate superior performance, while practical use-cases in interactive gaming and real-time digital-twin manipulation illustrate HoloScene’s broad applicability and effectiveness. Project page: https://xiahongchi.github.io/HoloScene.

[70] CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval

Bin Kang,Bin Chen,Junjie Wang,Yulin Li,Junzhi Zhao,Zhuotao Tian

Main category: cs.CV

TL;DR: 论文提出了CalibCLIP,一种无需训练的方法,通过视觉和文本空间的校准,解决现有视觉语言模型中少数低贡献Token主导语义的问题,提升了文本驱动图像检索的性能。

Details Motivation: 现有视觉语言模型(VLMs)存在结构性限制,少数低贡献Token过度捕获全局语义,抑制判别性特征,影响文本驱动图像检索的效果。

Contribution: 1. 提出视觉空间的对比视觉增强器(CVE),解耦特征并动态抑制主导Token;2. 引入文本空间的判别性概念校准器(DCC),区分通用和判别性概念;3. 在多个基准测试中验证了方法的有效性。

Method: 1. 视觉空间:CVE解耦视觉特征为目标和低信息区域,动态抑制主导Token;2. 文本空间:DCC区分通用和判别性概念,强化判别性特征;3. 无需训练的校准方法。

Result: 在7个基准测试中表现一致提升,验证了CalibCLIP在三种图像检索任务中的有效性。

Insight: 通过视觉和文本空间的校准,可以减少语义主导带来的负面影响,提高检索任务的判别性和鲁棒性。

Abstract: Existing Visual Language Models (VLMs) suffer structural limitations where a few low contribution tokens may excessively capture global semantics, dominating the information aggregation process and suppressing the discriminative features in text-driven image retrieval tasks. To address this, we introduce \textbf{CalibCLIP}, a training-free method designed to calibrate the suppressive effect of dominant tokens. Specifically, in the visual space, we propose the Contrastive Visual Enhancer (CVE), which decouples visual features into target and low information regions. Subsequently, it identifies dominant tokens and dynamically suppresses their representations.In the textual space, we introduce the Discriminative Concept Calibrator (DCC), which aims to differentiate between general and discriminative concepts within the text query. By mitigating the challenges posed by generic concepts and improving the representations of discriminative concepts, DCC strengthens the differentiation among similar samples. Finally, extensive experiments demonstrate consistent improvements across seven benchmarks spanning three image retrieval tasks, underscoring the effectiveness of CalibCLIP. Code is available at: https://github.com/kangbin98/CalibCLIP

[71] Improving Chain-of-Thought Efficiency for Autoregressive Image Generation

Zeqi Gu,Markos Georgopoulos,Xiaoliang Dai,Marjan Ghazvininejad,Chu Wang,Felix Juefei-Xu,Kunpeng Li,Yujun Shi,Zecheng He,Zijian He,Jiawei Zhou,Abe Davis,Jialiang Wang

Main category: cs.CV

TL;DR: 论文ShortCoTI通过轻量级优化框架减少链式思维(CoT)的冗余,提升自回归图像生成的效率,同时保持输出质量。

Details Motivation: 现有方法通过扩展用户输入为详细提示(CoT)来提升图像生成的细节和对齐性,但导致了冗余和计算成本增加。

Contribution: 提出了ShortCoTI框架,用于生成更简洁的CoT序列,通过强化学习减少提示长度54%,且不降低图像质量。

Method: ShortCoTI采用自适应奖励函数,根据任务难度调整,结合强化学习优化提示长度。

Result: 在T2I-CompBench和GenEval等多个基准测试中,提示长度减少54%,质量持平或略有提升。

Insight: 简洁的CoT提示可以消除冗余和重复细化,同时保持语义丰富性,提高计算效率。

Abstract: Autoregressive multimodal large language models have recently gained popularity for image generation, driven by advances in foundation models. To enhance alignment and detail, newer approaches employ chain-of-thought (CoT) reasoning, expanding user inputs into elaborated prompts prior to image synthesis. However, this strategy can introduce unnecessary redundancy – a phenomenon we call visual overthinking – which increases computational costs and can introduce details that contradict the original prompt. In this work, we explore how to generate more concise CoT sequences for more efficient image generation. We introduce ShortCoTI, a lightweight optimization framework that encourages more concise CoT while preserving output image quality. ShortCoTI rewards more concise prompts with an adaptive function that scales according to an estimated difficulty for each task. Incorporating this reward into a reinforcement learning paradigm reduces prompt reasoning length by 54% while maintaining or slightly improving quality metrics across multiple benchmarks (T2I-CompBench, GenEval). Qualitative analysis shows that our method eliminates verbose explanations and repetitive refinements, producing reasoning prompts that are both concise and semantically rich. As a result, ShortCoTI improves computational efficiency without compromising the fidelity or visual appeal of generated images.

[72] HOI-R1: Exploring the Potential of Multimodal Large Language Models for Human-Object Interaction Detection

Junwen Chen,Peilin Xiong,Keiji Yanai

Main category: cs.CV

TL;DR: 论文提出HOI-R1,首次探索多模态大语言模型(MLLMs)在人-物交互检测(HOID)任务中的潜力,无需额外检测模块,通过纯文本实现HOID任务,并在HICO-DET数据集上取得显著性能提升。

Details Motivation: 现有HOID方法依赖视觉语言模型(VLMs)的先验知识,训练策略和模型架构复杂。MLLMs在HOID任务中的推理能力未被充分探索。

Contribution: 1. 首次利用MLLMs解决HOID任务;2. 提出HOI推理过程和HOID奖励函数,纯文本完成任务;3. 在HICO-DET数据集上性能提升2倍。

Method: 1. 设计HOI推理过程;2. 引入HOID奖励函数,结合强化学习(RL)训练MLLMs;3. 不依赖额外检测模块,纯文本完成任务。

Result: 在HICO-DET数据集上,HOI-R1的准确率是基线的2倍,且表现出强大的泛化能力。

Insight: MLLMs可通过纯文本解决HOID任务,简化现有复杂框架,为多模态推理任务提供新思路。

Abstract: Recent Human-object interaction detection (HOID) methods highly require prior knowledge from VLMs to enhance the interaction recognition capabilities. The training strategies and model architectures for connecting the knowledge from VLMs to the HOI instance representations from the object detector are challenging, and the whole framework is complex for further development or application. On the other hand, the inherent reasoning abilities of MLLMs on human-object interaction detection are under-explored. Inspired by the recent success of training MLLMs with reinforcement learning (RL) methods, we propose HOI-R1 and first explore the potential of the language model on the HOID task without any additional detection modules. We introduce an HOI reasoning process and HOID reward functions to solve the HOID task by pure text. The results on the HICO-DET dataset show that HOI-R1 achieves 2x the accuracy of the baseline with great generalization ability. The source code is available at https://github.com/cjw2021/HOI-R1.

[73] TFM Dataset: A Novel Multi-task Dataset and Integrated Pipeline for Automated Tear Film Break-Up Segmentation

Guangrong Wan,Jun liu,Tang tang,Lianghao Shi,Wenjun Luo,TingTing Xu

Main category: cs.CV

TL;DR: 论文提出了首个多任务泪膜分析数据集TFM及高效分割模型TF-Net,并设计了自动化分析管线TF-Collab,为干眼症诊断提供新工具。

Details Motivation: 泪膜破裂(TFBU)分析对干眼症诊断至关重要,但现有方法缺乏标注数据和集成解决方案,限制了自动化发展。

Contribution: 1. 发布了首个多任务泪膜分析数据集TFM;2. 提出了高效分割模型TF-Net;3. 设计了集成管线TF-Collab,实现全自动化分析。

Method: 1. TF-Net基于MobileOne-mini骨干网和重参数化技术,结合增强的特征金字塔网络;2. TF-Collab通过多任务模型协同工作,标准化输入并分割TFBU。

Result: TF-Net在TFM数据集上表现出色,TF-Collab实现了完全自动化的泪膜分析,具有临床应用潜力。

Insight: 多任务数据集和集成管线的结合为医学图像分析提供了新思路,强调了效率和实时性的重要性。

Abstract: Tear film break-up (TFBU) analysis is critical for diagnosing dry eye syndrome, but automated TFBU segmentation remains challenging due to the lack of annotated datasets and integrated solutions. This paper introduces the Tear Film Multi-task (TFM) Dataset, the first comprehensive dataset for multi-task tear film analysis, comprising 15 high-resolution videos (totaling 6,247 frames) annotated with three vision tasks: frame-level classification (‘clear’, ‘closed’, ‘broken’, ‘blur’), Placido Ring detection, and pixel-wise TFBU area segmentation. Leveraging this dataset, we first propose TF-Net, a novel and efficient baseline segmentation model. TF-Net incorporates a MobileOne-mini backbone with re-parameterization techniques and an enhanced feature pyramid network to achieve a favorable balance between accuracy and computational efficiency for real-time clinical applications. We further establish benchmark performance on the TFM segmentation subset by comparing TF-Net against several state-of-the-art medical image segmentation models. Furthermore, we design TF-Collab, a novel integrated real-time pipeline that synergistically leverages models trained on all three tasks of the TFM dataset. By sequentially orchestrating frame classification for BUT determination, pupil region localization for input standardization, and TFBU segmentation, TF-Collab fully automates the analysis. Experimental results demonstrate the effectiveness of the proposed TF-Net and TF-Collab, providing a foundation for future research in ocular surface diagnostics. Our code and the TFM datasets are available at https://github.com/glory-wan/TF-Net

[74] Ocular-Induced Abnormal Head Posture: Diagnosis and Missing Data Imputation

Saja Al-Dabet,Sherzod Turaev,Nazar Zaki,Arif O. Khan,Luai Eldweik

Main category: cs.CV

TL;DR: 论文提出了两个深度学习框架:AHP-CADNet用于自动诊断由眼部引起的异常头部姿势(AHP),以及基于课程学习的数据填补框架,用于处理缺失的医疗数据。

Details Motivation: AHP的临床诊断通常依赖主观评估,且医疗记录常不完整,导致诊断困难。研究旨在通过自动化工具解决这些问题。

Contribution: 提出了AHP-CADNet框架和多层次注意力融合机制,用于精确诊断;设计了课程学习填补框架,提升数据完整性。

Method: AHP-CADNet结合眼部标志点、头部姿势特征和临床属性;填补框架逐步利用结构化和非结构化数据填补缺失值。

Result: AHP-CADNet准确率达96.9-99.0%,填补框架准确率为93.46-99.78%,显著提升了诊断和数据填补效果。

Insight: 多模态数据融合和课程学习填补缺失数据能显著提升临床诊断的准确性和鲁棒性。

Abstract: Ocular-induced abnormal head posture (AHP) is a compensatory mechanism that arises from ocular misalignment conditions, such as strabismus, enabling patients to reduce diplopia and preserve binocular vision. Early diagnosis minimizes morbidity and secondary complications such as facial asymmetry; however, current clinical assessments remain largely subjective and are further complicated by incomplete medical records. This study addresses both challenges through two complementary deep learning frameworks. First, AHP-CADNet is a multi-level attention fusion framework for automated diagnosis that integrates ocular landmarks, head pose features, and structured clinical attributes to generate interpretable predictions. Second, a curriculum learning-based imputation framework is designed to mitigate missing data by progressively leveraging structured variables and unstructured clinical notes to enhance diagnostic robustness under realistic data conditions. Evaluation on the PoseGaze-AHP dataset demonstrates robust diagnostic performance. AHP-CADNet achieves 96.9-99.0 percent accuracy across classification tasks and low prediction errors for continuous variables, with MAE ranging from 0.103 to 0.199 and R2 exceeding 0.93. The imputation framework maintains high accuracy across all clinical variables (93.46-99.78 percent with PubMedBERT), with clinical dependency modeling yielding significant improvements (p < 0.001). These findings confirm the effectiveness of both frameworks for automated diagnosis and recovery from missing data in clinical settings.

[75] SD-MVSum: Script-Driven Multimodal Video Summarization Method and Datasets

Manolis Mylonas,Charalampia Zerva,Evlampios Apostolidis,Vasileios Mezaris

Main category: cs.CV

TL;DR: SD-MVSum提出了一种基于脚本驱动的多模态视频摘要方法,通过加权跨模态注意力机制利用脚本与视频内容及转录文本的语义相似性,提升摘要的关联性。此外,还扩展了两个大规模数据集以支持训练和评估。

Details Motivation: 现有视频摘要方法通常仅依赖视觉内容,忽略了脚本与其他模态(如音频转录)的关联性。SD-MVSum旨在通过多模态建模提升脚本驱动的视频摘要效果。

Contribution: 1. 提出了SD-MVSum方法,利用加权跨模态注意力机制建模脚本与视频/转录文本的依赖关系;2. 扩展了两个大规模数据集(S-VideoXum和MrHiSum),支持多模态视频摘要任务。

Method: 1. 设计加权跨模态注意力机制,显式捕捉脚本与视频/转录文本的语义相似性;2. 多模态特征融合,生成与脚本相关性高的视频摘要。

Result: 实验表明,SD-MVSum在脚本驱动和通用视频摘要任务中均优于其他SOTA方法。

Insight: 多模态信息(尤其是脚本与音频转录的结合)能显著提升视频摘要的语义关联性和准确性。

Abstract: In this work, we extend a recent method for script-driven video summarization, originally considering just the visual content of the video, to take into account the relevance of the user-provided script also with the video’s spoken content. In the proposed method, SD-MVSum, the dependence between each considered pair of data modalities, i.e., script-video and script-transcript, is modeled using a new weighted cross-modal attention mechanism. This explicitly exploits the semantic similarity between the paired modalities in order to promote the parts of the full-length video with the highest relevance to the user-provided script. Furthermore, we extend two large-scale datasets for video summarization (S-VideoXum, MrHiSum), to make them suitable for training and evaluation of script-driven multimodal video summarization methods. Experimental comparisons document the competitiveness of our SD-MVSum method against other SOTA approaches for script-driven and generic video summarization. Our new method and extended datasets are available at: https://github.com/IDT-ITI/SD-MVSum.

[76] A Hierarchical Geometry-guided Transformer for Histological Subtyping of Primary Liver Cancer

Anwen Lu,Mingxin Liu,Yiping Jiao,Hongyi Gong,Geyang Xu,Jun Chen,Jun Xu

Main category: cs.CV

TL;DR: 本文提出了一种层次化几何引导的Transformer模型ARGUS,通过捕获肿瘤微环境(TME)中的宏观-中观-微观层次信息,提升了原发性肝癌组织学分型的性能。

Details Motivation: 原发性肝癌组织学分型(HCC和ICC)在WSI中表现出复杂的组织形态和细胞结构,但现有方法未能充分利用这些层次化信息和几何特征,导致分型性能有限。

Contribution: 1. 引入了微几何特征,通过核的几何结构表征细胞级模式;2. 设计了层次化视野(FoVs)对齐模块,建模WSI中的宏观和中观层次交互;3. 提出几何先验引导的融合策略,整合多层次特征以实现最优性能。

Method: 1. 构建微几何特征表征细胞级模式;2. 设计FoVs对齐模块捕获WSI的层次交互;3. 通过几何先验融合策略整合多层次特征。

Result: 在公开和私有数据集上,ARGUS实现了SOTA的肝癌组织学分型性能。

Insight: 层次化结构和几何特征是WSI中的重要信息源,有效利用这些特征可显著提升分型性能,为临床诊断提供可靠工具。

Abstract: Primary liver malignancies are widely recognized as the most heterogeneous and prognostically diverse cancers of the digestive system. Among these, hepatocellular carcinoma (HCC) and intrahepatic cholangiocarcinoma (ICC) emerge as the two principal histological subtypes, demonstrating significantly greater complexity in tissue morphology and cellular architecture than other common tumors. The intricate representation of features in Whole Slide Images (WSIs) encompasses abundant crucial information for liver cancer histological subtyping, regarding hierarchical pyramid structure, tumor microenvironment (TME), and geometric representation. However, recent approaches have not adequately exploited these indispensable effective descriptors, resulting in a limited understanding of histological representation and suboptimal subtyping performance. To mitigate these limitations, ARGUS is proposed to advance histological subtyping in liver cancer by capturing the macro-meso-micro hierarchical information within the TME. Specifically, we first construct a micro-geometry feature to represent fine-grained cell-level pattern via a geometric structure across nuclei, thereby providing a more refined and precise perspective for delineating pathological images. Then, a Hierarchical Field-of-Views (FoVs) Alignment module is designed to model macro- and meso-level hierarchical interactions inherent in WSIs. Finally, the augmented micro-geometry and FoVs features are fused into a joint representation via present Geometry Prior Guided Fusion strategy for modeling holistic phenotype interactions. Extensive experiments on public and private cohorts demonstrate that our ARGUS achieves state-of-the-art (SOTA) performance in histological subtyping of liver cancer, which provide an effective diagnostic tool for primary liver malignancies in clinical practice.

[77] When and How to Cut Classical Concerts? A Multimodal Automated Video Editing Approach

Daniel Gonzálbez-Biosca,Josep Cabacas-Maso,Carles Ventura,Ismael Benito-Altamirano

Main category: cs.CV

TL;DR: 论文提出了一种多模态自动化视频编辑方法,专注于剪辑古典音乐会的多视角录制视频,解决了‘何时剪辑’和‘如何剪辑’两大问题。

Details Motivation: 自动化视频编辑在计算机视觉和多媒体领域研究较少,而古典音乐会剪辑是一个具有挑战性的任务,涉及多模态数据(音频和视频)的协同处理。

Contribution: 1. 提出了一种新颖的多模态架构,用于时间分割任务(何时剪辑)。2. 改进了空间选择任务(如何剪辑),采用CLIP-based编码器并限制了干扰帧的选择范围。3. 构建了一个基于伪标注的数据集,自动聚类原始视频数据。

Method: 1. 时间分割任务:使用log-mel频谱特征和可选图像嵌入,结合轻量级卷积-Transformer架构。2. 空间选择任务:更新骨干网络为CLIP-based编码器,并约束干扰帧的选择。

Result: 模型在剪辑点检测和视觉镜头选择任务上优于基线方法,推动了多模态自动化视频编辑的技术进展。

Insight: 通过多模态数据融合和轻量级架构设计,可以有效解决复杂的视频编辑任务,尤其是需要音频和视觉信息协同的场景。

Abstract: Automated video editing remains an underexplored task in the computer vision and multimedia domains, especially when contrasted with the growing interest in video generation and scene understanding. In this work, we address the specific challenge of editing multicamera recordings of classical music concerts by decomposing the problem into two key sub-tasks: when to cut and how to cut. Building on recent literature, we propose a novel multimodal architecture for the temporal segmentation task (when to cut), which integrates log-mel spectrograms from the audio signals, plus an optional image embedding, and scalar temporal features through a lightweight convolutional-transformer pipeline. For the spatial selection task (how to cut), we improve the literature by updating from old backbones, e.g. ResNet, with a CLIP-based encoder and constraining distractor selection to segments from the same concert. Our dataset was constructed following a pseudo-labeling approach, in which raw video data was automatically clustered into coherent shot segments. We show that our models outperformed previous baselines in detecting cut points and provide competitive visual shot selection, advancing the state of the art in multimodal automated video editing.

[78] Development and Validation of a Low-Cost Imaging System for Seedling Germination Kinetics through Time-Cumulative Analysis

M. Torrente,A. Follador,A. Calcante,P. Casati,R. Oberti

Main category: cs.CV

TL;DR: 该研究开发了一种低成本成像系统,用于分析莴苣种子在R. solani感染下的发芽动态。通过时间累积分析和先进的图像处理算法,该方法在复杂条件下实现了高精度的幼苗计数和活力评估。

Details Motivation: 传统图像分析方法在幼苗重叠或交织的情况下难以准确计数和评估活力,尤其是在病原体感染条件下。研究旨在开发一种低成本、高效的解决方案,以捕捉发芽动力学并提供可靠的数据。

Contribution: 1. 提出了一种结合形态和空间特征的新型图像分析流程;2. 引入了时间累积分析方法,提高了在复杂条件下的识别精度;3. 验证了低成本硬件与计算工具结合的可行性。

Method: 通过多摄像头连续拍摄发芽过程,开发了一种图像分析算法,整合时间和空间特征,并在时间维度上累积分析,解决了重叠叶片的识别问题。

Result: 方法在R. solani感染条件下表现出高精度(R²=0.98,RMSE=1.12),显著提升了发芽率和活力评估的准确性。

Insight: 时间累积分析与多特征整合是解决复杂条件下植物表型分析的有效策略,低成本硬件与算法结合为大规模研究提供了新思路。

Abstract: The study investigates the effects of R. solani inoculation on the germination and early development of Lactuca sativa L. seeds using a low-cost, image-based monitoring system. Multiple cameras were deployed to continuously capture images of the germination process in both infected and control groups. The objective was to assess the impact of the pathogen by analyzing germination dynamics and growth over time. To achieve this, a novel image analysis pipeline was developed. The algorithm integrates both morphological and spatial features to identify and quantify individual seedlings, even under complex conditions where traditional image analyses fails. A key innovation of the method lies in its temporal integration: each analysis step considers not only the current status but also their developmental across prior time points. This approach enables robust discrimination of individual seedlings, especially when overlapping leaves significantly hinder object separation. The method demonstrated high accuracy in seedling counting and vigor assessment, even in challenging scenarios characterized by dense and intertwined growth. Results confirm that R. solani infection significantly reduces germination rates and early seedling vigor. The study also validates the feasibility of combining low-cost imaging hardware with advanced computational tools to obtain phenotyping data in a non-destructive and scalable manner. The temporal integration enabled accurate quantification of germinated seeds and precise determination of seedling emergence timing. This approach proved particularly effective in later stages of the experiment, where conventional segmentation techniques failed due to overlapping or intertwined seedlings, making accurate counting. The method achieved a coefficient of determination of 0.98 and a root mean square error (RMSE) of 1.12, demonstrating its robustness and reliability.

[79] Context Matters: Learning Global Semantics for Visual Reasoning and Comprehension

Jike Zhong,Yuxiang Lai,Xiaofeng Yang,Konstantinos Psounis

Main category: cs.CV

TL;DR: 论文提出了一种基于语义的视觉编码方法,将视觉对象类比为语言中的单词,通过Masked Image Modeling(MIM)框架学习全局语义和上下文,从而提升视觉推理和理解能力。

Details Motivation: 当前的视觉模型(如ViT)在语义和上下文理解上落后于语言模型,主要是因为ViT训练中缺乏语义指导。论文试图通过设计语义目标来缩小这一差距。

Contribution: 主要贡献包括:(1)提出将视觉对象作为语义单元,类比于语言中的单词;(2)通过MIM框架验证对象级表示的优越性;(3)在多模态任务(如VQA)中展示了语义编码对推理能力的提升。

Method: 采用Masked Image Modeling(MIM)框架,但不同于传统的随机掩码,该方法掩码视觉对象而非图像块,迫使模型学习全局语义和对象间关系。

Result: 实验表明,基于对象级表示的方法能更好地学习真实世界分布,避免了像素平均的捷径学习。在多模态任务(如VQA、GQA、ScienceQA)中表现显著提升。

Insight: 对象级语义编码是提升视觉模型推理和上下文理解能力的关键方向,为未来视觉编码器和分词器的设计提供了新思路。

Abstract: Recent advances in language modeling have witnessed the rise of highly desirable emergent capabilities, such as reasoning and in-context learning. However, vision models have yet to exhibit comparable progress in these areas. In this paper, we argue that this gap could stem from the lack of semantic and contextual guidance in current vision transformer (ViT) training schemes, and such a gap can be narrowed through the design of a semantic-grounded objective. Specifically, we notice that individual words in natural language are inherently semantic, and modeling directly on word tokens naturally learns a realistic distribution. In contrast, ViTs rely on spatial patchification, which inevitably lacks semantic information. To bridge this gap, we propose to directly model “object” as the visual equivalence of “word,” pushing the model to learn the global context and semantics among visual elements. We investigate our hypotheses via masked image modeling (MIM), a framework where our approach can be readily tested by applying masks to visual objects rather than random patches. Considerable evidence from qualitative and quantitative evaluations reveals a key finding: object-level representation alone helps to learn a real-world distribution, whereas pixel-averaging shortcuts are often learned without it. Moreover, further evaluations with multimodal LLMs (MLLM) on visual question answering (VQA, GQA, ScienceQA) tasks demonstrate the strong reasoning and contextual understanding gained with this simple objective. We hope our study highlights the effectiveness of object-level encoding and provides a plausible direction for developing stronger vision encoders and tokenizers. Code and model will be publicly released. Keywords: Semantic Visual Tokenizer, Vision Reasoning, In-context Learning, Multimodal Reasoning

[80] AgeBooth: Controllable Facial Aging and Rejuvenation via Diffusion Models

Shihao Zhu,Bohan Cao,Ziheng Ouyang,Zhen Li,Peng-Tao Jiang,Qibin Hou

Main category: cs.CV

TL;DR: AgeBooth是一种基于扩散模型的新方法,通过针对年龄的微调策略和矩阵融合技术,实现了高质量的跨年龄人脸生成,无需依赖大量年龄标注数据。

Details Motivation: 现有扩散模型在生成身份一致的人脸图像时,难以精确控制年龄且保留身份特征,同时微调这些模型通常需要昂贵的跨年龄配对数据。

Contribution: 提出了AgeBooth,一种新颖的年龄特定微调方法,增强了适配器基身份个性化模型的年龄控制能力;引入了年龄条件提示混合和基于SVDMix的LoRA融合策略,减少了数据依赖。

Method: 采用了年龄条件提示混合和基于SVDMix的LoRA融合策略,利用衰老的线性特性生成中间年龄肖像。

Result: AgeBooth能从未标注的参考图像生成逼真且身份一致的跨年龄段人脸图像,实验显示其在年龄控制和视觉质量上优于现有编辑方法。

Insight: 通过矩阵融合技术和年龄条件提示,AgeBooth展示了如何高效利用有限数据实现复杂的跨年龄生成任务。

Abstract: Recent diffusion model research focuses on generating identity-consistent images from a reference photo, but they struggle to accurately control age while preserving identity, and fine-tuning such models often requires costly paired images across ages. In this paper, we propose AgeBooth, a novel age-specific finetuning approach that can effectively enhance the age control capability of adapterbased identity personalization models without the need for expensive age-varied datasets. To reduce dependence on a large amount of age-labeled data, we exploit the linear nature of aging by introducing age-conditioned prompt blending and an age-specific LoRA fusion strategy that leverages SVDMix, a matrix fusion technique. These techniques enable high-quality generation of intermediate-age portraits. Our AgeBooth produces realistic and identity-consistent face images across different ages from a single reference image. Experiments show that AgeBooth achieves superior age control and visual quality compared to previous state-of-the-art editing-based methods.

[81] Data Factory with Minimal Human Effort Using VLMs

Jiaojiao Ye,Jiaxing Zhong,Qian Xie,Yuzhou Zhou,Niki Trigoni,Andrew Markham

Main category: cs.CV

TL;DR: 论文提出了一种无需训练的合成数据生成方法,结合ControlNet和视觉语言模型(VLMs),自动生成带有像素级标签的图像,提升下游任务的性能。

Details Motivation: 传统的增强方法难以操控高级语义属性,而现有基于扩散模型的方法计算成本高或性能不足。论文旨在解决这些问题。

Contribution: 1.提出了一种无需训练的流程,生成高质量合成数据;2.设计了多路提示生成器、掩码生成器和高质量图像选择模块,提升多样性和保真度。

Method: 结合预训练ControlNet和VLMs,通过文本到图像或图像到图像的转换生成合成数据。

Result: 在PASCAL-5i和COCO-20i数据集上表现出色,性能优于同期工作。

Insight: VLMs和ControlNet的结合为自动生成标注数据提供了一种高效且低成本的方法。

Abstract: Generating enough and diverse data through augmentation offers an efficient solution to the time-consuming and labour-intensive process of collecting and annotating pixel-wise images. Traditional data augmentation techniques often face challenges in manipulating high-level semantic attributes, such as materials and textures. In contrast, diffusion models offer a robust alternative, by effectively utilizing text-to-image or image-to-image transformation. However, existing diffusion-based methods are either computationally expensive or compromise on performance. To address this issue, we introduce a novel training-free pipeline that integrates pretrained ControlNet and Vision-Language Models (VLMs) to generate synthetic images paired with pixel-level labels. This approach eliminates the need for manual annotations and significantly improves downstream tasks. To improve the fidelity and diversity, we add a Multi-way Prompt Generator, Mask Generator and High-quality Image Selection module. Our results on PASCAL-5i and COCO-20i present promising performance and outperform concurrent work for one-shot semantic segmentation.

[82] ALISE: Annotation-Free LiDAR Instance Segmentation for Autonomous Driving

Yongxuan Lyu,Guangfeng Jiang,Hongsi Liu,Jun Liu

Main category: cs.CV

TL;DR: ALISE是一种无需标注的LiDAR实例分割框架,通过视觉基础模型和时空投票模块生成高质量伪标签,结合2D和3D语义监督,实现了无监督学习的新突破,性能甚至超过了部分有监督方法。

Details Motivation: LiDAR点云的手动标注成本高昂且耗时,现有方法仍需人工标注。ALISE旨在完全摆脱对标注的依赖,实现无监督实例分割。

Contribution: 1) 提出ALISE框架,完全无监督;2) 结合视觉基础模型和时空投票模块生成伪标签;3) 引入2D先验损失和原型对比损失,提升特征学习;4) 性能超越部分有监督方法。

Method: 1) 利用视觉基础模型生成初始伪标签;2) 时空投票模块优化标签;3) 2D先验损失注入视觉知识;4) 原型对比损失增强3D语义一致性。

Result: ALISE在无监督3D实例分割中达到新SOTA,mAP为50.95%,超过有监督方法MWSIS的48.42%。

Insight: 无监督方法可通过结合多模态信息和精心设计的损失函数,达到甚至超越有监督性能,为自动驾驶中的标注难题提供了新思路。

Abstract: The manual annotation of outdoor LiDAR point clouds for instance segmentation is extremely costly and time-consuming. Current methods attempt to reduce this burden but still rely on some form of human labeling. To completely eliminate this dependency, we introduce ALISE, a novel framework that performs LiDAR instance segmentation without any annotations. The central challenge is to generate high-quality pseudo-labels in a fully unsupervised manner. Our approach starts by employing Vision Foundation Models (VFMs), guided by text and images, to produce initial pseudo-labels. We then refine these labels through a dedicated spatio-temporal voting module, which combines 2D and 3D semantics for both offline and online optimization. To achieve superior feature learning, we further introduce two forms of semantic supervision: a set of 2D prior-based losses that inject visual knowledge into the 3D network, and a novel prototype-based contrastive loss that builds a discriminative feature space by exploiting 3D semantic consistency. This comprehensive design results in significant performance gains, establishing a new state-of-the-art for unsupervised 3D instance segmentation. Remarkably, our approach even outperforms MWSIS, a method that operates with supervision from ground-truth (GT) 2D bounding boxes by a margin of 2.53% in mAP (50.95% vs. 48.42%).

Zexin Zheng,Huangyu Dai,Lingtao Mao,Xinyu Sun,Zihan Liang,Ben Chen,Yuqing Ding,Chenyi Lei,Wenwu Ou,Han Li,Kun Gai

Main category: cs.CV

TL;DR: OneVision提出了一种端到端的生成框架,通过多视角语义对齐解决了传统多阶段级联架构在视觉搜索中的表示差异问题,并在效率和转化率上实现了显著提升。

Details Motivation: 传统视觉搜索采用多阶段级联架构(MCA),但查询图像在多阶段的表示差异与优化目标冲突,难以同时在用户体验和转化率上达到帕累托最优。

Contribution: 提出了OneVision框架,基于VRQ编码对齐多视角表示,并通过多阶段语义对齐方案结合个性化信息,实现了高效且个性化的视觉搜索。

Method: 采用VRQ(视觉对齐残差量化编码)对齐多视角表示,结合动态剪枝和多阶段语义对齐,统一检索与个性化生成。

Result: 离线评估中性能与在线MCA相当,推理效率提升21%;在线A/B测试中,点击率(CTR)提升2.15%,转化率(CVR)提升2.27%,订单量提升3.12%。

Insight: 生成式架构能够统一检索与个性化任务,简化服务流程,同时提升效率和用户体验。

Abstract: Traditional vision search, similar to search and recommendation systems, follows the multi-stage cascading architecture (MCA) paradigm to balance efficiency and conversion. Specifically, the query image undergoes feature extraction, recall, pre-ranking, and ranking stages, ultimately presenting the user with semantically similar products that meet their preferences. This multi-view representation discrepancy of the same object in the query and the optimization objective collide across these stages, making it difficult to achieve Pareto optimality in both user experience and conversion. In this paper, an end-to-end generative framework, OneVision, is proposed to address these problems. OneVision builds on VRQ, a vision-aligned residual quantization encoding, which can align the vastly different representations of an object across multiple viewpoints while preserving the distinctive features of each product as much as possible. Then a multi-stage semantic alignment scheme is adopted to maintain strong visual similarity priors while effectively incorporating user-specific information for personalized preference generation. In offline evaluations, OneVision performs on par with online MCA, while improving inference efficiency by 21% through dynamic pruning. In A/B tests, it achieves significant online improvements: +2.15% item CTR, +2.27% CVR, and +3.12% order volume. These results demonstrate that a semantic ID centric, generative architecture can unify retrieval and personalization while simplifying the serving pathway.

[84] Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow

Ruyang Liu,Shangkun Sun,Haoran Tang,Ge Li,Wei Gao

Main category: cs.CV

TL;DR: Flow4Agent利用光流运动先验优化长视频理解,通过时间粒度优化和运动令牌剪枝减少冗余。

Details Motivation: 解决长视频理解中时间和空间冗余问题,减轻多模态大语言模型的上下文长度限制。

Contribution: 提出Flow4Agent框架,首次引入光流运动先验,优化视频理解和令牌剪枝。

Method: 包含两个核心模块:时间粒度优化(TGO)和运动令牌剪枝(MTP),分别从时间和空间层面减少冗余。

Result: 在多个视频MLLM基准测试中表现优异,例如在Video-MME上达到64.7%。

Insight: 运动先验(光流)可以显著减少长视频的冗余信息,提升多模态模型的性能。

Abstract: Long-form video understanding has always been a challenging problem due to the significant redundancy in both temporal and spatial contents. This challenge is further exacerbated by the limited context length of Multimodal Large Language Models (MLLMs). To address this issue, many previous works have attempted to extract key video information, where the “key” is typically semantic-aware and heavily dependent on the CLIP model as prior. In this paper, we propose Flow4Agent, a novel framework that pioneeringly incorporates motion priors from optical flow to facilitate LLM-based long video understanding. Flow4Agent mitigates the redundancy in long videos at both temporal and spatial levels through two core modules: Temporal Granularity Optimization (TGO) adaptively refines framelevel hierarchies, which first leverages coarse flow priors to group similar visual contents and then applies semantic priors to filter out highly irrelevant scene information. Motion Token Pruning (MTP) further refines the intra-frame visual representations, pruning high-redundancy video tokens using fine-grained optical flow information. Extensive experiments demonstrate that our Flow4Agent outperforms existing methods across a wide range of video MLLM benchmarks, especially for hour-level video understanding tasks, achieving 64.7% on Video-MME, 71.4% on MLVU and 60.4% on LongVideoBench.

[85] BioAutoML-NAS: An End-to-End AutoML Framework for Multimodal Insect Classification via Neural Architecture Search on Large-Scale Biodiversity Data

Arefin Ittesafun Abian,Debopom Sutradhar,Md Rafi Ur Rashid,Reem E. Mohamed,Md Rafiqul Islam,Asif Karim,Kheng Cher Yeo,Sami Azam

Main category: cs.CV

TL;DR: BioAutoML-NAS是一个端到端的AutoML框架,通过神经架构搜索(NAS)和多模态数据融合,实现了高效的昆虫分类。在大型生物多样性数据集上表现优异,显著优于现有方法。

Details Motivation: 昆虫分类对农业管理和生态研究至关重要,但由于昆虫特征的复杂性、类别不平衡和大规模数据集的存在,这一任务极具挑战性。

Contribution: 1. 提出了首个基于多模态数据(图像和元数据)的BioAutoML模型BioAutoML-NAS。2. 使用NAS自动学习图像特征提取的最佳操作。3. 设计了多模态融合模块,结合视觉和生物信息进行分类。

Method: 1. 使用NAS自动设计网络架构,提取图像特征。2. 通过多模态融合模块结合图像嵌入和元数据。3. 采用交替的双层优化策略联合更新网络权重和架构参数,并通过零操作剪枝提升效率。

Result: 在BIOSCAN-5M数据集上,模型达到96.81%准确率、97.46%精确率、96.81%召回率和97.05% F1分数,显著优于现有方法(提升16%、10%和8%)。在Insects-1M数据集上表现同样优异。

Insight: 多模态数据和NAS的结合显著提升了昆虫分类的性能,为可持续农业提供了高效的支持工具。

Abstract: Insect classification is important for agricultural management and ecological research, as it directly affects crop health and production. However, this task remains challenging due to the complex characteristics of insects, class imbalance, and large-scale datasets. To address these issues, we propose BioAutoML-NAS, the first BioAutoML model using multimodal data, including images, and metadata, which applies neural architecture search (NAS) for images to automatically learn the best operations for each connection within each cell. Multiple cells are stacked to form the full network, each extracting detailed image feature representations. A multimodal fusion module combines image embeddings with metadata, allowing the model to use both visual and categorical biological information to classify insects. An alternating bi-level optimization training strategy jointly updates network weights and architecture parameters, while zero operations remove less important connections, producing sparse, efficient, and high-performing architectures. Extensive evaluation on the BIOSCAN-5M dataset demonstrates that BioAutoML-NAS achieves 96.81% accuracy, 97.46% precision, 96.81% recall, and a 97.05% F1 score, outperforming state-of-the-art transfer learning, transformer, AutoML, and NAS methods by approximately 16%, 10%, and 8% respectively. Further validation on the Insects-1M dataset obtains 93.25% accuracy, 93.71% precision, 92.74% recall, and a 93.22% F1 score. These results demonstrate that BioAutoML-NAS provides accurate, confident insect classification that supports modern sustainable farming.

[86] Shaken or Stirred? An Analysis of MetaFormer’s Token Mixing for Medical Imaging

Ron Keuth,Paul Kaftan,Mattias P. Heinrich

Main category: cs.CV

TL;DR: 该论文首次全面研究了MetaFormer架构中不同token mixer在医学影像任务中的表现,填补了医学影像领域的研究空白,并通过实验验证了低复杂度token mixer在分类任务中的有效性,以及卷积token mixer在分割任务中的必要性。

Details Motivation: 尽管MetaFormer架构在计算机视觉中取得了成功,但其在医学影像领域的应用和研究相对较少,尤其是对不同token mixer的比较和选择尚未深入。本文旨在填补这一空白,探究最适合医学影像任务的token mixer设计。

Contribution: 1. 首次系统分析了医学影像任务中MetaFormer架构的不同token mixer性能;2. 验证了低复杂度token mixer在分类任务中的有效性;3. 揭示了卷积token mixer在分割任务中的重要性;4. 探讨了从自然图像预训练权重迁移到新token mixer的可行性。

Method: 1. 在MetaFormer架构中实验了池化、卷积和注意力三种token mixer;2. 覆盖了八种医学影像数据集,涵盖分类(全局预测)和分割(密集预测)任务;3. 分析了预训练权重迁移的效果。

Result: 1. 分类任务中,低复杂度token mixer(如分组卷积或池化)表现足够,与自然图像研究一致;2. 分割任务中,卷积token mixer的局部归纳偏置是关键,分组卷积因其高效性和参数效率成为首选;3. 预训练权重对新token mixer仍有帮助。

Insight: 1. 医学影像任务中,MetaFormer的成功更多依赖于其通用架构而非token mixer的复杂性;2. 卷积token mixer的分割优势凸显了局部归纳偏置的重要性;3. 预训练权重的迁移能力表明,跨域知识转移在医学影像中仍有潜力。

Abstract: The generalization of the Transformer architecture via MetaFormer has reshaped our understanding of its success in computer vision. By replacing self-attention with simpler token mixers, MetaFormer provides strong baselines for vision tasks. However, while extensively studied on natural image datasets, its use in medical imaging remains scarce, and existing works rarely compare different token mixers, potentially overlooking more suitable designs choices. In this work, we present the first comprehensive study of token mixers for medical imaging. We systematically analyze pooling-, convolution-, and attention-based token mixers within the MetaFormer architecture on image classification (global prediction task) and semantic segmentation (dense prediction task). Our evaluation spans eight datasets covering diverse modalities and common challenges in the medical domain. Given the prevalence of pretraining from natural images to mitigate medical data scarcity, we also examine transferring pretrained weights to new token mixers. Our results show that, for classification, low-complexity token mixers (e.g. grouped convolution or pooling) are sufficient, aligning with findings on natural images. Pretrained weights remain useful despite the domain gap introduced by the new token mixer. For segmentation, we find that the local inductive bias of convolutional token mixers is essential. Grouped convolutions emerge as the preferred choice, as they reduce runtime and parameter count compared to standard convolutions, while the MetaFormer’s channel-MLPs already provide the necessary cross-channel interactions. Our code is available on GitHub.

[87] Diffusion Models for Low-Light Image Enhancement: A Multi-Perspective Taxonomy and Performance Analysis

Eashan Adhikarla,Yixin Liu,Brian D. Davison

Main category: cs.CV

TL;DR: 该论文对基于扩散模型的低光照图像增强(LLIE)方法进行了多视角分类和性能分析,对比了GAN和Transformer的方法,并探讨了实际部署中的挑战和未来方向。

Details Motivation: 低光照图像增强对于安全关键应用至关重要,扩散模型因其通过迭代去噪建模复杂图像分布的能力而成为LLIE的有力工具。

Contribution: 论文提出了一个多视角分类法(6类),深入比较了扩散模型与GAN和Transformer的性能,并讨论了实际部署中的挑战和伦理问题。

Method: 通过多视角分类法(如本征分解、频谱与潜在空间、加速、引导、多模态和自主)分析扩散模型在LLIE中的应用。

Result: 揭示了扩散模型的定性失败模式、基准不一致性,以及在可解释性、泛化性和推理效率之间的权衡。

Insight: 扩散模型在LLIE中表现优异,但仍需解决实际部署中的资源限制和伦理问题,未来可结合基础模型和新型条件信号进行改进。

Abstract: Low-light image enhancement (LLIE) is vital for safety-critical applications such as surveillance, autonomous navigation, and medical imaging, where visibility degradation can impair downstream task performance. Recently, diffusion models have emerged as a promising generative paradigm for LLIE due to their capacity to model complex image distributions via iterative denoising. This survey provides an up-to-date critical analysis of diffusion models for LLIE, distinctively featuring an in-depth comparative performance evaluation against Generative Adversarial Network and Transformer-based state-of-the-art methods, a thorough examination of practical deployment challenges, and a forward-looking perspective on the role of emerging paradigms like foundation models. We propose a multi-perspective taxonomy encompassing six categories: Intrinsic Decomposition, Spectral & Latent, Accelerated, Guided, Multimodal, and Autonomous; that map enhancement methods across physical priors, conditioning schemes, and computational efficiency. Our taxonomy is grounded in a hybrid view of both the model mechanism and the conditioning signals. We evaluate qualitative failure modes, benchmark inconsistencies, and trade-offs between interpretability, generalization, and inference efficiency. We also discuss real-world deployment constraints (e.g., memory, energy use) and ethical considerations. This survey aims to guide the next generation of diffusion-based LLIE research by highlighting trends and surfacing open research questions, including novel conditioning, real-time adaptation, and the potential of foundation models.

[88] A Dynamic Mode Decomposition Approach to Morphological Component Analysis

Owen T. Huber,Raghu G. Raj,Tianyu Chen,Zacharie I. Idriss

Main category: cs.CV

TL;DR: 该论文提出了一种基于动态模式分解(DMD)的新方法,称为动态形态成分分析(DMCA),用于自适应视频表示和信号分离。通过聚类DMD特征值,DMCA能够学习数据驱动的字典,从而提高MCA的性能,并在视频去噪和目标增强等任务中展示了有效性。

Details Motivation: 传统的形态成分分析(MCA)依赖于预定义的字典,限制了其适应性和性能。为了解决这一问题,作者提出利用动态模式分解的特征值聚类来学习数据驱动的字典,从而提升信号分离的效果。

Contribution: 论文的主要贡献是提出了动态形态成分分析(DMCA)方法,通过聚类DMD特征值学习自适应字典,改进了传统MCA的性能。

Method: 方法包括:1)利用动态模式分解(DMD)提取视频的动态特征;2)通过DMD特征值聚类生成数据驱动的字典;3)将学习到的字典结合稀疏先验用于信号分离,形成DMCA算法。

Result: 实验结果表明,DMCA在视频去噪(Adobe 240fps数据集)、目标增强(海况下的微弱目标)和目标分离(SAR图像中的自行车与风杂波)等任务中均表现出色。

Insight: DMCA的创新之处在于将DMD的动态特征与MCA的信号分离框架结合,突出了数据驱动字典在复杂信号处理中的潜力。

Abstract: This paper introduces a novel methodology of adapting the representation of videos based on the dynamics of their scene content variation. In particular, we demonstrate how the clustering of dynamic mode decomposition eigenvalues can be leveraged to learn an adaptive video representation for separating structurally distinct morphologies of a video. We extend the morphological component analysis (MCA) algorithm, which uses multiple predefined incoherent dictionaries and a sparsity prior to separate distinct sources in signals, by introducing our novel eigenspace clustering technique to obtain data-driven MCA dictionaries, which we call dynamic morphological component analysis (DMCA). After deriving our novel algorithm, we offer a motivational example of DMCA applied to a still image, then demonstrate DMCA’s effectiveness in denoising applications on videos from the Adobe 240fps dataset. Afterwards, we provide an example of DMCA enhancing the signal-to-noise ratio of a faint target summed with a sea state, and conclude the paper by applying DMCA to separate a bicycle from wind clutter in inverse synthetic aperture radar images.

[89] Detection and Measurement of Hailstones with Multimodal Large Language Models

Moritz Alker,David C. Schedl,Andreas Stöckl

Main category: cs.CV

TL;DR: 该研究利用预训练的多模态大语言模型,通过社交媒体和新闻图像检测并测量冰雹直径,提出了一种两阶段提示策略,显著提升了测量可靠性。

Details Motivation: 传统的冰雹传感器覆盖范围有限且成本高,而社交媒体图像提供了丰富且空间密集的信息来源,可以利用多模态模型快速评估极端天气事件。

Contribution: 1)验证了预训练多模态模型在冰雹直径测量中的潜力;2)提出两阶段提示策略,利用参考对象(如人手)提升测量可靠性;3)展示了模型在无需微调的情况下可与传统传感器互补。

Method: 1)使用474张奥地利冰雹事件的众包图像数据集;2)对比四种模型的一阶段和两阶段提示策略(后者引入参考对象);3)评估平均绝对误差(MAE)。

Result: 最佳模型的平均绝对误差为1.12cm,两阶段提示策略显著提升了多数模型的可靠性。

Insight: 预训练多模态模型可直接用于冰雹测量,无需额外微调;未来结合实时图像采集,可快速响应极端天气事件。

Abstract: This study examines the use of social media and news images to detect and measure hailstones, utilizing pre-trained multimodal large language models. The dataset for this study comprises 474 crowdsourced images of hailstones from documented hail events in Austria, which occurred between January 2022 and September 2024. These hailstones have maximum diameters ranging from 2 to 11cm. We estimate the hail diameters and compare four different models utilizing one-stage and two-stage prompting strategies. The latter utilizes additional size cues from reference objects, such as human hands, within the image. Our results show that pretrained models already have the potential to measure hailstone diameters from images with an average mean absolute error of 1.12cm for the best model. In comparison to a single-stage prompt, two-stage prompting improves the reliability of most models. Our study suggests that these off-the-shelf models, even without fine-tuning, can complement traditional hail sensors by extracting meaningful and spatially dense information from social media imagery, enabling faster and more detailed assessments of severe weather events. The automated real-time image harvesting from social media and other sources remains an open task, but it will make our approach directly applicable to future hail events.

[90] VideoMiner: Iteratively Grounding Key Frames of Hour-Long Videos via Tree-based Group Relative Policy Optimization

Xinye Cao,Hongcan Guo,Jiawen Qian,Guoshun Nan,Chao Wang,Yuqi Pan,Tianhao Hou,Xiaojuan Wang,Yutong Gao

Main category: cs.CV

TL;DR: VideoMiner提出了一种基于树结构的迭代方法,通过分组相对策略优化(T-GRPO)解决长视频中冗余信息干扰和关键帧动态识别问题。

Details Motivation: 长视频中包含大量冗余信息,均匀采样帧会淹没LLMs,现有分层关键帧提取方法仍难以动态适应复杂结构和准确识别关键帧。

Contribution: 1. 提出VideoMiner,通过迭代分割、标注和聚类形成树结构;2. 设计T-GRPO方法,结合时空信息和问题指导,动态优化关键帧选择。

Method: 1. 迭代分割视频并形成树结构;2. 引入T-GRPO,结合强化学习优化关键帧探索。

Result: 在所有长视频理解任务中表现优越,模型自发生成推理链,树生长调节动态调整扩展深度。

Insight: T-GRPO不仅能提升准确性,还促进模型生成推理链;树结构设计平衡了准确性和效率。

Abstract: Understanding hour-long videos with multi-modal large language models (MM-LLMs) enriches the landscape of human-centered AI applications. However, for end-to-end video understanding with LLMs, uniformly sampling video frames results in LLMs being overwhelmed by a vast amount of irrelevant information as video length increases. Existing hierarchical key frame extraction methods improve the accuracy of video understanding but still face two critical challenges. 1) How can the interference of extensive redundant information in long videos be mitigated? 2) How can a model dynamically adapt to complex hierarchical structures while accurately identifying key frames? To address these issues, we propose VideoMiner, which iteratively segments, captions, and clusters long videos, forming a hierarchical tree structure. The proposed VideoMiner progresses from long videos to events to frames while preserving temporal coherence, effectively addressing the first challenge. To precisely locate key frames, we introduce T-GRPO, a tree-based group relative policy optimization in reinforcement learning method that guides the exploration of the VideoMiner. The proposed T-GRPO is specifically designed for tree structures, integrating spatiotemporal information at the event level while being guided by the question, thus solving the second challenge. We achieve superior performance in all long-video understanding tasks and uncover several interesting insights. Our proposed T-GRPO surprisingly incentivizes the model to spontaneously generate a reasoning chain. Additionally, the designed tree growth auxin dynamically adjusts the expansion depth, obtaining accuracy and efficiency gains. The code is publicly available at https://github.com/caoxinye/VideoMiner.

[91] GLVD: Guided Learned Vertex Descent

Pol Caselles Rico,Francesc Moreno Noguer

Main category: cs.CV

TL;DR: GLVD是一种结合优化和学习的方法,用于从少量图像中重建3D人脸,通过整合顶点神经场优化和动态3D关键点引导,实现了高质量和高效的重建。

Details Motivation: 现有的3D人脸建模方法受限于固定形状先验或计算效率低,GLVD旨在结合优化和学习方法的优势,提高重建的质量和效率。

Contribution: GLVD的主要贡献是提出了一种结合顶点神经场优化和动态3D关键点引导的混合方法,支持无密集3D监督的高质量人脸重建。

Method: GLVD扩展了LVD方法,通过相对空间编码和迭代优化,动态调整网格顶点,同时利用预测的3D关键点提供全局结构指导。

Result: GLVD在单视角设置中达到SOTA性能,在多视角场景中表现优异,同时显著减少推理时间。

Insight: GLVD展示了结合局部优化和全局引导的潜力,能够在不依赖密集监督的情况下实现自适应几何重建。

Abstract: Existing 3D face modeling methods usually depend on 3D Morphable Models, which inherently constrain the representation capacity to fixed shape priors. Optimization-based approaches offer high-quality reconstructions but tend to be computationally expensive. In this work, we introduce GLVD, a hybrid method for 3D face reconstruction from few-shot images that extends Learned Vertex Descent (LVD) by integrating per-vertex neural field optimization with global structural guidance from dynamically predicted 3D keypoints. By incorporating relative spatial encoding, GLVD iteratively refines mesh vertices without requiring dense 3D supervision. This enables expressive and adaptable geometry reconstruction while maintaining computational efficiency. GLVD achieves state-of-the-art performance in single-view settings and remains highly competitive in multi-view scenarios, all while substantially reducing inference time.

[92] Medical Vision Language Models as Policies for Robotic Surgery

Akshay Muppidi,Martin Radfar

Main category: cs.CV

TL;DR: 该论文提出了一种将医学领域专用的视觉语言模型MedFlamingo与PPO结合的简单方法,用于解决基于视觉观察的机器人腹腔镜手术任务中的高维输入、稀疏奖励和特征提取难题,并在多任务环境中显著优于传统方法。

Details Motivation: 传统的基于视觉的PPO在机器人腹腔镜手术任务中表现不佳,主要由于高维视觉输入、稀疏的奖励信号以及难以从原始视觉数据中提取任务相关特征。为了解决这些问题,论文引入医学领域专用的视觉语言模型以改进性能。

Contribution: 论文的主要贡献是将MedFlamingo(医学专用的视觉语言模型)与PPO结合,提出了一种新方法,显著提高了机器人手术任务的性能和收敛速度。

Method: 方法的核心是将MedFlamingo与PPO结合,通过每回合处理一次任务观察和指令,生成高级规划令牌,从而高效结合医学专业知识和实时视觉反馈。

Result: 在LapGym的五个任务环境中,MedFlamingo PPO的表现优于标准PPO和OpenFlamingo PPO基线,任务成功率超过70%,提升幅度从66.67%到1114.29%不等。

Insight: 论文表明,医学领域的专业知识在机器人手术规划和决策中具有重要价值,通过结合视觉语言模型可以有效提升任务性能。

Abstract: Vision-based Proximal Policy Optimization (PPO) struggles with visual observation-based robotic laparoscopic surgical tasks due to the high-dimensional nature of visual input, the sparsity of rewards in surgical environments, and the difficulty of extracting task-relevant features from raw visual data. We introduce a simple approach integrating MedFlamingo, a medical domain-specific Vision-Language Model, with PPO. Our method is evaluated on five diverse laparoscopic surgery task environments in LapGym, using only endoscopic visual observations. MedFlamingo PPO outperforms and converges faster compared to both standard vision-based PPO and OpenFlamingo PPO baselines, achieving task success rates exceeding 70% across all environments, with improvements ranging from 66.67% to 1114.29% compared to baseline. By processing task observations and instructions once per episode to generate high-level planning tokens, our method efficiently combines medical expertise with real-time visual feedback. Our results highlight the value of specialized medical knowledge in robotic surgical planning and decision-making.

[93] Reasoning under Vision: Understanding Visual-Spatial Cognition in Vision-Language Models for CAPTCHA

Python Song,Luke Tenyi Chang,Yun-Yun Tsai,Penghui Li,Junfeng Yang

Main category: cs.CV

TL;DR: 该论文研究了视觉语言模型(VLMs)在解决CAPTCHA任务中的空间推理能力,发现逐步推理可显著提升模型性能,并提出CAPTCHA-X基准和基于推理的评估框架。

Details Motivation: CAPTCHA任务是评估VLMs空间推理能力的理想场景,但目前商用模型表现不佳(准确率约21.9%),表明其在复杂推理任务上存在局限。

Contribution: 1. 提出CAPTCHA-X基准,覆盖七类CAPTCHA任务,包含逐步推理和标注;2. 定义五项推理导向的评估指标;3. 提出一种基于VLMs的通用推理框架,显著提升性能(准确率达83.9%)。

Method: 通过要求模型逐步推理生成最终坐标,提升CAPTCHA解决能力,并提出一个结合推理能力的VLM框架。

Result: 在五类高难度CAPTCHA任务中,方法平均准确率达83.9%,远超基线模型。

Insight: 逐步推理是提升VLMs空间推理能力的关键,现有模型在此类任务上仍有较大改进空间。

Abstract: CAPTCHA, originally designed to distinguish humans from robots, has evolved into a real-world benchmark for assessing the spatial reasoning capabilities of vision-language models. In this work, we first show that step-by-step reasoning is crucial for vision-language models (VLMs) to solve CAPTCHAs, which represent high-difficulty spatial reasoning tasks, and that current commercial vision-language models still struggle with such reasoning. In particular, we observe that most commercial VLMs (e.g., Gemini, Claude, GPT, etc.) fail to effectively solve CAPTCHAs and thus achieve low accuracy (around 21.9 percent). However, our findings indicate that requiring the model to perform step-by-step reasoning before generating the final coordinates can significantly enhance its solving accuracy, underscoring the severity of the gap. To systematically study this issue, we introduce CAPTCHA-X, the first real-world CAPTCHA benchmark with reasoning, covering seven categories of CAPTCHAs (such as Gobang, hCaptcha, etc.) with step-by-step action solutions and grounding annotations. We further define five reasoning-oriented metrics that enable a comprehensive evaluation of models reasoning capabilities. To validate the effectiveness of reasoning, we also propose a general agentic VLM-based framework that incorporates the models inherent reasoning abilities. Our method achieves state-of-the-art performance across five high-difficulty CAPTCHA types, with an average solving accuracy of 83.9 percent, substantially surpassing existing baselines. These results reveal the limitations of current models and highlight the importance of reasoning in advancing visual-spatial challenges in the future.

[94] There is More to Attention: Statistical Filtering Enhances Explanations in Vision Transformers

Meghna P Ayyar,Jenny Benois-Pineau,Akka Zemmari

Main category: cs.CV

TL;DR: 本文提出了一种结合注意力机制与统计滤波的方法,旨在为Vision Transformers生成更清晰的解释图。通过移除噪声和非信息模式,该方法提升了解释的忠实性和可解释性,并在多个数据集上验证了其有效性。

Details Motivation: 现有的ViT解释方法多依赖注意力权重,但这些权重通常包含噪声,导致解释图不清晰。本文认为注意力机制仍有价值,但需通过过滤噪声来提升解释质量。

Contribution: 提出了一种统计滤波方法,结合注意力机制生成更清晰的解释图,并扩展了类别特异性变体以提升判别性。同时引入了人类注视数据评估解释图的可解释性。

Method: 将注意力图与统计滤波技术结合,移除噪声模式。扩展了类别特异性滤波方法,并通过扰动基准和人类注视数据评估解释效果。

Result: 实验表明,该方法生成的解释图更清晰、更忠实,且在多个数据集上优于或等同于现有方法,同时保持了高效性和人类可理解性。

Insight: 注意力机制仍是ViT解释的有价值信号,但需结合统计滤波技术提升其质量。人类可解释性是XAI的关键评估维度。

Abstract: Explainable AI (XAI) has become increasingly important with the rise of large transformer models, yet many explanation methods designed for CNNs transfer poorly to Vision Transformers (ViTs). Existing ViT explanations often rely on attention weights, which tend to yield noisy maps as they capture token-to-token interactions within each layer.While attribution methods incorporating MLP blocks have been proposed, we argue that attention remains a valuable and interpretable signal when properly filtered. We propose a method that combines attention maps with a statistical filtering, initially proposed for CNNs, to remove noisy or uninformative patterns and produce more faithful explanations. We further extend our approach with a class-specific variant that yields discriminative explanations. Evaluation against popular state-of-the-art methods demonstrates that our approach produces sharper and more interpretable maps. In addition to perturbation-based faithfulness metrics, we incorporate human gaze data to assess alignment with human perception, arguing that human interpretability remains essential for XAI. Across multiple datasets, our approach consistently outperforms or is comparable to the SOTA methods while remaining efficient and human plausible.

[95] When Thinking Drifts: Evidential Grounding for Robust Video Reasoning

Mi Luo,Zihui Xue,Alex Dimakis,Kristen Grauman

Main category: cs.CV

TL;DR: 论文分析了Chain-of-Thought (CoT)机制在视频推理任务中的表现不佳现象,提出了Visual Evidence Reward (VER)框架,显式奖励基于视觉证据的推理,提升了视频理解的性能。

Details Motivation: 现有的CoT机制虽然在文本任务中表现优异,但在视频推理中容易产生误导性推理,导致性能下降。这一问题被称为"视觉思维漂移",亟需解决以提升模型的视频推理能力。

Contribution: 1. 揭示了CoT在视频推理中的局限性;2. 提出了VER框架,通过强化学习显式奖励基于视觉证据的推理;3. 在10个视频理解基准测试中验证了VER的有效性。

Method: 提出Visual Evidence Reward (VER),一种基于强化学习的框架,通过奖励与视觉证据一致的推理轨迹来优化模型。

Result: Video-VER在多个视频理解任务中达到了最佳性能,证明了其在抑制”视觉思维漂移”方面的有效性。

Insight: 视频推理需要显式地结合视觉证据,VER框架为多模态模型的推理提供了更鲁棒的基础。

Abstract: Video reasoning, the task of enabling machines to infer from dynamic visual content through multi-step logic, is crucial for advanced AI. While the Chain-of-Thought (CoT) mechanism has enhanced reasoning in text-based tasks, its application to video understanding remains underexplored. This paper presents a systematic analysis revealing that CoT often degrades performance in video reasoning, generating verbose but misleading internal monologues, and leading to hallucinated visual details and overridden correct intuitions - a phenomenon we term “visual thinking drift”. We explain this drift through a Bayesian lens, positing that CoT traces often diverge from actual visual evidence, instead amplifying internal biases or language priors, causing models to storytell rather than engage in grounded reasoning. To counteract this, we introduce Visual Evidence Reward (VER), a novel reinforcement learning framework that explicitly rewards the generation of reasoning traces that are verifiably grounded in visual evidence. Comprehensive evaluation across 10 diverse video understanding benchmarks demonstrates that our Video-VER consistently achieves top performance. Our work sheds light on the distinct challenges of video-centric reasoning and encourages the development of AI that robustly grounds its inferences in visual evidence - for large multimodal models that not only “think before answering”, but also “see while thinking”.

[96] A public cardiac CT dataset featuring the left atrial appendage

Bjoern Hansen,Jonas Pedersen,Klaus F. Kofoed,Oscar Camara,Rasmus R. Paulsen,Kristine Soerensen

Main category: cs.CV

TL;DR: 本文介绍了一个公开的心脏CT数据集,专注于左心房附件(LAA)的高分辨率分割,同时改进了冠状动脉(CAs)和肺静脉(PVs)的分割标注,旨在推动LAA形态分析的新方法。

Details Motivation: 现有的先进分割框架(如TotalSegmentator)在LAA、CAs和PVs的分割上仍存在挑战,因此需要一个高质量的公开数据集来支持相关研究。

Contribution: 1. 提供了首个开源的、解剖学一致的高分辨率LAA分割数据集;
2. 改进了ImageCAS数据集中CAs和PVs的分割标注;
3. 提供了包含常见数据缺陷的扫描列表。

Method: 1. 使用专门为高分辨率LAA分割设计的先进框架生成LAA分割,训练数据来自带有医学读者手动标注的大型私有数据集;
2. 将模型迁移到ImageCAS数据集;
3. 改进CAs和PVs的分割标注。

Result: 生成了一个高质量的LAA分割数据集,同时提供了CAs和PVs的改进标注,并识别了数据中的常见缺陷。

Insight: 该数据集有望推动LAA形态分析及相关医学影像研究的发展,公开数据的标注质量对算法性能有显著影响。

Abstract: Despite the success of advanced segmentation frameworks such as TotalSegmentator (TS), accurate segmentations of the left atrial appendage (LAA), coronary arteries (CAs), and pulmonary veins (PVs) remain a significant challenge in medical imaging. In this work, we present the first open-source, anatomically coherent dataset of curated, high-resolution segmentations for these structures, supplemented with whole-heart labels produced by TS on the publicly available ImageCAS dataset consisting of 1000 cardiac computed tomography angiography (CCTA) scans. One purpose of the data set is to foster novel approaches to the analysis of LAA morphology. LAA segmentations on ImageCAS were generated using a state-of-the-art segmentation framework developed specifically for high resolution LAA segmentation. We trained the network on a large private dataset with manual annotations provided by medical readers guided by a trained cardiologist and transferred the model to ImageCAS data. CA labels were improved from the original ImageCAS annotations, while PV segmentations were refined from TS outputs. In addition, we provide a list of scans from ImageCAS that contains common data flaws such as step artefacts, LAAs extending beyond the scanner’s field of view, and other types of data defects.

[97] Compact Multi-level-prior Tensor Representation for Hyperspectral Image Super-resolution

Yinjian Wang,Wei Li,Yuanyuan Gui,Gemine Vivone

Main category: cs.CV

TL;DR: 论文提出了一种紧凑的多层次先验张量表示方法,用于高光谱图像超分辨率,通过分解光谱和空间信息并引入非凸模态相关总变差,实现了高效的多层次先验建模。

Details Motivation: 现有张量方法只能利用一到两个层次的一两种先验,无法有效平衡多层次的先验权重和优化多模块结构,需要一种紧凑的模型来同时利用多层次先验。

Contribution: 提出了一个紧凑的张量框架模型,通过块项分解和非凸模态相关总变差,实现了多层次先验的联合建模。

Method: 1. 通过块项分解光谱低秩性和空间先验;2. 引入非凸模态相关总变差建模空间张量的高阶低秩性和平滑性。

Result: 在多个数据集上的实验验证了方法的有效性。

Insight: 光谱和空间信息的解耦,以及多层次先验的联合优化,是高光谱图像超分辨率的关键。

Abstract: Fusing a hyperspectral image with a multispectral image acquired over the same scene, \textit{i.e.}, hyperspectral image super-resolution, has become a popular computational way to access the latent high-spatial-spectral-resolution image. To date, a variety of fusion methods have been proposed, among which the tensor-based ones have testified that multiple priors, such as multidimensional low-rankness and spatial total variation at multiple levels, effectively drive the fusion process. However, existing tensor-based models can only effectively leverage one or two priors at one or two levels, since simultaneously incorporating multi-level priors inevitably increases model complexity. This introduces challenges in both balancing the weights of different priors and optimizing multi-block structures. Concerning this, we present a novel hyperspectral super-resolution model compactly characterizing these multi-level priors of hyperspectral images within the tensor framework. Firstly, the proposed model decouples the spectral low-rankness and spatial priors by casting the latent high-spatial-spectral-resolution image into spectral subspace and spatial maps via block term decomposition. Secondly, these spatial maps are stacked as the spatial tensor encoding the high-order spatial low-rankness and smoothness priors, which are co-modeled via the proposed non-convex mode-shuffled tensor correlated total variation. Finally, we draw inspiration from the linearized alternating direction method of multipliers to design an efficient algorithm to optimize the resulting model, theoretically proving its Karush-Kuhn-Tucker convergence under mild conditions. Experiments on multiple datasets demonstrate the effectiveness of the proposed algorithm. The code implementation will be available from https://github.com/WongYinJ.

[98] Multimodal Feature Prototype Learning for Interpretable and Discriminative Cancer Survival Prediction

Shuo Jiang,Zhuwen Chen,Liaoman Xu,Yanming Zhu,Changmiao Wang,Jiong Zhang,Feiwei Qin,Yifei Chen,Zhu Zhu

Main category: cs.CV

TL;DR: 提出FeatProto框架,通过融合WSI的全局与局部特征和基因组数据,提升癌症生存预测的可解释性和判别性,结合EMA ProtoUp和分层原型匹配策略,表现优于现有单模态和多模态方法。

Details Motivation: 当前生存分析模型可解释性差,限制了其临床应用;传统原型学习方法忽视肿瘤的整体背景且缺乏与基因组数据的语义对齐。

Contribution: 1. 融合WSI全局/局部特征与基因组数据的统一原型空间;2. EMA ProtoUp策略;3. 分层原型匹配方案。

Method: 结合全局/局部特征的WSI表示与基因组数据,通过EMA ProtoUp动态更新原型,采用分层匹配捕获全局中心性和局部典型性。

Result: 在四个癌症数据集上超越现有方法,准确性和可解释性显著提升。

Insight: 原型学习在医学应用中需兼顾全局与局部特征,动态更新和分层匹配策略对提升模型性能至关重要。

Abstract: Survival analysis plays a vital role in making clinical decisions. However, the models currently in use are often difficult to interpret, which reduces their usefulness in clinical settings. Prototype learning presents a potential solution, yet traditional methods focus on local similarities and static matching, neglecting the broader tumor context and lacking strong semantic alignment with genomic data. To overcome these issues, we introduce an innovative prototype-based multimodal framework, FeatProto, aimed at enhancing cancer survival prediction by addressing significant limitations in current prototype learning methodologies within pathology. Our framework establishes a unified feature prototype space that integrates both global and local features of whole slide images (WSI) with genomic profiles. This integration facilitates traceable and interpretable decision-making processes. Our approach includes three main innovations: (1) A robust phenotype representation that merges critical patches with global context, harmonized with genomic data to minimize local bias. (2) An Exponential Prototype Update Strategy (EMA ProtoUp) that sustains stable cross-modal associations and employs a wandering mechanism to adapt prototypes flexibly to tumor heterogeneity. (3) A hierarchical prototype matching scheme designed to capture global centrality, local typicality, and cohort-level trends, thereby refining prototype inference. Comprehensive evaluations on four publicly available cancer datasets indicate that our method surpasses current leading unimodal and multimodal survival prediction techniques in both accuracy and interoperability, providing a new perspective on prototype learning for critical medical applications. Our source code is available at https://github.com/JSLiam94/FeatProto.

[99] Towards Data-Efficient Medical Imaging: A Generative and Semi-Supervised Framework

Mosong Ma,Tania Stathaki,Michalis Lazarou

Main category: cs.CV

TL;DR: SSGNet是一个结合生成模型和半监督学习的统一框架,用于解决医学影像中数据稀缺和标注不平衡的问题,通过生成高质量图像和迭代伪标签提升分类和分割性能。

Details Motivation: 医学影像深度学习中常面临标注数据稀缺和类别不平衡的挑战,限制了模型的性能。

Contribution: 提出SSGNet框架,结合StyleGAN3生成图像和迭代半监督伪标签,增强现有基线的分类和分割能力。

Method: 利用StyleGAN3生成类特定图像扩充训练数据,并通过迭代伪标签优化标注质量,从而提高模型性能。

Result: 在多个医学影像基准测试中,SSGNet显著提升了分类和分割性能,生成的样本也显示了高质量(通过Frechet Inception Distance验证)。

Insight: SSGNet提供了一种实用策略,缓解标注瓶颈,增强医学图像分析的鲁棒性。

Abstract: Deep learning in medical imaging is often limited by scarce and imbalanced annotated data. We present SSGNet, a unified framework that combines class specific generative modeling with iterative semisupervised pseudo labeling to enhance both classification and segmentation. Rather than functioning as a standalone model, SSGNet augments existing baselines by expanding training data with StyleGAN3 generated images and refining labels through iterative pseudo labeling. Experiments across multiple medical imaging benchmarks demonstrate consistent gains in classification and segmentation performance, while Frechet Inception Distance analysis confirms the high quality of generated samples. These results highlight SSGNet as a practical strategy to mitigate annotation bottlenecks and improve robustness in medical image analysis.

[100] Discrete Diffusion Models with MLLMs for Unified Medical Multimodal Generation

Jiawei Mao,Yuhan Wang,Lifeng Chen,Can Zhao,Yucheng Tang,Dong Yang,Liangqiong Qu,Daguang Xu,Yuyin Zhou

Main category: cs.CV

TL;DR: 该论文提出了MeDiM,一种基于离散扩散模型的医学多模态生成方法,通过共享概率空间统一了图像和文本的生成任务,并利用多模态大语言模型(MLLM)作为扩散主干,实现了高质量的医学数据生成。

Details Motivation: 现有的生成医学模型通常局限于单一模态,难以整合多模态数据(如影像、病理和临床笔记),限制了其作为医学基础模型的能力。本文旨在通过统一模型解决这一问题。

Contribution: 1. 提出了首个医学离散扩散模型MeDiM,统一了多模态生成任务;
2. 设计了双向上下文的扩散模型架构,并通过时间步嵌入实现了扩散感知;
3. 实验证明了其在医学图像和报告生成中的高性能。

Method: 1. 使用多模态大语言模型(MLLM)作为扩散主干;
2. 移除了因果注意掩码以实现双向上下文;
3. 注入连续时间步嵌入以增强扩散感知。

Result: 在MIMIC-CXR和PathGen数据集上,MeDiM实现了高保真医学图像生成(FID分别为16.60和24.19)和准确报告生成(METEOR分别为0.2650和0.2580)。联合生成的图像-报告对显著提高了下游任务性能。

Insight: 1. 通过共享概率空间和多模态统一生成任务,MeDiM能够生成连贯且临床相关的多模态输出;
2. 移除因果注意掩码的设计有助于模型更好地捕捉双向依赖关系。

Abstract: Recent advances in generative medical models are constrained by modality-specific scenarios that hinder the integration of complementary evidence from imaging, pathology, and clinical notes. This fragmentation limits their evolution into foundation models that can learn and reason across the full spectrum of biomedical data. We propose MeDiM, the first medical discrete diffusion model that learns shared distributions across modalities without modality-specific components. MeDiM unifies multiple generative tasks: translating between images and text, and jointly producing image-report pairs across domains in response to prompts. Built on a discrete diffusion framework, MeDiM bridges vision and language representations through a shared probabilistic space. To enable unified and flexible medical generation, we employ a multimodal large language model (MLLM) as the diffusion backbone, leveraging its prior knowledge and cross-modal reasoning. Two key designs are introduced: (1) removing the causal attention mask for bidirectional context, and (2) injecting continuous timestep embeddings for diffusion awareness. Experiments demonstrate high-fidelity medical generation (FID 16.60 on MIMIC-CXR and FID 24.19 on PathGen) and accurate report generation (METEOR 0.2650 and 0.2580). Jointly generated image-report pairs further enhance downstream performance (plus6.43 percent BLEU-1, plus18.57 percent BLEU-2, plus31.58 percent BLEU-3, plus4.80 percent METEOR), showing that MeDiM supports coherent and clinically grounded multimodal outputs.

[101] Deforming Videos to Masks: Flow Matching for Referring Video Segmentation

Zanyi Wang,Dengyang Jiang,Liuzhuozheng Li,Sizhe Dang,Chengzu Li,Harry Yang,Guang Dai,Mengmeng Wang,Jingdong Wang

Main category: cs.CV

TL;DR: 该论文提出了一种名为FlowRVS的新方法,将Referring Video Object Segmentation(RVOS)任务重新定义为条件连续流问题,避免了传统分阶段方法的局限性,实现了更好的语义对齐和时间一致性。

Details Motivation: 传统的RVOS方法采用‘定位-分割’的级联设计,可能导致语义信息简化成粗略的几何提示,且难以保持时间一致性。FlowRVS试图解决这些问题。

Contribution: 提出了FlowRVS框架,将RVOS任务重新定义为条件连续流问题,利用预训练的文本到视频模型的优势,实现细粒度的像素控制和文本视频语义对齐。

Method: 通过学习从视频的整体表示到目标掩码的直接语言引导变形,避免了传统的‘从噪声到掩码’或直接预测掩码的方法。

Result: FlowRVS在MeViS和Ref-DAVIS17等基准测试中取得了新的最优性能,分别提升了1.6和2.7个百分点。

Insight: 将视频理解任务建模为连续变形过程具有显著潜力,尤其是在需要语义对齐和时间一致性的任务中。

Abstract: Referring Video Object Segmentation (RVOS) requires segmenting specific objects in a video guided by a natural language description. The core challenge of RVOS is to anchor abstract linguistic concepts onto a specific set of pixels and continuously segment them through the complex dynamics of a video. Faced with this difficulty, prior work has often decomposed the task into a pragmatic `locate-then-segment’ pipeline. However, this cascaded design creates an information bottleneck by simplifying semantics into coarse geometric prompts (e.g, point), and struggles to maintain temporal consistency as the segmenting process is often decoupled from the initial language grounding. To overcome these fundamental limitations, we propose FlowRVS, a novel framework that reconceptualizes RVOS as a conditional continuous flow problem. This allows us to harness the inherent strengths of pretrained T2V models, fine-grained pixel control, text-video semantic alignment, and temporal coherence. Instead of conventional generating from noise to mask or directly predicting mask, we reformulate the task by learning a direct, language-guided deformation from a video’s holistic representation to its target mask. Our one-stage, generative approach achieves new state-of-the-art results across all major RVOS benchmarks. Specifically, achieving a $\mathcal{J}&\mathcal{F}$ of 51.1 in MeViS (+1.6 over prior SOTA) and 73.3 in the zero shot Ref-DAVIS17 (+2.7), demonstrating the significant potential of modeling video understanding tasks as continuous deformation processes.

[102] Bimanual 3D Hand Motion and Articulation Forecasting in Everyday Images

Aditya Prakash,David Forsyth,Saurabh Gupta

Main category: cs.CV

TL;DR: 该论文提出了一种从日常图像中预测双手3D运动和关节的方法,通过设计一个扩散模型标注流程,将2D手部关键点序列提升为4D手部运动。

Details Motivation: 解决日常场景中缺乏多样化的3D手部标注数据的问题,以实现更准确的双手运动和关节预测。

Contribution: 1. 设计了一个基于扩散模型的标注流程,将2D手部关键点序列提升为4D手部运动;2. 在预测模型中采用扩散损失,以处理手部运动分布的多模态性。

Method: 1. 使用扩散模型从2D手部关键点序列生成4D手部运动;2. 在预测模型中引入扩散损失,捕捉手部运动的多模态分布。

Result: 在6个数据集上的实验表明,标注流程和预测模型优于基线方法,尤其在零样本泛化到日常图像时(标注流程提升42%,预测模型提升16.4%)。

Insight: 多样化的数据和标注流程对提升模型性能至关重要,尤其是在复杂场景下。

Abstract: We tackle the problem of forecasting bimanual 3D hand motion & articulation from a single image in everyday settings. To address the lack of 3D hand annotations in diverse settings, we design an annotation pipeline consisting of a diffusion model to lift 2D hand keypoint sequences to 4D hand motion. For the forecasting model, we adopt a diffusion loss to account for the multimodality in hand motion distribution. Extensive experiments across 6 datasets show the benefits of training on diverse data with imputed labels (14% improvement) and effectiveness of our lifting (42% better) & forecasting (16.4% gain) models, over the best baselines, especially in zero-shot generalization to everyday images.

[103] ShapeGen4D: Towards High Quality 4D Shape Generation from Videos

Jiraphon Yenphraphai,Ashkan Mirzaei,Jianqi Chen,Jiaxu Zou,Sergey Tulyakov,Raymond A. Yeh,Peter Wonka,Chaoyang Wang

Main category: cs.CV

TL;DR: ShapeGen4D提出了一种直接从视频生成高质量4D形状的框架,通过引入时间注意力、时间感知点采样和4D潜在锚定以及噪声共享,实现了端到端的动态3D表示合成。

Details Motivation: 现有方法通常需要逐帧优化,难以捕捉非刚性运动、体积变化和拓扑转换。ShapeGen4D旨在直接从视频生成时间一致的4D形状,解决这些问题。

Contribution: 提出了一个端到端的视频到4D形状生成框架,引入三个关键组件以提升时间一致性、几何和纹理质量。

Method: 基于大规模预训练的3D模型,使用了时间注意力机制、时间感知点采样和4D潜在锚定,以及跨帧共享噪声。

Result: 方法在多样化的真实视频中表现出更强的鲁棒性和感知逼真度,减少了基线方法的失败模式。

Insight: 通过端到端学习和时间一致性设计,ShapeGen4D能够高效生成高质量的动态3D表示,支持复杂的非刚体运动。

Abstract: Video-conditioned 4D shape generation aims to recover time-varying 3D geometry and view-consistent appearance directly from an input video. In this work, we introduce a native video-to-4D shape generation framework that synthesizes a single dynamic 3D representation end-to-end from the video. Our framework introduces three key components based on large-scale pre-trained 3D models: (i) a temporal attention that conditions generation on all frames while producing a time-indexed dynamic representation; (ii) a time-aware point sampling and 4D latent anchoring that promote temporally consistent geometry and texture; and (iii) noise sharing across frames to enhance temporal stability. Our method accurately captures non-rigid motion, volume changes, and even topological transitions without per-frame optimization. Across diverse in-the-wild videos, our method improves robustness and perceptual fidelity and reduces failure modes compared with the baselines.

[104] Drive&Gen: Co-Evaluating End-to-End Driving and Video Generation Models

Jiahao Wang,Zhenpei Yang,Yijing Bai,Yingwei Li,Yuliang Zou,Bo Sun,Abhijit Kundu,Jose Lezama,Luna Yue Huang,Zehao Zhu,Jyh-Jing Hwang,Dragomir Anguelov,Mingxing Tan,Chiyu Max Jiang

Main category: cs.CV

TL;DR: 这篇论文探讨了生成模型与端到端驾驶模型的结合(Drive&Gen),通过新颖的统计指标评估生成视频的真实性,并利用合成数据提升驾驶模型的泛化能力。

Details Motivation: 近年来,生成模型和端到端驾驶模型的发展为自动驾驶领域提供了新的可能性,但仍存在生成视频是否符合实际条件、数据如何提升模型泛化能力等问题。

Contribution: 提出了结合驾驶模型与生成世界模型的方法(Drive&Gen),设计了统计指标评估生成视频的真实性,并展示了合成数据对提升模型泛化能力的有效性。

Method: 利用端到端驾驶模型评估生成视频的真实性;通过视频生成模型的可控性,研究影响驾驶模型性能的数据分布差距;使用合成数据扩展模型的泛化能力。

Result: 生成视频的统计指标验证了其真实性;合成数据有效提升了端到端驾驶模型在超出设计域的场景中的表现。

Insight: 生成模型可以作为低成本的数据扩展工具,帮助自动驾驶系统适应新的操作环境;视频生成模型的可控性为驾驶模型的研究提供了新思路。

Abstract: Recent advances in generative models have sparked exciting new possibilities in the field of autonomous vehicles. Specifically, video generation models are now being explored as controllable virtual testing environments. Simultaneously, end-to-end (E2E) driving models have emerged as a streamlined alternative to conventional modular autonomous driving systems, gaining popularity for their simplicity and scalability. However, the application of these techniques to simulation and planning raises important questions. First, while video generation models can generate increasingly realistic videos, can these videos faithfully adhere to the specified conditions and be realistic enough for E2E autonomous planner evaluation? Second, given that data is crucial for understanding and controlling E2E planners, how can we gain deeper insights into their biases and improve their ability to generalize to out-of-distribution scenarios? In this work, we bridge the gap between the driving models and generative world models (Drive&Gen) to address these questions. We propose novel statistical measures leveraging E2E drivers to evaluate the realism of generated videos. By exploiting the controllability of the video generation model, we conduct targeted experiments to investigate distribution gaps affecting E2E planner performance. Finally, we show that synthetic data produced by the video generation model offers a cost-effective alternative to real-world data collection. This synthetic data effectively improves E2E model generalization beyond existing Operational Design Domains, facilitating the expansion of autonomous vehicle services into new operational contexts.

[105] Fine-grained Defocus Blur Control for Generative Image Models

Ayush Shrivastava,Connelly Barnes,Xuaner Zhang,Lingzhi Zhang,Andrew Owens,Sohrab Amirghodsi,Eli Shechtman

Main category: cs.CV

TL;DR: 该论文提出了一种新的文本到图像扩散框架,通过利用相机元数据(EXIF数据)实现精细的镜头模糊控制,模拟物理成像过程,从而在不改变场景内容的情况下实现精确的交互控制。

Details Motivation: 现有的文本到图像扩散模型虽然能生成多样化的高质量图像,但难以结合精细的相机元数据(如光圈设置)。作者希望通过模拟成像过程,实现对镜头模糊的精细控制。

Contribution: 1)提出了一种结合EXIF数据的文本到图像扩散框架;2)引入了一种新颖的焦点距离变换器;3)实现了无需显式监督的梯度反向传播学习,生成基于内容和EXIF数据的模糊效果。

Method: 1)生成全聚焦图像;2)估计单目深度;3)使用焦点距离变换器预测焦点距离;4)通过可微分镜头模糊模型生成模糊图像。整个过程的梯度反向传播支持无监督学习。

Result: 实验结果表明,该方法能在不改变场景内容的情况下,提供优于现有扩散模型的精细模糊控制。

Insight: 模拟物理成像过程的生成方法可以有效结合相机元数据,为用户提供更自然的交互控制能力。

Abstract: Current text-to-image diffusion models excel at generating diverse, high-quality images, yet they struggle to incorporate fine-grained camera metadata such as precise aperture settings. In this work, we introduce a novel text-to-image diffusion framework that leverages camera metadata, or EXIF data, which is often embedded in image files, with an emphasis on generating controllable lens blur. Our method mimics the physical image formation process by first generating an all-in-focus image, estimating its monocular depth, predicting a plausible focus distance with a novel focus distance transformer, and then forming a defocused image with an existing differentiable lens blur model. Gradients flow backwards through this whole process, allowing us to learn without explicit supervision to generate defocus effects based on content elements and the provided EXIF data. At inference time, this enables precise interactive user control over defocus effects while preserving scene contents, which is not achievable with existing diffusion models. Experimental results demonstrate that our model enables superior fine-grained control without altering the depicted scene.

[106] Dropping the D: RGB-D SLAM Without the Depth Sensor

Mert Kiray,Alican Karaomer,Benjamin Busam

Main category: cs.CV

TL;DR: DropD-SLAM是一个实时单目SLAM系统,通过预训练的视觉模块替代深度传感器,实现了与RGB-D SLAM相当的精度。

Details Motivation: 传统RGB-D SLAM依赖深度传感器,限制了其成本效益和适用性。本文旨在通过预训练视觉模块替代深度输入,实现高性能的单目SLAM。

Contribution: 提出DropD-SLAM,首次通过三个预训练模块(深度估计、关键点检测、实例分割)实现无需深度传感器的RGB-D级SLAM性能。

Method: 结合单目深度估计、学习型关键点检测和实例分割网络,动态对象被抑制,静态关键点通过预测深度值重构3D特征,输入标准RGB-D SLAM后端。

Result: 在TUM RGB-D基准测试中,静态序列平均ATE为7.4 cm,动态序列为1.8 cm,与先进RGB-D方法相当,且运行速度达22 FPS。

Insight: 预训练视觉模型可替代深度传感器,为SLAM提供可靠实时度量尺度,推动低成本高效SLAM系统发展。

Abstract: We present DropD-SLAM, a real-time monocular SLAM system that achieves RGB-D-level accuracy without relying on depth sensors. The system replaces active depth input with three pretrained vision modules: a monocular metric depth estimator, a learned keypoint detector, and an instance segmentation network. Dynamic objects are suppressed using dilated instance masks, while static keypoints are assigned predicted depth values and backprojected into 3D to form metrically scaled features. These are processed by an unmodified RGB-D SLAM back end for tracking and mapping. On the TUM RGB-D benchmark, DropD-SLAM attains 7.4 cm mean ATE on static sequences and 1.8 cm on dynamic sequences, matching or surpassing state-of-the-art RGB-D methods while operating at 22 FPS on a single GPU. These results suggest that modern pretrained vision models can replace active depth sensors as reliable, real-time sources of metric scale, marking a step toward simpler and more cost-effective SLAM systems.

[107] EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark

Deheng Zhang,Yuqian Fu,Runyi Yang,Yang Miao,Tianwen Qian,Xu Zheng,Guolei Sun,Ajad Chhatkuli,Xuanjing Huang,Yu-Gang Jiang,Luc Van Gool,Danda Pani Paudel

Main category: cs.CV

TL;DR: 该论文提出了首个面向夜间第一人称视觉理解的基准数据集EgoNight,核心任务是视觉问答(VQA),填补了现有基准在低光照条件下的空白,并通过昼夜对齐的视频展示了光照条件对性能的显著影响。

Details Motivation: 现有第一人称视觉基准主要关注白天场景,忽视了夜间低光照条件下的真实需求。EgoNight旨在填补这一空白,推动夜间第一人称视觉研究。

Contribution: 1. 提出首个夜间第一人称视觉基准EgoNight;2. 引入昼夜对齐视频提升注释质量;3. 设计了基于昼夜数据的VQA任务及两个辅助任务;4. 展示了现有MLLM模型在夜间性能的大幅下降。

Method: 1. 使用Blender渲染合成视频和真实世界录制视频;2. 构建昼夜对齐的数据集EgoNight-VQA;3. 开发昼夜增强的自动标注引擎并人工验证;4. 评估包括MLLM在内的方法。

Result: EgoNight-VQA包含3658个QA对和90个视频,覆盖12种问题类型。结果显示,现有MLLM模型在夜间性能显著下降。

Insight: 1. 夜间VQA是未被充分探索的挑战;2. 昼夜对齐数据有助于提升夜间模型性能;3. 光照条件对模型泛化能力影响显著。

Abstract: Most existing benchmarks for egocentric vision understanding focus primarily on daytime scenarios, overlooking the low-light conditions that are inevitable in real-world applications. To investigate this gap, we present EgoNight, the first comprehensive benchmark for nighttime egocentric vision, with visual question answering (VQA) as the core task. A key feature of EgoNight is the introduction of day-night aligned videos, which enhance night annotation quality using the daytime data and reveal clear performance gaps between lighting conditions. To achieve this, we collect both synthetic videos rendered by Blender and real-world recordings, ensuring that scenes and actions are visually and temporally aligned. Leveraging these paired videos, we construct EgoNight-VQA, supported by a novel day-augmented night auto-labeling engine and refinement through extensive human verification. Each QA pair is double-checked by annotators for reliability. In total, EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types, with more than 300 hours of human work. Evaluations of state-of-the-art multimodal large language models (MLLMs) reveal substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions. Beyond VQA, EgoNight also introduces two auxiliary tasks, day-night correspondence retrieval and egocentric depth estimation at night, that further explore the boundaries of existing models. We believe EgoNight-VQA provides a strong foundation for advancing application-driven egocentric vision research and for developing models that generalize across illumination domains. All the data and code will be made available upon acceptance.

[108] Human3R: Everyone Everywhere All at Once

Yue Chen,Xingyu Chen,Yuxuan Xue,Anpei Chen,Yuliang Xiu,Gerard Pons-Moll

Main category: cs.CV

TL;DR: Human3R是一个统一的、前馈的框架,用于从单眼视频中实时重建4D人与场景的关系,无需依赖多阶段流程或迭代优化。

Details Motivation: 现有方法依赖于多阶段流程、接触感知优化和复杂的预处理(如人体检测、深度估计和SLAM),导致效率低下且难以统一。Human3R旨在通过单次前馈解决这些问题。

Contribution: 1. 提出了一个统一的模型,联合重建多人SMPL-X身体、稠密3D场景和相机轨迹;2. 无需依赖预处理或迭代优化;3. 高效实现(实时15 FPS,低内存占用)。

Method: 基于CUT3R模型,采用参数高效的视觉提示调优(visual prompt tuning),保留时空先验,直接输出多人SMPL-X身体。

Result: 在合成数据集BEDLAM上训练一天,实现了SOTA或竞争性表现,包括全局人体运动估计、局部人体网格恢复、视频深度估计和相机姿态估计。

Insight: Human3R展示了统一模型在高复杂度任务中的潜力,同时强调了轻量化和高效率的重要性。

Abstract: We present Human3R, a unified, feed-forward framework for online 4D human-scene reconstruction, in the world frame, from casually captured monocular videos. Unlike previous approaches that rely on multi-stage pipelines, iterative contact-aware refinement between humans and scenes, and heavy dependencies, e.g., human detection, depth estimation, and SLAM pre-processing, Human3R jointly recovers global multi-person SMPL-X bodies (“everyone”), dense 3D scene (“everywhere”), and camera trajectories in a single forward pass (“all-at-once”). Our method builds upon the 4D online reconstruction model CUT3R, and uses parameter-efficient visual prompt tuning, to strive to preserve CUT3R’s rich spatiotemporal priors, while enabling direct readout of multiple SMPL-X bodies. Human3R is a unified model that eliminates heavy dependencies and iterative refinement. After being trained on the relatively small-scale synthetic dataset BEDLAM for just one day on one GPU, it achieves superior performance with remarkable efficiency: it reconstructs multiple humans in a one-shot manner, along with 3D scenes, in one stage, at real-time speed (15 FPS) with a low memory footprint (8 GB). Extensive experiments demonstrate that Human3R delivers state-of-the-art or competitive performance across tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation, with a single unified model. We hope that Human3R will serve as a simple yet strong baseline, be easily extended for downstream applications.Code available in https://fanegg.github.io/Human3R

cs.DC [Back]

[109] Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices

Yilong Li,Shuai Zhang,Yijing Zeng,Hao Zhang,Xinmiao Xiong,Jingyu Liu,Pan Hu,Suman Banerjee

Main category: cs.DC

TL;DR: NANOMIND是一种软硬件协同设计的推理框架,通过将大型多模态模型分模块并映射到异构加速器上,显著提升了电池供电小型设备的能效和性能。

Details Motivation: 现有的多模态模型通常以单一方式运行,未能充分利用现代SoC中的异构加速器,导致高延迟和低效率。

Contribution: 提出NANOMIND框架,通过模块化设计和动态卸载技术,实现高效的多模态推理,显著降低能耗和内存使用。

Method: 将大型多模态模型分解为模块(如视觉、语言、音频),并调度到最适合的加速器上;结合定制硬件设计、系统级调度和低比特计算优化。

Result: 能效提升42.3%,GPU内存使用减少11.2%;在电池供电设备上可运行LLaVA-OneVision近半天,LLaMA-3-8B近20.8小时。

Insight: 模块化和动态卸载是充分利用异构加速器的关键,软硬件协同设计显著提升了小型设备的推理能力。

Abstract: Large Multimodal Models (LMMs) are inherently modular, consisting of vision and audio encoders, projectors, and large language models. Yet, they are almost always executed monolithically, which underutilizes the heterogeneous accelerators (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end latency. In this paper, we present NANOMIND, a hardware–software co-design inference framework for Large Multimodal Models (LMMs) that breaks large models into modular ``bricks’’ (vision, language, audio, etc.) and maps each to its ideal accelerator. The key insight is that large models can be broken into modular components and scheduled to run on the most appropriate compute units. It performs module-level dynamic offloading across accelerators on unified-memory SoCs. By combining customized hardware design, system-level scheduling, and optimized low-bit computation kernels, we demonstrate our framework with a compact, battery-powered device capable of running LMMs entirely on device. This prototype functions as a self-contained intelligent assistant that requires no network connectivity, while achieving higher throughput and superior power efficiency under strict resource constraints. The design further bypasses CPU bottlenecks and reduces redundant memory usage through token-aware buffer management and module-level coordination. Our system outperforms existing implementations in resource efficiency, cutting energy consumption by 42.3% and GPU memory usage by 11.2%. This enables a battery-powered device to run LLaVA-OneVision with a camera for nearly half a day and LLaMA-3-8B for voice interactions up to almost 20.8 hours.

cs.AI [Back]

[110] Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment

Radha Gulhane,Sathish Reddy Indurthi

Main category: cs.AI

TL;DR: 该论文提出了一种混合奖励建模框架,结合模型驱动和规则驱动的奖励,并通过多方面的奖励机制优化多模态大语言模型(MLLM)与人类偏好的对齐,显著提升了任务性能。

Details Motivation: 现有模型驱动奖励方法通常单一且缺乏领域任务的置信度校准,难以捕捉人类偏好的多样性,且需要大量数据标注和奖励模型训练。本文旨在通过结合多种奖励范式解决这些问题。

Contribution: 1. 提出了一种混合奖励建模框架,整合模型驱动和规则驱动的奖励;2. 引入多方面的奖励机制,包括指令遵循和长度惩罚;3. 在多个多模态基准测试中显著提升模型性能。

Method: 1. 模型驱动奖励:从合成和人类反馈中预测标量或向量分数;2. 规则驱动奖励:基于领域启发式提供明确的正确性信号;3. 多方面奖励:包括指令遵循和长度惩罚,以稳定训练。

Result: 实验表明,该方法在3B参数规模的模型中实现了平均约9.5%的整体性能提升,且在数学推理任务中平均提升约16%。

Insight: 结合模型驱动和规则驱动的奖励可以更灵活有效地对齐MLLM与人类偏好;多方面的奖励机制有助于提升模型的任务性能和稳定性。

Abstract: Aligning multimodal large language models (MLLMs) with human preferences often relies on single-signal, model-based reward methods. Such monolithic rewards often lack confidence calibration across domain-specific tasks, fail to capture diverse aspects of human preferences, and require extensive data annotation and reward model training. In this work, we propose a hybrid reward modeling framework that integrates complementary reward paradigms: (i) model-based rewards, where a learned reward model predicts scalar or vector scores from synthetic and human feedback, and (ii) rule-based rewards, where domain-specific heuristics provide explicit correctness signals with confidence. Beyond accuracy, we further incorporate multi-aspect rewards to enforce instruction adherence and introduce a generalized length-penalty reward to stabilize training and improve performance. The proposed framework provides a flexible and effective approach to aligning MLLMs through reinforcement learning policy optimization. Our experiments show consistent improvements across different multimodal benchmarks when applying hybrid and multi-aspect reward modeling. Our best performing model in the 3B family achieves an overall average improvement of ~9.5% across general and math reasoning tasks. Focusing specifically on mathematical benchmarks, the model achieves a significant average improvement of ~16%, highlighting its effectiveness in mathematical reasoning and problem solving.

[111] D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Suwhan Choi,Jaeyoon Jung,Haebin Seong,Minchan Kim,Minyeong Kim,Yongjun Cho,Yoonshik Kim,Yubeen Park,Youngjae Yu,Yunsung Lee

Main category: cs.AI

TL;DR: D2E提出了一种利用桌面数据(如游戏)进行预训练的方法,并验证了其在机器人任务中的迁移能力,展示了桌面数据可作为实际机器人学习的有效预训练基础。

Details Motivation: 物理轨迹数据的采集成本高昂,限制了具身AI的发展。桌面环境(如游戏)提供了大规模的传感器运动交互数据,且保持了观察-动作的耦合性。

Contribution: 1) OWA Toolkit:统一桌面交互数据格式并压缩;2) Generalist-IDM:通过时间戳事件预测实现零样本泛化;3) VAPT:将桌面预训练特征迁移到物理任务。

Method: 通过收集和伪标注桌面数据(如游戏),训练通用模型并迁移到机器人任务(如操纵和导航)。

Result: 在LIBERO操纵任务中达到96.6%成功率,CANVAS导航任务中83.3%,验证了桌面数据的迁移有效性。

Insight: 数字交互中的传感器运动原语具有足够的不变性,可以迁移到物理任务,桌面预训练是机器人学习的实用范式。

Abstract: Large language models leverage internet-scale text data, yet embodied AI remains constrained by the prohibitive costs of physical trajectory collection. Desktop environments – particularly gaming – offer a compelling alternative: they provide rich sensorimotor interactions at scale while maintaining the structured observation-action coupling essential for embodied learning. We present D2E (Desktop to Embodied AI), a framework that demonstrates desktop interactions can serve as an effective pretraining substrate for robotics embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a complete pipeline from scalable desktop data collection to verified transfer in embodied domains. Our framework comprises three components: (1) the OWA Toolkit that unifies diverse desktop interactions into a standardized format with 152x compression, (2) the Generalist-IDM that achieves strong zero-shot generalization across unseen games through timestamp-based event prediction, enabling internet-scale pseudo-labeling, and (3) VAPT that transfers desktop-pretrained representations to physical manipulation and navigation. Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO manipulation and 83.3% on CANVAS navigation benchmarks. This validates that sensorimotor primitives in digital interactions exhibit sufficient invariance to transfer meaningfully to physical embodied tasks, establishing desktop pretraining as a practical paradigm for robotics. We will make all our work public, including the OWA toolkit, datasets of human-collected and pseudo-labeled, and VAPT-trained models available at https://worv-ai.github.io/d2e/

[112] Optimization Modeling via Semantic Anchored Alignment

Yansen Zhang,Qingcan Kang,Yujie Chen,Yufei Wang,Xiongwei Han,Tao Zhong,Mingxuan Yuan,Chen Ma

Main category: cs.AI

TL;DR: 这篇论文提出了一种名为SAC-Opt的后向引导修正框架,通过语义对齐提升LLM在优化建模中的准确性,避免了传统方法中因语法正确但逻辑错误导致的模型失效问题。

Details Motivation: 现有方法依赖单次前向生成和求解器反馈的有限修正,易产生语义错误。SAC-Opt旨在通过语义锚定对齐,确保问题意图到求解器代码的准确翻译。

Contribution: 提出了SAC-Opt框架,通过语义锚定对齐和后向修正,细粒度优化约束和目标逻辑,提高了模型的准确性和鲁棒性。

Method: 框架通过比对原始语义锚点和生成代码重构的锚点,选择性修正不匹配部分,逐步收敛到语义一致的模型。

Result: 在七个公共数据集上,SAC-Opt平均建模准确率提升7.8%,在ComplexLP数据集上增益高达21.9%。

Insight: 语义锚定修正是LLM优化建模中确保逻辑一致性的关键,避免了单纯依赖求解器反馈的局限性。

Abstract: Large language models (LLMs) have opened new paradigms in optimization modeling by enabling the generation of executable solver code from natural language descriptions. Despite this promise, existing approaches typically remain solver-driven: they rely on single-pass forward generation and apply limited post-hoc fixes based on solver error messages, leaving undetected semantic errors that silently produce syntactically correct but logically flawed models. To address this challenge, we propose SAC-Opt, a backward-guided correction framework that grounds optimization modeling in problem semantics rather than solver feedback. At each step, SAC-Opt aligns the original semantic anchors with those reconstructed from the generated code and selectively corrects only the mismatched components, driving convergence toward a semantically faithful model. This anchor-driven correction enables fine-grained refinement of constraint and objective logic, enhancing both fidelity and robustness without requiring additional training or supervision. Empirical results on seven public datasets demonstrate that SAC-Opt improves average modeling accuracy by 7.8%, with gains of up to 21.9% on the ComplexLP dataset. These findings highlight the importance of semantic-anchored correction in LLM-based optimization workflows to ensure faithful translation from problem intent to solver-executable code.

[113] The Safety Challenge of World Models for Embodied AI Agents: A Review

Lorenzo Baraldi,Zifan Zeng,Chongzhe Zhang,Aradhana Nayak,Hongbo Zhu,Feng Liu,Qunli Zhang,Peng Wang,Shiming Liu,Zheng Hu,Angelo Cangelosi,Lorenzo Baraldi

Main category: cs.AI

TL;DR: 该论文综述了世界模型(WMs)在具身AI代理中的应用及其安全性挑战,重点分析了自动驾驶和机器人领域中的场景与控制生成任务。

Details Motivation: 具身AI的发展要求更先进的模型来感知和预测环境动态,世界模型虽能填补知识空白并提升代理的行动能力,但其预测的安全性至关重要。

Contribution: 论文通过文献综述和实证分析,系统总结了世界模型的病理问题,并定量评估了其安全性表现。

Method: 结合文献综述和实证分析,收集了前沿模型的预测结果,识别并分类了常见故障(病理),并进行了定量评估。

Result: 研究揭示了世界模型在实际应用中的安全漏洞,为未来的安全性改进提供了依据。

Insight: 世界模型在具身AI中的应用需更多关注安全性,尤其是预测结果的可靠性,以避免潜在风险。

Abstract: The rapid progress in embodied artificial intelligence has highlighted the necessity for more advanced and integrated models that can perceive, interpret, and predict environmental dynamics. In this context, World Models (WMs) have been introduced to provide embodied agents with the abilities to anticipate future environmental states and fill in knowledge gaps, thereby enhancing agents’ ability to plan and execute actions. However, when dealing with embodied agents it is fundamental to ensure that predictions are safe for both the agent and the environment. In this article, we conduct a comprehensive literature review of World Models in the domains of autonomous driving and robotics, with a specific focus on the safety implications of scene and control generation tasks. Our review is complemented by an empirical analysis, wherein we collect and examine predictions from state-of-the-art models, identify and categorize common faults (herein referred to as pathologies), and provide a quantitative evaluation of the results.

[114] Do Code Models Suffer from the Dunning-Kruger Effect?

Mukul Singh,Somya Chatterjee,Arjun Radhakrishna,Sumit Gulwani

Main category: cs.AI

TL;DR: 这篇论文研究了先进LLM(大语言模型)在编程任务中是否存在邓宁-克鲁格效应(DKE),即能力不足的模型是否会高估自身能力。研究发现AI模型在低资源或陌生领域表现出类似人类的过度自信倾向。

Details Motivation: 随着人工智能与人类在技术领域的协作增加,理解AI的认知边界和偏见变得尤为重要。论文旨在探索AI模型是否表现出类似人类的邓宁-克鲁格效应,以及对编程任务的影响。

Contribution: 揭示了先进LLM在编程任务中表现出DKE现象,尤其是在能力较弱或面对罕见编程语言时,这种偏差更为显著。

Method: 通过分析模型在不同编程语言中的信心和性能表现,特别是低资源或陌生领域的表现,验证DKE倾向。

Result: 研究表明,能力较低的模型和在罕见编程语言中操作的模型表现出更明显的DKE类似偏差,偏差强度与模型能力成反比。

Insight: AI模型的过度自信倾向可能与人类相似,尤其在陌生领域或能力不足时,这对AI的可靠性评估和实际应用具有重要意义。

Abstract: As artificial intelligence systems increasingly collaborate with humans in creative and technical domains, questions arise about the cognitive boundaries and biases that shape our shared agency. This paper investigates the Dunning-Kruger Effect (DKE), the tendency for those with limited competence to overestimate their abilities in state-of-the-art LLMs in coding tasks. By analyzing model confidence and performance across a diverse set of programming languages, we reveal that AI models mirror human patterns of overconfidence, especially in unfamiliar or low-resource domains. Our experiments demonstrate that less competent models and those operating in rare programming languages exhibit stronger DKE-like bias, suggesting that the strength of the bias is proportionate to the competence of the models.

[115] VAL-Bench: Measuring Value Alignment in Language Models

Aman Gupta,Denny O’Shea,Fazl Barez

Main category: cs.AI

TL;DR: VAL-Bench是一个新基准,用于评估大语言模型(LLM)在面对争议性问题时是否保持一致的人类价值观。它通过115K对来自维基百科争议性话题的配对提示,测试模型是否无视提示框架而表达稳定的价值观。

Details Motivation: 现有基准大多仅测试规则遵守(如拒绝或不安全内容),但无法揭示模型是否在真实争议性问题中保持一致的价值观。VAL-Bench填补了这一空白,旨在系统性评估LLM的价值对齐性。

Contribution: 提出了VAL-Bench基准,用于测量LLM在面对对立争议性话题时是否保持价值观一致性。特点是使用LLM作为评判者,量化成对响应的分歧程度。

Method: 从维基百科争议性部分提取115K对对立提示,要求模型回答问题。通过LLM-as-judge评分成对响应的分歧,衡量价值观对齐性。

Result: 测试显示,主流开源和闭源模型在价值对齐上差异显著,部分模型在安全策略(如拒绝回答)与价值观表达之间存在权衡。

Insight: VAL-Bench揭示了现有模型在价值观一致性上的不足,并强调了在设计安全机制时需兼顾表达的灵活性以体现更连贯的价值系统。

Abstract: Large language models (LLMs) are increasingly used for tasks where outputs shape human decisions, so it is critical to test whether their responses reflect consistent human values. Existing benchmarks mostly track refusals or predefined safety violations, but these only check rule compliance and do not reveal whether a model upholds a coherent value system when facing controversial real-world issues. We introduce the \textbf{V}alue \textbf{AL}ignment \textbf{Bench}mark (\textbf{VAL-Bench}), which evaluates whether models maintain a stable value stance across paired prompts that frame opposing sides of public debates. VAL-Bench consists of 115K such pairs from Wikipedia’s controversial sections. A well-aligned model should express similar underlying views regardless of framing, which we measure using an LLM-as-judge to score agreement or divergence between paired responses. Applied across leading open- and closed-source models, the benchmark reveals large variation in alignment and highlights trade-offs between safety strategies (e.g., refusals) and more expressive value systems. By providing a scalable, reproducible benchmark, VAL-Bench enables systematic comparison of how reliably LLMs embody human values.

[116] In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

Zhuofeng Li,Haoxiang Zhang,Seungju Han,Sheng Liu,Jianwen Xie,Yu Zhang,Yejin Choi,James Zou,Pan Lu

Main category: cs.AI

TL;DR: AgentFlow是一种可训练的多模块代理系统框架,通过优化多轮交互中的规划模块,显著提升了任务完成效果和工具使用的可靠性。

Details Motivation: 传统基于工具的增强学习方法在长任务和多样化工具场景中表现不佳,代理系统虽然提供了模块化解耦的可能,但大多数缺乏动态训练。

Contribution: 提出了AgentFlow框架和Flow-GRPO优化方法,通过动态优化多轮交互中的局部决策与全局目标一致性,提升了任务完成效果。

Method: 采用四个模块(规划器、执行器、验证器、生成器)协同工作,通过Flow-GRPO方法在多轮环境中动态优化规划模块,利用轨迹级反馈改进决策。

Result: 在十项基准测试中,AgentFlow表现优异,平均准确率提升显著(搜索任务提升14.9%,代理任务14.0%,数学任务14.5%,科学任务4.1%)。

Insight: 动态优化在多轮交互中至关重要,模块化设计能显著提升工具使用的可靠性和任务规划能力,且表现随模型规模和推理轮次增加而提升。

Abstract: Outcome-driven reinforcement learning has advanced reasoning in large language models (LLMs), but prevailing tool-augmented approaches train a single, monolithic policy that interleaves thoughts and tool calls under full context; this scales poorly with long horizons and diverse tools and generalizes weakly to new scenarios. Agentic systems offer a promising alternative by decomposing work across specialized modules, yet most remain training-free or rely on offline training decoupled from the live dynamics of multi-turn interaction. We introduce AgentFlow, a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory and directly optimizes its planner inside the multi-turn loop. To train on-policy in live environments, we propose Flow-based Group Refined Policy Optimization (Flow-GRPO), which tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates. It broadcasts a single, verifiable trajectory-level outcome to every turn to align local planner decisions with global success and stabilizes learning with group-normalized advantages. Across ten benchmarks, AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks, even surpassing larger proprietary models like GPT-4o. Further analyses confirm the benefits of in-the-flow optimization, showing improved planning, enhanced tool-calling reliability, and positive scaling with model size and reasoning turns.

[117] ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems

Bohan Yao,Shiva Krishna Reddy Malay,Vikas Yadav

Main category: cs.AI

TL;DR: 论文提出了一种自动设计多智能体系统(MAS)的新范式,重点是优化链式思考(CoT)推理,通过Agentic Reasoning Module(ARM)实现更高性能和泛化能力。

Details Motivation: 现有的自动MAS设计方法性能不佳,且计算成本高。CoT推理表现竞争性,表明其值得进一步研究。

Contribution: 提出了ARM,一种将CoT扩展到智能体化的模块,通过树搜索和反射突变发现高效推理模块。

Method: 从简单的CoT模块出发,通过代码空间的树搜索和基于执行轨迹的反射突变,发现ARM模块。

Result: ARM显著优于手动设计的MAS和现有自动方法,且在不同任务和模型上展现出极强的泛化能力。

Insight: 聚焦CoT推理单元的优化是提升MAS性能的关键,ARM为通用的推理构建块,展示了高泛化性。

Abstract: Large Language Model (LLM)-powered Multi-agent systems (MAS) have achieved state-of-the-art results on various complex reasoning tasks. Recent works have proposed techniques to automate the design of MASes, eliminating the need for manual engineering. However, these techniques perform poorly, often achieving similar or inferior performance to simple baselines. Furthermore, they require computationally expensive re-discovery of architectures for each new task domain and expensive data annotation on domains without existing labeled validation sets. A critical insight is that simple Chain of Thought (CoT) reasoning often performs competitively with these complex systems, suggesting that the fundamental reasoning unit of MASes, CoT, warrants further investigation. To this end, we present a new paradigm for automatic MAS design that pivots the focus to optimizing CoT reasoning. We introduce the Agentic Reasoning Module (ARM), an agentic generalization of CoT where each granular reasoning step is executed by a specialized reasoning module. This module is discovered through a tree search over the code space, starting from a simple CoT module and evolved using mutations informed by reflection on execution traces. The resulting ARM acts as a versatile reasoning building block which can be utilized as a direct recursive loop or as a subroutine in a learned meta-orchestrator. Our approach significantly outperforms both manually designed MASes and state-of-the-art automatic MAS design methods. Crucially, MASes built with ARM exhibit superb generalization, maintaining high performance across different foundation models and task domains without further optimization.

[118] Early Multimodal Prediction of Cross-Lingual Meme Virality on Reddit: A Time-Window Analysis

Sedat Dogan,Nina Dethlefs,Debarati Chakraborty

Main category: cs.AI

TL;DR: 该研究提出了一种多模态方法,用于早期预测跨语言迷因的网络传播性,使用Reddit上的大规模数据集,并通过动态特征分析展示了传播过程中特征重要性从静态内容到时间动态的转变。

Details Motivation: 预测在线内容的传播性(尤其是文化复杂、快速演变的迷因)仍具挑战性,研究旨在通过早期信号和多模态特征探索其可行性。

Contribution: 1)提出了基于混合参与度的迷因传播性定义;2)结合时间序列数据、静态内容和网络特征进行早期预测;3)贡献了一个新颖的跨语言数据集和鲁棒的方法框架。

Method: 1)使用Logistic Regression、XGBoost和MLP模型;2)在多模态特征集上逐步增加时间窗口(30-420分钟);3)通过百分位数阈值防止数据泄露。

Result: XGBoost在30分钟内即达到PR-AUC > 0.52的性能,揭示了特征重要性从静态内容到时间动态的明显转变。

Insight: 迷因传播性的早期信号快速显现,特征动态变化揭示了传播过程中的关键转折点。

Abstract: Predicting the virality of online content remains challenging, especially for culturally complex, fast-evolving memes. This study investigates the feasibility of early prediction of meme virality using a large-scale, cross-lingual dataset from 25 diverse Reddit communities. We propose a robust, data-driven method to define virality based on a hybrid engagement score, learning a percentile-based threshold from a chronologically held-out training set to prevent data leakage. We evaluated a suite of models, including Logistic Regression, XGBoost, and a Multi-layer Perceptron (MLP), with a comprehensive, multimodal feature set across increasing time windows (30-420 min). Crucially, useful signals emerge quickly: our best-performing model, XGBoost, achieves a PR-AUC $>$ 0.52 in just 30 minutes. Our analysis reveals a clear “evidentiary transition,” in which the importance of the feature dynamically shifts from the static context to the temporal dynamics as a meme gains traction. This work establishes a robust, interpretable, and practical benchmark for early virality prediction in scenarios where full diffusion cascade data is unavailable, contributing a novel cross-lingual dataset and a methodologically sound definition of virality. To our knowledge, this study is the first to combine time series data with static content and network features to predict early meme virality.

[119] MatheMagic: Generating Dynamic Mathematics Benchmarks Robust to Memorization

Dayyán O’Brien,Barry Haddow,Emily Allaway,Pinzhen Chen

Main category: cs.AI

TL;DR: 这篇论文提出了一个动态数学基准生成方法MatheMagic,通过改变数字和运算符的解释来生成随机测试实例,既能避免记忆污染,又能评估模型的推理能力。

Details Motivation: 现有的数学评测基准容易因符号和规则的有限多样性而过度拟合,且答案封闭,模型可能记忆公开测试集。因此,需要一种动态、抗记忆的评测方法来衡量真实推理能力。

Contribution: 提出MatheMagic方法,生成动态数学测试实例,改变数字和运算符的解释,确保答案可自动验证。评测方法具备稳定性、可扩展性、可比性和抗过度拟合性。

Method: 通过随机种子在测试时生成数学测试实例,修改数字和运算符的语义,生成动态、反事实的测试集,评估模型的归纳或演绎能力。

Result: 实验表明,模型在演绎任务上表现优于归纳任务,但仍倾向于标准数学运算。数学适应模型未能展现通用的推理技能,归纳任务的微调泛化能力差。

Insight: 动态生成的评测基准能有效避免记忆污染,揭示模型的真实推理能力。模型的归纳能力较弱,需要通过评测揭示其不足。

Abstract: Conducting contamination-free evaluation of mathematical capabilities can be difficult for two reasons: models may memorize a test set once it is made public, and current mathematical benchmarks are prone to overfitting due to having limited diversity of symbols and rules, coupled with closed-ended answers. This paper proposes a method to leverage these shortcomings as useful features to a construct dynamic, counterfactual benchmark, which can be used to both reveal overfitting and measure true reasoning. We demonstrate this via MatheMagic, which generates math test instances with the interpretations of numbers and operators altered, yet has automatically verifiable answers. Test instances are randomly seeded and constructed at test time to evaluate a model’s induction or deduction capability, offering stability, extensibility, comparability, and robustness to overfitting. Our experiments find that models solve deduction more easily than induction, but they revert to standard math. Further analysis reveals that math-adapted models fail to exhibit a general “skill” of reasoning, and fine-tuning on induction tasks generalizes poorly.

[120] MixReasoning: Switching Modes to Think

Haiquan Lu,Gongfan Fang,Xinyin Ma,Qi Li,Xinchao Wang

Main category: cs.AI

TL;DR: MixReasoning提出了一种动态调整推理深度的框架,能够根据问题的难易程度自适应地选择详细推理或简洁推断,从而在不降低准确性的前提下提高效率。

Details Motivation: 传统推理模型对所有步骤采用相同的详细推理方式,忽略了不同子问题的难度差异,导致冗余计算。

Contribution: 提出了MixReasoning框架,动态调整推理深度,混合详细推理和简洁推断,显著提升了效率。

Method: 通过识别关键步骤(难)和非关键步骤(简单),动态选择详细推理或简洁推断,形成混合推理链。

Result: 在GSM8K、MATH-500和AIME等数据集上,MixReasoning减少了推理长度,提升了效率,同时保持了准确性。

Insight: 动态调整推理深度是一种提升推理效率的有效方法,未来可以扩展到更多推理任务中。

Abstract: Reasoning models enhance performance by tackling problems in a step-by-step manner, decomposing them into sub-problems and exploring long chains of thought before producing an answer. However, applying extended reasoning to every step introduces substantial redundancy, as sub-problems vary widely in difficulty and complexity: a small number of pivotal steps are genuinely challenging and decisive for the final answer, while many others only involve straightforward revisions or simple computations. Therefore, a natural idea is to endow reasoning models with the ability to adaptively respond to this variation, rather than treating all steps with the same level of elaboration. To this end, we propose MixReasoning, a framework that dynamically adjusts the depth of reasoning within a single response. The resulting chain of thought then becomes a mixture of detailed reasoning on difficult steps and concise inference on simpler ones. Experiments on GSM8K, MATH-500, and AIME show that MixReasoning shortens reasoning length and substantially improves efficiency without compromising accuracy.

[121] TaTToo: Tool-Grounded Thinking PRM for Test-Time Scaling in Tabular Reasoning

Jiaru Zou,Soumya Roy,Vinay Kumar Verma,Ziyi Wang,David Wipf,Pan Lu,Sumit Negi,James Zou,Jingrui He

Main category: cs.AI

TL;DR: TaTToo是一个基于工具的表格推理PRM框架,通过显式推理和工具验证,解决了现有PRM在表格操作中的瓶颈问题,显著提升了推理模型的性能。

Details Motivation: 现有的PRMs虽然在文本推理任务中表现优异,但在表格推理领域(如子表检索和模式交互)表现不佳,导致性能瓶颈。为了解决这一问题,TaTToo提出了一种新的表格接地PRM框架。

Contribution: 提出了TaTToo框架,显式推理表格操作并集成工具验证;设计了数据标注流程和双阶段训练范式(监督微调+强化学习);在多个表格推理任务中显著超越基线模型。

Method: 1. 设计了可扩展的数据标注流程,生成60k高质量步骤级标注;2. 采用双阶段训练:预训练+强化学习,利用工具验证奖励对齐模型;3. 显式推理表格操作。

Result: 在5个表格推理任务中,TaTToo将推理模型的性能提升了30.9%,仅用8B参数超越了Qwen-2.5-Math-PRM-72B等基线,展示了强大的泛化能力。

Insight: 表格推理中工具验证的重要性;双阶段训练(监督学习+强化学习)的有效性;数据标注质量对模型性能的提升作用。

Abstract: Process Reward Models (PRMs) have recently emerged as a powerful framework for enhancing the reasoning capabilities of large reasoning models (LRMs), particularly in the context of test-time scaling (TTS). However, their potential for supervising LRMs on tabular reasoning domains remains underexplored. Through detailed empirical analyses, we identify that existing PRMs, though widely adopted for supervising text-only reasoning steps, struggle with table-specific operations such as sub-table retrieval and schema interaction, leading to critical performance bottlenecks. To address this limitation, we propose TaTToo, a novel table-grounded PRM framework that (i) reasons explicitly over tabular reasoning steps and (ii) integrates tool-based verification to provide precise reward supervision. Concretely, we first design a scalable data curation pipeline that constructs over 60k high-quality step-level annotations by integrating table verification rationales with tool-based executions. Building on the collected data, we train TaTToo with a dual-stage paradigm: cold-start supervised fine-tuning to capture tool-use reasoning patterns, followed by reinforcement learning with tool-grounded reward shaping to align our model with table-based verification. We provide a comprehensive evaluation of the policy improvement induced by our newly designed PRM. Across 5 challenging tabular reasoning benchmarks covering numerical reasoning, fact-checking, and data analysis, TaTToo improves downstream policy LRMs by 30.9% at inference, surpasses strong PRM baselines such as Qwen-2.5-Math-PRM-72B with only 8B parameters, and demonstrates strong generalizability across diverse TTS strategies.

cs.RO [Back]

[122] DeLTa: Demonstration and Language-Guided Novel Transparent Object Manipulation

Taeyeop Lee,Gyuree Kang,Bowen Wen,Youngho Kim,Seunghyeok Back,In So Kweon,David Hyunchul Shim,Kuk-Jin Yoon

Main category: cs.RO

TL;DR: DeLTa提出了一种结合深度估计、6D位姿估计和视觉语言规划的新框架,用于透明物体的精确长程操作,通过单次演示实现对新型透明物体的泛化。

Details Motivation: 透明物体操作在机器人研究中仍局限于短程任务和基本抓取,缺乏对新型物体的泛化能力和精确的长程操作能力。

Contribution: 1) 提出DeLTa框架,结合深度/6D位姿估计与视觉语言规划;2) 单次演示支持新型透明物体的长程操作;3) 任务规划器优化VLM生成的计划以适应单臂机器人约束。

Method: 整合深度估计、6D位姿估计和视觉语言规划,通过单次演示实现轨迹泛化,并通过任务规划器优化操作计划。

Result: DeLTa在长程透明物体操作任务中显著优于现有方法,尤其在精确操作能力上表现突出。

Insight: 结合视觉、语言和演示可实现透明物体的高效长程操作,单次演示避免了繁琐的类别级先验或额外训练。

Abstract: Despite the prevalence of transparent object interactions in human everyday life, transparent robotic manipulation research remains limited to short-horizon tasks and basic grasping capabilities.Although some methods have partially addressed these issues, most of them have limitations in generalizability to novel objects and are insufficient for precise long-horizon robot manipulation. To address this limitation, we propose DeLTa (Demonstration and Language-Guided Novel Transparent Object Manipulation), a novel framework that integrates depth estimation, 6D pose estimation, and vision-language planning for precise long-horizon manipulation of transparent objects guided by natural task instructions. A key advantage of our method is its single-demonstration approach, which generalizes 6D trajectories to novel transparent objects without requiring category-level priors or additional training. Additionally, we present a task planner that refines the VLM-generated plan to account for the constraints of a single-arm, eye-in-hand robot for long-horizon object manipulation tasks. Through comprehensive evaluation, we demonstrate that our method significantly outperforms existing transparent object manipulation approaches, particularly in long-horizon scenarios requiring precise manipulation capabilities. Project page: https://sites.google.com/view/DeLTa25/

cs.CR [Back]

[123] Towards Reliable and Practical LLM Security Evaluations via Bayesian Modelling

Mary Llewellyn,Annie Gray,Josh Collyer,Michael Harries

Main category: cs.CR

TL;DR: 该论文提出了一种基于贝叶斯模型的端到端框架,用于评估大型语言模型(LLM)在提示注入攻击中的脆弱性,解决了现有评估方法在不确定性量化、模型比较和实验设计方面的不足。

Details Motivation: 现有评估方法不可靠,难以量化不确定性,且模型比较不公平,需要一种更严谨的方法来评估LLM的安全性。

Contribution: 1. 提出了实用的实验设计方法,考虑了训练和部署两种场景;2. 开发了贝叶斯分层模型,结合嵌入空间聚类,改进了不确定性量化;3. 展示了评估Transformer和Mamba架构安全性的实际应用。

Method: 提出了一个贝叶斯分层模型,结合嵌入空间聚类,用于分析LLM输出不确定性和测试提示的缺陷,同时设计了两种实践场景的实验框架。

Result: 研究表明,考虑输出变异性可能导致结论不那么明确,但在某些攻击下,Transformer和Mamba变体的脆弱性显著增加。

Insight: 贝叶斯方法能够更好地捕捉LLM输出的不确定性,同时嵌入空间聚类有助于提升实验分析的可靠性。

Abstract: Before adopting a new large language model (LLM) architecture, it is critical to understand vulnerabilities accurately. Existing evaluations can be difficult to trust, often drawing conclusions from LLMs that are not meaningfully comparable, relying on heuristic inputs or employing metrics that fail to capture the inherent uncertainty. In this paper, we propose a principled and practical end-to-end framework for evaluating LLM vulnerabilities to prompt injection attacks. First, we propose practical approaches to experimental design, tackling unfair LLM comparisons by considering two practitioner scenarios: when training an LLM and when deploying a pre-trained LLM. Second, we address the analysis of experiments and propose a Bayesian hierarchical model with embedding-space clustering. This model is designed to improve uncertainty quantification in the common scenario that LLM outputs are not deterministic, test prompts are designed imperfectly, and practitioners only have a limited amount of compute to evaluate vulnerabilities. We show the improved inferential capabilities of the model in several prompt injection attack settings. Finally, we demonstrate the pipeline to evaluate the security of Transformer versus Mamba architectures. Our findings show that consideration of output variability can suggest less definitive findings. However, for some attacks, we find notably increased Transformer and Mamba-variant vulnerabilities across LLMs with the same training data or mathematical ability.

eess.IV [Back]

[124] nnSAM2: nnUNet-Enhanced One-Prompt SAM2 for Few-shot Multi-Modality Segmentation and Composition Analysis of Lumbar Paraspinal Muscles

Zhongyi Zhang,Julie A. Hides,Enrico De Martino,Abdul Joseph Fofanah,Gervase Tuxworth

Main category: eess.IV

TL;DR: 论文提出了一种名为nnSAM2的少样本分割方法,用于腰椎旁肌的多模态分割和分析,仅需单张标注切片即可达到专家级别的测量效果。

Details Motivation: 传统分割方法需要大量标注数据,而nnSAM2的目标是在仅使用单张标注切片的情况下,实现高质量的少样本分割,减少标注负担。

Contribution: nnSAM2结合了SAM2的少样本提示和nnU-Net的优化能力,提出了一种高效的分割框架,并在多模态MRI和CT数据上验证了其性能。

Method: 方法结合了SAM2的单切片提示生成伪标签,并通过三个独立的nnU-Net模型进行优化。性能通过Dice相似系数和统计测试(TOST、ICC)评估。

Result: nnSAM2在MR和CT图像上的DSC分别为0.94-0.96和0.92-0.93,自动化测量结果与专家标注统计等效(TOST P < 0.05,ICC 0.86-1.00)。

Insight: nnSAM2展示了少样本分割的高效性和鲁棒性,适用于多模态、多中心数据,为医学图像分割提供了一种低标注成本的解决方案。

Abstract: Purpose: To develop and validate No-New SAM2 (nnsam2) for few-shot segmentation of lumbar paraspinal muscles using only a single annotated slice per dataset, and to assess its statistical comparability with expert measurements across multi-sequence MRI and multi-protocol CT. Methods: We retrospectively analyzed 1,219 scans (19,439 slices) from 762 participants across six datasets. Six slices (one per dataset) served as labeled examples, while the remaining 19,433 slices were used for testing. In this minimal-supervision setting, nnsam2 used single-slice SAM2 prompts to generate pseudo-labels, which were pooled across datasets and refined through three sequential, independent nnU-Net models. Segmentation performance was evaluated using the Dice similarity coefficient (DSC), and automated measurements-including muscle volume, fat ratio, and CT attenuation-were assessed with two one-sided tests (TOST) and intraclass correlation coefficients (ICC). Results: nnsam2 outperformed vanilla SAM2, its medical variants, TotalSegmentator, and the leading few-shot method, achieving DSCs of 0.94-0.96 on MR images and 0.92-0.93 on CT. Automated and expert measurements were statistically equivalent for muscle volume (MRI/CT), CT attenuation, and Dixon fat ratio (TOST, P < 0.05), with consistently high ICCs (0.86-1.00). Conclusion: We developed nnsam2, a state-of-the-art few-shot framework for multi-modality LPM segmentation, producing muscle volume (MRI/CT), attenuation (CT), and fat ratio (Dixon MRI) measurements that were statistically comparable to expert references. Validated across multimodal, multicenter, and multinational cohorts, and released with open code and data, nnsam2 demonstrated high annotation efficiency, robust generalizability, and reproducibility.

eess.SP [Back]

[125] Leveraging Vision Transformers for Enhanced Classification of Emotions using ECG Signals

Pubudu L. Indrasiri,Bipasha Kashyap,Pubudu N. Pathirana

Main category: eess.SP

TL;DR: 该论文提出了一种基于Vision Transformer(ViT)的改进方法,用于从ECG信号中识别情绪状态,结合了CNN和SE模块以增强性能,并在YAAD和DREAMER数据集上展示了优异结果。

Details Motivation: 传统方法在从ECG信号中识别情绪状态时存在局限性,而Transformer架构在图像分类中的成功激发了其在ECG信号分析中的应用探索。

Contribution: 1. 提出了结合ViT、CNN和SE模块的改进方法;2. 设计了高效的信号预处理和图像转换流程;3. 在两个公开数据集上验证了方法的优越性。

Method: 1. 使用连续小波变换和功率谱密度分析将ECG信号转换为图像;2. 提出了一种结合CNN和SE模块的ViT架构,用于情绪分类。

Result: 在YAAD数据集上,方法在7种情绪状态及效价和唤醒度分类中超越了现有技术;在DREAMER数据集上也表现优异。

Insight: 将Transformer架构与CNN结合可以有效提升ECG信号的情感识别性能,信号预处理和图像转换是关键步骤。

Abstract: Biomedical signals provide insights into various conditions affecting the human body. Beyond diagnostic capabilities, these signals offer a deeper understanding of how specific organs respond to an individual’s emotions and feelings. For instance, ECG data can reveal changes in heart rate variability linked to emotional arousal, stress levels, and autonomic nervous system activity. This data offers a window into the physiological basis of our emotional states. Recent advancements in the field diverge from conventional approaches by leveraging the power of advanced transformer architectures, which surpass traditional machine learning and deep learning methods. We begin by assessing the effectiveness of the Vision Transformer (ViT), a forefront model in image classification, for identifying emotions in imaged ECGs. Following this, we present and evaluate an improved version of ViT, integrating both CNN and SE blocks, aiming to bolster performance on imaged ECGs associated with emotion detection. Our method unfolds in two critical phases: first, we apply advanced preprocessing techniques for signal purification and converting signals into interpretable images using continuous wavelet transform and power spectral density analysis; second, we unveil a performance-boosted vision transformer architecture, cleverly enhanced with convolutional neural network components, to adeptly tackle the challenges of emotion recognition. Our methodology’s robustness and innovation were thoroughly tested using ECG data from the YAAD and DREAMER datasets, leading to remarkable outcomes. For the YAAD dataset, our approach outperformed existing state-of-the-art methods in classifying seven unique emotional states, as well as in valence and arousal classification. Similarly, in the DREAMER dataset, our method excelled in distinguishing between valence, arousal and dominance, surpassing current leading techniques.

eess.AS [Back]

[126] WaveSP-Net: Learnable Wavelet-Domain Sparse Prompt Tuning for Speech Deepfake Detection

Xi Xuan,Xuechen Liu,Wenxin Zhang,Yi-Cheng Lin,Xiaojian Lin,Tomi Kinnunen

Main category: eess.AS

TL;DR: 论文提出了WaveSP-Net,一种参数高效的语音深度伪造检测前端,通过结合小波变换和Mamba后端,在不改变预训练模型参数的情况下显著提升检测性能。

Details Motivation: 当前语音深度伪造检测的前端设计依赖于对大型预训练模型(如XLSR)的完整微调,这不仅参数效率低,还可能对现实数据泛化能力不足。

Contribution: 1)提出了基于傅里叶和小波变换的两种参数高效前端架构;2)设计了WaveSP-Net,结合Partial-WSPT-XLSR前端和双向Mamba后端;3)在Deepfake-Eval-2024和SpoofCeleb基准上性能优于现有方法。

Method: 1)在小波域中进行稀疏提示调整(Sparse Prompt Tuning);2)冻结XLSR参数,通过注入多分辨率特征增强合成伪影的定位能力;3)使用双向Mamba后端处理时序信息。

Result: WaveSP-Net在Deepfake-Eval-2024和SpoofCeleb基准上表现优于现有方法,且训练参数较少。

Insight: 结合经典信号处理方法和现代深度学习框架(如小波变换+Mamba)可以显著提升语音深度伪造检测的性能和效率。

Abstract: Modern front-end design for speech deepfake detection relies on full fine-tuning of large pre-trained models like XLSR. However, this approach is not parameter-efficient and may lead to suboptimal generalization to realistic, in-the-wild data types. To address these limitations, we introduce a new family of parameter-efficient front-ends that fuse prompt-tuning with classical signal processing transforms. These include FourierPT-XLSR, which uses the Fourier Transform, and two variants based on the Wavelet Transform: WSPT-XLSR and Partial-WSPT-XLSR. We further propose WaveSP-Net, a novel architecture combining a Partial-WSPT-XLSR front-end and a bidirectional Mamba-based back-end. This design injects multi-resolution features into the prompt embeddings, which enhances the localization of subtle synthetic artifacts without altering the frozen XLSR parameters. Experimental results demonstrate that WaveSP-Net outperforms several state-of-the-art models on two new and challenging benchmarks, Deepfake-Eval-2024 and SpoofCeleb, with low trainable parameters and notable performance gains. The code and models are available at https://github.com/xxuan-acoustics/WaveSP-Net.

[127] TokenChain: A Discrete Speech Chain via Semantic Token Modeling

Mingxuan Wang,Satoshi Nakamura

Main category: eess.AS

TL;DR: TokenChain 是一种完全离散的语音链模型,通过语义标记建模,联合改进 ASR 和 TTS。结果表明其在 LibriSpeech 和 TED-LIUM 数据集上显著提高了性能。

Details Motivation: 模拟人类感知-生产循环的机器语音链在联合改进自动语音识别(ASR)和文本到语音(TTS)方面表现有效。本文旨在探索离散语义标记接口下的语音链学习效果。

Contribution: 提出 TokenChain,一种完全离散的语音链,通过语义标记建模耦合 ASR 和 TTS,支持端到端反馈,并在实验中展示了显著的性能提升。

Method: TokenChain 结合了语义标记的 ASR 和两阶段 TTS:自回归的文本到语义模型与 ASR 联合训练,以及掩码生成的语义到声学模型仅用于合成。使用了 straight-through argmax/Gumbel-Softmax 技术和动态权重平均平衡监督 ASR。

Result: 在 LibriSpeech 上,TokenChain 比基线提前 2-6 个周期达到更高准确性,错误率降低 5-13%;在 TED-LIUM 上,ASR WER 降低 56%,TTS WER 降低 31%,且遗忘最小。

Insight: 研究表明,即使使用标记接口和模型,语音链学习仍然有效。离散语义标记可以作为 ASR 和 TTS 联合优化的有力工具。

Abstract: Machine Speech Chain, simulating the human perception-production loop, proves effective in jointly improving ASR and TTS. We propose TokenChain, a fully discrete speech chain coupling semantic-token ASR with a two-stage TTS: an autoregressive text-to-semantic model co-trained with ASR and a masked-generative semantic-to-acoustic model for synthesis only. End-to-end feedback across the text interface is enabled with straight-through argmax/Gumbel-Softmax and balanced with supervised ASR via dynamic weight averaging. Ablations examine optimal temperature schedules for in- and cross-domain transfer. Evaluation reveals TokenChain surpasses baseline accuracy 2-6 epochs earlier and yields 5-13% lower equal-epoch error with stable T2S on LibriSpeech, and reduces relative ASR WER by 56% and T2S WER by 31% on TED-LIUM with minimal forgetting, showing that chain learning remains effective with token interfaces and models.

stat.ML [Back]

[128] Domain-Shift-Aware Conformal Prediction for Large Language Models

Zhexiao Lin,Yuanyuan Li,Neeraj Sarna,Yuanyuan Gao,Michael von Gablenz

Main category: stat.ML

TL;DR: 该论文提出了一个名为Domain-Shift-Aware Conformal Prediction (DS-CP)的新框架,旨在解决大语言模型在领域偏移(domain shift)下置信预测的失效问题,通过重新加权校准样本以提升覆盖率和可靠性。

Details Motivation: 大语言模型虽然表现优异,但在实际应用中容易产生过度自信且错误的输出(幻觉)。标准的一致性预测在领域偏移下失效,导致覆盖率不足和不可靠的预测集。

Contribution: 提出了DS-CP框架,通过根据测试样本与校准样本的接近程度系统性重新加权,解决了领域偏移下的置信预测问题,并保持了效率和有效性。

Method: DS-CP通过重新加权校准样本的方法,改进标准一致性预测的覆盖率,尤其是在领域偏移显著的情况下。理论分析和MMLU基准实验验证了其有效性。

Result: 实验表明,DS-CP在领域偏移下比标准方法提供更可靠的覆盖率,同时保持了效率,为大语言模型在实际部署中的不确定性量化提供了实用方法。

Insight: 领域偏移是大语言模型应用中不可忽视的问题,DS-CP通过校准样本的加权策略,提供了一种兼顾覆盖率和适应性的解决方案。

Abstract: Large language models have achieved impressive performance across diverse tasks. However, their tendency to produce overconfident and factually incorrect outputs, known as hallucinations, poses risks in real world applications. Conformal prediction provides finite-sample, distribution-free coverage guarantees, but standard conformal prediction breaks down under domain shift, often leading to under-coverage and unreliable prediction sets. We propose a new framework called Domain-Shift-Aware Conformal Prediction (DS-CP). Our framework adapts conformal prediction to large language models under domain shift, by systematically reweighting calibration samples based on their proximity to the test prompt, thereby preserving validity while enhancing adaptivity. Our theoretical analysis and experiments on the MMLU benchmark demonstrate that the proposed method delivers more reliable coverage than standard conformal prediction, especially under substantial distribution shifts, while maintaining efficiency. This provides a practical step toward trustworthy uncertainty quantification for large language models in real-world deployment.

cs.LG [Back]

[129] NEO: No-Optimization Test-Time Adaptation through Latent Re-Centering

Alexander Murphy,Michal Danilowski,Soumyajit Chatterjee,Abhirup Ghosh

Main category: cs.LG

TL;DR: NEO是一个无需优化的测试时适应方法,通过潜在空间重定位实现高效计算和鲁棒性,显著提升分类精度且在资源受限设备上表现优越。

Details Motivation: 当前的测试时适应(TTA)方法通常计算成本高、需要大量数据或对超参数敏感。本文旨在设计一种无需优化、计算高效的TTA方法。

Contribution: 提出NEO方法:通过潜在空间重定位(re-centering)提升源域和目标域数据的对齐性,无需额外优化或超参数调优,计算成本与普通推理相当。

Method: NEO的核心是基于潜在空间几何理论,将目标数据嵌入重新定位到原点,从而改善分类性能。该方法无需训练或调整模型参数。

Result: NEO在ImageNet-C上将ViT-Base的分类精度从55.6%提升至59.2%(仅用64样本);在512样本下优于7种TTA方法,且在资源受限设备上显著减少推理时间和内存。

Insight: 潜在空间的数据分布对齐是提升TTA性能的关键,简单几何操作(如重定位)即可显著改善模型在分布偏移下的鲁棒性。

Abstract: Test-Time Adaptation (TTA) methods are often computationally expensive, require a large amount of data for effective adaptation, or are brittle to hyperparameters. Based on a theoretical foundation of the geometry of the latent space, we are able to significantly improve the alignment between source and distribution-shifted samples by re-centering target data embeddings at the origin. This insight motivates NEO – a hyperparameter-free fully TTA method, that adds no significant compute compared to vanilla inference. NEO is able to improve the classification accuracy of ViT-Base on ImageNet-C from 55.6% to 59.2% after adapting on just one batch of 64 samples. When adapting on 512 samples NEO beats all 7 TTA methods we compare against on ImageNet-C, ImageNet-R and ImageNet-S and beats 6/7 on CIFAR-10-C, while using the least amount of compute. NEO performs well on model calibration metrics and additionally is able to adapt from 1 class to improve accuracy on 999 other classes in ImageNet-C. On Raspberry Pi and Jetson Orin Nano devices, NEO reduces inference time by 63% and memory usage by 9% compared to baselines. Our results based on 3 ViT architectures and 4 datasets show that NEO can be used efficiently and effectively for TTA.

[130] Neighborhood-Adaptive Generalized Linear Graph Embedding with Latent Pattern Mining

S. Peng,L. Hu,W. Zhang,B. Jie,Y. Luo

Main category: cs.LG

TL;DR: 提出了一种新型自适应邻域图的线性图嵌入模型NGLGE,结合潜在模式挖掘和低秩表示,解决了传统方法邻域大小固定和模式挖掘单一的问题。

Details Motivation: 传统的图嵌入方法通常需要预先定义邻域大小,限制了数据潜在结构相关性的有效揭示;同时,线性投影方法依赖单一模式挖掘,难以适应多样化场景。

Contribution: 1. 提出自适应邻域图的图学习方法;2. 引入重构的低秩表示和ℓ₂₀范数约束,灵活挖掘多模式信息;3. 设计了高效的迭代求解算法。

Method: 基于潜在模式挖掘的自适应邻域图学习方法,结合低秩表示和ℓ₂₀范数约束,改进投影矩阵以探索多模式信息。

Result: 在多种场景的数据集上验证了模型的优越性,性能优于现有最优方法。

Insight: 自适应邻域设计和多模式挖掘能显著提升图嵌入的性能,尤其适用于复杂数据结构的分析。

Abstract: Graph embedding has been widely applied in areas such as network analysis, social network mining, recommendation systems, and bioinformatics. However, current graph construction methods often require the prior definition of neighborhood size, limiting the effective revelation of potential structural correlations in the data. Additionally, graph embedding methods using linear projection heavily rely on a singular pattern mining approach, resulting in relative weaknesses in adapting to different scenarios. To address these challenges, we propose a novel model, Neighborhood-Adaptive Generalized Linear Graph Embedding (NGLGE), grounded in latent pattern mining. This model introduces an adaptive graph learning method tailored to the neighborhood, effectively revealing intrinsic data correlations. Simultaneously, leveraging a reconstructed low-rank representation and imposing $\ell_{2,0}$ norm constraint on the projection matrix allows for flexible exploration of additional pattern information. Besides, an efficient iterative solving algorithm is derived for the proposed model. Comparative evaluations on datasets from diverse scenarios demonstrate the superior performance of our model compared to state-of-the-art methods.

[131] Improving Clinical Dataset Condensation with Mode Connectivity-based Trajectory Surrogates

Pafue Christy Nganjimi,Andrew Soltan,Danielle Belgrave,Lei Clifton,David A. Clifton,Anshul Thakur

Main category: cs.LG

TL;DR: 该论文提出了一种基于模式连接的轨迹替代方法,用于改进临床数据集压缩技术,通过使用平滑的低损失参数替代(如二次Bézier曲线)替代噪声大、高曲率的SGD轨迹,提高了训练的稳定性和效率。

Details Motivation: 当前的数据集压缩方法通过对齐真实和合成数据的训练动态来监督合成数据,但SGD轨迹通常噪声大、存储密集,导致训练不稳定和收敛缓慢。

Contribution: 提出了一种新模式连接的轨迹替代方法,用Bézier曲线替代SGD轨迹,降低了噪声和存储开销,同时提高了性能。

Method: 使用二次Bézier曲线连接真实训练轨迹的初始和最终模型状态,作为平滑、低损失的参数替代。

Result: 在五个临床数据集上,该方法超越了现有技术,生成的压缩数据集支持高效的临床模型开发。

Insight: 平滑的替代轨迹能有效减少噪声和存储需求,显著提升数据集压缩的效果和效率。

Abstract: Dataset condensation (DC) enables the creation of compact, privacy-preserving synthetic datasets that can match the utility of real patient records, supporting democratised access to highly regulated clinical data for developing downstream clinical models. State-of-the-art DC methods supervise synthetic data by aligning the training dynamics of models trained on real and those trained on synthetic data, typically using full stochastic gradient descent (SGD) trajectories as alignment targets; however, these trajectories are often noisy, high-curvature, and storage-intensive, leading to unstable gradients, slow convergence, and substantial memory overhead. We address these limitations by replacing full SGD trajectories with smooth, low-loss parametric surrogates, specifically quadratic B'ezier curves that connect the initial and final model states from real training trajectories. These mode-connected paths provide noise-free, low-curvature supervision signals that stabilise gradients, accelerate convergence, and eliminate the need for dense trajectory storage. We theoretically justify B'ezier-mode connections as effective surrogates for SGD paths and empirically show that the proposed method outperforms state-of-the-art condensation approaches across five clinical datasets, yielding condensed datasets that enable clinically effective model development.

[132] Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density

Randall Balestriero,Nicolas Ballas,Mike Rabbat,Yann LeCun

Main category: cs.LG

TL;DR: 论文揭示了JEPAs中的抗坍塌目标(anti-collapse term)不仅能防止表示坍塌,还能隐式估计数据密度,并提出了一种称为JEPA-SCORE的方法,用于从JEPA模型中高效计算样本概率。

Details Motivation: JEPAs通过学习表示来解决下游任务,但其抗坍塌目标的真正作用未被充分理解。论文旨在揭示这一目标背后的数据密度估计能力,并探索其在数据筛选和异常检测等任务中的应用。

Contribution: 主要贡献包括:(1)理论证明了JEPAs的抗坍塌目标能够估计数据密度;(2)提出JEPA-SCORE方法,通过模型的雅可比矩阵高效计算样本概率;(3)在多个数据集和模型(如I-JEPA、DINOv2和MetaCLIP)上验证了这一发现的普适性。

Method: 方法的核心是分析JEPAs的抗坍塌目标,并利用模型的雅可比矩阵(Jacobian matrix)推导出样本概率的闭式解。JEPA-SCORE通过计算雅可比矩阵的行列式来估计数据密度。

Result: 实验结果表明,JEPA-SCORE能够有效估计数据密度,并在合成数据、可控数据和ImageNet等数据集上验证了其性能。此外,方法在多模态模型(如MetaCLIP)中也表现良好。

Insight: 论文揭示了JEPAs的抗坍塌目标不仅仅是防止表示坍塌的工具,还隐式地学习了数据的分布。这一发现在数据筛选、异常检测和密度估计等领域具有潜在应用价值。

Abstract: Joint Embedding Predictive Architectures (JEPAs) learn representations able to solve numerous downstream tasks out-of-the-box. JEPAs combine two objectives: (i) a latent-space prediction term, i.e., the representation of a slightly perturbed sample must be predictable from the original sample’s representation, and (ii) an anti-collapse term, i.e., not all samples should have the same representation. While (ii) is often considered as an obvious remedy to representation collapse, we uncover that JEPAs’ anti-collapse term does much more–it provably estimates the data density. In short, any successfully trained JEPA can be used to get sample probabilities, e.g., for data curation, outlier detection, or simply for density estimation. Our theoretical finding is agnostic of the dataset and architecture used–in any case one can compute the learned probabilities of sample $x$ efficiently and in closed-form using the model’s Jacobian matrix at $x$. Our findings are empirically validated across datasets (synthetic, controlled, and Imagenet) and across different Self Supervised Learning methods falling under the JEPA family (I-JEPA and DINOv2) and on multimodal models, such as MetaCLIP. We denote the method extracting the JEPA learned density as {\bf JEPA-SCORE}.

[133] Adversarial Reinforcement Learning for Large Language Model Agent Safety

Zizhao Wang,Dingcheng Li,Vaishakh Keshava,Phillip Wallis,Ananth Balashankar,Peter Stone,Lukas Rutishauser

Main category: cs.LG

TL;DR: 论文提出了一种名为ARLAS的新框架,利用对抗性强化学习训练LLM代理以防御多样化的间接提示注入攻击,同时提升任务完成率。

Details Motivation: 当前LLM代理在使用工具时易受间接提示注入攻击,而现有的防御方法依赖于手工构建的攻击数据集,缺乏多样性,无法应对新型攻击。

Contribution: 提出了ARLAS框架,通过对抗性强化学习自动生成多样化攻击并训练代理防御,同时引入基于种群的学习机制以防止循环学习。

Method: 将问题建模为零和博弈,协同训练攻击者和防御者LLM;使用种群学习机制,代理需防御历史所有攻击者检查点。

Result: 在BrowserGym和AgentDojo上的实验表明,ARLAS显著降低了攻击成功率,同时提升了任务成功率,生成的攻击更具挑战性。

Insight: 对抗性学习能够自动生成多样化攻击,有助于提升代理的鲁棒性;基于种群的学习机制是关键优化点。

Abstract: Large Language Model (LLM) agents can leverage tools such as Google Search to complete complex tasks. However, this tool usage introduces the risk of indirect prompt injections, where malicious instructions hidden in tool outputs can manipulate the agent, posing security risks like data leakage. Current defense strategies typically rely on fine-tuning LLM agents on datasets of known attacks. However, the generation of these datasets relies on manually crafted attack patterns, which limits their diversity and leaves agents vulnerable to novel prompt injections. To address this limitation, we propose Adversarial Reinforcement Learning for Agent Safety (ARLAS), a novel framework that leverages adversarial reinforcement learning (RL) by formulating the problem as a two-player zero-sum game. ARLAS co-trains two LLMs: an attacker that learns to autonomously generate diverse prompt injections and an agent that learns to defend against them while completing its assigned tasks. To ensure robustness against a wide range of attacks and to prevent cyclic learning, we employ a population-based learning framework that trains the agent to defend against all previous attacker checkpoints. Evaluated on BrowserGym and AgentDojo, agents fine-tuned with ARLAS achieve a significantly lower attack success rate than the original model while also improving their task success rate. Our analysis further confirms that the adversarial process generates a diverse and challenging set of attacks, leading to a more robust agent compared to the base model.

[134] NorMuon: Making Muon more efficient and scalable

Zichong Li,Liming Liu,Chen Liang,Weizhu Chen,Tuo Zhao

Main category: cs.LG

TL;DR: NorMuon 是一种新型优化器,结合了 Muon 的正交化和神经元级自适应学习率,解决了 Muon 更新过程中神经元规范不均匀的问题,显著提升了训练效率和可扩展性。

Details Motivation: Muon 优化器通过正交化参数更新改善了优化几何特性,但其更新过程中神经元规范不均匀,导致某些神经元主导优化过程。未系统地探索与 Adam 优点的结合潜力。

Contribution: 提出了 NorMuon,结合正交化和神经元级自适应学习率,通过行归一化平衡参数利用。开发了高效的分布式实现,提升训练效率和可扩展性。

Method: 在 Muon 的基础上,引入神经元级自适应学习率,保持二阶动量统计;正交化后进行行归一化;基于 FSDP2 框架实现分布式计算。

Result: 在 1.1 B 预训练设置中,NorMuon 比 Adam 提升 21.74% 训练效率,比 Muon 提升 11.31%,同时保持与 Muon 相当的内存占用。

Insight: 正交化和自适应学习率是互补而非竞争的方法,为大规模深度学习的优化器设计提供了新方向。

Abstract: The choice of optimizer significantly impacts the training efficiency and computational costs of large language models (LLMs). Recently, the Muon optimizer has demonstrated promising results by orthogonalizing parameter updates, improving optimization geometry through better conditioning. Despite Muon’s emergence as a candidate successor to Adam, the potential for jointly leveraging their strengths has not been systematically explored. In this work, we bridge this gap by proposing NorMuon (Neuron-wise Normalized Muon), an optimizer that synergistically combines orthogonalization with neuron-level adaptive learning rates. Our analysis reveals that while Muon effectively reduces condition numbers, the resulting updates exhibit highly non-uniform neuron norms, causing certain neurons to dominate the optimization process. NorMuon addresses this imbalance by maintaining second-order momentum statistics for each neuron and applying row-wise normalization after orthogonalization, ensuring balanced parameter utilization while preserving Muon’s conditioning benefits. To enable practical deployment at scale, we develop an efficient distributed implementation under the FSDP2 framework that strategically distributes orthogonalization computations across devices. Experiments across multiple model scales demonstrate that NorMuon consistently outperforms both Adam and Muon, achieving 21.74% better training efficiency than Adam and 11.31% improvement over Muon on 1.1 B pretraining setting, while maintaining a comparable memory footprint to Muon. Our findings suggest that orthogonalization and adaptive learning rates are complementary rather than competing approaches, opening new avenues for optimizer design in large-scale deep learning.

[135] Mitigating Premature Exploitation in Particle-based Monte Carlo for Inference-Time Scaling

Giorgio Giannone,Guangxuan Xu,Nikhil Shivakumar Nayak,Rohan Mahesh Awhad,Shivchander Sudalairaj,Kai Xu,Akash Srivastava

Main category: cs.LG

TL;DR: 论文提出Entropic Particle Filtering (ePF)算法,通过Entropic Annealing (EA)和Look-ahead Modulation (LaM)技术解决Particle Filtering (PF)在Inference-Time Scaling中的过早利用问题,显著提升了数学推理任务的性能。

Details Motivation: 在Inference-Time Scaling中,Particle Filtering (PF)方法易受过程奖励模型的误导,导致过早利用和粒子贫化问题,从而收敛到次优解。论文旨在通过改进PF方法解决这一问题。

Contribution: 1. 提出Entropic Particle Filtering (ePF)算法,结合Entropic Annealing (EA)和Look-ahead Modulation (LaM)技术;2. 通过动态调整重采样分布和前瞻性评估路径潜力,显著提升PF在复杂数学推理任务中的表现。

Method: 1. Entropic Annealing (EA):通过熵监控搜索多样性,动态调整重采样分布以防止粒子贫化;2. Look-ahead Modulation (LaM):基于后继状态预测路径潜力,增强探索能力。

Result: 在多个数学基准测试中,ePF相比基线方法提升了50%的相对任务奖励,显著改善了PF的鲁棒性和性能。

Insight: 通过结合动态多样性和前瞻性评估,ePF在探索多样化解空间和利用高奖励区域之间取得了平衡,为复杂推理任务提供了更高质量的解决方案。

Abstract: Inference-Time Scaling (ITS) improves language models by allocating more computation at generation time. Particle Filtering (PF) has emerged as a strong ITS method for complex mathematical reasoning tasks, but it is vulnerable when guided by process reward models, which often assign overconfident scores early in the reasoning process. This causes PF to suffer from premature exploitation: it myopically commits to locally promising trajectories, prunes potentially correct hypotheses, and converges to suboptimal solutions. This failure mode, known as particle impoverishment, is especially severe under constrained computational budgets. To address this, we analyze the problem and identify two root causes: a lack of diversity in the particle set due to overconfident resampling and consequent inability to assess the potential of a reasoning path. We introduce Entropic Particle Filtering (ePF), an algorithm that integrates two new techniques to solve these issues. The first technique, Entropic Annealing (EA), directly mitigates particle impoverishment by monitoring search diversity via entropy; when diversity drops, it intervenes by dynamically annealing the resampling distribution to preserve exploration. The second, an enhancement called Look-ahead Modulation (LaM), adds a predictive guide to evaluate a state’s potential based on its successors. On several challenging math benchmarks, ePF significantly outperforms strong baselines and achieves up to a 50 % relative improvement in task reward. Together, these methods improve PF’s resilience by balancing the exploration of diverse solution spaces with the exploitation of high-reward regions, ultimately leading to higher-quality solutions.

[136] Sample Smart, Not Hard: Correctness-First Decoding for Better Reasoning in LLMs

Xueyan Li,Guinan Su,Mrinmaya Sachan,Jonas Geiping

Main category: cs.LG

TL;DR: 论文提出了一种新的解码策略,强调基于正确性而非置信度的采样,以改善大型语言模型在复杂推理任务中的表现。

Details Motivation: 现有方法在解码时无法平衡探索多样性和准确性,作者认为应根据预测的正确性而非置信度来指导采样。

Contribution: 提出了三种基于正确性的解码策略:Greedy-Threshold、Calibrated-TopK和Calibrated-epsilon,并验证了其在数学和一般推理任务中的有效性。

Method: 通过校准解码规则,专注于采样预测正确性高的token,同时在低正确性区域减少采样。

Result: 新策略在多个推理基准上取得了优于传统方法的性能。

Insight: 正确性而非置信度是指导解码的更优指标,挑战了现有关于不确定性解码的启发式方法。

Abstract: Large Language Models (LLMs) are increasingly applied to complex tasks that require extended reasoning. In such settings, models often benefit from diverse chains-of-thought to arrive at multiple candidate solutions. This requires two competing objectives: to inject enough stochasticity to explore multiple reasoning chains, and to ensure sufficient accuracy and quality in each path. Existing works pursue the first objective by increasing exploration at highly uncertain steps with higher temperature or larger candidate token sets, while others improve reliability by rejecting samples with low confidence post-generation, implying that low confidence correlates with low answer quality. These two lines of thought are in conflict, as they conflate different sources of uncertainty. To resolve this, we argue that the decoding rule should be calibrated by correctness, not confidence alone. We should sample from tokens with higher estimated correctness, and reduce sampling where expected correctness is low. We propose simple strategies that achieve this goal: Greedy-Threshold makes sampling greedy at very low confidence steps. Calibrated-TopK and Calibrated-epsilon set truncation threshold based on estimated rank-wise correctness. Together, our findings challenge prevailing heuristics about decoding under uncertainty and show gains across math and general reasoning benchmarks.

[137] Learning from Failures: Understanding LLM Alignment through Failure-Aware Inverse RL

Nyal Patel,Matthieu Bou,Arjun Jagota,Satyapriya Krishna,Sonali Parbhoo

Main category: cs.LG

TL;DR: 本文提出了一种新的基于失败的逆强化学习(IRL)方法,专注于RLHF中误分类或难以处理的样本,以更准确地提取潜在奖励信号。

Details Motivation: RLHF虽然能够对齐LLM与人类偏好,但其内部学习的奖励信号难以解释,存在安全隐患。现有IRL方法对所有偏好对平等处理,忽略了最具信息量的失败样本。

Contribution: 1. 提出了“失败感知”IRL算法,专注于误分类或困难样本;2. 通过实验证明该方法在LLM去毒任务中优于现有基线;3. 提取的奖励信号更能反映RLHF的真实目标。

Method: 设计了一种新的IRL算法,特别关注RLHF中被误分类或评分接近的样本(即失败样本),从中提取潜在的奖励函数。

Result: 在LLM去毒任务中,失败感知IRL在多项指标上优于现有IRL基线,且无需外部分类器或监督。

Insight: 专注于失败样本可以更高效地揭示RLHF中学到的真实奖励信号,提升模型对齐的可解释性和安全性。

Abstract: Reinforcement Learning from Human Feedback (RLHF) aligns Large Language Models (LLMs) with human preferences, yet the underlying reward signals they internalize remain hidden, posing a critical challenge for interpretability and safety. Existing approaches attempt to extract these latent incentives using Inverse Reinforcement Learning (IRL), but treat all preference pairs equally, often overlooking the most informative signals: those examples the extracted reward model misclassifies or assigns nearly equal scores, which we term \emph{failures}. We introduce a novel \emph{failure-aware} IRL algorithm that focuses on misclassified or difficult examples to recover the latent rewards defining model behaviors. By learning from these failures, our failure-aware IRL extracts reward functions that better reflect the true objectives behind RLHF. We demonstrate that failure-aware IRL outperforms existing IRL baselines across multiple metrics when applied to LLM detoxification, without requiring external classifiers or supervision. Crucially, failure-aware IRL yields rewards that better capture the true incentives learned during RLHF, enabling more effective re-RLHF training than standard IRL. This establishes failure-aware IRL as a robust, scalable method for auditing model alignment and reducing ambiguity in the IRL process.

[138] The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives

Matthieu Bou,Nyal Patel,Arjun Jagota,Satyapriya Krishna,Sonali Parbhoo

Main category: cs.LG

TL;DR: 该论文提出了一个基于贝叶斯逆强化学习的框架,用于验证和优化大型语言模型(LLM)的目标。通过量化不确定性、提供诊断信息和验证策略有效性,该框架增强了透明度和对齐性。

Details Motivation: LLM的隐含目标不透明,导致对其行为的可信对齐和审计成为挑战。现有方法无法有效解决非可识别性问题或过度自信的奖励估计。

Contribution: 提出了一种基于贝叶斯逆强化学习的审计框架,能够生成目标分布并提供三项关键能力:(1)量化并减少非可识别性;(2)提供不确定性感知的诊断信息;(3)验证策略有效性。

Method: 利用贝叶斯逆强化学习,推断目标分布,并通过多轮证据后验收缩、诊断信息生成和策略验证实现目标优化。

Result: 实验表明,该框架能成功审计去毒化的LLM,生成校准良好且可解释的目标,增强了对齐性。

Insight: 该框架为审计者、安全团队和监管机构提供了实用工具,推动AI的可信度和问责性。

Abstract: The objectives that Large Language Models (LLMs) implicitly optimize remain dangerously opaque, making trustworthy alignment and auditing a grand challenge. While Inverse Reinforcement Learning (IRL) can infer reward functions from behaviour, existing approaches either produce a single, overconfident reward estimate or fail to address the fundamental ambiguity of the task (non-identifiability). This paper introduces a principled auditing framework that re-frames reward inference from a simple estimation task to a comprehensive process for verification. Our framework leverages Bayesian IRL to not only recover a distribution over objectives but to enable three critical audit capabilities: (i) Quantifying and systematically reducing non-identifiability by demonstrating posterior contraction over sequential rounds of evidence; (ii) Providing actionable, uncertainty-aware diagnostics that expose spurious shortcuts and identify out-of-distribution prompts where the inferred objective cannot be trusted; and (iii) Validating policy-level utility by showing that the refined, low-uncertainty reward can be used directly in RLHF to achieve training dynamics and toxicity reductions comparable to the ground-truth alignment process. Empirically, our framework successfully audits a detoxified LLM, yielding a well-calibrated and interpretable objective that strengthens alignment guarantees. Overall, this work provides a practical toolkit for auditors, safety teams, and regulators to verify what LLMs are truly trying to achieve, moving us toward more trustworthy and accountable AI.

[139] Influence Functions for Efficient Data Selection in Reasoning

Prateek Humane,Paolo Cudrano,Daniel Z. Kaplan,Matteo Matteucci,Supriyo Chakraborty,Irina Rish

Main category: cs.LG

TL;DR: 论文提出了利用影响函数(influence functions)来衡量链式思维(CoT)数据对下游任务准确率的因果效应,并基于此提出了一种高效的数据选择方法,优于传统的困惑度和嵌入基线。

Details Motivation: 当前研究依赖间接启发式方法(如问题难度或追踪长度)来定义推理数据的质量,但这些方法缺乏明确性。论文希望明确推理数据的质量标准。

Contribution: 1. 提出用影响函数度量CoT数据质量;2. 引入基于影响的剪枝方法,显著提升了数学推理任务的性能。

Method: 通过影响函数量化单个CoT样本对下游准确率的影响,并设计了一种基于影响的剪枝策略来选择高质量数据。

Result: 实验显示,该方法在数学推理任务中表现优于困惑度和嵌入基线。

Insight: 影响函数提供了一种直接衡量数据质量的方法,有助于高效选择高质量样本,特别是在推理任务中。

Abstract: Fine-tuning large language models (LLMs) on chain-of-thought (CoT) data shows that a small amount of high-quality data can outperform massive datasets. Yet, what constitutes “quality” remains ill-defined. Existing reasoning methods rely on indirect heuristics such as problem difficulty or trace length, while instruction-tuning has explored a broader range of automated selection strategies, but rarely in the context of reasoning. We propose to define reasoning data quality using influence functions, which measure the causal effect of individual CoT examples on downstream accuracy, and introduce influence-based pruning, which consistently outperforms perplexity and embedding-based baselines on math reasoning within a model family.

[140] Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents

Mingkang Zhu,Xi Chen,Bei Yu,Hengshuang Zhao,Jiaya Jia

Main category: cs.LG

TL;DR: 论文提出Stratified GRPO方法,通过分层优势归一化(SAN)解决LLM搜索代理任务中结构异质性导致的强化学习偏见问题,显著提升了训练效果和策略性能。

Details Motivation: LLM搜索代理在执行多步任务时,轨迹结构异质性(如搜索调用的数量、位置和结果)会导致奖励分布不均,标准策略梯度方法因全局基线而产生跨层偏见,影响信用分配和探索效率。

Contribution: 提出了Stratified GRPO方法,核心是分层优势归一化(SAN),通过将轨迹分层为同质子集并局部计算优势值,消除跨层偏见,同时保留全局无偏性和单位方差特性。

Method: SAN将轨迹按结构属性分层,局部计算优势值,确保评估仅在同质轨迹间进行;进一步引入线性混合全局估计器以提高有限样本下的稳定性。

Result: 实验表明,Stratified GRPO在单跳和多跳问答任务中比GRPO性能提升高达11.3分,训练奖励更高、稳定性更强、搜索策略更有效。

Insight: 分层策略是解决RL中结构异质性问题的有效方法,SAN的设计不仅理论完备,还在实践中显著提升了LLM搜索代理的性能和稳定性。

Abstract: Large language model (LLM) agents increasingly rely on external tools such as search engines to solve complex, multi-step problems, and reinforcement learning (RL) has become a key paradigm for training them. However, the trajectories of search agents are structurally heterogeneous, where variations in the number, placement, and outcomes of search calls lead to fundamentally different answer directions and reward distributions. Standard policy gradient methods, which use a single global baseline, suffer from what we identify and formalize as cross-stratum bias-an “apples-to-oranges” comparison of heterogeneous trajectories. This cross-stratum bias distorts credit assignment and hinders exploration of complex, multi-step search strategies. To address this, we propose Stratified GRPO, whose central component, Stratified Advantage Normalization (SAN), partitions trajectories into homogeneous strata based on their structural properties and computes advantages locally within each stratum. This ensures that trajectories are evaluated only against their true peers. Our analysis proves that SAN eliminates cross-stratum bias, yields conditionally unbiased unit-variance estimates inside each stratum, and retains the global unbiasedness and unit-variance properties enjoyed by standard normalization, resulting in a more pure and scale-stable learning signal. To improve practical stability under finite-sample regimes, we further linearly blend SAN with the global estimator. Extensive experiments on diverse single-hop and multi-hop question-answering benchmarks demonstrate that Stratified GRPO consistently and substantially outperforms GRPO by up to 11.3 points, achieving higher training rewards, greater training stability, and more effective search policies. These results establish stratification as a principled remedy for structural heterogeneity in RL for LLM search agents.

cs.MM [Back]

[141] Towards Robust and Realible Multimodal Fake News Detection with Incomplete Modality

Hengyang Zhou,Yiwei Wei,Jian Yang,Zhenyu Zhang

Main category: cs.MM

TL;DR: 论文提出了一种新的多模态假新闻检测方法MMLNet,通过动态补偿缺失模态的信息,提升了模型在不完整模态场景下的鲁棒性和泛化能力。

Details Motivation: 现实世界中多模态新闻在传播过程中容易丢失部分信息,导致模态不完整性,这影响了现有模型的泛化能力和鲁棒性。

Contribution: 提出了一种通用的多模态融合策略MMLNet,包含多专家协作推理、不完整模态适配器和模态缺失学习三个关键步骤。

Method: 利用多专家协作推理动态补偿缺失模态,通过不完整模态适配器调整特征分布,并结合对比学习和自适应加权策略学习鲁棒表示。

Result: 在三个真实世界基准数据集上优于现有方法,同时保持了模型的简洁性。

Insight: 通过动态补偿缺失模态信息,可以有效提升假新闻检测在不完整模态场景下的性能,从而遏制恶意信息的传播。

Abstract: Multimodal fake news detection (MFND) has become an urgent task with the emergence of huge multimodal fake content on social media platforms. Previous studies mainly focus on complex feature extraction and fusion to learn discriminative information from multimodal content. However, in real-world applications, multimedia news may naturally lose some information during dissemination, resulting in modality incompleteness, which is detrimental to the generalization and robustness of existing models. To this end, we propose a novel generic and robust multimodal fusion strategy, termed Multi-expert Modality-incomplete Learning Network (MMLNet), which is simple yet effective. It consists of three key steps: (1) Multi-Expert Collaborative Reasoning to compensate for missing modalities by dynamically leveraging complementary information through multiple experts. (2) Incomplete Modality Adapters compensates for the missing information by leveraging the new feature distribution. (3) Modality Missing Learning leveraging an label-aware adaptive weighting strategy to learn a robust representation with contrastive learning. We evaluate MMLNet on three real-world benchmarks across two languages, demonstrating superior performance compared to state-of-the-art methods while maintaining relative simplicity. By ensuring the accuracy of fake news detection in incomplete modality scenarios caused by information propagation, MMLNet effectively curbs the spread of malicious misinformation. Code is publicly available at https://github.com/zhyhome/MMLNet.

[142] Controllable Audio-Visual Viewpoint Generation from 360° Spatial Information

Christian Marinoni,Riccardo Fosco Gramaccioni,Eleonora Grassucci,Danilo Comminiello

Main category: cs.MM

TL;DR: 该论文提出了一种可控的音频-视觉视角生成框架,通过扩散模型结合360度空间信息生成具有空间感知能力的视频和音频内容。

Details Motivation: 现有方法在生成音频-视频内容时缺乏对360度沉浸环境中特定视角的细粒度控制,限制了生成内容的真实性。

Contribution: 首次提出了一种可控的音频-视觉生成框架,结合全景显著性图、距离图和场景描述,实现了对生成内容的精细控制。

Method: 采用扩散模型,引入基于360度全景显著性图、边界框感知的距离图和场景描述的强条件信号,生成空间感知的视频和音频。

Result: 生成的音频和视频内容在视角和环境背景下显示出高度的一致性,验证了框架的有效性。

Insight: 通过整合全景空间信息,可以实现更真实和沉浸式的音频-视觉生成,为未来相关研究提供了新方向。

Abstract: The generation of sounding videos has seen significant advancements with the advent of diffusion models. However, existing methods often lack the fine-grained control needed to generate viewpoint-specific content from larger, immersive 360-degree environments. This limitation restricts the creation of audio-visual experiences that are aware of off-camera events. To the best of our knowledge, this is the first work to introduce a framework for controllable audio-visual generation, addressing this unexplored gap. Specifically, we propose a diffusion model by introducing a set of powerful conditioning signals derived from the full 360-degree space: a panoramic saliency map to identify regions of interest, a bounding-box-aware signed distance map to define the target viewpoint, and a descriptive caption of the entire scene. By integrating these controls, our model generates spatially-aware viewpoint videos and audios that are coherently influenced by the broader, unseen environmental context, introducing a strong controllability that is essential for realistic and immersive audio-visual generation. We show audiovisual examples proving the effectiveness of our framework.

cs.NE [Back]

[143] From Neural Activity to Computation: Biological Reservoirs for Pattern Recognition in Digit Classification

Ludovico Iannello,Luca Ciampi,Fabrizio Tonelli,Gabriele Lagani,Lucio Maria Calcagnile,Federico Cremisi,Angelo Di Garbo,Giuseppe Amato

Main category: cs.NE

TL;DR: 本文提出了一种基于生物神经元的储层计算(BRC)方法,用于手写数字分类任务,展示了生物神经网络的计算潜力。

Details Motivation: 传统的储层计算(RC)依赖于人工设计的递归单元,而本文探索利用活体神经元的自发性活动作为计算资源,以生物系统为灵感改进机器学习方法。

Contribution: 1. 提出生物储层计算(BRC)框架,利用多电极阵列(MEA)捕获神经元活动;2. 展示了生物神经元在手写数字分类任务中的有效性;3. 与人工储层计算进行了对比,验证了生物系统的潜力。

Method: 1. 使用MEA对培养的神经元施加电刺激输入;2. 通过电极记录神经活动,将其映射到高维特征空间;3. 训练一个线性分类器进行数字分类;4. 与标准人工储层进行对比实验。

Result: 实验表明,生物储层能够有效支持分类任务,证明了其作为计算基板的可行性,并与人工储层性能相近。

Insight: 生物神经网络的自发活动可以作为高效的计算资源,为生物启发的机器学习模型提供了新思路。

Abstract: In this paper, we present a biologically grounded approach to reservoir computing (RC), in which a network of cultured biological neurons serves as the reservoir substrate. This system, referred to as biological reservoir computing (BRC), replaces artificial recurrent units with the spontaneous and evoked activity of living neurons. A multi-electrode array (MEA) enables simultaneous stimulation and readout across multiple sites: inputs are delivered through a subset of electrodes, while the remaining ones capture the resulting neural responses, mapping input patterns into a high-dimensional biological feature space. We evaluate the system through a case study on digit classification using a custom dataset. Input images are encoded and delivered to the biological reservoir via electrical stimulation, and the corresponding neural activity is used to train a simple linear classifier. To contextualize the performance of the biological system, we also include a comparison with a standard artificial reservoir trained on the same task. The results indicate that the biological reservoir can effectively support classification, highlighting its potential as a viable and interpretable computational substrate. We believe this work contributes to the broader effort of integrating biological principles into machine learning and aligns with the goals of human-inspired vision by exploring how living neural systems can inform the design of efficient and biologically plausible models.

cs.SD [Back]

[144] StereoSync: Spatially-Aware Stereo Audio Generation from Video

Christian Marinoni,Riccardo Fosco Gramaccioni,Kazuki Shimada,Takashi Shibuya,Yuki Mitsufuji,Danilo Comminiello

Main category: cs.SD

TL;DR: StereoSync提出了一种新颖的模型,用于生成与视频在时间和空间上同步的立体音频,通过利用深度图和边界框的空间线索,结合扩散模型的音频生成方法,显著提升了视频到音频生成的沉浸感和真实性。

Details Motivation: 现有音频生成方法主要关注时间同步,忽略了视频的空间结构信息。StereoSync填补了这一空白,旨在生成与视频上下文空间对齐的音频。

Contribution: 1. 提出一种空间感知的立体音频生成模型;2. 引入深度图和边界框作为空间线索,通过扩散模型生成动态适应的音频;3. 在Walking The Maps数据集上验证了方法的有效性。

Method: 基于扩散模型的音频生成框架,通过提取视频深度图和边界框的空间线索,作为交叉注意力条件,实现音频的时间同步和空间对齐。

Result: 实验表明,StereoSync能够同时在时间和空间上对齐音频,提升了视频到音频生成的质量和沉浸感。

Insight: 空间信息(如深度图和边界框)可以显著提升音频生成的动态适应能力和真实性。

Abstract: Although audio generation has been widely studied over recent years, video-aligned audio generation still remains a relatively unexplored frontier. To address this gap, we introduce StereoSync, a novel and efficient model designed to generate audio that is both temporally synchronized with a reference video and spatially aligned with its visual context. Moreover, StereoSync also achieves efficiency by leveraging pretrained foundation models, reducing the need for extensive training while maintaining high-quality synthesis. Unlike existing methods that primarily focus on temporal synchronization, StereoSync introduces a significant advancement by incorporating spatial awareness into video-aligned audio generation. Indeed, given an input video, our approach extracts spatial cues from depth maps and bounding boxes, using them as cross-attention conditioning in a diffusion-based audio generation model. Such an approach allows StereoSync to go beyond simple synchronization, producing stereo audio that dynamically adapts to the spatial structure and movement of a video scene. We evaluate StereoSync on Walking The Maps, a curated dataset comprising videos from video games that feature animated characters walking through diverse environments. Experimental results demonstrate the ability of StereoSync to achieve both temporal and spatial alignment, advancing the state of the art in video-to-audio generation and resulting in a significantly more immersive and realistic audio experience.

[145] FoleyGRAM: Video-to-Audio Generation with GRAM-Aligned Multimodal Encoders

Riccardo Fosco Gramaccioni,Christian Marinoni,Eleonora Grassucci,Giordano Cicchetti,Aurelio Uncini,Danilo Comminiello

Main category: cs.SD

TL;DR: FoleyGRAM提出了一种新颖的视频到音频生成方法,通过基于GRAM的对齐多模态编码器实现语义控制,提升了音频生成的质量和语义对齐。

Details Motivation: 现有的视频到音频生成方法在语义控制和多模态对齐方面存在不足,无法精确地将生成的音频与视频内容对齐。

Contribution: FoleyGRAM的核心贡献是通过GRAM对齐的多模态编码器实现了视频、文本和音频嵌入的对齐,从而提升了生成音频的语义准确性和时间对齐。

Method: FoleyGRAM采用基于扩散的音频合成模型,结合GRAM对齐的嵌入和波形包络条件,确保生成的音频语义丰富且与视频时间对齐。

Result: 在Greatest Hits数据集上的实验表明,GRAM对齐显著提升了生成音频与视频内容的语义对齐性能,推动了视频到音频合成的技术进步。

Insight: 多模态嵌入的精确对齐对于提升视频到音频生成的语义控制至关重要,GRAM方法为实现这一目标提供了有效途径。

Abstract: In this work, we present FoleyGRAM, a novel approach to video-to-audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video-to-audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion-based audio synthesis model conditioned on GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system’s ability to semantically align generated audio with video content, advancing the state of the art in video-to-audio synthesis.