Table of Contents
- cs.CL [Total: 53]
- cs.CV [Total: 47]
- cs.RO [Total: 2]
- cs.CR [Total: 1]
- eess.IV [Total: 1]
- cs.SE [Total: 1]
- cs.LG [Total: 6]
- cs.MA [Total: 1]
- eess.AS [Total: 1]
- cs.AI [Total: 4]
cs.CL [Back]
[1] Small Language Models Offer Significant Potential for Science Community
Jian Zhang
Main category: cs.CL
TL;DR: 该论文探讨了小型语言模型(MiniLMs)在地球科学文献检索中的应用潜力,提出了一个高效、低成本且精确的信息检索框架,替代大型语言模型(LLMs)。
Details
Motivation: 尽管大型语言模型(LLMs)在科学研究中的应用日益广泛,但存在信息偏见和计算成本高的问题。作者希望通过小型语言模型提供一种更高效、低成本的替代方案,专注于地球科学领域的精确信息检索。Contribution: 1)构建了一个包含7700万高质量句子的地球科学文献语料库;2)提出了一种基于MiniLMs的高效语义搜索和句子级索引方法;3)展示了MiniLMs在情感分析和主题聚类中的应用潜力。
Method: 作者使用了自由可用的小型语言模型(MiniLMs),通过语义搜索技术和句子级索引,从地球科学文献中高效提取领域特定信息。此外,还结合情感分析和无监督聚类方法分析句子的情感基调和研究主题演变。
Result: MiniLMs能够高效地从文献中提取专家验证的信息,尤其是在定量研究方面表现优异。此外,情感分析和主题聚类方法揭示了地球科学领域的研究趋势和争议点。
Insight: 小型语言模型在资源受限的科学领域中具有显著优势,不仅能降低计算成本,还能提供更精确的领域特定信息检索能力。
Abstract: Recent advancements in natural language processing, particularly with large language models (LLMs), are transforming how scientists engage with the literature. While the adoption of LLMs is increasing, concerns remain regarding potential information biases and computational costs. Rather than LLMs, I developed a framework to evaluate the feasibility of precise, rapid, and cost-effective information retrieval from extensive geoscience literature using freely available small language models (MiniLMs). A curated corpus of approximately 77 million high-quality sentences, extracted from 95 leading peer-reviewed geoscience journals such as Geophysical Research Letters and Earth and Planetary Science Letters published during years 2000 to 2024, was constructed. MiniLMs enable a computationally efficient approach for extracting relevant domain-specific information from these corpora through semantic search techniques and sentence-level indexing. This approach, unlike LLMs such as ChatGPT-4 that often produces generalized responses, excels at identifying substantial amounts of expert-verified information with established, multi-disciplinary sources, especially for information with quantitative findings. Furthermore, by analyzing emotional tone via sentiment analysis and topical clusters through unsupervised clustering within sentences, MiniLM provides a powerful tool for tracking the evolution of conclusions, research priorities, advancements, and emerging questions within geoscience communities. Overall, MiniLM holds significant potential within the geoscience community for applications such as fact and image retrievals, trend analyses, contradiction analyses, and educational purposes.
[2] Transformer-Based Low-Resource Language Translation: A Study on Standard Bengali to Sylheti
Mangsura Kabir Oni,Tabia Tanzin Prama
Main category: cs.CL
TL;DR: 论文探讨了基于Transformer的模型在低资源语言(标准孟加拉语到锡尔赫特语)翻译中的表现,发现微调模型优于零样本LLMs,强调了任务特定适应的重要性。
Details
Motivation: 机器翻译在高资源语言中取得显著进展,但低资源语言如锡尔赫特语的研究不足,需要探索有效方法。Contribution: 研究发现微调多语言Transformer模型在锡尔赫特语翻译中表现最佳,为低资源语言翻译提供了新见解。
Method: 研究通过微调mBART-50和MarianMT等Transformer模型,并与零样本LLMs对比,评估翻译质量。
Result: mBART-50在翻译流畅性上表现最优,MarianMT在字符级保真度上最强,微调模型显著优于LLMs。
Insight: 任务特定微调对低资源语言翻译至关重要,有助于推动包容性语言技术的发展。
Abstract: Machine Translation (MT) has advanced from rule-based and statistical methods to neural approaches based on the Transformer architecture. While these methods have achieved impressive results for high-resource languages, low-resource varieties such as Sylheti remain underexplored. In this work, we investigate Bengali-to-Sylheti translation by fine-tuning multilingual Transformer models and comparing them with zero-shot large language models (LLMs). Experimental results demonstrate that fine-tuned models significantly outperform LLMs, with mBART-50 achieving the highest translation adequacy and MarianMT showing the strongest character-level fidelity. These findings highlight the importance of task-specific adaptation for underrepresented languages and contribute to ongoing efforts toward inclusive language technologies.
[3] DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code
Shriyansh Agrawal,Aidan Lau,Sanyam Shah,Ahan M R,Kevin Zhu,Sunishchal Dev,Vasu Sharma
Main category: cs.CL
TL;DR: DuoLens提出了一种用于检测多语言机器生成文本和源代码的框架,通过微调小型语言模型(SLM),在计算成本和准确性上显著优于现有方法。
Details
Motivation: 当前基于零样本方法的机器生成内容检测器(如Fast DetectGPT或GPTZero)存在计算成本高或准确性不足的问题,需要在两者之间权衡。Contribution: 1. 提出用小型语言模型(SLM)微调方法,显著提升检测性能;2. 在二进制分类任务中,SLM表现优于LLM,同时大幅降低计算资源;3. 在跨生成器转移和对抗性变换下仍保持高鲁棒性。
Method: 微调预训练的RoBERTA和CodeBERTa模型,使用专门的数据集进行二进制分类任务。
Result: AUROC达到0.97-0.99,macro-F1为0.89-0.94,延迟降低8-12倍,峰值VRAM减少3-5倍。在对抗性变换下,性能保持≥92%的干净AUROC。
Insight: 小型语言模型在特定任务中可以优于大型语言模型,同时大幅减少计算开销,为机器生成内容检测提供了高效解决方案。
Abstract: The prevalence of Large Language Models (LLMs) for generating multilingual text and source code has only increased the imperative for machine-generated content detectors to be accurate and efficient across domains. Current detectors, predominantly utilizing zero-shot methods, such as Fast DetectGPT or GPTZero, either incur high computational cost or lack sufficient accuracy, often with a trade-off between the two, leaving room for further improvement. To address these gaps, we propose the fine-tuning of encoder-only Small Language Models (SLMs), in particular, the pre-trained models of RoBERTA and CodeBERTa using specialized datasets on source code and other natural language to prove that for the task of binary classification, SLMs outperform LLMs by a huge margin whilst using a fraction of compute. Our encoders achieve AUROC $= 0.97$ to $0.99$ and macro-F1 $0.89$ to $0.94$ while reducing latency by $8$-$12\times$ and peak VRAM by $3$-$5\times$ at $512$-token inputs. Under cross-generator shifts and adversarial transformations (paraphrase, back-translation; code formatting/renaming), performance retains $\geq 92%$ of clean AUROC. We release training and evaluation scripts with seeds and configs; a reproducibility checklist is also included.
[4] Improving Topic Modeling of Social Media Short Texts with Rephrasing: A Case Study of COVID-19 Related Tweets
Wangjiaxuan Xin,Shuhua Yin,Shi Chen,Yaorong Ge
Main category: cs.CL
TL;DR: 该论文提出了一种名为TM-Rephrase的框架,通过使用大语言模型(LLMs)将社交媒体短文本重新表述为更标准化的语言,以提升主题建模的效果。实验表明,该方法显著改善了主题一致性、独特性和多样性。
Details
Motivation: 社交媒体短文本(如推文)的简洁性和噪声影响了主题建模的效果,导致生成的主题难以解释。论文旨在通过文本重述来解决这一问题。Contribution: 提出了一种模型无关的框架TM-Rephrase,利用LLMs重述文本,显著提升了主题建模的多个性能指标。
Method: 通过两种重述策略(通用和口语到正式)对推文进行标准化处理,并在多种主题建模算法(如LDA)上进行验证。
Result: 实验结果表明,TM-Rephrase提高了主题一致性、独特性和多样性,减少了冗余,尤其是在LDA算法中效果最佳。
Insight: 标准化社交媒体短文本可以显著提升主题建模的实用性,对公共卫生危机等领域的社会媒体分析具有广泛意义。
Abstract: Social media platforms such as Twitter (now X) provide rich data for analyzing public discourse, especially during crises such as the COVID-19 pandemic. However, the brevity, informality, and noise of social media short texts often hinder the effectiveness of traditional topic modeling, producing incoherent or redundant topics that are often difficult to interpret. To address these challenges, we have developed \emph{TM-Rephrase}, a model-agnostic framework that leverages large language models (LLMs) to rephrase raw tweets into more standardized and formal language prior to topic modeling. Using a dataset of 25,027 COVID-19-related Twitter posts, we investigate the effects of two rephrasing strategies, general- and colloquial-to-formal-rephrasing, on multiple topic modeling methods. Results demonstrate that \emph{TM-Rephrase} improves three metrics measuring topic modeling performance (i.e., topic coherence, topic uniqueness, and topic diversity) while reducing topic redundancy of most topic modeling algorithms, with the colloquial-to-formal strategy yielding the greatest performance gains and especially for the Latent Dirichlet Allocation (LDA) algorithm. This study contributes to a model-agnostic approach to enhancing topic modeling in public health related social media analysis, with broad implications for improved understanding of public discourse in health crisis as well as other important domains.
[5] MMAO-Bench: MultiModal All in One Benchmark Reveals Compositional Law between Uni-modal and Omni-modal in OmniModels
Chen Chen,ZeYang Hu,Fengjiao Chen,Liya Ma,Jiaxing Liu,Xiaoyu Li,Xuezhi Cao
Main category: cs.CL
TL;DR: MMAO-Bench是一个新颖的高质量多模态基准测试,旨在评估单模态与全模态能力之间的组合规律。
Details
Motivation: 当前多模态大模型正从单模态理解向全模态(视觉、音频、语言)统一演进,但单模态与全模态之间的关联尚不明确,需要全面评估以推动模型智能发展。Contribution: 提出了MMAO-Bench,包含1880个人工标注样本和44种任务类型,并引入创新的多步开放性问题类型,以评估复杂推理任务。
Method: 设计了包含多样任务类型的基准测试,并通过实验分析了单模态与全模态性能的组合规律。
Result: 实验揭示了全模态能力对弱模型表现为瓶颈效应,而对强模型则表现出协同促进作用。
Insight: 全模态能力的提升依赖于单模态能力的增强,且在不同模型能力水平下表现出不同的影响模式。
Abstract: Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models. However, the correlation between uni-modal and omni-modal remains unclear, which requires comprehensive evaluation to drive omni model’s intelligence evolution. In this work, we propose a novel, high quality and diversity omni model benchmark, MultiModal All in One Benchmark (MMAO-Bench), which effectively assesses both uni-modal and omni-modal understanding capabilities. The benchmark consists of 1880 human curated samples, across 44 task types, and a innovative multi-step open-ended question type that better assess complex reasoning tasks. Experimental result shows the compositional law between cross-modal and uni-modal performance and the omni-modal capability manifests as a bottleneck effect on weak models, while exhibiting synergistic promotion on strong models.
[6] Are they lovers or friends? Evaluating LLMs’ Social Reasoning in English and Korean Dialogues
Eunsu Kim,Junyeong Park,Juhyun Oh,Kiwoong Park,Seyoung Song,A. Seza Dogruoz,Najoung Kim,Alice Oh
Main category: cs.CL
TL;DR: 论文研究了大型语言模型(LLMs)在英韩双语对话中的社会推理能力。通过SCRIPTS数据集发现,模型在英语对话中的表现优于韩语,且存在显著的社交偏见和推理失误。
Details
Motivation: 随着LLMs在人类与AI交互中的广泛应用,它们在社交推理方面的能力变得至关重要。论文旨在评估LLMs在推断人际关系时的表现,尤其是在不同语言和文化背景下。Contribution: 论文引入了SCRIPTS数据集,包含英韩双语对话,并标注了概率化的人际关系标签;揭示了LLMs在英语和韩语中的性能差距及其在社交推理中的局限性。
Method: 使用SCRIPTS数据集评估九种LLMs,重点关注模型推断人际关系的能力,并分析思考模型(thinking models)和思维链提示(chain-of-thought prompting)的效果。
Result: LLMs在英语数据集上的准确率为75-80%,在韩语中降至58-69%;10-25%的回答选择了不成立的关系;社交偏见在某些情况下被放大。
Insight: 当前LLMs的社会推理能力有限,尤其在跨语言和文化场景中表现不佳,凸显了开发更具社交意识的语言模型的必要性。
Abstract: As large language models (LLMs) are increasingly used in human-AI interactions, their social reasoning capabilities in interpersonal contexts are critical. We introduce SCRIPTS, a 1k-dialogue dataset in English and Korean, sourced from movie scripts. The task involves evaluating models’ social reasoning capability to infer the interpersonal relationships (e.g., friends, sisters, lovers) between speakers in each dialogue. Each dialogue is annotated with probabilistic relational labels (Highly Likely, Less Likely, Unlikely) by native (or equivalent) Korean and English speakers from Korea and the U.S. Evaluating nine models on our task, current proprietary LLMs achieve around 75-80% on the English dataset, whereas their performance on Korean drops to 58-69%. More strikingly, models select Unlikely relationships in 10-25% of their responses. Furthermore, we find that thinking models and chain-of-thought prompting, effective for general reasoning, provide minimal benefits for social reasoning and occasionally amplify social biases. Our findings reveal significant limitations in current LLMs’ social reasoning capabilities, highlighting the need for efforts to develop socially-aware language models.
[7] Re:Member: Emotional Question Generation from Personal Memories
Zackary Rackauckas,Nobuaki Minematsu,Julia Hirschberg
Main category: cs.CL
TL;DR: Re:Member 是一个基于个人记忆的情感化问题生成系统,旨在通过结合用户个人视频和情感化语音问题,提升第二语言学习的互动性和情感参与度。
Details
Motivation: 传统第二语言学习工具缺乏情感互动和个人化内容,Re:Member 填补了这一空白,利用个人记忆和情感化设计提升学习体验。Contribution: 提出了一个模块化生成流程,结合了 WhisperX 转录对齐、3 帧视觉采样和 Style-BERT-VITS2 情感合成技术,生成情感化问题。
Method: 采用多模态方法:1) WhisperX 对齐转录,2) 3 帧采样提取视觉上下文,3) Style-BERT-VITS2 合成情感化语音,实现了情感与内容的匹配。
Result: 系统能够生成情感丰富且与视觉上下文一致的问题,有效提升学习者的情感回忆和互动参与。
Insight: 情感化和个人化内容在教育技术中具有重要作用,能够显著提升学习者的参与度和学习效果。
Abstract: We present Re:Member, a system that explores how emotionally expressive, memory-grounded interaction can support more engaging second language (L2) learning. By drawing on users’ personal videos and generating stylized spoken questions in the target language, Re:Member is designed to encourage affective recall and conversational engagement. The system aligns emotional tone with visual context, using expressive speech styles such as whispers or late-night tones to evoke specific moods. It combines WhisperX-based transcript alignment, 3-frame visual sampling, and Style-BERT-VITS2 for emotional synthesis within a modular generation pipeline. Designed as a stylized interaction probe, Re:Member highlights the role of affect and personal media in learner-centered educational technologies.
[8] A Graph Signal Processing Framework for Hallucination Detection in Large Language Models
Valentin Noël
Main category: cs.CL
TL;DR: 论文提出了一种基于图信号处理的框架,用于检测大型语言模型中的幻觉问题。该方法将Transformer层的注意力机制建模为动态图,并通过谱分析定义诊断指标,实验表明这些指标能有效区分事实陈述和幻觉。
Details
Motivation: 大型语言模型在生成内容时容易产生幻觉(即非事实内容),目前缺乏有效的检测方法。论文旨在通过谱分析揭示幻觉的独特模式,从而提供一种检测框架。Contribution: 1. 提出了一种将Transformer层建模为动态图的框架;2. 定义了Dirichlet能量、谱熵和高频能量比等诊断指标;3. 实验验证了这些指标对不同类型幻觉的区分能力。
Method: 通过图信号处理技术,将语言模型的注意力机制视为动态图,并分析其谱特性(如Dirichlet能量和谱熵),以检测幻觉。
Result: 实验显示,事实陈述表现为低频收敛的能量分布,而幻觉表现为特定的谱特性。基于谱特征的检测器准确率为88.75%,优于基于困惑度的基线方法(75%)。
Insight: 谱几何特性能够捕捉语言模型的推理模式和错误行为,有望成为幻觉检测的新框架。
Abstract: Large language models achieve impressive results but distinguishing factual reasoning from hallucinations remains challenging. We propose a spectral analysis framework that models transformer layers as dynamic graphs induced by attention, with token embeddings as signals on these graphs. Through graph signal processing, we define diagnostics including Dirichlet energy, spectral entropy, and high-frequency energy ratios, with theoretical connections to computational stability. Experiments across GPT architectures suggest universal spectral patterns: factual statements exhibit consistent “energy mountain” behavior with low-frequency convergence, while different hallucination types show distinct signatures. Logical contradictions destabilize spectra with large effect sizes ($g>1.0$), semantic errors remain stable but show connectivity drift, and substitution hallucinations display intermediate perturbations. A simple detector using spectral signatures achieves 88.75% accuracy versus 75% for perplexity-based baselines, demonstrating practical utility. These findings indicate that spectral geometry may capture reasoning patterns and error behaviors, potentially offering a framework for hallucination detection in large language models.
[9] Training-Free Spectral Fingerprints of Voice Processing in Transformers
Valentin Noël
Main category: cs.CL
TL;DR: 该论文通过谱分析识别不同Transformer架构在不同语言处理中的计算指纹,发现特定结构和训练重点会在注意力图中留下可检测的痕迹,且这些痕迹与行为差异强相关。
Details
Motivation: 研究动机是揭示不同Transformer架构在处理语言任务时,如何通过不同的连接模式实现相同的计算任务,并探索这些模式是否能通过谱分析方法检测。Contribution: 主要贡献是提出了一种无需训练的谱分析方法,用于识别Transformer模型的计算指纹,并展示了这些指纹在不同语言和模型架构中的差异性。
Method: 方法是通过图信号处理技术分析注意力诱导的标记图,重点关注早期层(2-5层)的代数连通性变化(Fiedler值)。
Result: 结果显示Phi-3-Mini在英语中表现出显著的早期层扰动,而其他模型则在形态丰富的语言中表现不同,这些结果与模型的行为差异高度相关。
Insight: 研究发现训练重点和架构设计会在模型的注意力结构中留下可检测的痕迹,这些痕迹可以作为诊断工具揭示模型的偏向性。
Abstract: Different transformer architectures implement identical linguistic computations via distinct connectivity patterns, yielding model imprinted ``computational fingerprints’’ detectable through spectral analysis. Using graph signal processing on attention induced token graphs, we track changes in algebraic connectivity (Fiedler value, $\Delta\lambda_2$) under voice alternation across 20 languages and three model families, with a prespecified early window (layers 2–5). Our analysis uncovers clear architectural signatures: Phi-3-Mini shows a dramatic English specific early layer disruption ($\overline{\Delta\lambda_2}_{[2,5]}!\approx!-0.446$) while effects in 19 other languages are minimal, consistent with public documentation that positions the model primarily for English use. Qwen2.5-7B displays small, distributed shifts that are largest for morphologically rich languages, and LLaMA-3.2-1B exhibits systematic but muted responses. These spectral signatures correlate strongly with behavioral differences (Phi-3: $r=-0.976$) and are modulated by targeted attention head ablations, linking the effect to early attention structure and confirming functional relevance. Taken together, the findings are consistent with the view that training emphasis can leave detectable computational imprints: specialized processing strategies that manifest as measurable connectivity patterns during syntactic transformations. Beyond voice alternation, the framework differentiates reasoning modes, indicating utility as a simple, training free diagnostic for revealing architectural biases and supporting model reliability analysis.
[10] Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges
Cheng Huang,Nyima Tashi,Fan Gao,Yutong Liu,Jiahao Li,Hao Tian,Siyang Jiang,Thupten Tsering,Ban Ma-bao,Renzeg Duojie,Gadeng Luosang,Rinchen Dongrub,Dorje Tashi,Jin Zhang,Xiao Feng,Hao Wang,Jie Tang,Guojie Tang,Xiangxiang Wang,Jia Zhang,Tsengdar Lee,Yongbin Yu
Main category: cs.CL
TL;DR: 这篇论文全面调查了藏语AI研究的现状,包括数据资源、NLP任务、机器翻译、语音识别和大语言模型的发展,同时指出了数据稀疏性、拼写变体和缺乏统一评估标准等挑战,并提出了跨语言迁移和多模态学习的潜力。
Details
Motivation: 藏语作为亚洲主要的低资源语言之一,具有独特的语言和社会文化特征,但其AI研究因缺乏可访问的数据资源、标准化基准和专用工具而受限。本文旨在填补这一空白,推动藏语AI研究的发展。Contribution: 论文的主要贡献是对藏语AI领域的资源和方法进行了系统分类和评估,提出了未来研究的潜在方向和技术路径。
Method: 通过调查和分类现有数据集和工具,评估不同任务中的方法,并在可能的情况下比较性能。
Result: 总结了藏语AI研究的现状,突出了数据稀疏性和标准化评估的不足。
Insight: 提出跨语言迁移和多模态学习是解决藏语AI研究中数据不足问题的有效途径,同时呼吁社区驱动的资源创建。
Abstract: Tibetan, one of the major low-resource languages in Asia, presents unique linguistic and sociocultural characteristics that pose both challenges and opportunities for AI research. Despite increasing interest in developing AI systems for underrepresented languages, Tibetan has received limited attention due to a lack of accessible data resources, standardized benchmarks, and dedicated tools. This paper provides a comprehensive survey of the current state of Tibetan AI in the AI domain, covering textual and speech data resources, NLP tasks, machine translation, speech recognition, and recent developments in LLMs. We systematically categorize existing datasets and tools, evaluate methods used across different tasks, and compare performance where possible. We also identify persistent bottlenecks such as data sparsity, orthographic variation, and the lack of unified evaluation metrics. Additionally, we discuss the potential of cross-lingual transfer, multi-modal learning, and community-driven resource creation. This survey aims to serve as a foundational reference for future work on Tibetan AI research and encourages collaborative efforts to build an inclusive and sustainable AI ecosystem for low-resource languages.
[11] “You Are Rejected!”: An Empirical Study of Large Language Models Taking Hiring Evaluations
Dingjie Fu,Dianxing Shi
Main category: cs.CL
TL;DR: 这篇论文研究了大型语言模型(LLMs)是否能通过技术公司招聘评估的问题,结果表明所有评估的LLMs均未通过测试。
Details
Motivation: 随着AI技术的快速发展,企业需要高效筛选大量工程师申请者。LLMs在编码和推理任务中表现出色,因此作者研究了LLMs是否能成功通过招聘评估。Contribution: 论文的主要贡献是通过实证研究首次验证LLMs在标准化招聘评估中的表现,发现其与实际公司要求的解决方案存在显著不一致。
Method: 作者采用了最新的LLMs生成对专业评估问卷的回答,并将其与公司参考解决方案进行对比分析。
Result: 所有评估的LLMs均未能通过招聘评估,其回答与公司要求的标准存在明显差异。
Insight: 论文揭示了LLMs在当前技术水平下尚无法完全替代人类工程师在招聘评估中的角色,强调了实际应用中的局限性。
Abstract: With the proliferation of the internet and the rapid advancement of Artificial Intelligence, leading technology companies face an urgent annual demand for a considerable number of software and algorithm engineers. To efficiently and effectively identify high-potential candidates from thousands of applicants, these firms have established a multi-stage selection process, which crucially includes a standardized hiring evaluation designed to assess job-specific competencies. Motivated by the demonstrated prowess of Large Language Models (LLMs) in coding and reasoning tasks, this paper investigates a critical question: Can LLMs successfully pass these hiring evaluations? To this end, we conduct a comprehensive examination of a widely used professional assessment questionnaire. We employ state-of-the-art LLMs to generate responses and subsequently evaluate their performance. Contrary to any prior expectation of LLMs being ideal engineers, our analysis reveals a significant inconsistency between the model-generated answers and the company-referenced solutions. Our empirical findings lead to a striking conclusion: All evaluated LLMs fails to pass the hiring evaluation.
[12] Think Straight, Stop Smart: Structured Reasoning for Efficient Multi-Hop RAG
Jihwan Bang,Juntae Lee,Seunghan Yang,Sungha Choi
Main category: cs.CL
TL;DR: TSSS是一个高效的多跳RAG框架,通过模板化推理和检索器终止机制,减少冗余标记生成,提升推理效率。
Details
Motivation: 现有的多跳RAG方法效率低下,冗余生成标记且依赖随机终止,导致计算资源浪费和结果不稳定。Contribution: 提出了TSSS框架,结合模板化推理和确定性终止机制,显著提高推理效率和结果可靠性。
Method: 1. 模板化推理:缓存重复前缀并锚定子问题,减少标记生成;2. 检索器终止机制:检测子问题重复时终止推理。
Result: 在HotpotQA等多个数据集上达到SOTA准确率,同时在效率上优于其他RAG-CoT方法。
Insight: 结构化推理和确定性终止机制可以显著提升多跳RAG的效率,适用于资源受限场景。
Abstract: Multi-hop retrieval-augmented generation (RAG) is a promising strategy for complex reasoning, yet existing iterative prompting approaches remain inefficient. They often regenerate predictable token sequences at every step and rely on stochastic stopping, leading to excessive token usage and unstable termination. We propose TSSS (Think Straight, Stop Smart), a structured multi-hop RAG framework designed for efficiency. TSSS introduces (i) a template-based reasoning that caches recurring prefixes and anchors sub-queries to the main question, reducing token generation cost while promoting stable reasoning, and (ii) a retriever-based terminator, which deterministically halts reasoning once additional sub-queries collapse into repetition. This separation of structured reasoning and termination control enables both faster inference and more reliable answers. On HotpotQA, 2WikiMultiHop, and MuSiQue, TSSS achieves state-of-the-art accuracy and competitive efficiency among RAG-CoT approaches, highlighting its effectiveness in efficiency-constrained scenarios such as on-device inference.
[13] When Facts Change: Probing LLMs on Evolving Knowledge with evolveQA
Nishanth Sridhar Nakshatri,Shamik Roy,Manoj Ghuhan Arivazhagan,Hanhan Zhou,Vinayshekhar Bannihatti Kumar,Rashmi Gangadharaiah
Main category: cs.CL
TL;DR: 这篇论文介绍了evolveQA,一个专门用于评估大语言模型(LLMs)在处理时序知识演化问题上的能力的新基准测试。该基准基于三个真实世界的时标数据集构建,并揭示了LLMs在面对动态知识时的显著性能下降。
Details
Motivation: 现有的研究多基于结构化知识库(如Wikidata)评估LLMs的时序知识冲突处理能力,但这些研究局限于覆盖广泛的流行实体,缺乏对不同知识截止日期的公平评估。Contribution: 论文提出了evolveQA这一新基准测试,专注于评估LLMs在面对真实世界动态知识演化时的表现,填补了这一领域的空白。
Method: evolveQA基于三个时标数据集(AWS更新、Azure变更和WHO疾病爆发报告)构建,设计了针对不同LLM知识截止日期的问题和黄金答案。
Result: 评估12个开源和闭源LLMs显示,evolveQA上LLMs的性能较静态知识问题下降了高达31%。
Insight: LLMs在处理动态知识时表现不佳,提示未来研究需更关注时序知识的适应性和更新能力。
Abstract: LLMs often fail to handle temporal knowledge conflicts–contradictions arising when facts evolve over time within their training data. Existing studies evaluate this phenomenon through benchmarks built on structured knowledge bases like Wikidata, but they focus on widely-covered, easily-memorized popular entities and lack the dynamic structure needed to fairly evaluate LLMs with different knowledge cut-off dates. We introduce evolveQA, a benchmark specifically designed to evaluate LLMs on temporally evolving knowledge, constructed from 3 real-world, time-stamped corpora: AWS updates, Azure changes, and WHO disease outbreak reports. Our framework identifies naturally occurring knowledge evolution and generates questions with gold answers tailored to different LLM knowledge cut-off dates. Through extensive evaluation of 12 open and closed-source LLMs across 3 knowledge probing formats, we demonstrate significant performance drops of up to 31% on evolveQA compared to static knowledge questions.
[14] Interpretable Question Answering with Knowledge Graphs
Kartikeya Aneja,Manasvi Srivastava,Subhayan Das,Nagender Aneja
Main category: cs.CL
TL;DR: 本文提出了一种不依赖检索增强生成(RAG)和大语言模型(LLMs)的知识图谱问答系统,通过小型复述模型从知识图谱检索结果中生成答案。该系统在CRAG基准测试中表现良好。
Details
Motivation: 现有的问答系统通常依赖大语言模型和检索增强生成技术,但这些方法可能缺乏可解释性且计算成本高。本文旨在探索一种基于知识图谱的替代方案,提高问答系统的透明性和效率。Contribution: 主要贡献包括:1) 设计了完全基于知识图谱的问答系统;2) 使用小型复述模型替代大语言模型;3) 在CRAG基准测试中验证了方法的有效性。
Method: 方法分为两阶段:1) 预处理文档生成问答对;2) 将问答对转换为知识图谱,利用嵌入和模糊技术进行图谱检索,并对结果重排序和复述。
Result: 在CRAG基准测试中,使用LLAMA-3.2和GPT-3.5-Turbo的准确率分别为71.9%和54.4%。
Insight: 知识图谱可以有效替代大语言模型进行问答任务,小型复述模型的使用展示了轻量化解决方案的潜力。
Abstract: This paper presents a question answering system that operates exclusively on a knowledge graph retrieval without relying on retrieval augmented generation (RAG) with large language models (LLMs). Instead, a small paraphraser model is used to paraphrase the entity relationship edges retrieved from querying the knowledge graph. The proposed pipeline is divided into two main stages. The first stage involves pre-processing a document to generate sets of question-answer (QA) pairs. The second stage converts these QAs into a knowledge graph from which graph-based retrieval is performed using embeddings and fuzzy techniques. The graph is queried, re-ranked, and paraphrased to generate a final answer. This work includes an evaluation using LLM-as-a-judge on the CRAG benchmark, which resulted in accuracies of 71.9% and 54.4% using LLAMA-3.2 and GPT-3.5-Turbo, respectively.
[15] Multi-Faceted Evaluation of Tool-Augmented Dialogue Systems
Zhaoyi Joey Hou,Tanya Shourya,Yingfan Wang,Shamik Roy,Vinayshekhar Bannihatti Kumar,Rashmi Gangadharaiah
Main category: cs.CL
TL;DR: 该论文提出了TRACE基准和SCOPE框架,用于系统评估工具增强对话系统中的多样化错误模式,解决了现有评估方法无法捕捉多轮对话中关键错误的问题。
Details
Motivation: 现有的对话系统评估方法主要关注用户满意度或工具调用能力,但在多轮工具增强对话中,代理人可能误解工具结果但仍令用户满意,导致关键错误被忽略。Contribution: 1. 提出TRACE基准,包含合成的多样化错误案例;2. 设计SCOPE框架,自动发现错误模式并制定评估标准。
Method: 通过合成多轮工具对话数据集(TRACE),并开发SCOPE框架自动分析错误模式,结合评估标准进行量化分析。
Result: 实验表明,SCOPE在用户满意度信号误导的复杂案例中显著优于基线方法。
Insight: 工具增强对话系统的评估需关注多轮交互中的潜在错误,而非仅依赖用户满意度。
Abstract: Evaluating conversational AI systems that use external tools is challenging, as errors can arise from complex interactions among user, agent, and tools. While existing evaluation methods assess either user satisfaction or agents’ tool-calling capabilities, they fail to capture critical errors in multi-turn tool-augmented dialogues-such as when agents misinterpret tool results yet appear satisfactory to users. We introduce TRACE, a benchmark of systematically synthesized tool-augmented conversations covering diverse error cases, and SCOPE, an evaluation framework that automatically discovers diverse error patterns and evaluation rubrics in tool-augmented dialogues. Experiments show SCOPE significantly outperforms the baseline, particularly on challenging cases where user satisfaction signals are misleading.
[16] DiSRouter: Distributed Self-Routing for LLM Selections
Hang Zheng,Hongshen Xu,Yongkai Lin,Shuai Fan,Lu Chen,Kai Yu
Main category: cs.CL
TL;DR: 提出了DiSRouter,一种分布式自路由范式,解决LLM选择中的灵活性和性能问题。
Details
Motivation: 现有基于外部集中式路由器的LLM选择方法灵活性差且性能受限,无法充分理解不同LLM的知识边界。Contribution: 1. 提出分布式自路由范式DiSRouter,避免了集中控制的缺点。2. 设计了两阶段的自感知训练流程,增强LLM的自我判断能力。
Method: 1. 分布式路由设计:查询在网络中由各LLM代理独立决定响应或路由。2. 两阶段自感知训练:提升LLM的自知能力,使其能够判断自身竞争力。
Result: DiSRouter在多种场景下显著优于现有路由方法,能有效区分易难查询,并在域外任务中表现优越。
Insight: 利用LLM内在的自我意识比外部评估更有效,为模块化和高效的多智能体系统提供了新思路。
Abstract: The proliferation of Large Language Models (LLMs) has created a diverse ecosystem of models with highly varying performance and costs, necessitating effective query routing to balance performance and expense. Current routing systems often rely on a centralized external router trained on a fixed set of LLMs, making them inflexible and prone to poor performance since the small router can not fully understand the knowledge boundaries of different LLMs. We introduce DiSRouter (Distributed Self-Router), a novel paradigm that shifts from centralized control to distributed routing. In DiSRouter, a query traverses a network of LLM agents, each independently deciding whether to answer or route to other agents based on its own self-awareness, its ability to judge its competence. This distributed design offers superior flexibility, scalability, and generalizability. To enable this, we propose a two-stage Self-Awareness Training pipeline that enhances each LLM’s self-awareness. Extensive experiments demonstrate that DiSRouter significantly outperforms existing routing methods in utility across various scenarios, effectively distinguishes between easy and hard queries, and shows strong generalization to out-of-domain tasks. Our work validates that leveraging an LLM’s intrinsic self-awareness is more effective than external assessment, paving the way for more modular and efficient multi-agent systems.
[17] SheetBrain: A Neuro-Symbolic Agent for Accurate Reasoning over Complex and Large Spreadsheets
Ziwei Wang,Jiayuan Su,Mengyu Zhou,Huaxing Zeng,Mengni Jia,Xiao Lv,Haoyu Dong,Xiaojun Ma,Shi Han,Dongmei Zhang
Main category: cs.CL
TL;DR: SheetBrain是一个神经符号双工作流代理框架,专注于在电子表格上进行高准确度的推理,支持问答和操作任务。
Details
Motivation: 大型语言模型(LLMs)在处理复杂电子表格时难以准确捕获结构和确保推理正确性,因此需要更高效的工具。Contribution: 提出了SheetBrain框架,包含理解、执行和验证三个核心模块,显著提升了表格推理的准确性。
Method: 框架结合了神经网络的符号推理能力,使用Python沙箱和Excel工具包进行多轮推理,并通过验证模块确保结果正确。
Result: 在公共基准测试和新的SheetBench上,SheetBrain显著提高了准确性。
Insight: 神经符号结合的框架在处理复杂表格任务时具有优势,验证模块的设计是确保推理可靠性的关键。
Abstract: Understanding and reasoning over complex spreadsheets remain fundamental challenges for large language models (LLMs), which often struggle with accurately capturing the complex structure of tables and ensuring reasoning correctness. In this work, we propose SheetBrain, a neuro-symbolic dual workflow agent framework designed for accurate reasoning over tabular data, supporting both spreadsheet question answering and manipulation tasks. SheetBrain comprises three core modules: an understanding module, which produces a comprehensive overview of the spreadsheet - including sheet summary and query-based problem insight to guide reasoning; an execution module, which integrates a Python sandbox with preloaded table-processing libraries and an Excel helper toolkit for effective multi-turn reasoning; and a validation module, which verifies the correctness of reasoning and answers, triggering re-execution when necessary. We evaluate SheetBrain on multiple public tabular QA and manipulation benchmarks, and introduce SheetBench, a new benchmark targeting large, multi-table, and structurally complex spreadsheets. Experimental results show that SheetBrain significantly improves accuracy on both existing benchmarks and the more challenging scenarios presented in SheetBench. Our code is publicly available at https://github.com/microsoft/SheetBrain.
[18] Difficulty-Controllable Multiple-Choice Question Generation Using Large Language Models and Direct Preference Optimization
Yuto Tomikawa,Masaki Uto
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型和直接偏好优化技术的难度可控多选题生成方法,旨在解决传统方法无法直接生成多选题和难度控制精度不足的问题。
Details
Motivation: 在教育领域,难度可控的问题生成是自适应学习的关键工具,但现有方法无法直接生成多选题且难度控制精度有限。Contribution: 提出了首个结合大语言模型和直接偏好优化的难度可控多选题生成框架,提升了难度控制的准确性。
Method: 利用大语言模型,并通过直接偏好优化技术训练模型,以优化难度控制的性能。
Result: 生成的单选题在难度控制上表现更优,适用于教育场景。
Insight: 直接偏好优化技术能有效提升模型在难度控制任务中的表现,为大语言模型在教育领域的应用提供了新思路。
Abstract: Difficulty-controllable question generation for reading comprehension has gained significant attention in the field of education as a fundamental tool for adaptive learning support. Although several neural question generation methods have recently succeeded in controlling difficulty, conventional approaches still face two major limitations. First, they cannot directly generate multiple-choice questions, which are the most widely used question type in educational contexts. Second, they are not explicitly trained to optimize the accuracy of difficulty control, leaving room for further improvement in difficulty controllability. To address these limitations, this study proposes a novel difficulty-controllable multiple-choice question generation method for reading comprehension which leverages a large language model trained using a direct preference optimization technique to improve the accuracy of difficulty control.
[19] TheMCPCompany: Creating General-purpose Agents with Task-specific Tools
Reza Esfandiarpoor,Vishwas Suryanarayanan,Stephen H. Bach,Vishal Chowdhary,Anthony Aue
Main category: cs.CL
TL;DR: 该论文介绍了TheMCPCompany基准,用于评估基于工具调用的智能体在交互多种现实服务中的表现,展示了高级推理模型在工具发现中的潜力,但也揭示了复杂环境中工具的导航与组合仍是挑战。
Details
Motivation: 当前通用智能体主要依赖浏览器与环境交互,但任务专用工具集更易开发和维护,作者希望探索工具调用智能体在现实任务中的实用性。Contribution: 提出了TheMCPCompany基准,包含18,000多种工具的MCP服务器和标注的真实工具;展示了工具调用智能体的性能潜力及成本优势;分析了不同模型在工具检索中的表现差异。
Method: 通过REST API创建MCP服务器提供工具集,并使用标注的真实工具评估智能体性能;进一步研究了工具检索对智能体表现的影响。
Result: 高级模型(如GPT-5)在工具检索中表现接近真实工具,但小模型无法充分利用工具;复杂环境中的工具导航与组合仍是挑战。
Insight: 当前模型在复杂环境中仍需改进推理和检索能力;任务专用工具集为智能体性能提升提供了新方向。
Abstract: Since the introduction of the Model Context Protocol (MCP), the number of available tools for Large Language Models (LLMs) has increased significantly. These task-specific tool sets offer an alternative to general-purpose tools such as web browsers, while being easier to develop and maintain than GUIs. However, current general-purpose agents predominantly rely on web browsers for interacting with the environment. Here, we introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services. We use the REST APIs of these services to create MCP servers, which include over 18,000 tools. We also provide manually annotated ground-truth tools for each task. In our experiments, we use the ground truth tools to show the potential of tool-calling agents for both improving performance and reducing costs assuming perfect tool retrieval. Next, we explore agent performance using tool retrieval to study the real-world practicality of tool-based agents. While all models with tool retrieval perform similarly or better than browser-based agents, smaller models cannot take full advantage of the available tools through retrieval. On the other hand, GPT-5’s performance with tool retrieval is very close to its performance with ground-truth tools. Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments. TheMCPCompany reveals that navigating tens of thousands of tools and combining them in non-trivial ways to solve complex problems is still a challenging task for current models and requires both better reasoning and better retrieval models.
[20] JointCQ: Improving Factual Hallucination Detection with Joint Claim and Query Generation
Fan Xu,Huixuan Zhang,Zhenliang Zhang,Jiahao Wang,Xiaojun Wan
Main category: cs.CL
TL;DR: JointCQ提出了一种联合生成声明和查询的框架,旨在解决大语言模型在幻觉检测中因上下文丢失和查询特异性不足而导致的问题,并通过实验验证了其在开放域QA任务上的优越性。
Details
Motivation: 现有的大语言模型在生成内容时容易出现幻觉问题(即生成看似真实但不可靠的内容),而目前的幻觉检测方法在声明提取和查询生成阶段表现不佳,影响了整体检测效果。Contribution: 提出了JointCQ框架,通过联合生成声明和查询,改进了幻觉检测的输入质量,从而提升了下游搜索和验证的可靠性。
Method: 设计了评价标准以筛选合成训练数据,并微调语言模型以联合生成声明和查询,确保输入信息的可靠性和丰富性。
Result: 在多个开放域QA幻觉检测基准测试中,JointCQ优于现有方法,证明了其有效性。
Insight: 联合声明和查询生成能够更有效地解决幻觉问题,为大语言模型的透明性和可信性提供了新思路。
Abstract: Current large language models (LLMs) often suffer from hallucination issues, i,e, generating content that appears factual but is actually unreliable. A typical hallucination detection pipeline involves response decomposition (i.e., claim extraction), query generation, evidence collection (i.e., search or retrieval), and claim verification. However, existing methods exhibit limitations in the first two stages, such as context loss during claim extraction and low specificity in query generation, resulting in degraded performance across the hallucination detection pipeline. In this work, we introduce JointCQ https://github.com/pku0xff/JointCQ, a joint claim-and-query generation framework designed to construct an effective and efficient claim-query generator. Our framework leverages elaborately designed evaluation criteria to filter synthesized training data, and finetunes a language model for joint claim extraction and query generation, providing reliable and informative inputs for downstream search and verification. Experimental results demonstrate that our method outperforms previous methods on multiple open-domain QA hallucination detection benchmarks, advancing the goal of more trustworthy and transparent language model systems.
[21] KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints
Kailin Jiang,Hongbo Jiang,Ning Jiang,Zhi Gao,Jinhe Bi,Yuchen Ren,Bin Li,Yuntao Du,Lei Liu,Qing Li
Main category: cs.CL
TL;DR: KORE通过知识导向的数据增强和约束方法,提升大型多模态模型的知识注入能力,同时避免灾难性遗忘,实现了新知识的准确学习和旧知识的保留。
Details
Motivation: 大型多模态模型在预训练中编码了大量知识,但其知识是静态的且无法及时更新,导致难以持续学习新知识。现有的方法在新知识学习和避免灾难性遗忘方面存在困难。Contribution: 1. 提出KORE方法,结合知识导向的数据增强和约束,实现新知识的准确注入和旧知识的保留;2. 通过协方差矩阵存储旧知识,并在适配器初始化时最小化对旧知识的干扰。
Method: 1. 将知识条目自动转换为结构化知识;2. 利用协方差矩阵存储旧知识;3. 在适配器初始化时投影到矩阵的零空间以最小化干扰。
Result: 在LLaVA和Qwen2.5-VL等多个模型上,KORE实现了优异的新知识注入性能并有效减轻了灾难性遗忘。
Insight: 知识注入需要同时关注新知识的准确学习和旧知识的保留,结构化知识转换和干扰最小化的方法为实现这一目标提供了有效途径。
Abstract: Large Multimodal Models encode extensive factual knowledge in their pre-trained weights. However, its knowledge remains static and limited, unable to keep pace with real-world developments, which hinders continuous knowledge acquisition. Effective knowledge injection thus becomes critical, involving two goals: knowledge adaptation (injecting new knowledge) and knowledge retention (preserving old knowledge). Existing methods often struggle to learn new knowledge and suffer from catastrophic forgetting. To address this, we propose KORE, a synergistic method of KnOwledge-oRientEd augmentations and constraints for injecting new knowledge into large multimodal models while preserving old knowledge. Unlike general text or image data augmentation, KORE automatically converts individual knowledge items into structured and comprehensive knowledge to ensure that the model accurately learns new knowledge, enabling accurate adaptation. Meanwhile, KORE stores previous knowledge in the covariance matrix of LMM’s linear layer activations and initializes the adapter by projecting the original weights into the matrix’s null space, defining a fine-tuning direction that minimizes interference with previous knowledge, enabling powerful retention. Extensive experiments on various LMMs, including LLaVA-v1.5-7B, LLaVA-v1.5-13B, and Qwen2.5-VL-7B, show that KORE achieves superior new knowledge injection performance and effectively mitigates catastrophic forgetting.
[22] Balancing Rewards in Text Summarization: Multi-Objective Reinforcement Learning via HyperVolume Optimization
Junjie Song,Yiwen Liu,Dapeng Li,Yin Sun,Shukun Fu,Siqi Chen,Yuji Cao
Main category: cs.CL
TL;DR: 该论文提出了一种基于超体积优化(HVO)的多目标强化学习框架,用于文本摘要任务,动态调整不同目标的权重,生成更平衡的摘要。
Details
Motivation: 文本摘要任务需要同时优化一致性、连贯性、相关性和流畅性等多个目标,挑战性较大。尽管基于LLM的强化学习取得了显著进展,但多目标优化问题研究较少。Contribution: 提出了HVO方法,动态调整不同目标在强化学习奖励过程中的分数权重,逼近Pareto前沿,生成多目标平衡的摘要。
Method: 采用超体积优化策略,通过动态调整不同目标组的分数权重,逐步优化模型。
Result: 在多个代表性数据集上,HVO优于GRPO方法,且在7B规模的基础模型上与GPT-4表现相当,同时生成更短的摘要。
Insight: 动态调整多目标权重是实现高质量文本摘要的有效方法,超体积优化为多目标强化学习提供了新思路。
Abstract: Text summarization is a crucial task that requires the simultaneous optimization of multiple objectives, including consistency, coherence, relevance, and fluency, which presents considerable challenges. Although large language models (LLMs) have demonstrated remarkable performance, enhanced by reinforcement learning (RL), few studies have focused on optimizing the multi-objective problem of summarization through RL based on LLMs. In this paper, we introduce hypervolume optimization (HVO), a novel optimization strategy that dynamically adjusts the scores between groups during the reward process in RL by using the hypervolume method. This method guides the model’s optimization to progressively approximate the pareto front, thereby generating balanced summaries across multiple objectives. Experimental results on several representative summarization datasets demonstrate that our method outperforms group relative policy optimization (GRPO) in overall scores and shows more balanced performance across different dimensions. Moreover, a 7B foundation model enhanced by HVO performs comparably to GPT-4 in the summarization task, while maintaining a shorter generation length. Our code is publicly available at https://github.com/ai4business-LiAuto/HVO.git
[23] Slot Filling as a Reasoning Task for SpeechLLMs
Kadri Hacioglu,Manjunath K E,Andreas Stolcke
Main category: cs.CL
TL;DR: 该论文提出将推理能力集成到语音大语言模型(SpeechLLM)中,以完成端到端的槽填充任务。通过链式思维框架分解任务,生成推理数据集,并采用监督微调策略。实验表明,引入推理步骤能提升性能,但适用于数学、逻辑和编码领域的文本LLM可能不适用于语音LLM。混合模式的SpeechLLM性能更优。
Details
Motivation: 受推理大语言模型(LLMs)发展的启发,作者希望通过引入推理能力提升语音LLM在槽填充任务中的表现,从而推动语音与语言模型的深度融合。Contribution: 主要贡献包括:1)将槽填充任务分解为多步推理;2)创建推理数据集;3)验证推理语音LLM的性能提升;4)发现适用于其他领域的推理文本LLM可能不适用于语音任务;5)提出混合模式SpeechLLM的性能优势。
Method: 方法包括:1)使用链式思维框架分解任务;2)生成推理数据集;3)监督微调语音LLM;4)比较不同类型和大小的文本基础LLM;5)探索混合模式训练策略。
Result: 实验结果显示,引入推理步骤能提升性能,但某些领域的文本LLM可能不适合语音任务。混合模式的SpeechLLM比单一模式表现更好。
Insight: 研究发现,语音LLM的任务可能需要特定的推理能力,通用推理能力不一定适用。混合模式设计提供了更灵活的任务适应能力。
Abstract: We propose integration of reasoning into speech large language models (speechLLMs) for the end-to-end slot-filling task. Inspired by the recent development of reasoning LLMs, we use a chain-of-thought framework to decompose the slot-filling task into multiple reasoning steps, create a reasoning dataset and apply the supervised fine-tuning strategy to a speechLLM. We distinguish between regular and reasoning speechLLMs and experiment with different types and sizes of LLMs as their text foundation models. We demonstrate performance improvements by introducing reasoning (intermediate) steps. However, we show that a reasoning textual LLM developed mainly for math, logic and coding domains might be inferior as a foundation model for a reasoning speechLLM. We further show that hybrid speechLLMs, built on a hybrid text foundation LLM and fine-tuned to preserve both direct and reasoning modes of operation, have better performance than those fine-tuned employing only one mode of operation.
[24] Algorithmic Fairness in NLP: Persona-Infused LLMs for Human-Centric Hate Speech Detection
Ewelina Gajewska,Arda Derbent,Jaroslaw A Chudziak,Katarzyna Budzynska
Main category: cs.CL
TL;DR: 该论文研究了通过为大型语言模型(LLMs)注入注释者身份特性(Persona-LLMs)如何影响其对仇恨言论的敏感性,尤其是关于注释者与目标之间身份共享或差异带来的偏见。实验使用了Google Gemini和OpenAI GPT-4.1-mini,并采用两种身份提示方法:浅层提示和基于检索增强生成(RAG)的深度上下文身份开发,以纳入更丰富的身份特征。分析了内群和外群注释者身份对模型检测性能和公平性的影响。
Details
Motivation: 现有的自动化仇恨言论检测系统在处理不同社会群体时可能存在偏见。论文希望通过结合心理学中群体身份的观点,以LLMs为基础,探索如何通过身份注入(Persona)来减少这种偏见,提升检测的公平性。Contribution: 1. 提出了两种身份提示方法(浅层和基于RAG的深度上下文)注入LLMs;2. 分析了内群与外群注释者身份对模型性能和公平性的影响;3. 展示了身份注入方法在减少偏见方面的潜力与局限性。
Method: 1. 使用Google Gemini和OpenAI GPT-4.1-mini模型;2. 采用浅层身份提示和基于RAG的深度上下文身份开发;3. 评估不同身份特性对模型表现的影响。
Result: 结果表明,身份注入能够在一定程度上减少模型的偏见,尤其是在内群注释者场景下表现更优。然而,身份注入也存在局限性,例如在外群注释者场景下可能未能完全消除偏见。
Insight: 结合社会心理学理论与NLP技术可以为自动化仇恨言论检测提供更公平的解决方案,但身份注入的效果受限于身份兼容性和数据多样性。
Abstract: In this paper, we investigate how personalising Large Language Models (Persona-LLMs) with annotator personas affects their sensitivity to hate speech, particularly regarding biases linked to shared or differing identities between annotators and targets. To this end, we employ Google’s Gemini and OpenAI’s GPT-4.1-mini models and two persona-prompting methods: shallow persona prompting and a deeply contextualised persona development based on Retrieval-Augmented Generation (RAG) to incorporate richer persona profiles. We analyse the impact of using in-group and out-group annotator personas on the models’ detection performance and fairness across diverse social groups. This work bridges psychological insights on group identity with advanced NLP techniques, demonstrating that incorporating socio-demographic attributes into LLMs can address bias in automated hate speech detection. Our results highlight both the potential and limitations of persona-based approaches in reducing bias, offering valuable insights for developing more equitable hate speech detection systems.
[25] Modeling Turn-Taking with Semantically Informed Gestures
Varsha Suresh,M. Hamza Mughal,Christian Theobalt,Vera Demberg
Main category: cs.CL
TL;DR: 论文提出了一种基于语义手势的对话轮转建模方法,通过扩展数据集并整合多模态信息,验证了手势在轮转预测中的补充作用。
Details
Motivation: 人类在对话中通过语音、手势和凝视等多模态线索管理轮转,现有研究多关注语言和声学特征,而忽视了手势的补充作用。本文旨在填补这一空白。Contribution: 扩展了DnD Gesture++数据集,包含2,663个语义手势标注;提出了一种基于Mixture-of-Experts的多模态轮转预测框架。
Method: 使用文本、音频和手势的多模态数据,通过Mixture-of-Experts框架整合语义手势信息进行轮转预测。
Result: 实验表明,加入语义手势后模型的性能优于基线方法,验证了手势在多模态轮转预测中的补充作用。
Insight: 语义手势在对话轮转中提供了独特的补充信息,多模态整合能显著提升轮转预测的准确性。
Abstract: In conversation, humans use multimodal cues, such as speech, gestures, and gaze, to manage turn-taking. While linguistic and acoustic features are informative, gestures provide complementary cues for modeling these transitions. To study this, we introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations spanning iconic, metaphoric, deictic, and discourse types. Using this dataset, we model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures. Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating their complementary role in multimodal turn-taking.
[26] M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models
Yejin Kwon,Taewoo Kang,Hyunsoo Yoon,Changouk Kim
Main category: cs.CL
TL;DR: M3-SLU 是一个新的多模态大型语言模型(MLLM)基准,旨在评估多说话者、多轮对话的语音理解能力,尤其是说话者归属推理的挑战。
Details
Motivation: 当前的多模态大型语言模型在语音和文本理解方面表现优异,但在自然对话中识别“谁在什么时间说了什么”的能力仍存在不足。因此,M3-SLU 旨在填补这一空白。Contribution: 提出了 M3-SLU 基准,基于四个公开语料库构建,包含 12,000 多个已验证实例,支持评估说话者归属推理任务。
Method: 设计了两种任务:说话者归属问答和通过话语匹配的说话者归属,并通过端到端 MLLM 和级联流水线提供基线结果。
Result: 实验表明,模型能捕捉说话内容,但在说话者识别上表现不佳,揭示了说话者感知对话理解的差距。
Insight: M3-SLU 为促进说话者感知的多模态理解研究提供了具有挑战性的基准。
Abstract: We present M3-SLU, a new multimodal large language model (MLLM) benchmark for evaluating multi-speaker, multi-turn spoken language understanding. While recent models show strong performance in speech and text comprehension, they still struggle with speaker-attributed reasoning, the ability to understand who said what and when in natural conversations. M3-SLU is built from four open corpora (CHiME-6, MELD, MultiDialog, and AMI) and comprises over 12,000 validated instances with paired audio, transcripts, and metadata. It includes two tasks: (1) Speaker-Attributed Question Answering and (2) Speaker Attribution via Utterance Matching. We provide baseline results for both cascaded pipelines and end-to-end MLLMs, evaluated using an LLM-as-Judge and accuracy metrics. Results show that while models can capture what was said, they often fail to identify who said it, revealing a key gap in speaker-aware dialogue understanding. M3-SLU offers as a challenging benchmark to advance research in speaker-aware multimodal understanding.
[27] AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation
Xianyang Liu,Yilin Liu,Shuai Wang,Hao Cheng,Andrew Estornell,Yuzhi Zhao,Jiaheng Wei
Main category: cs.CL
TL;DR: AgenticMath提出了一种基于多智能体的高质量数学问答对生成方法,通过筛选种子问题、多样化重述问题、增强答案逻辑性和最终评估,提升了LLM在数学推理任务中的表现。
Details
Motivation: 当前生成高质量数据集以提升LLM推理能力的方法存在答案质量低、信息丰富度有限的问题,因此需要一种更高效的方法来解决这一问题。Contribution: 提出了AgenticMath,一种四阶段的智能体管道,用于生成高质量数学问答对,显著提升了LLM在数学推理任务中的表现,且数据规模更小。
Method: 包括种子问题筛选、多智能体问题重述、答案增强和问答对评估四个阶段,强调逻辑一致性和数值正确性。
Result: 实验表明,使用仅30-60K样本微调的LLM在数学推理任务中表现优于基于更大规模低质量数据的基线模型。
Insight: 高质量、针对性强的数据生成对小规模模型性能的提升比大规模低质量数据更有效。
Abstract: The creation of high-quality datasets to improve Large Language Model (LLM) reasoning remains a significant challenge, as current methods often suffer from generating low-quality/incorrect answers and limited information richness from available data sources. To address this, we propose AgenticMath, a novel agentic pipeline for generating high-quality mathematical question-answer pairs to enhance the supervised fine-tuning of LLMs. Our method operates through four stages: (1) Seed Question Filter that selects questions with high information richness, complexity, and clarity; (2) an Agentic Question Rephrase step that employs a multi-agent system to generate diverse, logically consistent paraphrases; (3) an Answer Augment step where rewrite answers using chain-of-thought reasoning to enhance numerical and logical correctness, without reliance on human-provided labels; and (4) a final Question and Answer Evaluation that retains only the most superior pairs. Extensive experiments demonstrate that, fine-tuning 3B-8B parameter LLMs on AgenticMath generated datasets (comprising only 30-60K math samples) achieves competitive or superior performance on diverse in domain and out-of-domain mathematical reasoning benchmarks compared to baselines trained on much more data (e.g., 400K or 2.3M samples). Our work demonstrates that targeted, high-quality data generation is a more efficient path to improving mathematical reasoning in LLMs than large-scale, low-quality alternatives.
[28] LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts
Siyuan Wang,Gaokai Zhang,Li Lyna Zhang,Ning Shang,Fan Yang,Dongyao Chen,Mao Yang
Main category: cs.CL
TL;DR: LoongRL提出了一种基于强化学习的数据驱动方法,用于提升大模型在长上下文中的高级推理能力。其核心贡献是KeyChain,一种将短多跳QA任务转化为高难度长上下文任务的方法,并通过RL训练诱导出计划-检索-推理-复查的推理模式。
Details
Motivation: 长上下文推理对大语言模型至关重要,但目前强化学习主要用于短上下文推理,针对长上下文的高级推理模式和数据仍缺乏探索和研究。Contribution: 提出了LoongRL方法和KeyChain数据合成技术,显著提升了模型在长上下文推理任务中的表现。
Method: 通过KeyChain将短多跳QA任务转化为长上下文任务,并利用强化学习训练模型,生成计划-检索-推理-复查的推理模式。
Result: LoongRL训练后的模型在16K长度任务上表现出色,并可泛化到128K任务,性能大幅提升(Qwen2.5-7B和14B分别提升23.5%和21.1%)。
Insight: 强化学习可以有效诱导模型产生适用于长上下文的高级推理模式,且训练数据的高难度设计是关键。
Abstract: Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing “Aha” moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.
[29] The Massive Legal Embedding Benchmark (MLEB)
Umar Butler,Abdur-Rahman Butler,Adrian Lucas Malec
Main category: cs.CL
TL;DR: 该论文提出了Massive Legal Embedding Benchmark(MLEB),这是迄今为止最大、最多样化且最全面的开源法律信息检索基准。MLEB包含十个专业标注的数据集,涵盖多个司法管辖区、文档类型和任务类型。
Details
Motivation: 填补开源法律信息检索领域中关于跨司法管辖区和多任务类型的数据空白,促进法律信息检索技术的发展。Contribution: 构建了MLEB这一大规模、多样化的法律信息检索基准,填补了开源领域的数据缺口,并公开了代码、数据和结果以支持可重复性评估。
Method: 收集并标注了十个法律数据集,涵盖多个司法管辖区和文档类型,同时构建了七个新数据集以补充现有不足。
Result: 发布了MLEB基准及其相关资源,为法律信息检索研究提供了丰富的测试平台。
Insight: 强调了跨司法管辖区和多任务类型数据集的重要性,为法律领域的自然语言处理研究提供了新的方向。
Abstract: We present the Massive Legal Embedding Benchmark (MLEB), the largest, most diverse, and most comprehensive open-source benchmark for legal information retrieval to date. MLEB consists of ten expert-annotated datasets spanning multiple jurisdictions (the US, UK, EU, Australia, Ireland, and Singapore), document types (cases, legislation, regulatory guidance, contracts, and literature), and task types (search, zero-shot classification, and question answering). Seven of the datasets in MLEB were newly constructed in order to fill domain and jurisdictional gaps in the open-source legal information retrieval landscape. We document our methodology in building MLEB and creating the new constituent datasets, and release our code, results, and data openly to assist with reproducible evaluations.
[30] MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs
Xinfeng Xia,Jiacheng Liu,Xiaofeng Hou,Peng Tang,Mingxuan Zhang,Wenfeng Wang,Chao Li
Main category: cs.CL
TL;DR: MoE-Prism通过模型-系统协同设计,将传统的Mixture-of-Experts模型转化为弹性服务,提供更多细粒度的操作点,显著提升性能和资源利用率。
Details
Motivation: 现有的Mixture-of-Experts模型由于依赖少数固定专家(monolithic experts)的路由机制,导致操作点过于粗粒度,难以适应多样化的服务级别目标(SLOs),造成资源浪费。Contribution: 提出MoE-Prism,包含离线重构引擎和在线调度引擎,通过分解专家为子专家并优化调度策略,实现更细粒度的弹性服务。
Method: 1. 离线重构引擎:使用元启发式方法将专家分解为子专家;2. 在线调度引擎:基于QoS感知的调度策略优化系统性能。
Result: 在三种MoE模型上验证,MoE-Prism提供超过4倍的稳定操作点,吞吐量提升19.9%,延迟降低10.36%。
Insight: 通过模型-系统协同设计,可以实现高质量的弹性服务,从而灵活适应不同的SLOs和资源约束。
Abstract: Mixture-of-Experts (MoE) models, the state-of-the-art in large-scale AI, achieve high quality by sparsely activating parameters. However, their reliance on routing between a few monolithic experts via a top-k mechanism creates a “quality cliff”, offering only a few coarse-grained operating points. This inflexibility forces a difficult trade-off between cost and quality, preventing adaptation to diverse Service Level Objectives (SLOs) and leading to significant resource over-provisioning. This paper introduces MoE-Prism, a model-system co-design that transforms rigid MoE models into elastic services. Our methodology is divided into two phases. First, an \emph{Offline Refactoring Engine} systematically deconstructs monolithic experts into fine-grained “sub-experts.” This engine employs a partitioning optimization solver that uses a metaheuristic-based approach to group neurons, preserving functional locality without requiring retraining. Second, an \emph{Online Scheduling Engine} leverages this new elasticity through QoS-aware scheduling. It implements specialized policies to solve complex system problems, including maximizing throughput in cloud deployments and managing latency-optimized offloading for memory-constrained devices. Our evaluation across three different MoE models shows that MoE-Prismprovides over 4 times more distinct, stable operating points than the baseline. This allows an AI service to dynamically improve throughput by up to 19.9% under a strict latency budget or reduce latency by up to 10.36% under limited resources. MoE-Prism provides the critical “control knob” to bridge the model-system gap, enabling the next generation of adaptive, efficient, and QoS-aware AI services.
[31] Sign Language Translation with Sentence Embedding Supervision
Yasser Hamidullah,Josef van Genabith,Cristina España-Bonet
Main category: cs.CL
TL;DR: 该论文提出了一种新颖的手语翻译方法,通过利用目标句子的句子嵌入作为监督信号,无需依赖传统的手语注释数据,显著提升了无注释数据下的翻译性能。
Details
Motivation: 传统手语翻译系统依赖于手语注释数据(gloss annotations),但这些数据通常难以大规模获取且标注不一致。论文目标是开发一种无需手语注释的翻译方法,以解决数据稀缺和标注不一致的问题。Contribution: 1. 提出了一种基于句子嵌入监督的手语翻译方法,无需手语注释数据;2. 在德国和美国手语数据集上验证了方法的有效性,显著超越了其他无注释方法;3. 探索了单语和多语句子嵌入在翻译中的应用,支持多语言场景。
Method: 1. 利用目标句子的句子嵌入作为监督信号,替代传统的手语注释;2. 通过预训练的文本模型生成句子嵌入;3. 在训练时将这些嵌入与手语视频对齐;4. 实验了单语和多语言嵌入的配置。
Result: 在PHOENIX-2014T(德语)和How2Sign(美式手语)数据集上,该方法显著优于其他无注释方法,缩小了与依赖注释的系统的性能差距。
Insight: 1. 句子嵌入可以作为手语翻译的有效监督信号;2. 多语言嵌入能进一步提升模型的翻译能力;3. 这种方法为手语翻译的数据获取开辟了新途径。
Abstract: State-of-the-art sign language translation (SLT) systems facilitate the learning process through gloss annotations, either in an end2end manner or by involving an intermediate step. Unfortunately, gloss labelled sign language data is usually not available at scale and, when available, gloss annotations widely differ from dataset to dataset. We present a novel approach using sentence embeddings of the target sentences at training time that take the role of glosses. The new kind of supervision does not need any manual annotation but it is learned on raw textual data. As our approach easily facilitates multilinguality, we evaluate it on datasets covering German (PHOENIX-2014T) and American (How2Sign) sign languages and experiment with mono- and multilingual sentence embeddings and translation systems. Our approach significantly outperforms other gloss-free approaches, setting the new state-of-the-art for data sets where glosses are not available and when no additional SLT datasets are used for pretraining, diminishing the gap between gloss-free and gloss-dependent systems.
[32] SONAR-SLT: Multilingual Sign Language Translation via Language-Agnostic Sentence Embedding Supervision
Yasser Hamidullah,Shakib Yazdani,Cennet Oguz,Josef van Genabith,Cristina España-Bonet
Main category: cs.CL
TL;DR: 这篇论文提出了SONAR-SLT方法,通过语言无关的多模态嵌入来监督手语翻译(SLT),支持直接多语言翻译。采用耦合增强方法解决数据稀缺问题,实验结果表明其优于传统的文本句子嵌入监督方法。
Details
Motivation: 传统的手语翻译(SLT)方法通常依赖于单一语言的文本监督,限制了其扩展性和跨语言泛化能力。为了解决这一问题,论文探索了语言无关的多模态嵌入监督方法。Contribution: 1. 提出了一种语言无关的多模态嵌入监督方法,支持直接多语言SLT。2. 设计了耦合增强方法,结合多语言目标增强和视频级扰动,提高模型鲁棒性。3. 实验证明该方法在低资源场景下表现更优。
Method: 1. 使用多语言文本和语音训练的语言无关嵌入来监督SLT。2. 提出了耦合增强方法,包括多语言目标增强和视频级扰动。
Result: 实验结果显示,该方法在BLEURT指标上优于仅基于文本句子嵌入的监督方法,尤其在低资源场景下表现更优。
Insight: 语言无关的多模态监督和耦合增强方法是提高SLT可扩展性和鲁棒性的有效途径。
Abstract: Sign language translation (SLT) is typically trained with text in a single spoken language, which limits scalability and cross-language generalization. Earlier approaches have replaced gloss supervision with text-based sentence embeddings, but up to now, these remain tied to a specific language and modality. In contrast, here we employ language-agnostic, multimodal embeddings trained on text and speech from multiple languages to supervise SLT, enabling direct multilingual translation. To address data scarcity, we propose a coupled augmentation method that combines multilingual target augmentations (i.e. translations into many languages) with video-level perturbations, improving model robustness. Experiments show consistent BLEURT gains over text-only sentence embedding supervision, with larger improvements in low-resource settings. Our results demonstrate that language-agnostic embedding supervision, combined with coupled augmentation, provides a scalable and semantically robust alternative to traditional SLT training.
[33] Spatio-temporal Sign Language Representation and Translation
Yasser Hamidullah,Josef van Genabith,Cristina España-Bonet
Main category: cs.CL
TL;DR: 这是一篇关于手语翻译任务的论文,提出了一种时空特征表示与翻译的单模型方法,性能在开发集上表现尚可但测试集上较差。
Details
Motivation: 传统的手语翻译方法通常使用通用序列到序列架构,缺乏对时间特征的充分利用,本文旨在提出一种时空特征表示与翻译的单模型方法以改进性能。Contribution: 提出了一种学习时空特征表示与翻译的单模型方法,实现了真正端到端的架构,有望更好地泛化到新数据集。
Method: 通过单模型学习手语视频中的时空特征表示,并进行翻译,避免了传统方法的分离特征提取与翻译过程。
Result: 最佳系统在开发集上达到5±1 BLEU分,但在测试集上性能大幅下降至0.11±0.06 BLEU分。
Insight: 时空特征表示方法在开发集上表现较好,但测试集上的性能下降可能表明模型存在泛化问题,或测试数据与训练数据差异较大。
Abstract: This paper describes the DFKI-MLT submission to the WMT-SLT 2022 sign language translation (SLT) task from Swiss German Sign Language (video) into German (text). State-of-the-art techniques for SLT use a generic seq2seq architecture with customized input embeddings. Instead of word embeddings as used in textual machine translation, SLT systems use features extracted from video frames. Standard approaches often do not benefit from temporal features. In our participation, we present a system that learns spatio-temporal feature representations and translation in a single model, resulting in a real end-to-end architecture expected to better generalize to new data sets. Our best system achieved $5\pm1$ BLEU points on the development set, but the performance on the test dropped to $0.11\pm0.06$ BLEU points.
[34] MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models
Kailin Jiang,Ning Jiang,Yuchen Ren,Yuchen Li,Yifan Gao,Jinhe Bi,Yunpu Ma,Qingqing Liu,Xianhao Wang,Yifan Jia,Hongbo Jiang,Yaocong Hu,Bin Li,Lei Liu,Yuntao Du
Main category: cs.CL
TL;DR: 论文提出了MINED基准,用于评估大型多模态模型(LMMs)对时间敏感知识的理解能力,并通过知识编辑方法探索了知识更新的可行性。
Details
Motivation: 现有的大型多模态模型(LMMs)在时间敏感知识的理解上表现不足,且缺乏动态评估基准来全面衡量其能力。Contribution: 1. 提出了MINED基准,涵盖6个关键维度和11项任务,评估LMMs的时间敏感性;2. 通过知识编辑方法验证了LMMs更新时间敏感知识的可行性。
Method: 1. 从维基百科构建包含2,104个时间敏感知识样本的MINED基准;2. 评估15种常用LMMs的性能;3. 测试知识编辑方法在单次编辑场景中的效果。
Result: Gemini-2.5-Pro在MINED上表现最佳(平均CEM得分63.07),开源LMMs表现较差;组织知识表现最好,体育知识最弱。知识编辑方法在单次编辑中有效。
Insight: 1. LMMs在时间敏感知识理解上仍需改进;2. 知识编辑方法为动态更新LMMs知识提供了可行路径。
Abstract: Large Multimodal Models (LMMs) encode rich factual knowledge via cross-modal pre-training, yet their static representations struggle to maintain an accurate understanding of time-sensitive factual knowledge. Existing benchmarks remain constrained by static designs, inadequately evaluating LMMs’ ability to understand time-sensitive knowledge. To address this gap, we propose MINED, a comprehensive benchmark that evaluates temporal awareness along 6 key dimensions and 11 challenging tasks: cognition, awareness, trustworthiness, understanding, reasoning, and robustness. MINED is constructed from Wikipedia by two professional annotators, containing 2,104 time-sensitive knowledge samples spanning six knowledge types. Evaluating 15 widely used LMMs on MINED shows that Gemini-2.5-Pro achieves the highest average CEM score of 63.07, while most open-source LMMs still lack time understanding ability. Meanwhile, LMMs perform best on organization knowledge, whereas their performance is weakest on sport. To address these challenges, we investigate the feasibility of updating time-sensitive knowledge in LMMs through knowledge editing methods and observe that LMMs can effectively update knowledge via knowledge editing methods in single editing scenarios.
[35] VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos
Dunjie Lu,Yiheng Xu,Junli Wang,Haoyuan Wu,Xinyuan Wang,Zekun Wang,Junlin Yang,Hongjin Su,Jixuan Chen,Junda Chen,Yuchen Mao,Jingren Zhou,Junyang Lin,Binyuan Hui,Tao Yu
Main category: cs.CL
TL;DR: VideoAgentTrek提出了一种自动从公开视频中挖掘GUI交互数据的管道,无需手动标注;通过Video2Action模块提取精确的时间和内容信息,显著提升了计算机使用代理的性能。
Details
Motivation: 训练计算机使用代理需要大量标注数据,但手动标注成本高昂;现有公开视频(如YouTube教程)隐含了大量GUI交互信息,但缺乏显式标签。Contribution: 提出VideoAgentTrek,利用Video2Action模块从视频中自动提取GUI动作参数;实现了大规模无标注数据的有效利用,替代昂贵的人工标注。
Method: 设计Video2Action模块,包含视频定位模型(检测时间和上下文)和动作内容识别器(提取结构化参数);通过预训练和监督微调优化代理性能。
Result: 在OSWorld-Verified上任务成功率从9.3%提升至15.8%(70%相对提升);AgentNetBench上步骤准确率从64.1%提升至69.3%。
Insight: 互联网视频可作为高质量监督信号来源,为代理训练提供可扩展的数据解决方案;强调了无标注数据在计算机使用代理中的潜力。
Abstract: Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos at web scale, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries and context, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text with high fidelity. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps automatically. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.
[36] What is the Best Sequence Length for BABYLM?
Suchir Salhan,Richard Diehl Martinez,Zébulon Goriely,Paula Buttery
Main category: cs.CL
TL;DR: 研究了在BabyLM Challenge中序列长度对预训练的影响,发现任务和架构决定了最优序列长度:短序列适合语法任务,长序列适合形态类比任务。
Details
Motivation: Transformer语言模型通常使用固定长度的上下文窗口,但在BabyLM Challenge中,许多提交使用更短的序列长度。研究旨在确定BabyLM预训练的最佳序列长度。Contribution: 通过实验比较125M参数的Mamba和OPT模型,揭示了序列长度对任务性能的影响,并提供了针对不同任务的最佳长度建议。
Method: 使用100M词的训练数据和固定计算预算,比较不同序列长度下Mamba和OPT模型在语法和形态类比任务上的表现。
Result: 发现短序列(512 tokens)对语法任务足够,而长序列(2048 tokens)对形态类比任务更有利。最佳长度取决于任务和模型架构。
Insight: 序列长度的选择应根据具体任务和模型架构调整,单一固定长度可能不适用于所有场景。
Abstract: Transformer language models typically operate with a fixed-length context window, which has grown in step with large-scale pretraining datasets. In the BabyLM Challenge, however, many past submissions have defaulted to using much shorter sequence lengths. We examine the impact of sequence length on BabyLM pretraining, to answer the simple question: what sequence length should we be using when training Baby LMs? Using 100M-word training data and fixed compute budgets, we compare 125M-parameter Mamba and OPT models, finding that although longer is often better, the optimal length depends on both task and architecture. Shorter sequences are sufficient for grammatical generalization tasks whereas longer contexts benefit morphological analogical reasoning tasks.
[37] Lookahead Routing for Large Language Models
Canbin Huang,Tianyuan Shi,Yuhua Zhu,Ruijun Chen,Xiaojun Quan
Main category: cs.CL
TL;DR: 论文提出了Lookahead框架,通过预测潜在输出来改进LLM路由决策,避免了传统分类方法的局限性,提升了7.7%的性能。
Details
Motivation: 现有LLM路由方法仅基于输入查询进行分类,忽略了输出信息的价值,导致复杂查询的路由决策不佳。Contribution: 提出Lookahead框架,利用潜在输出预测优化路由选择,解决了传统方法的局限性。
Method: 框架包含两种实现方式:基于因果和掩码语言模型的预测。
Result: 在七项公共基准测试中平均性能提升7.7%。
Insight: 动态预测潜在输出能显著提升路由决策质量,尤其是对复杂或模糊查询。
Abstract: Large language model (LLM) routers improve the efficiency of multi-model systems by directing each query to the most appropriate model while leveraging the diverse strengths of heterogeneous LLMs. Most existing approaches frame routing as a classification problem based solely on the input query. While this reduces overhead by avoiding inference across all models, it overlooks valuable information that could be gleaned from potential outputs and fails to capture implicit intent or contextual nuances that often emerge only during response generation. These limitations can result in suboptimal routing decisions, particularly for complex or ambiguous queries that require deeper semantic understanding. To address this challenge, we propose Lookahead, a routing framework that “foresees” potential model outputs by predicting their latent representations and uses these predictions to guide model selection, thus enabling more informed routing without full inference. Within this framework, we implement two approaches based on causal and masked language models. Empirical evaluations across seven public benchmarks - spanning instruction following, mathematical reasoning, and code generation - show that Lookahead consistently outperforms existing routing baselines, achieving an average performance gain of 7.7% over the state-of-the-art. Our code is available at https://github.com/huangcb01/lookahead-routing.
[38] Which Evaluation for Which Model? A Taxonomy for Speech Model Assessment
Maureen de Seyssel,Eeshan Gunesh Dhekane
Main category: cs.CL
TL;DR: 该论文提出了一种统一的评估分类法,用于解决语音基础模型评估中的分散性问题,通过三个正交轴(评估方面、模型能力需求和任务要求)对现有评估方法进行分类,为选择合适的评估方法提供框架。
Details
Motivation: 语音基础模型的评估通常因任务和模型类型的差异而显得分散,缺乏统一标准。不同模型在不同语音处理方面表现优异,因此需要针对性的评估协议。Contribution: 论文提出了一种新的分类法,明确了评估的三个正交轴(评估方面、模型能力需求和任务要求),为语音模型的评估提供了系统化的框架,并揭示了当前评估方法的不足。
Method: 通过定义评估方面、模型能力需求和任务要求三个轴,对现有评估方法和基准进行分类和分析,识别评估方法的系统缺口。
Result: 分类法成功对广泛的评估方法进行了系统性归类,并指出了评估中未被充分覆盖的领域(如韵律、交互和推理)。
Insight: 该分类法不仅为选择和设计评估方法提供了指导,还揭示了未来基准设计的优先级,有助于推动语音模型评估的统一性和全面性。
Abstract: Speech foundation models have recently achieved remarkable capabilities across a wide range of tasks. However, their evaluation remains disjointed across tasks and model types. Different models excel at distinct aspects of speech processing and thus require different evaluation protocols. This paper proposes a unified taxonomy that addresses the question: Which evaluation is appropriate for which model? The taxonomy defines three orthogonal axes: the \textbf{evaluation aspect} being measured, the model capabilities required to attempt the task, and the task or protocol requirements needed to perform it. We classify a broad set of existing evaluations and benchmarks along these axes, spanning areas such as representation learning, speech generation, and interactive dialogue. By mapping each evaluation to the capabilities a model exposes (e.g., speech generation, real-time processing) and to its methodological demands (e.g., fine-tuning data, human judgment), the taxonomy provides a principled framework for aligning models with suitable evaluation methods. It also reveals systematic gaps, such as limited coverage of prosody, interaction, or reasoning, that highlight priorities for future benchmark design. Overall, this work offers a conceptual foundation and practical guide for selecting, interpreting, and extending evaluations of speech models.
[39] Conditions for Catastrophic Forgetting in Multilingual Translation
Danni Liu,Jan Niehues
Main category: cs.CL
TL;DR: 论文探讨了多语言基础模型在微调时引发灾难性遗忘的条件,发现模型与数据规模的相对比例是主要因素,同时模型的指令跟随能力比架构更关键。
Details
Motivation: 多语言基础模型在微调特定语言时常常引发灾难性遗忘,但文献中对遗忘发生的条件缺乏系统性研究。Contribution: 1. 系统性研究了灾难性遗忘的触发条件;2. 揭示了模型与数据规模的相对比例是关键;3. 发现指令跟随能力比架构更重要;4. 参数高效微调并未显著优于全微调;5. 跨语言对齐可缓解遗忘并促进正向迁移。
Method: 通过机器翻译实验,对比不同模型架构、数据规模和微调方法,分析灾难性遗忘的条件。
Result: 1. 模型与数据规模的相对比例是遗忘的主要因素;2. 指令跟随能力比架构更关键;3. 跨语言对齐可缓解遗忘。
Insight: 模型的指令跟随能力和跨语言对齐是多语言知识保留的关键,而参数高效微调未必优于全微调。
Abstract: Fine-tuning multilingual foundation models on specific languages often induces catastrophic forgetting, degrading performance on languages unseen in fine-tuning. While this phenomenon is widely-documented, the literature presents fragmented results about when forgetting occurs. To address this ambiguity, we conduct a systematic empirical study using machine translation as a testbed to identify the conditions that trigger catastrophic forgetting in multilingual fine-tuning. Through controlled experiments across different model architectures, data scales, and fine-tuning approaches, we reveal that the relative scale between model and data size is a primary determinant of forgetting. Moreover, we demonstrate that a model’s instruction-following ability is more critical for retaining multilingual knowledge than its architecture. Contrary to assumptions, parameter-efficient fine-tuning offers no clear advantage over full fine-tuning in mitigating forgetting. Lastly, we show that cross-lingual alignment can mitigate forgetting while also facilitating positive transfer to unseen target languages.
[40] Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark
Yu Wu,Ke Shu,Jonas Fischer,Lidia Pivovarova,David Rosson,Eetu Mäkelä,Mikko Tolonen
Main category: cs.CL
TL;DR: 本文提出了一个新颖的任务:从混合语言的古籍中提取拉丁文片段,并通过多模态数据集评估大型基础模型的性能。结果表明,当代模型可以可靠地完成拉丁文检测任务。
Details
Motivation: 古籍中常包含多种语言混合的内容,尤其是拉丁文与其他语言的混杂,这对自动提取拉丁文提出了挑战。现有的模型在这些任务中的表现尚未得到系统评估。Contribution: 1) 提出了一个新的任务:从多语言古籍中检测拉丁文;2) 构建了一个包含724页标注数据的多模态基准数据集;3) 首次全面分析了大型基础模型在此任务中的能力与局限性。
Method: 使用了大型基础模型(如多模态预训练模型)对多语言古籍进行拉丁文检测,并通过基准数据集对其性能进行评估。
Result: 实验结果表明,当前的模型可以在多语言古籍中可靠地检测拉丁文,为相关领域的研究提供了基准。
Insight: 大型基础模型在多模态、多语言任务中表现出色,但仍需进一步优化以应对古籍中的复杂布局和语言混杂问题。
Abstract: This paper presents a novel task of extracting Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary models is achievable. Our study provides the first comprehensive analysis of these models’ capabilities and limits for this task.
[41] Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent
Yangshijie Zhang,Xinda Wang,Jialin Liu,Wenqiang Wang,Zhicong Ma,Xingxing Jia
Main category: cs.CL
TL;DR: 论文提出了一种基于字体风格的对抗攻击方法(SAD),利用人类与NLP模型对风格化文本的感知差异,实现对模型的干扰。实验验证了其在情感分类和机器翻译等任务中的攻击效果,并展示了其在多模态任务中的潜在威胁。
Details
Motivation: 社交媒体的发展使得用户广泛使用风格化字体和类似字体的表情符号表达个性,但这些字体在NLP模型中可能被处理为无关的token,导致模型性能下降。研究旨在利用这种人类与模型的感知差异,设计对抗攻击。Contribution: 1. 提出了一种基于字体风格的对抗攻击方法SAD,分为轻量化和强力两种变体;2. 在情感分类、机器翻译和多模态任务上验证了攻击的有效性;3. 揭示了风格化文本对现代NLP系统的潜在威胁。
Method: 设计了SAD攻击方法,包括两种变体:轻量化版本(注重查询效率)和强力版本(注重攻击性能)。通过风格化字体和表情符号干扰模型的token处理,利用人类与模型的感知差异实现攻击。
Result: 实验表明,SAD在情感分类和机器翻译任务中成功干扰了传统模型、大语言模型(LLM)和商业服务。此外,SAD在多模态任务(如文本生成图像和语音)中也展示了潜在威胁。
Insight: 风格化文本在视觉上对人类友好,但对NLP模型可能是潜在的脆弱点。这种人类与模型的感知差异为对抗攻击提供了新的研究方向。
Abstract: With social media growth, users employ stylistic fonts and font-like emoji to express individuality, creating visually appealing text that remains human-readable. However, these fonts introduce hidden vulnerabilities in NLP models: while humans easily read stylistic text, models process these characters as distinct tokens, causing interference. We identify this human-model perception gap and propose a style-based attack, Style Attack Disguise (SAD). We design two sizes: light for query efficiency and strong for superior attack performance. Experiments on sentiment classification and machine translation across traditional models, LLMs, and commercial services demonstrate SAD’s strong attack performance. We also show SAD’s potential threats to multimodal tasks including text-to-image and text-to-speech generation.
[42] LLavaCode: Compressed Code Representations for Retrieval-Augmented Code Generation
Daria Cherniuk,Nikita Sukhorukov,Nikita Sushko,Daniil Gusak,Danil Sivtsov,Elena Tutubalina,Evgeny Frolov
Main category: cs.CL
TL;DR: LlavaCode提出了一种通过压缩代码为紧凑表示的方法,显著减少了检索增强代码生成的上下文长度,提升了生成质量并降低了延迟。
Details
Motivation: 检索增强生成在代码补全中表现出色,但长上下文导致推理速度慢,影响交互式环境(如IDE)的体验。Contribution: LlavaCode框架通过压缩代码为单令牌向量,减少上下文长度,提升生成质量并降低TTFT延迟20-38%。
Method: 使用小型投影模块将代码压缩为语义丰富的紧凑表示,供代码LLM解释。
Result: 实验显示压缩上下文显著提升EM和ES指标,同时TTFT减少了20-38%。
Insight: 紧凑的代码表示是解决检索增强生成延迟问题的有效途径。
Abstract: Retrieval-augmented generation has emerged as one of the most effective approaches for code completion, particularly when context from a surrounding repository is essential. However, incorporating context significantly extends sequence length, leading to slower inference - a critical limitation for interactive settings such as IDEs. In this work, we introduce LlavaCode, a framework that compresses code into compact, semantically rich representations interpretable by code LLM, enhancing generation quality while reducing the retrieved context to only a few compressed single-token vectors. Using a small projector module we can significantly increase the EM and ES metrics of coding model with negligible latency increase. Our experiments demonstrate that compressed context enables 20-38% reduction in Time-to-First-Token (TTFT) on line completion tasks compared to full-RAG pipelines.
[43] Unraveling Emotions with Pre-Trained Models
Alejandro Pajón-Sanmartín,Francisco De Arriba-Pérez,Silvia García-Méndez,Fátima Leal,Benedita Malheiro,Juan Carlos Burguillo-Rial
Main category: cs.CL
TL;DR: 论文比较了微调预训练模型和通用LLMs在情感检测中的效果,强调了结构化提示设计和情感分组的重要性,实验显示微调模型在情感识别中表现优于70%。
Details
Motivation: 情感识别在开放文本中存在上下文模糊性和语言多样性等挑战,通用模型直接应用效果有限,因此研究微调和提示工程的效果。Contribution: 1. 比较微调预训练模型和通用LLMs的性能;2. 分析不同情感提示设计的有效性;3. 探讨情感分组技术对模型的影响。
Method: 通过实验验证三种场景:(i)微调模型与通用LLMs的对比;(ii)不同提示设计的效果;(iii)情感分组技术的作用。
Result: 微调预训练模型情感识别指标超过70%,LLMs需结构化提示和情感分组以提升性能。
Insight: 结构化提示和情感分组是提升LLMs情感分析性能的关键,微调模型在开放文本情感识别中表现更优。
Abstract: Transformer models have significantly advanced the field of emotion recognition. However, there are still open challenges when exploring open-ended queries for Large Language Models (LLMs). Although current models offer good results, automatic emotion analysis in open texts presents significant challenges, such as contextual ambiguity, linguistic variability, and difficulty interpreting complex emotional expressions. These limitations make the direct application of generalist models difficult. Accordingly, this work compares the effectiveness of fine-tuning and prompt engineering in emotion detection in three distinct scenarios: (i) performance of fine-tuned pre-trained models and general-purpose LLMs using simple prompts; (ii) effectiveness of different emotion prompt designs with LLMs; and (iii) impact of emotion grouping techniques on these models. Experimental tests attain metrics above 70% with a fine-tuned pre-trained model for emotion recognition. Moreover, the findings highlight that LLMs require structured prompt engineering and emotion grouping to enhance their performance. These advancements improve sentiment analysis, human-computer interaction, and understanding of user behavior across various domains.
[44] DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference
Xiang Liu,Xuming Hu,Xiaowen Chu,Eunsol Choi
Main category: cs.CL
TL;DR: DiffAdapt是一个轻量级框架,通过根据问题难度和推理轨迹熵选择不同的推理策略,减少大型语言模型(LLM)在推理时的token使用量,同时保持或提高准确性。
Details
Motivation: 尽管当前大型语言模型具备强问题解决能力,但由于生成长的推理轨迹,效率较低。研究发现模型的token概率熵在简单问题上过高,表明存在‘过度思考’现象,需要一种高效的推理策略。Contribution: 揭示了LLM推理过程中熵的U型模式,提出DiffAdapt框架,通过动态选择推理策略(Easy/Normal/Hard),显著减少token使用量(高达22.4%)。
Method: 通过分析token概率熵,设计轻量级探针分类器,对隐藏状态分类以选择推理策略(固定提示、温度、最大token长度),无需微调基础模型。
Result: 在5个模型和8个基准测试中,DiffAdapt在不降低或提升准确性的前提下,token使用量减少高达22.4%。
Insight: 简单问题存在过度思考现象,动态调整推理策略是提高LLM推理效率的有效途径。
Abstract: Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear. Our work aims to improve their efficiency, enabling them to reach high performance without overthinking. First, we analyze the entropy of token probabilities in reasoning traces. Across three models, we observe a consistent U-shaped entropy pattern: high entropy on easy problems despite high accuracy, low entropy on problems with medium difficulty, and high entropy on hard problems reflecting uncertainty. Specifically, we notice 22–25% entropy reduction from easy to medium difficulty regions, suggesting an {overthinking} phenomenon on easy instances. Building on these insights, we introduce \textbf{DiffAdapt}, a lightweight framework that selects Easy/Normal/Hard inference strategies per question based on their difficulty and reasoning trace entropy. Each inference strategy consists of a fixed prompt, temperature and maximum token length. In contrast to existing efficiency optimization methods, our approach does not fine-tune base LLM but a small probe that classifies LLM’s final hidden state, allowing inexpensive adaptation. We comprehensively evaluate our method on five models and eight benchmarks. Our method achieves comparable or improved accuracy while reducing token usage by up to 22.4%, establishing a practical path toward compute-efficient reasoning.
[45] CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation
Hasan Akgul,Mari Eplik,Javier Rojas,Aina Binti Abdullah,Pieter van der Merwe
Main category: cs.CL
TL;DR: CoSense-LLM是一个边缘优先框架,通过多模态传感器流生成语义令牌,并与大语言模型协作,满足延迟、能耗、带宽和隐私约束。
Details
Motivation: 在大模型部署中,语义理解、隐私保护和低延迟之间存在冲突。CoSense-LLM旨在将这些目标统一为一个边缘优先的设计,尤其适用于干扰环境。Contribution: 提出了四个核心组件:(i) SenseFusion轻量编码器,(ii) Edge-RAG本地检索层,(iii) PromptRouter成本感知策略,(iv) Secure Execution隐私保护路径。
Method: 结合轻量编码、本地检索、云端协作和隐私保护技术,优化语义生成和模型服务的效率。
Result: 在家庭、办公室和诊所场景中实现了亚秒级延迟,减少了带宽消耗,并通过本地检索提高了事实一致性。
Insight: 边缘优先设计能将语义、隐私和低延迟整合为统一的优化目标,适合资源受限和干扰多的环境。
Abstract: We present CoSense-LLM, an edge-first framework that turns continuous multimodal sensor streams (for example Wi-Fi CSI, IMU, audio, RFID, and lightweight vision) into compact, verifiable semantic tokens and coordinates with large language models under explicit latency, energy, bandwidth, and privacy constraints. CoSense-LLM has four parts: (i) SenseFusion, a lightweight encoder that aligns sensor embeddings with language and compresses them into short discrete code sequences; (ii) Edge-RAG, a local hybrid retrieval layer that grounds generation in site specific policies and notes; (iii) PromptRouter, a cost and uncertainty aware policy that selects edge only generation, edge plus retrieval, or compact cloud escalation; and (iv) Secure Execution, an auditable redaction path that enforces data minimization so raw waveforms never leave the device. The system works with modern serving optimizations, including paged or streaming KV caches, FlashAttention style kernels, speculative decoding, and quantized LoRA adapters, and supports on device personalization and federated updates under non IID drift. Across home, office, and clinic deployments, CoSense-LLM delivers grounded explanations while meeting tight service level objectives: it sustains sub second (p95) end to end latency on edge dominant paths, reduces inter tier token and bandwidth costs by preferring local retrieval grounded responses, and preserves privacy by transmitting only discrete codes and redacted metadata. Ablations show that Edge-RAG improves factual consistency and reduces contradictions, calibrated uncertainty enables selective abstention and controlled escalations, and KV plus decoding accelerators lower energy per decision. The results support an edge first design that treats semantics, privacy, and predictable latency as co equal goals for large model deployments in interference prone environments.
[46] Are Large Language Models Sensitive to the Motives Behind Communication?
Addison J. Wu,Ryan Liu,Kerem Oktar,Theodore R. Sumers,Thomas L. Griffiths
Main category: cs.CL
TL;DR: 该论文研究了大型语言模型(LLMs)是否能够理解人类交流背后的动机,发现LLMs在一定程度上能够理性评估偏置信息,但在真实场景中表现较弱,通过干预可提升其敏感度。
Details
Motivation: 人类交流具有动机性,而LLMs需要理解这些动机才能在现实世界中有效运作。目前尚不清楚LLMs是否具备这种能力。Contribution: 论文首次全面评估了LLMs对交流动机的敏感性,并结合实验和真实广告场景验证其表现,提出了一种简单的干预方法提升模型表现。
Method: 通过认知科学的控制实验和真实广告场景评估LLMs的表现,并使用干预方法增强模型对动机的敏感性。
Result: LLMs在控制实验中表现接近人类理性模型,但在真实广告场景中表现较差;通过干预可显著提升模型表现。
Insight: LLMs具备对动机的基本敏感性,但在复杂真实场景中需要进一步优化以提升其表现。
Abstract: Human communication is motivated: people speak, write, and create content with a particular communicative intent in mind. As a result, information that large language models (LLMs) and AI agents process is inherently framed by humans’ intentions and incentives. People are adept at navigating such nuanced information: we routinely identify benevolent or self-serving motives in order to decide what statements to trust. For LLMs to be effective in the real world, they too must critically evaluate content by factoring in the motivations of the source – for instance, weighing the credibility of claims made in a sales pitch. In this paper, we undertake a comprehensive study of whether LLMs have this capacity for motivational vigilance. We first employ controlled experiments from cognitive science to verify that LLMs’ behavior is consistent with rational models of learning from motivated testimony, and find they successfully discount information from biased sources in a human-like manner. We then extend our evaluation to sponsored online adverts, a more naturalistic reflection of LLM agents’ information ecosystems. In these settings, we find that LLMs’ inferences do not track the rational models’ predictions nearly as closely – partly due to additional information that distracts them from vigilance-relevant considerations. However, a simple steering intervention that boosts the salience of intentions and incentives substantially increases the correspondence between LLMs and the rational model. These results suggest that LLMs possess a basic sensitivity to the motivations of others, but generalizing to novel real-world settings will require further improvements to these models.
[47] Do Prompts Reshape Representations? An Empirical Study of Prompting Effects on Embeddings
Cesar Gonzalez-Gutierrez,Dirk Hovy
Main category: cs.CL
TL;DR: 这篇论文通过实验研究了提示(prompting)对预训练语言模型内部表示质量的影响,发现提示的相关性与表示质量并不总是一致,挑战了传统假设。
Details
Motivation: 理解提示如何影响预训练语言模型的内部表示,尤其是在零样本(zero-shot)任务中,有助于揭示模型如何通过提示解决任务的内在机制。Contribution: 通过系统地分析不同提示模板对嵌入表示的影响,论文提出了提示相关性与表示质量之间不一致的发现,并探讨了可能的原因。
Method: 作者进行了一系列探测实验,研究了多种提示模板对零样本分类任务中嵌入表示质量的影响。
Result: 研究发现提示会影响表示质量,但这种影响与提示和目标任务的相关性并不一致,提示的相关性并非总是带来更好的表示。
Insight: 提示的作用机制可能比简单的相关性假设更复杂,需要进一步研究其他潜在因素,如提示的多样性或模型内部的注意力机制。
Abstract: Prompting is a common approach for leveraging LMs in zero-shot settings. However, the underlying mechanisms that enable LMs to perform diverse tasks without task-specific supervision remain poorly understood. Studying the relationship between prompting and the quality of internal representations can shed light on how pre-trained embeddings may support in-context task solving. In this empirical study, we conduct a series of probing experiments on prompt embeddings, analyzing various combinations of prompt templates for zero-shot classification. Our findings show that while prompting affects the quality of representations, these changes do not consistently correlate with the relevance of the prompts to the target task. This result challenges the assumption that more relevant prompts necessarily lead to better representations. We further analyze potential factors that may contribute to this unexpected behavior.
[48] SmartSwitch: Advancing LLM Reasoning by Overcoming Underthinking via Promoting Deeper Thought Exploration
Xichen Zhang,Sitong Wu,Haoru Tan,Shaozuo Yu,Yinghao Zhu,Ziyi He,Jiaya Jia
Main category: cs.CL
TL;DR: 本文提出了SmartSwitch推理框架,旨在解决大型语言模型在复杂推理任务中因’浅层思维’(underthinking)导致的性能瓶颈问题。该框架通过监控推理过程、检测浅层思维并引导深度思考,显著提升了模型的性能。
Details
Motivation: 大型语言模型在复杂推理任务中表现出色,但其伴随的浅层思维问题(频繁切换思维而未深入探索)限制了性能和token效率。本文旨在解决这一问题。Contribution: 提出了SmartSwitch推理框架,通过检测浅层思维并引导深度探索,显著提升了模型的推理能力。
Method: 框架包含感知模块和干预模块:感知模块识别思维切换点并使用过程奖励模型评估潜力;干预模块在发现高潜力思维被过早放弃时回溯并插入’深化提示’。
Result: 在数学推理基准测试中,SmartSwitch显著提升了不同规模模型的性能。
Insight: 针对浅层思维的干预是提升大型语言模型推理能力的关键,SmartSwitch为这一问题提供了一种简单高效的解决方案。
Abstract: The long chain-of-thought (LongCoT) capability is central to the recent breakthroughs achieved by large language models in complex reasoning tasks. However, the accompanying issue of ‘’underthinking’’, where models exhibit shallow reasoning by frequently switching thoughts without sufficient exploration, limits both performance and token efficiency. To address this problem, we propose a simple yet effective reasoning strategy: the SmartSwitch inference framework. This framework can be easily integrated into any large language model as a plug-and-play solution, continuously monitoring the model’s reasoning process to detect underthinking and guide it toward deeper exploration of promising but overlooked thoughts. Specifically, the perception module identifies points where thoughts switch and evaluates the potential of the preceding thought using an off-the-shelf process reward model (PRM). If a high-potential thought is found to be prematurely abandoned, the intervention module interrupts the ongoing inference, backtracks to the point before the switch, and inserts a “deepening prompt” to encourage further exploration along that promising path. Extensive experiments on challenging mathematical reasoning benchmarks demonstrate that our method significantly enhances the performance of various large language models of different sizes.
[49] AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders
Yuezhou Hu,Jiaxin Guo,Xinyu Feng,Tuo Zhao
Main category: cs.CL
TL;DR: AdaSPEC通过选择性知识蒸馏改进推测解码器的效率,提出了一种过滤难拟合令牌的方法,从而提升令牌接受率。
Details
Motivation: 推测解码(SD)依赖小型草案模型与大型目标模型的对齐,传统知识蒸馏方法因对所有令牌最小化KL散度而与SD目标(最大化令牌接受率)不一致,导致性能不佳。Contribution: AdaSPEC引入选择性令牌过滤机制,通过参考模型识别难拟合令牌并聚焦简单令牌的蒸馏,提升草案模型与目标模型的对齐效果。
Method: AdaSPEC使用参考模型过滤难拟合令牌,优化蒸馏过程,仅对简单令牌进行知识蒸馏,从而提高令牌接受率。
Result: 在算术推理、指令遵循、编码和摘要等多任务中,AdaSPEC在31M/1.4B和350M/2.7B参数配置下,令牌接受率最高提升15%,优于DistillSpec方法。
Insight: 选择性知识蒸馏更贴合SD的目标,避免了因模型容量限制导致的性能瓶颈,同时保持了生成质量。
Abstract: Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions, which are then verified by a larger target model. The effectiveness of SD hinges on the alignment between these models, which is typically enhanced by Knowledge Distillation (KD). However, conventional KD methods aim to minimize the KL divergence between the draft and target models across all tokens, a goal that is misaligned with the true objective of SD, which is to maximize token acceptance rate. Therefore, draft models often struggle to fully assimilate the target model’s knowledge due to capacity constraints, leading to suboptimal performance. To address this challenge, we propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process. AdaSPEC utilizes a reference model to identify and filter out difficult-to-fit tokens, enabling the distillation of a draft model that better aligns with the target model on simpler tokens. This approach improves the overall token acceptance rate without compromising generation quality. We evaluate AdaSPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that AdaSPEC consistently outperforms the state-of-the-art DistillSpec method, achieving higher acceptance rates across all tasks (up to 15%). The code is publicly available at https://github.com/yuezhouhu/adaspec.
[50] Adapting Multilingual Models to Code-Mixed Tasks via Model Merging
Prashant Kodali,Vaishnavi Shivkumar,Swarang Joshi,Monojit Choudhary,Ponnurangam Kumaraguru,Manish Shrivastava
Main category: cs.CL
TL;DR: 该论文研究了模型合并作为适应代码混合NLP任务的替代方法,通过结合多语言基础模型和无标签代码混合文本的预训练,显著提升了分类任务的性能。
Details
Motivation: 代码混合NLP任务在处理多语言输入时面临资源分配不均和上下文理解的挑战,传统方法如完全微调或持续预训练难以高效利用无标签数据。Contribution: 提出了模型合并方法,结合基础模型和无标签数据的预训练检查点,显著优于传统微调和持续预训练方法,并验证了跨语言对的更强迁移能力。
Method: 1. 对多语言基础模型进行无标签代码混合文本的持续预训练;2. 合并基础模型和预训练检查点;3. 在下游任务数据上进行微调。
Result: 合并模型在英语-印地语和英语-西班牙语分类任务中表现优于传统方法,F1分数提升2-5分;在跨语言迁移任务中也表现更优。
Insight: 模型合并能更高效地利用无标签数据,适用于低资源场景;大语言模型的零/少样本学习在代码混合任务中表现不及微调方法。
Abstract: We study model merging as a practical alternative to conventional adaptation strategies for code-mixed NLP. Starting from a multilingual base model, we: (i) perform continued pre-training (CPT) on unlabeled code-mixed text to obtain an adapted checkpoint, (ii) merge checkpoint with the base model, and (iii) fine-tune (FT) on the downstream task data. We evaluate our approach for sentence classification (sentiment and hate speech) task in English-Hindi (En-Hi) and English-Spanish (En-Es) using XLM-R and Llama-3.2-1B models. Our results show that merged models consistently outperform full fine-tuning and CPT->FT. We observe gains of 2–5 points in F1 over full fine-tuning and ~1-2 points over CPT->FT, indicating that unlabeled data is leveraged more effectively via merging than via CPT alone. Zero-/few-shot prompting with larger LLMs (e.g., Llama-3.3-70B) lags behind fine-tuned and merged checkpoints, underscoring limits of in-context learning for code-mixed inputs. We further test cross-pair transfer by training on En-Hi and evaluating on En-Ta and En-Ml: merged checkpoints transfer more strongly than monolingual-English baselines (e.g., TV/TIES variants reaching 0.65-0.68 F1 vs 0.61-0.63 for full fine-tuning), suggesting that code-mixed knowledge is a more reliable substrate for low-resource pairs. We conclude with adaptation recipes matched to common data regimes (labeled only; labeled+unlabeled; transfer-only) and discuss limitations and scaling considerations for broader tasks and larger models.
[51] ToolDreamer: Instilling LLM Reasoning Into Tool Retrievers
Saptarshi Sengupta,Zhengyu Zhou,Jun Araki,Xingbo Wang,Bingqing Wang,Suhang Wang,Zhe Feng
Main category: cs.CL
TL;DR: 本文提出了ToolDreamer框架,通过利用LLM生成的假设工具描述(TD)优化工具检索,解决用户请求与TD语言不匹配的问题,提升检索性能。
Details
Motivation: 现有检索模型基于用户查询和工具描述(TD)的相似性排名工具,但由于用户请求与TD语言不匹配,导致检索效果不佳。Contribution: 1. 提出ToolDreamer框架,利用LLM生成的假设TD优化检索。2. 在ToolRet数据集上验证框架对稀疏和密集检索器的性能提升。
Method: 通过LLM生成假设TD,将其用于检索模型训练或推理,以更自然地匹配查询和工具。
Result: 实验表明ToolDreamer提升了检索器的性能,支持训练和无训练场景,展示了其灵活性。
Insight: 将部分推理任务卸载到检索器,可以扩展LLM处理大规模工具集的能力,避免上下文窗口过载问题。
Abstract: Tool calling has become increasingly popular for Large Language Models (LLMs). However, for large tool sets, the resulting tokens would exceed the LLM’s context window limit, making it impossible to include every tool. Hence, an external retriever is used to provide LLMs with the most relevant tools for a query. Existing retrieval models rank tools based on the similarity between a user query and a tool description (TD). This leads to suboptimal retrieval as user requests are often poorly aligned with the language of TD. To remedy the issue, we propose ToolDreamer, a framework to condition retriever models to fetch tools based on hypothetical (synthetic) TD generated using an LLM, i.e., description of tools that the LLM feels will be potentially useful for the query. The framework enables a more natural alignment between queries and tools within the language space of TD’s. We apply ToolDreamer on the ToolRet dataset and show that our method improves the performance of sparse and dense retrievers with and without training, thus showcasing its flexibility. Through our proposed framework, our aim is to offload a portion of the reasoning burden to the retriever so that the LLM may effectively handle a large collection of tools without inundating its context window.
[52] Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
Xichen Zhang,Sitong Wu,Yinghao Zhu,Haoru Tan,Shaozuo Yu,Ziyi He,Jiaya Jia
Main category: cs.CL
TL;DR: Scaf-GRPO提出了一个渐进式训练框架,通过在模型学习停滞时注入分层提示,帮助LLM解决超出其当前能力的问题,显著提升了数学推理任务的性能。
Details
Motivation: 现有的强化学习方法在LLM解决远超出其当前能力的问题时,会遇到'学习悬崖'现象,导致学习梯度消失,无法取得进展。Contribution: 提出了Scaf-GRPO框架,通过诊断学习停滞并注入分层提示(从抽象概念到具体步骤),逐步提升LLM解决复杂问题的能力。
Method: 使用渐进式训练策略,在模型学习停滞时提供分层提示,并通过Group Relative Policy Optimization(GRPO)算法优化策略。
Result: 在AIME24数学基准测试中,Scaf-GRPO将Qwen2.5-Math-7B模型的pass@1分数相对提高了44.3%。
Insight: Scaf-GRPO通过分层引导的策略,为LLM提供了一种打破’学习悬崖’现象的有效方法,扩展了其自主推理能力的边界。
Abstract: Reinforcement learning from verifiable rewards has emerged as a powerful technique for enhancing the complex reasoning abilities of Large Language Models (LLMs). However, these methods are fundamentally constrained by the ‘’learning cliff’’ phenomenon: when faced with problems far beyond their current capabilities, models consistently fail, yielding a persistent zero-reward signal. In policy optimization algorithms like GRPO, this collapses the advantage calculation to zero, rendering these difficult problems invisible to the learning gradient and stalling progress. To overcome this, we introduce Scaf-GRPO (Scaffolded Group Relative Policy Optimization), a progressive training framework that strategically provides minimal guidance only when a model’s independent learning has plateaued. The framework first diagnoses learning stagnation and then intervenes by injecting tiered in-prompt hints, ranging from abstract concepts to concrete steps, enabling the model to construct a valid solution by itself. Extensive experiments on challenging mathematics benchmarks demonstrate Scaf-GRPO’s effectiveness, boosting the pass@1 score of the Qwen2.5-Math-7B model on the AIME24 benchmark by a relative 44.3% over a vanilla GRPO baseline. This result demonstrates our framework provides a robust and effective methodology for unlocking a model’s ability to solve problems previously beyond its reach, a critical step towards extending the frontier of autonomous reasoning in LLM.
[53] Hubble: a Model Suite to Advance the Study of LLM Memorization
Johnny Tian-Zheng Wei,Ameya Godbole,Mohammad Aflah Khan,Ryan Wang,Xiaoyuan Zhu,James Flemings,Nitya Kashyap,Krishna P. Gummadi,Willie Neiswanger,Robin Jia
Main category: cs.CL
TL;DR: Hubble是一个开源的LLM套件,旨在研究LLM的记忆问题,通过标准模型和扰动模型的设计,探索敏感数据记忆的风险及其缓解策略。
Details
Motivation: 研究大语言模型(LLM)的记忆问题,尤其是在训练过程中敏感数据的记忆和遗忘机制,为缓解隐私风险提供实证支持。Contribution: 发布了Hubble模型套件,包含标准和扰动变体,揭示了记忆风险的关键因素(如数据频率和训练阶段),并提出了两项最佳实践:稀释敏感数据和调整数据出现顺序。
Method: 设计了标准模型和扰动模型,扰动模型中插入了可控文本(如书籍段落、传记等),通过不同参数规模和训练阶段的对比实验研究记忆行为。
Result: 发现敏感数据的记忆与其在训练语料中的频率和出现阶段密切相关,高频或早期的数据更容易被记忆,而低频或后期的数据可能被遗忘。
Insight: Hubble不仅为记忆研究提供了基准工具,还为隐私保护(如成员推断和机器遗忘)提供了新的研究平台。
Abstract: We present Hubble, a suite of fully open-source large language models (LLMs) for the scientific study of LLM memorization. Hubble models come in standard and perturbed variants: standard models are pretrained on a large English corpus, and perturbed models are trained in the same way but with controlled insertion of text (e.g., book passages, biographies, and test sets) designed to emulate key memorization risks. Our core release includes 8 models – standard and perturbed models with 1B or 8B parameters, pretrained on 100B or 500B tokens – establishing that memorization risks are determined by the frequency of sensitive data relative to size of the training corpus (i.e., a password appearing once in a smaller corpus is memorized better than the same password in a larger corpus). Our release also includes 6 perturbed models with text inserted at different pretraining phases, showing that sensitive data without continued exposure can be forgotten. These findings suggest two best practices for addressing memorization risks: to dilute sensitive data by increasing the size of the training corpus, and to order sensitive data to appear earlier in training. Beyond these general empirical findings, Hubble enables a broad range of memorization research; for example, analyzing the biographies reveals how readily different types of private information are memorized. We also demonstrate that the randomized insertions in Hubble make it an ideal testbed for membership inference and machine unlearning, and invite the community to further explore, benchmark, and build upon our work.
cs.CV [Back]
[54] Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts
Seungjun Yu,Junsung Park,Youngsun Lim,Hyunjung Shim
Main category: cs.CV
TL;DR: 本文提出了一种两阶段的视觉-语言问答系统,用于自动驾驶场景中的高级感知、预测和规划问题。通过结合多模态大模型、历史数据和元数据增强提示,显著提升了问答准确性。
Details
Motivation: 自动驾驶场景中的高级视觉-语言问答任务需要处理复杂的感知、预测和规划问题,现有方法在这些任务上的表现有待提升。本文旨在通过设计工程化的提示和上下文增强方法,提升问答系统的性能和鲁棒性。Contribution: 1. 提出了一个两阶段的视觉-语言问答系统;2. 设计了结合历史数据和任务特定提示的元数据增强方法;3. 展示了系统在视觉损坏下的鲁棒性(96%准确率)。
Method: 1. 第一阶段:使用多模态大模型(Qwen2.5-VL-32B),结合六相机输入、历史数据和少样本提示;2. 第二阶段:通过场景元数据(如物体标注、车辆状态)和任务特定提示增强模型;3. 采用自一致性集成(多采样推理链)提高可靠性。
Result: 在驾驶问答基准测试中,系统显著优于基线模型(65.1% vs. 62.61%);自一致性进一步提高性能至66.85%;第二阶段达到67.37%整体准确率,且在视觉损坏下保持96%准确率。
Insight: 精心设计的提示和上下文增强可以显著提升预训练视觉-语言模型在自动驾驶任务中的表现。系统的鲁棒性表明其在复杂场景中的实用性。
Abstract: We present a two-phase vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions. In Phase-1, a large multimodal LLM (Qwen2.5-VL-32B) is conditioned on six-camera inputs, a short temporal window of history, and a chain-of-thought prompt with few-shot exemplars. A self-consistency ensemble (multiple sampled reasoning chains) further improves answer reliability. In Phase-2, we augment the prompt with nuScenes scene metadata (object annotations, ego-vehicle state, etc.) and category-specific question instructions (separate prompts for perception, prediction, planning tasks). In experiments on a driving QA benchmark, our approach significantly outperforms the baseline Qwen2.5 models. For example, using 5 history frames and 10-shot prompting in Phase-1 yields 65.1% overall accuracy (vs.62.61% with zero-shot); applying self-consistency raises this to 66.85%. Phase-2 achieves 67.37% overall. Notably, the system maintains 96% accuracy under severe visual corruption. These results demonstrate that carefully engineered prompts and contextual grounding can greatly enhance high-level driving QA with pretrained vision-language models.
[55] $Δ$t-Mamba3D: A Time-Aware Spatio-Temporal State-Space Model for Breast Cancer Risk Prediction
Zhengbo Zhou,Dooman Arefan,Margarita Zuley,Shandong Wu
Main category: cs.CV
TL;DR: 该论文提出了一种名为$Δ$t-Mamba3D的新型状态空间模型,用于解决乳腺X光片序列中不规则时间间隔的时空建模问题,显著提升了乳腺癌风险预测的性能。
Details
Motivation: 现有的方法在处理高分辨率医学图像序列时,往往无法充分利用时空信息。要么将空间信息压缩为向量,要么使用计算效率低且不适配非均匀时间步长的时空模型,限制了预测性能。Contribution: 1. 提出了$Δ$t-Mamba3D模型,能够同时编码不规则时间间隔和丰富的时空上下文信息;2. 引入了连续时间选择性扫描机制,显式整合真实时间差;3. 设计了多尺度3D邻域融合模块,增强了时空关系的捕捉能力。
Method: 1. 使用状态空间模型架构,结合连续时间选择性扫描机制;2. 通过多尺度3D邻域融合模块建模时空关系;3. 保持了线性复杂度,适合处理长序列数据。
Result: 在乳腺癌风险预测任务中,模型优于现有的循环、Transformer和状态空间模型变体,验证c-index提高了2-5个百分点,1-5年AUC评分更高。
Insight: 模型的成功表明,显式建模时间间隔和多尺度时空信息对纵向医学图像分析至关重要,同时证明高效的计算设计可以支持长序列数据的处理。
Abstract: Longitudinal analysis of sequential radiological images is hampered by a fundamental data challenge: how to effectively model a sequence of high-resolution images captured at irregular time intervals. This data structure contains indispensable spatial and temporal cues that current methods fail to fully exploit. Models often compromise by either collapsing spatial information into vectors or applying spatio-temporal models that are computationally inefficient and incompatible with non-uniform time steps. We address this challenge with Time-Aware $\Delta$t-Mamba3D, a novel state-space architecture adapted for longitudinal medical imaging. Our model simultaneously encodes irregular inter-visit intervals and rich spatio-temporal context while remaining computationally efficient. Its core innovation is a continuous-time selective scanning mechanism that explicitly integrates the true time difference between exams into its state transitions. This is complemented by a multi-scale 3D neighborhood fusion module that robustly captures spatio-temporal relationships. In a comprehensive breast cancer risk prediction benchmark using sequential screening mammogram exams, our model shows superior performance, improving the validation c-index by 2-5 percentage points and achieving higher 1-5 year AUC scores compared to established variants of recurrent, transformer, and state-space models. Thanks to its linear complexity, the model can efficiently process long and complex patient screening histories of mammograms, forming a new framework for longitudinal image analysis.
[56] MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models
Aritra Bhowmik,Denis Korzhenkov,Cees G. M. Snoek,Amirhossein Habibian,Mohsen Ghafoorian
Main category: cs.CV
TL;DR: 论文提出了一种专注于运动的表示对齐方法(MoAlign),通过解耦视频编码器的运动子空间并与扩散模型的特征对齐,提升了文本到视频生成模型的运动连贯性和物理合理性。
Details
Motivation: 现有文本到视频扩散模型在生成复杂运动时常缺乏连贯性和物理合理性,原因是模型对视频运动动态的理解不足。前人工作通过对齐视频编码器特征来解决,但这些特征混合了视频外观和运动动态,限制了改进效果。Contribution: 提出了一种解耦运动动态的框架,通过学习视频编码器的解耦运动子空间并与扩散模型的特征对齐,提升了生成视频的运动连贯性和物理合理性。
Method: 1. 从预训练视频编码器中学习解耦的运动子空间;2. 通过预测真实光流确保运动子空间捕获真实的动态;3. 将扩散模型的潜在特征与该运动子空间对齐。
Result: 在VideoPhy、VideoPhy2、VBench和VBench-2.0等数据集上的实验表明,方法显著提升了生成视频的物理常识性,同时保持了文本提示的贴合性。用户研究也验证了其优势。
Insight: 解耦运动动态是提升视频生成质量的关键;通过特征对齐可以高效地将预训练模型的运动知识迁移到生成模型中。
Abstract: Text-to-video diffusion models have enabled high-quality video synthesis, yet often fail to generate temporally coherent and physically plausible motion. A key reason is the models’ insufficient understanding of complex motions that natural videos often entail. Recent works tackle this problem by aligning diffusion model features with those from pretrained video encoders. However, these encoders mix video appearance and dynamics into entangled features, limiting the benefit of such alignment. In this paper, we propose a motion-centric alignment framework that learns a disentangled motion subspace from a pretrained video encoder. This subspace is optimized to predict ground-truth optical flow, ensuring it captures true motion dynamics. We then align the latent features of a text-to-video diffusion model to this new subspace, enabling the generative model to internalize motion knowledge and generate more plausible videos. Our method improves the physical commonsense in a state-of-the-art video diffusion model, while preserving adherence to textual prompts, as evidenced by empirical evaluations on VideoPhy, VideoPhy2, VBench, and VBench-2.0, along with a user study.
[57] PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Amith Ananthram,Elias Stengel-Eskin,Lorena A. Bradford,Julia Demarest,Adam Purvis,Keith Krut,Robert Stein,Rina Elster Pantalony,Mohit Bansal,Kathleen McKeown
Main category: cs.CV
TL;DR: PoSh是一种新的图像描述评估指标,利用场景图引导LLM作为评判工具,提供细粒度错误评分。DOCENT是一个新数据集,用于验证PoSh并成为详细图像描述的新基准。
Details
Motivation: 现有的图像描述评估指标(如CIDEr、SPICE)是为短文本设计的,难以评估长文本中的属性和关系错误。需要一种更敏感的评估方法。Contribution: 1. 提出PoSh指标,利用场景图和LLM作为评判工具;2. 推出DOCENT数据集,包含艺术品和专家标注;3. 验证PoSh在复杂领域(如艺术)的有效性。
Method: PoSh通过场景图构建结构化评分标准,指导LLM生成细粒度错误评分。DOCENT数据集包含艺术品和专家标注,用于验证指标。
Result: PoSh在DOCENT上比现有指标表现更好(Spearman ρ提升0.05),并可作为奖励函数提升模型性能。
Insight: 基础模型在处理复杂场景动态时仍存在不足,DOCENT为评估VLM提供了新挑战。
Abstract: While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $\rho$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.
[58] UniHPR: Unified Human Pose Representation via Singular Value Contrastive Learning
Zhongyu Jiang,Wenhao Chai,Lei Li,Zhuoran Zhou,Cheng-Yen Yang,Jenq-Neng Hwang
Main category: cs.CV
TL;DR: UniHPR是一种统一的姿态表征学习方法,通过奇异值对比学习将图像、2D和3D人体姿态嵌入对齐,并在2D/3D姿态估计任务中表现优异。
Details
Motivation: 现有方法缺乏对不同模态(如图像、2D关键点、3D骨架等)之间相关性的系统性研究,UniHPR旨在填补这一空白,提升姿态表征的统一性和性能。Contribution: 提出了UniHPR框架,首次通过奇异值对比学习损失实现多模态姿态表征对齐,并在2D/3D姿态估计和检索任务中验证了其有效性。
Method: 设计了基于奇异值的对比学习损失,对齐图像、2D和3D姿态嵌入,并结合简单的3D姿态解码器进行评估。
Result: 在Human3.6M(MPJPE 49.9mm)和3DPW(PA-MPJPE 51.6mm)数据集上达到先进性能,且姿态检索误差低至9.24mm。
Insight: 奇异值对比学习是跨模态对齐的有效工具,统一的姿态表征能显著提升下游任务的性能。
Abstract: In recent years, there has been a growing interest in developing effective alignment pipelines to generate unified representations from different modalities for multi-modal fusion and generation. As an important component of Human-Centric applications, Human Pose representations are critical in many downstream tasks, such as Human Pose Estimation, Action Recognition, Human-Computer Interaction, Object tracking, etc. Human Pose representations or embeddings can be extracted from images, 2D keypoints, 3D skeletons, mesh models, and lots of other modalities. Yet, there are limited instances where the correlation among all of those representations has been clearly researched using a contrastive paradigm. In this paper, we propose UniHPR, a unified Human Pose Representation learning pipeline, which aligns Human Pose embeddings from images, 2D and 3D human poses. To align more than two data representations at the same time, we propose a novel singular value-based contrastive learning loss, which better aligns different modalities and further boosts performance. To evaluate the effectiveness of the aligned representation, we choose 2D and 3D Human Pose Estimation (HPE) as our evaluation tasks. In our evaluation, with a simple 3D human pose decoder, UniHPR achieves remarkable performance metrics: MPJPE 49.9mm on the Human3.6M dataset and PA-MPJPE 51.6mm on the 3DPW dataset with cross-domain evaluation. Meanwhile, we are able to achieve 2D and 3D pose retrieval with our unified human pose representations in Human3.6M dataset, where the retrieval error is 9.24mm in MPJPE.
[59] Advancing Brain Tumor Segmentation via Attention-based 3D U-Net Architecture and Digital Image Processing
Eyad Gad,Seif Soliman,M. Saeed Darweesh
Main category: cs.CV
TL;DR: 本文提出了一种基于注意力机制的3D U-Net架构,结合数字图像处理技术,用于改进脑肿瘤分割任务。该方法在BraTS 2020数据集上表现优异,超越了现有研究。
Details
Motivation: 标准U-Net模型在脑肿瘤分割任务中面临不规则形状和模糊边界等问题,同时高分辨率MRI数据训练存在计算资源需求高和类别不平衡的挑战。Contribution: 1. 将注意力机制引入3D U-Net,提升对复杂细节的捕捉能力;2. 利用数字图像处理技术解决数据不平衡问题;3. 在BraTS 2020数据集上实现优于现有方法的分割性能。
Method: 1. 提出注意力机制增强的3D U-Net架构;2. 结合数字图像处理技术优化训练数据;3. 在BraTS 2020数据集上验证模型性能。
Result: 模型在BraTS 2020数据集上表现出色,Dice系数为0.975,特异性为0.988,敏感性为0.995。
Insight: 注意力机制和图像处理技术的结合可以显著提升脑肿瘤分割的精度和鲁棒性,为临床诊断提供了更可靠的解决方案。
Abstract: In the realm of medical diagnostics, rapid advancements in Artificial Intelligence (AI) have significantly yielded remarkable improvements in brain tumor segmentation. Encoder-Decoder architectures, such as U-Net, have played a transformative role by effectively extracting meaningful representations in 3D brain tumor segmentation from Magnetic resonance imaging (MRI) scans. However, standard U-Net models encounter challenges in accurately delineating tumor regions, especially when dealing with irregular shapes and ambiguous boundaries. Additionally, training robust segmentation models on high-resolution MRI data, such as the BraTS datasets, necessitates high computational resources and often faces challenges associated with class imbalance. This study proposes the integration of the attention mechanism into the 3D U-Net model, enabling the model to capture intricate details and prioritize informative regions during the segmentation process. Additionally, a tumor detection algorithm based on digital image processing techniques is utilized to address the issue of imbalanced training data and mitigate bias. This study aims to enhance the performance of brain tumor segmentation, ultimately improving the reliability of diagnosis. The proposed model is thoroughly evaluated and assessed on the BraTS 2020 dataset using various performance metrics to accomplish this goal. The obtained results indicate that the model outperformed related studies, exhibiting dice of 0.975, specificity of 0.988, and sensitivity of 0.995, indicating the efficacy of the proposed model in improving brain tumor segmentation, offering valuable insights for reliable diagnosis in clinical settings.
[60] A Novel Approach to Breast Cancer Segmentation using U-Net Model with Attention Mechanisms and FedProx
Eyad Gad,Mustafa Abou Khatwa,Mustafa A. Elattar,Sahar Selim
Main category: cs.CV
TL;DR: 本文提出了一种结合注意力机制的改进U-Net模型和FedProx方法的新型乳腺癌分割方法,旨在解决非独立同分布(non-IID)医学数据训练中的准确性和隐私问题。
Details
Motivation: 乳腺癌是女性死亡的主要原因之一,早期检测和准确诊断至关重要。超声成像是可靠且经济的方法,但医疗数据的敏感性使得开发准确且隐私保护的人工智能模型具有挑战性。Contribution: 主要贡献是结合注意力机制的改进U-Net模型和FedProx方法,解决了非IID数据训练的准确性和泛化性问题,同时保护了患者隐私。
Method: 使用FedProx方法对非IID超声乳腺癌图像数据集进行训练,并结合带有注意力机制的改进U-Net模型以增强肿瘤分割的准确性。
Result: 全局模型达到了96%的准确率,证明了该方法在提高分割准确性和保护隐私方面的有效性。
Insight: FedProx是一种有潜力的方法,可用于在非IID本地医学数据集上训练精确的机器学习模型。
Abstract: Breast cancer is a leading cause of death among women worldwide, emphasizing the need for early detection and accurate diagnosis. As such Ultrasound Imaging, a reliable and cost-effective tool, is used for this purpose, however the sensitive nature of medical data makes it challenging to develop accurate and private artificial intelligence models. A solution is Federated Learning as it is a promising technique for distributed machine learning on sensitive medical data while preserving patient privacy. However, training on non-Independent and non-Identically Distributed (non-IID) local datasets can impact the accuracy and generalization of the trained model, which is crucial for accurate tumour boundary delineation in BC segmentation. This study aims to tackle this challenge by applying the Federated Proximal (FedProx) method to non-IID Ultrasonic Breast Cancer Imaging datasets. Moreover, we focus on enhancing tumour segmentation accuracy by incorporating a modified U-Net model with attention mechanisms. Our approach resulted in a global model with 96% accuracy, demonstrating the effectiveness of our method in enhancing tumour segmentation accuracy while preserving patient privacy. Our findings suggest that FedProx has the potential to be a promising approach for training precise machine learning models on non-IID local medical datasets.
[61] X-Ego: Acquiring Team-Level Tactical Situational Awareness via Cross-Egocentric Contrastive Video Representation Learning
Yunzhe Wang,Soham Hans,Volkan Ustun
Main category: cs.CV
TL;DR: 论文提出了X-Ego-CS数据集和跨自我对比学习(CECL)方法,旨在通过同步的第一人称视角视频增强团队战术情境感知能力。
Details
Motivation: 现有团队交互建模多依赖第三方视角,忽略了同步的自我中心多智能体学习。Contribution: 1. 发布X-Ego-CS数据集,包含124小时的多玩家同步第一人称视频;2. 提出CECL方法,通过对比学习对齐队友视角以提升战术感知。
Method: CECL利用对比学习对齐队友的第一人称视频流,结合状态-动作轨迹提升团队级情境感知。
Result: CECL在队友和对手位置预测任务中表现优异,验证了其有效性。
Insight: 游戏理解是多智能体建模和战术学习的理想测试平台,对时空推理和人机协作有广泛意义。
Abstract: Human team tactics emerge from each player’s individual perspective and their ability to anticipate, interpret, and adapt to teammates’ intentions. While advances in video understanding have improved the modeling of team interactions in sports, most existing work relies on third-person broadcast views and overlooks the synchronous, egocentric nature of multi-agent learning. We introduce X-Ego-CS, a benchmark dataset consisting of 124 hours of gameplay footage from 45 professional-level matches of the popular e-sports game Counter-Strike 2, designed to facilitate research on multi-agent decision-making in complex 3D environments. X-Ego-CS provides cross-egocentric video streams that synchronously capture all players’ first-person perspectives along with state-action trajectories. Building on this resource, we propose Cross-Ego Contrastive Learning (CECL), which aligns teammates’ egocentric visual streams to foster team-level tactical situational awareness from an individual’s perspective. We evaluate CECL on a teammate-opponent location prediction task, demonstrating its effectiveness in enhancing an agent’s ability to infer both teammate and opponent positions from a single first-person view using state-of-the-art video encoders. Together, X-Ego-CS and CECL establish a foundation for cross-egocentric multi-agent benchmarking in esports. More broadly, our work positions gameplay understanding as a testbed for multi-agent modeling and tactical learning, with implications for spatiotemporal reasoning and human-AI teaming in both virtual and real-world domains. Code and dataset are available at https://github.com/HATS-ICT/x-ego.
[62] FootFormer: Estimating Stability from Visual Input
Keaton Kraiger,Jingjing Li,Skanda Bharadwaj,Jesse Scott,Robert T. Collins,Yanxi Liu
Main category: cs.CV
TL;DR: FootFormer是一种跨模态方法,直接从视觉输入预测人体运动动力学,并在多个数据集上显著优于或等同于现有方法。
Details
Motivation: 现有方法通常只能生成一到两种运动动力学测量指标(如足压分布或重心),而FootFormer旨在通过视觉输入联合预测多种指标,填补这一空白。Contribution: FootFormer提出了一种跨模态方法,能够联合预测足压分布、足接触图和重心(CoM)等多种运动动力学指标,并在经典的运动学稳定性预测组件(如CoP、CoM、BoS)上达到了SOTA性能。
Method: FootFormer通过视觉输入直接联合预测运动动力学指标,利用跨模态学习实现多任务预测。
Result: FootFormer在多个数据集上表现优异,显著优于或等同于现有方法,尤其在稳定性预测组件(CoP、CoM、BoS)上达到SOTA性能。
Insight: 跨模态学习和联合预测能够有效整合视觉输入与运动动力学指标,提升预测的准确性和全面性。
Abstract: We propose FootFormer, a cross-modality approach for jointly predicting human motion dynamics directly from visual input. On multiple datasets, FootFormer achieves statistically significantly better or equivalent estimates of foot pressure distributions, foot contact maps, and center of mass (CoM), as compared with existing methods that generate one or two of those measures. Furthermore, FootFormer achieves SOTA performance in estimating stability-predictive components (CoP, CoM, BoS) used in classic kinesiology metrics. Code and data are available at https://github.com/keatonkraiger/Vision-to-Stability.git.
[63] PruneHal: Reducing Hallucinations in Multi-modal Large Language Models through Adaptive KV Cache Pruning
Fengyuan Sun,Hui Chen,Xinhao Xu,Dandan Zheng,Jingdong Chen,Jun Zhou,Jungong Han,Guiguang Ding
Main category: cs.CV
TL;DR: PruneHal通过自适应KV缓存剪枝减少多模态大语言模型中的幻觉现象,无需额外训练且几乎不增加推理成本。
Details
Motivation: 现有方法通常通过额外训练或推理时引入外部/内部信息来缓解幻觉,但增加了计算成本。PruneHal观察到幻觉与模型对视觉token注意力不足相关,并提出了一种更高效的解决方案。Contribution: 首次将token剪枝技术应用于MLLMs的幻觉缓解,提出了无需训练的模型无关方法PruneHal,显著提升了模型对关键视觉信息的注意力。
Method: 通过自适应KV缓存剪枝,动态去除冗余视觉token,使模型聚焦于关键信息,从而减少幻觉。
Result: 在多个主流MLLMs和基准测试中,PruneHal表现稳健且优异,验证了其高效性和优越性。
Insight: 幻觉的根本原因可能是注意力分散在多模态信息上,而PruneHal通过剪枝技术直接优化注意力分配,为解决这一问题提供了新思路。
Abstract: While multi-modal large language models (MLLMs) have made significant progress in recent years, the issue of hallucinations remains a major challenge. To mitigate this phenomenon, existing solutions either introduce additional data for further training or incorporate external or internal information during inference. However, these approaches inevitably introduce extra computational costs. In this paper, we observe that hallucinations in MLLMs are strongly associated with insufficient attention allocated to visual tokens. In particular, the presence of redundant visual tokens disperses the model’s attention, preventing it from focusing on the most informative ones. As a result, critical visual cues are often under-attended, which in turn exacerbates the occurrence of hallucinations. Building on this observation, we propose \textbf{PruneHal}, a training-free, simple yet effective method that leverages adaptive KV cache pruning to enhance the model’s focus on critical visual information, thereby mitigating hallucinations. To the best of our knowledge, we are the first to apply token pruning for hallucination mitigation in MLLMs. Notably, our method don’t require additional training and incurs nearly no extra inference cost. Moreover, PruneHal is model-agnostic and can be seamlessly integrated with different decoding strategies, including those specifically designed for hallucination mitigation. We evaluate PruneHal on several widely used hallucination evaluation benchmarks using four mainstream MLLMs, achieving robust and outstanding results that highlight the effectiveness and superiority of our method. Our code will be publicly available.
[64] Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning
Takehiro Aoshima,Yusuke Shinohara,Park Byeongseon
Main category: cs.CV
TL;DR: 本文提出了一种新的度量标准Video Consistency Distance (VCD),通过奖励微调框架提升图像到视频生成任务的时序一致性,并在多数据集上验证了其有效性。
Details
Motivation: 传统奖励函数主要关注生成视频的整体质量和一致性,但在图像到视频生成任务中,时序一致性表现较差。为解决这一问题,作者提出了VCD度量标准。Contribution: 主要贡献是提出了VCD,一种在频率空间定义的新度量标准,用于增强视频生成的时序一致性,并通过奖励微调框架实现模型优化。
Method: VCD在视频帧特征的频率空间中定义,通过频域分析有效捕捉帧信息。作者采用奖励微调框架对模型进行优化,以提升时序一致性。
Result: 实验结果表明,使用VCD微调的模型在多个数据集上显著提升了时序一致性,且不损害其他性能。
Insight: 频域分析可能是提升视频时序一致性的有效手段,奖励微调框架无需真实视频数据集即可优化模型。
Abstract: Reward-based fine-tuning of video diffusion models is an effective approach to improve the quality of generated videos, as it can fine-tune models without requiring real-world video datasets. However, it can sometimes be limited to specific performances because conventional reward functions are mainly aimed at enhancing the quality across the whole generated video sequence, such as aesthetic appeal and overall consistency. Notably, the temporal consistency of the generated video often suffers when applying previous approaches to image-to-video (I2V) generation tasks. To address this limitation, we propose Video Consistency Distance (VCD), a novel metric designed to enhance temporal consistency, and fine-tune a model with the reward-based fine-tuning framework. To achieve coherent temporal consistency relative to a conditioning image, VCD is defined in the frequency space of video frame features to capture frame information effectively through frequency-domain analysis. Experimental results across multiple I2V datasets demonstrate that fine-tuning a video generation model with VCD significantly enhances temporal consistency without degrading other performance compared to the previous method.
[65] Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks
Kai Zeng,Zhanqian Wu,Kaixin Xiong,Xiaobao Wei,Xiangyu Guo,Zhenxin Zhu,Kalok Ho,Lijun Zhou,Bohan Zeng,Ming Lu,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye,Wentao Zhang
Main category: cs.CV
TL;DR: 论文提出Dream4Drive框架,通过合成数据生成提升自动驾驶下游感知任务的性能,显著增强极端案例的感知能力。
Details
Motivation: 现有驾驶世界模型主要关注生成质量和可控性指标,忽略了下游感知任务的评估,而这对自动驾驶性能至关重要。Contribution: 提出Dream4Drive框架,通过分解输入视频为3D感知指导图并渲染3D资产生成多视角视频,显著提升感知任务性能;贡献了DriveObj3D数据集。
Method: 使用3D感知指导图分解输入视频,渲染3D资产并微调驾驶世界模型,生成多视角真实视频用于下游任务训练。
Result: Dream4Drive在多视角极端案例生成上具有高度灵活性,显著提升了感知模型在不同训练周期下的性能。
Insight: 合成数据生成应紧密结合下游任务需求,而非仅关注生成质量;3D资产和多视角编辑是实现高效数据增强的关键。
Abstract: Recent advancements in driving world models enable controllable generation of high-quality RGB videos or multimodal videos. Existing methods primarily focus on metrics related to generation quality and controllability. However, they often overlook the evaluation of downstream perception tasks, which are $\mathbf{really\ crucial}$ for the performance of autonomous driving. Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only). When we double the epochs in the baseline, the benefit of synthetic data becomes negligible. To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks. Dream4Drive first decomposes the input video into several 3D-aware guidance maps and subsequently renders the 3D assets onto these guidance maps. Finally, the driving world model is fine-tuned to produce the edited, multi-view photorealistic videos, which can be used to train the downstream perception models. Dream4Drive enables unprecedented flexibility in generating multi-view corner cases at scale, significantly boosting corner case perception in autonomous driving. To facilitate future research, we also contribute a large-scale 3D asset dataset named DriveObj3D, covering the typical categories in driving scenarios and enabling diverse 3D-aware video editing. We conduct comprehensive experiments to show that Dream4Drive can effectively boost the performance of downstream perception models under various training epochs. Project: $\href{https://wm-research.github.io/Dream4Drive/}{this\ https\ URL}$
[66] MoE-GS: Mixture of Experts for Dynamic Gaussian Splatting
In-Hwan Jin,Hyeongju Mun,Joonsoo Kim,Kugjin Yun,Kyeongbo Kong
Main category: cs.CV
TL;DR: MoE-GS 提出了一种动态高斯喷射的新框架,通过混合专家模型(MoE)提升动态场景重建的质量和一致性,并通过高效渲染和蒸馏策略解决计算开销问题。
Details
Motivation: 现有动态高斯喷射方法在不同场景中表现不一致,缺乏通用性。MoE-GS 旨在通过混合专家模型整合多种专家能力,解决动态场景的多样化挑战。Contribution: 1. 提出首个结合混合专家模型的动态高斯喷射框架(MoE-GS)
2. 设计体积感知像素路由器(Volume-aware Pixel Router),实现专家输出的自适应融合
3. 引入高效渲染和蒸馏策略,平衡性能与计算开销
Method: 1. 使用混合专家模型结合多个专家,通过体积感知像素路由器动态分配任务
2. 提出单次多专家渲染和门控感知高斯剪枝技术,优化计算效率
3. 采用蒸馏策略将 MoE-GS 性能迁移至单个专家
Result: 在 N3V 和 Technicolor 数据集上,MoE-GS 一致优于现有方法,并在高效性上表现突出。
Insight: 1. 混合专家模型可显著提升动态场景重建的质量
2. 高效渲染和蒸馏策略是平衡模型容量与实时性的关键
Abstract: Recent advances in dynamic scene reconstruction have significantly benefited from 3D Gaussian Splatting, yet existing methods show inconsistent performance across diverse scenes, indicating no single approach effectively handles all dynamic challenges. To overcome these limitations, we propose Mixture of Experts for Dynamic Gaussian Splatting (MoE-GS), a unified framework integrating multiple specialized experts via a novel Volume-aware Pixel Router. Our router adaptively blends expert outputs by projecting volumetric Gaussian-level weights into pixel space through differentiable weight splatting, ensuring spatially and temporally coherent results. Although MoE-GS improves rendering quality, the increased model capacity and reduced FPS are inherent to the MoE architecture. To mitigate this, we explore two complementary directions: (1) single-pass multi-expert rendering and gate-aware Gaussian pruning, which improve efficiency within the MoE framework, and (2) a distillation strategy that transfers MoE performance to individual experts, enabling lightweight deployment without architectural changes. To the best of our knowledge, MoE-GS is the first approach incorporating Mixture-of-Experts techniques into dynamic Gaussian splatting. Extensive experiments on the N3V and Technicolor datasets demonstrate that MoE-GS consistently outperforms state-of-the-art methods with improved efficiency. Video demonstrations are available at https://anonymous.4open.science/w/MoE-GS-68BA/.
[67] SFGFusion: Surface Fitting Guided 3D Object Detection with 4D Radar and Camera Fusion
Xiaozhi Li,Huijun Di,Jian Li,Feng Liu,Wei Liang
Main category: cs.CV
TL;DR: SFGFusion是一种新型相机与4D成像雷达融合的3D物体检测方法,通过曲面拟合增强空间表示和多模态交互,提升深度预测和点云密度。
Details
Motivation: 4D成像雷达虽具低成本、远距离探测和精确测速优势,但其稀疏点云和低分辨率限制了物体几何表示和多模态融合效果。Contribution: 引入曲面拟合模型,通过估计二次曲面参数增强空间表示和跨模态交互,生成密集伪点云以弥补雷达点云稀疏性。
Method: 采用图像和雷达分支,分别利用预测深度引导PV到BEV转换和生成密集伪点云,并使用支柱编码方法融合特征。
Result: 在TJ4DRadSet和VoD基准测试中表现出色,有效融合相机和4D雷达特征。
Insight: 曲面拟合为多模态融合提供了显式几何约束,显著提升了稀疏点云数据的利用率和检测精度。
Abstract: 3D object detection is essential for autonomous driving. As an emerging sensor, 4D imaging radar offers advantages as low cost, long-range detection, and accurate velocity measurement, making it highly suitable for object detection. However, its sparse point clouds and low resolution limit object geometric representation and hinder multi-modal fusion. In this study, we introduce SFGFusion, a novel camera-4D imaging radar detection network guided by surface fitting. By estimating quadratic surface parameters of objects from image and radar data, the explicit surface fitting model enhances spatial representation and cross-modal interaction, enabling more reliable prediction of fine-grained dense depth. The predicted depth serves two purposes: 1) in an image branch to guide the transformation of image features from perspective view (PV) to a unified bird’s-eye view (BEV) for multi-modal fusion, improving spatial mapping accuracy; and 2) in a surface pseudo-point branch to generate dense pseudo-point cloud, mitigating the radar point sparsity. The original radar point cloud is also encoded in a separate radar branch. These two point cloud branches adopt a pillar-based method and subsequently transform the features into the BEV space. Finally, a standard 2D backbone and detection head are used to predict object labels and bounding boxes from BEV features. Experimental results show that SFGFusion effectively fuses camera and 4D radar features, achieving superior performance on the TJ4DRadSet and view-of-delft (VoD) object detection benchmarks.
[68] Advances in 4D Representation: Geometry, Motion, and Interaction
Mingrui Zhao,Sauradip Nag,Kai Wang,Aditya Vora,Guangda Ji,Peter Chun,Ali Mahdavi-Amiri,Hao Zhang
Main category: cs.CV
TL;DR: 该综述论文聚焦4D表示在几何、运动和互动中的应用,强调如何选择和定制适合任务的4D表示方法,并讨论了当前数据集的不足与未来发展方向。
Details
Motivation: 4D生成与重建是计算机图形学的快速发展子领域,但现有研究多集中于技术枚举,缺乏从表示角度系统分析其特性与挑战的工作。Contribution: 论文从几何、运动和互动三个关键视角分类4D表示方法,重点分析代表性工作的优缺点,并提出任务驱动的表示选择与定制策略。
Method: 采用选择性综述方法,聚焦代表性工作(如NeRF、3DGS),结合计算、应用和数据场景评估其性能与挑战,并引入LLM和VFM的应用讨论。
Result: 总结了当前主流与未充分探索的4D表示技术,指出数据集不足对领域发展的制约,并提出了改进方向。
Insight: 4D表示的选择需结合实际任务需求;LLM和VFM在4D应用中潜力大但需解决现有局限性;数据集的丰富性是推动领域进步的关键。
Abstract: We present a survey on 4D generation and reconstruction, a fast-evolving subfield of computer graphics whose developments have been propelled by recent advances in neural fields, geometric and motion deep learning, as well 3D generative artificial intelligence (GenAI). While our survey is not the first of its kind, we build our coverage of the domain from a unique and distinctive perspective of 4D representations/}, to model 3D geometry evolving over time while exhibiting motion and interaction. Specifically, instead of offering an exhaustive enumeration of many works, we take a more selective approach by focusing on representative works to highlight both the desirable properties and ensuing challenges of each representation under different computation, application, and data scenarios. The main take-away message we aim to convey to the readers is on how to select and then customize the appropriate 4D representations for their tasks. Organizationally, we separate the 4D representations based on three key pillars: geometry, motion, and interaction. Our discourse will not only encompass the most popular representations of today, such as neural radiance fields (NeRFs) and 3D Gaussian Splatting (3DGS), but also bring attention to relatively under-explored representations in the 4D context, such as structured models and long-range motions. Throughout our survey, we will reprise the role of large language models (LLMs) and video foundational models (VFMs) in a variety of 4D applications, while steering our discussion towards their current limitations and how they can be addressed. We also provide a dedicated coverage on what 4D datasets are currently available, as well as what is lacking, in driving the subfield forward. Project page:https://mingrui-zhao.github.io/4DRep-GMI/
[69] Vision-Based Mistake Analysis in Procedural Activities: A Review of Advances and Challenges
Konstantinos Bacharidis,Antonis A. Argyros
Main category: cs.CV
TL;DR: 该论文综述了基于视觉的程序活动中错误分析的进展与挑战,探讨了如何利用计算机视觉技术检测和预测结构化任务中的错误,并总结了现有数据集、评估方法和先进技术。
Details
Motivation: 程序活动中的错误分析在工业自动化、物理康复、教育和人机协作等领域具有重要应用价值。通过视觉方法检测和预测错误,可以提高任务执行的安全性和效率。Contribution: 论文的主要贡献包括:(1)分类总结了基于视觉的错误检测方法;(2)分析了当前面临的挑战和局限;(3)提供了现有数据集和评估指标的全面概述;(4)探讨了未来研究方向。
Method: 论文重点讨论了基于计算机视觉的方法,包括动作识别、行为预测和活动理解技术,用于检测程序执行中的偏差(如错误顺序、技术不当或时间误差)。
Result: 论文总结了现有方法的性能和局限性,并指出未来需要解决的问题(如区分允许的变异与真实错误)。
Insight: 通过结合神经符号推理和反事实状态建模等方向,未来可能进一步提升错误检测的精度和适用性。
Abstract: Mistake analysis in procedural activities is a critical area of research with applications spanning industrial automation, physical rehabilitation, education and human-robot collaboration. This paper reviews vision-based methods for detecting and predicting mistakes in structured tasks, focusing on procedural and executional errors. By leveraging advancements in computer vision, including action recognition, anticipation and activity understanding, vision-based systems can identify deviations in task execution, such as incorrect sequencing, use of improper techniques, or timing errors. We explore the challenges posed by intra-class variability, viewpoint differences and compositional activity structures, which complicate mistake detection. Additionally, we provide a comprehensive overview of existing datasets, evaluation metrics and state-of-the-art methods, categorizing approaches based on their use of procedural structure, supervision levels and learning strategies. Open challenges, such as distinguishing permissible variations from true mistakes and modeling error propagation are discussed alongside future directions, including neuro-symbolic reasoning and counterfactual state modeling. This work aims to establish a unified perspective on vision-based mistake analysis in procedural activities, highlighting its potential to enhance safety, efficiency and task performance across diverse domains.
[70] Unified Reinforcement and Imitation Learning for Vision-Language Models
Byung-Kwan Lee,Ryo Hachiuma,Yong Man Ro,Yu-Chiang Frank Wang,Yueh-Hua Wu
Main category: cs.CV
TL;DR: 本文提出了统一强化学习和模仿学习(RIL)算法,用于训练高效、轻量的视觉语言模型(VLM),结合两者优势,使小模型能模仿大模型并提升生成能力,性能接近或超越先进封闭源VLM。
Details
Motivation: 视觉语言模型(VLM)规模庞大,难以在资源受限环境中部署。因此,需要一种高效方法训练轻量但高性能的VLM。Contribution: 提出了RIL算法,统一强化学习和模仿学习,通过对抗模仿学习和多样化教师指导,显著提升学生模型的生成能力。
Method: 结合强化学习和对抗模仿学习,使用基于LLM的判别器区分师生输出,并利用多位大模型教师提供多样学习信号。
Result: 实验表明,RIL缩小了与先进开源和封闭源VLM的性能差距,甚至在某些情况下超越它们。
Insight: 通过统一强化和模仿学习,轻量级模型不仅能模仿大模型,还能通过强化信号自主学习,提升适应性。
Abstract: Vision-Language Models (VLMs) have achieved remarkable progress, yet their large scale often renders them impractical for resource-constrained environments. This paper introduces Unified Reinforcement and Imitation Learning (RIL), a novel and efficient training algorithm designed to create powerful, lightweight VLMs. RIL distinctively combines the strengths of reinforcement learning with adversarial imitation learning. This enables smaller student VLMs not only to mimic the sophisticated text generation of large teacher models but also to systematically improve their generative capabilities through reinforcement signals. Key to our imitation framework is an LLM-based discriminator that adeptly distinguishes between student and teacher outputs, complemented by guidance from multiple large teacher VLMs to ensure diverse learning. This unified learning strategy, leveraging both reinforcement and imitation, empowers student models to achieve significant performance gains, making them competitive with leading closed-source VLMs. Extensive experiments on diverse vision-language benchmarks demonstrate that RIL significantly narrows the performance gap with state-of-the-art open- and closed-source VLMs and, in several instances, surpasses them.
[71] A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP
Ying Dai,Wei Yu Chen
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的开集图像分割与识别框架,结合EfficientNetB0的无监督分割和CLIP的开集对象识别,通过两阶段流程实现了高效且灵活的跨模态对齐。
Details
Motivation: 开集视觉任务(如分割和识别)需要处理未见过的类别,传统方法依赖大量标注数据和训练。本文目标是开发一种无需训练的框架,利用预训练模型的优势减少对标注的依赖。Contribution: 1. 提出了一种无需训练的两阶段框架,结合无监督分割(EfficientNetB0)和跨模态对齐(CLIP);2. 通过奇异值分解(SVD)和层次聚类自适应确定分割区域;3. 设计了类别特定的提示词和通用提示词以支持开集识别。
Method: 1. 使用EfficientNetB0提取像素级特征,通过SVD和层次聚类实现无监督分割;2. 将分割区域编码为CLIP图像嵌入,并与预计算的文本嵌入(类别提示词)对齐;3. 通过相似度计算实现识别。
Result: 在COCO、ADE20K和PASCAL VOC等基准测试中达到了SOTA性能(匈牙利mIoU、精度、召回率和F1分数),证明了框架的有效性和泛化能力。
Insight: 1. 预训练模型的无缝结合可以显著减少对标注数据的依赖;2. SVD和层次聚类为无监督分割提供了灵活的解决方案;3. 跨模态对齐是开集识别的关键。
Abstract: This paper presents a novel training-free framework for open-vocabulary image segmentation and object recognition (OVSR), which leverages EfficientNetB0, a convolutional neural network, for unsupervised segmentation and CLIP, a vision-language model, for open-vocabulary object recognition. The proposed framework adopts a two stage pipeline: unsupervised image segmentation followed by segment-level recognition via vision-language alignment. In the first stage, pixel-wise features extracted from EfficientNetB0 are decomposed using singular value decomposition to obtain latent representations, which are then clustered using hierarchical clustering to segment semantically meaningful regions. The number of clusters is adaptively determined by the distribution of singular values. In the second stage, the segmented regions are localized and encoded into image embeddings using the Vision Transformer backbone of CLIP. Text embeddings are precomputed using CLIP’s text encoder from category-specific prompts, including a generic something else prompt to support open set recognition. The image and text embeddings are concatenated and projected into a shared latent feature space via SVD to enhance cross-modal alignment. Recognition is performed by computing the softmax over the similarities between the projected image and text embeddings. The proposed method is evaluated on standard benchmarks, including COCO, ADE20K, and PASCAL VOC, achieving state-of-the-art performance in terms of Hungarian mIoU, precision, recall, and F1-score. These results demonstrate the effectiveness, flexibility, and generalizability of the proposed framework.
[72] DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents
Kai Shi,Jun Yang,Ni Yang,Binqiang Pan,Qingsong Xie,Chao Zhang,Zhenyu Yang,Tianhuang Su,Haonan Lu
Main category: cs.CV
TL;DR: 论文提出了一种名为DaMo的数据混合优化器,用于在多模态大型语言模型(MLLMs)的微调中优化训练数据组合,提升移动电话代理(MPAs)的多任务处理能力。
Details
Motivation: 现有方法在确定多任务监督微调(SFT)的最优数据组合方面表现不佳,限制了MLLMs在多任务场景下的性能。Contribution: 1. 提出DaMo,一种可训练网络,预测下游任务性能以优化数据混合比例。2. 引入PhoneAgentBench,首个专注于MLLMs在移动电话多模态任务上的评测基准。
Method: DaMo通过小型实验预测数据混合比例的性能(R^2=0.81),并外推最优配置。
Result: 在PhoneAgentBench上性能提升3.38%,在其他基准(如BFCL-v3)上平均提升2.57%,在BFCL-v3单一任务上提升12.47%。
Insight: DaMo展示了强大的可扩展性,适用于不同模型架构,并能显著提升多任务学习效果。
Abstract: Mobile Phone Agents (MPAs) have emerged as a promising research direction due to their broad applicability across diverse scenarios. While Multimodal Large Language Models (MLLMs) serve as the foundation for MPAs, their effectiveness in handling multiple mobile phone tasks simultaneously remains limited. Although multitask supervised fine-tuning (SFT) is widely adopted for multitask learning, existing approaches struggle to determine optimal training data compositions for peak performance. To address this challenge, we propose DaMo (Data Mixture Optimizer) - a novel solution employing a trainable network that predicts optimal data mixtures by forecasting downstream task performance for any given dataset ratio. To support comprehensive evaluation, we introduce PhoneAgentBench, the first specialized benchmark to evaluate MLLMs on multimodal mobile phone tasks, comprising 1235 QA pairs spanning diverse real-world industrial mobile application scenarios. Demonstrating strong predictive capability (R^2=0.81) in small-scale pilot experiments, DaMo efficiently extrapolates optimal data mixing configurations. Our results show DaMo achieves a 3.38% performance improvement on PhoneAgentBench compared to alternative methods. Furthermore, extensive experiments across established benchmarks including BFCL-v3, MME-Reasoning, MME-Perception, and OCRBench reveal DaMo’s superior generalization, outperforming other approaches by 2.57% in terms of average score. When used solely for MLLM optimization on the BFCL-v3 task, DaMo improves the metrics by 12.47% than other methods. Notably, DaMo maintains robust scalability, preserving its effectiveness when applied to other model architectures. The code and dataset are available at https://github.com/OPPO-Mente-Lab/DaMo.git
[73] Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes
Zhiyuan Feng,Zhaolu Kang,Qijie Wang,Zhiying Du,Jiongrui Yan,Shubin Shi,Chengbo Yuan,Huizhi Liang,Yu Deng,Qixiu Li,Rushuai Yang,Arctanx An,Leqi Zheng,Weijie Wang,Shawn Chen,Sicheng Xu,Yaobo Liang,Jiaolong Yang,Baining Guo
Main category: cs.CV
TL;DR: MV-RoboBench是一个新基准,用于评估视觉-语言模型(VLMs)在机器人场景中的多视角空间推理能力,结果显示当前模型在多视角任务中表现远低于人类水平。
Details
Motivation: 现有VLMs评估主要集中在单视角任务中,而机器人场景通常需要多视角信息以解决遮挡和深度模糊问题,因此需要评估VLMs的多视角推理能力。Contribution: 提出了MV-RoboBench基准,包含1.7k个手动标注的问答项,覆盖空间理解和机器人执行两大类任务,并公开了数据和标准化评估协议。
Method: 通过多视角空间推理任务设计基准,评估了开源和闭源VLMs及其增强版本(如CoT技术),并与人类表现对比。
Result: 当前VLMs在多视角任务中表现远不如人类,且空间智能与机器人任务执行在多视角场景中呈正相关。
Insight: 发现现有单视角基准的表现不能直接推广到机器人多视角任务中,突显了多视角推理能力的重要性。
Abstract: Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions incorporating CoT-inspired techniques. The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are positively correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark. We release MV-RoboBench as an open resource to foster progress in spatially grounded VLMs and VLAs, providing not only data but also a standardized evaluation protocol for multi-view embodied reasoning.
[74] Multi-Camera Worker Tracking in Logistics Warehouse Considering Wide-Angle Distortion
Yuki Mori,Kazuma Kano,Yusuke Asai,Shin Katayama,Kenta Urano,Takuro Yonezawa,Nobuo Kawaguchi
Main category: cs.CV
TL;DR: 论文提出了一种在物流仓库中使用19个广角摄像头跟踪工人的方法,通过基于脚部位置的坐标对齐减少图像失真,提高了20%以上的跟踪精度。
Details
Motivation: 随着电子商务的普及,物流仓库的效率提升需求增加。数字孪生技术需要准确跟踪工人位置,但单一摄像头视野有限,广角摄像头又会引入图像失真。Contribution: 提出了一种基于脚部位置的多摄像头坐标对齐方法,有效减少了广角摄像头边缘的垂直失真,显著提高了工人跟踪的准确性。
Method: 使用19个广角摄像头从天花板俯视仓库地面,通过地板表面的对齐理解摄像头坐标与实际位置的关系。检测工人位置时,基于脚部位置对齐以减少失真。
Result: 实验表明,该方法将工人跟踪的准确率提高了20%以上,并通过外观特征的比较验证了其有效性。
Insight: 广角摄像头的图像失真可以通过局部特征(如脚部位置)的对齐来缓解,这对于多摄像头系统的跟踪任务具有重要意义。
Abstract: With the spread of e-commerce, the logistics market is growing around the world. Therefore, improving the efficiency of warehouse operations is essential. To achieve this, various approaches have been explored, and among them, the use of digital twins is gaining attention. To make this approach possible, it is necessary to accurately collect the positions of workers in a warehouse and reflect them in a virtual space. However, a single camera has limitations in its field of view, therefore sensing with multiple cameras is necessary. In this study, we explored a method to track workers using 19 wide-angle cameras installed on the ceiling, looking down at the floor of the logistics warehouse. To understand the relationship between the camera coordinates and the actual positions in the warehouse, we performed alignment based on the floor surface. However, due to the characteristics of wide-angle cameras, significant distortion occurs at the edges of the image, particularly in the vertical direction. To address this, the detected worker positions from each camera were aligned based on foot positions, reducing the effects of image distortion, and enabling accurate position alignment across cameras. As a result, we confirmed an improvement of over 20% in tracking accuracy. Furthermore, we compared multiple methods for utilizing appearance features and validated the effectiveness of the proposed approach.
[75] Reasoning Like Experts: Leveraging Multimodal Large Language Models for Drawing-based Psychoanalysis
Xueqi Ma,Yanbei Jiang,Sarah Erfani,James Bailey,Weifeng Liu,Krista A. Ehinger,Jey Han Lau
Main category: cs.CV
TL;DR: 提出了一个名为PICK的多步骤框架,利用多模态大语言模型(MLLMs)进行基于绘画的心理分析,特别是在HTP测试中的应用,通过分层分析和知识注入提升心理分析的准确性。
Details
Motivation: 当前MLLMs在多模态感知任务中表现优异,但在主观且情感丰富的心理分析领域应用较少。PICK旨在填补这一空白,通过结构化方法提升MLLMs在心理分析中的表现。Contribution: 1. 提出了PICK框架,通过分层分析和知识注入增强MLLMs的心理分析能力。2. 设计了HTP知识库和特征提取模块,生成心理特征档案。3. 实验证明PICK显著提升了MLLMs在心理分析中的性能,并展示了其通用性。
Method: 1. 将绘画分解为语义子图,构建包含单对象、多对象和整体的层次表示。2. 在每层提取视觉线索中的心理或情感信息。3. 引入HTP知识库和强化学习训练的特征提取模块,生成心理档案。4. 整合多层次信息生成专家级心理评估。
Result: 实验结果显示PICK显著提升了MLLMs在心理分析中的能力,并通过情感理解任务的扩展验证了其通用性。
Insight: PICK通过结构化方法和知识注入,成功将MLLMs应用于主观领域,展示了MLLMs在专业领域(如心理学)的潜力。
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance across various objective multimodal perception tasks, yet their application to subjective, emotionally nuanced domains, such as psychological analysis, remains largely unexplored. In this paper, we introduce PICK, a multi-step framework designed for Psychoanalytical Image Comprehension through hierarchical analysis and Knowledge injection with MLLMs, specifically focusing on the House-Tree-Person (HTP) Test, a widely used psychological assessment in clinical practice. First, we decompose drawings containing multiple instances into semantically meaningful sub-drawings, constructing a hierarchical representation that captures spatial structure and content across three levels: single-object level, multi-object level, and whole level. Next, we analyze these sub-drawings at each level with a targeted focus, extracting psychological or emotional insights from their visual cues. We also introduce an HTP knowledge base and design a feature extraction module, trained with reinforcement learning, to generate a psychological profile for single-object level analysis. This profile captures both holistic stylistic features and dynamic object-specific features (such as those of the house, tree, or person), correlating them with psychological states. Finally, we integrate these multi-faceted information to produce a well-informed assessment that aligns with expert-level reasoning. Our approach bridges the gap between MLLMs and specialized expert domains, offering a structured and interpretable framework for understanding human mental states through visual expression. Experimental results demonstrate that the proposed PICK significantly enhances the capability of MLLMs in psychological analysis. It is further validated as a general framework through extensions to emotion understanding tasks.
[76] PRGCN: A Graph Memory Network for Cross-Sequence Pattern Reuse in 3D Human Pose Estimation
Zhuoyang Xie,Yibo Zhao,Hui Huang,Riwei Wang,Zan Gao
Main category: cs.CV
TL;DR: PRGCN是一个新颖的图记忆网络,通过跨序列模式重用解决3D人体姿态估计中的深度模糊性问题,结合图记忆库和双流混合架构,实现了新的SOTA性能。
Details
Motivation: 现有的视频方法在处理序列时孤立操作,未能利用跨序列中普遍存在的结构规律和重复运动模式。Contribution: 提出了PRGCN框架,将姿态估计建模为模式检索和适应的问题,利用图记忆库存储姿态原型并通过注意力机制动态检索。
Method: 设计了图记忆库和双流混合架构(Mamba状态空间模型与自注意力结合),动态检索和融合姿态原型与解剖学约束。
Result: 在Human3.6M和MPI-INF-3DHP基准测试中分别达到37.1mm和13.4mm的MPJPE,表现最优。
Insight: 跨序列模式重用是推动领域发展的关键机制,应从单序列优化转向累积知识学习。
Abstract: Monocular 3D human pose estimation remains a fundamentally ill-posed inverse problem due to the inherent depth ambiguity in 2D-to-3D lifting. While contemporary video-based methods leverage temporal context to enhance spatial reasoning, they operate under a critical paradigm limitation: processing each sequence in isolation, thereby failing to exploit the strong structural regularities and repetitive motion patterns that pervade human movement across sequences. This work introduces the Pattern Reuse Graph Convolutional Network (PRGCN), a novel framework that formalizes pose estimation as a problem of pattern retrieval and adaptation. At its core, PRGCN features a graph memory bank that learns and stores a compact set of pose prototypes, encoded as relational graphs, which are dynamically retrieved via an attention mechanism to provide structured priors. These priors are adaptively fused with hard-coded anatomical constraints through a memory-driven graph convolution, ensuring geometrical plausibility. To underpin this retrieval process with robust spatiotemporal features, we design a dual-stream hybrid architecture that synergistically combines the linear-complexity, local temporal modeling of Mamba-based state-space models with the global relational capacity of self-attention. Extensive evaluations on Human3.6M and MPI-INF-3DHP benchmarks demonstrate that PRGCN establishes a new state-of-the-art, achieving an MPJPE of 37.1mm and 13.4mm, respectively, while exhibiting enhanced cross-domain generalization capability. Our work posits that the long-overlooked mechanism of cross-sequence pattern reuse is pivotal to advancing the field, shifting the paradigm from per-sequence optimization towards cumulative knowledge learning.
[77] Towards Single-Source Domain Generalized Object Detection via Causal Visual Prompts
Chen Li,Huiying Xu,Changxin Gao,Zeyu Wang,Yun Liu,Xinzhong Zhu
Main category: cs.CV
TL;DR: 本文提出了一种名为Cauvis的方法,通过因果视觉提示(Causal Visual Prompts)解决单源域泛化目标检测(SDGOD)中的虚假相关性问题,显著提升了模型在未见目标域中的泛化能力。
Details
Motivation: 当前单源域泛化目标检测方法因数据增强技术的局限性,容易陷入虚假相关性问题,导致模型过度依赖浅层特征(如颜色)而非本质特征(如物体轮廓)。Contribution: 1. 提出Cross-Attention Prompts模块,通过视觉提示和交叉注意力减少虚假特征的偏差;2. 设计双分支适配器,解耦因果特征与虚假特征,同时通过高频特征提取实现域适应。
Method: Cauvis方法结合了交叉注意力视觉提示和双分支适配器,前者减少虚假特征依赖,后者通过高频特征提取和解耦实现域适应。
Result: 在SDGOD数据集上,Cauvis比现有域泛化方法性能提升15.9-31.4%,并在复杂干扰环境中表现出更强的鲁棒性。
Insight: 通过因果视觉提示和高频特征提取,可以有效解耦虚假特征与本质特征,从而显著提升模型的域泛化能力。
Abstract: Single-source Domain Generalized Object Detection (SDGOD), as a cutting-edge research topic in computer vision, aims to enhance model generalization capability in unseen target domains through single-source domain training. Current mainstream approaches attempt to mitigate domain discrepancies via data augmentation techniques. However, due to domain shift and limited domain-specific knowledge, models tend to fall into the pitfall of spurious correlations. This manifests as the model’s over-reliance on simplistic classification features (e.g., color) rather than essential domain-invariant representations like object contours. To address this critical challenge, we propose the Cauvis (Causal Visual Prompts) method. First, we introduce a Cross-Attention Prompts module that mitigates bias from spurious features by integrating visual prompts with cross-attention. To address the inadequate domain knowledge coverage and spurious feature entanglement in visual prompts for single-domain generalization, we propose a dual-branch adapter that disentangles causal-spurious features while achieving domain adaptation via high-frequency feature extraction. Cauvis achieves state-of-the-art performance with 15.9-31.4% gains over existing domain generalization methods on SDGOD datasets, while exhibiting significant robustness advantages in complex interference environments.
[78] CARES: Context-Aware Resolution Selector for VLMs
Moshe Kimhi,Nimrod Shabtay,Raja Giryes,Chaim Baskin,Eli Schwartz
Main category: cs.CV
TL;DR: CARES提出了一种轻量级预处理模块,通过预测图像查询对的最小分辨率来减少大型视觉语言模型的计算开销,同时保持任务性能。
Details
Motivation: 现有大型视觉语言模型通常以原生或高分辨率处理图像,导致计算和延迟显著增加,即使低分辨率图像可能已足够。CARES旨在解决这一问题。Contribution: CARES的核心贡献是设计了一个上下文感知的分辨率选择器,动态预测足够的最小输入分辨率,显著降低计算成本(高达80%)。
Method: CARES利用一个紧凑的视觉语言模型(350M参数)提取特征,预测目标模型在不同分辨率下回答的收敛性,支持连续分辨率插值以实现精细控制。
Result: 在五个多模态基准测试中,CARES在保持任务性能的同时,显著减少了计算开销。
Insight: 通过动态分辨率选择,可以在不影响模型性能的情况下大幅优化计算效率,尤其适用于资源受限的场景。
Abstract: Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM’s response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.
[79] PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis
Qing Mao,Tianxin Huang,Yu Zhu,Jinqiu Sun,Yanning Zhang,Gim Hee Lee
Main category: cs.CV
TL;DR: PoseCrafter提出了一种混合视频生成(HVG)方法,结合视频插值和姿态条件新视角合成模型,生成清晰的中间帧,并通过特征匹配选择器(FMS)优化姿态估计性能。
Details
Motivation: 现有的稀疏重叠图像对姿态估计方法在小重叠或无重叠情况下效果不佳,生成的中间帧模糊且选择策略效率低下。Contribution: 1. 提出混合视频生成(HVG)方法,结合两种模型生成清晰中间帧;2. 设计了特征匹配选择器(FMS),优化帧选择策略。
Method: HVG耦合视频插值模型和姿态条件新视角合成模型,FMS基于特征对应选择适合姿态估计的帧。
Result: 在多个数据集上,PoseCrafter显著提升了姿态估计性能,尤其是小重叠或无重叠情况。
Insight: 结合生成模型的优势并引入针对性选择策略,可以有效解决极端姿态估计的挑战。
Abstract: Pairwise camera pose estimation from sparsely overlapping image pairs remains a critical and unsolved challenge in 3D vision. Most existing methods struggle with image pairs that have small or no overlap. Recent approaches attempt to address this by synthesizing intermediate frames using video interpolation and selecting key frames via a self-consistency score. However, the generated frames are often blurry due to small overlap inputs, and the selection strategies are slow and not explicitly aligned with pose estimation. To solve these cases, we propose Hybrid Video Generation (HVG) to synthesize clearer intermediate frames by coupling a video interpolation model with a pose-conditioned novel view synthesis model, where we also propose a Feature Matching Selector (FMS) based on feature correspondence to select intermediate frames appropriate for pose estimation from the synthesized results. Extensive experiments on Cambridge Landmarks, ScanNet, DL3DV-10K, and NAVI demonstrate that, compared to existing SOTA methods, PoseCrafter can obviously enhance the pose estimation performances, especially on examples with small or no overlap.
[80] [De|Re]constructing VLMs’ Reasoning in Counting
Simone Alghisi,Gabriel Roccabruna,Massimo Rizzoli,Seyed Mahed Mousavi,Giuseppe Riccardi
Main category: cs.CV
TL;DR: 该论文研究了视觉语言模型(VLMs)在计数任务中的推理能力,发现其对物体数量、类型、空间排列及干扰物高度敏感,并指出错误主要源于最后一层表示到输出空间的映射问题。通过仅微调输出层,准确率提升了21%。
Details
Motivation: VLMs虽然在多任务中表现优异,但在视觉推理(如计数)中仍存在局限性。论文旨在深入分析VLMs失败的原因,并提出针对性改进方法。Contribution: 1. 在控制实验条件下研究了7种前沿VLMs的计数推理能力。2. 揭示VLMs对物体特性及干扰的敏感性。3. 通过仅微调输出层显著提升模型性能。
Method: 1. 设计控制实验分析VLMs的计数表现。2. 分层分析发现错误源于最后一层映射问题。3. 针对性地微调输出层以改进模型。
Result: 实验表明,VLMs对物体特性及干扰高度敏感。通过微调输出层,计数准确率提升了21%,并在真实数据集上验证了改进效果。
Insight: VLMs的视觉推理能力可以通过针对底层表示的微调显著提升,无需大规模调整模型。这为改进其他视觉推理任务提供了潜在方向。
Abstract: Vision-Language Models (VLMs) have recently gained attention due to their competitive performance on multiple downstream tasks, achieved by following user-input instructions. However, VLMs still exhibit several limitations in visual reasoning, such as difficulties in identifying relations (e.g., spatial, temporal, and among objects), understanding temporal sequences (e.g., frames), and counting objects. In this work, we go beyond score-level benchmark evaluations of VLMs by investigating the underlying causes of their failures and proposing a targeted approach to improve their reasoning capabilities. We study the reasoning skills of seven state-of-the-art VLMs in the counting task under controlled experimental conditions. Our experiments show that VLMs are highly sensitive to the number and type of objects, their spatial arrangement, and the co-occurrence of distractors. A layer-wise analysis reveals that errors are due to incorrect mapping of the last-layer representation into the output space. Our targeted training shows that fine-tuning just the output layer improves accuracy by up to 21%. We corroborate these findings by achieving consistent improvements on real-world datasets.
[81] The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models
Xiaofeng Zhang,Aaron Courville,Michal Drozdzal,Adriana Romero-Soriano
Main category: cs.CV
TL;DR: 这篇论文探讨了文本到图像(T2I)模型中提示复杂性对生成数据质量、多样性和一致性的影响。通过实验和理论分析,作者提出了一个评估框架,揭示了提示复杂性与生成数据效用之间的关系,并分析了不同推理时干预方法的效果。
Details
Motivation: T2I模型可以生成丰富的合成数据,但其效用受提示复杂性影响,而这一影响尚未被系统性研究。Contribution: 1)通过合成实验和理论分析揭示了提示复杂性对T2I模型生成的难点;2)提出了一个新的评估框架,用于比较真实数据和合成数据的效用;3)分析了不同推理时干预方法的优劣。
Method: 作者设计了合成实验和理论推导,提出了评估框架,并在多种数据集(如CC12M、ImageNet-1k、DCI)上进行了大规模实验,研究了不同推理时干预方法的效果。
Result: 实验表明,增加提示复杂性会降低条件多样性和提示一致性,但能减少合成数据与真实数据之间的分布偏移。此外,提示扩展方法在图像多样性和美学上表现最优。
Insight: 1)提示复杂性是T2I模型生成效用中的一个关键因素;2)推理时干预方法能在一定程度上提升生成数据的多样性,但可能偏离真实数据的支持范围;3)提示扩展方法因其使用了预训练语言模型作为似然估计器,表现最佳。
Abstract: Text-to-image (T2I) models offer great potential for creating virtually limitless synthetic data, a valuable resource compared to fixed and finite real datasets. Previous works evaluate the utility of synthetic data from T2I models on three key desiderata: quality, diversity, and consistency. While prompt engineering is the primary means of interacting with T2I models, the systematic impact of prompt complexity on these critical utility axes remains underexplored. In this paper, we first conduct synthetic experiments to motivate the difficulty of generalization w.r.t. prompt complexity and explain the observed difficulty with theoretical derivations. Then, we introduce a new evaluation framework that can compare the utility of real data and synthetic data, and present a comprehensive analysis of how prompt complexity influences the utility of synthetic data generated by commonly used T2I models. We conduct our study across diverse datasets, including CC12M, ImageNet-1k, and DCI, and evaluate different inference-time intervention methods. Our synthetic experiments show that generalizing to more general conditions is harder than the other way round, since the former needs an estimated likelihood that is not learned by diffusion models. Our large-scale empirical experiments reveal that increasing prompt complexity results in lower conditional diversity and prompt consistency, while reducing the synthetic-to-real distribution shift, which aligns with the synthetic experiments. Moreover, current inference-time interventions can augment the diversity of the generations at the expense of moving outside the support of real data. Among those interventions, prompt expansion, by deliberately using a pre-trained language model as a likelihood estimator, consistently achieves the highest performance in both image diversity and aesthetics, even higher than that of real data.
[82] A Matter of Time: Revealing the Structure of Time in Vision-Language Models
Nidham Tekaya,Manuela Waldner,Matthias Zeppelzauer
Main category: cs.CV
TL;DR: 这篇论文研究了视觉语言模型(VLM)的时间感知能力,提出了TIME10k基准数据集,并揭示了时间信息在VLM嵌入空间中的低维非线性结构。基于此,作者提出了一种显式的‘时间线’表示方法,用于时间推理任务。
Details
Motivation: 探讨VLM是否具备将视觉内容定位在时间线上的能力,以扩展其应用场景和功能性。Contribution: 1. 提出了TIME10k基准数据集;2. 揭示了时间信息在VLM嵌入空间中的低维非线性结构;3. 提出了一种高效的时间线表示方法,用于时间推理任务。
Method: 1. 评估37种VLM的时间感知能力;2. 通过TIME10k数据集验证模型的性能;3. 提出基于嵌入空间的显式时间线表示方法。
Result: 提出的时间线方法在时间推理任务中表现优于或接近基于提示的基线方法,且计算高效。
Insight: 时间信息在VLM嵌入空间中具有结构化特征,能够通过低维非线性映射表示时间进展。
Abstract: Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ``timeline’’ representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at https://tekayanidham.github.io/timeline-page/.
[83] HAD: Hierarchical Asymmetric Distillation to Bridge Spatio-Temporal Gaps in Event-Based Object Tracking
Yao Deng,Xian Zhong,Wenxuan Liu,Zhaofei Yu,Jingling Yuan,Tiejun Huang
Main category: cs.CV
TL;DR: 该论文提出了一种名为HAD(Hierarchical Asymmetric Distillation)的多模态知识蒸馏框架,旨在解决RGB相机和事件相机之间的时空不对称性问题,以提升目标跟踪性能。
Details
Motivation: RGB相机和事件相机各有优势(RGB相机空间分辨率高,事件相机时间分辨率高),但两者在成像机制上的时空不对称性阻碍了多模态信息的有效整合。Contribution: 提出了HAD框架,通过层次对齐策略显式建模和缓解时空不对称性,同时保持学生网络的计算效率和参数紧凑性。
Method: 采用多模态知识蒸馏方法,设计了分层对齐策略,以减少信息损失并平衡效率。
Result: 实验表明,HAD在多种场景下优于现有方法,消融实验验证了各模块的有效性和必要性。
Insight: 通过显式建模不对称性,可以有效整合多模态信息的互补优势,提升目标跟踪的鲁棒性。
Abstract: RGB cameras excel at capturing rich texture details with high spatial resolution, whereas event cameras offer exceptional temporal resolution and a high dynamic range (HDR). Leveraging their complementary strengths can substantially enhance object tracking under challenging conditions, such as high-speed motion, HDR environments, and dynamic background interference. However, a significant spatio-temporal asymmetry exists between these two modalities due to their fundamentally different imaging mechanisms, hindering effective multi-modal integration. To address this issue, we propose {Hierarchical Asymmetric Distillation} (HAD), a multi-modal knowledge distillation framework that explicitly models and mitigates spatio-temporal asymmetries. Specifically, HAD proposes a hierarchical alignment strategy that minimizes information loss while maintaining the student network’s computational efficiency and parameter compactness. Extensive experiments demonstrate that HAD consistently outperforms state-of-the-art methods, and comprehensive ablation studies further validate the effectiveness and necessity of each designed component. The code will be released soon.
[84] Can You Trust What You See? Alpha Channel No-Box Attacks on Video Object Detection
Ariana Yi,Ce Zhou,Liyang Xiao,Qiben Yan
Main category: cs.CV
TL;DR: 论文提出了{\alpha}-Cloak,一种针对视频目标检测的无盒对抗攻击方法,通过RGBA视频的alpha通道实现攻击,无需访问模型内部信息。
Details
Motivation: 随着目标检测模型在自动驾驶车辆(AVs)和监控平台等物理系统中的部署增加,确保其对抗攻击的安全性变得至关重要。现有的对抗攻击研究主要集中在图像领域,而视频领域的无盒攻击尚未充分探索。Contribution: 论文首次提出了一种基于alpha通道的无盒对抗攻击方法{\alpha}-Cloak,能够在不引入可见伪影的情况下欺骗目标检测器,攻击成功率达100%。
Method: {\alpha}-Cloak利用alpha通道将恶意目标视频与良性视频融合,设计了一种融合算法以确保视觉隐蔽性和兼容性,支持多种视频格式和播放应用。
Result: 在五种最先进的目标检测器、一个视觉语言模型和一个多模态大语言模型(Gemini-2.0-Flash)上,{\alpha}-Cloak实现了100%的攻击成功率。
Insight: 论文揭示了基于视频的感知系统中未被探索的alpha通道漏洞,强调了在对抗环境中考虑alpha通道防御的紧迫性。
Abstract: As object detection models are increasingly deployed in cyber-physical systems such as autonomous vehicles (AVs) and surveillance platforms, ensuring their security against adversarial threats is essential. While prior work has explored adversarial attacks in the image domain, those attacks in the video domain remain largely unexamined, especially in the no-box setting. In this paper, we present {\alpha}-Cloak, the first no-box adversarial attack on object detectors that operates entirely through the alpha channel of RGBA videos. {\alpha}-Cloak exploits the alpha channel to fuse a malicious target video with a benign video, resulting in a fused video that appears innocuous to human viewers but consistently fools object detectors. Our attack requires no access to model architecture, parameters, or outputs, and introduces no perceptible artifacts. We systematically study the support for alpha channels across common video formats and playback applications, and design a fusion algorithm that ensures visual stealth and compatibility. We evaluate {\alpha}-Cloak on five state-of-the-art object detectors, a vision-language model, and a multi-modal large language model (Gemini-2.0-Flash), demonstrating a 100% attack success rate across all scenarios. Our findings reveal a previously unexplored vulnerability in video-based perception systems, highlighting the urgent need for defenses that account for the alpha channel in adversarial settings.
[85] VGD: Visual Geometry Gaussian Splatting for Feed-Forward Surround-view Driving Reconstruction
Junhong Lin,Kangli Wang,Shunzhou Wang,Songlin Fan,Ge Li,Wei Gao
Main category: cs.CV
TL;DR: VGD提出了一种新颖的前馈端到端学习框架,通过显式学习几何信息并结合高斯头分支提升新视角的语义质量,在nuScenes数据集上显著优于现有方法。
Details
Motivation: 环视自动驾驶场景重建的核心挑战是在保证泛化能力的同时提升新视角质量。由于多视角间重叠区域极少,现有方法难以保证几何一致性和重建质量。Contribution: 1. 设计了轻量级VGGT变体,从预训练的VGGT中高效提取几何先验;2. 提出了高斯头分支,融合多尺度几何标记预测高斯参数;3. 结合几何和高斯头分支的多尺度特征联合监督语义细化模型。
Method: 1. 使用轻量级VGGT提取几何先验;2. 高斯头分支预测高斯参数;3. 多尺度特征联合监督语义细化。
Result: 在nuScenes数据集上,VGD在客观指标和主观质量上均显著优于现有方法。
Insight: 显式学习几何信息并结合高斯参数预测能有效提升多视角重建的质量和一致性。
Abstract: Feed-forward surround-view autonomous driving scene reconstruction offers fast, generalizable inference ability, which faces the core challenge of ensuring generalization while elevating novel view quality. Due to the surround-view with minimal overlap regions, existing methods typically fail to ensure geometric consistency and reconstruction quality for novel views. To tackle this tension, we claim that geometric information must be learned explicitly, and the resulting features should be leveraged to guide the elevating of semantic quality in novel views. In this paper, we introduce \textbf{Visual Gaussian Driving (VGD)}, a novel feed-forward end-to-end learning framework designed to address this challenge. To achieve generalizable geometric estimation, we design a lightweight variant of the VGGT architecture to efficiently distill its geometric priors from the pre-trained VGGT to the geometry branch. Furthermore, we design a Gaussian Head that fuses multi-scale geometry tokens to predict Gaussian parameters for novel view rendering, which shares the same patch backbone as the geometry branch. Finally, we integrate multi-scale features from both geometry and Gaussian head branches to jointly supervise a semantic refinement model, optimizing rendering quality through feature-consistent learning. Experiments on nuScenes demonstrate that our approach significantly outperforms state-of-the-art methods in both objective metrics and subjective quality under various settings, which validates VGD’s scalability and high-fidelity surround-view reconstruction.
[86] Multi-modal Co-learning for Earth Observation: Enhancing single-modality models via modality collaboration
Francisco Mena,Dino Ienco,Cassio F. Dantas,Roberto Interdonato,Andreas Dengel
Main category: cs.CV
TL;DR: 该论文提出了一种多模态协同学习的框架,用于提升地球观测任务中单模态模型的性能,尤其是在训练和推理阶段模态不一致的情况下。
Details
Motivation: 地球观测领域的多模数据量大且模态多样,但实际应用中训练和推理阶段可能无法获得相同的传感器模态。传统方法通常针对特定任务或模态设计解决方案,缺乏通用性。Contribution: 提出了一个通用的多模态协同学习框架,能够泛化到不同任务,而无需在推理阶段针对特定模态。该方法结合了对比学习和模态判别学习,分离模态共享和模态特定信息。
Method: 采用对比学习和模态判别学习,引导单模态模型将内部模型流形结构化为模态共享和模态特定信息。
Result: 在四个地球观测基准测试中,该方法在分类和回归任务上均优于当前最先进的机器学习、计算机视觉及地球观测专用方法。
Insight: 多模态协同学习可以有效利用训练阶段的多样模态数据,提升单模态推理性能,尤其在模态不一致的实际场景中。
Abstract: Multi-modal co-learning is emerging as an effective paradigm in machine learning, enabling models to collaboratively learn from different modalities to enhance single-modality predictions. Earth Observation (EO) represents a quintessential domain for multi-modal data analysis, wherein diverse remote sensors collect data to sense our planet. This unprecedented volume of data introduces novel challenges. Specifically, the access to the same sensor modalities at both training and inference stages becomes increasingly complex based on real-world constraints affecting remote sensing platforms. In this context, multi-modal co-learning presents a promising strategy to leverage the vast amount of sensor-derived data available at the training stage to improve single-modality models for inference-time deployment. Most current research efforts focus on designing customized solutions for either particular downstream tasks or specific modalities available at the inference stage. To address this, we propose a novel multi-modal co-learning framework capable of generalizing across various tasks without targeting a specific modality for inference. Our approach combines contrastive and modality discriminative learning together to guide single-modality models to structure the internal model manifold into modality-shared and modality-specific information. We evaluate our framework on four EO benchmarks spanning classification and regression tasks across different sensor modalities, where only one of the modalities available during training is accessible at inference time. Our results demonstrate consistent predictive improvements over state-of-the-art approaches from the recent machine learning and computer vision literature, as well as EO-specific methods. The obtained findings validate our framework in the single-modality inference scenarios across a diverse range of EO applications.
[87] Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation
Su Ho Han,Jeongseok Hyun,Pilhyeon Lee,Minho Shim,Dongyoon Wee,Seon Joo Kim
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的跨模态视频推理分割方法DecAF,通过分解注意力融合机制优化原始注意力图,并结合SAM2提示生成精细分割掩码。
Details
Motivation: 现有MLLMs在视频理解中表现出色,但其原始注意力图噪声较多且与目标区域对齐性差,难以直接用于定位任务。Contribution: 提出DecAF方法,通过对比性目标-背景融合和互补性视频帧融合机制优化注意力图,并引入注意力引导的SAM2提示生成精细掩码。
Method: 1. 利用rollout机制提取注意力图;2. 通过DecAF(对比性与互补性融合)优化注意力图;3. 结合SAM2提示生成精细分割结果。
Result: 在指代和推理VOS基准测试中,DecAF表现优于无需训练方法,并与训练方法性能相当。
Insight: 无需训练即可将MLLMs的注意力机制直接应用于视频分割任务,突破了传统方法的限制。
Abstract: Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks. The code will be available at https://github.com/HYUNJS/DecAF.
[88] CBDiff:Conditional Bernoulli Diffusion Models for Image Forgery Localization
Zhou Lei,Pan Gang,Wang Jiahao,Sun Di
Main category: cs.CV
TL;DR: 本文提出了一种名为CBDiff的条件伯努利扩散模型,用于图像伪造定位任务,通过生成多样化的伪造定位图,提升预测的可信度和可靠性。
Details
Motivation: 现有方法生成的单一确定性定位图在精确度和可靠性上不足,无法满足高风险应用需求。本文旨在解决这一问题。Contribution: 1. 提出了CBDiff模型,生成多样化的伪造定位图;2. 引入伯努利噪声以适应伪造掩码的二元稀疏特性;3. 设计了时间步交叉注意力机制(TSCAttention),提升检测性能。
Method: CBDiff结合了伯努利噪声和扩散模型,通过TSCAttention利用语义特征和时间步信息,生成多样化的定位图。
Result: 在八个公开数据集上的实验表明,CBDiff显著优于现有最先进方法。
Insight: 通过生成多样化的预测结果,CBDiff能够更全面地捕捉伪造分布的不确定性,适合高风险场景的部署。
Abstract: Image Forgery Localization (IFL) is a crucial task in image forensics, aimed at accurately identifying manipulated or tampered regions within an image at the pixel level. Existing methods typically generate a single deterministic localization map, which often lacks the precision and reliability required for high-stakes applications such as forensic analysis and security surveillance. To enhance the credibility of predictions and mitigate the risk of errors, we introduce an advanced Conditional Bernoulli Diffusion Model (CBDiff). Given a forged image, CBDiff generates multiple diverse and plausible localization maps, thereby offering a richer and more comprehensive representation of the forgery distribution. This approach addresses the uncertainty and variability inherent in tampered regions. Furthermore, CBDiff innovatively incorporates Bernoulli noise into the diffusion process to more faithfully reflect the inherent binary and sparse properties of forgery masks. Additionally, CBDiff introduces a Time-Step Cross-Attention (TSCAttention), which is specifically designed to leverage semantic feature guidance with temporal steps to improve manipulation detection. Extensive experiments on eight publicly benchmark datasets demonstrate that CBDiff significantly outperforms existing state-of-the-art methods, highlighting its strong potential for real-world deployment.
[89] XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography
Haozhe Luo,Shelley Zixin Shu,Ziyu Zhou,Sebastian Otalora,Mauricio Reyes
Main category: cs.CV
TL;DR: 该论文提出了XBench,首个系统性评估胸部X光视觉-语言模型跨模态可解释性的基准测试,揭示了当前模型在小病灶或弥散性病变上的局限性,并强调了临床可靠定位的重要性。
Details
Motivation: 视觉-语言模型在医学图像理解中表现优异,但其定位能力(文本概念与视觉证据的对齐程度)尚未充分研究。在医学领域,可靠的定位能力对可解释性和临床采纳至关重要。Contribution: 1. 提出了XBench,首个系统性评估胸部X光视觉-语言模型跨模态可解释性的基准测试。
2. 揭示了当前模型在小病灶或弥散性病变上的性能下降问题。
3. 发现模型在胸部X光数据集上的预训练显著提升了定位能力。
Method: 1. 使用交叉注意力和基于相似性的定位图生成视觉解释。
2. 定量评估这些解释与放射科医生标记区域的在多病变中的对齐程度。
3. 评估了七种CLIP风格的视觉-语言模型变体。
Result: 1. 所有模型变体对大且明确的病变定位表现良好,但对小或弥散性病变的性能显著下降。
2. 在胸部X光数据集上预训练的模型表现出更好的定位能力。
3. 模型的识别能力与定位能力高度相关。
Insight: 尽管当前视觉-语言模型在识别能力上表现优异,但其临床可靠的定位能力仍不足,强调了医学实践中针对性可解释性基准测试的必要性。
Abstract: Vision-language models (VLMs) have recently shown remarkable zero-shot performance in medical image understanding, yet their grounding ability, the extent to which textual concepts align with visual evidence, remains underexplored. In the medical domain, however, reliable grounding is essential for interpretability and clinical adoption. In this work, we present the first systematic benchmark for evaluating cross-modal interpretability in chest X-rays across seven CLIP-style VLM variants. We generate visual explanations using cross-attention and similarity-based localization maps, and quantitatively assess their alignment with radiologist-annotated regions across multiple pathologies. Our analysis reveals that: (1) while all VLM variants demonstrate reasonable localization for large and well-defined pathologies, their performance substantially degrades for small or diffuse lesions; (2) models that are pretrained on chest X-ray-specific datasets exhibit improved alignment compared to those trained on general-domain data. (3) The overall recognition ability and grounding ability of the model are strongly correlated. These findings underscore that current VLMs, despite their strong recognition ability, still fall short in clinically reliable grounding, highlighting the need for targeted interpretability benchmarks before deployment in medical practice. XBench code is available at https://github.com/Roypic/Benchmarkingattention
[90] MedReason-R1: Learning to Reason for CT Diagnosis with Reinforcement Learning and Local Zoom
Yifan Li,Fenghe Tang,Yingtai Li,Shaohua Kevin Zhou
Main category: cs.CV
TL;DR: MedReason-R1是一个专为CT诊断设计的医学视觉语言模型,通过结合强化学习和局部放大技术,实现了显式的诊断推理过程,显著提升了医学影像的诊断性能。
Details
Motivation: 通用的大型视觉语言模型(VLMs)在自然图像描述任务上表现优异,但在医学领域的表现不佳,主要因为缺乏高质量的大规模医学影像数据集和忽视从粗到细的诊断过程。Contribution: 1.提出了CT-RATE-VQA数据集(84K QA对);2.设计了MedReason-R1模型,通过局部放大疾病区域和强化学习(GRPO框架)提升诊断能力;3.在CT诊断任务上实现了SOTA性能。
Method: 1.构建CT-RATE-VQA数据集;2.MedReason-R1结合疾病区域局部放大和GRPO强化学习框架,显式推理诊断过程;3.避免依赖昂贵的人工标注。
Result: MedReason-R1在CT疾病诊断任务上优于通用和医学VLMs,同时保留了泛化能力。
Insight: 1.医学诊断需要结合全局定位和疾病细节;2.强化学习可以在缺乏人工标注的情况下提升诊断推理能力;3.高质量的专有数据集是医学VLMs成功的关键。
Abstract: General-purpose large Vision-Language Models (VLMs) demonstrate strong capabilities in generating detailed descriptions for natural images. However, their performance in the medical domain remains suboptimal, even for relatively straightforward tasks, primarily due to the lack of large-scale, high-quality, specialized medical imaging datasets and the neglect of the diagnostic process that progresses from coarse to fine-grained. To address the first issue, we construct the CT-RATE-VQA dataset, which has 84K QA pairs. For the second issue, we propose MedReason-R1, a medical VLM with explicit reasoning process for disease diagnosis. MedReason-R1 incorporates a novel strategy that embeds zoom-in disease region-of-interest areas into the image, highlighting the crucial role of both global localization and disease-specific details in enhancing the model’s diagnostic performance. Furthermore, we introduce the GRPO reinforcement learning framework to MedReason-R1, which enables effective reasoning without relying on costly manual annotations. Compared to recent general-purpose and medical VLMs, MedReason-R1 achieves state-of-the-art performance in CT disease diagnosis while retaining generalization. The code, checkpoints, and dataset are available at: https://github.com/Leevan001/MedReason-R1
[91] From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction
Zhida Zhao,Talas Fu,Yifan Wang,Lijun Wang,Huchuan Lu
Main category: cs.CV
TL;DR: 本文提出了一种名为Policy World Model(PWM)的新驾驶范式,将世界建模与轨迹规划统一在一个架构中,并通过无动作的未来状态预测机制提升规划性能。
Details
Motivation: 现有的驾驶世界模型主要用于模拟世界,且与世界规划解耦。尽管最近的研究尝试统一世界建模和规划,但如何利用世界建模的知识协同提升规划仍需探索。Contribution: 1. 提出PWM范式,整合世界建模与规划,并通过协同状态-动作预测实现人类式的预知能力;2. 引入动态增强的并行令牌生成机制以提升视频预测效率;3. 仅使用单视角输入即可媲美多视角多模态输入的先进方法。
Method: PWM通过无动作的未来状态预测方案利用学到的世界知识提升规划性能,采用动态增强的并行令牌生成机制(包含上下文引导的令牌生成器和自适应动态焦点损失)。
Result: 实验表明PWM仅使用前视摄像头输入即可匹配或超越依赖多视角多模态输入的先进方法。
Insight: 通过协同状态-动作预测和无动作的未来状态预测,可以更有效地将世界建模知识应用于规划任务,提升自动驾驶系统的可靠性和性能。
Abstract: Despite remarkable progress in driving world models, their potential for autonomous systems remains largely untapped: the world models are mostly learned for world simulation and decoupled from trajectory planning. While recent efforts aim to unify world modeling and planning in a single framework, the synergistic facilitation mechanism of world modeling for planning still requires further exploration. In this work, we introduce a new driving paradigm named Policy World Model (PWM), which not only integrates world modeling and trajectory planning within a unified architecture, but is also able to benefit planning using the learned world knowledge through the proposed action-free future state forecasting scheme. Through collaborative state-action prediction, PWM can mimic the human-like anticipatory perception, yielding more reliable planning performance. To facilitate the efficiency of video forecasting, we further introduce a dynamically enhanced parallel token generation mechanism, equipped with a context-guided tokenizer and an adaptive dynamic focal loss. Despite utilizing only front camera input, our method matches or exceeds state-of-the-art approaches that rely on multi-view and multi-modal inputs. Code and model weights will be released at https://github.com/6550Zhao/Policy-World-Model.
[92] I Spy With My Model’s Eye: Visual Search as a Behavioural Test for MLLMs
John Burden,Jonathan Prunty,Ben Slater,Matthieu Tehenan,Greg Davis,Lucy Cheke
Main category: cs.CV
TL;DR: 该论文提出了一种基于认知心理学中经典视觉搜索范式的方法,用于评估多模态大语言模型(MLLMs)的视觉处理能力,并发现其表现出类似人类的‘突显效果’和场景先验。
Details
Motivation: 尽管MLLMs在视觉语言任务上表现优异,但其视觉处理机制仍不透明。现有的黑盒评估方法仅关注任务准确性,而忽视了底层机制的研究。Contribution: 论文的主要贡献是将视觉搜索范式引入MLLMs评估,揭示了其在颜色和大小特征上的‘突显效果’,以及在多特征搜索中的能力限制,同时验证了场景先验的存在。
Method: 采用认知心理学中的视觉搜索范式,设计控制实验(如颜色、大小、光照特征),并结合微调和可解释性分析验证结果。
Result: 实验表明,MLLMs在单一特征搜索中表现出类似人类的突显效果,但在多特征搜索中存在能力限制;同时,确认了模型在处理光照方向等场景先验上的能力。
Insight: 视觉搜索可作为MLLMs感知能力的诊断工具,其研究揭示了模型感知机制与人类认知的相似性,为理解和改进MLLMs提供了新视角。
Abstract: Multimodal large language models (MLLMs) achieve strong performance on vision-language tasks, yet their visual processing is opaque. Most black-box evaluations measure task accuracy, but reveal little about underlying mechanisms. Drawing on cognitive psychology, we adapt classic visual search paradigms – originally developed to study human perception – to test whether MLLMs exhibit the ``pop-out’’ effect, where salient visual features are detected independently of distractor set size. Using controlled experiments targeting colour, size and lighting features, we find that advanced MLLMs exhibit human-like pop-out effects in colour or size-based disjunctive (single feature) search, as well as capacity limits for conjunctive (multiple feature) search. We also find evidence to suggest that MLLMs, like humans, incorporate natural scene priors such as lighting direction into object representations. We reinforce our findings using targeted fine-tuning and mechanistic interpretability analyses. Our work shows how visual search can serve as a cognitively grounded diagnostic tool for evaluating perceptual capabilities in MLLMs.
[93] Curvilinear Structure-preserving Unpaired Cross-domain Medical Image Translation
Zihao Chen,Yi Zhou,Xudong Jiang,Li Chen,Leopold Schmetterer,Bingyao Tan,Jun Cheng
Main category: cs.CV
TL;DR: 该论文提出了一种名为CST的框架,用于在无配对医疗图像翻译任务中保留精细的曲线结构(如微血管),通过引入结构一致性监督提升翻译准确性和诊断可靠性。
Details
Motivation: 现有方法在无配对图像翻译中常扭曲精细的曲线结构(如微血管),影响诊断和定量分析。这在眼科和血管成像中尤为重要,因为微小的形态变化具有重要临床意义。Contribution: 提出了CST框架,将结构一致性监督融入训练,保留了曲线结构的几何完整性,并可与现有方法(如CycleGAN和UNSB)无缝结合。
Method: CST通过曲线结构提取模块提供拓扑监督,增强基线模型的性能。实验覆盖了光学相干断层扫描血管造影、彩色眼底和X射线冠状动脉造影三种成像模态。
Result: CST在翻译保真度上表现优异,取得了最先进的性能,同时显著提升了曲线结构的保留能力。
Insight: CST为医疗图像翻译中的几何完整性提供了新思路,适用于对结构敏感的应用场景。
Abstract: Unpaired image-to-image translation has emerged as a crucial technique in medical imaging, enabling cross-modality synthesis, domain adaptation, and data augmentation without costly paired datasets. Yet, existing approaches often distort fine curvilinear structures, such as microvasculature, undermining both diagnostic reliability and quantitative analysis. This limitation is consequential in ophthalmic and vascular imaging, where subtle morphological changes carry significant clinical meaning. We propose Curvilinear Structure-preserving Translation (CST), a general framework that explicitly preserves fine curvilinear structures during unpaired translation by integrating structure consistency into the training. Specifically, CST augments baseline models with a curvilinear extraction module for topological supervision. It can be seamlessly incorporated into existing methods. We integrate it into CycleGAN and UNSB as two representative backbones. Comprehensive evaluation across three imaging modalities: optical coherence tomography angiography, color fundus and X-ray coronary angiography demonstrates that CST improves translation fidelity and achieves state-of-the-art performance. By reinforcing geometric integrity in learned mappings, CST establishes a principled pathway toward curvilinear structure-aware cross-domain translation in medical imaging.
[94] Explainable Face Presentation Attack Detection via Ensemble-CAM
Rashik Shadman,M G Sarwar Murshed,Faraz Hussain
Main category: cs.CV
TL;DR: 该论文提出了Ensemble-CAM方法,为基于深度学习的面部呈现攻击检测(PAD)系统提供视觉解释,增强系统的透明性和可信度。
Details
Motivation: 现有的深度学习PAD系统虽有效,但决策过程不透明,缺乏解释性。为解决这一问题,需提供视觉解释以帮助理解系统决策的关键区域。Contribution: 提出了Ensemble-CAM,一种新颖的方法,用于为基于深度学习的面部PAD系统生成视觉解释,提升系统的可解释性和信任度。
Method: 通过集成类激活映射(CAM)技术,Ensemble-CAM能够生成视觉热图,突出显示关键区域,解释系统如何区分真实与伪造生物特征图像。
Result: Ensemble-CAM提升了面部PAD系统的透明度,使用户能够直观理解模型的决策依据,增强了系统的可信度。
Insight: 视觉解释技术不仅适用于PAD系统,还可以扩展到其他需要透明决策的深度学习应用领域。
Abstract: Presentation attacks represent a critical security threat where adversaries use fake biometric data, such as face, fingerprint, or iris images, to gain unauthorized access to protected systems. Various presentation attack detection (PAD) systems have been designed leveraging deep learning (DL) models to mitigate this type of threat. Despite their effectiveness, most of the DL models function as black boxes - their decisions are opaque to their users. The purpose of explainability techniques is to provide detailed information about the reason behind the behavior or decision of DL models. In particular, visual explanation is necessary to better understand the decisions or predictions of DL-based PAD systems and determine the key regions due to which a biometric image is considered real or fake by the system. In this work, a novel technique, Ensemble-CAM, is proposed for providing visual explanations for the decisions made by deep learning-based face PAD systems. Our goal is to improve DL-based face PAD systems by providing a better understanding of their behavior. Our provided visual explanations will enhance the transparency and trustworthiness of DL-based face PAD systems.
[95] LyTimeT: Towards Robust and Interpretable State-Variable Discovery
Kuai Yu,Crystal Su,Xiang Liu,Judah Goldfeder,Mingyuan Shao,Hod Lipson
Main category: cs.CV
TL;DR: LyTimeT是一个两阶段框架,用于从高维视频中提取动态系统的真实变量,通过时空注意力机制和稳定性约束学习鲁棒且可解释的潜在表示。
Details
Motivation: 从高维视频中提取动态系统的真实变量面临视觉干扰(如背景运动、遮挡和纹理变化)的挑战,需要一种鲁棒且可解释的方法。Contribution: 提出了LyTimeT框架,结合时空注意力机制和Lyapunov稳定性约束,提取鲁棒且可解释的潜在变量。
Method: 1. 第一阶段使用TimeSformer自编码器学习动态相关区域的潜在表示;2. 第二阶段通过线性相关分析和Lyapunov正则化选择物理意义维度并优化动态。
Result: 在合成和真实动态系统测试中,LyTimeT在互信息和均方误差指标上优于基线方法,且对背景扰动具有不变性。
Insight: 时空注意力与稳定性约束的结合不仅提升了预测准确性,还增强了模型的物理可解释性。
Abstract: Extracting the true dynamical variables of a system from high-dimensional video is challenging due to distracting visual factors such as background motion, occlusions, and texture changes. We propose LyTimeT, a two-phase framework for interpretable variable extraction that learns robust and stable latent representations of dynamical systems. In Phase 1, LyTimeT employs a spatio-temporal TimeSformer-based autoencoder that uses global attention to focus on dynamically relevant regions while suppressing nuisance variation, enabling distraction-robust latent state learning and accurate long-horizon video prediction. In Phase 2, we probe the learned latent space, select the most physically meaningful dimensions using linear correlation analysis, and refine the transition dynamics with a Lyapunov-based stability regularizer to enforce contraction and reduce error accumulation during roll-outs. Experiments on five synthetic benchmarks and four real-world dynamical systems, including chaotic phenomena, show that LyTimeT achieves mutual information and intrinsic dimension estimates closest to ground truth, remains invariant under background perturbations, and delivers the lowest analytical mean squared error among CNN-based (TIDE) and transformer-only baselines. Our results demonstrate that combining spatio-temporal attention with stability constraints yields predictive models that are not only accurate but also physically interpretable.
[96] OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation
Guowei Xu,Yuxuan Bian,Ailing Zeng,Mingyi Shi,Shaoli Huang,Wen Li,Lixin Duan,Qiang Xu
Main category: cs.CV
TL;DR: OmniMotion-X是一个多功能的多模态全身人体运动生成框架,采用自回归扩散变换器,支持多种任务组合,如文本到运动、音乐到舞蹈等。通过引入参考运动和新颖的训练策略,提高了生成内容的一致性和质量。
Details
Motivation: 现有的运动生成方法往往局限于单一任务或模态,且在多模态任务中容易产生冲突。OmniMotion-X旨在提供一个统一的框架,解决多模态任务中的一致性和灵活性挑战。Contribution: 1. 提出OmniMotion-X框架,支持多种多模态任务;2. 引入参考运动作为条件信号,提升生成内容的一致性;3. 提出渐进弱到强的混合条件训练策略;4. 构建了最大的统一多模态运动数据集OmniMoCap-X。
Method: 采用自回归扩散变换器,结合参考运动条件和渐进训练策略。数据集通过标准化和自动标注(使用GPT-4o)进行优化。
Result: 实验表明,OmniMotion-X在多种任务中表现优于现有方法,生成长时间、一致且可控的运动。
Insight: 通过参考信号和渐进训练策略,可以有效解决多模态任务中的冲突,实现高质量的运动生成。
Abstract: This paper introduces OmniMotion-X, a versatile multimodal framework for whole-body human motion generation, leveraging an autoregressive diffusion transformer in a unified sequence-to-sequence manner. OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture, and global spatial-temporal control scenarios (e.g., motion prediction, in-betweening, completion, and joint/trajectory-guided synthesis), as well as flexible combinations of these tasks. Specifically, we propose the use of reference motion as a novel conditioning signal, substantially enhancing the consistency of generated content, style, and temporal dynamics crucial for realistic animations. To handle multimodal conflicts, we introduce a progressive weak-to-strong mixed-condition training strategy. To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date, integrating 28 publicly available MoCap sources across 10 distinct tasks, standardized to the SMPL-X format at 30 fps. To ensure detailed and consistent annotations, we render sequences into videos and use GPT-4o to automatically generate structured and hierarchical captions, capturing both low-level actions and high-level semantics. Extensive experimental evaluations confirm that OmniMotion-X significantly surpasses existing methods, demonstrating state-of-the-art performance across multiple multimodal tasks and enabling the interactive generation of realistic, coherent, and controllable long-duration motions.
[97] Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models
Xiaozhen Qiao,Jingkai Zhao,Yuqiu Jiang,Xianda Guo,Zhe Sun,Hongyuan Zhang,Xuelong Li
Main category: cs.CV
TL;DR: CPL-NC是一个轻量级的测试时适应框架,针对视觉语言模型提出,通过动态调整类感知原型缓存和负对比学习机制,解决了原型降解和类混淆问题,显著提升了分布偏移下的泛化能力。
Details
Motivation: 现有的测试时适应方法在处理长尾分布和语义相似类混淆时表现不佳,CPL-NC通过动态调整原型和负对比学习,优化了这些问题。Contribution: 提出了CPL-NC框架,包含动态调整类感知原型缓存模块和负对比学习机制,通过非对称优化提升了模型在分布偏移下的性能。
Method: CPL-NC采用动态容量调整的原型缓存模块和负对比学习机制,仅优化文本原型,保持视觉特征的稳定性。
Result: 在15个基准测试中,CPL-NC表现优于现有TTA方法,适用于ResNet-50和ViT-B/16骨干网络。
Insight: 动态调整原型和负对比学习可以有效解决长尾分布和类混淆问题,非对称优化方法提升了测试时适应的灵活性。
Abstract: Vision-Language Models (VLMs) demonstrate impressive zero-shot generalization through large-scale image-text pretraining, yet their performance can drop once the deployment distribution diverges from the training distribution. To address this, Test-Time Adaptation (TTA) methods update models using unlabeled target data. However, existing approaches often ignore two key challenges: prototype degradation in long-tailed distributions and confusion between semantically similar classes. To tackle these issues, we propose \textbf{C}lass-Aware \textbf{P}rototype \textbf{L}earning with \textbf{N}egative \textbf{C}ontrast(\textbf{CPL-NC}), a lightweight TTA framework designed specifically for VLMs to enhance generalization under distribution shifts. CPL-NC introduces a \textit{Class-Aware Prototype Cache} Module that dynamically adjusts per-class capacity based on test-time frequency and activation history, with a rejuvenation mechanism for inactive classes to retain rare-category knowledge. Additionally, a \textit{Negative Contrastive Learning} Mechanism identifies and constrains hard visual-textual negatives to improve class separability. The framework employs asymmetric optimization, refining only textual prototypes while anchoring on stable visual features. Experiments on 15 benchmarks show that CPL-NC consistently outperforms prior TTA methods across both ResNet-50 and ViT-B/16 backbones.
[98] Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
Yusu Qian,Eli Bocek-Rivele,Liangchen Song,Jialing Tong,Yinfei Yang,Jiasen Lu,Wenze Hu,Zhe Gan
Main category: cs.CV
TL;DR: Pico-Banana-400K是一个包含40万张图像的大规模数据集,用于文本引导的图像编辑研究。它通过系统的质量和多样性控制,提供了多样化的编辑对,并包含三个专门子集,支持复杂编辑场景的研究。
Details
Motivation: 当前多模态模型在文本引导图像编辑方面取得了显著进展,但缺乏大规模、高质量且开放的真实图像数据集限制了研究进展。Pico-Banana-400K旨在填补这一空白。Contribution: 提出了Pico-Banana-400K数据集,它不仅规模大,还通过精细的图像编辑分类和质量评分确保了编辑对的多样性和质量。此外,还提供了三个专门子集,支持多任务研究。
Method: 数据集通过Nano-Banana从OpenImages中生成多样化的编辑对,采用了基于MLLM的质量评分和精细编辑分类来确保质量和多样性。
Result: Pico-Banana-400K为训练和评估新一代文本引导图像编辑模型提供了坚实的基础,并支持复杂编辑场景的研究。
Insight: 大规模、高质量的数据集是推动文本引导图像编辑研究的关键,而精细的质量控制和多样化编辑场景的设计可以进一步提升模型的适应能力。
Abstract: Recent advances in multimodal models have demonstrated remarkable text-guided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community’s progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs in the OpenImages collection. What distinguishes Pico-Banana-400K from previous synthetic datasets is our systematic approach to quality and diversity. We employ a fine-grained image editing taxonomy to ensure comprehensive coverage of edit types while maintaining precise content preservation and instruction faithfulness through MLLM-based quality scoring and careful curation. Beyond single turn editing, Pico-Banana-400K enables research into complex editing scenarios. The dataset includes three specialized subsets: (1) a 72K-example multi-turn collection for studying sequential editing, reasoning, and planning across consecutive modifications; (2) a 56K-example preference subset for alignment research and reward model training; and (3) paired long-short editing instructions for developing instruction rewriting and summarization capabilities. By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.
[99] olmOCR 2: Unit Test Rewards for Document OCR
Jake Poznanski,Luca Soldaini,Kyle Lo
Main category: cs.CV
TL;DR: olmOCR 2是一款基于7B规模视觉语言模型(VLM)的OCR系统,通过强化学习和单元测试奖励机制实现高性能文档OCR,尤其在数学公式转换、表格解析和多栏布局方面表现优异。
Details
Motivation: 现有的OCR系统在处理复杂文档布局(如数学公式、表格和多栏文本)时存在性能瓶颈,需要更高效且可验证的训练方法。Contribution: 提出了olmOCR 2,采用强化学习与可验证奖励(RLVR)训练7B VLM模型,同时开发了合成数据生成管道,显著提升了OCR性能。
Method: 使用强化学习结合多样化的二进制单元测试奖励(RLVR),并通过合成文档生成管道扩展测试用例。
Result: 在olmOCR-Bench基准测试中达到最先进性能,数学公式转换、表格解析和多栏布局改善尤为显著。
Insight: 通过合成数据和单元测试驱动的强化学习,可以有效提升复杂文档OCR任务的性能,同时开源模型和工具促进了技术共享。
Abstract: We present olmOCR 2, the latest in our family of powerful OCR systems for converting digitized print documents, like PDFs, into clean, naturally ordered plain text. olmOCR 2 is powered by olmOCR-2-7B-1025, a specialized, 7B vision language model (VLM) trained using reinforcement learning with verifiable rewards (RLVR), where our rewards are a diverse set of binary unit tests. To scale unit test creation, we develop a pipeline for generating synthetic documents with diverse and challenging layouts, known ground-truth HTML source code, and extracted test cases. We show that RL training on these test cases results in state-of-the-art performance on olmOCR-Bench, our English-language OCR benchmark, with the largest improvements in math formula conversion, table parsing, and multi-column layouts compared to previous versions. We release our model, data and code under permissive open licenses.
[100] Is This Tracker On? A Benchmark Protocol for Dynamic Tracking
Ilona Demler,Saumya Chauhan,Georgia Gkioxari
Main category: cs.CV
TL;DR: 该论文提出了一个新的基准测试套件ITTO,用于评估和诊断点跟踪方法的能力和局限性,通过真实世界场景的复杂运动和遮挡模式揭示了现有跟踪器的不足。
Details
Motivation: 当前的基准测试缺乏真实世界场景的运动复杂性和遮挡模式,限制了跟踪算法的实际应用能力。ITTO旨在填补这一空白,推动更鲁棒的跟踪算法发展。Contribution: 1. 引入ITTO基准测试套件,包含真实世界场景的高质量标注数据;2. 对现有跟踪方法进行了详细分析,揭示其在复杂运动和遮挡模式下的局限性。
Method: 通过多阶段标注流程从现有数据集和第一人称视角的真实世界视频中收集数据,构建ITTO基准测试。随后对多个先进跟踪方法进行性能分析。
Result: 实验表明,现有跟踪器在复杂运动和遮挡模式下表现不佳,尤其在遮挡后重新识别点的能力较弱。
Insight: 现有跟踪方法在真实世界动态场景中存在显著不足,亟需针对真实动态特性设计的新建模方法。
Abstract: We introduce ITTO, a challenging new benchmark suite for evaluating and diagnosing the capabilities and limitations of point tracking methods. Our videos are sourced from existing datasets and egocentric real-world recordings, with high-quality human annotations collected through a multi-stage pipeline. ITTO captures the motion complexity, occlusion patterns, and object diversity characteristic of real-world scenes – factors that are largely absent in current benchmarks. We conduct a rigorous analysis of state-of-the-art tracking methods on ITTO, breaking down performance along key axes of motion complexity. Our findings reveal that existing trackers struggle with these challenges, particularly in re-identifying points after occlusion, highlighting critical failure modes. These results point to the need for new modeling approaches tailored to real-world dynamics. We envision ITTO as a foundation testbed for advancing point tracking and guiding the development of more robust tracking algorithms.
cs.RO [Back]
[101] $\nabla$-SDF: Learning Euclidean Signed Distance Functions Online with Gradient-Augmented Octree Interpolation and Neural Residual
Zhirui Dai,Qihao Qian,Tianxing Fan,Nikolay Atanasov
Main category: cs.RO
TL;DR: 这篇论文提出了$
abla$-SDF方法,结合了梯度增强的八叉树插值和神经残差的混合方法,用于在线学习欧几里得符号距离函数(SDF)。该方法在计算效率、内存占用和准确性方面均优于现有技术。
Details
Motivation: 现有的SDF估计方法在在线和大规模重建中存在局限性:基于离散体素结构的方法影响SDF的连续性和可微性,而基于神经网络的方法效率低且易受灾难性遗忘和内存限制的影响。因此,需要一种兼具效率和准确性的解决方案。Contribution: 论文提出了一种混合方法$
abla$-SDF,结合了显式梯度增强八叉树插值和隐式神经残差,实现了非截断(欧几里得)SDF重建,在计算效率和准确性上优于现有技术。
Method: $
abla$-SDF方法利用梯度增强的八叉树插值提供显式先验,并通过神经残差补充隐式特征,从而在效率和准确性之间取得平衡。
Result: 实验表明,$
abla$-SDF在准确性和效率上超过了现有技术,为机器人学和计算机视觉的下游任务提供了可扩展的解决方案。
Insight: 通过结合显式和隐式方法的优势,$
abla$-SDF展示了在处理大规模和非截断SDF重建任务中的潜力。
Abstract: Estimation of signed distance functions (SDFs) from point cloud data has been shown to benefit many robot autonomy capabilities, including localization, mapping, motion planning, and control. Methods that support online and large-scale SDF reconstruction tend to rely on discrete volumetric data structures, which affect the continuity and differentiability of the SDF estimates. Recently, using implicit features, neural network methods have demonstrated high-fidelity and differentiable SDF reconstruction but they tend to be less efficient, can experience catastrophic forgetting and memory limitations in large environments, and are often restricted to truncated SDFs. This work proposes $\nabla$-SDF, a hybrid method that combines an explicit prior obtained from gradient-augmented octree interpolation with an implicit neural residual. Our method achieves non-truncated (Euclidean) SDF reconstruction with computational and memory efficiency comparable to volumetric methods and differentiability and accuracy comparable to neural network methods. Extensive experiments demonstrate that \methodname{} outperforms the state of the art in terms of accuracy and efficiency, providing a scalable solution for downstream tasks in robotics and computer vision.
[102] GigaBrain-0: A World Model-Powered Vision-Language-Action Model
GigaBrain Team,Angen Ye,Boyuan Wang,Chaojun Ni,Guan Huang,Guosheng Zhao,Haoyun Li,Jie Li,Jiagang Zhu,Lv Feng,Peng Li,Qiuping Deng,Runqi Ouyang,Wenkang Qin,Xinze Chen,Xiaofeng Wang,Yang Wang,Yifan Li,Yilong Li,Yiran Ding,Yuan Xu,Yun Ye,Yukun Zhou,Zhehao Dong,Zhenan Wang,Zhichao Liu,Zheng Zhu
Main category: cs.RO
TL;DR: GigaBrain-0 是一种基于世界模型的视觉-语言-动作(VLA)模型,通过生成多样化数据减少对真实机器人数据的依赖,并通过 RGBD 输入建模和 CoT 监督提升策略鲁棒性,显著提高了跨任务泛化能力。
Details
Motivation: 训练通用机器人的 VLA 模型通常需要大规模的真实机器人数据,但数据采集成本高且耗时,限制了模型的扩展性和泛化能力。Contribution: 提出了 GigaBrain-0,利用世界模型生成多样化数据以减少对真实数据的依赖,并通过 RGBD 输入和 CoT 监督提升模型在复杂任务中的表现。
Method: 结合世界模型生成数据(如视频生成、sim2real 转换等),使用 RGBD 输入建模和 CoT 监督,增强模型的空间几何推理能力。
Result: 在多种任务中表现出优异的泛化能力,尤其在纹理、颜色、物体位置和视角变化的情况下性能显著提升。
Insight: 生成数据可以有效弥补真实数据的不足,同时多模态输入和推理机制的结合是提升 VLA 模型性能的关键。
Abstract: Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.
cs.CR [Back]
[103] From See to Shield: ML-Assisted Fine-Grained Access Control for Visual Data
Mete Harun Akcay,Buse Gul Atli,Siddharth Prakash Rao,Alexandros Bakas
Main category: cs.CR
TL;DR: 该论文提出了一种基于机器学习的细粒度访问控制系统,用于视觉数据的敏感信息保护,结合自动检测、加密和策略管理模块,展示了高效性和可扩展性。
Details
Motivation: 随着存储数据量的增长,如何在大型数据仓库中识别和保护敏感信息(尤其是与多角色用户共享时)成为挑战。需要一种能选择性保护敏感区域的解决方案。Contribution: 提出了一个系统架构,支持策略驱动的访问控制,集成自动检测敏感区域、后校正、密钥管理和访问控制模块,并通过混合加密方案提升效率和安全性。
Method: 系统结合对称加密(高效性)和基于属性的加密(策略执行),支持高效密钥分发和隔离存储。实验评估了其在视觉数据集上的性能。
Result: 实验结果显示,系统在隐私敏感对象检测上表现优异(F1提升5%,平均精度提升10%),策略解密的平均时间为每图像1秒内。
Insight: 混合加密方案和模块化设计能有效平衡效率与安全性,适用于大规模视觉数据的细粒度访问控制。
Abstract: As the volume of stored data continues to grow, identifying and protecting sensitive information within large repositories becomes increasingly challenging, especially when shared with multiple users with different roles and permissions. This work presents a system architecture for trusted data sharing with policy-driven access control, enabling selective protection of sensitive regions while maintaining scalability. The proposed architecture integrates four core modules that combine automated detection of sensitive regions, post-correction, key management, and access control. Sensitive regions are secured using a hybrid scheme that employs symmetric encryption for efficiency and Attribute-Based Encryption for policy enforcement. The system supports efficient key distribution and isolates key storage to strengthen overall security. To demonstrate its applicability, we evaluate the system on visual datasets, where Privacy-Sensitive Objects in images are automatically detected, reassessed, and selectively encrypted prior to sharing in a data repository. Experimental results show that our system provides effective PSO detection, increases macro-averaged F1 score (5%) and mean Average Precision (10%), and maintains an average policy-enforced decryption time of less than 1 second per image. These results demonstrate the effectiveness, efficiency and scalability of our proposed solution for fine-grained access control.
eess.IV [Back]
[104] Automated Morphological Analysis of Neurons in Fluorescence Microscopy Using YOLOv8
Banan Alnemri,Arwa Basbrain
Main category: eess.IV
TL;DR: 论文提出了一种基于YOLOv8的自动化流程,用于荧光显微镜图像中神经元的分割与形态分析,准确率超过97%,显著减少了人工标注需求。
Details
Motivation: 神经元形态分析的准确分割和测量是神经科学和生物医学成像的关键,但传统方法依赖人工,耗时且主观性强,亟需自动化解决方案。Contribution: 1) 开发了基于YOLOv8的神经元实例分割和形态测量流程;2) 在高分辨率荧光显微镜数据集上实现了97%的分割精度;3) 提取了多种生物形态特征,测量精度达75.32%。
Method: 1) 使用人工标注的显微镜图像数据训练YOLOv8模型;2) 结合真实标注和预测掩码提取细胞长度、宽度、面积和灰度强度等特征;3) 构建端到端自动化分析框架。
Result: 模型分割准确率超过97%,形态测量整体精度为75.32%,证明方法对神经元形态分析的可靠性和有效性。
Insight: YOLOv8在生物医学图像分割中表现优异,自动化流程可显著提升研究效率,为神经科学和细胞成像提供了一种可扩展的工具。
Abstract: Accurate segmentation and precise morphological analysis of neuronal cells in fluorescence microscopy images are crucial steps in neuroscience and biomedical imaging applications. However, this process is labor-intensive and time-consuming, requiring significant manual effort and expertise to ensure reliable outcomes. This work presents a pipeline for neuron instance segmentation and measurement based on a high-resolution dataset of stem-cell-derived neurons. The proposed method uses YOLOv8, trained on manually annotated microscopy images. The model achieved high segmentation accuracy, exceeding 97%. In addition, the pipeline utilized both ground truth and predicted masks to extract biologically significant features, including cell length, width, area, and grayscale intensity values. The overall accuracy of the extracted morphological measurements reached 75.32%, further supporting the effectiveness of the proposed approach. This integrated framework offers a valuable tool for automated analysis in cell imaging and neuroscience research, reducing the need for manual annotation and enabling scalable, precise quantification of neuron morphology.
cs.SE [Back]
[105] Human-Agent Collaborative Paper-to-Page Crafting for Under $0.1
Qianli Ma,Siyu Wang,Yilin Chen,Yinhao Tang,Yixiang Yang,Chang Guo,Bingjie Gao,Zhening Xing,Yanan Sun,Zhipeng Zhang
Main category: cs.SE
TL;DR: 论文提出AutoPage,一个多层次的多智能体系统,用于高效、低成本地将学术论文转化为动态网页。通过分层协作流程和验证机制,解决了自动化网页生成的挑战。
Details
Motivation: 研究人员在创建动态网页以展示研究成果时,面临手动、重复的工作负担。现有自动化工具无法处理动态交互式网页的需求。Contribution: 1. 提出AutoPage系统,将论文转化为网页的任务分解为从叙事规划到多模态内容生成的层次化流程。2. 引入‘Checker’智能体验证内容准确性,减少AI幻觉。3. 构建PageBench,首个针对此任务的基准数据集。
Method: 采用多智能体协作框架,将任务分解为粗到细的流程,包括叙事规划、内容生成和交互式渲染,并通过验证机制确保内容与源论文一致。
Result: AutoPage在15分钟内以低于0.1美元的成本生成高质量、视觉效果佳的网页。
Insight: 网页生成的挑战可通过分层协作和验证机制解决,将系统设计为人类与AI的协作助手而非单一工具。
Abstract: In the quest for scientific progress, communicating research is as vital as the discovery itself. Yet, researchers are often sidetracked by the manual, repetitive chore of building project webpages to make their dense papers accessible. While automation has tackled static slides and posters, the dynamic, interactive nature of webpages has remained an unaddressed challenge. To bridge this gap, we reframe the problem, arguing that the solution lies not in a single command, but in a collaborative, hierarchical process. We introduce $\textbf{AutoPage}$, a novel multi-agent system that embodies this philosophy. AutoPage deconstructs paper-to-page creation into a coarse-to-fine pipeline from narrative planning to multimodal content generation and interactive rendering. To combat AI hallucination, dedicated “Checker” agents verify each step against the source paper, while optional human checkpoints ensure the final product aligns perfectly with the author’s vision, transforming the system from a mere tool into a powerful collaborative assistant. To rigorously validate our approach, we also construct $\textbf{PageBench}$, the first benchmark for this new task. Experiments show AutoPage not only generates high-quality, visually appealing pages but does so with remarkable efficiency in under 15 minutes for less than $0.1. Code and dataset will be released at $\href{https://mqleet.github.io/AutoPage_ProjectPage/}{Webpage}$.
cs.LG [Back]
[106] BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping
Zhiheng Xi,Xin Guo,Yang Nan,Enyu Zhou,Junrui Shen,Wenxiang Chen,Jiaqi Liu,Jixuan Huang,Zhihao Zhang,Honglin Guo,Xun Deng,Zhikai Lei,Miao Zheng,Guoteng Wang,Shuo Zhang,Peng Sun,Rui Zheng,Hang Yan,Tao Gui,Qi Zhang,Xuanjing Huang
Main category: cs.LG
TL;DR: BAPO提出了一种基于平衡策略优化和自适应剪裁的方法,解决了离线强化学习中策略熵急剧下降和优化不稳定的问题,显著提升了训练效率和模型性能。
Details
Motivation: 离线强化学习(RL)在大型语言模型(LLMs)的训练中虽然提升了样本效率,但存在策略熵下降快、优化不稳定甚至崩溃的问题。BAPO旨在解决这些问题。Contribution: BAPO的核心贡献包括:(1)识别了负优势样本主导梯度导致的不平衡问题;(2)提出了动态调整剪裁边界的自适应剪裁规则,以平衡优化并保持策略熵稳定。
Method: BAPO通过动态调整剪裁边界,自适应地平衡正负样本的贡献,同时保留熵增加的更新,从而稳定RL优化过程。
Result: 在AIME 2024和2025基准测试中,BAPO的7B和32B模型超越了开源和商业模型,表现出高效、稳定的训练性能。
Insight: 研究揭示了PPO类目标中固定剪裁机制会系统地抑制熵增更新,导致策略过度利用。BAPO通过自适应剪裁解决了这一问题。
Abstract: Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings–where stale data from past policies are used for training–improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective method that dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization. Across diverse off-policy scenarios–including sample replay and partial rollout–BAPO achieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-art results among models of the same scale but also outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.
[107] NeuroAda: Activating Each Neuron’s Potential for Parameter-Efficient Fine-Tuning
Zhi Zhang,Yixian Shen,Congfeng Cao,Ekaterina Shutova
Main category: cs.LG
TL;DR: NeuroAda是一种新颖的参数高效微调方法,通过选择重要参数并引入旁路连接,既实现了精细微调又保持了高内存效率,在23+任务中表现优异。
Details
Motivation: 现有参数高效微调方法存在表现力与内存效率之间的权衡问题,NeuroAda旨在解决这一矛盾。Contribution: 提出NeuroAda方法,结合选择性参数适应与旁路连接,以极小训练参数量实现高性能微调,同时显著降低内存消耗。
Method: 先识别网络中重要参数,再为这些参数引入旁路连接,微调时仅更新旁路连接,冻结原始参数。
Result: 在23+任务中表现最佳,仅需≤0.02%可训练参数,CUDA内存使用减少高达60%。
Insight: 选择性参数适应与旁路连接的结合能有效平衡性能与资源消耗,为参数高效微调提供了新思路。
Abstract: Existing parameter-efficient fine-tuning (PEFT) methods primarily fall into two categories: addition-based and selective in-situ adaptation. The former, such as LoRA, introduce additional modules to adapt the model to downstream tasks, offering strong memory efficiency. However, their representational capacity is often limited, making them less suitable for fine-grained adaptation. In contrast, the latter directly fine-tunes a carefully chosen subset of the original model parameters, allowing for more precise and effective adaptation, but at the cost of significantly increased memory consumption. To reconcile this trade-off, we propose NeuroAda, a novel PEFT method that enables fine-grained model finetuning while maintaining high memory efficiency. Our approach first identifies important parameters (i.e., connections within the network) as in selective adaptation, and then introduces bypass connections for these selected parameters. During finetuning, only the bypass connections are updated, leaving the original model parameters frozen. Empirical results on 23+ tasks spanning both natural language generation and understanding demonstrate that NeuroAda achieves state-of-the-art performance with as little as $\leq \textbf{0.02}%$ trainable parameters, while reducing CUDA memory usage by up to 60%. We release our code here: https://github.com/FightingFighting/NeuroAda.git.
[108] FrogDeepSDM: Improving Frog Counting and Occurrence Prediction Using Multimodal Data and Pseudo-Absence Imputation
Chirag Padubidri,Pranesh Velmurugan,Andreas Lanitis,Andreas Kamilaris
Main category: cs.LG
TL;DR: 论文通过深度学习与数据填补技术提升蛙类物种分布模型精度,数据平衡显著减少计数误差,多模态集成模型优于单模态,图像与表格数据融合提升分类准确率至84.9%。
Details
Motivation: 传统物种分布监测方法覆盖不全,数据稀疏或缺失限制模型表现。通过深度学习和数据预处理技术弥补这些不足,提升生态模型预测精度。Contribution: 1. 提出数据平衡和填补方法显著提升蛙类分布预测;2. 多模态集成模型(结合图像与表格数据)优于单一模型;3. 特征选择优化环境变量输入。
Method: 使用深度学习与数据填补技术,整合土地覆盖、NDVI等多模态环境数据,生成伪缺失数据并平衡数据集。
Result: MAE从189降至29,多模态模型分类准确率达84.9%,AUC为0.90,泛化能力强。
Insight: 多模态学习与数据预处理对稀疏或不完整数据的生态建模至关重要,为生物多样性监测提供更精确、可扩展的方法。
Abstract: Monitoring species distribution is vital for conservation efforts, enabling the assessment of environmental impacts and the development of effective preservation strategies. Traditional data collection methods, including citizen science, offer valuable insights but remain limited in coverage and completeness. Species Distribution Modelling (SDM) helps address these gaps by using occurrence data and environmental variables to predict species presence across large regions. In this study, we enhance SDM accuracy for frogs (Anura) by applying deep learning and data imputation techniques using data from the “EY - 2022 Biodiversity Challenge.” Our experiments show that data balancing significantly improved model performance, reducing the Mean Absolute Error (MAE) from 189 to 29 in frog counting tasks. Feature selection identified key environmental factors influencing occurrence, optimizing inputs while maintaining predictive accuracy. The multimodal ensemble model, integrating land cover, NDVI, and other environmental inputs, outperformed individual models and showed robust generalization across unseen regions. The fusion of image and tabular data improved both frog counting and habitat classification, achieving 84.9% accuracy with an AUC of 0.90. This study highlights the potential of multimodal learning and data preprocessing techniques such as balancing and imputation to improve predictive ecological modeling when data are sparse or incomplete, contributing to more precise and scalable biodiversity monitoring.
[109] Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning
Ling Team,Bin Han,Caizhi Tang,Chen Liang,Donghao Zhang,Fan Yuan,Feng Zhu,Jie Gao,Jingyu Hu,Longfei Li,Meng Li,Mingyang Zhang,Peijie Jiang,Peng Jiao,Qian Zhao,Qingyuan Yang,Wenbo Shen,Xinxing Yang,Yalin Zhang,Yankun Ren,Yao Zhao,Yibo Cao,Yixuan Sun,Yue Zhang,Yuchen Fang,Zibin Lin,Zixuan Cheng,Jun Zhou
Main category: cs.LG
TL;DR: 该论文提出了Ring-linear系列模型,采用线性注意力与softmax注意力的混合架构,显著降低了长上下文推理的计算成本,并通过优化比例实现了高效训练与推理。
Details
Motivation: 解决长上下文推理中的高计算和I/O开销问题,提出一种高效的混合注意力架构。Contribution: 1. 提出了Ring-linear模型系列;2. 通过混合架构显著降低推理成本;3. 发现了混合注意力机制的最优比例;4. 开发了高效FP8算子库,提升训练效率。
Method: 结合线性注意力与softmax注意力的混合架构,系统性优化注意力比例,利用自研FP8算子库提升性能。
Result: 推理成本降至密集模型的1/10,训练效率提升50%,在多任务复杂推理基准中保持SOTA。
Insight: 混合注意力架构在长上下文任务中高效且成本低,算子库的优化对训练和推理性能至关重要。
Abstract: In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.
[110] A Survey on Cache Methods in Diffusion Models: Toward Efficient Multi-Modal Generation
Jiacheng Liu,Xinyu Wang,Yuqi Lin,Zhikai Wang,Peiru Wang,Peiliang Cai,Qinming Zhou,Zhengan Yan,Zexuan Yan,Zhengyi Shi,Chang Zou,Yue Ma,Linfeng Zhang
Main category: cs.LG
TL;DR: 该论文系统综述了扩散模型中的缓存方法,提出了一种无需训练、架构无关的高效推理范式——Diffusion Caching,通过重用扩散过程中的计算冗余来减少计算开销。
Details
Motivation: 扩散模型因其高质量生成和控制能力成为生成式AI的核心,但其多步迭代和复杂网络导致高昂计算开销和延迟,限制了实时应用。现有加速技术存在适用性有限、训练成本高或质量下降等问题。Contribution: 提出了Diffusion Caching范式,通过特征级跨步重用和跨层调度减少计算,无需修改模型参数,实现了高效推理。同时,系统总结了缓存方法的理论与演进,提出了分类框架。
Method: 核心机制是识别和重用扩散过程中的计算冗余,支持动态预测和静态重用。与其他加速技术(如采样优化和模型蒸馏)结合,提升灵活性。
Result: Diffusion Caching显著减少了计算开销,适用于多样任务,为多模态和交互式应用提供了高效推理框架。
Insight: 缓存方法从静态重用发展到动态预测,增强了灵活性和通用性,预示着实时高效生成式AI的未来方向。
Abstract: Diffusion Models have become a cornerstone of modern generative AI for their exceptional generation quality and controllability. However, their inherent \textit{multi-step iterations} and \textit{complex backbone networks} lead to prohibitive computational overhead and generation latency, forming a major bottleneck for real-time applications. Although existing acceleration techniques have made progress, they still face challenges such as limited applicability, high training costs, or quality degradation. Against this backdrop, \textbf{Diffusion Caching} offers a promising training-free, architecture-agnostic, and efficient inference paradigm. Its core mechanism identifies and reuses intrinsic computational redundancies in the diffusion process. By enabling feature-level cross-step reuse and inter-layer scheduling, it reduces computation without modifying model parameters. This paper systematically reviews the theoretical foundations and evolution of Diffusion Caching and proposes a unified framework for its classification and analysis. Through comparative analysis of representative methods, we show that Diffusion Caching evolves from \textit{static reuse} to \textit{dynamic prediction}. This trend enhances caching flexibility across diverse tasks and enables integration with other acceleration techniques such as sampling optimization and model distillation, paving the way for a unified, efficient inference framework for future multimodal and interactive applications. We argue that this paradigm will become a key enabler of real-time and efficient generative AI, injecting new vitality into both theory and practice of \textit{Efficient Generative Intelligence}.
[111] Blackbox Model Provenance via Palimpsestic Membership Inference
Rohith Kuditipudi,Jing Huang,Sally Zhu,Diyi Yang,Christopher Potts,Percy Liang
Main category: cs.LG
TL;DR: 该论文研究了如何通过查询或观察文本来证明黑箱模型是否源自某个特定训练过的模型,提出了基于训练数据顺序的统计方法,并在不同规模的语言模型上验证了其有效性。
Details
Motivation: 研究动机是解决黑箱模型溯源问题,即如何证明某个黑箱模型是基于特定训练模型生成的。这对于模型版权保护和责任追溯具有重要意义。Contribution: 主要贡献包括:(1)将溯源问题形式化为独立性检验问题;(2)提出基于训练顺序的统计方法;(3)在查询和观察两种设置下验证了方法的有效性。
Method: 采用的方法包括:(1)在查询设置中,通过提示估计黑箱模型对训练数据顺序的似然相关性;(2)在观察设置中,通过文本重叠或不同版本模型的似然估计来判断模型来源。
Result: 结果表明,查询方法在大多数情况下能达到极低的p值(1e-8),观察方法中第二种方法仅需几百个token即可区分模型来源。
Insight: 研究揭示了语言模型训练数据的顺序信息可以作为模型溯源的有效依据,且在小规模数据下也能实现高准确率。
Abstract: Suppose Alice trains an open-weight language model and Bob uses a blackbox derivative of Alice’s model to produce text. Can Alice prove that Bob is using her model, either by querying Bob’s derivative model (query setting) or from the text alone (observational setting)? We formulate this question as an independence testing problem–in which the null hypothesis is that Bob’s model or text is independent of Alice’s randomized training run–and investigate it through the lens of palimpsestic memorization in language models: models are more likely to memorize data seen later in training, so we can test whether Bob is using Alice’s model using test statistics that capture correlation between Bob’s model or text and the ordering of training examples in Alice’s training run. If Alice has randomly shuffled her training data, then any significant correlation amounts to exactly quantifiable statistical evidence against the null hypothesis, regardless of the composition of Alice’s training data. In the query setting, we directly estimate (via prompting) the likelihood Bob’s model gives to Alice’s training examples and order; we correlate the likelihoods of over 40 fine-tunes of various Pythia and OLMo base models ranging from 1B to 12B parameters with the base model’s training data order, achieving a p-value on the order of at most 1e-8 in all but six cases. In the observational setting, we try two approaches based on estimating 1) the likelihood of Bob’s text overlapping with spans of Alice’s training examples and 2) the likelihood of Bob’s text with respect to different versions of Alice’s model we obtain by repeating the last phase (e.g., 1%) of her training run on reshuffled data. The second approach can reliably distinguish Bob’s text from as little as a few hundred tokens; the first does not involve any retraining but requires many more tokens (several hundred thousand) to achieve high power.
cs.MA [Back]
[112] ColorAgent: Building A Robust, Personalized, and Interactive OS Agent
Ning Li,Qiqiang Lin,Zheng Wu,Xiaoyun Mo,Weiming Zhang,Yin Zhao,Xiangmou Qu,Jiamu Zhou,Jun Wang,Congmin Zheng,Yuanyi Song,Hongjiang Chen,Heyuan Huang,Jihong Wang,Jiaxin Yin,Jingwei Yu,Junwei Liao,Qiuying Peng,Xingyu Lou,Jun Wang,Weiwen Liu,Zhuosheng Zhang,Weinan Zhang
Main category: cs.MA
TL;DR: ColorAgent是一个个性化的操作系统代理,通过强化学习和多智能体框架实现长期稳健的环境交互,同时在用户意图识别和主动交互方面表现出色。
Details
Motivation: 随着硬件、软件和大语言模型的进步,人机交互正从命令行转向AI代理交互。构建一个能执行用户指令并忠实遵循用户需求的操作系统代理成为可能。Contribution: 提出了ColorAgent,一个支持长期、稳健环境交互的操作系统代理,同时具备个性化的用户意图识别和主动交互能力。
Method: 采用逐步强化学习和自演化训练增强模型能力,并开发了一个定制的多智能体框架以确保通用性、一致性和鲁棒性。
Result: 在AndroidWorld和AndroidLab基准测试中,分别取得了77.2%和50.7%的成功率,达到新的SOTA。
Insight: 当前基准测试不足以全面评估操作系统代理,未来需要在评估范式、智能体协作和安全性等方面进一步探索。
Abstract: With the advancements in hardware, software, and large language model technologies, the interaction between humans and operating systems has evolved from the command-line interface to the rapidly emerging AI agent interactions. Building an operating system (OS) agent capable of executing user instructions and faithfully following user desires is becoming a reality. In this technical report, we present ColorAgent, an OS agent designed to engage in long-horizon, robust interactions with the environment while also enabling personalized and proactive user interaction. To enable long-horizon interactions with the environment, we enhance the model’s capabilities through step-wise reinforcement learning and self-evolving training, while also developing a tailored multi-agent framework that ensures generality, consistency, and robustness. In terms of user interaction, we explore personalized user intent recognition and proactive engagement, positioning the OS agent not merely as an automation tool but as a warm, collaborative partner. We evaluate ColorAgent on the AndroidWorld and AndroidLab benchmarks, achieving success rates of 77.2% and 50.7%, respectively, establishing a new state of the art. Nonetheless, we note that current benchmarks are insufficient for a comprehensive evaluation of OS agents and propose further exploring directions in future work, particularly in the areas of evaluation paradigms, agent collaboration, and security. Our code is available at https://github.com/MadeAgents/mobile-use.
eess.AS [Back]
[113] StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction
Qianheng Xu
Main category: eess.AS
TL;DR: 该论文提出了StutterZero和StutterFormer两种端到端的语音转换模型,可直接将口吃语音转化为流畅语音并联合预测转录文本,显著提升了转录准确率和语义相似度。
Details
Motivation: 全球有超过7000万人存在口吃问题,但现有的自动语音系统常因分阶段处理或多模块分离,导致转录不准确或失真。Contribution: 1. 首次提出端到端的波形到波形模型(StutterZero和StutterFormer),直接联合完成口吃语音到流畅语音的转换与转录;2. 在SEP-28K和LibriStutter数据集上训练,并在FluencyBank上验证,性能优于现有方法。
Method: 1. StutterZero采用卷积双向LSTM编码器-解码器架构;2. StutterFormer则结合了双流Transformer和共享声学-语言学表示。
Result: StutterZero将词错误率(WER)降低24%,BERTScore提升31%;StutterFormer进一步将WER降低28%,BERTScore提升34%。
Insight: 端到端模型能够直接联合处理语音转换与转录任务,避免了多阶段处理的失真问题,为包容性人机交互和语音治疗提供了新方向。
Abstract: Over 70 million people worldwide experience stuttering, yet most automatic speech systems misinterpret disfluent utterances or fail to transcribe them accurately. Existing methods for stutter correction rely on handcrafted feature extraction or multi-stage automatic speech recognition (ASR) and text-to-speech (TTS) pipelines, which separate transcription from audio reconstruction and often amplify distortions. This work introduces StutterZero and StutterFormer, the first end-to-end waveform-to-waveform models that directly convert stuttered speech into fluent speech while jointly predicting its transcription. StutterZero employs a convolutional-bidirectional LSTM encoder-decoder with attention, whereas StutterFormer integrates a dual-stream Transformer with shared acoustic-linguistic representations. Both architectures are trained on paired stuttered-fluent data synthesized from the SEP-28K and LibriStutter corpora and evaluated on unseen speakers from the FluencyBank dataset. Across all benchmarks, StutterZero had a 24% decrease in Word Error Rate (WER) and a 31% improvement in semantic similarity (BERTScore) compared to the leading Whisper-Medium model. StutterFormer achieved better results, with a 28% decrease in WER and a 34% improvement in BERTScore. The results validate the feasibility of direct end-to-end stutter-to-fluent speech conversion, offering new opportunities for inclusive human-computer interaction, speech therapy, and accessibility-oriented AI systems.
cs.AI [Back]
[114] A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist
Sohyeon Jeon,Hyung-Chul Lee
Main category: cs.AI
TL;DR: 这篇论文通过行为和元认知分析方法,研究了两种大型语言模型(LLMs)在三种提示条件下评估临床试验报告(基于CONSORT标准)的能力,揭示了模型在推理风格和不确定性表达上的差异,并强调了其在医疗AI开发中的局限性。
Details
Motivation: 尽管LLMs在医疗领域快速扩展,但其在基于CONSORT标准评估临床试验报告时的认知和推理能力尚不明确。研究旨在填补这一空白,帮助开发更可靠和可解释的医疗AI。Contribution: 主要贡献包括系统地比较了两种LLMs在三种提示条件下的表现,揭示了模型在处理CONSORT标准时的差异和局限性。
Method: 采用了行为和元认知分析方法,利用专家验证的数据,对模型在不同提示条件下的响应进行了系统评估。
Result: 结果显示,模型在不同CONSORT项目上的表现差异显著,提示类型显著影响了推理风格和不确定性表达。
Insight: 研究表明,当前LLMs在临床合规自动化中存在局限性,开发更可靠的医疗AI需要深入理解模型的认知适应和策略行为。
Abstract: Despite the rapid expansion of Large Language Models (LLMs) in healthcare, the ability of these systems to assess clinical trial reporting according to CONSORT standards remains unclear, particularly with respect to their cognitive and reasoning strategies. This study applies a behavioral and metacognitive analytic approach with expert-validated data, systematically comparing two representative LLMs under three prompt conditions. Clear differences emerged in how the models approached various CONSORT items, and prompt types, including shifts in reasoning style, explicit uncertainty, and alternative interpretations shaped response patterns. Our results highlight the current limitations of these systems in clinical compliance automation and underscore the importance of understanding their cognitive adaptations and strategic behavior in developing more explainable and reliable medical AI.
[115] The Zero-Step Thinking: An Empirical Study of Mode Selection as Harder Early Exit in Reasoning Models
Yuqiao Tan,Shizhu He,Kang Liu,Jun Zhao
Main category: cs.AI
TL;DR: 该论文研究推理模型中模式选择(Mode Selection)作为一种更难的早期退出(Early Exit)问题,通过零步思考(zero-step thinking)减少计算开销,发现现有方法在信息有限时效果不佳。
Details
Motivation: 推理模型在数学和逻辑推理任务中表现出色,但其逐步思考可能导致计算开销过大。模式选择和早期退出旨在减少这种开销,但模式选择因需在推理前做决策而更具挑战性。Contribution: 将模式选择问题形式化为更难的早期退出问题;通过实证研究揭示了现有方法的局限性,尤其是在信息有限时。
Method: 提出零步思考概念,基于预定义的虚假思考进行模式选择;实验评估了九种基线方法。
Result: 发现提示方法因分类能力有限效果差,而利用内部信息的方法表现较好但稳定性不足。
Insight: 模式选择在信息有限时仍具挑战性,现有方法难以有效解决,需进一步优化。
Abstract: Reasoning models have demonstrated exceptional performance in tasks such as mathematics and logical reasoning, primarily due to their ability to engage in step-by-step thinking during the reasoning process. However, this often leads to overthinking, resulting in unnecessary computational overhead. To address this issue, Mode Selection aims to automatically decide between Long-CoT (Chain-of-Thought) or Short-CoT by utilizing either a Thinking or NoThinking mode. Simultaneously, Early Exit determines the optimal stopping point during the iterative reasoning process. Both methods seek to reduce the computational burden. In this paper, we first identify Mode Selection as a more challenging variant of the Early Exit problem, as they share similar objectives but differ in decision timing. While Early Exit focuses on determining the best stopping point for concise reasoning at inference time, Mode Selection must make this decision at the beginning of the reasoning process, relying on pre-defined fake thoughts without engaging in an explicit reasoning process, referred to as zero-step thinking. Through empirical studies on nine baselines, we observe that prompt-based approaches often fail due to their limited classification capabilities when provided with minimal hand-crafted information. In contrast, approaches that leverage internal information generally perform better across most scenarios but still exhibit issues with stability. Our findings indicate that existing methods relying solely on the information provided by models are insufficient for effectively addressing Mode Selection in scenarios with limited information, highlighting the ongoing challenges of this task. Our code is available at https://github.com/Trae1ounG/Zero_Step_Thinking.
[116] Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning
Gunshi Gupta,Karmesh Yadav,Zsolt Kira,Yarin Gal,Rahaf Aljundi
Main category: cs.AI
TL;DR: 论文提出了Memo,一种基于Transformer的架构和训练方法,用于强化学习中的长时记忆任务。通过引入周期性总结标记,Memo在训练过程中创建和检索记忆,提升了计算和存储效率,并在长上下文推理中表现更优。
Details
Motivation: 现有的Transformer模型在长时记忆任务中面临上下文限制和计算效率问题,而人类却能高效压缩和利用记忆。论文旨在解决这些问题,提出了一种更高效的记忆管理方法。Contribution: 提出了Memo架构和训练方法,通过周期性总结标记管理长时记忆,提升了模型的性能和效率。
Method: Memo在训练过程中引入总结标记,动态压缩和检索记忆,减少了Transformer的上下文负担。
Result: 在网格世界元RL基准和真实室内导航任务中,Memo优于基线模型,且在长上下文推理和流式设置中表现稳健。
Insight: 动态记忆管理可以有效解决Transformer的长时记忆问题,同时保持计算和存储效率。
Abstract: To enable embodied agents to operate effectively over extended timeframes, it is crucial to develop models that form and access memories to stay contextualized in their environment. In the current paradigm of training transformer-based policies for embodied sequential decision-making tasks, visual inputs often overwhelm the context limits of transformers, while humans can maintain and utilize a lifetime of experience compressed as memories. Significant compression is possible in principle, as much of the input is irrelevant and can be abstracted. However, existing approaches predominantly focus on either recurrent models with fixed-size memory or transformers with full-context reliance. In this work, we propose Memo, a transformer-based architecture and training recipe for reinforcement learning (RL) on memory-intensive, long-horizon tasks. Memo incorporates the creation and retrieval of memory by interleaving periodic summarization tokens with the inputs of a model during training. We demonstrate Memo’s effectiveness on a gridworld meta-RL benchmark and a multi-object navigation task in photo-realistic indoor settings. Memo outperforms naive long-context transformer baselines while being more compute and storage efficient. Additionally, Memo generalizes better to longer contexts at inference time and remains robust in streaming settings, where historical context must be truncated to fit inference constraints.
[117] HSCodeComp: A Realistic and Expert-level Benchmark for Deep Search Agents in Hierarchical Rule Application
Yiqian Yang,Tian Lan,Qianghuai Jia,Li Zhu,Hui Jiang,Hang Zhu,Longyue Wang,Weihua Luo,Kaifu Zhang
Main category: cs.AI
TL;DR: HSCodeComp 是一个面向深度搜索代理的基准测试,专注于评估代理在层次化规则应用中的能力,特别是在模糊和隐式逻辑关系的情境下。实验表明,现有代理的表现远低于人类专家水平。
Details
Motivation: 当前代理基准测试忽视了代理在处理复杂规则(如关税规则)时的能力,而这些规则在现实应用中至关重要。Contribution: 提出了 HSCodeComp,一个基于真实电商数据的专家级基准测试,用于评估代理在层次化规则应用中的表现。
Method: 通过构建包含 632 个产品条目的数据集,并由人类专家标注 10 位 HS 编码,设计任务要求代理预测产品编码。
Result: 实验显示,最佳代理仅达到 46.8% 的准确率,远低于人类专家的 95.0%。
Insight: 层次化规则应用对代理具有显著挑战性,现有的测试时扩展方法未能进一步提升性能。
Abstract: Effective deep search agents must not only access open-domain and domain-specific knowledge but also apply complex rules-such as legal clauses, medical manuals and tariff rules. These rules often feature vague boundaries and implicit logic relationships, making precise application challenging for agents. However, this critical capability is largely overlooked by current agent benchmarks. To fill this gap, we introduce HSCodeComp, the first realistic, expert-level e-commerce benchmark designed to evaluate deep search agents in hierarchical rule application. In this task, the deep reasoning process of agents is guided by these rules to predict 10-digit Harmonized System Code (HSCode) of products with noisy but realistic descriptions. These codes, established by the World Customs Organization, are vital for global supply chain efficiency. Built from real-world data collected from large-scale e-commerce platforms, our proposed HSCodeComp comprises 632 product entries spanning diverse product categories, with these HSCodes annotated by several human experts. Extensive experimental results on several state-of-the-art LLMs, open-source, and closed-source agents reveal a huge performance gap: best agent achieves only 46.8% 10-digit accuracy, far below human experts at 95.0%. Besides, detailed analysis demonstrates the challenges of hierarchical rule application, and test-time scaling fails to improve performance further.