Table of Contents
- cs.CL [Total: 51]
- cs.CV [Total: 49]
- cs.RO [Total: 1]
- cs.LO [Total: 1]
- cs.CY [Total: 1]
- cs.LG [Total: 6]
- cs.SE [Total: 1]
- cs.HC [Total: 1]
- eess.IV [Total: 4]
- cs.AR [Total: 2]
- cs.CR [Total: 1]
- q-bio.NC [Total: 1]
- cs.AI [Total: 2]
cs.CL [Back]
[1] KG-o1: Enhancing Multi-hop Question Answering in Large Language Models via Knowledge Graph Integration
Nan Wang,Yongqi Fan,yansha zhu,ZongYu Wang,Xuezhi Cao,Xinyan He,Haiyun Jiang,Tong Ruan,Jingping Liu
Main category: cs.CL
TL;DR: KG-o1通过集成知识图谱(KG)增强大型语言模型(LLM)在多跳问答任务中的推理能力,提出四阶段方法并优于现有模型。
Details
Motivation: LLMs在知识密集型任务(如多跳问答)中表现不佳,因生成的思维链偏离真实推理路径,而KG能明确表示逻辑连接,故提出KG-o1填补这一差距。Contribution: 提出KG-o1框架,结合KG的四阶段方法(实体过滤、子图生成、逻辑路径构建、自改进语料生成),显著提升LLM多跳推理能力。
Method: 四阶段方法:1) 初始实体过滤与复杂子图生成;2) 子图逻辑路径构建;3) 使用KG构建长链推理数据集;4) 拒绝采样生成DPO优化语料。
Result: 在简单和复杂数据集上实验表明,KG-o1模型在所有任务中均优于现有大型推理模型(LRM)。
Insight: KG的显式逻辑表征与LLM的推理能力结合,能有效解决多跳推理问题,且通过自改进语料进一步提升性能。
Abstract: Large Language Models (LLMs) face challenges in knowledge-intensive reasoning tasks like classic multi-hop question and answering, which involves reasoning across multiple facts. This difficulty arises because the chain of thoughts (CoTs) generated by LLMs in such tasks often deviate from real or a priori reasoning paths. In contrast, knowledge graphs (KGs) explicitly represent the logical connections between facts through entities and relationships. This reflects a significant gap. Meanwhile, large reasoning models (LRMs), such as o1, have demonstrated that long-step reasoning significantly enhances the performance of LLMs. Building on these insights, we propose KG-o1, a four-stage approach that integrates KGs to enhance the multi-hop reasoning abilities of LLMs. We first filter out initial entities and generate complex subgraphs. Secondly, we construct logical paths for subgraphs and then use knowledge graphs to build a dataset with a complex and extended brainstorming process, which trains LLMs to imitate long-term reasoning. Finally, we employ rejection sampling to generate a self-improving corpus for direct preference optimization (DPO), further refining the LLMs reasoning abilities. We conducted experiments on two simple and two complex datasets. The results show that KG-o1 models exhibit superior performance across all tasks compared to existing LRMs.
[2] Bhav-Net: Knowledge Transfer for Cross-Lingual Antonym vs Synonym Distinction via Dual-Space Graph Transformers
Samyak S. Sanghvi
Main category: cs.CL
TL;DR: Bhav-Net提出了一种双空间架构,结合语言特定的BERT编码器和图变换网络,实现多语言知识转移,有效区分反义词和同义词。
Details
Motivation: 跨语言反义词和同义词区分具有挑战性,因为反义词在共享语义域的同时表达相反含义。Bhav-Net旨在解决这一问题,实现知识从复杂多语言模型向特定语言架构的转移。Contribution: 提出Bhav-Net,一种双空间架构,结合语言特定的BERT编码器和图变换网络,支持跨语言反义词和同义词区分。该方法在八种语言上验证了语义关系建模的有效性。
Method: 使用语言特定的BERT编码器和图变换网络,创建两个互补的语义投影空间:一个聚类同义词对,另一个展示反义词对的相似性。
Result: Bhav-Net在八种语言上表现优异,与前沿基线模型竞争,并提供可解释的语义表示和跨语言泛化能力。
Insight: 双空间设计有效地捕获了反义词和同义词的独特语义关系,同时支持跨语言知识转移。
Abstract: Antonym vs synonym distinction across multiple languages presents unique computational challenges due to the paradoxical nature of antonymous relationships words that share semantic domains while expressing opposite meanings. This work introduces Bhav-Net, a novel dual-space architecture that enables effective knowledge transfer from complex multilingual models to simpler, language-specific architectures while maintaining robust cross-lingual antonym–synonym distinction capabilities. Our approach combines language-specific BERT encoders with graph transformer networks, creating distinct semantic projections where synonymous pairs cluster in one space while antonymous pairs exhibit high similarity in a complementary space. Through comprehensive evaluation across eight languages (English, German, French, Spanish, Italian, Portuguese, Dutch, and Russian), we demonstrate that semantic relationship modeling transfers effectively across languages. The dual-encoder design achieves competitive performance against state-of-the-art baselines while providing interpretable semantic representations and effective cross-lingual generalization.
[3] Format as a Prior: Quantifying and Analyzing Bias in LLMs for Heterogeneous Data
Jiacheng Liu,Mayi Xu,Qiankun Pi,Wenli Li,Ming Zhong,Yuanyuan Zhu,Mengchi Liu,Tieyun Qian
Main category: cs.CL
TL;DR: 该论文首次研究了大型语言模型(LLMs)在处理异构数据时存在的格式偏见问题,通过三阶段实证研究分析了偏见的系统性特征、数据级因素和内部机制,并提出了未来减少偏见的研究方向。
Details
Motivation: 随着LLMs越来越多地用于处理异构格式数据(如文本、表格、知识图谱等),格式偏见可能导致不公平的数据整合和推理错误,但这一问题的系统性特征和内部机制尚不明确。Contribution: 论文首次系统性地分析了LLMs中的格式偏见,揭示了偏见的存在性、数据级影响因素和注意力机制中的表现,并提出了三种可能的缓解方向。
Method: 通过构建异构数据冲突场景,进行了三阶段实证研究:1)验证偏见的存在性和方向;2)分析信息丰富度、结构质量和格式类型等数据级因素对偏见的影响;3)探索注意力机制中的偏见表现及轻量级干预效果。
Result: 研究发现LLMs普遍存在格式偏见,且信息丰富度和结构质量等因素显著影响偏见强度。注意力机制分析揭示了偏见的内部机制,轻量级干预(如注意力重新加权)显示出缓解潜力。
Insight: 格式偏见不仅是数据预处理的问题,还与模型的内部机制密切相关。未来研究可通过格式标准化、推理时干预和均衡训练数据来减少偏见,提升异构数据处理的公平性和鲁棒性。
Abstract: Large Language Models (LLMs) are increasingly employed in applications that require processing information from heterogeneous formats, including text, tables, infoboxes, and knowledge graphs. However, systematic biases toward particular formats may undermine LLMs’ ability to integrate heterogeneous data impartially, potentially resulting in reasoning errors and increased risks in downstream tasks. Despite these concerns, it remains uncertain whether such format biases are systematic, which data-level factors contribute to them, and what internal mechanisms in LLMs underlie their emergence. In this paper, we make the first attempt to investigate and analyze the format bias in LLMs. To systematically investigate the aforementioned questions, we conduct a three-stage empirical study by constructing an heterogeneous data conflict scenario for the exploration of bias. The first stage explores the presence and direction of bias across a diverse range of LLMs. The second stage aims to examine how key data-level factors, including information richness, structure quality, and format type, influence these biases. The third stage analyzes how format bias emerges within LLMs’ attention patterns and evaluates a lightweight intervention to test its potential mitigability. Based on these investigations, we identify three future research directions to reduce format bias: improving data preprocessing through format sanitization and normalization, introducing inference-time interventions such as attention re-weighting, and developing format-balanced training corpora. These directions will support the design of more robust and fair heterogeneous data processing systems.
[4] Benchmarking the Legal Reasoning of LLMs in Arabic Islamic Inheritance Cases
Nouar AlDahoul,Yasir Zaki
Main category: cs.CL
TL;DR: 该研究评估了大型语言模型(LLMs)在伊斯兰继承法案例中的推理能力,提出了一种多数投票解决方案,显著提高了准确性。
Details
Motivation: 伊斯兰继承法的计算复杂且容易出错,需探讨LLMs是否能辅助此类复杂法律推理任务。Contribution: 评估了多种LLMs的推理能力,提出多数投票方法(结合三种基本模型),在阿拉伯语伊斯兰继承案例中表现最优。
Method: 使用了阿拉伯NLP QIAS 2025数据集,对多种基本和微调模型进行测试,重点考察模型在识别继承人、计算份额和提供法律依据方面的能力。
Result: 多数投票方案在QIAS 2025挑战赛任务1中取得92.7%的准确率,位列第三。
Insight: LLMs在伊斯兰法律推理中表现良好,但其性能依赖于模型选择和集成方法。
Abstract: Islamic inheritance domain holds significant importance for Muslims to ensure fair distribution of shares between heirs. Manual calculation of shares under numerous scenarios is complex, time-consuming, and error-prone. Recent advancements in Large Language Models (LLMs) have sparked interest in their potential to assist with complex legal reasoning tasks. This study evaluates the reasoning capabilities of state-of-the-art LLMs to interpret and apply Islamic inheritance laws. We utilized the dataset proposed in the ArabicNLP QIAS 2025 challenge, which includes inheritance case scenarios given in Arabic and derived from Islamic legal sources. Various base and fine-tuned models, are assessed on their ability to accurately identify heirs, compute shares, and justify their reasoning in alignment with Islamic legal principles. Our analysis reveals that the proposed majority voting solution, leveraging three base models (Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3), outperforms all other models that we utilized across every difficulty level. It achieves up to 92.7% accuracy and secures the third place overall in Task 1 of the Qias 2025 challenge.
[5] Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks
Nouar AlDahoul,Yasir Zaki
Main category: cs.CL
TL;DR: 这篇论文评估了大型语言模型(LLMs)在阿拉伯语医疗任务中的理解和推理能力,通过多项选择题和开放式问题测试了其性能,并提出了多数投票方法提升了准确率。
Details
Motivation: 研究动机是探索LLMs在阿拉伯语医疗领域的表现,填补当前研究的空白,评估其在临床环境中的实际应用潜力。Contribution: 主要贡献包括:(1) 提出了一种基于多数投票的解决方案,显著提升了多项选择题的准确率;(2) 系统评估了LLMs在阿拉伯语医疗任务中的性能。
Method: 方法包括:(1) 使用阿拉伯语医疗数据集(AraHealthQA)测试LLMs的多项选择与开放式问题回答能力;(2) 采用多数投票策略整合多个基础模型的答案(Gemini Flash 2.5、Gemini Pro 2.5和GPT o3)。
Result: 结果表明:(1) 多数投票方法在多项选择题中达到77%准确率;(2) 开放式问题中,多个LLMs的语义对齐表现优异,BERTScore最高达86.44%。
Insight: 研究发现LLMs在阿拉伯语医疗领域具备潜力,但在临床应用中仍需优化生成内容的准确性和语义一致性。
Abstract: Recent progress in large language models (LLMs) has showcased impressive proficiency in numerous Arabic natural language processing (NLP) applications. Nevertheless, their effectiveness in Arabic medical NLP domains has received limited investigation. This research examines the degree to which state-of-the-art LLMs demonstrate and articulate healthcare knowledge in Arabic, assessing their capabilities across a varied array of Arabic medical tasks. We benchmark several LLMs using a medical dataset proposed in the Arabic NLP AraHealthQA challenge in MedArabiQ2025 track. Various base LLMs were assessed on their ability to accurately provide correct answers from existing choices in multiple-choice questions (MCQs) and fill-in-the-blank scenarios. Additionally, we evaluated the capacity of LLMs in answering open-ended questions aligned with expert answers. Our results reveal significant variations in correct answer prediction accuracy and low variations in semantic alignment of generated answers, highlighting both the potential and limitations of current LLMs in Arabic clinical contexts. Our analysis shows that for MCQs task, the proposed majority voting solution, leveraging three base models (Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3), outperforms others, achieving up to 77% accuracy and securing first place overall in the Arahealthqa 2025 shared task-track 2 (sub-task 1) challenge. Moreover, for the open-ended questions task, several LLMs were able to demonstrate excellent performance in terms of semantic alignment and achieve a maximum BERTScore of 86.44%.
[6] Persuasiveness and Bias in LLM: Investigating the Impact of Persuasiveness and Reinforcement of Bias in Language Models
Saumya Roy
Main category: cs.CL
TL;DR: 该研究探讨了大语言模型(LLMs)的说服力和偏见之间的相互作用,揭示了模型如何可能被滥用传播错误信息或强化社会偏见,并提出了防范措施。
Details
Motivation: 随着LLMs广泛应用,其强大的说服力和潜在的偏见放大效应可能被滥用,研究旨在评估这些风险并为安全部署提供依据。Contribution: 提出了一个基于角色的说服力评估框架(convincer-skeptic),量化了模型的说服力及其对偏见的放大作用。
Method: 采用角色模拟实验,通过Jensen-Shannon散度量说服效果,并探究模型在种族、性别、宗教等领域的偏见强化能力。
Result: LLMs能够显著影响叙事并适应受众价值观,但也可能被用于传播错误信息或强化社会偏见。
Insight: 核心风险在于滥用而非模型本身的偶然错误,需通过技术(如对齐设计)和政策手段防范潜在危害。
Abstract: Warning: This research studies AI persuasion and bias amplification that could be misused; all experiments are for safety evaluation. Large Language Models (LLMs) now generate convincing, human-like text and are widely used in content creation, decision support, and user interactions. Yet the same systems can spread information or misinformation at scale and reflect social biases that arise from data, architecture, or training choices. This work examines how persuasion and bias interact in LLMs, focusing on how imperfect or skewed outputs affect persuasive impact. Specifically, we test whether persona-based models can persuade with fact-based claims while also, unintentionally, promoting misinformation or biased narratives. We introduce a convincer-skeptic framework: LLMs adopt personas to simulate realistic attitudes. Skeptic models serve as human proxies; we compare their beliefs before and after exposure to arguments from convincer models. Persuasion is quantified with Jensen-Shannon divergence over belief distributions. We then ask how much persuaded entities go on to reinforce and amplify biased beliefs across race, gender, and religion. Strong persuaders are further probed for bias using sycophantic adversarial prompts and judged with additional models. Our findings show both promise and risk. LLMs can shape narratives, adapt tone, and mirror audience values across domains such as psychology, marketing, and legal assistance. But the same capacity can be weaponized to automate misinformation or craft messages that exploit cognitive biases, reinforcing stereotypes and widening inequities. The core danger lies in misuse more than in occasional model mistakes. By measuring persuasive power and bias reinforcement, we argue for guardrails and policies that penalize deceptive use and support alignment, value-sensitive design, and trustworthy deployment.
[7] MAC: A Live Benchmark for Multimodal Large Language Models in Scientific Understanding
Mohan Jiang,Jin Gao,Jiahao Zhan,Dequan Wang
Main category: cs.CL
TL;DR: 论文提出了一个动态更新的多模态大模型(MLLM)评测基准MAC,用于评估模型在科学理解任务上的能力。它基于顶级期刊的图文数据,结合了推理能力的挑战,并通过DAD方法提升了模型性能。
Details
Motivation: 随着MLLM能力的提升,传统固定评测基准逐渐难以有效评估高级科学理解能力。因此需要一种动态更新的评测方法。Contribution: 1. 提出了动态更新的MAC评测基准;2. 提供了科学图文数据集;3. 提出了DAD方法提升MLLM推理能力。
Method: MAC基于25,000+期刊图文对,设计了跨模态科学推理任务。DAD通过在语言空间扩展视觉特征来增强推理能力。
Result: 实验显示MLLM感知能力较强,但跨模态推理能力有限;DAD方法可使性能提升达11%。
Insight: 动态评测基准更符合技术发展需求,科学推理能力是MLLM未来的重要发展方向。
Abstract: As multimodal large language models (MLLMs) grow increasingly capable, fixed benchmarks are gradually losing their effectiveness in evaluating high-level scientific understanding. In this paper, we introduce the Multimodal Academic Cover benchmark (MAC), a live benchmark that could continuously evolve with scientific advancement and model progress. MAC leverages over 25,000 image-text pairs sourced from issues of top-tier scientific journals such as Nature, Science, and Cell, challenging MLLMs to reason across abstract visual and textual scientific content. Experiments on our most recent yearly snapshot, MAC-2025, reveal that while MLLMs demonstrate strong perceptual abilities, their cross-modal scientific reasoning remains limited. To bridge this gap, we propose DAD, a lightweight inference-time approach that enhances MLLMs by extending MLLM visual features with language space reasoning, achieving performance improvements of up to 11%. Finally, we highlight the live nature of MAC through experiments on updating journal covers and models for curation, illustrating its potential to remain aligned with the frontier of human knowledge. We release our benchmark at https://github.com/mhjiang0408/MAC_Bench.
[8] SurfaceLogicKV: Surface and Logic Attention Behaviors are All You Need for Robust KV Cache Compression
Mengjie Li,William J. Song
Main category: cs.CL
TL;DR: 该论文提出了一种名为SurfaceLogicKV的新方法,通过区分注意力行为为表面记忆和逻辑构建,有效地压缩KV缓存,同时保持模型性能。
Details
Motivation: 大型语言模型(LLMs)中不断增长的输入序列长度对KV缓存存储带来了巨大压力,影响了推理效率。本文旨在通过分析注意力的不同行为,设计一种更高效的KV缓存压缩方法。Contribution: 主要贡献包括:1) 首次明确区分注意力行为为表面记忆和逻辑构建;2) 提出一种两阶段的SurfaceLogicKV方法,利用注意力行为实现KV缓存的高效压缩。
Method: 论文的核心方法是:1) 分析注意力头的行为,分为表面记忆(0.5%)和逻辑构建(1.5%);2) 利用层和头的集成,设计两阶段压缩方法,优先保留逻辑构建行为,忽略无关内容。
Result: 实验结果显示,该方法在多种任务和长序列中表现稳健,性能接近甚至优于基线方法或FullKV。
Insight: 研究发现,绝大部分注意力头(98.5%)会忽略无关信息,而表面记忆和逻辑构建行为虽少但对长上下文推理至关重要。这一发现为KV缓存优化提供了新思路。
Abstract: The increasing input sequence length in Large Language Models (LLMs) puts significant pressure on key-value (KV) cache storage, making efficient inference challenging. Explicitly distinguishing attention behavior into our self-defined surface memorization and logic construction reveals essential roles in long-context reasoning. We observe that an individual attention head can display various behaviors, with nearly 98.5% effectively ignoring completely irrelevant information. The remaining 1.5% behaves as logic construction, and 0.5% behaves as surface memorization. Based on layer- and head-wise integration, we propose a novel two-stage SurfaceLogicKV method to utilize these attention behaviors for KV Cache compression. As a result, it achieves improved compressing robustness while maintaining competitive performance across various tasks and long sequences compared to baselines or even FullKV in some specific situations
[9] KL-based self-distillation for large language models
Max Rehman Linder
Main category: cs.CL
TL;DR: 论文提出了一种基于KL散度的自蒸馏方法,用于在词汇扩展时解决大型语言模型的知识迁移问题,解决了不同分词方法带来的挑战,并在代码生成任务中表现优异。
Details
Motivation: 大型预训练语言模型在微调时难以融入新的领域术语,尤其是在词汇扩展时,由于分词差异导致的知识迁移问题。本文提出了一个数学基础的方法来解决这一问题。Contribution: 提出了一种基于KL散度的方法,即使师生模型使用不同的分词方法也能实现知识蒸馏,并通过机制解释性分析了新词学习的过程。
Method: 使用KL散度进行知识蒸馏,比较了多种新词嵌入初始化策略,并对模型进行微调以融入新词汇。在代码生成任务上进行评估。
Result: 在2000个代码生成任务上,KL散度方法表现优于传统的交叉熵训练。
Insight: 通过机制解释性分析,揭示了新词表征的学习过程,解释了性能增益的来源,并提供了嵌入空间在词汇扩展时的结构洞察。
Abstract: Large pre-trained language models often struggle to incorporate new domain-specific terminology when fine-tuned on small, specialized corpora. In this work, we address the challenge of vocabulary expansion in frozen LLMs by introducing a mathematically grounded method for knowledge distillation via KL divergence, even when the original and extended models use different tokenizations. This allows the student model to inherit distributional knowledge from the teacher despite differing vocabularies. We compare our KL-based distillation approach to conventional cross-entropy training, evaluating both methods across multiple strategies for initializing new token embeddings. After embedding initialization, models are further fine-tuned to integrate the new vocabulary. Each trained model is benchmarked on approximately 2000 code-generation tasks, where our approach achieves the best performance across the board. Finally, through mechanistic interpretability, we analyze how models learn representations for the new tokens, providing an explanation for the observed gains and offering insight into the structure of embedding space during vocabulary expansion.
[10] Chain-of-Query: Unleashing the Power of LLMs in SQL-Aided Table Understanding via Multi-Agent Collaboration
Songyuan Sui,Hongyi Liu,Serena Liu,Li Li,Soo-Hyun Choi,Rui Chen,Xia Hu
Main category: cs.CL
TL;DR: 论文提出了一种名为Chain-of-Query (CoQ)的多智能体框架,通过自然语言表模式表示和分步SQL生成策略,显著提高了表格理解的准确性和SQL生成的有效性。
Details
Motivation: 表格理解需要多步结构化推理,但大型语言模型(LLMs)因表格数据的结构复杂性而表现不佳。现有方法存在理解表结构不可靠、错误传播导致无效查询以及过度依赖执行正确性等问题。Contribution: 提出了CoQ框架,通过自然语言表模式表示消除结构噪声,分步生成SQL提高质量,并分离SQL机械推理与LLM逻辑推断以减少对执行结果的依赖。
Method: 采用多智能体协作,使用自然语言表示表模式,分步生成SQL,并引入混合推理分工机制。
Result: 在五个基准测试中,准确率从61.11%提升至74.77%,无效SQL率从9.48%降至3.34%。
Insight: 自然语言表模式表示和分步SQL生成策略能显著提升表格理解和SQL生成的效果,分离机械与逻辑推理可减少对执行结果的依赖。
Abstract: Table understanding requires structured, multi-step reasoning. Large Language Models (LLMs) struggle with it due to the structural complexity of tabular data. Recently, multi-agent frameworks for SQL generation have shown promise in tackling the challenges of understanding tabular data, but existing approaches often suffer from limitations such as the inability to comprehend table structure for reliable SQL generation, error propagation that results in invalid queries, and over-reliance on execution correctness. To address these issues, we propose Chain-of-Query (CoQ), a novel multi-agent framework for SQL-aided table understanding. CoQ adopts natural-language-style representations of table schemas to abstract away structural noise and enhance understanding. It employs a clause-by-clause SQL generation strategy to improve query quality and introduces a hybrid reasoning division that separates SQL-based mechanical reasoning from LLM-based logical inference, thereby reducing reliance on execution outcomes. Experiments with four models (both closed- and open-source) across five widely used benchmarks show that Chain-of-Query significantly improves accuracy from 61.11% to 74.77% and reduces the invalid SQL rate from 9.48% to 3.34%, demonstrating its superior effectiveness in table understanding. The code is available at https://github.com/SongyuanSui/ChainofQuery.
[11] Detecting Hope, Hate, and Emotion in Arabic Textual Speech and Multi-modal Memes Using Large Language Models
Nouar AlDahoul,Yasir Zaki
Main category: cs.CL
TL;DR: 论文探讨大型语言模型(LLM)在识别阿拉伯语文本和表情包中的希望、仇恨言论、冒犯性语言及情感表达方面的潜力,并在MAHED 2025挑战赛中验证了其优越性能。
Details
Motivation: 社会媒体上阿拉伯语内容和表情包的传播亟需精准分析,以应对仇恨言论和冒犯性语言的增加。Contribution: 展示了经微调的LLM(如GPT-4o-mini和Gemini Flash 2.5)在阿拉伯语内容分析中的高效性,并在挑战赛中取得最佳成绩。
Method: 评估了基础LLM、微调LLM和预训练嵌入模型的性能,使用阿拉伯NLP MAHED 2025数据集进行测试。
Result: GPT-4o-mini和Gemini Flash 2.5在三个任务中分别取得72.1%、57.8%和79.6%的宏F1分数,总体排名第一。
Insight: 研究表明,微调LLM能更精细理解阿拉伯语文本和表情包,为内容审核系统提供高效解决方案。
Abstract: The rise of social media and online communication platforms has led to the spread of Arabic textual posts and memes as a key form of digital expression. While these contents can be humorous and informative, they are also increasingly being used to spread offensive language and hate speech. Consequently, there is a growing demand for precise analysis of content in Arabic text and memes. This paper explores the potential of large language models to effectively identify hope, hate speech, offensive language, and emotional expressions within such content. We evaluate the performance of base LLMs, fine-tuned LLMs, and pre-trained embedding models. The evaluation is conducted using a dataset of Arabic textual speech and memes proposed in the ArabicNLP MAHED 2025 challenge. The results underscore the capacity of LLMs such as GPT-4o-mini, fine-tuned with Arabic textual speech, and Gemini Flash 2.5, fine-tuned with Arabic memes, to deliver the superior performance. They achieve up to 72.1%, 57.8%, and 79.6% macro F1 scores for tasks 1, 2, and 3, respectively, and secure first place overall in the Mahed 2025 challenge. The proposed solutions offer a more nuanced understanding of both text and memes for accurate and efficient Arabic content moderation systems.
[12] From Clicks to Preference: A Multi-stage Alignment Framework for Generative Query Suggestion in Conversational System
Junhao Yin,Haolin Wang,Peng Bao,Ju Xu,Yongliang Wang
Main category: cs.CL
TL;DR: 该论文提出了一种多阶段对齐框架,通过逐步细化生成策略与用户偏好的一致性,解决了对话系统中生成式查询建议的挑战,显著提升了用户点击率。
Details
Motivation: 尽管大语言模型为对话系统提供了强大的生成式查询建议能力,但如何精确对齐生成结果与用户复杂且不确定的偏好仍然是一个关键问题。Contribution: 1. 提出了一个多阶段对齐框架,结合提示工程、监督微调和强化学习;2. 设计了高斯奖励模型(GaRM)以捕捉用户偏好的不确定性;3. 提出了新颖的分布外正则化和两阶段奖励融合技术。
Method: 1. 使用提示工程作为冷启动策略;2. 基于点击日志的蒸馏方法监督微调基础模型;3. 用GaRM表示用户偏好分布;4. 结合复合奖励函数的强化学习对齐策略。
Result: 在自动和人工评估中显著优于基线,A/B测试中用户点击率提升了34%。
Insight: 通过概率分布建模用户偏好,结合多阶段对齐和正则化技术,可以有效提升生成建议的质量和用户参与度。
Abstract: Generative query suggestion using large language models offers a powerful way to enhance conversational systems, but aligning outputs with nuanced user preferences remains a critical challenge. To address this, we introduce a multi-stage framework designed for progressive alignment between the generation policy and user intent. Our pipeline begins with prompt engineering as a cold-start strategy, followed by the Supervised Fine-Tuning stage, in which we introduce a distillation method on click logs to create a robust foundational model. To better model user preferences while capturing their inherent uncertainty, we develop a Gaussian Reward Model (GaRM) that represents user preferences as probability distributions rather than point estimates. Finally, we employ reinforcement learning to align the generation policy with these preferences, guided by a composite reward function that integrates GaRM with auxiliary heuristics to mitigate reward hacking. To maintain training stability, this process is enhanced by a novel out-of-distribution regularization method and a two-stage reward fusion technique. Extensive experiments demonstrate that our framework significantly outperforms baselines on both automatic and human evaluations and yields a 34% relative increase in user engagement as measured by click-through rate in live A/B tests.
[13] SCOPE: A Generative Approach for LLM Prompt Compression
Tinghui Zhang,Yifan Wang,Daisy Zhe Wang
Main category: cs.CL
TL;DR: 论文提出了一种基于生成方法的LLM提示压缩技术(SCOPE),通过将提示分块并重写以保持语义连贯性,显著提升了压缩质量和稳定性。
Details
Motivation: 现有的提示压缩方法主要基于标记去除,导致信息丢失和结构不连贯,限制了生成质量。本文旨在通过生成方法解决这些问题。Contribution: 1. 提出了一种创新的生成式提示压缩方法;2. 引入了分块与摘要机制;3. 设计多种优化技术以提升压缩质量。
Method: 方法分为分块与重写两步:将提示分割为语义连贯的块,再将其重写为更简洁的形式,并通过优化技术(如动态压缩比、关键词保留)提升效果。
Result: 在问答和摘要任务上的实验表明,该方法在高压缩比下表现优于现有方法,压缩质量更高且更稳定。
Insight: 生成式压缩方法优于标记去除,保留语义和结构完整性的同时实现高效压缩。
Abstract: Prompt compression methods enhance the efficiency of Large Language Models (LLMs) and minimize the cost by reducing the length of input context. The goal of prompt compression is to shorten the LLM prompt while maintaining a high generation quality. However, existing solutions, mainly based on token removal, face challenges such as information loss and structural incoherence, like missing grammar elements in a sentence, or incomplete word phrases after token removal. Such challenges limit the final generation quality of LLM. To overcome these limitations, we present a novel generative prompt compression method. Unlike the existing token removal methods, our method centers at a chunking-and-summarization mechanism. Specifically, our method splits prompt into semantically coherent chunks and rewrites the chunks to be more concise. The chunks are reconstructed into meaningful prompt finally. We design several optimization techniques for the mechanism, including optimized semantic chunking, outlier chunk handling, dynamic compression ratio, compression prioritization, and keyword maintaining. These techniques effectively improve the identifying and preserving of critical information and coherence among texts, as well as providing finer grind control of the compression ratio. We conduct extensive evaluation on question-answering and summarization tasks, with datasets covering multiple different domain. The evaluation shows our method achieves a significantly better compression quality, and higher stability than the state-of-the-art methods, especially under high compression ratio, which proves the effectiveness and practicality of our method.
[14] User-Assistant Bias in LLMs
Xu Pan,Jingxuan Fan,Zidi Xiong,Ely Hahami,Jorin Overwiening,Ziqian Xie
Main category: cs.CL
TL;DR: 大语言模型(LLM)在多轮对话中存在用户或助手信息偏好的问题,称为用户-助手偏见(user-assistant bias)。作者提出一个8k的多轮对话数据集UserAssist,用于评估和调控26个商业和开源模型中的偏见。研究发现:商业模型存在不同程度的用户偏见;指令调优的开源模型用户偏见显著,而推理模型较弱。微调实验表明,人类偏好对齐会增加用户偏见,而链式思维(chain-of-thought)训练会降低偏见。通过直接偏好优化(DPO),可以双向调整偏见,且效果泛化性强。
Details
Motivation: LLM在多轮对话中可能过度依赖用户或自身信息,导致固执或顺从的行为。为了理解和调控这种偏见,需要一种系统的评测方法和工具。Contribution: 1) 提出用户-助手偏见的概念;2) 发布UserAssist数据集;3) 评测商业和开源模型中的偏见差异;4) 发现偏见调控方法,如DPO和链式思维训练。
Method: 1) 构建UserAssist数据集;2) 评测26个商业和26个开源模型;3) 通过微调实验分析偏见来源;4) 使用DPO调整偏见。
Result: 商业模型用户偏见多样,开源指令调优模型偏见显著,推理模型较弱。人类偏好对齐增加偏见,链式思维训练降低偏见。DPO可有效调控偏见,泛化性强。
Insight: LLM的偏见与训练方法密切相关,偏好对齐和推理训练对其影响相反。偏见调控技术可用于检测和改善模型异常行为。
Abstract: Large language models (LLMs) can bias towards relying on their own or the user’s information in chat history, leading to overly stubborn or agreeable behaviors in multi-turn conversations. In this paper, we formalize this model characteristic as user-assistant bias and introduce an 8k multi-turn conversation dataset $\textbf{UserAssist}$, which we use to benchmark, understand and manipulate the user-assistant bias in frontier LLMs. Leveraging $\textbf{UserAssist-test}$, we first benchmark the user-assistant bias of 26 commercial and 26 open-weight models. Commercial models show various levels of user bias. Evaluation on open-weight models reveals significant user bias in the instruction-tuned models, and weak user bias in reasoning (or reasoning-distilled) models. We then perform controlled fine-tuning experiments to pinpoint the post-training recipe contributing to these bias shifts: human preference alignment increases user bias, while training on chain-of-thought reasoning traces decreases it. Finally, we demonstrate that user-assistant bias can be bidirectionally adjusted by performing direct preference optimization (DPO) on $\textbf{UserAssist-train}$, and generalizes well to both in-domain and out-of-domain conversations. Our results provide insights into how the LLM integrates information from different sources, and also a viable way to detect and control model abnormalities.
[15] Enhancing Cryptocurrency Sentiment Analysis with Multimodal Features
Chenghao Liu,Aniket Mahanti,Ranesh Naha,Guanghao Wang,Erwann Sbai
Main category: cs.CL
TL;DR: 该论文通过多模态分析比较了TikTok和Twitter对加密货币市场情绪的影响,发现视频内容更能影响短期市场趋势,而文本内容与长期动态更相关,跨平台信号整合提高了预测准确性。
Details
Motivation: 随着加密货币的日益流行,研究社交媒体对其市场情绪的影响变得重要。现有研究主要聚焦于文本数据(如Twitter),而视频内容的情绪和背景信息尚未充分挖掘。Contribution: 提出了一个多模态分析框架,揭示了TikTok视频情绪对短期市场的显著影响及Twitter文本情绪与长期动态的关联,跨平台情绪信号整合提升了预测性能。
Method: 使用大型语言模型从TikTok视频和Twitter文本中提取情绪特征,分析其与加密货币市场指标的动态依赖性和溢出效应。
Result: TikTok视频情绪显著影响投机性资产和短期趋势,Twitter文本情绪与长期动态更一致;跨平台整合使预测准确性提升20%。
Insight: 视频内容因其丰富的情感表达,在短期市场预测中具有独特价值,而文本数据更适合长期分析,多模态方法能更全面地捕捉市场情绪。
Abstract: As cryptocurrencies gain popularity, the digital asset marketplace becomes increasingly significant. Understanding social media signals offers valuable insights into investor sentiment and market dynamics. Prior research has predominantly focused on text-based platforms such as Twitter. However, video content remains underexplored, despite potentially containing richer emotional and contextual sentiment that is not fully captured by text alone. In this study, we present a multimodal analysis comparing TikTok and Twitter sentiment, using large language models to extract insights from both video and text data. We investigate the dynamic dependencies and spillover effects between social media sentiment and cryptocurrency market indicators. Our results reveal that TikTok’s video-based sentiment significantly influences speculative assets and short-term market trends, while Twitter’s text-based sentiment aligns more closely with long-term dynamics. Notably, the integration of cross-platform sentiment signals improves forecasting accuracy by up to 20%.
[16] Embarrassed to observe: The effects of directive language in brand conversation
Andria Andriuzzi,Géraldine Michel
Main category: cs.CL
TL;DR: 研究表明,社交媒体中品牌使用指令性语言与消费者互动会引发旁观消费者的间接尴尬,从而降低其参与度,尤其在非产品中心对话中更为明显,但品牌关系强度可缓解此负面影响。
Details
Motivation: 研究动机是探究品牌在社交媒体中使用指令性语言与消费者互动时,旁观消费者的反应及其背后的心理机制。Contribution: 主要贡献在于揭示了指令性语言在品牌对话中对旁观消费者的负面影响,并基于Goffman的面子理论解释了其心理机制,同时强调了对话内容和品牌关系的调节作用。
Method: 研究方法包括一项实地研究和三项在线实验。
Result: 结果表明,指令性语言会引发旁观消费者的间接尴尬并降低参与度,非产品中心对话中负面效应更强,但品牌关系强度可缓解这一效应。
Insight: 研究发现对话内容(产品与非产品中心)和品牌关系强度是关键调节变量,这对品牌社交媒体管理策略具有重要启示。
Abstract: In social media, marketers attempt to influence consumers by using directive language, that is, expressions designed to get consumers to take action. While the literature has shown that directive messages in advertising have mixed results for recipients, we know little about the effects of directive brand language on consumers who see brands interacting with other consumers in social media conversations. On the basis of a field study and three online experiments, this study shows that directive language in brand conversation has a detrimental downstream effect on engagement of consumers who observe such exchanges. Specifically, in line with Goffman’s facework theory, because a brand that encourages consumers to react could be perceived as face-threatening, consumers who see a brand interacting with others in a directive way may feel vicarious embarrassment and engage less (compared with a conversation without directive language). In addition, we find that when the conversation is nonproduct-centered (vs. product-centered), consumers expect more freedom, as in mundane conversations, even for others; therefore, directive language has a stronger negative effect. However, in this context, the strength of the brand relationship mitigates this effect. Thus, this study contributes to the literature on directive language and brand-consumer interactions by highlighting the importance of context in interactive communication, with direct relevance for social media and brand management.
[17] Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models
Zhifei Xie,Ziyang Ma,Zihang Liu,Kaiyu Pang,Hongyu Li,Jialin Zhang,Yue Liao,Deheng Ye,Chunyan Miao,Shuicheng Yan
Main category: cs.CL
TL;DR: 论文提出了一种名为Mini-Omni-Reasoner的框架,通过在语音模型中实现”边说边思考”的机制,显著提升了实时交互的效率和推理能力。
Details
Motivation: 现有的语音模型(LSMs)通常采用"先思考再说话"的模式,导致推理完成前无法生成语音输出,引入显著的延迟问题。论文旨在解决这一延迟问题,同时保持推理的准确性和语音的自然性。Contribution: 1. 提出”Thinking-in-Speaking”的范式,通过在token级别交错推理和语音输出实现实时推理;2. 设计了层次化的Thinker-Talker架构;3. 发布了大尺度数据集Spoken-Math-Problems-3M,用于支持交错推理和语音输出的学习。
Method: 1. 在token级别交错推理token和语音token;2. 通过局部语义对齐确保每个语音token基于其前的推理;3. 采用Thinker-Talker架构实现流利且逻辑严密的语音生成。
Result: 在Spoken-MQA基准测试中,模型在算术推理和上下文理解上分别取得了+19.1%和+6.4%的提升,同时输出更短且无解码延迟。
Insight: 通过交错推理和语音生成,可以显著减少延迟并提升交互效率,同时保持语音的自然性和逻辑性,为实时语音交互提供了新思路。
Abstract: Reasoning is essential for effective communication and decision-making. While recent advances in LLMs and MLLMs have shown that incorporating explicit reasoning significantly improves understanding and generalization, reasoning in LSMs remains in a nascent stage. Early efforts attempt to transfer the “Thinking-before-Speaking” paradigm from textual models to speech. However, this sequential formulation introduces notable latency, as spoken responses are delayed until reasoning is fully completed, impairing real-time interaction and communication efficiency. To address this, we propose Mini-Omni-Reasoner, a framework that enables reasoning within speech via a novel “Thinking-in-Speaking” formulation. Rather than completing reasoning before producing any verbal output, Mini-Omni-Reasoner interleaves silent reasoning tokens with spoken response tokens at the token level. This design allows continuous speech generation while embedding structured internal reasoning, leveraging the model’s high-frequency token processing capability. Although interleaved, local semantic alignment is enforced to ensure that each response token is informed by its preceding reasoning. To support this framework, we introduce Spoken-Math-Problems-3M, a large-scale dataset tailored for interleaved reasoning and response. The dataset ensures that verbal tokens consistently follow relevant reasoning content, enabling accurate and efficient learning of speech-coupled reasoning. Built on a hierarchical Thinker-Talker architecture, Mini-Omni-Reasoner delivers fluent yet logically grounded spoken responses, maintaining both naturalness and precision. On the Spoken-MQA benchmark, it achieves a +19.1% gain in arithmetic reasoning and +6.4% in contextual understanding, with shorter outputs and zero decoding latency.
[18] DAIQ: Auditing Demographic Attribute Inference from Question in LLMs
Srikant Panda,Hitesh Laxmichand Patel,Shahad Al-Khalifa,Amit Agarwal,Hend Al-Khalifa,Sharefah Al-Ghamdi
Main category: cs.CL
TL;DR: 该论文提出了DAIQ任务和框架,用于审计语言模型中从无明确人口统计线索的问题中推断用户人口统计属性的行为,揭示了LLMs的系统性风险,并开发了提示护栏以减少身份推断。
Details
Motivation: 研究动机是解决语言模型在问题中隐含推断用户人口统计属性(如性别或种族)的潜在风险,这种行为可能违反中立性期望、推断不必要的信息并编码刻板印象,影响公平性。Contribution: 主要贡献包括:(1)提出DAIQ任务和框架,用于系统审计LLMs中的人口统计属性推断行为;(2)揭示了开闭源LLMs普遍存在的系统性风险;(3)开发了有效的提示技术以减轻风险。
Method: 方法包括设计中性查询、系统性提示策略,以及定性和定量分析,以评估LLMs从问题中推断人口统计属性的倾向和行为模式。
Result: 结果显示,无论是开源还是闭源LLMs都会根据问题措辞推断人口统计属性,这种行为普遍且一致,可能加剧社会刻板印象和传播危害。
Insight: 关键见解是LLMs的隐含人口统计推断行为是一种系统性风险,可能威胁隐私、公平和信任,需要通过技术手段(如提示护栏)进行干预以符合负责任AI的目标。
Abstract: Large Language Models (LLMs) are known to reflect social biases when demographic attributes, such as gender or race, are explicitly present in the input. But even in their absence, these models still infer user identities based solely on question phrasing. This subtle behavior has received far less attention, yet poses serious risks: it violates expectations of neutrality, infers unintended demographic information, and encodes stereotypes that undermine fairness in various domains including healthcare, finance and education. We introduce Demographic Attribute Inference from Questions (DAIQ), a task and framework for auditing an overlooked failure mode in language models: inferring user demographic attributes from questions that lack explicit demographic cues. Our approach leverages curated neutral queries, systematic prompting, and both quantitative and qualitative analysis to uncover how models infer demographic information. We show that both open and closed source LLMs do assign demographic labels based solely on question phrasing. Prevalence and consistency of demographic inferences across diverse models reveal a systemic and underacknowledged risk: LLMs can fabricate demographic identities, reinforce societal stereotypes, and propagate harms that erode privacy, fairness, and trust posing a broader threat to social equity and responsible AI deployment. To mitigate this, we develop a prompt-based guardrail that substantially reduces identity inference and helps align model behavior with fairness and privacy objectives.
[19] Who’s Asking? Investigating Bias Through the Lens of Disability Framed Queries in LLMs
Srikant Panda,Vishnu Hari,Kalpana Panda,Amit Agarwal,Hitesh Laxmichand Patel
Main category: cs.CL
TL;DR: 该论文首次系统审计了基于残疾条件的LLM人口统计偏见,发现LLM在无明确人口信息时仍会推断用户特征,且残疾语境会显著影响预测结果,尤其是更大模型更易受刻板印象影响。
Details
Motivation: 探究LLM如何通过查询中的残疾线索推断用户人口特征,揭示现有对齐策略中忽视的残疾包容性问题。Contribution: 提出了首个针对残疾语境下LLM人口统计偏见的系统评估框架,并发现模型规模与偏见放大之间的正相关关系。
Method: 使用包含九个残疾类别和六个商业领域的平衡模板语料库,测试八种LLM在残疾感知条件下对五种人口属性的预测差异。
Result: 模型在97%的案例中会武断推断人口特征,残疾语境显著改变预测分布,且更大模型对残疾线索更敏感但偏见也更严重。
Insight: 当前LLM对齐策略存在严重盲点,需结合弃权校准和反事实微调以减少无依据的人口推断及刻板印象放大。
Abstract: Large Language Models (LLMs) routinely infer users demographic traits from phrasing alone, which can result in biased responses, even when no explicit demographic information is provided. The role of disability cues in shaping these inferences remains largely uncharted. Thus, we present the first systematic audit of disability-conditioned demographic bias across eight state-of-the-art instruction-tuned LLMs ranging from 3B to 72B parameters. Using a balanced template corpus that pairs nine disability categories with six real-world business domains, we prompt each model to predict five demographic attributes - gender, socioeconomic status, education, cultural background, and locality - under both neutral and disability-aware conditions. Across a varied set of prompts, models deliver a definitive demographic guess in up to 97% of cases, exposing a strong tendency to make arbitrary inferences with no clear justification. Disability context heavily shifts predicted attribute distributions, and domain context can further amplify these deviations. We observe that larger models are simultaneously more sensitive to disability cues and more prone to biased reasoning, indicating that scale alone does not mitigate stereotype amplification. Our findings reveal persistent intersections between ableism and other demographic stereotypes, pinpointing critical blind spots in current alignment strategies. We release our evaluation framework and results to encourage disability-inclusive benchmarking and recommend integrating abstention calibration and counterfactual fine-tuning to curb unwarranted demographic inference. Code and data will be released on acceptance.
[20] Alvorada-Bench: Can Language Models Solve Brazilian University Entrance Exams?
Henrique Godoy
Main category: cs.CL
TL;DR: 这篇论文介绍了Alvorada-Bench,一个包含4,515道巴西大学入学考试题的基准测试,用于评估语言模型在葡萄牙语和多步骤推理任务中的表现。模型在零样本、角色扮演和思维链提示下测试,结果显示其在语言类题目上表现优异,但在数学和工程类题目上仍有不足。
Details
Motivation: 现有语言模型评估多集中在英语环境,而忽视了葡萄牙语及其他语言文化背景的测试需求。Alvorada-Bench旨在填补这一空白,测试模型在巴西教育系统中的表现。Contribution: 提出了Alvorada-Bench基准测试,包含大量巴西大学入学考试题,并评估了20个模型在多种提示方法下的表现,揭示了模型在多步骤推理和文化语境中的能力局限。
Method: 使用零样本、角色扮演和思维链提示方法,测试模型对4,515道巴西大学入学考试题的响应,并记录模型的置信度、感知难度和认知层次(Bloom level)。
Result: 模型在语言类题目上表现优异(超过94%准确率),但在数学和工程类题目上表现较差。模型的置信度校准良好,且与感知难度相关。成本分析显示每千次token测试成本低于2美元。
Insight: 语言模型在非英语和复杂推理任务中存在明显不足,但在文化语境中的表现接近人类水平。模型的自我评估能力较强,能准确判断自身表现。
Abstract: Language models are increasingly used in Brazil, but most evaluation remains English-centric. This paper presents Alvorada-Bench, a 4,515-question, text-only benchmark drawn from five Brazilian university entrance examinations. Evaluating twenty models under zero-shot, role-playing, and chain-of-thought prompting, producing 270,900 responses with structured self-reports of confidence, perceived difficulty, and Bloom level. The top models exceed 94% accuracy overall, but accuracy declines on Mathematics and on the engineering oriented IME and ITA exams, indicating persistent weaknesses in multi-step reasoning. Confidence is well calibrated and correlates with perceived difficulty, revealing that models can accurately assess their own certainty capabilities. A cost accuracy analysis shows that high accuracy is achievable at under $2 per 1K tokens. On ENEM 2024 the top model (O3) achieved perfect scores in Languages subject questions while even the weakest system (GPT-4.1 Nano) only underperforms humans in Mathematics. Through exams that distill decades of Brazilian educational priorities and assess millions of students yearly, Alvorada-Bench establishes whether language models can navigate the intersection of language, culture, and reasoning that defines academic readiness in Brazil.
[21] Lexical Hints of Accuracy in LLM Reasoning Chains
Arne Vanhoyweghen,Brecht Verbeken,Andres Algaba,Vincent Ginis
Main category: cs.CL
TL;DR: 论文研究了通过分析LLM推理链中的词汇特征(如不确定性词汇、情感波动等)来预测模型答案的准确性,发现词汇不确定性标记是最强的错误指标,而推理链长度仅在中等难度任务中有用。
Details
Motivation: 当前LLM在低准确率任务中常常表现出高自信,校准性较差。作者希望通过分析推理链(CoT)的可测量特征,如长度、情感波动和词汇提示,来捕捉模型的内部置信度,以提高模型的安全部署。Contribution: 论文提出了一种轻量级的后校准信号,通过词汇不确定性标记和情感波动来预测LLM答案的正确性。
Method: 分析了CoT的三个特征类别:长度、情感波动和词汇提示(如模糊词)。使用了DeepSeek-R1和Claude 3.7 Sonnet在Humanity’s Last Exam(HLE)和Omni-MATH两个基准上进行实验。
Result: 词汇不确定性标记(如“guess”、“stuck”)是最强的错误预测指标;情感波动信号较弱;CoT长度仅在中等难度任务(Omni-MATH)中有用。不确定性指标比高自信标记更显著。
Insight: 词汇特征提供了一种简单有效的方法来预测LLM的错误,尤其是在低准确率任务中,这对模型的安全部署具有重要意义。
Abstract: Fine-tuning Large Language Models (LLMs) with reinforcement learning to produce an explicit Chain-of-Thought (CoT) before answering produces models that consistently raise overall performance on code, math, and general-knowledge benchmarks. However, on benchmarks where LLMs currently achieve low accuracy, such as Humanity’s Last Exam (HLE), they often report high self-confidence, reflecting poor calibration. Here, we test whether measurable properties of the CoT provide reliable signals of an LLM’s internal confidence in its answers. We analyze three feature classes: (i) CoT length, (ii) intra-CoT sentiment volatility, and (iii) lexicographic hints, including hedging words. Using DeepSeek-R1 and Claude 3.7 Sonnet on both Humanity’s Last Exam (HLE), a frontier benchmark with very low accuracy, and Omni-MATH, a saturated benchmark of moderate difficulty, we find that lexical markers of uncertainty (e.g., $\textit{guess}$, $\textit{stuck}$, $\textit{hard}$) in the CoT are the strongest indicators of an incorrect response, while shifts in the CoT sentiment provide a weaker but complementary signal. CoT length is informative only on Omni-MATH, where accuracy is already high ($\approx 70%$), and carries no signal on the harder HLE ($\approx 9%$), indicating that CoT length predicts correctness only in the intermediate-difficulty benchmarks, i.e., inside the model’s demonstrated capability, but still below saturation. Finally, we find that uncertainty indicators in the CoT are consistently more salient than high-confidence markers, making errors easier to predict than correct responses. Our findings support a lightweight post-hoc calibration signal that complements unreliable self-reported probabilities and supports safer deployment of LLMs.
[22] Coarse-to-Fine Personalized LLM Impressions for Streamlined Radiology Reports
Chengbo Sun,Hui Yi Leong,Lei Li
Main category: cs.CL
TL;DR: 论文提出了一种从粗到精的框架,利用开源大语言模型(LLMs)自动生成并个性化放射学报告中的总结部分,旨在减轻放射科医生的工作负担。
Details
Motivation: 放射学报告中的“总结”(Impression)部分手动编写是导致放射科医生职业疲劳的主要原因之一,需要一种自动化方法来提升效率。Contribution: 提出了一种基于开源LLMs的粗到精生成框架,通过机器学习和人类反馈强化学习(RLHF)实现个性化总结生成,同时确保临床准确性。
Method: 1. 使用LLaMA和Mistral模型在大量报告数据上微调;2. 首先生成初步总结,再通过RLHF进一步精炼以匹配放射科医生的个人风格。
Result: 该方法显著减少了行政工作量,提升了报告效率,同时保持了高标准的临床精确性。
Insight: 通过结合大语言模型和RLHF,能够在医疗报告中实现个性化和高效化的自动生成,为类似领域提供参考。
Abstract: The manual creation of the “Impression” section in radiology reports is a primary driver of radiologist burnout. To address this challenge, we propose a coarse-to-fine framework that leverages open-source large language models (LLMs) to automatically generate and personalize impressions from clinical findings. The system first produces a draft impression and then refines it using machine learning and reinforcement learning from human feedback (RLHF) to align with individual radiologists’ styles while ensuring factual accuracy. We fine-tune LLaMA and Mistral models on a large dataset of reports from the University of Chicago Medicine. Our approach is designed to significantly reduce administrative workload and improve reporting efficiency while maintaining high standards of clinical precision.
[23] CyPortQA: Benchmarking Multimodal Large Language Models for Cyclone Preparedness in Port Operation
Chenchen Kuai,Chenhao Wu,Yang Zhou,Xiubin Bruce Wang,Tianbao Yang,Zhengzhong Tu,Zihao Li,Yunlong Zhang
Main category: cs.CL
TL;DR: 这篇论文提出了CyPortQA,第一个针对港口台风准备的多模态基准测试,评估了多模态大语言模型(MLLMs)在港口操作中的表现。
Details
Motivation: 由于台风强度增强且路径预测不确定性增加,港口操作员需要快速整合多模态预报数据以提供可操作的指导,而MLLMs在此领域的准确性和可靠性尚未被严格评估。Contribution: 提出了CyPortQA基准,包含2,917个真实场景和117,178个结构化问答对,用于评估MLLMs在港口台风准备中的表现。
Method: 通过自动化管道将多源数据(台风产品、港口运营记录等)扩展为问答对,并对多种MLLMs进行了广泛实验。
Result: MLLMs在情境理解方面表现出潜力,但在潜在影响估计和决策推理等任务中仍面临挑战。
Insight: MLLMs在台风准备任务中展示了潜力,但需要进一步改进推理能力以提高实际应用中的可靠性。
Abstract: As tropical cyclones intensify and track forecasts become increasingly uncertain, U.S. ports face heightened supply-chain risk under extreme weather conditions. Port operators need to rapidly synthesize diverse multimodal forecast products, such as probabilistic wind maps, track cones, and official advisories, into clear, actionable guidance as cyclones approach. Multimodal large language models (MLLMs) offer a powerful means to integrate these heterogeneous data sources alongside broader contextual knowledge, yet their accuracy and reliability in the specific context of port cyclone preparedness have not been rigorously evaluated. To fill this gap, we introduce CyPortQA, the first multimodal benchmark tailored to port operations under cyclone threat. CyPortQA assembles 2,917 realworld disruption scenarios from 2015 through 2023, spanning 145 U.S. principal ports and 90 named storms. Each scenario fuses multisource data (i.e., tropical cyclone products, port operational impact records, and port condition bulletins) and is expanded through an automated pipeline into 117,178 structured question answer pairs. Using this benchmark, we conduct extensive experiments on diverse MLLMs, including both open-source and proprietary model. MLLMs demonstrate great potential in situation understanding but still face considerable challenges in reasoning tasks, including potential impact estimation and decision reasoning.
[24] MedCoT-RAG: Causal Chain-of-Thought RAG for Medical Question Answering
Ziyu Wang,Elahe Khatibi,Amir M. Rahmani
Main category: cs.CL
TL;DR: MedCoT-RAG是一个针对医学问答任务的框架,通过结合因果感知的文档检索和结构化思维链提示,提升了模型在复杂医学任务中的表现,优于现有方法。
Details
Motivation: 大型语言模型(LLMs)在医学问答中存在幻觉和浅层推理问题,传统检索增强生成(RAG)方法缺乏结构化推理能力,难以满足临床决策支持的需求。Contribution: 提出了MedCoT-RAG框架,结合因果感知检索与医学工作流程定制的思维链提示,显著提升了医学问答的准确性、可解释性和一致性。
Method: 1. 因果感知文档检索;2. 结构化思维链提示设计,模拟临床诊断逻辑。
Result: 在三个医学问答基准测试中,MedCoT-RAG比普通RAG和先进领域适配方法分别提高了10.3%和6.4%的性能。
Insight: 通过模拟临床推理过程,模型在复杂医学任务中的表现显著提升,证明了结构化因果推理对医学问答的有效性。
Abstract: Large language models (LLMs) have shown promise in medical question answering but often struggle with hallucinations and shallow reasoning, particularly in tasks requiring nuanced clinical understanding. Retrieval-augmented generation (RAG) offers a practical and privacy-preserving way to enhance LLMs with external medical knowledge. However, most existing approaches rely on surface-level semantic retrieval and lack the structured reasoning needed for clinical decision support. We introduce MedCoT-RAG, a domain-specific framework that combines causal-aware document retrieval with structured chain-of-thought prompting tailored to medical workflows. This design enables models to retrieve evidence aligned with diagnostic logic and generate step-by-step causal reasoning reflective of real-world clinical practice. Experiments on three diverse medical QA benchmarks show that MedCoT-RAG outperforms strong baselines by up to 10.3% over vanilla RAG and 6.4% over advanced domain-adapted methods, improving accuracy, interpretability, and consistency in complex medical tasks.
[25] DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections
Jiwon Park,Seohyun Pyeon,Jinwoo Kim,Rina Carines Cabal,Yihao Ding,Soyeon Caren Han
Main category: cs.CL
TL;DR: DocHop-QA是一个大规模多模态、多文档、多跳问答基准,包含11,379个问题实例,支持跨文档、模态的结构化推理。
Details
Motivation: 现有的问答基准多局限于单文档或单模态,无法反映真实世界中信息检索的复杂性。DocHop-QA旨在填补这一空白。Contribution: 1. 提出DocHop-QA基准;2. 支持开放式多跳推理;3. 包含多样化的信息格式(文本、表格、布局)。
Method: 使用LLM驱动的流水线构建数据集,基于PubMed科学文档,通过语义相似度和布局感知的合成生成问题。
Result: 通过四项任务验证了DocHop-QA在结构化索引预测、生成式回答和多模态整合方面的能力。
Insight: DocHop-QA为多模态跨文档推理提供了更真实的评估场景,推动复杂QA任务的进一步发展。
Abstract: Despite recent advances in large language models (LLMs), most QA benchmarks are still confined to single-paragraph or single-document settings, failing to capture the complexity of real-world information-seeking tasks. Practical QA often requires multi-hop reasoning over information distributed across multiple documents, modalities, and structural formats. Although prior datasets made progress in this area, they rely heavily on Wikipedia-based content and unimodal plain text, with shallow reasoning paths that typically produce brief phrase-level or single-sentence answers, thus limiting their realism and generalizability. We propose DocHop-QA, a large-scale benchmark comprising 11,379 QA instances for multimodal, multi-document, multi-hop question answering. Constructed from publicly available scientific documents sourced from PubMed, DocHop-QA is domain-agnostic and incorporates diverse information formats, including textual passages, tables, and structural layout cues. Unlike existing datasets, DocHop-QA does not rely on explicitly hyperlinked documents; instead, it supports open-ended reasoning through semantic similarity and layout-aware evidence synthesis. To scale realistic QA construction, we designed an LLM-driven pipeline grounded in 11 high-frequency scientific question concepts. We evaluated DocHop-QA through four tasks spanning structured index prediction, generative answering, and multimodal integration, reflecting both discriminative and generative paradigms. These tasks demonstrate DocHop-QA’s capacity to support complex, multimodal reasoning across multiple documents.
[26] MGSC: A Multi-granularity Consistency Framework for Robust End-to-end Asr
Xuwen Yang
Main category: cs.CL
TL;DR: 论文提出了一个多粒度一致性框架MGSC,通过同时优化宏观句子语义和微观词对齐的一致性,显著提升了端到端ASR模型在噪声环境中的鲁棒性。
Details
Motivation: 当前端到端ASR模型在噪声环境下容易产生灾难性的语义错误,主要原因是其仅关注最终输出错误,而忽略了模型内部计算过程的一致性约束。Contribution: 提出了MGSC框架,首次揭示了宏观语义一致性和微观词对齐一致性协同优化的强大效果,显著提升了模型鲁棒性。
Method: 通过多粒度软一致性(MGSC)同时正则化宏观句子语义和微观词对齐,强制模型内部计算过程的一致性。
Result: 在公开数据集上,MGSC将字符错误率平均降低了8.7%,显著减少了语义错误。
Insight: 模型内部一致性的约束是提升AI系统鲁棒性和可信度的关键步骤。
Abstract: End-to-end ASR models, despite their success on benchmarks, often pro-duce catastrophic semantic errors in noisy environments. We attribute this fragility to the prevailing ‘direct mapping’ objective, which solely penalizes final output errors while leaving the model’s internal computational pro-cess unconstrained. To address this, we introduce the Multi-Granularity Soft Consistency (MGSC) framework, a model-agnostic, plug-and-play module that enforces internal self-consistency by simultaneously regulariz-ing macro-level sentence semantics and micro-level token alignment. Cru-cially, our work is the first to uncover a powerful synergy between these two consistency granularities: their joint optimization yields robustness gains that significantly surpass the sum of their individual contributions. On a public dataset, MGSC reduces the average Character Error Rate by a relative 8.7% across diverse noise conditions, primarily by preventing se-vere meaning-altering mistakes. Our work demonstrates that enforcing in-ternal consistency is a crucial step towards building more robust and trust-worthy AI.
[27] QU-NLP at QIAS 2025 Shared Task: A Two-Phase LLM Fine-Tuning and Retrieval-Augmented Generation Approach for Islamic Inheritance Reasoning
Mohammad AL-Smadi
Main category: cs.CL
TL;DR: QU-NLP团队在QIAS 2025的子任务1中提出了一种结合两阶段LLM微调和检索增强生成(RAG)的方法,用于伊斯兰遗产推理任务,取得了85.8%的高准确率,超越了GPT 4.5等大型模型。
Details
Motivation: 伊斯兰遗产法涉及复杂的规则和计算,传统大型语言模型(LLM)在零样本设置下表现有限。团队希望通过领域微调和检索增强技术提升推理能力。Contribution: 1. 采用LoRA微调Fanar-1-9B模型,结合RAG,专为伊斯兰遗产推理任务优化。2. 在测试中表现优异,尤其在高级推理任务上达到97.6%准确率,超越了Gemini 2.5和OpenAI o3等前沿模型。
Method: 1. 使用LoRA对Fanar-1-9B进行微调。2. 集成RAG框架,结合检索和生成能力。3. 任务包括理解场景、识别继承人、应用固定份额规则和精确计算。
Result: 系统在最终测试中达到85.8%的准确率,在高级推理任务中表现尤为突出(97.6%),超过了多个零样本设置的竞争模型。
Insight: 领域专用微调(LoRA)与检索增强(RAG)的结合,使得中等规模的阿拉伯语LLM在特定任务上可以超越前沿通用模型。
Abstract: This paper presents our approach and results for SubTask 1: Islamic Inheritance Reasoning at QIAS 2025, a shared task focused on evaluating Large Language Models (LLMs) in understanding and reasoning within Islamic inheritance knowledge. We fine-tuned the Fanar-1-9B causal language model using Low-Rank Adaptation (LoRA) and integrated it into a Retrieval-Augmented Generation (RAG) pipeline. Our system addresses the complexities of Islamic inheritance law, including comprehending inheritance scenarios, identifying eligible heirs, applying fixed-share rules, and performing precise calculations. Our system achieved an accuracy of 0.858 in the final test, outperforming other competitive models such as, GPT 4.5, LLaMA, Fanar, Mistral and ALLaM evaluated with zero-shot prompting. Our results demonstrate that QU-NLP achieves near state-of-the-art accuracy (85.8%), excelling especially on advanced reasoning (97.6%) where it outperforms Gemini 2.5 and OpenAI’s o3. This highlights that domain-specific fine-tuning combined with retrieval grounding enables mid-scale Arabic LLMs to surpass frontier models in Islamic inheritance reasoning.
[28] Counterspeech for Mitigating the Influence of Media Bias: Comparing Human and LLM-Generated Responses
Luyang Lin,Zijin Feng,Lingzhi Wang,Kam-Fai Wong
Main category: cs.CL
TL;DR: 研究探讨了如何通过回应言论(counterspeech)减少媒体偏见的影响,对比了人类与大型语言模型(LLM)生成的回应效果,发现后者更礼貌但缺乏多样性和新颖性。通过小样本学习和背景信息整合,生成效果得到提升。
Details
Motivation: 偏见新闻加剧社会极化,而攻击性评论进一步强化偏见,造成危害。回应言论可有效抵制此类言论。本研究首次在新闻背景下探讨回应言论生成。Contribution: 1. 引入标注数据集,连接媒体偏见、攻击性评论和回应言论;2. 分析发现70%以上攻击性评论支持偏见文章;3. 比较人类与模型生成回应,提出改进方法。
Method: 1. 标注数据集构建;2. 分析攻击性评论与偏见的关系;3. 对比人类与LLM生成回应;4. 通过小样本学习和背景信息整合改进生成效果。
Result: 模型生成回应更礼貌,但多样性不足。整合背景信息和小样本学习提升了多样性和相关性。
Insight: 回应言论是抵制偏见的高效工具,但需提升模型生成的多样性和新颖性。新闻背景信息对小样本学习有显著帮助。
Abstract: Biased news contributes to societal polarization and is often reinforced by hostile reader comments, constituting a vital yet often overlooked aspect of news dissemination. Our study reveals that offensive comments support biased content, amplifying bias and causing harm to targeted groups or individuals. Counterspeech is an effective approach to counter such harmful speech without violating freedom of speech, helping to limit the spread of bias. To the best of our knowledge, this is the first study to explore counterspeech generation in the context of news articles. We introduce a manually annotated dataset linking media bias, offensive comments, and counterspeech. We conduct a detailed analysis showing that over 70% offensive comments support biased articles, amplifying bias and thus highlighting the importance of counterspeech generation. Comparing counterspeech generated by humans and large language models, we find model-generated responses are more polite but lack the novelty and diversity. Finally, we improve generated counterspeech through few-shot learning and integration of news background information, enhancing both diversity and relevance.
[29] XFinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning
Zhihan Zhang,Yixin Cao,Lizi Liao
Main category: cs.CL
TL;DR: XFinBench是一个新颖的金融问题解决基准,用于评估大语言模型(LLM)在复杂、知识密集型多模态金融问题中的能力。实验表明当前最佳文本模型仍显著落后于人类专家,尤其在时序推理和场景规划方面。
Details
Motivation: 金融问题解决需要复杂的推理、多模态数据处理和广泛的技术知识,这对现有大语言模型提出了独特挑战。Contribution: 提出XFinBench基准,包含4,235个例子,覆盖多元金融主题,定义了五大核心能力,并通过实验评估了18个领先模型。
Method: 设计多模态金融问题基准,分析模型在术语理解、时序推理、未来预测、场景规划和数值建模能力上的表现。
Result: 最佳文本模型准确率为67.3%,仍显著落后于人类专家(79.8%)。知识增强仅对小模型有效,计算和视觉问题是主要错误来源。
Insight: 当前LLM在复杂金融任务中存在明显不足,尤其是时序推理和场景规划;知识增强的局限性揭示了模型规模的影响。
Abstract: Solving financial problems demands complex reasoning, multimodal data processing, and a broad technical understanding, presenting unique challenges for current large language models (LLMs). We introduce XFinBench, a novel benchmark with 4,235 examples designed to evaluate LLM’s ability in solving complex, knowledge-intensive financial problems across diverse graduate-level finance topics with multi-modal context. We identify five core capabilities of LLMs using XFinBench, i.e, terminology understanding, temporal reasoning, future forecasting, scenario planning, and numerical modelling. Upon XFinBench, we conduct extensive experiments on 18 leading models. The result shows that o1 is the best-performing text-only model with an overall accuracy of 67.3%, but still lags significantly behind human experts with 12.5%, especially in temporal reasoning and scenario planning capabilities. We further construct a knowledge bank with 3,032 finance terms for knowledge augmentation analysis, and find that relevant knowledge to the question only brings consistent accuracy improvements to small open-source model. Additionally, our error analysis reveals that rounding errors during calculation and blindness to position and intersection of curves in the image are two primary issues leading to model’s poor performance in calculating and visual-context questions, respectively. Code and dataset are accessible via GitHub: https://github.com/Zhihan72/XFinBench.
[30] CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning
Wenqiao Zhu,Ji Liu,Rongjuncheng Zhang,Haipang Wu,Yulun Zhang
Main category: cs.CL
TL;DR: 该论文提出了CARFT方法,通过结合对比学习和标注的Chain-of-Thought强化微调,提升大语言模型的推理能力。
Details
Motivation: 当前基于强化学习的微调方法忽视了标注的Chain-of-Thought,且推理路径采样不稳定,导致模型崩溃和性能下降。而监督微调方法过于依赖标注的Chain-of-Thought,未能充分挖掘潜在推理路径。Contribution: 提出CARFT方法,结合对比学习和强化学习,利用标注和潜在推理路径,提升模型推理性能和训练稳定性。
Method: 通过为每个Chain-of-Thought学习表示,设计对比信号指导微调过程,并结合无监督学习信号稳定训练。
Result: 在实验中,CARFT在鲁棒性、性能(提升10.15%)和效率(提升30.62%)上显著优于基线方法。
Insight: 结合对比学习和强化学习不仅能充分利用标注数据,还能稳定训练过程,提升模型推理能力。
Abstract: Reasoning capability plays a significantly critical role in the the broad applications of Large Language Models (LLMs). To enhance the reasoning performance of LLMs, diverse Reinforcement Learning (RL)-based fine-tuning approaches have been proposed to address the limited generalization capability of LLMs trained solely via Supervised Fine-Tuning (SFT). Despite their effectiveness, two major limitations hinder the advancement of LLMs. First, vanilla RL-based approaches ignore annotated Chain-of-Thought (CoT) and incorporate unstable reasoning path sampling, which typically results in model collapse, unstable training process, and suboptimal performance. Second, existing SFT approaches generally overemphasize the annotated CoT, potentially leading to performance degradation due to insufficient exploitation of potential CoT. In this paper, we propose a Contrastive learning with annotated CoT-based Reinforced Fine-Tuning approach, i.e., \TheName{}, to enhance the reasoning performance of LLMs while addressing the aforementioned limitations. Specifically, we propose learning a representation for each CoT. Based on this representation, we design novel contrastive signals to guide the fine-tuning process. Our approach not only fully exploits the available annotated CoT but also stabilizes the fine-tuning procedure by incorporating an additional unsupervised learning signal. We conduct comprehensive experiments and in-depth analysis with three baseline approaches, two foundation models, and two datasets to demonstrate significant advantages of \TheName{} in terms of robustness, performance (up to 10.15%), and efficiency (up to 30.62%). Code is available at https://github.com/WNQzhu/CARFT.
[31] DeepMEL: A Multi-Agent Collaboration Framework for Multimodal Entity Linking
Fang Wang,Tianwei Yan,Zonghao Yang,Minghao Hu,Jun Zhang,Zhunchen Luo,Xiaoying Bai
Main category: cs.CL
TL;DR: DeepMEL是一个基于多智能体协作的多模态实体链接框架,通过角色专责分工策略解决了现有方法在跨模态融合和联合大型语言模型与视觉模型方面的挑战。
Details
Motivation: 当前多模态实体链接方法面临上下文信息不完整、跨模态融合粗糙以及联合大型语言模型和视觉模型的困难。Contribution: 提出了DeepMEL框架,通过四个专责智能体的协作实现了高效的跨模态对齐和消歧,设计了双模态对齐路径和自适应迭代策略。
Method: 框架包含Modal-Fuser、Candidate-Adapter、Entity-Clozer和Role-Orchestrator四个智能体,采用双模态对齐路径和结构化填空提示。
Result: 在五个公开基准数据集上取得了最先进性能,准确率提升1%-57%。
Insight: 角色专责分工和动态协调显著提升跨模态链接性能,结构化填空提示简化了任务解析。
Abstract: Multimodal Entity Linking (MEL) aims to associate textual and visual mentions with entities in a multimodal knowledge graph. Despite its importance, current methods face challenges such as incomplete contextual information, coarse cross-modal fusion, and the difficulty of jointly large language models (LLMs) and large visual models (LVMs). To address these issues, we propose DeepMEL, a novel framework based on multi-agent collaborative reasoning, which achieves efficient alignment and disambiguation of textual and visual modalities through a role-specialized division strategy. DeepMEL integrates four specialized agents, namely Modal-Fuser, Candidate-Adapter, Entity-Clozer and Role-Orchestrator, to complete end-to-end cross-modal linking through specialized roles and dynamic coordination. DeepMEL adopts a dual-modal alignment path, and combines the fine-grained text semantics generated by the LLM with the structured image representation extracted by the LVM, significantly narrowing the modal gap. We design an adaptive iteration strategy, combines tool-based retrieval and semantic reasoning capabilities to dynamically optimize the candidate set and balance recall and precision. DeepMEL also unifies MEL tasks into a structured cloze prompt to reduce parsing complexity and enhance semantic comprehension. Extensive experiments on five public benchmark datasets demonstrate that DeepMEL achieves state-of-the-art performance, improving ACC by 1%-57%. Ablation studies verify the effectiveness of all modules.
[32] Less Redundancy: Boosting Practicality of Vision Language Model in Walking Assistants
Chongyang Li,Yuan Zhiqiang,Jiapei Zhang,Ying Deng,Hanbo Bi,Zexi Jia,Xiaoyue Duan,Peixiang Luo,Jinchao Zhang
Main category: cs.CL
TL;DR: 论文提出WalkVLM-LR模型,通过减少冗余输出和时间冗余,提升视觉语言模型在盲人行走辅助系统中的实用性。
Details
Motivation: 全球约2.83亿人存在视觉障碍,现有视觉语言模型在行走辅助任务中存在输出冗余和时间冗余问题,影响用户对环境的准确评估。Contribution: 1. 提出基于人类偏好的四种奖励函数优化输出;2. 引入环境感知判别器减少时间冗余;3. WalkVLM-LR在各项指标上达到SOTA。
Method: 1. 在GRPO推理框架中使用四种奖励函数优化输出;2. 共享视觉编码器的环境感知判别器评估场景风险。
Result: 实验表明WalkVLM-LR在输出简洁性和减少时间冗余方面优于其他模型。
Insight: 结合人类偏好和场景风险评估可以显著提升行走辅助模型的实用性和效率。
Abstract: Approximately 283 million people worldwide live with visual impairments, motivating increasing research into leveraging Visual Language Models (VLMs) to develop effective walking assistance systems for blind and low vision individuals. However, existing VLMs in walking assistant task often have outputs that contain considerable redundancy and extraneous details, adversely affecting users’ ability to accurately assess their surroundings. Moreover, these models typically lack the capability to proactively assess environmental risks and adaptively trigger reminders based on the appropriate scene, leading to excessive temporal redundancy. To mitigate output and temporal redundancy, we propose WalkVLM-LR, a walking assistance model with less redundancy. To reduce output redundancy, we introduce four human-preference-based custom reward functions within the GRPO-based reasoning framework to optimize the output in terms of conciseness, fluency, keyword density, and accuracy, thereby producing more informative and streamlined outputs. To minimize temporal redundancy, we incorporate an environment awareness discriminator, which shares the visual encoder with the VLMs to reduce redundant computations and enhance discriminative efficiency, to make WalkVLM-LR assess scene risk levels and minimize unnecessary reminders. Experimental results demonstrate that our method achieves state-of-the-art performance across all evaluation metrics compared with other models, particularly in output conciseness and less temporal redundancy.
[33] CEQuest: Benchmarking Large Language Models for Construction Estimation
Yanzhao Wu,Lufan Wang,Rui Liu
Main category: cs.CL
TL;DR: 论文介绍了CEQuest,一个专门用于评估大语言模型在建筑领域问答性能的新基准数据集,重点关注建筑图纸解释和估算,并通过实验证明了当前模型的不足。
Details
Motivation: 大语言模型在通用领域表现优异,但在建筑等专业领域的潜力尚未充分探索。因此,研究团队希望通过开发专业基准数据集,推动领域专用模型的发展。Contribution: 提出了CEQuest基准数据集,用于评估大语言模型在建筑领域的性能,并通过实验量化了当前模型的局限性。
Method: 使用五种先进的大语言模型(如Gemma 3、GPT-4.1等),从准确性、执行时间和模型规模等方面全面评估其在建筑相关问题上的表现。
Result: 实验表明,当前模型在建筑领域仍有显著提升空间,强调了融入领域专业知识的重要性。
Insight: 领域专用的大语言模型需要更多专业化数据和知识,而CEQuest数据集的开放将促进相关研究的进一步发展。
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of general-domain tasks. However, their effectiveness in specialized fields, such as construction, remains underexplored. In this paper, we introduce CEQuest, a novel benchmark dataset specifically designed to evaluate the performance of LLMs in answering construction-related questions, particularly in the areas of construction drawing interpretation and estimation. We conduct comprehensive experiments using five state-of-the-art LLMs, including Gemma 3, Phi4, LLaVA, Llama 3.3, and GPT-4.1, and evaluate their performance in terms of accuracy, execution time, and model size. Our experimental results demonstrate that current LLMs exhibit considerable room for improvement, highlighting the importance of integrating domain-specific knowledge into these models. To facilitate further research, we will open-source the proposed CEQuest dataset, aiming to foster the development of specialized large language models (LLMs) tailored to the construction domain.
[34] CYCLE-INSTRUCT: Fully Seed-Free Instruction Tuning via Dual Self-Training and Cycle Consistency
Zhanming Shen,Hao Chen,Yulei Tang,Shaolin Zhu,Wentao Ye,Xiaomeng Hu,Haobo Wang,Gang Chen,Junbo Zhao
Main category: cs.CL
TL;DR: Cycle-Instruct提出了一种无需种子数据的指令调优框架,通过双自训练和循环一致性实现完全自动化,避免了依赖人工种子数据或外部教师模型的问题。
Details
Motivation: 传统指令调优依赖成本高昂的人工标注种子数据或强大的外部教师模型,而现有方法仍无法完全摆脱种子数据的限制。Cycle-Instruct旨在解决这一问题,实现完全无需种子的指令调优。Contribution: 提出Cycle-Instruct框架,通过双自训练和循环一致性实现完全种子免费的指令调优。该方法利用原始无标注文本,通过两个生成模型(问题生成器和答案生成器)的相互监督,实现自动化学习。
Method: 使用双自训练循环框架,结合问题生成器和答案生成器,通过循环一致性从原始文本中相互生成伪标签并重建原始文本,从而无需种子数据。
Result: 在四个多样化的数据任务上验证了Cycle-Instruct的有效性,性能优于基于种子的反向翻译基线,接近强监督方法。
Insight: 循环一致性和双自训练的组合为指令调优提供了一种全新的无种子解决方案,展示了从数据固有结构中学习的潜力,避免了种子数据引入的偏差。
Abstract: Instruction tuning is vital for aligning large language models (LLMs) with human intent, but current methods typically rely on costly human-annotated seed data or powerful external teacher models. While instruction back-translation techniques reduce this dependency, they remain fundamentally tethered to an initial seed set, which limits full automation, introduces biases, and can lead to inefficient use of unlabeled corpora. In this paper, we propose Cycle-Instruct, a novel framework that achieves fully seed-free instruction tuning. Inspired by cycle consistency, Cycle-Instruct employs a dual self-training loop where two models-an answer generator and a question generator-are bootstrapped solely from raw, unlabeled text. These models mutually supervise each other by reconstructing original text segments from their counterpart’s generated pseudo-labels, effectively learning from the intrinsic structure of the data without any human-provided seeds. We demonstrate Cycle-Instruct’s efficacy across four diverse data tracks, including general instruction-following, domain-specific tasks, dialogue logs, and plain text. Our extensive experiments show that Cycle-Instruct not only outperforms seed-driven back-translation baselines but also achieves performance comparable to strongly supervised methods.
[35] From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits
Karim Saraipour,Shichang Zhang
Main category: cs.CL
TL;DR: 论文探讨了GPT-2 small处理二元逻辑推理任务的机制,通过分析三段论任务,识别了多个电路,揭示了其中的二进制机制,并通过与间接宾语识别(IOI)任务的比较提供了对注意力头(attention heads)和MLP(多层感知机)作用的新见解。
Details
Motivation: 研究旨在理解Transformer模型在复杂逻辑任务(如三段论)中的行为,探索其内部机制及其与简单语言任务(如IOI)的差异,以推动对模型推理能力的深入理解。Contribution: 1)识别了GPT-2处理三段论任务的多个电路;2)揭示了模型如何通过负注意力头(negative heads)生成输入中未出现的否定标记;3)与IOI任务的比较提供了对注意力头和MLP作用的统一理解。
Method: 通过设计不同难度的三段论任务,分析GPT-2 small的内部行为,使用faithfulness metric评估电路的性能,并将结果与IOI任务进行比较。
Result: 一个由五个注意力头组成的电路实现了原模型90%以上的性能,验证了其有效性。研究还发现模型能通过负注意力头生成否定标记,支持复杂逻辑推理。
Insight: 注意力头和MLPs在简单和复杂任务中扮演不同角色;二进制机制(如负注意力头)是模型逻辑推理能力的核心;IOI和逻辑任务的分析可以相互补充,推动更全面的模型理解。
Abstract: Transformer-based language models (LMs) can perform a wide range of tasks, and mechanistic interpretability (MI) aims to reverse engineer the components responsible for task completion to understand their behavior. Previous MI research has focused on linguistic tasks such as Indirect Object Identification (IOI). In this paper, we investigate the ability of GPT-2 small to handle binary truth values by analyzing its behavior with syllogistic prompts, e.g., “Statement A is true. Statement B matches statement A. Statement B is”, which requires more complex logical reasoning compared to IOI. Through our analysis of several syllogism tasks of varying difficulty, we identify multiple circuits that mechanistically explain GPT-2’s logical-reasoning capabilities and uncover binary mechanisms that facilitate task completion, including the ability to produce a negated token not present in the input prompt through negative heads. Our evaluation using a faithfulness metric shows that a circuit comprising five attention heads achieves over 90% of the original model’s performance. By relating our findings to IOI analysis, we provide new insights into the roles of specific attention heads and MLPs in LMs. These insights contribute to a broader understanding of model reasoning and support future research in mechanistic interpretability.
[36] Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection
Ankan Mullick,Saransh Sharma,Abhik Jana,Pawan Goyal
Main category: cs.CL
TL;DR: 该研究发现,在多模态意图检测任务中,文本模态的主导性导致纯文本LLM(如Mistral-7B)优于多模态模型。通过去偏框架处理数据集后,大部分样本被移除,模型性能显著下降,突显了多模态数据集中的模态偏差问题。
Details
Motivation: 研究动机是探索多模态意图检测任务中不同模态模型的性能表现,特别是纯文本LLM与多模态模型的对比,以及数据集中存在的模态偏差问题。Contribution: 主要贡献包括:1)揭示多模态数据集中文本模态的主导性;2)提出数据去偏框架,验证模态偏差的影响;3)分析不同模态在上下文中的相关性。
Method: 研究方法包括:1)评估纯文本LLM与多模态模型的性能;2)通过人类评估确认数据集的模态偏差;3)设计去偏框架处理数据集并分析结果。
Result: 结果表明:1)纯文本LLM在多模态任务中表现优于多模态模型;2)去偏后数据集样本大幅减少,模型性能显著下降;3)小规模多模态融合模型受去偏影响最大。
Insight: 研究揭示了多模态数据集中的文本模态偏差问题,强调了构建无偏数据集对有效评估多模态模型的重要性。
Abstract: The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively.
[37] ParamBench: A Graduate-Level Benchmark for Evaluating LLM Understanding on Indic Subjects
Kaushal Sharma,Vivek Patel,Ayush Maheshwari,Aditya Maheshwari
Main category: cs.CL
TL;DR: 这篇论文提出了ParamBench,一个用于评估大型语言模型(LLM)在印度文化背景下研究生水平问题理解能力的基准测试,涵盖了16个不同学科的11.5K个问题。
Details
Motivation: 现有的印度基准测试主要关注基础事实性查询,缺乏对印度文化背景下深度学科理解的评估。ParamBench填补了这一空白,专注于研究生水平的印度文化相关问题。Contribution: 1. 提出了ParamBench基准测试,包含11.5K个印地语问题,覆盖16个学科;2. 评估了17个开源LLM的性能,发现最高准确率仅为48%;3. 揭示了LLM在音乐、古典乐器等文化相关主题上的表现薄弱。
Method: 通过收集全国研究生入学考试的问题,构建了包含多种问题格式(如匹配题、断言-原因对等)的数据集,并在17个开源LLM上进行了评估。
Result: Llama 3.3 70B表现最佳,总体准确率为48%。但LLM在音乐、古典乐器等文化相关主题上的表现仍然较差。
Insight: LLM在印度文化背景下的研究生水平问题理解能力有限,尤其是在文化相关主题上表现较弱,突显了文化接地推理的挑战。
Abstract: Large language models (LLMs) have been widely evaluated on tasks such as comprehension, question answering, summarization, code generation, etc. However, their performance on graduate-level, culturally grounded questions in the Indian context remains largely unexplored. Existing Indian benchmarks emphasise basic fact-orientated queries that offer limited assessment of a deeper disciplinary understanding tailored to the Indian setting. In this paper, we present ParamBench, consisting of around 11.5K questions in Hindi language comprising questionnaires from 16 diverse subjects. These questions are primarily derived from nation-wide graduate level entrance examination covering topics such as history, music, instruments, yoga, literature, philosophy, law, etc., specifically for the Indian context. Additionally, we assess the ability of LLMs to handle diverse question formats-such as list-based matching, assertion-reason pairs, and sequence ordering-alongside conventional multiple-choice questions. We evaluated the performance of more than 17 open source LLMs on this benchmark, observing that Llama 3.3 70B attains the highest overall accuracy of 48%. Furthermore, subject-wise analysis indicates that even for the best performing LLMs, performance remains weak on topics such as music, classical instruments, politics and archaeology, underscoring persistent challenges in culturally grounded reasoning.
[38] Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation
Weiting Tan,Jiachen Lian,Hirofumi Inaguma,Paden Tomasello,Philipp Koehn,Xutai Ma
Main category: cs.CL
TL;DR: 本文提出了一种音频-视觉语言模型(AVLM),通过整合全脸视觉线索到预训练的语音生成模型中,显著提升了情感识别和表达性对话任务的性能。
Details
Motivation: 当前的语音生成模型通常依赖纯语音输入,忽略了视觉信息在情感表达中的作用。本文通过引入视觉线索弥补这一不足,以提升语音生成的情感表达能力。Contribution: 提出了AVLM模型,首次系统地探索了视觉编码器和多模态融合策略在语音生成中的作用,验证了视觉信息对情感表达的正面影响。
Method: 通过预训练整合视觉编码器,采用多模态融合策略,并在情感识别和表达性对话任务上进行微调。
Result: 实验表明AVLM在情感识别的F1分数上比纯语音基线提升了5分,验证了视觉信息对语音生成的重要性。
Insight: 视觉信息是提升语音生成情感表达能力的关键因素,未来多模态对话系统应进一步整合视觉与语音模态。
Abstract: We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.
[39] ComicScene154: A Scene Dataset for Comic Analysis
Sandro Paval,Ivan P. Yamshchikov,Pascal Meißner
Main category: cs.CL
TL;DR: 这篇论文介绍了ComicScene154,一个手动标注的漫画场景数据集,旨在促进多模态叙事分析和漫画研究的计算方法的进展。
Details
Motivation: 漫画是一个多模态叙事的独特领域,但目前缺乏足够的数据集支持其计算分析,因此作者提出了ComicScene154来填补这一空白。Contribution: 提出了ComicScene154数据集,并通过基线场景分割管道展示了其效用,为未来的多模态叙事研究提供了基准。
Method: 采用了手动标注的方法构建数据集,并设计了一个基线场景分割管道作为基准测试。
Result: ComicScene154被证明是一个有价值的资源,能够推动多模态叙事理解的计算方法发展。
Insight: 漫画作为一种多模态叙事形式,具有潜力为更广泛的多模态故事叙述研究提供独特见解。
Abstract: Comics offer a compelling yet under-explored domain for computational narrative analysis, combining text and imagery in ways distinct from purely textual or audiovisual media. We introduce ComicScene154, a manually annotated dataset of scene-level narrative arcs derived from public-domain comic books spanning diverse genres. By conceptualizing comics as an abstraction for narrative-driven, multimodal data, we highlight their potential to inform broader research on multi-modal storytelling. To demonstrate the utility of ComicScene154, we present a baseline scene segmentation pipeline, providing an initial benchmark that future studies can build upon. Our results indicate that ComicScene154 constitutes a valuable resource for advancing computational methods in multimodal narrative understanding and expanding the scope of comic analysis within the Natural Language Processing community.
[40] CMR-SPB: Cross-Modal Multi-Hop Reasoning over Text, Image, and Speech with Path Balance
Seunghee Kim,Ingyu Bang,Seokgyu Jang,Changhyeon Kim,Sanghwan Bae,Jihun Choi,Richeng Xuan,Taeuk Kim
Main category: cs.CL
TL;DR: 该论文提出了一种新的基准CMR-SPB,用于评估跨模态多跳推理能力,弥补了现有基准忽略语音模态和存在偏置推理路径的不足,并提出了一种有效的ECV提示技术。
Details
Motivation: 现有的跨模态多跳推理评估基准存在两个主要问题:忽视了语音模态,且推理路径分布偏置严重,影响了公平评估。Contribution: 1. 提出新的基准CMR-SPB,涵盖文本、图像和语音三种模态,且确保推理路径的多样性和无偏性;2. 揭示了模型在特定推理序列上的失败情况;3. 提出ECV提示技术,缩小了不同推理路径的性能差距。
Method: 1. 设计包含文本、图像和语音的多样化推理路径基准;2. 通过实验分析模型在不同推理路径上的表现;3. 提出ECV(提取、连接、验证)提示技术,优化跨模态推理。
Result: 实验表明,CMR-SPB能更公平地评估模型性能,并揭示了现有基准的偏置问题。ECV提示技术显著提升了模型在不同推理路径上的表现。
Insight: 研究强调了公平和无偏的评估对跨模态推理的重要性,提出的ECV技术为未来的多模态AI开发提供了有效工具。
Abstract: Cross-modal multi-hop reasoning (CMR) is a valuable yet underexplored capability of multimodal large language models (MLLMs), entailing the integration of information from multiple modalities to produce a coherent output for a given context. We argue that existing benchmarks for evaluating this ability have critical shortcomings: (1) they largely overlook the speech modality, and (2) they exhibit heavily biased reasoning path distributions, which can severely undermine fair evaluation. To address these limitations, we introduce a novel benchmark – Cross-Modal Multi-Hop Reasoning over Text, Image and Speech with Path Balance (CMR-SPB) – designed to assess tri-modal multi-hop reasoning while ensuring both unbiased and diverse reasoning paths. Our experiments with the new dataset reveal consistent model failures in specific reasoning sequences and show that biased benchmarks risk misrepresenting model performance. Finally, based on our extensive analysis, we propose a new ECV (Extract, Connect, Verify) prompting technique that effectively mitigates the performance gap across different reasoning paths. Overall, we call for more careful evaluation in CMR to advance the development of robust multimodal AI.
[41] TULIP: Adapting Open-Source Large Language Models for Underrepresented Languages and Specialized Financial Tasks
İrem Demirtaş,Burak Payzun,Seçil Arslan
Main category: cs.CL
TL;DR: 这篇论文介绍了TULIP模型,通过多阶段管道(数据收集、持续预训练、基准设计、合成数据生成和监督微调)对Llama 3.1 8B和Qwen 2.5 7B进行适应,以提升其在金融土耳其语任务中的表现。
Details
Motivation: 尽管大型专有模型在金融领域表现优异,但较小的开源模型在隐私和适应性方面更具优势,尤其是针对小众语言和敏感数据的场景。Contribution: 提出了TULIP模型,专门针对金融土耳其语任务,通过多阶段管道提升模型性能。
Method: 采用五阶段开发管道:数据收集、持续预训练(CPT)、基准设计、合成数据生成和监督微调(SFT)。
Result: 实验表明,TULIP模型在金融土耳其语任务中表现显著提升。
Insight: 开源模型的适应性和隐私优势使其在特定领域和小众语言任务中具有竞争力,尤其是在金融等敏感领域。
Abstract: Thanks to the growing popularity of large language models over the years, there is great potential for their applications in finance. Despite the exceptional performance of larger proprietary models, which are presented as black-box solutions through APIs, smaller models that can be hosted on-premise present opportunities for adaptability and privacy. Especially in cases where the management of sensitive information and application of domain knowledge is important, like finance, enhancing the capabilities of smaller models becomes crucial, notably for underrepresented languages. In this work, we introduce TULIP models, which adapt Llama 3.1 8B and Qwen 2.5 7B for domain and language adaptation, focusing on financial Turkish use cases. The five-stage development pipeline involves data collection, continual pre-training (CPT), benchmark design, synthetic data generation and supervised fine-tuning (SFT). The results show that the capabilities of the models can be enhanced to effectively accomplish targeted tasks in this specific domain and language.
[42] M3TQA: Massively Multilingual Multitask Table Question Answering
Daixin Shu,Jian Yang,Zhenhe Wu,Xianjie Wu,Xianfu Cheng,Xiangyuan Guan,Yanghai Wang,Pengfei Wu,Tingyang Yang,Hualei Zhu,Wei Zhang,Ge Zhang,Jiaheng Liu,Zhoujun Li
Main category: cs.CL
TL;DR: 论文提出M3TQA框架,解决多语言表格问答中数据不均衡和规模不足的问题,通过大规模多任务基准(97种语言)和高质量翻译流程,提升低资源语言的性能。
Details
Motivation: 现有表格理解研究多集中于英语,多语言数据存在地理语言不平衡问题,缺乏对低资源语言的覆盖。Contribution: 1. 提出M3TQA框架,覆盖97种语言;2. 开发高质量LLM翻译流程;3. 提供2916个专业标注的问答对;4. 发现合成数据对低资源语言的性能提升作用。
Method: 采用六步LLM翻译流程(DeepSeek和GPT-4o),构建多语言表格问答基准,实验测试SOTA模型的跨语言泛化能力。
Result: 翻译质量中位数BLEU达60.19,合成数据显著提升低资源语言性能,M3TQA成为多语言表格理解的新标准。
Insight: 合成未标注数据是提升低资源语言性能的有效途径,多语言任务需系统性覆盖多样语言家族。
Abstract: Tabular data is a fundamental component of real-world information systems, yet most research in table understanding remains confined to English, leaving multilingual comprehension significantly underexplored. Existing multilingual table benchmarks suffer from geolinguistic imbalance - overrepresenting certain languages and lacking sufficient scale for rigorous cross-lingual analysis. To address these limitations, we introduce a comprehensive framework for massively multilingual multitask table question answering, featuring m3TQA-Instruct, a large-scale benchmark spanning 97 languages across diverse language families, including underrepresented and low-resource languages. We construct m3TQA by curating 50 real-world tables in Chinese and English, then applying a robust six-step LLM-based translation pipeline powered by DeepSeek and GPT-4o, achieving high translation fidelity with a median BLEU score of 60.19 as validated through back-translation. The benchmark includes 2,916 professionally annotated question-answering pairs across four tasks designed to evaluate nuanced table reasoning capabilities. Experiments on state-of-the-art LLMs reveal critical insights into cross-lingual generalization, demonstrating that synthetically generated, unannotated QA data can significantly boost performance, particularly for low-resource languages. M3T-Bench establishes a new standard for multilingual table understanding, providing both a challenging evaluation platform and a scalable methodology for future research.
[43] From Confidence to Collapse in LLM Factual Robustness
Alina Fastowski,Bardh Prenkaj,Gjergji Kasneci
Main category: cs.CL
TL;DR: 论文提出了一种新的指标FRS,通过分析Token分布熵和温度缩放敏感性来衡量LLM在事实知识上的鲁棒性,并验证了其有效性。
Details
Motivation: 现有的评估方法主要关注基于性能的指标,而忽视了生成过程中知识的鲁棒性,因此需要一种新的方法来填补这一空白。Contribution: 提出了Factual Robustness Score (FRS),通过结合Token分布熵和温度缩放敏感性,量化事实知识在解码条件扰动下的稳定性。
Method: 利用Token分布熵和温度缩放敏感性设计FRS指标,并在5个LLM和3个QA数据集上进行实验。
Result: 结果表明,不同规模的模型FRS差异显著(小模型0.76,大模型0.93),且在不确定性增加时准确率下降约60%。
Insight: 熵和温度缩放对事实准确性有显著影响,为未来模型开发更鲁棒的知识保留和检索机制奠定了基础。
Abstract: Ensuring the robustness of factual knowledge in LLMs is critical for reliable applications in tasks such as question answering and reasoning. However, existing evaluation methods predominantly focus on performance-based metrics, often investigating from the perspective of prompt perturbations, which captures only the externally triggered side of knowledge robustness. To bridge this gap, we introduce a principled approach to measure factual robustness from the perspective of the generation process by analyzing token distribution entropy in combination with temperature scaling sensitivity. These two factors build the Factual Robustness Score (FRS), a novel metric which quantifies the stability of a fact against perturbations in decoding conditions, given its initial uncertainty. To validate our approach, we conduct extensive experiments on 5 LLMs across 3 closed-book QA datasets (SQuAD, TriviaQA, and HotpotQA). We show that factual robustness varies significantly – smaller models report an FRS of $0.76$, larger ones $0.93$ – with accuracy degrading by ~$60%$ under increased uncertainty. These insights demonstrate how entropy and temperature scaling impact factual accuracy, and lay a foundation for developing more robust knowledge retention and retrieval in future models.
[44] LLMs that Understand Processes: Instruction-tuning for Semantics-Aware Process Mining
Vira Pyrih,Adrian Rebmann,Han van der Aa
Main category: cs.CL
TL;DR: 论文探讨了通过指令调优(instruction-tuning)提升大型语言模型(LLM)在语义感知流程挖掘任务中的泛化能力。
Details
Motivation: 传统流程挖掘方法基于频率分析,缺乏对语义信息的利用。LLM可通过微调优化特定任务表现,但计算成本高且泛化能力差,因此研究了指令调优的潜力。Contribution: 提出通过指令调优使LLM适应多任务(如异常检测、下一活动预测),从而提升其在未见任务(如流程发现)中的表现。
Method: 采用指令调优方法,将LLM暴露于多任务提示-答案对中,增强其对流程挖掘的理解。
Result: 指令调优显著提升了流程发现和预测任务性能,但在异常检测任务中表现因模型而异,表明任务选择对结果至关重要。
Insight: 指令调优是提升LLM在流程挖掘中泛化能力的有效方法,但任务组合的选择对性能优化起关键作用。
Abstract: Process mining is increasingly using textual information associated with events to tackle tasks such as anomaly detection and process discovery. Such semantics-aware process mining focuses on what behavior should be possible in a process (i.e., expectations), thus providing an important complement to traditional, frequency-based techniques that focus on recorded behavior (i.e., reality). Large Language Models (LLMs) provide a powerful means for tackling semantics-aware tasks. However, the best performance is so far achieved through task-specific fine-tuning, which is computationally intensive and results in models that can only handle one specific task. To overcome this lack of generalization, we use this paper to investigate the potential of instruction-tuning for semantics-aware process mining. The idea of instruction-tuning here is to expose an LLM to prompt-answer pairs for different tasks, e.g., anomaly detection and next-activity prediction, making it more familiar with process mining, thus allowing it to also perform better at unseen tasks, such as process discovery. Our findings demonstrate a varied impact of instruction-tuning: while performance considerably improved on process discovery and prediction tasks, it varies across models on anomaly detection tasks, highlighting that the selection of tasks for instruction-tuning is critical to achieving desired outcomes.
[45] JaParaPat: A Large-Scale Japanese-English Parallel Patent Application Corpus
Masaaki Nagata,Katsuki Chousa,Norihito Yasuda
Main category: cs.CL
TL;DR: 构建了JaParaPat,一个包含超过3亿日英句对的专利申请平行语料库,通过翻译对齐方法提升专利翻译质量,BLEU分数提高20分。
Details
Motivation: 专利翻译的需求日益增长,现有的平行语料库规模有限,构建大规模且高质量的专利平行语料库以满足翻译需求。Contribution: 构建了JaParaPat,一个包含300M+日英句对的专利平行语料库,并通过实验证明其能显著提升翻译质量。
Method: 从JPO和USPTO获取专利申请数据,利用DOCDB的专利家族信息对齐文档,采用基于翻译的句子对齐方法提取句对。
Result: 实验表明,加入专利语料后翻译质量显著提升,BLEU分数提高了20分。
Insight: 专利领域的平行语料对提升机器翻译质量具有显著作用,尤其是在专业领域翻译任务中。
Abstract: We constructed JaParaPat (Japanese-English Parallel Patent Application Corpus), a bilingual corpus of more than 300 million Japanese-English sentence pairs from patent applications published in Japan and the United States from 2000 to 2021. We obtained the publication of unexamined patent applications from the Japan Patent Office (JPO) and the United States Patent and Trademark Office (USPTO). We also obtained patent family information from the DOCDB, that is a bibliographic database maintained by the European Patent Office (EPO). We extracted approximately 1.4M Japanese-English document pairs, which are translations of each other based on the patent families, and extracted about 350M sentence pairs from the document pairs using a translation-based sentence alignment method whose initial translation model is bootstrapped from a dictionary-based sentence alignment method. We experimentally improved the accuracy of the patent translations by 20 bleu points by adding more than 300M sentence pairs obtained from patent applications to 22M sentence pairs obtained from the web.
[46] MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering
Adil Bahaj,Mounir Ghogho
Main category: cs.CL
TL;DR: MizanQA是一个专门评估大型语言模型在摩洛哥法律问答任务上的基准,填补了阿拉伯语法律领域低资源环境的空白。
Details
Motivation: 当前大型语言模型在阿拉伯语法律等低资源、专业化领域的表现有限,亟需针对性评估工具和领域优化。Contribution: 提出了MizanQA基准,包含1700多道多选题,涵盖现代标准阿拉伯语、伊斯兰马利基法学、摩洛哥习惯法和法国法律影响。
Method: 通过构建多选和多答案格式的数据集,结合多语言和阿拉伯语专用大模型进行基准测试。
Result: 实验表明现有模型在摩洛哥法律任务上存在显著性能差距,需改进评估指标和领域特异性模型开发。
Insight: 文化背景和法律特殊性对语言模型性能至关重要,未来需开发更符合本地需求的模型。
Abstract: The rapid advancement of large language models (LLMs) has significantly propelled progress in natural language processing (NLP). However, their effectiveness in specialized, low-resource domains-such as Arabic legal contexts-remains limited. This paper introduces MizanQA (pronounced Mizan, meaning “scale” in Arabic, a universal symbol of justice), a benchmark designed to evaluate LLMs on Moroccan legal question answering (QA) tasks, characterised by rich linguistic and legal complexity. The dataset draws on Modern Standard Arabic, Islamic Maliki jurisprudence, Moroccan customary law, and French legal influences. Comprising over 1,700 multiple-choice questions, including multi-answer formats, MizanQA captures the nuances of authentic legal reasoning. Benchmarking experiments with multilingual and Arabic-focused LLMs reveal substantial performance gaps, highlighting the need for tailored evaluation metrics and culturally grounded, domain-specific LLM development.
[47] RoMedQA: The First Benchmark for Romanian Medical Question Answering
Ana-Cristina Rogoz,Radu Tudor Ionescu,Alexandra-Valentina Anghel,Ionut-Lucian Antone-Iordache,Simona Coniac,Andreea Iuliana Ionescu
Main category: cs.CL
TL;DR: RoMedQA 是首个罗马尼亚医学领域问答基准,包含 102,646 个问答对,基于癌症患者病例总结构建。通过实验发现,监督微调的模型显著优于零样本提示模型,凸显了领域和语言特定微调的重要性。
Details
Motivation: 当前缺乏特定领域和语言的问答数据集,影响了 AI 模型的泛化能力,尤其在医疗领域和罗马尼亚语中。Contribution: 1) 发布首个罗马尼亚医学 QA 基准 RoMedQA;2) 高质量手动标注数据集;3) 评估不同 LLMs 的零样本和微调表现。
Method: 1) 构建 102,646 个问答对的数据集;2) 手动标注;3) 实验评估四种 LLM 在零样本和监督微调下的表现。
Result: 微调模型显著优于零样本模型,表明预训练模型在 RoMedQA 上泛化能力不足。
Insight: 领域和语言特定微调对可靠临床 QA 至关重要,RoMedQA 填补了罗马尼亚医学 QA 的空白。
Abstract: Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce RoMedQA, the first Romanian QA benchmark for the medical domain, alongside a comprehensive evaluation of state-of-the-art large language models (LLMs). We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients. The questions regard medical case summaries of 1,011 patients, requiring either keyword extraction or reasoning to be answered correctly. RoMedQA is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 2,100 work hours to generate the QA pairs. We experiment with four LLMs from distinct families of models on RoMedQA. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. Our results show that fine-tuned models significantly outperform their zero-shot counterparts, clearly indicating that pretrained models fail to generalize on RoMedQA. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at https://github.com/ana-rogoz/RoMedQA.
[48] Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish
Yakup Abrek Er,Ilker Kesen,Gözde Gül Şahin,Aykut Erdem
Main category: cs.CL
TL;DR: Cetvel是一个针对土耳其语的综合基准测试,旨在评估大型语言模型(LLMs)在语言理解、生成和文化能力方面的表现,弥补了现有土耳其语基准测试的不足。
Details
Motivation: 现有土耳其语基准测试通常缺乏任务多样性或文化相关性,Cetvel通过结合多样化的判别性和生成性任务,并融入土耳其语言和文化的丰富内容,解决了这一问题。Contribution: Cetvel引入了包含23个任务的七类测试,涵盖了语法纠错、机器翻译和基于土耳其历史与习语的问答等任务,为土耳其语的LLMs评估提供了全面且文化相关的基准。
Method: Cetvel评估了33个开放权重的LLMs(参数高达70B),覆盖了不同模型家族和指令范式,通过多样化的任务测试模型性能,并比较了土耳其专用模型与多语言或通用模型的差异。
Result: 实验结果表明,尽管土耳其专用模型针对土耳其语进行了优化,但其表现通常不如多语言或通用模型(如Llama 3和Mistral),语法纠错和抽取式问答等任务能有效区分模型能力。
Insight: 土耳其语LLMs的发展需要更多文化相关的数据和研究,而Cetvel为未来模型优化和评估提供了重要工具。
Abstract: We introduce Cetvel, a comprehensive benchmark designed to evaluate large language models (LLMs) in Turkish. Existing Turkish benchmarks often lack either task diversity or culturally relevant content, or both. Cetvel addresses these gaps by combining a broad range of both discriminative and generative tasks ensuring content that reflects the linguistic and cultural richness of Turkish language. Cetvel covers 23 tasks grouped into seven categories, including tasks such as grammatical error correction, machine translation, and question answering rooted in Turkish history and idiomatic language. We evaluate 33 open-weight LLMs (up to 70B parameters) covering different model families and instruction paradigms. Our experiments reveal that Turkish-centric instruction-tuned models generally underperform relative to multilingual or general-purpose models (e.g. Llama 3 and Mistral), despite being tailored for the language. Moreover, we show that tasks such as grammatical error correction and extractive question answering are particularly discriminative in differentiating model capabilities. Cetvel offers a comprehensive and culturally grounded evaluation suite for advancing the development and assessment of LLMs in Turkish.
[49] A Probabilistic Inference Scaling Theory for LLM Self-Correction
Zhe Yang,Yichang Zhang,Yudong Wang,Ziyao Xu,Junyang Lin,Zhifang Sui
Main category: cs.CL
TL;DR: 该论文提出了一种概率理论,用于建模大语言模型(LLM)在多轮自我纠正过程中的准确率动态变化,并解释了性能提升的机制。通过数学推导,作者给出了准确率的收敛公式,并通过实验验证了理论的有效性。
Details
Motivation: 探索LLM在多轮自我纠正过程中准确率变化的机制,填补了现有研究中对这一动态过程的定量理解空白。Contribution: 1. 提出了一个概率理论模型,用于描述LLM自我纠正过程中准确率的变化。2. 导出了准确率的收敛公式,并通过单轮自我纠正数据预测性能曲线。3. 通过多模型和多数据集的实验验证了理论的有效性。
Method: 通过数学建模和推导,提出了准确率随时间变化的公式:$Acc_t = Upp - \alpha^t(Upp - Acc_0)$,其中$Upp$是准确率上限,$\alpha$是收敛速率。公式参数可通过单轮实验数据计算得出。
Result: 实验结果显示,理论预测的准确率曲线与实证数据高度吻合,证明了该模型的有效性。
Insight: 该模型不仅量化了LLM自我纠正的动态过程,还为未来研究提供了理论基础,例如进一步优化收敛速率或探索不同任务下的表现。
Abstract: Large Language Models (LLMs) have demonstrated the capability to refine their generated answers through self-correction, enabling continuous performance improvement over multiple rounds. However, the mechanisms underlying how and why accuracy evolves during this iterative process remain unexplored. To fill this gap, we propose a probabilistic theory to model the dynamics of accuracy change and explain the performance improvements observed in multi-round self-correction. Through mathematical derivation, we establish that the accuracy after the $t^{th}$ round of self-correction is given by: $Acc_t = Upp - \alpha^t(Upp - Acc_0),$ where $Acc_0$ denotes the initial accuracy, $Upp$ represents the upper bound of accuracy convergence, and $\alpha$ determines the rate of convergence. Based on our theory, these parameters can be calculated and the predicted accuracy curve then can be obtained through only a single round of self-correction. Extensive experiments across diverse models and datasets demonstrate that our theoretical predictions align closely with empirical accuracy curves, validating the effectiveness of the theory. Our work provides a theoretical foundation for understanding LLM self-correction, thus paving the way for further explorations.
[50] LLM-as-classifier: Semi-Supervised, Iterative Framework for Hierarchical Text Classification using Large Language Models
Doohee You,Andy Parisi,Zach Vander Velden,Lara Dantas Inojosa
Main category: cs.CL
TL;DR: 论文提出了一个半监督迭代框架,利用LLM的零样本和小样本能力,构建层次化文本分类器,解决了实际工业部署中的动态数据分发问题。
Details
Motivation: LLM在文本分析中具有强大能力,但其作为生产环境中的分类器的可靠性和可扩展性存在挑战,动态数据分发尤为关键。Contribution: 提出了一个半监督、迭代式框架,结合人类反馈,通过提示优化、层次扩展和多方面验证,构建稳健的层次分类器。
Method: 采用零样本和小样本的LLM能力,结合领域知识提取、提示优化、迭代验证和偏见缓解技术,进行层次分类。
Result: 实现了高效、可维护的分类系统,适用于动态工业数据分发。
Insight: 人类反馈和迭代优化是LLM实用化的关键;层次分类需要结合领域知识和多阶段验证。
Abstract: The advent of Large Language Models (LLMs) has provided unprecedented capabilities for analyzing unstructured text data. However, deploying these models as reliable, robust, and scalable classifiers in production environments presents significant methodological challenges. Standard fine-tuning approaches can be resource-intensive and often struggle with the dynamic nature of real-world data distributions, which is common in the industry. In this paper, we propose a comprehensive, semi-supervised framework that leverages the zero- and few-shot capabilities of LLMs for building hierarchical text classifiers as a framework for a solution to these industry-wide challenges. Our methodology emphasizes an iterative, human-in-the-loop process that begins with domain knowledge elicitation and progresses through prompt refinement, hierarchical expansion, and multi-faceted validation. We introduce techniques for assessing and mitigating sequence-based biases and outline a protocol for continuous monitoring and adaptation. This framework is designed to bridge the gap between the raw power of LLMs and the practical need for accurate, interpretable, and maintainable classification systems in industry applications.
[51] Transfer Learning via Lexical Relatedness: A Sarcasm and Hate Speech Case Study
Angelly Cabrera,Linus Lei,Antonio Ortega
Main category: cs.CL
TL;DR: 本文探讨了通过讽刺预训练提升仇恨言论检测,特别是隐含仇恨言论的效果,证明了讽刺预训练对BERT+BiLSTM模型的性能提升。
Details
Motivation: 社交媒体中隐含形式的仇恨言论(如讽刺、反讽)检测一直是一个难题。本文研究讽刺预训练是否有助于提升隐含和显式仇恨言论检测。Contribution: 提出将讽刺作为预训练步骤,集成到仇恨言论检测模型中,显著提升了BERT+BiLSTM在ETHOS和隐含仇恨语料库上的检测性能。
Method: 设计了两种训练策略:单步训练(讽刺模型直接测试仇恨言论)和顺序迁移学习(讽刺、隐含仇恨、显式仇恨依次微调)。
Result: 讽刺预训练使BERT+BiLSTM在ETHOS上的召回率提升9.7%,AUC提升7.8%,F1分数提升6%。在隐含仇恨语料库上,精确率提升7.8%。
Insight: 讽刺和仇恨言论之间存在语义关联性,利用讽刺预训练可以增强模型对隐含仇恨的捕捉能力,有助于整体仇恨言论检测。
Abstract: Detecting hate speech in non-direct forms, such as irony, sarcasm, and innuendos, remains a persistent challenge for social networks. Although sarcasm and hate speech are regarded as distinct expressions, our work explores whether integrating sarcasm as a pre-training step improves implicit hate speech detection and, by extension, explicit hate speech detection. Incorporating samples from ETHOS, Sarcasm on Reddit, and Implicit Hate Corpus, we devised two training strategies to compare the effectiveness of sarcasm pre-training on a CNN+LSTM and BERT+BiLSTM model. The first strategy is a single-step training approach, where a model trained only on sarcasm is then tested on hate speech. The second strategy uses sequential transfer learning to fine-tune models for sarcasm, implicit hate, and explicit hate. Our results show that sarcasm pre-training improved the BERT+BiLSTM’s recall by 9.7%, AUC by 7.8%, and F1-score by 6% on ETHOS. On the Implicit Hate Corpus, precision increased by 7.8% when tested only on implicit samples. By incorporating sarcasm into the training process, we show that models can more effectively detect both implicit and explicit hate.
cs.CV [Back]
[52] Text-Driven 3D Hand Motion Generation from Sign Language Data
Léore Bensabath,Mathis Petrovich,Gül Varol
Main category: cs.CV
TL;DR: 该论文旨在通过自然语言描述生成3D手部动作,利用大规模手语视频数据集和伪标注类别,结合LLM翻译成手部动作描述,训练了一个文本条件扩散模型HandMDM,并展示了其跨领域的鲁棒性。
Details
Motivation: 当前缺乏大规模3D手部动作与文本描述配对的数据集,且现有方法难以在跨领域(如不同手语或非手势动作)中表现鲁棒。Contribution: 1)自动构建大规模3D手部动作与文本描述配对数据集;2)提出文本条件扩散模型HandMDM;3)展示模型在跨领域任务中的鲁棒性。
Method: 利用大规模手语视频数据集及伪标注类别,通过LLM和运动脚本提示生成文本描述,训练基于扩散模型的HandMDM。
Result: HandMDM在未见手语类别和其他手语或非手势动作中表现出鲁棒性。
Insight: 通过结合大规模伪标注数据和LLM生成的文本描述,可以实现高效的3D手部动作生成,并具有跨领域泛化能力。
Abstract: Our goal is to train a generative model of 3D hand motions, conditioned on natural language descriptions specifying motion characteristics such as handshapes, locations, finger/hand/arm movements. To this end, we automatically build pairs of 3D hand motions and their associated textual labels with unprecedented scale. Specifically, we leverage a large-scale sign language video dataset, along with noisy pseudo-annotated sign categories, which we translate into hand motion descriptions via an LLM that utilizes a dictionary of sign attributes, as well as our complementary motion-script cues. This data enables training a text-conditioned hand motion diffusion model HandMDM, that is robust across domains such as unseen sign categories from the same sign language, but also signs from another sign language and non-sign hand movements. We contribute extensive experimental investigation of these scenarios and will make our trained models and data publicly available to support future research in this relatively new field.
[53] VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos
Kaining Li,Shuwei He,Zihan Xu
Main category: cs.CV
TL;DR: VT-LVLM-AR是一种新颖的框架,通过将长视频转换为语义丰富的视觉事件序列,并利用大视觉语言模型(LVLM)进行动作识别,解决了长视频中细粒度动作识别的挑战。
Details
Motivation: 长视频中的动作识别面临复杂背景和细微动作差异的挑战,传统深度学习模型在计算开销、长程时序依赖捕捉和语义理解方面存在局限性。Contribution: 提出了VT-LVLM-AR框架,包括视频到事件映射器(VTEM)和基于LVLM的动作推理模块,实现了对长视频的高效细粒度动作识别。
Method: VTEM通过轻量级时空特征提取、自适应时序池化和事件一致性量化生成视觉事件序列;使用LLaVA-1.5模型进行参数高效的Prompt Tuning。
Result: 在NTU RGB+D和NTU RGB+D 120数据集上取得SOTA性能(如NTU RGB+D X-Sub 94.1%准确率)。
Insight: 通过视频到语言的转换和高效模型适配,展示了LVLM在视频动作理解中的巨大潜力,同时提高了可解释性。
Abstract: Human action recognition in long-term videos, characterized by complex backgrounds and subtle action differences, poses significant challenges for traditional deep learning models due to computational overhead, difficulty in capturing long-range temporal dependencies, and limited semantic understanding. While Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have shown remarkable capabilities in multi-modal understanding and reasoning, their direct application to continuous video streams for fine-grained action recognition remains an open problem. This paper introduces VT-LVLM-AR (Video-Temporal Large Vision-Language Model Adapter for Action Recognition), a novel framework designed to bridge this gap. VT-LVLM-AR comprises a Video-to-Event Mapper (VTEM) that efficiently transforms raw video into compact, semantically rich, and temporally coherent “visual event sequences” through lightweight spatio-temporal feature extraction, adaptive temporal pooling, and conceptual quantization with an event coherence bias. These visual event sequences are then fed into an LVLM-based Action Reasoning module, specifically a frozen LLaVA-1.5 model, adapted using parameter-efficient Prompt Tuning (P-Tuning v2) for action classification. Comprehensive evaluations on the NTU RGB+D and NTU RGB+D 120 datasets demonstrate that VT-LVLM-AR consistently achieves state-of-the-art performance, surpassing existing methods (e.g., 94.1% accuracy on NTU RGB+D X-Sub). Ablation studies confirm the critical contributions of VTEM’s components and the efficacy of Prompt Tuning, while human evaluations underscore the interpretability of our visual event representations. This work highlights the immense potential of leveraging LVLMs for robust and interpretable video action understanding through effective video-to-language translation and efficient model adaptation.
[54] Boosting Pathology Foundation Models via Few-shot Prompt-tuning for Rare Cancer Subtyping
Dexuan He,Xiao Zhou,Wenbin Guan,Liyuan Zhang,Xiaoman Zhang,Sinuo Xu,Ge Wang,Lifeng Wang,Xiaojun Yuan,Xin Sun,Yanfeng Wang,Kun Sun,Ya Zhang,Weidi Xie
Main category: cs.CV
TL;DR: 论文提出了一种名为PathPT的新框架,通过空间感知的视觉聚合和任务特定的提示调优,充分利用视觉-语言病理学基础模型的潜力,显著提升了罕见癌症亚型分类的性能。
Details
Motivation: 罕见癌症占所有恶性肿瘤的20-25%,但由于专家资源有限(尤其在儿科肿瘤学中占70%以上),其诊断面临巨大挑战。现有的病理学基础模型在常见癌症分类中表现良好,但在罕见癌症中性能有限。Contribution: 1) 提出了PathPT框架,通过空间感知的视觉聚合和任务特定的提示调优,显著提升了罕见癌症的分类性能;2) 将WSI级别的监督转换为细粒度的tile级别指导,保留了癌细胞区域的定位能力;3) 在多个数据集(包括罕见和常见癌症)上验证了方法的有效性。
Method: 1) 利用视觉-语言基础模型的零样本能力,将WSI级别的监督转换为tile级别的分类任务;2) 设计了任务特定的提示调优(prompt tuning),使模型通过提示与病理语义对齐;3) 结合空间感知的视觉聚合,提升了模型的定位能力。
Result: PathPT在八个罕见癌症和三个常见癌症数据集上进行了测试,显著提升了分类准确性和癌细胞区域的定位能力,优于现有的多种视觉-语言模型和多实例学习方法。
Insight: 这项工作为罕见癌症的AI辅助诊断提供了可扩展的解决方案,尤其在专家资源有限的情况下,通过结合视觉和语言模态的知识,显著提升了模型的分类和解释能力。
Abstract: Rare cancers comprise 20-25% of all malignancies but face major diagnostic challenges due to limited expert availability-especially in pediatric oncology, where they represent over 70% of cases. While pathology vision-language (VL) foundation models show promising zero-shot capabilities for common cancer subtyping, their clinical performance for rare cancers remains limited. Existing multi-instance learning (MIL) methods rely only on visual features, overlooking cross-modal knowledge and compromising interpretability critical for rare cancer diagnosis. To address this limitation, we propose PathPT, a novel framework that fully exploits the potential of vision-language pathology foundation models through spatially-aware visual aggregation and task-specific prompt tuning. Unlike conventional MIL, PathPT converts WSI-level supervision into fine-grained tile-level guidance by leveraging the zero-shot capabilities of VL models, thereby preserving localization on cancerous regions and enabling cross-modal reasoning through prompts aligned with histopathological semantics. We benchmark PathPT on eight rare cancer datasets(four adult and four pediatric) spanning 56 subtypes and 2,910 WSIs, as well as three common cancer datasets, evaluating four state-of-the-art VL models and four MIL frameworks under three few-shot settings. Results show that PathPT consistently delivers superior performance, achieving substantial gains in subtyping accuracy and cancerous region grounding ability. This work advances AI-assisted diagnosis for rare cancers, offering a scalable solution for improving subtyping accuracy in settings with limited access to specialized expertise.
[55] Semantic-Aware Ship Detection with Vision-Language Integration
Jiahao Li,Jiancheng Pan,Yuze Sun,Xiaomeng Huang
Main category: cs.CV
TL;DR: 本文提出了一种结合视觉-语言模型(VLMs)和多尺度自适应滑动窗口策略的新型船舶检测框架,并引入ShipSem-VL数据集以支持细粒度语义信息的捕获。
Details
Motivation: 远程感知图像中的船舶检测在海事活动监控、航运物流和环境研究中有广泛应用,但现有方法难以捕获细粒度语义信息,限制了其在复杂场景中的效果。Contribution: 1) 提出了一种新的船舶检测框架SASD,结合VLMs和多尺度策略;2) 构建了ShipSem-VL数据集,专注于细粒度船舶属性;3) 通过三项任务评估框架性能。
Method: 1) 使用视觉-语言模型集成语义信息;2) 采用多尺度自适应滑动窗口策略提升检测效果。
Result: 实验表明该框架在多任务评估中表现优异,有效提升了船舶检测的语义感知能力。
Insight: 视觉-语言模型的结合和多尺度策略可以显著提升远程感知图像中船舶检测的语义和检测精度。
Abstract: Ship detection in remote sensing imagery is a critical task with wide-ranging applications, such as maritime activity monitoring, shipping logistics, and environmental studies. However, existing methods often struggle to capture fine-grained semantic information, limiting their effectiveness in complex scenarios. To address these challenges, we propose a novel detection framework that combines Vision-Language Models (VLMs) with a multi-scale adaptive sliding window strategy. To facilitate Semantic-Aware Ship Detection (SASD), we introduce ShipSem-VL, a specialized Vision-Language dataset designed to capture fine-grained ship attributes. We evaluate our framework through three well-defined tasks, providing a comprehensive analysis of its performance and demonstrating its effectiveness in advancing SASD from multiple perspectives.
[56] Automatic Retrieval of Specific Cows from Unlabeled Videos
Jiawen Lyu,Manu Ramesh,Madison Simonds,Jacquelyn P. Boerman,Amy R. Reibman
Main category: cs.CV
TL;DR: 该论文提出了一种自动化系统,用于从无标签视频中检索特定的奶牛,包含自动目录生成器、无需深度学习的奶牛识别器和实时视频奶牛查找器。
Details
Motivation: 当前缺乏能够无接触地对奶牛进行自动分类和识别的视频系统,尤其是在无标签和无约束的视频环境下。Contribution: 1. 提出了一个完整的自动化系统(AutoCattloger、eidetic cow recognizer和CowFinder);2. 无需深度学习即可实现奶牛识别;3. 可在无标签和无约束的视频中高效检索特定奶牛。
Method: 系统由三部分组成:AutoCattloger生成奶牛目录,eidetic cow recognizer通过非深度学习方法识别奶牛,CowFinder在连续视频流中实时识别奶牛。
Result: 系统成功从未标记、未分割的视频中识别出特定奶牛,展示了其高效性和实用性。
Insight: 无需依赖深度学习,通过传统方法也能实现高效的奶牛识别,为类似场景提供了低成本解决方案。
Abstract: Few automated video systems are described in the open literature that enable hands-free cataloging and identification (ID) of cows in a dairy herd. In this work, we describe our system, composed of an AutoCattloger, which builds a Cattlog of dairy cows in a herd with a single input video clip per cow, an eidetic cow recognizer which uses no deep learning to ID cows, and a CowFinder, which IDs cows in a continuous stream of video. We demonstrate its value in finding individuals in unlabeled, unsegmented videos of cows walking unconstrained through the holding area of a milking parlor.
[57] Investigating Different Geo Priors for Image Classification
Angela Zhu,Christian Lange,Max Hamilton
Main category: cs.CV
TL;DR: 该论文研究了不同空间隐式神经表示(SINR)模型作为地理先验在基于视觉的物种分类中的有效性,探讨了模型配置以及对未训练物种处理的影响。发现了地理先验模型的有效性与制作准确分布图的不同因素。
Details
Motivation: 地理先验模型在结合位置信息进行物种分类时表现优越,但不同模型配置和未训练物种的处理方式对分类效果的影响尚不明确。Contribution: 评估了多种SINR模型作为地理先验的效果,揭示了地理先验模型有效性不同于分布图制作的因素。
Method: 使用SINR模型作为地理先验,调整模型配置和处理未训练物种的方法,分析其对iNaturalist数据集的分类效果。
Result: 研究发现地理先验模型的有效性依赖于特定配置和处理方式,而这些因素与制作分布图的要求不同。
Insight: 地理先验模型在视觉分类中的作用与分布图制作的要求存在差异,需针对性优化模型配置和处理策略。
Abstract: Species distribution models encode spatial patterns of species occurrence making them effective priors for vision-based species classification when location information is available. In this study, we evaluate various SINR (Spatial Implicit Neural Representations) models as a geographical prior for visual classification of species from iNaturalist observations. We explore the impact of different model configurations and adjust how we handle predictions for species not included in Geo Prior training. Our analysis reveals factors that contribute to the effectiveness of these models as Geo Priors, factors that may differ from making accurate range maps.
[58] Representation Learning with Adaptive Superpixel Coding
Mahmoud Khalil,Ahmad Khalil,Alioune Ngom
Main category: cs.CV
TL;DR: 本文提出了一种基于Transformer的自监督模型——自适应超像素编码(ASC),通过动态调整超像素层克服传统Vision Transformer固定分区的局限性。
Details
Motivation: 传统视觉模型依赖于固定的网格结构,限制了其对不同图像内容的适应性。Contribution: 提出了自适应超像素编码(ASC),能够根据图像内容动态调整分区。
Method: 采用基于Transformer的自监督模型,引入自适应超像素层动态调整图像分区。
Result: 在标准下游任务基准测试中优于广泛使用的替代方法。
Insight: 动态分区能够更好地适应图像内容,提升模型性能。
Abstract: Deep learning vision models are typically tailored for specific modalities and often rely on domain-specific assumptions, such as the grid structures used by nearly all existing vision models. In this work, we propose a self-supervised model based on Transformers, which we call Adaptive Superpixel Coding (ASC). The key insight of our model is to overcome the limitations of traditional Vision Transformers, which depend on fixed-size and non-adaptive patch partitioning. Instead, ASC employs adaptive superpixel layers that dynamically adjust to the underlying image content. We analyze key properties of the approach that make it effective, and find that our method outperforms widely-used alternatives on standard image downstream task benchmarks.
[59] Glo-VLMs: Leveraging Vision-Language Models for Fine-Grained Diseased Glomerulus Classification
Zhenhao Guo,Rachit Saluja,Tianyuan Yao,Quan Liu,Yuankai Huo,Benjamin Liechty,David J. Pisapia,Kenji Ikemura,Mert R. Sabuncu,Yihe Yang,Ruining Deng
Main category: cs.CV
TL;DR: 该论文提出了Glo-VLMs框架,通过结合视觉语言模型(VLMs)在有限标注数据下实现细粒度的肾病肾小球分类,展示了VLMs在医学图像分类中的潜力。
Details
Motivation: 肾病肾小球的细粒度分类面临形态学差异微小且标注数据稀缺的挑战,传统方法效果有限,因此需要探索如何利用预训练的视觉语言模型解决这一问题。Contribution: 提出了Glo-VLMs框架,通过联合学习病理图像和临床文本的表示,实现数据受限场景下的细粒度分类;评估了不同VLMs架构和适应策略的效能。
Method: 利用少量标注样本(每类8个),结合视觉语言模型进行微调;通过多类评价指标比较不同方法的效果。
Result: 在少量标注数据下,模型达到了0.7416的准确率、0.9045的宏AUC和0.5277的F1分数,验证了VLMs在医学图像分类中的有效性。
Insight: 即使在高度有限的监督下,大规模预训练模型仍可通过微调适应细粒度的医学图像分类任务,为临床研究提供了新思路。
Abstract: Vision-language models (VLMs) have shown considerable potential in digital pathology, yet their effectiveness remains limited for fine-grained, disease-specific classification tasks such as distinguishing between glomerular subtypes. The subtle morphological variations among these subtypes, combined with the difficulty of aligning visual patterns with precise clinical terminology, make automated diagnosis in renal pathology particularly challenging. In this work, we explore how large pretrained VLMs can be effectively adapted to perform fine-grained glomerular classification, even in scenarios where only a small number of labeled examples are available. In this work, we introduce Glo-VLMs, a systematic framework designed to explore the adaptation of VLMs to fine-grained glomerular classification in data-constrained settings. Our approach leverages curated pathology images alongside clinical text prompts to facilitate joint image-text representation learning for nuanced renal pathology subtypes. By assessing various VLMs architectures and adaptation strategies under a few-shot learning paradigm, we explore how both the choice of method and the amount of labeled data impact model performance in clinically relevant scenarios. To ensure a fair comparison, we evaluate all models using standardized multi-class metrics, aiming to clarify the practical requirements and potential of large pretrained models for specialized clinical research applications. As a result, fine-tuning the VLMs achieved 0.7416 accuracy, 0.9045 macro-AUC, and 0.5277 F1-score with only 8 shots per class, demonstrating that even with highly limited supervision, foundation models can be effectively adapted for fine-grained medical image classification.
[60] Contributions to Label-Efficient Learning in Computer Vision and Remote Sensing
Minh-Tan Pham
Main category: cs.CV
TL;DR: 本文总结了作者在计算机视觉和遥感领域中标签高效学习方面的多项贡献,重点研究了如何通过有限或部分标注数据以及大量未标注数据进行有效学习,并针对地球观测数据的独特挑战提出了方法。
Details
Motivation: 在计算机视觉和遥感领域中,标注数据的获取成本高昂且耗时,因此需要开发能够从有限标注数据或未标注数据中高效学习的方法。Contribution: 1. 基于异常感知表示的对象发现和检测方法;2. 多任务学习联合训练不完整标注的数据集;3. 自监督和监督对比学习的多模态场景分类;4. 层次场景分类的小样本学习方法。
Method: 1. 使用大量背景图像学习异常感知表示;2. 多任务学习联合训练;3. 结合多模态数据的对比学习;4. 显式和隐式类别层次建模的小样本学习。
Result: 通过广泛的实验验证,这些方法在自然和遥感数据集上取得了显著效果。
Insight: 标签高效学习方法在处理遥感数据时,多模态和层次信息的利用是关键。未来研究可以进一步扩展这些方法,以应对实际应用中的规模化需求。
Abstract: This manuscript presents a series of my selected contributions to the topic of label-efficient learning in computer vision and remote sensing. The central focus of this research is to develop and adapt methods that can learn effectively from limited or partially annotated data, and can leverage abundant unlabeled data in real-world applications. The contributions span both methodological developments and domain-specific adaptations, in particular addressing challenges unique to Earth observation data such as multi-modality, spatial resolution variability, and scene heterogeneity. The manuscript is organized around four main axes including (1) weakly supervised learning for object discovery and detection based on anomaly-aware representations learned from large amounts of background images; (2) multi-task learning that jointly trains on multiple datasets with disjoint annotations to improve performance on object detection and semantic segmentation; (3) self-supervised and supervised contrastive learning with multimodal data to enhance scene classification in remote sensing; and (4) few-shot learning for hierarchical scene classification using both explicit and implicit modeling of class hierarchies. These contributions are supported by extensive experimental results across natural and remote sensing datasets, reflecting the outcomes of several collaborative research projects. The manuscript concludes by outlining ongoing and future research directions focused on scaling and enhancing label-efficient learning for real-world applications.
[61] Panoptic Segmentation of Environmental UAV Images : Litter Beach
Ousmane Youme,Jean Marie Dembélé,Eugene C. Ezin,Christophe Cambier
Main category: cs.CV
TL;DR: 本文探讨了使用CNN进行环境无人机图像的全景分割,特别是在海滩垃圾监测中的应用,提出了基于实例的分割方法和全景分割方法,以解决传统CNN模型在复杂环境中的局限性。
Details
Motivation: 监测海洋垃圾已成为全球性问题,传统CNN模型在复杂海滩环境中因多种干扰因素(如沙色反射、脚印等)表现不佳,需要更鲁棒的方法。Contribution: 提出了一种基于实例和全景分割的CNN方法,仅需少量样本即可达到高精度,增强了模型的鲁棒性和适应性。
Method: 采用基于实例的分割方法和全景分割方法,优化了复杂环境下的垃圾检测与分类。
Result: 所提方法在少量样本下表现出高精度,能够有效克服传统CNN在复杂环境中的局限性。
Insight: 全景分割方法在环境监测任务中具有潜力,尤其在复杂背景下,能显著提升检测性能。
Abstract: Convolutional neural networks (CNN) have been used efficiently in several fields, including environmental challenges. In fact, CNN can help with the monitoring of marine litter, which has become a worldwide problem. UAVs have higher resolution and are more adaptable in local areas than satellite images, making it easier to find and count trash. Since the sand is heterogeneous, a basic CNN model encounters plenty of inferences caused by reflections of sand color, human footsteps, shadows, algae present, dunes, holes, and tire tracks. For these types of images, other CNN models, such as CNN-based segmentation methods, may be more appropriate. In this paper, we use an instance-based segmentation method and a panoptic segmentation method that show good accuracy with just a few samples. The model is more robust and less
[62] Automated Multi-label Classification of Eleven Retinal Diseases: A Benchmark of Modern Architectures and a Meta-Ensemble on a Large Synthetic Dataset
Jerry Cao-Xue,Tien Comlekoglu,Keyi Xue,Guanliang Wang,Jiang Li,Gordon Laurie
Main category: cs.CV
TL;DR: 该论文提出了一个基于合成数据集SynFundus-1M的多标签视网膜疾病分类基准,通过训练六种现代深度学习架构和一个元集成模型,展示了其在合成数据和真实临床数据中的高泛化性能。
Details
Motivation: 由于患者隐私和高成本问题,真实临床数据集的稀缺性限制了多标签深度学习模型在视网膜疾病分类中的发展。合成数据集SynFundus-1M的发布为解决这一问题提供了新机会。Contribution: 1. 在SynFundus-1M上建立了六种现代架构的性能基准;2. 提出了一种基于XGBoost的元集成模型,显著提高了分类性能;3. 证明了合成数据训练的模型在真实临床数据中的强泛化能力。
Method: 使用5折多标签分层交叉验证策略训练了ConvNeXtV2、SwinV2、ViT等六种架构,并通过堆叠它们的预测结果构建了一个XGBoost元集成模型。
Result: 元集成模型在内部验证集上的宏平均AUC达到0.9973,在真实临床数据集上也表现出色(如DR数据集的AUC为0.7972,青光眼数据集的AUC为0.9126)。
Insight: 合成数据可以作为真实数据的高效替代方案,加速眼科AI系统的开发,同时展示了元集成方法在多标签分类任务中的潜力。
Abstract: The development of multi-label deep learning models for retinal disease classification is often hindered by the scarcity of large, expertly annotated clinical datasets due to patient privacy concerns and high costs. The recent release of SynFundus-1M, a high-fidelity synthetic dataset with over one million fundus images, presents a novel opportunity to overcome these barriers. To establish a foundational performance benchmark for this new resource, we developed an end-to-end deep learning pipeline, training six modern architectures (ConvNeXtV2, SwinV2, ViT, ResNet, EfficientNetV2, and the RETFound foundation model) to classify eleven retinal diseases using a 5-fold multi-label stratified cross-validation strategy. We further developed a meta-ensemble model by stacking the out-of-fold predictions with an XGBoost classifier. Our final ensemble model achieved the highest performance on the internal validation set, with a macro-average Area Under the Receiver Operating Characteristic Curve (AUC) of 0.9973. Critically, the models demonstrated strong generalization to three diverse, real-world clinical datasets, achieving an AUC of 0.7972 on a combined DR dataset, an AUC of 0.9126 on the AIROGS glaucoma dataset and a macro-AUC of 0.8800 on the multi-label RFMiD dataset. This work provides a robust baseline for future research on large-scale synthetic datasets and establishes that models trained exclusively on synthetic data can accurately classify multiple pathologies and generalize effectively to real clinical images, offering a viable pathway to accelerate the development of comprehensive AI systems in ophthalmology.
[63] DRespNeT: A UAV Dataset and YOLOv8-DRN Model for Aerial Instance Segmentation of Building Access Points for Post-Earthquake Search-and-Rescue Missions
Aykut Sirma,Angelos Plastropoulos,Argyrios Zolotas,Gilbert Tang
Main category: cs.CV
TL;DR: 该论文提出一个名为DRespNeT的高分辨率数据集,用于地震后建筑物入口点的空中实例分割,并基于YOLOv8提出优化的YOLOv8-DRN模型,显著提升搜索与救援任务中的实时决策能力。
Details
Motivation: 地震后的搜救任务需要快速识别建筑物入口点和结构障碍物,现有的数据集依赖卫星图像或粗略语义标注,缺乏高分辨率的细粒度标注。因此,作者开发了DRespNeT数据集和优化模型以满足这一需求。Contribution: 1. 提出了DRespNeT数据集,包含28个关键类别的高分辨率实例分割标注;2. 开发了YOLOv8-DRN模型,在实时性和准确性(92.7% mAP50)上表现优异;3. 数据细粒度标注支持区分可进入和阻塞区域,提升搜救效率。
Method: 基于YOLOv8-seg实例分割模型,作者优化并提出YOLOv8-DRN模型,使用DRespNeT数据集进行训练和评估。标注的细粒度设计支持更精确的实例分割。
Result: YOLOv8-DRN模型在RTX-4090 GPU上达到92.7% mAP50和27 FPS的推理速度,满足实时任务需求。数据集和模型显著提升了搜救任务的效率和实时决策能力。
Insight: 1. 高分辨率和细粒度标注对灾难响应任务至关重要;2. 轻量化模型(如YOLOv8)在实时任务中表现出色;3. 人机协作在搜救任务中的潜力通过数据集和模型得到进一步提升。
Abstract: Recent advancements in computer vision and deep learning have enhanced disaster-response capabilities, particularly in the rapid assessment of earthquake-affected urban environments. Timely identification of accessible entry points and structural obstacles is essential for effective search-and-rescue (SAR) operations. To address this need, we introduce DRespNeT, a high-resolution dataset specifically developed for aerial instance segmentation of post-earthquake structural environments. Unlike existing datasets, which rely heavily on satellite imagery or coarse semantic labeling, DRespNeT provides detailed polygon-level instance segmentation annotations derived from high-definition (1080p) aerial footage captured in disaster zones, including the 2023 Turkiye earthquake and other impacted regions. The dataset comprises 28 operationally critical classes, including structurally compromised buildings, access points such as doors, windows, and gaps, multiple debris levels, rescue personnel, vehicles, and civilian visibility. A distinctive feature of DRespNeT is its fine-grained annotation detail, enabling differentiation between accessible and obstructed areas, thereby improving operational planning and response efficiency. Performance evaluations using YOLO-based instance segmentation models, specifically YOLOv8-seg, demonstrate significant gains in real-time situational awareness and decision-making. Our optimized YOLOv8-DRN model achieves 92.7% mAP50 with an inference speed of 27 FPS on an RTX-4090 GPU for multi-target detection, meeting real-time operational requirements. The dataset and models support SAR teams and robotic systems, providing a foundation for enhancing human-robot collaboration, streamlining emergency response, and improving survivor outcomes.
[64] NeuralMeshing: Complete Object Mesh Extraction from Casual Captures
Floris Erich,Naoya Chiba,Abdullah Mustafa,Ryo Hanai,Noriaki Ando,Yusuke Yoshiyasu,Yukiyasu Domae
Main category: cs.CV
TL;DR: 该论文提出了一种自动化系统,通过多段视频提取完整物体网格,无需依赖3D扫描设备,仅需少量标记点即可实现。
Details
Motivation: 研究动机是通过日常视频捕捉生成物体的完整几何模型,而无需商业3D扫描设备,降低几何建模的门槛。Contribution: 主要贡献是开发了一个自动化系统,通过多视频输入和少量标记点(如棋盘格或AR标记),利用结构光运动(SfM)技术生成完整的物体网格。
Method: 系统通过多段视频输入,结合结构光运动(SfM)技术定位视频帧,无需依赖空洞填充即可合并生成完整网格。
Result: 该系统能从日常视频中生成完整物体网格,代码已开源。
Insight: 通过多视频融合和简易标记点,可以低成本实现高质量几何建模,适用于日常场景数据采集。
Abstract: How can we extract complete geometric models of objects that we encounter in our daily life, without having access to commercial 3D scanners? In this paper we present an automated system for generating geometric models of objects from two or more videos. Our system requires the specification of one known point in at least one frame of each video, which can be automatically determined using a fiducial marker such as a checkerboard or Augmented Reality (AR) marker. The remaining frames are automatically positioned in world space by using Structure-from-Motion techniques. By using multiple videos and merging results, a complete object mesh can be generated, without having to rely on hole filling. Code for our system is available from https://github.com/FlorisE/NeuralMeshing.
[65] Expandable Residual Approximation for Knowledge Distillation
Zhaoyi Yan,Binghui Chen,Yunfan Liu,Qixiang Ye
Main category: cs.CV
TL;DR: 该论文提出了一种新颖的知识蒸馏方法——可扩展残差近似(ERA),通过多步分解残差知识的逼近任务,采用分治策略减少学生模型模仿教师表示的难度,并结合教师权重集成策略缓解能力差距,显著提升了图像分类和目标检测的性能。
Details
Motivation: 知识蒸馏(KD)中存在教师模型与学生模型之间的学习能力差距,导致知识传递不充分。论文受Stone-Weierstrass定理的渐进逼近原理启发,设计了ERA方法以解决这一问题。Contribution: 1. 提出ERA方法,通过多步分解残差知识逼近任务;2. 设计多分支残差网络(MBRNet)实现知识分解;3. 引入教师权重集成(TWI)策略复用教师头部权重。
Method: ERA使用MBRNet分步逼近残差知识,并结合TWI策略复用教师模型的权重。通过分治策略降低学生模型的学习难度。
Result: 在ImageNet分类基准上Top-1准确率提升1.41%,在MS COCO目标检测基准上AP提升1.40,并在多个计算机视觉任务中表现领先。
Insight: 分步逼近残差知识可显著降低学生模型学习难度,而教师权重复用策略能有效缓解能力差距,从而实现高效知识蒸馏。
Abstract: Knowledge distillation (KD) aims to transfer knowledge from a large-scale teacher model to a lightweight one, significantly reducing computational and storage requirements. However, the inherent learning capacity gap between the teacher and student often hinders the sufficient transfer of knowledge, motivating numerous studies to address this challenge. Inspired by the progressive approximation principle in the Stone-Weierstrass theorem, we propose Expandable Residual Approximation (ERA), a novel KD method that decomposes the approximation of residual knowledge into multiple steps, reducing the difficulty of mimicking the teacher’s representation through a divide-and-conquer approach. Specifically, ERA employs a Multi-Branched Residual Network (MBRNet) to implement this residual knowledge decomposition. Additionally, a Teacher Weight Integration (TWI) strategy is introduced to mitigate the capacity disparity by reusing the teacher’s head weights. Extensive experiments show that ERA improves the Top-1 accuracy on the ImageNet classification benchmark by 1.41% and the AP on the MS COCO object detection benchmark by 1.40, as well as achieving leading performance across computer vision tasks. Codes and models are available at https://github.com/Zhaoyi-Yan/ERA.
[66] Advances and Trends in the 3D Reconstruction of the Shape and Motion of Animals
Ziqi Li,Abderraouf Amrani,Shri Rai,Hamid Laga
Main category: cs.CV
TL;DR: 该论文综述了3D动物形状与运动重建领域的最新进展,探讨了基于深度学习的非侵入式方法,分析了不同输入模态、表示方法、重建技术和训练机制,并指出了当前挑战与未来方向。
Details
Motivation: 传统3D扫描方法侵入性强、成本高且难以在自然环境中部署,因此迫切需要非侵入式的解决方案,以应用于生物学、畜牧业、动物保护及数字娱乐等领域。Contribution: 论文系统地分类并讨论了基于RGB图像/视频的非侵入式3D动物重建方法,总结了其输入模态、表示方法、技术手段和训练机制,并对性能、优劣势进行了分析。
Method: 论文采用文献综述方法,分析了深度学习技术在3D动物重建中的应用,重点关注输入模态(如RGB)、几何与运动表示(如网格)、重建技术(如优化方法)和训练机制(如监督学习)。
Result: 研究表明,深度学习方法在非侵入式3D动物重建中表现优异,但仍面临数据稀缺、动态建模复杂性和泛化能力不足等挑战。
Insight: 未来研究可关注多模态数据融合、无监督学习、以及轻量化部署,以进一步提升3D动物重建的实用性和普适性。
Abstract: Reconstructing the 3D geometry, pose, and motion of animals is a long-standing problem, which has a wide range of applications, from biology, livestock management, and animal conservation and welfare to content creation in digital entertainment and Virtual/Augmented Reality (VR/AR). Traditionally, 3D models of real animals are obtained using 3D scanners. These, however, are intrusive, often prohibitively expensive, and difficult to deploy in the natural environment of the animals. In recent years, we have seen a significant surge in deep learning-based techniques that enable the 3D reconstruction, in a non-intrusive manner, of the shape and motion of dynamic objects just from their RGB image and/or video observations. Several papers have explored their application and extension to various types of animals. This paper surveys the latest developments in this emerging and growing field of research. It categorizes and discusses the state-of-the-art methods based on their input modalities, the way the 3D geometry and motion of animals are represented, the type of reconstruction techniques they use, and the training mechanisms they adopt. It also analyzes the performance of some key methods, discusses their strengths and limitations, and identifies current challenges and directions for future research.
[67] A Unified Voxel Diffusion Module for Point Cloud 3D Object Detection
Qifeng Liu,Dawei Zhao,Yabo Dong,Linzhi Shang,Liang Xiao,Juan Wang,Kunkong Zhao,Dongming Lu,Qi Zhu
Main category: cs.CV
TL;DR: 本文提出了一种新型的Voxel Diffusion Module(VDM),通过结合稀疏3D卷积和子流形稀疏卷积,增强了点云数据中体素级的表示和扩散能力,显著提升了基于Transformer和SSM的检测模型的性能。
Details
Motivation: 当前基于体素的点云目标检测模型由于输入输出维度的严格一致性要求,限制了卷积操作提供的空间扩散能力,影响了检测精度。本文受CNN启发,旨在解决这一问题。Contribution: 提出VDM模块,通过稀疏3D卷积和子流形稀疏卷积增强体素特征的扩散和表示能力;展示了VDM在多种主流检测模型中的通用性和性能提升。
Method: VDM由稀疏3D卷积、子流形稀疏卷积和残差连接组成,输出特征图降采样为输入分辨率的四分之一,以兼顾计算效率。
Result: 在多个基准数据集上,VDM显著提升了检测性能,特别是在Waymo(74.7 mAPH)、nuScenes(72.9 NDS)等数据集上刷新了SOTA。
Insight: 通过稀疏卷积的空间扩散能力,结合残差连接,VDM能够在保持计算效率的同时,显著增强体素特征的表示能力,为点云检测提供了新思路。
Abstract: Recent advances in point cloud object detection have increasingly adopted Transformer-based and State Space Models (SSMs), demonstrating strong performance. However, voxelbased representations in these models require strict consistency in input and output dimensions due to their serialized processing, which limits the spatial diffusion capability typically offered by convolutional operations. This limitation significantly affects detection accuracy. Inspired by CNN-based object detection architectures, we propose a novel Voxel Diffusion Module (VDM) to enhance voxel-level representation and diffusion in point cloud data. VDM is composed of sparse 3D convolutions, submanifold sparse convolutions, and residual connections. To ensure computational efficiency, the output feature maps are downsampled to one-fourth of the original input resolution. VDM serves two primary functions: (1) diffusing foreground voxel features through sparse 3D convolutions to enrich spatial context, and (2) aggregating fine-grained spatial information to strengthen voxelwise feature representation. The enhanced voxel features produced by VDM can be seamlessly integrated into mainstream Transformer- or SSM-based detection models for accurate object classification and localization, highlighting the generalizability of our method. We evaluate VDM on several benchmark datasets by embedding it into both Transformerbased and SSM-based models. Experimental results show that our approach consistently improves detection accuracy over baseline models. Specifically, VDM-SSMs achieve 74.7 mAPH (L2) on Waymo, 72.9 NDS on nuScenes, 42.3 mAP on Argoverse 2, and 67.6 mAP on ONCE, setting new stateof-the-art performance across all datasets. Our code will be made publicly available.
[68] Ensemble learning of foundation models for precision oncology
Xiangde Luo,Xiyue Wang,Feyisope Eweje,Xiaoming Zhang,Sen Yang,Ryan Quinton,Jinxi Xiang,Yuchen Li,Yuanfeng Ji,Zhe Li,Yijiang Chen,Colin Bergstrom,Ted Kim,Francesca Maria Olguin,Kelley Yuan,Matthew Abikenari,Andrew Heider,Sierra Willens,Sanjeeth Rajaram,Robert West,Joel Neal,Maximilian Diehn,Ruijiang Li
Main category: cs.CV
TL;DR: 该论文提出了ELF(Ensemble Learning of Foundation models)框架,通过集成五种顶尖的病理学基础模型,生成统一的幻灯片级别表征,显著提升了疾病分类、生物标志物检测和治疗效果预测的准确性和鲁棒性。
Details
Motivation: 现有的病理学基础模型通常在分散的数据集上以不同策略训练,导致性能不一致且泛化能力有限。为了克服这一问题,作者提出了ELF框架。Contribution: 提出了ELF框架,首次将五种病理学基础模型集成到一个统一的框架中,通过集成学习捕捉互补信息,同时保持高数据效率。
Method: ELF采用幻灯片级别的架构,通过对53,699张WSIs的训练,结合五种基础模型生成统一的表征。这种方法特别适合临床数据有限的情况。
Result: ELF在多种临床应用(疾病分类、生物标志物检测和治疗效果预测)中均优于单一基础模型和现有幻灯片级别模型,显示出更高的准确性和鲁棒性。
Insight: 集成学习能够有效结合不同病理学基础模型的优势,为精准肿瘤学的AI辅助解决方案提供了可扩展和通用的新途径。
Abstract: Histopathology is essential for disease diagnosis and treatment decision-making. Recent advances in artificial intelligence (AI) have enabled the development of pathology foundation models that learn rich visual representations from large-scale whole-slide images (WSIs). However, existing models are often trained on disparate datasets using varying strategies, leading to inconsistent performance and limited generalizability. Here, we introduce ELF (Ensemble Learning of Foundation models), a novel framework that integrates five state-of-the-art pathology foundation models to generate unified slide-level representations. Trained on 53,699 WSIs spanning 20 anatomical sites, ELF leverages ensemble learning to capture complementary information from diverse models while maintaining high data efficiency. Unlike traditional tile-level models, ELF’s slide-level architecture is particularly advantageous in clinical contexts where data are limited, such as therapeutic response prediction. We evaluated ELF across a wide range of clinical applications, including disease classification, biomarker detection, and response prediction to major anticancer therapies, cytotoxic chemotherapy, targeted therapy, and immunotherapy, across multiple cancer types. ELF consistently outperformed all constituent foundation models and existing slide-level models, demonstrating superior accuracy and robustness. Our results highlight the power of ensemble learning for pathology foundation models and suggest ELF as a scalable and generalizable solution for advancing AI-assisted precision oncology.
[69] Two-flow Feedback Multi-scale Progressive Generative Adversarial Network
Sun Weikai,Song Shijie,Chi Wenjie
Main category: cs.CV
TL;DR: 该论文提出了一种新颖的双向反馈多尺度渐进生成对抗网络(MSPG-SEN),通过优化训练过程、提升生成质量和稳定性,同时降低了训练成本。
Details
Motivation: 虽然扩散模型在图像生成领域取得了进展,但GAN因其独特的优势仍具有发展空间,作者旨在进一步提升GAN的生成质量和训练效率。Contribution: 1) 提出MSPG-SEN模型,显著改善生成图像质量并简化训练;2) 引入自适应感知-行为反馈循环(APFL),增强模型稳定性;3) 设计全局连接的双流动态残差网络,提升泛化能力和灵活性;4) 提出动态嵌入注意力机制(DEMA),增强特征表达能力并节省计算资源。
Method: 1) 双向反馈多尺度渐进结构;2) 自适应反馈循环;3) 双流动态残差网络;4) 动态嵌入注意力机制。
Result: 在多个数据集上达到SOTA水平(如INKK数据集89.7%,OPIN数据集96.4%),同时训练成本显著降低。
Insight: 创新性地结合反馈机制与多尺度渐进结构,同时引入注意力机制和动态网络设计,为GAN的优化提供了新思路。
Abstract: Although diffusion model has made good progress in the field of image generation, GAN\cite{huang2023adaptive} still has a large development space due to its unique advantages, such as WGAN\cite{liu2021comparing}, SSGAN\cite{guibas2021adaptive} \cite{zhang2022vsa} \cite{zhou2024adapt} and so on. In this paper, we propose a novel two-flow feedback multi-scale progressive generative adversarial network (MSPG-SEN) for GAN models. This paper has four contributions: 1) : We propose a two-flow feedback multi-scale progressive Generative Adversarial network (MSPG-SEN), which not only improves image quality and human visual perception on the basis of retaining the advantages of the existing GAN model, but also simplifies the training process and reduces the training cost of GAN networks. Our experimental results show that, MSPG-SEN has achieved state-of-the-art generation results on the following five datasets,INKK The dataset is 89.7%,AWUN The dataset is 78.3%,IONJ The dataset is 85.5%,POKL The dataset is 88.7%,OPIN The dataset is 96.4%. 2) : We propose an adaptive perception-behavioral feedback loop (APFL), which effectively improves the robustness and training stability of the model and reduces the training cost. 3) : We propose a globally connected two-flow dynamic residual network(). After ablation experiments, it can effectively improve the training efficiency and greatly improve the generalization ability, with stronger flexibility. 4) : We propose a new dynamic embedded attention mechanism (DEMA). After experiments, the attention can be extended to a variety of image processing tasks, which can effectively capture global-local information, improve feature separation capability and feature expression capabilities, and requires minimal computing resources only 88.7% with INJK With strong cross-task capability.
[70] Domain Adaptation via Feature Refinement
Savvas Karatsiolis,Andreas Kamilaris
Main category: cs.CV
TL;DR: 论文提出了一种名为DAFR2的简单有效的无监督域适应框架,通过结合批量归一化统计调整、特征蒸馏和假设迁移,实现了在分布偏移下的鲁棒性和域不变性特征空间。
Details
Motivation: 在无监督域适应任务中,分布偏移问题导致模型在目标域上的性能下降。传统方法通常需要复杂架构或训练目标,而本文旨在通过简单的方法实现域间特征对齐。Contribution: 提出了DAFR2框架,结合批量归一化统计调整、特征蒸馏和假设迁移,无需目标域标签即能生成鲁棒且域不变的特征空间。
Method: 通过调整批量归一化统计以适配目标域数据,结合特征蒸馏和假设迁移,实现特征分布在统计和表示层面的对齐。
Result: 在多个基准数据集(如CIFAR10-C、CIFAR100-C等)上的实验表明,DAFR2在抗干扰性上优于现有方法。
Insight: 特征分布在统计和表示层面的对齐是提升域适应性能的关键,且DAFR2在不增加模型复杂度的情况下提高了特征对齐效果。
Abstract: We propose Domain Adaptation via Feature Refinement (DAFR2), a simple yet effective framework for unsupervised domain adaptation under distribution shift. The proposed method synergistically combines three key components: adaptation of Batch Normalization statistics using unlabeled target data, feature distillation from a source-trained model and hypothesis transfer. By aligning feature distributions at the statistical and representational levels, DAFR2 produces robust and domain-invariant feature spaces that generalize across similar domains without requiring target labels, complex architectures or sophisticated training objectives. Extensive experiments on benchmark datasets, including CIFAR10-C, CIFAR100-C, MNIST-C and PatchCamelyon-C, demonstrate that the proposed algorithm outperforms prior methods in robustness to corruption. Theoretical and empirical analyses further reveal that our method achieves improved feature alignment, increased mutual information between the domains and reduced sensitivity to input perturbations.
[71] 4D Virtual Imaging Platform for Dynamic Joint Assessment via Uni-Plane X-ray and 2D-3D Registration
Hao Tang,Rongxi Yi,Lei Li,Kaiyi Cao,Jiapeng Zhao,Yihan Xiao,Minghai Shi,Peng Yuan,Yan Xi,Hui Tang,Wei Li,Zhan Wu,Yixin Zhou
Main category: cs.CV
TL;DR: 这篇论文提出了一个集成的4D关节分析平台,结合了双机器人臂锥形束CT(CBCT)系统和动态2D X射线成像,用于动态关节评估,具有高精度和低辐射剂量。
Details
Motivation: 传统的CT无法捕捉动态负重关节运动,当前方法在辐射暴露或空间信息完整性上存在局限,因此需要一种能实现4D成像且辐射低的解决方案。Contribution: 开发了一个集成的4D关节分析平台,结合了CBCT系统和2D X射线成像,通过深度学习预处理和优化算法实现高精度的动态关节评估。
Method: 方法包括(1)双机器人臂CBCT系统,(2)深度学习预处理和3D-2D投影迭代优化的混合成像流程,(3)临床验证的定量运动学评估框架。
Result: 在模拟研究中,方法达到亚体素精度(0.235毫米),成功率达99.18%,优于传统方法;临床评估显示其对TKA患者膝关节运动的准确量化。
Insight: 该平台为生物力学研究、精准诊断和个性化骨科治疗提供了高效、低剂量的4D关节成像工具。
Abstract: Conventional computed tomography (CT) lacks the ability to capture dynamic, weight-bearing joint motion. Functional evaluation, particularly after surgical intervention, requires four-dimensional (4D) imaging, but current methods are limited by excessive radiation exposure or incomplete spatial information from 2D techniques. We propose an integrated 4D joint analysis platform that combines: (1) a dual robotic arm cone-beam CT (CBCT) system with a programmable, gantry-free trajectory optimized for upright scanning; (2) a hybrid imaging pipeline that fuses static 3D CBCT with dynamic 2D X-rays using deep learning-based preprocessing, 3D-2D projection, and iterative optimization; and (3) a clinically validated framework for quantitative kinematic assessment. In simulation studies, the method achieved sub-voxel accuracy (0.235 mm) with a 99.18 percent success rate, outperforming conventional and state-of-the-art registration approaches. Clinical evaluation further demonstrated accurate quantification of tibial plateau motion and medial-lateral variance in post-total knee arthroplasty (TKA) patients. This 4D CBCT platform enables fast, accurate, and low-dose dynamic joint imaging, offering new opportunities for biomechanical research, precision diagnostics, and personalized orthopedic care.
[72] Beyond Human-prompting: Adaptive Prompt Tuning with Semantic Alignment for Anomaly Detection
Pi-Wei Chen,Jerry Chun-Wei Lin,Wei-Han Chen,Jia Ji,Zih-Ching Chen,Feng-Hao Yeh,Chao-Chun Chen
Main category: cs.CV
TL;DR: 论文提出了一种自适应提示微调方法(APT),通过自生成异常样本和噪声扰动训练可学习提示,显著提升了异常检测的性能。
Details
Motivation: 现有的基于提示的异常检测方法依赖人工设计的提示和缺乏可用异常样本,限制了其在上下文特定异常理解上的表现。Contribution: 提出APT框架,无需先验知识,通过自生成异常样本和学习提示,结合自优化元提示指导方案(SMGS),克服传统方法的局限性。
Method: APT利用噪声扰动生成异常样本,训练可学习提示;SMGS通过迭代对齐提示与通用异常语义,避免过拟合。
Result: APT在多个基准数据集上实现了最先进的性能,无需人工提示设计,提供了鲁棒的异常检测解决方案。
Insight: 该方法展示了如何通过自适应提示微调和语义对齐提升模型对上下文依赖异常的泛化能力。
Abstract: Pre-trained Vision-Language Models (VLMs) have recently shown promise in detecting anomalies. However, previous approaches are fundamentally limited by their reliance on human-designed prompts and the lack of accessible anomaly samples, leading to significant gaps in context-specific anomaly understanding. In this paper, we propose \textbf{A}daptive \textbf{P}rompt \textbf{T}uning with semantic alignment for anomaly detection (APT), a groundbreaking prior knowledge-free, few-shot framework and overcomes the limitations of traditional prompt-based approaches. APT uses self-generated anomaly samples with noise perturbations to train learnable prompts that capture context-dependent anomalies in different scenarios. To prevent overfitting to synthetic noise, we propose a Self-Optimizing Meta-prompt Guiding Scheme (SMGS) that iteratively aligns the prompts with general anomaly semantics while incorporating diverse synthetic anomaly. Our system not only advances pixel-wise anomaly detection, but also achieves state-of-the-art performance on multiple benchmark datasets without requiring prior knowledge for prompt crafting, establishing a robust and versatile solution for real-world anomaly detection.
[73] RAGSR: Regional Attention Guided Diffusion for Image Super-Resolution
Haodong He,Yancheng Bai,Rui Lan,Xu Duan,Lei Sun,Xiangxiang Chu,Gui-Song Xia
Main category: cs.CV
TL;DR: 本文提出了一种基于区域注意力引导的超分辨率方法(RAGSR),通过结合细粒度的区域描述和新型注意力机制,提升了多物体场景下的超分辨率生成质量。
Details
Motivation: 现有基于视觉-语言模型和扩散模型的超分辨率方法在生成清晰准确的区域细节时表现不佳,尤其是在多物体场景下,主要原因是缺乏细粒度的区域描述和模型对复杂提示的捕捉能力不足。Contribution: 1. 提出了RAGSR方法,通过区域注意力机制显式提取和编码细粒度区域信息;2. 设计了区域-文本对作为文本先验,并利用区域引导注意力机制避免无关区域对之间的干扰。
Method: RAGSR通过定位图像中的物体区域并赋予细粒度描述,形成区域-文本对作为文本先验。随后采用区域引导注意力机制,确保在注意力过程中正确处理这些区域-文本对。
Result: 实验结果表明,RAGSR在基准数据集上能生成感知真实的视觉细节,同时保持上下文一致性,性能优于现有方法。
Insight: 1. 细粒度的区域描述和文本先验对提升超分辨率质量至关重要;2. 区域注意力机制可以有效控制文本与图像信息的融合,避免不相关的干扰。
Abstract: The rich textual information of large vision-language models (VLMs) combined with the powerful generative prior of pre-trained text-to-image (T2I) diffusion models has achieved impressive performance in single-image super-resolution (SISR). However, existing methods still face significant challenges in generating clear and accurate regional details, particularly in scenarios involving multiple objects. This challenge primarily stems from a lack of fine-grained regional descriptions and the models’ insufficient ability to capture complex prompts. To address these limitations, we propose a Regional Attention Guided Super-Resolution (RAGSR) method that explicitly extracts localized fine-grained information and effectively encodes it through a novel regional attention mechanism, enabling both enhanced detail and overall visually coherent SR results. Specifically, RAGSR localizes object regions in an image and assigns fine-grained caption to each region, which are formatted as region-text pairs as textual priors for T2I models. A regional guided attention is then leveraged to ensure that each region-text pair is properly considered in the attention process while preventing unwanted interactions between unrelated region-text pairs. By leveraging this attention mechanism, our approach offers finer control over the integration of text and image information, thereby effectively overcoming limitations faced by traditional SISR techniques. Experimental results on benchmark datasets demonstrate that our approach exhibits superior performance in generating perceptually authentic visual details while maintaining contextual consistency compared to existing approaches.
[74] Through the Looking Glass: A Dual Perspective on Weakly-Supervised Few-Shot Segmentation
Jiaqi Ma,Guo-Sen Xie,Fang Zhao,Zechao Li
Main category: cs.CV
TL;DR: 论文提出了一种新颖的同源但异构网络TLG,通过异构视觉聚合(HA)模块和异构转移(HT)模块,解决了元学习中网络同质化的问题,并在弱监督少样本分割任务中取得了显著性能提升。
Details
Motivation: 现有的元学习方法在采样支持-查询对时倾向于同质化,导致网络过度语义同质化。为了解决这一问题,论文提出通过异构网络设计增强互补性,同时保留语义共性。Contribution: 1. 提出了一种同源但异构的网络架构;2. 设计了异构视觉聚合(HA)和异构转移(HT)模块;3. 引入异构CLIP(HC)文本信息提升多模态模型泛化能力;4. 在弱监督少样本分割任务中性能显著优于现有方法,甚至超越全监督模型。
Method: 通过支持-查询对的异构视角,引入HA模块增强互补性,HT模块减少语义噪声并放大异构语义独特性,HC模块结合文本信息提升泛化能力。
Result: TLG仅使用现有SOTA模型1/24的参数,在Pascal-5i上提升了13.2%,在COCO-20i上提升了9.7%,且为第一个在相同骨干架构下超越全监督模型的弱监督方法。
Insight: 异构网络设计可以有效平衡语义共性与异构互补性,弱监督方法在少样本任务中具有巨大潜力。
Abstract: Meta-learning aims to uniformly sample homogeneous support-query pairs, characterized by the same categories and similar attributes, and extract useful inductive biases through identical network architectures. However, this identical network design results in over-semantic homogenization. To address this, we propose a novel homologous but heterogeneous network. By treating support-query pairs as dual perspectives, we introduce heterogeneous visual aggregation (HA) modules to enhance complementarity while preserving semantic commonality. To further reduce semantic noise and amplify the uniqueness of heterogeneous semantics, we design a heterogeneous transfer (HT) module. Finally, we propose heterogeneous CLIP (HC) textual information to enhance the generalization capability of multimodal models. In the weakly-supervised few-shot semantic segmentation (WFSS) task, with only 1/24 of the parameters of existing state-of-the-art models, TLG achieves a 13.2% improvement on Pascal-5\textsuperscript{i} and a 9.7% improvement on COCO-20\textsuperscript{i}. To the best of our knowledge, TLG is also the first weakly supervised (image-level) model that outperforms fully supervised (pixel-level) models under the same backbone architectures. The code is available at https://github.com/jarch-ma/TLG.
[75] FTIO: Frequent Temporally Integrated Objects
Mohammad Mohammadzadeh Kalati,Farhad Maleki,Ian McQuillan
Main category: cs.CV
TL;DR: FTIO是一个后处理框架,通过改进目标选择和纠正时间不一致性,显著提升了无监督视频目标分割(UVOS)的性能。
Details
Motivation: 无监督视频目标分割(UVOS)面临目标选择不确定性高和时间不一致性的挑战,尤其是对小目标和复杂结构的处理。Contribution: 提出了一个结合频率和显著性的目标选择准则,以及三阶段时间一致性校正方法。
Method: 使用频率显著性准则优化目标选择,并采用三阶段方法整合缺失的目标掩码区域。
Result: 实验表明FTIO在多目标UVOS任务中达到SOTA性能。
Insight: 频率显著性和时间整合方法有效提升了UVOS的鲁棒性和一致性。
Abstract: Predicting and tracking objects in real-world scenarios is a critical challenge in Video Object Segmentation (VOS) tasks. Unsupervised VOS (UVOS) has the additional challenge of finding an initial segmentation of salient objects, which affects the entire process and keeps a permanent uncertainty about the object proposals. Moreover, deformation and fast motion can lead to temporal inconsistencies. To address these problems, we propose Frequent Temporally Integrated Objects (FTIO), a post-processing framework with two key components. First, we introduce a combined criterion to improve object selection, mitigating failures common in UVOS–particularly when objects are small or structurally complex–by extracting frequently appearing salient objects. Second, we present a three-stage method to correct temporal inconsistencies by integrating missing object mask regions. Experimental results demonstrate that FTIO achieves state-of-the-art performance in multi-object UVOS. Code is available at: https://github.com/MohammadMohammadzadehKalati/FTIO
[76] SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
Yicheng Ji,Jun Zhang,Heming Xia,Jinpeng Chen,Lidan Shou,Gang Chen,Huan Li
Main category: cs.CV
TL;DR: SpecVLM是一种针对视频大型语言模型(Vid-LLMs)的训练无关推测解码框架,通过两阶段视频令牌修剪实现无损加速解码,最高可提升2.68倍解码速度。
Details
Motivation: 视频大型语言模型在处理密集视频令牌时存在内存和计算开销问题,而现有视频令牌缩减方法会导致信息损失。Contribution: 提出了SpecVLM框架,通过基于验证器指导的令牌修剪,实现无损加速解码,并展示了修剪90%视频令牌仍能保持准确性。
Method: 采用两阶段修剪策略:第一阶段基于验证器注意力信号选择高信息量令牌,第二阶段对剩余冗余令牌进行空间均匀修剪。
Result: 在四个视频理解基准测试中,SpecVLM实现了LLaVA-OneVision-72B模型2.68倍和解码速度提升,Qwen2.5-VL-32B模型2.11倍加速。
Insight: 发现推测模型的推测能力对视频令牌修剪的低敏感性,为高效视频内容处理提供了新思路。
Abstract: Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model’s speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens, enabling efficient speculation without sacrificing accuracy. To achieve this, it performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner. Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68$\times$ decoding speedup for LLaVA-OneVision-72B and 2.11$\times$ speedup for Qwen2.5-VL-32B.
[77] \textsc{T-Mask}: Temporal Masking for Probing Foundation Models across Camera Views in Driver Monitoring
Thinesh Thiyakesan Ponbagavathi,Kunyu Peng,Alina Roitberg
Main category: cs.CV
TL;DR: 本文提出了一种名为T-Mask的时序掩码方法,用于在驾驶员监控任务中跨视角地利用基础模型的潜力,并通过实验结果展示了其优于现有轻量级适配方法的性能。
Details
Motivation: 在驾驶员监控任务中,摄像头视角变化是一个常见挑战。传统深度学习方法和预训练基础模型虽然在轻量级适配(如线性探针)上表现出潜力,但对未见视角的鲁棒性研究不足。Contribution: 1. 引入T-Mask方法,通过时序掩码和动态区域强调,提升了跨视角的泛化能力;
2. 在公开数据集Drive&Act上展示了T-Mask在跨视角任务中的优越性,显著提升识别准确率;
3. 特别针对低数据量和次要活动识别问题,提出了有效解决方案。
Method: 1. 使用线性探针和参数高效微调(PEFT)等方法进行基础模型适配;
2. 提出了T-Mask方法,通过掩码时序token和动态区域强调,增强了视频数据的时序建模能力。
Result: 1. T-Mask在跨视角任务中相比基线提升了1.23%的Top-1准确率,相比PEFT方法提升了8.0%;
2. 在次要活动识别中,训练视角下提升了5.42%,跨视角下提升了1.36%。
Insight: 1. 轻量级适配方法(如T-Mask)在跨视角和低数据条件下具有潜力;
2. 时序token选择对构建鲁棒的驾驶员监控系统至关重要;
3. 基础模型在细粒度视觉任务中展现了强大的泛化能力。
Abstract: Changes of camera perspective are a common obstacle in driver monitoring. While deep learning and pretrained foundation models show strong potential for improved generalization via lightweight adaptation of the final layers (‘probing’), their robustness to unseen viewpoints remains underexplored. We study this challenge by adapting image foundation models to driver monitoring using a single training view, and evaluating them directly on unseen perspectives without further adaptation. We benchmark simple linear probes, advanced probing strategies, and compare two foundation models (DINOv2 and CLIP) against parameter-efficient fine-tuning (PEFT) and full fine-tuning. Building on these insights, we introduce \textsc{T-Mask} – a new image-to-video probing method that leverages temporal token masking and emphasizes more dynamic video regions. Benchmarked on the public Drive&Act dataset, \textsc{T-Mask} improves cross-view top-1 accuracy by $+1.23%$ over strong probing baselines and $+8.0%$ over PEFT methods, without adding any parameters. It proves particularly effective for underrepresented secondary activities, boosting recognition by $+5.42%$ under the trained view and $+1.36%$ under cross-view settings. This work provides encouraging evidence that adapting foundation models with lightweight probing methods like \textsc{T-Mask} has strong potential in fine-grained driver observation, especially in cross-view and low-data settings. These results highlight the importance of temporal token selection when leveraging foundation models to build robust driver monitoring systems. Code and models will be made available at https://github.com/th-nesh/T-MASK to support ongoing research.
[78] Forecast then Calibrate: Feature Caching as ODE for Efficient Diffusion Transformers
Shikang Zheng,Liang Feng,Xinyu Wang,Qinming Zhou,Peiliang Cai,Chang Zou,Jiacheng Liu,Yuqi Lin,Junjie Chen,Yue Ma,Linfeng Zhang
Main category: cs.CV
TL;DR: 论文提出了一种名为FoCa的方法,通过将特征缓存问题建模为ODE求解问题,显著提高了Diffusion Transformers的推理效率,同时在高度加速下保持了生成质量。
Details
Motivation: 当前的特征缓存方法在高加速比下难以保持生成质量,主要原因是无法鲁棒地整合历史特征。Contribution: 提出了FoCa框架,将特征缓存问题建模为ODE求解,并通过两步策略(预测-校准)显著提升了推理效率与生成质量。
Method: FoCa将特征序列建模为ODE轨迹,先预测未来特征,再通过校准步骤修正误差,实现高效稳定的加速。
Result: 在多种任务(图像合成、视频生成、超分辨率)中,FoCa取得了显著的加速效果(最高6.45倍),且质量损失极小。
Insight: ODE框架为特征缓存提供了一种理论支持,预测-校准策略有效缓解了高加速比下的误差累积问题。
Abstract: Diffusion Transformers (DiTs) have demonstrated exceptional performance in high-fidelity image and video generation. To reduce their substantial computational costs, feature caching techniques have been proposed to accelerate inference by reusing hidden representations from previous timesteps. However, current methods often struggle to maintain generation quality at high acceleration ratios, where prediction errors increase sharply due to the inherent instability of long-step forecasting. In this work, we adopt an ordinary differential equation (ODE) perspective on the hidden-feature sequence, modeling layer representations along the trajectory as a feature-ODE. We attribute the degradation of existing caching strategies to their inability to robustly integrate historical features under large skipping intervals. To address this, we propose FoCa (Forecast-then-Calibrate), which treats feature caching as a feature-ODE solving problem. Extensive experiments on image synthesis, video generation, and super-resolution tasks demonstrate the effectiveness of FoCa, especially under aggressive acceleration. Without additional training, FoCa achieves near-lossless speedups of 5.50 times on FLUX, 6.45 times on HunyuanVideo, 3.17 times on Inf-DiT, and maintains high quality with a 4.53 times speedup on DiT.
[79] OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models
Huanpeng Chu,Wei Wu,Guanyu Fen,Yutao Zhang
Main category: cs.CV
TL;DR: OmniCache是一种无需训练的缓存重用方法,通过分析扩散Transformer模型的采样轨迹,全局优化缓存策略,显著提升计算效率,同时保持生成质量。
Details
Motivation: 扩散Transformer模型的实时部署面临高计算成本挑战,现有缓存方法仅关注局部步骤相似性。OmniCache从全局采样视角出发,优化缓存重用策略。Contribution: 提出OmniCache,一种基于采样轨迹的全局缓存优化方法,动态过滤噪声,实现无训练加速扩散模型。
Method: 分析模型采样轨迹,全局分布缓存重用策略,动态估计并过滤噪声以减少对采样方向的影响。
Result: 实验表明,OmniCache在加速采样过程的同时保持生成质量,为扩散模型的实时部署提供实用解决方案。
Insight: 全局视角的缓存策略能更高效利用计算冗余,动态噪声过滤进一步优化了采样效率和生成质量。
Abstract: Diffusion models have emerged as a powerful paradigm for generative tasks such as image synthesis and video generation, with Transformer architectures further enhancing performance. However, the high computational cost of diffusion Transformers-stemming from a large number of sampling steps and complex per-step computations-presents significant challenges for real-time deployment. In this paper, we introduce OmniCache, a training-free acceleration method that exploits the global redundancy inherent in the denoising process. Unlike existing methods that determine caching strategies based on inter-step similarities and tend to prioritize reusing later sampling steps, our approach originates from the sampling perspective of DIT models. We systematically analyze the model’s sampling trajectories and strategically distribute cache reuse across the entire sampling process. This global perspective enables more effective utilization of cached computations throughout the diffusion trajectory, rather than concentrating reuse within limited segments of the sampling procedure.In addition, during cache reuse, we dynamically estimate the corresponding noise and filter it out to reduce its impact on the sampling direction.Extensive experiments demonstrate that our approach accelerates the sampling process while maintaining competitive generative quality, offering a promising and practical solution for efficient deployment of diffusion-based generative models.
[80] MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine
Kaiyuan Ji,Yijin Guo,Zicheng Zhang,Xiangyang Zhu,Yuan Tian,Ning Liu,Guangtao Zhai
Main category: cs.CV
TL;DR: 本文介绍了 MedOmni-45 Degrees 基准,用于评估医学领域中大型语言模型(LLMs)在推理过程中的安全性与性能权衡,重点关注 Chain-of-Thought 的忠实性和抗奉承性。
Details
Motivation: 随着 LLMs 在医疗决策支持中的广泛应用,需要评估其推理过程的可靠性,而现有基准通常将这些漏洞简化为单一准确率分数。Contribution: 提出 MedOmni-45 Degrees 基准和工作流,量化 LLMs 在操纵性提示条件下的安全与性能权衡,并覆盖多种医学专业和任务类型。
Method: 基准包含 1,804 个医学推理问题,搭配七种操纵性提示和无提示基线,生成约 27K 输入。评估七种 LLMs,结合准确率、忠实性和抗奉承性指标,通过 45 度图可视化结果。
Result: 结果表明所有模型均未突破对角线,开源模型 QwQ-32B 在安全性和性能间平衡最佳(43.81 度),但未在两方面均领先。
Insight: MedOmni-45 Degrees 旨在揭示医学 LLMs 的推理漏洞,并指导开发更安全的模型。
Abstract: With the increasing use of large language models (LLMs) in medical decision-support, it is essential to evaluate not only their final answers but also the reliability of their reasoning. Two key risks are Chain-of-Thought (CoT) faithfulness – whether reasoning aligns with responses and medical facts – and sycophancy, where models follow misleading cues over correctness. Existing benchmarks often collapse such vulnerabilities into single accuracy scores. To address this, we introduce MedOmni-45 Degrees, a benchmark and workflow designed to quantify safety-performance trade-offs under manipulative hint conditions. It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA. Each question is paired with seven manipulative hint types and a no-hint baseline, producing about 27K inputs. We evaluate seven LLMs spanning open- vs. closed-source, general-purpose vs. medical, and base vs. reasoning-enhanced models, totaling over 189K inferences. Three metrics – Accuracy, CoT-Faithfulness, and Anti-Sycophancy – are combined into a composite score visualized with a 45 Degrees plot. Results show a consistent safety-performance trade-off, with no model surpassing the diagonal. The open-source QwQ-32B performs closest (43.81 Degrees), balancing safety and accuracy but not leading in both. MedOmni-45 Degrees thus provides a focused benchmark for exposing reasoning vulnerabilities in medical LLMs and guiding safer model development.
[81] PromptFlare: Prompt-Generalized Defense via Cross-Attention Decoy in Diffusion-Based Inpainting
Hohyun Na,Seunghoo Hong,Simon S. Woo
Main category: cs.CV
TL;DR: PromptFlare提出了一种基于跨注意力机制的新对抗防御方法,通过注入噪声抑制扩散修复模型的采样过程,有效防止恶意图像修改。
Details
Motivation: 扩散模型的成功使得高质量的图像修改变得容易,但也可能被恶意利用。现有方法依赖图像级不一致性,无法解决文本提示的影响。Contribution: 提出PromptFlare,利用跨注意力机制和提示嵌入的内在特性,通过注入噪声作为‘跨注意力诱饵’,抑制恶意修改。
Method: 通过分析提示嵌入中的共享令牌(invariant token),注入对抗噪声扰乱采样过程,使模型无法有效对齐提示和图像。
Result: 在EditBench数据集上表现优异,显著减少了计算开销和GPU内存占用。
Insight: 跨注意力机制可被用于防御攻击,通过噪声注入扰乱模型对提示的关注。
Abstract: The success of diffusion models has enabled effortless, high-quality image modifications that precisely align with users’ intentions, thereby raising concerns about their potential misuse by malicious actors. Previous studies have attempted to mitigate such misuse through adversarial attacks. However, these approaches heavily rely on image-level inconsistencies, which pose fundamental limitations in addressing the influence of textual prompts. In this paper, we propose PromptFlare, a novel adversarial protection method designed to protect images from malicious modifications facilitated by diffusion-based inpainting models. Our approach leverages the cross-attention mechanism to exploit the intrinsic properties of prompt embeddings. Specifically, we identify and target shared token of prompts that is invariant and semantically uninformative, injecting adversarial noise to suppress the sampling process. The injected noise acts as a cross-attention decoy, diverting the model’s focus away from meaningful prompt-image alignments and thereby neutralizing the effect of prompt. Extensive experiments on the EditBench dataset demonstrate that our method achieves state-of-the-art performance across various metrics while significantly reducing computational overhead and GPU memory usage. These findings highlight PromptFlare as a robust and efficient protection against unauthorized image manipulations. The code is available at https://github.com/NAHOHYUN-SKKU/PromptFlare.
[82] An Investigation of Visual Foundation Models Robustness
Sandeep Gupta,Roberto Passerone
Main category: cs.CV
TL;DR: 本文探讨了视觉基础模型(VFMs)在计算机视觉任务中的鲁棒性需求,分析了现有防御方法和训练策略的优缺点,并提出了评估网络鲁棒性的挑战和方法。
Details
Motivation: VFMs在安全敏感领域(如生物识别和自动驾驶)的应用需要高鲁棒性,以应对动态环境中的多种干扰因素(如光照、天气和传感器噪声)。本文旨在研究如何提升VFMs的鲁棒性。Contribution: 1. 分析了VFMs在动态环境中的鲁棒性需求;2. 总结了现有防御方法和鲁棒训练策略;3. 提出了评估网络鲁棒性的挑战和指导性指标。
Method: 通过文献调研和实证分析,探讨了分布偏移、噪声输入、对抗攻击等挑战,并梳理了现有防御方法(如对抗训练和鲁棒优化)的局限性。
Result: 研究发现,现有防御机制存在网络属性和组件选择等方面的挑战,需要进一步的研究和改进。
Insight: 1. VFMs的鲁棒性需多维度评估;2. 未来研究应关注网络结构和训练策略的优化,以提升模型在动态环境中的适应性。
Abstract: Visual Foundation Models (VFMs) are becoming ubiquitous in computer vision, powering systems for diverse tasks such as object detection, image classification, segmentation, pose estimation, and motion tracking. VFMs are capitalizing on seminal innovations in deep learning models, such as LeNet-5, AlexNet, ResNet, VGGNet, InceptionNet, DenseNet, YOLO, and ViT, to deliver superior performance across a range of critical computer vision applications. These include security-sensitive domains like biometric verification, autonomous vehicle perception, and medical image analysis, where robustness is essential to fostering trust between technology and the end-users. This article investigates network robustness requirements crucial in computer vision systems to adapt effectively to dynamic environments influenced by factors such as lighting, weather conditions, and sensor characteristics. We examine the prevalent empirical defenses and robust training employed to enhance vision network robustness against real-world challenges such as distributional shifts, noisy and spatially distorted inputs, and adversarial attacks. Subsequently, we provide a comprehensive analysis of the challenges associated with these defense mechanisms, including network properties and components to guide ablation studies and benchmarking metrics to evaluate network robustness.
[83] FlexMUSE: Multimodal Unification and Semantics Enhancement Framework with Flexible interaction for Creative Writing
Jiahao Chen,Zhiyong Ma,Wenbiao Du,Qingyuan Chuai
Main category: cs.CV
TL;DR: FlexMUSE是一个多模态统一和语义增强框架,用于创意写作,通过灵活的交互模式和语义对齐技术提升多模态输出的创造力和一致性。
Details
Motivation: 现有的多模态生成方法通常要求严格的输入模式或高昂的训练成本,且在多模态创意写作(MMCW)中容易产生语义不一致的问题。Contribution: 提出了FlexMUSE框架,通过可选视觉输入、模态语义对齐门控(msaGate)和跨模态注意力融合增强语义。此外,还设计了模态语义创意直接偏好优化(mscDPO)和新的数据集ArtMUSE。
Method: 采用T2I模块支持可选视觉输入,使用msaGate限制文本输入以实现模态对齐,提出基于注意力的跨模态融合增强语义,并通过mscDPO优化创意。
Result: FlexMUSE在多模态创意写作任务中展现出良好的一致性、创造力和连贯性。
Insight: 模态语义对齐和增强对于提升多模态创意写作的质量至关重要,灵活的交互模式可以进一步释放创造力。
Abstract: Multi-modal creative writing (MMCW) aims to produce illustrated articles. Unlike common multi-modal generative (MMG) tasks such as storytelling or caption generation, MMCW is an entirely new and more abstract challenge where textual and visual contexts are not strictly related to each other. Existing methods for related tasks can be forcibly migrated to this track, but they require specific modality inputs or costly training, and often suffer from semantic inconsistencies between modalities. Therefore, the main challenge lies in economically performing MMCW with flexible interactive patterns, where the semantics between the modalities of the output are more aligned. In this work, we propose FlexMUSE with a T2I module to enable optional visual input. FlexMUSE promotes creativity and emphasizes the unification between modalities by proposing the modality semantic alignment gating (msaGate) to restrict the textual input. Besides, an attention-based cross-modality fusion is proposed to augment the input features for semantic enhancement. The modality semantic creative direct preference optimization (mscDPO) within FlexMUSE is designed by extending the rejected samples to facilitate the writing creativity. Moreover, to advance the MMCW, we expose a dataset called ArtMUSE which contains with around 3k calibrated text-image pairs. FlexMUSE achieves promising results, demonstrating its consistency, creativity and coherence.
[84] UniEM-3M: A Universal Electron Micrograph Dataset for Microstructural Segmentation and Generation
Nan wang,Zhiyi Xia,Yiming Li,Shi Tang,Zuxin Fan,Xi Fang,Haoyi Tao,Xiaochen Cai,Guolin Ke,Linfeng Zhang,Yanhui Hong
Main category: cs.CV
TL;DR: 本文介绍了UniEM-3M,这是首个大规模、多模态的电子显微图像数据集,用于实例级理解,包含5,091张高分辨率图像、约300万个实例分割标签和图像级属性解耦的文本描述。同时,作者还发布了一个基于扩散模型的文本到图像生成工具,作为数据增强和完整数据分布的代理。
Details
Motivation: 材料科学中的定量微观结构表征依赖于电子显微图像(EM),但深度学习在此领域的进展受到大规模、多样化且专家标注数据稀缺的阻碍。本文旨在解决这一问题。Contribution: 1) 发布了首个大规模的EM数据集UniEM-3M,包含实例分割标签和文本描述;2) 训练了一个文本到图像的扩散模型,用于数据增强;3) 提供了基于UniEM-3M的基准测试和基线模型UniEM-Net。
Method: 1) 收集并标注了大规模、多模态的EM数据集;2) 训练了基于扩散模型的文本到图像生成工具;3) 提出了流式基线模型UniEM-Net,并在数据集上进行了基准测试。
Result: 实验表明,提出的流式模型UniEM-Net在UniEM-3M基准测试中优于其他先进方法。
Insight: 1) 大规模标注数据集对材料科学中的深度学习应用至关重要;2) 文本到图像生成模型可以作为数据增强的有效工具;3) 流式模型在实例分割任务中表现优越。
Abstract: Quantitative microstructural characterization is fundamental to materials science, where electron micrograph (EM) provides indispensable high-resolution insights. However, progress in deep learning-based EM characterization has been hampered by the scarcity of large-scale, diverse, and expert-annotated datasets, due to acquisition costs, privacy concerns, and annotation complexity. To address this issue, we introduce UniEM-3M, the first large-scale and multimodal EM dataset for instance-level understanding. It comprises 5,091 high-resolution EMs, about 3 million instance segmentation labels, and image-level attribute-disentangled textual descriptions, a subset of which will be made publicly available. Furthermore, we are also releasing a text-to-image diffusion model trained on the entire collection to serve as both a powerful data augmentation tool and a proxy for the complete data distribution. To establish a rigorous benchmark, we evaluate various representative instance segmentation methods on the complete UniEM-3M and present UniEM-Net as a strong baseline model. Quantitative experiments demonstrate that this flow-based model outperforms other advanced methods on this challenging benchmark. Our multifaceted release of a partial dataset, a generative model, and a comprehensive benchmark – available at huggingface – will significantly accelerate progress in automated materials analysis.
[85] Structuring GUI Elements through Vision Language Models: Towards Action Space Generation
Yi Xu,Yesheng Zhang,jiajia Liu,Jingdong Chen
Main category: cs.CV
TL;DR: 本文提出了一种IoU增强的最大似然(IAML)训练范式,用于提升多模态大语言模型(MLLMs)在图形用户界面(GUI)元素定位中的性能,解决了传统方法在生成精确坐标时面临的语义缺失和暴露偏差问题。
Details
Motivation: 多模态大语言模型在GUI元素结构化中的应用表现出巨大潜力,但其在生成精确UI元素坐标方面的性能受限,主要由于数值坐标在语言表示空间中的语义缺失以及传统训练方法中的暴露偏差问题。Contribution: 提出了IoU增强的最大似然(IAML)训练范式,通过基于IoU的坐标采样数据增强策略和新的训练方法,显著提升了MLLMs在GUI元素定位中的性能。
Method: 引入了基于IoU的坐标采样管道来增强训练数据,并在IAML范式下微调MLLMs,以缓解传统最大似然估计中的暴露偏差问题。
Result: 通过大量实验证明,IAML训练方法在GUI元素定位任务中的表现优于传统训练范式。
Insight: 通过数据增强和创新的训练范式,可以有效弥补MLLMs在数值坐标生成任务中的不足,为GUI理解和交互设计提供了新的解决方案。
Abstract: Multimodal large language models (MLLMs) have emerged as pivotal tools in enhancing human-computer interaction. In this paper we focus on the application of MLLMs in the field of graphical user interface (GUI) elements structuring, where they assist in processing user instructions based on screen contents. Despite the promise of MLLMs, their performance in precisely generating UI element coordinates, a critical aspect of GUI understanding, is hindered by the nature of next-token prediction training. This challenge arises from the semantic void surrounding numerical UI coordinates in language representation spaces, necessitating a substantial and diverse dataset to bolster visual module capabilities. To address these limitations, we introduce an IoU-Augmented Maximum Likelihood (IAML) training paradigm. Specifically, our approach involves a novel pipeline for IoU-based coordinate sampling to augment the training data, which considers the proximity to ground truth coordinates. This data augmentation strategy is then employed to fine-tune MLLMs under the IAML paradigm, which is designed to mitigate the exposure bias problem inherent in traditional maximum likelihood estimation. Through extensive experiments, we demonstrate the superior performance of our IAML training approach over traditional training paradigms.
[86] IRSAMap:Towards Large-Scale, High-Resolution Land Cover Map Vectorization
Yu Meng,Ligao Deng,Zhihao Xi,Jiansheng Chen,Jingbo Chen,Anzhi Yue,Diyou Liu,Kai Li,Chenhao Wang,Kaiyu Li,Yupeng Deng,Xian Sun
Main category: cs.CV
TL;DR: IRSAMap是一个面向大规模、高分辨率土地覆盖地图矢量化的全球遥感数据集,解决了现有数据集中类注释有限、数据规模小和缺乏空间结构信息的问题。
Details
Motivation: 随着遥感图像分辨率的提升和深度学习的快速发展,土地覆盖映射正从像素级分割转向基于对象的矢量建模,现有数据集无法满足精确对象边界和拓扑一致性的需求。Contribution: IRSAMap是首个全球遥感数据集,提供1.8百万个实例的矢量注释、智能标注工作流程、全球覆盖和多任务适应性,为地理特征自动化和协作建模提供基准。
Method: 结合人工和AI的智能标注工作流程,全球覆盖79个地区的1000公里数据,支持多任务如像素分类、建筑物轮廓提取等。
Result: IRSAMap为标准化的对象级土地覆盖映射提供基准,推动了地理信息更新和数字孪生构建。
Insight: IRSAMap的发布填补了土地覆盖矢量数据集在规模、分辨率和空间结构信息上的空白,为深度学习模型提供了更丰富的训练和评估资源。
Abstract: With the enhancement of remote sensing image resolution and the rapid advancement of deep learning, land cover mapping is transitioning from pixel-level segmentation to object-based vector modeling. This shift demands more from deep learning models, requiring precise object boundaries and topological consistency. However, existing datasets face three main challenges: limited class annotations, small data scale, and lack of spatial structural information. To overcome these issues, we introduce IRSAMap, the first global remote sensing dataset for large-scale, high-resolution, multi-feature land cover vector mapping. IRSAMap offers four key advantages: 1) a comprehensive vector annotation system with over 1.8 million instances of 10 typical objects (e.g., buildings, roads, rivers), ensuring semantic and spatial accuracy; 2) an intelligent annotation workflow combining manual and AI-based methods to improve efficiency and consistency; 3) global coverage across 79 regions in six continents, totaling over 1,000 km; and 4) multi-task adaptability for tasks like pixel-level classification, building outline extraction, road centerline extraction, and panoramic segmentation. IRSAMap provides a standardized benchmark for the shift from pixel-based to object-based approaches, advancing geographic feature automation and collaborative modeling. It is valuable for global geographic information updates and digital twin construction. The dataset is publicly available at https://github.com/ucas-dlg/IRSAMap
[87] Learning Long-Range Action Representation by Two-Stream Mamba Pyramid Network for Figure Skating Assessment
Fengshun Wang,Qiurui Wang,Peilin Zhao
Main category: cs.CV
TL;DR: 论文提出了一种两流Mamba金字塔网络,用于花样滑冰评分(TES和PCS),通过分离视觉特征的TES评估流和音视频特征的PCS评估流,解决了现有方法的三大挑战,并利用Mamba模型的长距离依赖捕捉能力高效处理长视频。
Details
Motivation: 现有方法在花样滑冰评分中忽视了评估标准的先验知识,未区分TES和PCS的特征需求,且未对动作元素逐一评分,同时长视频处理效率低。Contribution: 1. 提出两流Mamba金字塔网络,分别处理TES(视觉特征)和PCS(音视频特征);2. 引入多级融合机制,避免TES评估干扰PCS估计;3. 利用Mamba的多尺度金字塔结构高效定位和评估动作元素。
Method: 1. TES评估流:基于多尺度Mamba金字塔和TES头,定位和评分动作元素;2. PCS评估流:通过多级融合机制结合视觉和听觉特征;3. 利用Mamba的长距离依赖建模和线性计算复杂度处理长视频。
Result: 在FineFS基准测试中达到SOTA性能。
Insight: 1. 区分TES和PCS的特征需求是关键;2. Mamba模型适合长视频任务;3. 多级融合机制能提升多模态特征的有效性。
Abstract: Technical Element Score (TES) and Program Component Score (PCS) evaluations in figure skating demand precise assessment of athletic actions and artistic interpretation, respectively. Existing methods face three major challenges. Firstly, video and audio cues are regarded as common features for both TES and PCS predictions in previous works without considering the prior evaluation criterion of figure skating. Secondly, action elements in competitions are separated in time, TES should be derived from each element’s score, but existing methods try to give an overall TES prediction without evaluating each action element. Thirdly, lengthy competition videos make it difficult and inefficient to handle long-range contexts. To address these challenges, we propose a two-stream Mamba pyramid network that aligns with actual judging criteria to predict TES and PCS by separating visual-feature based TES evaluation stream from audio-visual-feature based PCS evaluation stream. In the PCS evaluation stream, we introduce a multi-level fusion mechanism to guarantee that video-based features remain unaffected when assessing TES, and enhance PCS estimation by fusing visual and auditory cues across each contextual level of the pyramid. In the TES evaluation stream, the multi-scale Mamba pyramid and TES head we proposed effectively address the challenges of localizing and evaluating action elements with various temporal scales and give score predictions. With Mamba’s superior ability to capture long-range dependencies and its linear computational complexity, our method is ideal for handling lengthy figure skating videos. Comprehensive experimentation demonstrates that our framework attains state-of-the-art performance on the FineFS benchmark. Our source code is available at https://github.com/ycwfs/Figure-Skating-Action-Quality-Assessment.
[88] A Multimodal-Multitask Framework with Cross-modal Relation and Hierarchical Interactive Attention for Semantic Comprehension
Mohammad Zia Ur Rehman,Devraj Raghuvanshi,Umang Jain,Shubhi Bansal,Nagendra Kumar
Main category: cs.CV
TL;DR: 该论文提出了一种名为MM-ORIENT的多模态多任务框架,通过跨模态关系图和分层交互注意力机制,有效减少了模态间的噪声影响,并提升了多任务性能。
Details
Motivation: 多模态学习中的主要挑战是模态内部的噪声问题,这种噪声会影响多模态表示的效果,尤其是在模态间显式交互时。此外,现有多模态融合方法可能忽略单一模态中的判别性信息。Contribution: 1. 提出了MM-ORIENT框架,通过跨模态关系图在潜在阶段减少噪声影响。2. 设计了分层交互单模态注意力机制(HIMA),专注于单模态内的关键信息。3. 在多个任务和数据集上验证了框架的有效性。
Method: 1. 跨模态关系图:通过不同模态的特征重建单模态特征,实现多模态表示。2. HIMA:分层注意力机制学习单模态的判别性特征,支持多任务学习。
Result: 在三个数据集上的实验表明,MM-ORIENT能够有效理解多模态内容,并在多任务中表现出色。
Insight: 通过避免模态间的显式交互,MM-ORIENT在潜在阶段减少了噪声影响,同时HIMA机制保留了单模态的判别性信息,为多模态多任务学习提供了新思路。
Abstract: A major challenge in multimodal learning is the presence of noise within individual modalities. This noise inherently affects the resulting multimodal representations, especially when these representations are obtained through explicit interactions between different modalities. Moreover, the multimodal fusion techniques while aiming to achieve a strong joint representation, can neglect valuable discriminative information within the individual modalities. To this end, we propose a Multimodal-Multitask framework with crOss-modal Relation and hIErarchical iNteractive aTtention (MM-ORIENT) that is effective for multiple tasks. The proposed approach acquires multimodal representations cross-modally without explicit interaction between different modalities, reducing the noise effect at the latent stage. To achieve this, we propose cross-modal relation graphs that reconstruct monomodal features to acquire multimodal representations. The features are reconstructed based on the node neighborhood, where the neighborhood is decided by the features of a different modality. We also propose Hierarchical Interactive Monomadal Attention (HIMA) to focus on pertinent information within a modality. While cross-modal relation graphs help comprehend high-order relationships between two modalities, HIMA helps in multitasking by learning discriminative features of individual modalities before late-fusing them. Finally, extensive experimental evaluation on three datasets demonstrates that the proposed approach effectively comprehends multimodal content for multiple tasks.
[89] Exploiting Information Redundancy in Attention Maps for Extreme Quantization of Vision Transformers
Lucas Maisonnave,Karim Haroun,Tom Pegeot
Main category: cs.CV
TL;DR: 本文提出了一种利用注意力图中信息冗余的方法(EAM),通过量化低熵的注意力头来降低计算复杂性和内存需求,同时保持模型性能。
Details
Motivation: Transformer中的多头自注意力(MHSA)机制计算复杂且内存需求高,限制了其在边缘设备上的部署。作者发现低熵的注意力头贡献信息较少,从而提出了针对性的压缩策略。Contribution: 提出了熵注意力图(EAM)模型,通过冻结低熵注意力头的权重并将其量化为低精度,避免了冗余计算,在ImageNet-1k上验证了其有效性。
Method: 利用香农熵量化每个注意力头的信息量,冻结并量化低熵注意力头以减少计算开销。
Result: 在DeiT和Swin Transformer模型上,EAM在注意力图稀疏度≤20%时实现了与原始模型相当或更高的准确率。
Insight: 注意力头的信息冗余可以通过熵分析来识别,针对性地压缩低熵部分是一种有效的模型加速方法。
Abstract: Transformer models rely on Multi-Head Self-Attention (MHSA) mechanisms, where each attention head contributes to the final representation. However, their computational complexity and high memory demands due to MHSA hinders their deployment at the edge. In this work, we analyze and exploit information redundancy in attention maps to accelerate model inference. By quantifying the information captured by each attention head using Shannon entropy, our analysis reveals that attention heads with lower entropy, i.e., exhibiting more deterministic behavior, tend to contribute less information, motivating targeted compression strategies. Relying on these insights, we propose Entropy Attention Maps (EAM), a model that freezes the weights of low-entropy attention maps and quantizes these values to low precision to avoid redundant re-computation. Empirical validation on ImageNet-1k shows that EAM achieves similar or higher accuracy at $\leq$20% sparsity in attention maps and competitive performance beyond this level for the DeiT and Swin Transformer models.
[90] Vision encoders should be image size agnostic and task driven
Nedyalko Prisadnikov,Danda Pani Paudel,Yuqian Fu,Luc Van Gool
Main category: cs.CV
TL;DR: 这篇立场论文主张下一代视觉编码器应当与图像尺寸无关且由任务驱动,灵感来源于生物视觉的效率特性,通过任务动态调节计算复杂度。
Details
Motivation: 现代视觉编码器在处理图像时通常固定计算复杂度与图像尺寸相关,而生物视觉系统则根据任务动态调整计算资源以提高效率。论文旨在解决这一差距。Contribution: 提出未来视觉编码器应具备图像尺寸无关性(image size agnostic)和任务驱动性(task driven),并通过图像分类任务的初步验证证明了其可行性。
Method: 提出一种概念验证方案,动态调整计算复杂度以适应任务需求,而非依赖固定尺寸的输入图像。
Result: 初步实验表明该方法是可行的,尤其在图像分类任务中展现了潜力。
Insight: 视觉编码器的效率可通过模仿生物系统的任务驱动行为实现,未来研究应更多关注动态计算资源分配。
Abstract: This position paper argues that the next generation of vision encoders should be image size agnostic and task driven. The source of our inspiration is biological. Not a structural aspect of biological vision, but a behavioral trait – efficiency. We focus on a couple of ways in which vision in nature is efficient, but modern vision encoders not. We – humans and animals – deal with vast quantities of visual data, and need to be smart where we focus our limited energy – it depends on the task. It is our belief that vision encoders should be dynamic and the computational complexity should depend on the task at hand rather than the size of the image. We, also, provide concrete first steps towards our vision – a proof-of-concept solution for image classification. Despite classification being not very representative for what we are trying to achieve, it shows that our approach is feasible and promising.
[91] Attention Mechanism in Randomized Time Warping
Yutaro Hiraoka,Kazuya Okamura,Kota Suto,Kazuhiro Fukui
Main category: cs.CV
TL;DR: 论文揭示了RTW与自注意力机制的本质联系,通过实验证明RTW在动作识别任务中优于Transformer。
Details
Motivation: 探讨RTW与自注意力机制的相似性,并分析两者在动作识别任务中的性能差异。Contribution: 将RTW的核心功能解释为自注意力机制,并提出RTW的优势。
Method: 通过分析RTW的贡献权重与自注意力权重的相关性,结合实验验证性能差异。
Result: RTW与自注意力权重的平均相关性为0.80;在Something-Something V2数据集上表现优于Transformer5%。
Insight: RTW的全局注意力机制可能在任务中比局部自注意力更具优势。
Abstract: This paper reveals that we can interpret the fundamental function of Randomized Time Warping (RTW) as a type of self-attention mechanism, a core technology of Transformers in motion recognition. The self-attention is a mechanism that enables models to identify and weigh the importance of different parts of an input sequential pattern. On the other hand, RTW is a general extension of Dynamic Time Warping (DTW), a technique commonly used for matching and comparing sequential patterns. In essence, RTW searches for optimal contribution weights for each element of the input sequential patterns to produce discriminative features. Although the two approaches look different, these contribution weights can be interpreted as self-attention weights. In fact, the two weight patterns look similar, producing a high average correlation of 0.80 across the ten smallest canonical angles. However, they work in different ways: RTW attention operates on an entire input sequential pattern, while self-attention focuses on only a local view which is a subset of the input sequential pattern because of the computational costs of the self-attention matrix. This targeting difference leads to an advantage of RTW against Transformer, as demonstrated by the 5% performance improvement on the Something-Something V2 dataset.
[92] A Lightweight Group Multiscale Bidirectional Interactive Network for Real-Time Steel Surface Defect Detection
Yong Zhang,Cunjian Chen,Qiang Gao,Yi Wang,Bin Fang
Main category: cs.CV
TL;DR: 提出了一种轻量级的实时钢材表面缺陷检测方法GMBINet,通过创新模块优化多尺度特征提取与交互,显著提升了速度和精度。
Details
Motivation: 钢铁制造业对实时缺陷检测的需求迫切,但现有深度学习方法计算复杂度高、推理速度慢,难以部署在资源受限的工业环境中。Contribution: 设计了GMBINet框架,提出GMBI模块(分组多尺度双向交互)和BPFI(双向渐进特征交互器),实现高效多尺度特征提取与跨尺度交互。
Method: 采用分组策略进行多尺度特征提取(避免计算复杂度增加),结合BPFI和无参数EWMS操作,增强跨尺度交互且不引入额外计算负担。
Result: 在SD-Saliency-900和NRSD-MN数据集上达到1048 FPS(GPU)和16.53 FPS(CPU,512分辨率),仅用0.19 M参数,保持高精度。
Insight: 轻量化和高效特征交互是工业场景下实时检测的关键,分组策略和无参数操作可有效平衡速度与性能。
Abstract: Real-time surface defect detection is critical for maintaining product quality and production efficiency in the steel manufacturing industry. Despite promising accuracy, existing deep learning methods often suffer from high computational complexity and slow inference speeds, which limit their deployment in resource-constrained industrial environments. Recent lightweight approaches adopt multibranch architectures based on depthwise separable convolution (DSConv) to capture multiscale contextual information. However, these methods often suffer from increased computational overhead and lack effective cross-scale feature interaction, limiting their ability to fully leverage multiscale representations. To address these challenges, we propose GMBINet, a lightweight framework that enhances multiscale feature extraction and interaction through novel Group Multiscale Bidirectional Interactive (GMBI) modules. The GMBI adopts a group-wise strategy for multiscale feature extraction, ensuring scale-agnostic computational complexity. It further integrates a Bidirectional Progressive Feature Interactor (BPFI) and a parameter-free Element-Wise Multiplication-Summation (EWMS) operation to enhance cross-scale interaction without introducing additional computational overhead. Experiments on SD-Saliency-900 and NRSD-MN datasets demonstrate that GMBINet delivers competitive accuracy with real-time speeds of 1048 FPS on GPU and 16.53 FPS on CPU at 512 resolution, using only 0.19 M parameters. Additional evaluations on the NEU-CLS defect classification dataset further confirm the strong generalization ability of our method, demonstrating its potential for broader industrial vision applications beyond surface defect detection. The dataset and code are publicly available at: https://github.com/zhangyongcode/GMBINet.
[93] SAMFusion: Sensor-Adaptive Multimodal Fusion for 3D Object Detection in Adverse Weather
Edoardo Palladin,Roland Dietze,Praveen Narayanan,Mario Bijelic,Felix Heide
Main category: cs.CV
TL;DR: SAMFusion提出了一种针对恶劣天气的多模态传感器融合方法,结合了RGB、LiDAR、NIR门控相机和雷达数据,通过深度感知的注意力机制和BEV平面上的优化,显著提升了自动驾驶在极端天气条件下的目标检测性能。
Details
Motivation: 现有多模态融合方法在恶劣天气条件下表现不佳,导致自动驾驶系统在如浓雾、大雪或污损等情况下失效。Contribution: 提出了一种适应恶劣天气的多模态传感器融合框架,首次整合了RGB、LiDAR、NIR和雷达数据,并设计了基于注意力和深度感知的融合机制。
Method: 采用深度感知的注意力机制和BEV平面优化融合多模态数据,并通过Transformer解码器根据距离和可见性动态加权不同模态。
Result: 在恶劣天气条件下,尤其是远距离和雾天场景中,对易受伤的行人检测平均精度提升17.2 AP。
Insight: 恶劣天气下的多模态融合需要动态调整传感器权重,并结合更多传感器类型(如NIR和雷达)以提升鲁棒性。
Abstract: Multimodal sensor fusion is an essential capability for autonomous robots, enabling object detection and decision-making in the presence of failing or uncertain inputs. While recent fusion methods excel in normal environmental conditions, these approaches fail in adverse weather, e.g., heavy fog, snow, or obstructions due to soiling. We introduce a novel multi-sensor fusion approach tailored to adverse weather conditions. In addition to fusing RGB and LiDAR sensors, which are employed in recent autonomous driving literature, our sensor fusion stack is also capable of learning from NIR gated camera and radar modalities to tackle low light and inclement weather. We fuse multimodal sensor data through attentive, depth-based blending schemes, with learned refinement on the Bird’s Eye View (BEV) plane to combine image and range features effectively. Our detections are predicted by a transformer decoder that weighs modalities based on distance and visibility. We demonstrate that our method improves the reliability of multimodal sensor fusion in autonomous vehicles under challenging weather conditions, bridging the gap between ideal conditions and real-world edge cases. Our approach improves average precision by 17.2 AP compared to the next best method for vulnerable pedestrians in long distances and challenging foggy scenes. Our project page is available at https://light.princeton.edu/samfusion/
[94] HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction
Sara Rojas,Matthieu Armando,Bernard Ghamen,Philippe Weinzaepfel,Vincent Leroy,Gregory Rogez
Main category: cs.CV
TL;DR: HAMSt3R是一种基于学习的多视图立体三维重建方法,专注于人-场景联合重建。通过结合场景几何和人体理解,引入附加网络头部分割人物、估计密集对应关系和深度,从而生成富含人类语义信息的密集三维点图。方法高效且完全前馈,适用于实际应用。
Details
Motivation: 现有的学习型多视图立体重建方法(如DUSt3R和MASt3R)主要针对静态室外场景,难以处理以人为中心的场景。HAMSt3R旨在填补这一空白,实现人-场景的高效联合重建。Contribution: 1)提出HAMSt3R,扩展MASt3R以支持人-场景联合三维重建;2)利用DUNE图像编码器结合场景几何和人体理解;3)引入附加网络头部分割人物、估计密集对应关系和深度。
Method: 1)使用DUNE图像编码器(结合MASt3R和多HMR模型的编码器);2)通过附加网络头部分割人物(DensePose估计密集对应关系)和预测深度;3)生成富含人类语义的密集三维点图。
Result: 在EgoHumans和EgoExo4D等挑战性基准测试中表现优异,同时验证了在传统多视图立体和姿态回归任务上的泛化能力。
Insight: HAMSt3R通过结合场景和人体理解,实现了高效的人-场景联合重建,为三维视觉中的人类语义与场景融合提供了新思路。
Abstract: Recovering the 3D geometry of a scene from a sparse set of uncalibrated images is a long-standing problem in computer vision. While recent learning-based approaches such as DUSt3R and MASt3R have demonstrated impressive results by directly predicting dense scene geometry, they are primarily trained on outdoor scenes with static environments and struggle to handle human-centric scenarios. In this work, we introduce HAMSt3R, an extension of MASt3R for joint human and scene 3D reconstruction from sparse, uncalibrated multi-view images. First, we exploit DUNE, a strong image encoder obtained by distilling, among others, the encoders from MASt3R and from a state-of-the-art Human Mesh Recovery (HMR) model, multi-HMR, for a better understanding of scene geometry and human bodies. Our method then incorporates additional network heads to segment people, estimate dense correspondences via DensePose, and predict depth in human-centric environments, enabling a more comprehensive 3D reconstruction. By leveraging the outputs of our different heads, HAMSt3R produces a dense point map enriched with human semantic information in 3D. Unlike existing methods that rely on complex optimization pipelines, our approach is fully feed-forward and efficient, making it suitable for real-world applications. We evaluate our model on EgoHumans and EgoExo4D, two challenging benchmarks con taining diverse human-centric scenarios. Additionally, we validate its generalization to traditional multi-view stereo and multi-view pose regression tasks. Our results demonstrate that our method can reconstruct humans effectively while preserving strong performance in general 3D reconstruction tasks, bridging the gap between human and scene understanding in 3D vision.
[95] HOSt3R: Keypoint-free Hand-Object 3D Reconstruction from RGB images
Anilkumar Swamy,Vincent Leroy,Philippe Weinzaepfel,Jean-Sébastien Franco,Grégory Rogez
Main category: cs.CV
TL;DR: HOSt3R 是一种无需关键点检测的手-物体三维重建方法,通过单目运动视频估计手和物体的三维变换,并结合多视图重建技术恢复其形状,在 SHOWMe 和 HO3D 数据集上表现优异。
Details
Motivation: 现有方法依赖关键点检测技术(如 SfM 等),对物体几何多样性、弱纹理和遮挡敏感,限制了方法的可扩展性和泛化能力。HOSt3R 旨在解决这些问题。Contribution: 提出一种无需关键点检测的方法,直接从单目视频估计手-物体的三维变换,并结合多视图重建恢复形状。方法无需预扫描物体模板或相机内参。
Method: 通过关键点检测器无关的变换估计方法,结合多视图重建管道,实现手-物体的高精度三维重建。
Result: 在 SHOWMe 和 HO3D 数据集上展示了优异的性能,尤其在未见过物体类别上表现出良好的泛化能力。
Insight: 无需关键点检测的方法可以更好地处理几何多样性和遮挡问题,为手-物体三维重建提供更通用的解决方案。
Abstract: Hand-object 3D reconstruction has become increasingly important for applications in human-robot interaction and immersive AR/VR experiences. A common approach for object-agnostic hand-object reconstruction from RGB sequences involves a two-stage pipeline: hand-object 3D tracking followed by multi-view 3D reconstruction. However, existing methods rely on keypoint detection techniques, such as Structure from Motion (SfM) and hand-keypoint optimization, which struggle with diverse object geometries, weak textures, and mutual hand-object occlusions, limiting scalability and generalization. As a key enabler to generic and seamless, non-intrusive applicability, we propose in this work a robust, keypoint detector-free approach to estimating hand-object 3D transformations from monocular motion video/images. We further integrate this with a multi-view reconstruction pipeline to accurately recover hand-object 3D shape. Our method, named HOSt3R, is unconstrained, does not rely on pre-scanned object templates or camera intrinsics, and reaches state-of-the-art performance for the tasks of object-agnostic hand-object 3D transformation and shape estimation on the SHOWMe benchmark. We also experiment on sequences from the HO3D dataset, demonstrating generalization to unseen object categories.
[96] Arbitrary-Scale 3D Gaussian Super-Resolution
Huimin Zeng,Yue Bai,Yun Fu
Main category: cs.CV
TL;DR: 提出了一种支持任意比例3D高斯超分辨率的框架,解决了现有方法仅支持固定比例的问题,同时避免了后处理上采样器的复杂性和渲染效率下降。
Details
Motivation: 现有3D高斯泼溅(3DGS)超分辨率方法仅支持固定比例的高分辨率(HR)渲染,限制了其在资源受限场景的实用性。直接使用原生3DGS渲染任意比例HR视图会因缺乏比例感知能力产生混叠伪影,而添加后处理上采样器会增加框架复杂度并降低效率。Contribution: 提出了一个集成框架,结合比例感知渲染、生成先验引导优化和渐进式超分辨率技术,实现了单一3D模型支持任意比例(包括整数和非整数)的超分辨率渲染。
Method: 框架包括三个核心技术:1)比例感知渲染,避免混叠伪影;2)生成先验引导优化,提升超分辨率质量;3)渐进式超分辨,逐步增强细节。
Result: 实验表明,该方法在渲染任意比例HR视图时,PSNR比原生3DGS高出6.59 dB,且保持实时渲染速度(1080p下85 FPS)。
Insight: 通过比例感知技术避免了后处理步骤,提升了框架的灵活性和效率,同时生成先验和渐进优化机制确保了高质量且结构一致的超分辨率结果。
Abstract: Existing 3D Gaussian Splatting (3DGS) super-resolution methods typically perform high-resolution (HR) rendering of fixed scale factors, making them impractical for resource-limited scenarios. Directly rendering arbitrary-scale HR views with vanilla 3DGS introduces aliasing artifacts due to the lack of scale-aware rendering ability, while adding a post-processing upsampler for 3DGS complicates the framework and reduces rendering efficiency. To tackle these issues, we build an integrated framework that incorporates scale-aware rendering, generative prior-guided optimization, and progressive super-resolving to enable 3D Gaussian super-resolution of arbitrary scale factors with a single 3D model. Notably, our approach supports both integer and non-integer scale rendering to provide more flexibility. Extensive experiments demonstrate the effectiveness of our model in rendering high-quality arbitrary-scale HR views (6.59 dB PSNR gain over 3DGS) with a single model. It preserves structural consistency with LR views and across different scales, while maintaining real-time rendering speed (85 FPS at 1080p).
[97] Seeing Clearly, Forgetting Deeply: Revisiting Fine-Tuned Video Generators for Driving Simulation
Chun-Peng Chang,Chen-Yu Wang,Julian Schmidt,Holger Caesar,Alain Pagani
Main category: cs.CV
TL;DR: 本文研究了微调视频生成模型在驾驶仿真中的效果,发现视觉保真度提升的同时可能损害动态元素的空间准确性,并提出了一种基于持续学习的平衡方案。
Details
Motivation: 最近的视频生成技术在视觉质量和时序连贯性上取得了显著进展,但将其应用于驾驶仿真等领域时,可能会因微调导致动态建模精度下降。本文旨在探讨这一现象及其成因。Contribution: 主要贡献包括:(1)揭示了微调视频生成模型在驾驶场景中的视觉保真度与动态准确性之间的权衡;(2)提出通过持续学习策略(如多域回放)平衡二者。
Method: 通过分析微调对驾驶数据集的影响,发现动态建模退化的原因;进一步实验表明,持续学习方法可以有效缓解这一问题。
Result: 实验结果显示,持续学习策略能够在保持视觉质量的同时,显著提升动态元素的空间准确性。
Insight: 视觉质量和动态建模在多样化场景中高度相关,但在高度规则的驾驶场景中,微调可能导致模型倾向于表面真实性而非动态精度。持续学习提供了一种有效的平衡手段。
Abstract: Recent advancements in video generation have substantially improved visual quality and temporal coherence, making these models increasingly appealing for applications such as autonomous driving, particularly in the context of driving simulation and so-called “world models”. In this work, we investigate the effects of existing fine-tuning video generation approaches on structured driving datasets and uncover a potential trade-off: although visual fidelity improves, spatial accuracy in modeling dynamic elements may degrade. We attribute this degradation to a shift in the alignment between visual quality and dynamic understanding objectives. In datasets with diverse scene structures within temporal space, where objects or perspective shift in varied ways, these objectives tend to highly correlated. However, the very regular and repetitive nature of driving scenes allows visual quality to improve by modeling dominant scene motion patterns, without necessarily preserving fine-grained dynamic behavior. As a result, fine-tuning encourages the model to prioritize surface-level realism over dynamic accuracy. To further examine this phenomenon, we show that simple continual learning strategies, such as replay from diverse domains, can offer a balanced alternative by preserving spatial accuracy while maintaining strong visual quality.
[98] Towards Open World Detection: A Survey
Andrei-Stefan Bulzan,Cosmin Cernazanu-Glavan
Main category: cs.CV
TL;DR: 该论文提出“开放世界检测”(OWD)这一术语,旨在统一视觉领域中的类无关通用检测模型。通过回顾视觉子领域的历史、关键概念和方法,探讨了从早期显著性检测到现代开放世界检测等任务的融合趋势。
Details
Motivation: 计算机视觉领域的早期研究专注于狭窄的任务,但随着技术进步,复杂的感知任务逐渐涌现。论文旨在探索这些任务的融合可能,推动更通用的检测模型发展。Contribution: 提出了开放世界检测(OWD)这一统一框架,总结了相关子领域的历史、方法和数据集,揭示了未来视觉感知领域的潜在统一方向。
Method: 通过文献综述的方式,梳理了从显著性检测到开放世界检测的技术演进,分析了各子领域的重叠与融合。
Result: 论文展示了开放世界检测作为一种通用感知任务的潜力,并指出未来研究方向是实现更统一的视觉感知模型。
Insight: 视觉领域的子任务正逐渐融合,未来可能形成一个统一的感知领域,而开放世界检测是这一趋势的关键步骤。
Abstract: For decades, Computer Vision has aimed at enabling machines to perceive the external world. Initial limitations led to the development of highly specialized niches. As success in each task accrued and research progressed, increasingly complex perception tasks emerged. This survey charts the convergence of these tasks and, in doing so, introduces Open World Detection (OWD), an umbrella term we propose to unify class-agnostic and generally applicable detection models in the vision domain. We start from the history of foundational vision subdomains and cover key concepts, methodologies and datasets making up today’s state-of-the-art landscape. This traverses topics starting from early saliency detection, foreground/background separation, out of distribution detection and leading up to open world object detection, zero-shot detection and Vision Large Language Models (VLLMs). We explore the overlap between these subdomains, their increasing convergence, and their potential to unify into a singular domain in the future, perception.
[99] MV-RAG: Retrieval Augmented Multiview Diffusion
Yosef Dayani,Omer Benishu,Sagie Benaim
Main category: cs.CV
TL;DR: MV-RAG 是一种基于检索增强的多视角扩散模型,用于生成高质量、一致且准确的 3D 内容,特别针对域外(OOD)或稀有概念。
Details
Motivation: 现有的文本到 3D 生成方法依赖于预训练的 2D 扩散先验,但在处理 OOD 或稀有概念时效果不佳。MV-RAG 旨在通过检索相关 2D 图像并结合多视角扩散模型来解决这些问题。Contribution: 1. 提出了 MV-RAG,一种结合检索和多视角扩散的文本到 3D 生成方法;2. 设计了混合训练策略,结合多视角数据和真实 2D 图像;3. 引入了新的 OOD 评测集。
Method: 1. 从大型 2D 图像库中检索相关图像;2. 通过多视角扩散模型生成一致的多视角输出;3. 使用混合训练策略,包括模拟检索变化的增强条件视图和预测保留视角的目标函数。
Result: 实验表明,MV-RAG 在 OOD/稀有概念的 3D 一致性、真实性和文本匹配度上显著优于现有方法,同时在标准评测集上保持竞争力。
Insight: 检索和多视角扩散的结合可以有效提升 3D 生成的质量,特别是在处理复杂或罕见概念时。
Abstract: Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.
[100] Interpreting the linear structure of vision-language model embedding spaces
Isabel Papadimitriou,Huangyuan Su,Thomas Fel,Sham Kakade,Stephanie Gil
Main category: cs.CV
TL;DR: 论文通过稀疏自编码器(SAE)分析视觉-语言模型嵌入空间的线性结构,发现跨模态语义的稀疏概念桥接现象。
Details
Motivation: 研究视觉-语言模型如何通过联合嵌入空间组织语言和图像,以及如何编码意义和模态。Contribution: 1. 训练并发布了四个视觉-语言模型的稀疏自编码器(SAE),揭示其嵌入空间的稀疏线性结构。2. 提出了桥接分数(Bridge Score),量化跨模态概念对的协同作用。
Method: 使用稀疏自编码器(SAE)对CLIP、SigLIP等模型的嵌入空间进行分析,并通过桥接分数衡量概念对的跨模态关联。
Result: SAE能有效重构嵌入并保持稀疏性;跨模态概念对通过桥接分数揭示了语义关联。
Insight: 嵌入空间的线性结构由模态塑造,但通过潜在桥接实现跨模态语义整合。
Abstract: Vision-language models encode images and text in a joint space, minimizing the distance between corresponding image and text pairs. How are language and images organized in this joint space, and how do the models encode meaning and modality? To investigate this, we train and release sparse autoencoders (SAEs) on the embedding spaces of four vision-language models (CLIP, SigLIP, SigLIP2, and AIMv2). SAEs approximate model embeddings as sparse linear combinations of learned directions, or “concepts”. We find that, compared to other methods of linear feature learning, SAEs are better at reconstructing the real embeddings, while also able to retain the most sparsity. Retraining SAEs with different seeds or different data diet leads to two findings: the rare, specific concepts captured by the SAEs are liable to change drastically, but we also show that commonly-activating concepts are remarkably stable across runs. Interestingly, while most concepts activate primarily for one modality, we find they are not merely encoding modality per se. Many are almost orthogonal to the subspace that defines modality, and the concept directions do not function as good modality classifiers, suggesting that they encode cross-modal semantics. To quantify this bridging behavior, we introduce the Bridge Score, a metric that identifies concept pairs which are both co-activated across aligned image-text inputs and geometrically aligned in the shared space. This reveals that even single-modality concepts can collaborate to support cross-modal integration. We release interactive demos of the SAEs for all models, allowing researchers to explore the organization of the concept spaces. Overall, our findings uncover a sparse linear structure within VLM embedding spaces that is shaped by modality, yet stitched together through latent bridges, offering new insight into how multimodal meaning is constructed.
cs.RO [Back]
[101] GelSLAM: A Real-time, High-Fidelity, and Robust 3D Tactile SLAM System
Hung-Jui Huang,Mohammad Amin Mirzaee,Michael Kaess,Wenzhen Yuan
Main category: cs.RO
TL;DR: GelSLAM是一种仅依赖触觉感知的实时3D SLAM系统,用于高精度物体姿态估计与形状重建。
Details
Motivation: 相对于视觉方法,触觉感知在高精度与抗遮挡方面具有优势,尤其在接触式物体跟踪与重建任务中。传统的点云方法在低纹理物体上效果不佳,触觉感知能弥补这一缺陷。Contribution: 1. GelSLAM是首个仅依赖触觉的实时3D SLAM系统。2. 利用触觉表面法线与曲率实现鲁棒跟踪与闭环。3. 实现亚毫米级重建精度,适用于低纹理物体。
Method: GelSLAM采用触觉传感器获取表面法线与曲率数据,替代传统点云方法,用于姿态估计与形状重建。通过优化算法减少漂移,支持实时跟踪。
Result: 系统在实时跟踪中表现出低误差与最小漂移,尤其对低纹理物体(如木质工具)能实现亚毫米级重建精度。
Insight: 触觉感知不仅适用于局部接触任务,还能扩展到全局时空感知,为高精度操作任务(如手内物体交互)提供新思路。
Abstract: Accurately perceiving an object’s pose and shape is essential for precise grasping and manipulation. Compared to common vision-based methods, tactile sensing offers advantages in precision and immunity to occlusion when tracking and reconstructing objects in contact. This makes it particularly valuable for in-hand and other high-precision manipulation tasks. In this work, we present GelSLAM, a real-time 3D SLAM system that relies solely on tactile sensing to estimate object pose over long periods and reconstruct object shapes with high fidelity. Unlike traditional point cloud-based approaches, GelSLAM uses tactile-derived surface normals and curvatures for robust tracking and loop closure. It can track object motion in real time with low error and minimal drift, and reconstruct shapes with submillimeter accuracy, even for low-texture objects such as wooden tools. GelSLAM extends tactile sensing beyond local contact to enable global, long-horizon spatial perception, and we believe it will serve as a foundation for many precise manipulation tasks involving interaction with objects in hand. The video demo is available on our website: https://joehjhuang.github.io/gelslam.
cs.LO [Back]
[102] Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs
Terry Jingchen Zhang,Wenyuan Jiang,Rongchuan Liu,Yisong Wang,Junran Yang,Ning Wang,Nicole Ni,Yinya Huang,Mrinmaya Sachan
Main category: cs.LO
TL;DR: 论文通过利用理论计算机科学(TCS)生成可扩展的形式定理证明挑战,展示了其在自动化推理研究中的价值。
Details
Motivation: 当前形式定理证明(FTP)数据集的局限性(高成本、稀缺性)阻碍了大型语言模型在推理能力评估中的进展,需要寻找可扩展的挑战性问题来源。Contribution: 提出了利用TCS生成形式-非形式定理证明对的框架,并应用在Busy Beaver问题和混合布尔算术问题中,为自动化推理研究提供了新的数据集生成方法。
Method: 通过算法定义自动化生成具有形式(Lean4)和非形式(Markdown)规范的问题对,建立可扩展的验证管道。
Result: 实验表明,前沿模型(如DeepSeekProver-V2-671B)在Busy Beaver问题中达到57.5%的成功率,但在混合布尔算术问题中仅12%,揭示了长形式证明生成的挑战性。
Insight: 即使是计算验证简单的问题,长形式证明生成对模型仍具有显著挑战性,TCS领域为自动化推理研究提供了丰富的资源。
Abstract: Formal theorem proving (FTP) has emerged as a critical foundation for evaluating the reasoning capabilities of large language models, enabling automated verification of mathematical proofs at scale. However, progress has been constrained by limited datasets due to the high cost of manual curation and the scarcity of challenging problems with verified formal-informal correspondences. We propose leveraging theoretical computer science (TCS) as a scalable source of rigorous proof problems, where algorithmic definitions enable automated generation of arbitrarily many challenging theorem-proof pairs. We demonstrate this approach on two TCS domains: Busy Beaver problems, which involve proving bounds on Turing machine halting behavior, and Mixed Boolean Arithmetic problems, which combine logical and arithmetic reasoning. Our framework automatically synthesizes problems with parallel formal (Lean4) and informal (Markdown) specifications, creating a scalable pipeline for generating verified proof challenges. Evaluation on frontier models reveals substantial gaps in automated theorem proving: while DeepSeekProver-V2-671B achieves 57.5% success on Busy Beaver problems, it manages only 12% on Mixed Boolean Arithmetic problems. These results highlight the difficulty of long-form proof generation even for problems that are computationally easy to verify, demonstrating the value of TCS domains for advancing automated reasoning research.
cs.CY [Back]
[103] PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark
Adil Bahaj,Mounir Ghogho
Main category: cs.CY
TL;DR: 该论文提出了一个新的多模态儿科问答基准PediatricsMQA,旨在解决大语言模型和视觉增强语言模型在儿科任务中的系统性年龄偏见问题。
Details
Motivation: 现有的医疗大模型在儿科任务中表现较差,反映了医学研究中儿科研究的不足。为了解决这一偏见并推动儿科AI的公平性,作者构建了多模态儿科问答基准。Contribution: 主要贡献是提出了PediatricsMQA,包含3,417个文本多选题和2,067个视觉多选题,覆盖131个儿科主题和多种影像模态。
Method: 采用了混合手动-自动的管道构建数据集,结合了同行评审的儿科文献、已验证的题库和现有资源。
Result: 实验显示,现有模型在年轻群体上的性能显著下降,突出了年龄感知方法的必要性。
Insight: 论文揭示了医疗AI中存在的年龄偏见问题,并强调了针对儿科任务定制模型的重要性。
Abstract: Large language models (LLMs) and vision-augmented LLMs (VLMs) have significantly advanced medical informatics, diagnostics, and decision support. However, these models exhibit systematic biases, particularly age bias, compromising their reliability and equity. This is evident in their poorer performance on pediatric-focused text and visual question-answering tasks. This bias reflects a broader imbalance in medical research, where pediatric studies receive less funding and representation despite the significant disease burden in children. To address these issues, a new comprehensive multi-modal pediatric question-answering benchmark, PediatricsMQA, has been introduced. It consists of 3,417 text-based multiple-choice questions (MCQs) covering 131 pediatric topics across seven developmental stages (prenatal to adolescent) and 2,067 vision-based MCQs using 634 pediatric images from 67 imaging modalities and 256 anatomical regions. The dataset was developed using a hybrid manual-automatic pipeline, incorporating peer-reviewed pediatric literature, validated question banks, existing benchmarks, and existing QA resources. Evaluating state-of-the-art open models, we find dramatic performance drops in younger cohorts, highlighting the need for age-aware methods to ensure equitable AI support in pediatric care.
cs.LG [Back]
[104] RotaTouille: Rotation Equivariant Deep Learning for Contours
Odin Hoff Gardaa,Nello Blaser
Main category: cs.LG
TL;DR: RotaTouille 是一个针对轮廓数据的深度学习框架,通过复值循环卷积实现对旋转和循环平移的等变性,并在形状分类、重建和回归任务中表现出色。
Details
Motivation: 轮廓数据(如闭曲线)在多个领域普遍存在,且输入的旋转通常会导致输出的相应旋转。因此,模型需要具备旋转等变性和循环平移等变性。Contribution: 提出了 RotaTouille 框架,实现了对旋转和循环平移的等变性,并引入了等变非线性层、粗化层和全局池化层,以获得不变表示。
Method: 采用复值循环卷积实现等变性,结合专门设计的非线性层和池化层处理轮廓数据。
Result: 在形状分类、重建和轮廓回归任务中验证了 RotaTouille 的有效性。
Insight: 通过复值表示和循环卷积可以高效处理轮廓数据的等变性需求,为类似任务提供了新思路。
Abstract: Contours or closed planar curves are common in many domains. For example, they appear as object boundaries in computer vision, isolines in meteorology, and the orbits of rotating machinery. In many cases when learning from contour data, planar rotations of the input will result in correspondingly rotated outputs. It is therefore desirable that deep learning models be rotationally equivariant. In addition, contours are typically represented as an ordered sequence of edge points, where the choice of starting point is arbitrary. It is therefore also desirable for deep learning methods to be equivariant under cyclic shifts. We present RotaTouille, a deep learning framework for learning from contour data that achieves both rotation and cyclic shift equivariance through complex-valued circular convolution. We further introduce and characterize equivariant non-linearities, coarsening layers, and global pooling layers to obtain invariant representations for downstream tasks. Finally, we demonstrate the effectiveness of RotaTouille through experiments in shape classification, reconstruction, and contour regression.
[105] TinyML Towards Industry 4.0: Resource-Efficient Process Monitoring of a Milling Machine
Tim Langer,Matthias Widra,Volkhard Beyer
Main category: cs.LG
TL;DR: 本文提出了一个完整的TinyML流程,用于工业铣床的资源高效过程监控,通过量化CNN模型实现了高精度和低能耗。
Details
Motivation: 为工业4.0中的老旧设备提供智能化升级方案,TinyML因其资源高效性成为理想选择。Contribution: 开发了新型MillingVibes数据集,并实现了一个8位量化CNN模型,仅需12.59kiB存储,测试精度达100%,能耗极低。
Method: 采用完整的TinyML流程,包括数据集生成、模型开发、预处理和分类管道在微控制器上的部署。
Result: 在ARM Cortex M4F微控制器上,实现15.4ms推理时间、1.462mJ能耗和100%测试精度。
Insight: TinyML在工业过程监控中具有极高潜力,量化技术是关键,能够平衡资源与性能。
Abstract: In the context of industry 4.0, long-serving industrial machines can be retrofitted with process monitoring capabilities for future use in a smart factory. One possible approach is the deployment of wireless monitoring systems, which can benefit substantially from the TinyML paradigm. This work presents a complete TinyML flow from dataset generation, to machine learning model development, up to implementation and evaluation of a full preprocessing and classification pipeline on a microcontroller. After a short review on TinyML in industrial process monitoring, the creation of the novel MillingVibes dataset is described. The feasibility of a TinyML system for structure-integrated process quality monitoring could be shown by the development of an 8-bit-quantized convolutional neural network (CNN) model with 12.59kiB parameter storage. A test accuracy of 100.0% could be reached at 15.4ms inference time and 1.462mJ per quantized CNN inference on an ARM Cortex M4F microcontroller, serving as a reference for future TinyML process monitoring solutions.
[106] PGF-Net: A Progressive Gated-Fusion Framework for Efficient Multimodal Sentiment Analysis
Bin Wen,Tien-Ping Tan
Main category: cs.LG
TL;DR: PGF-Net提出了一种新颖的多模态情感分析框架,通过渐进式门控融合机制和高效的参数调优策略,实现了高性能且轻量化的模型设计。
Details
Motivation: 多模态情感分析需要高效的跨模态融合方法,同时减少计算开销,以适应资源受限的场景。Contribution: 1. 提出了渐进式层内融合范式;2. 设计了自适应门控仲裁机制;3. 结合了参数高效的微调策略(LoRA和Post-Fusion Adapters)。
Method: 使用Cross-Attention动态融合文本与非语言特征,结合自适应门控机制和混合PEFT策略,构建了一个高效的层次化编码器架构。
Result: 在MOSI数据集上,MAE为0.691,F1-Score为86.9%,仅需3.09M可训练参数。
Insight: 渐进式融合和动态门控机制能够提升模型的性能和可解释性,而混合PEFT策略显著降低了计算成本。
Abstract: We introduce PGF-Net (Progressive Gated-Fusion Network), a novel deep learning framework designed for efficient and interpretable multimodal sentiment analysis. Our framework incorporates three primary innovations. Firstly, we propose a Progressive Intra-Layer Fusion paradigm, where a Cross-Attention mechanism empowers the textual representation to dynamically query and integrate non-linguistic features from audio and visual streams within the deep layers of a Transformer encoder. This enables a deeper, context-dependent fusion process. Secondly, the model incorporates an Adaptive Gated Arbitration mechanism, which acts as a dynamic controller to balance the original linguistic information against the newly fused multimodal context, ensuring stable and meaningful integration while preventing noise from overwhelming the signal. Lastly, a hybrid Parameter-Efficient Fine-Tuning (PEFT) strategy is employed, synergistically combining global adaptation via LoRA with local refinement through Post-Fusion Adapters. This significantly reduces trainable parameters, making the model lightweight and suitable for resource-limited scenarios. These innovations are integrated into a hierarchical encoder architecture, enabling PGF-Net to perform deep, dynamic, and interpretable multimodal sentiment analysis while maintaining exceptional parameter efficiency. Experimental results on MOSI dataset demonstrate that our proposed PGF-Net achieves state-of-the-art performance, with a Mean Absolute Error (MAE) of 0.691 and an F1-Score of 86.9%. Notably, our model achieves these results with only 3.09M trainable parameters, showcasing a superior balance between performance and computational efficiency.
[107] AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
Huichi Zhou,Yihang Chen,Siyuan Guo,Xue Yan,Kin Hei Lee,Zihan Wang,Ka Yiu Lee,Guchun Zhang,Kun Shao,Linyi Yang,Jun Wang
Main category: cs.LG
TL;DR: 该论文提出了一种无需微调大型语言模型(LLM)的新型学习范式AgentFly,通过基于记忆的在线强化学习实现高效持续适应。
Details
Motivation: 现有方法要么依赖静态手工反射流程,要么需要高计算成本的LLM参数梯度更新,无法实现低成本持续适应。Contribution: 提出了一种记忆增强的马尔可夫决策过程(M-MDP),通过神经案例选择策略和记忆读写机制实现高效学习。
Method: 使用基于记忆的强化学习,通过案例选择和记忆读写机制动态更新策略,无需微调LLM。
Result: 在GAIA验证集上达到87.88% Pass@3,测试集上79.40%,DeepResearcher数据集上F1为66.6%、PM为80.4%。
Insight: 方法提供了可扩展的路径,使得LLM代理能够通过记忆机制实现无梯度更新的实时学习。
Abstract: In this paper, we introduce a novel learning paradigm for adaptive Large Language Model (LLM) agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either rigid, relying on static, handcrafted reflection workflows, or computationally intensive, requiring gradient updates of LLM model parameters. In contrast, our method enables low-cost continual adaptation via memory-based online reinforcement learning. We formalise this as a Memory-augmented Markov Decision Process (M-MDP), equipped with a neural case-selection policy to guide action decisions. Past experiences are stored in an episodic memory, either differentiable or non-parametric. The policy is continually updated based on environmental feedback through a memory rewriting mechanism, whereas policy improvement is achieved through efficient memory reading (retrieval). We instantiate our agent model in the deep research setting, namely AgentFly, which attains top-1 on GAIA validation ($87.88%$ Pass@$3$) and $79.40%$ on the test set. It reaches $66.6%$ F1 and $80.4%$ PM on the DeepResearcher dataset, outperforming the state-of-the-art training-based method, while case-based memory adds $4.7%$ to $9.6%$ absolute points on out-of-distribution tasks. Our approach offers a scalable and efficient pathway for developing generalist LLM agents capable of continuous, real-time learning without gradient updates, advancing machine learning towards open-ended skill acquisition and deep research scenarios. The code is available at https://github.com/Agent-on-the-Fly/AgentFly.
[108] Retrieval Enhanced Feedback via In-context Neural Error-book
Jongyeop Hyun,Bumsoo Kim
Main category: cs.LG
TL;DR: 论文提出了REFINE框架,通过检索增强反馈和结构化错误分析,提升多模态大语言模型的推理能力。
Details
Motivation: 现有方法缺乏对错误的系统性分析与缓解,尤其在多模态大语言模型中,视觉和文本信息的整合增加了复杂性。Contribution: 提出了REFINE框架,通过三种结构化查询(Feed-Target、Feed-Check、Feed-Path)提供针对性反馈,优化了推理效率和可扩展性。
Method: 采用师生框架,系统性构建错误分析(神经错误本),并通过检索增强反馈优化多模态推理。
Result: 实验显示REFINE显著提升速度、降低计算成本,并具有良好泛化能力。
Insight: 结构化错误分析和针对性反馈是提升多模态推理性能的关键,检索优化可显著改善效率。
Abstract: Recent advancements in Large Language Models (LLMs) have significantly improved reasoning capabilities, with in-context learning (ICL) emerging as a key technique for adaptation without retraining. While previous works have focused on leveraging correct examples, recent research highlights the importance of learning from errors to enhance performance. However, existing methods lack a structured framework for analyzing and mitigating errors, particularly in Multimodal Large Language Models (MLLMs), where integrating visual and textual inputs adds complexity. To address this issue, we propose REFINE: Retrieval-Enhanced Feedback via In-context Neural Error-book, a teacher-student framework that systematically structures errors and provides targeted feedback. REFINE introduces three systematic queries to construct structured feedback – Feed-Target, Feed-Check, and Feed-Path – to enhance multimodal reasoning by prioritizing relevant visual information, diagnosing critical failure points, and formulating corrective actions. Unlike prior approaches that rely on redundant retrievals, REFINE optimizes structured feedback retrieval, improving inference efficiency, token usage, and scalability. Our results demonstrate substantial speedup, reduced computational costs, and successful generalization, highlighting REFINE’s potential for enhancing multimodal reasoning.
[109] FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline
Parker Seegmiller,Kartik Mehta,Soumya Saha,Chenyang Tao,Shereen Oraby,Arpit Gupta,Tagyoung Chung,Mohit Bansal,Nanyun Peng
Main category: cs.LG
TL;DR: 论文提出了FLAMES框架,用于系统分析和优化数学推理数据的合成策略,发现复杂度和多样性平衡的重要性,并设计了新的数据合成方法,显著提升了多个数学基准的性能。
Details
Motivation: 现有研究在利用合成数据改进LLM数学推理时缺乏统一比较,无法明确数据合成中各因素(如低质量问题的过滤)的作用。Contribution: 1. 提出FLAMES框架,系统分析数据合成策略;2. 发现复杂度和多样性的平衡是关键;3. 设计两种新合成策略提升泛化性和鲁棒性;4. 构建FLAMES数据集,显著优于公开数据集。
Method: 通过FLAMES框架分析了10种现有数据合成策略和多个影响因素,基于发现设计了两种新策略,并构建了混合数据集。
Result: FLAMES数据集在多个数学基准(如OlympiadBench、MATH)上优于公开数据集,微调模型性能超越Llama3 405B等更大模型。
Insight: 1. 增加问题复杂度对提升数学能力最有效;2. 预算固定时,高覆盖率比高可靠性更重要;3. 易到难的泛化能力显著。
Abstract: Recent works improving LLM math reasoning with synthetic data have used unique setups, making comparison of data synthesis strategies impractical. This leaves many unanswered questions about the roles of different factors in the synthetic data pipeline, such as the impact of filtering low-quality problems. To address this gap, we introduce FLAMES, a Framework for LLM Assessment of Math rEasoning Data Synthesis, and perform a systematic study of 10 existing data synthesis strategies and multiple other factors impacting the performance of synthetic math reasoning data. Our FLAMES experiments provide several valuable insights about the optimal balance of difficulty and diversity of synthetic data. First, data agents designed to increase problem complexity lead to best improvements on most math metrics. Second, with a fixed data generation budget, keeping higher problem coverage is more important than keeping only problems with reliable solutions. Third, GSM8K- and MATH-based synthetic data can lead to improvements on competition-level benchmarks, showcasing easy-to-hard generalization. Leveraging insights from our FLAMES experiments, we design two novel data synthesis strategies for improving out-of-domain generalization and robustness. Further, we develop the FLAMES dataset, an effective blend of our novel and existing data synthesis strategies, outperforming public datasets on OlympiadBench (+15.7), CollegeMath (+4.5), GSMPlus (+6.5), and MATH (+3.1). Fine-tuning Qwen2.5-Math-7B on the FLAMES dataset achieves 81.4% on MATH, surpassing larger Llama3 405B, GPT-4o and Claude 3.5 Sonnet.
cs.SE [Back]
[110] AetherCode: Evaluating LLMs’ Ability to Win In Premier Programming Competitions
Zihan Wang,Jiaze Chen,Zhicheng Liu,Markus Mak,Yidi Du,Geonsik Moon,Luoqi Xu,Aaron Tua,Kunshuo Peng,Jiayi Lu,Mingfei Xia,Boqian Zou,Chenyang Ran,Guang Tian,Shoutai Zhu,Yeheng Duan,Zhenghui Kang,Zhenxing Lin,Shangshu Li,Qiang Luo,Qingshen Long,Zhiyong Chen,Yihan Xiao,Yurong Wu,Daoguang Zan,Yuyi Fu,Mingxuan Wang,Ming Ding
Main category: cs.SE
TL;DR: AetherCode是一个新的基准测试,旨在通过更难的编程竞赛问题更准确地评估大型语言模型(LLMs)的编程能力,弥补现有基准的不足。
Details
Motivation: 现有基准测试低估了LLMs与顶级人类程序员之间的差距,主要因为问题难度不足和测试用例质量低。Contribution: 提出了AetherCode基准,包含高难度的编程竞赛问题及其专家验证的测试套件,提供了更可靠的评估。
Method: 结合国际编程竞赛(如IOI和ICPC)的高难度问题,通过自动生成和人工验证构建测试套件。
Result: AetherCode能够更准确衡量LLMs的编程能力,为其能力设立了新标准。
Insight: 现有基准可能高估了LLMs的能力,需要更具挑战性的评估方式来真实反映其水平。
Abstract: Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem design with robust evaluation, AetherCode provides a more faithful measure of LLM capabilities and sets a new standard for future research in code reasoning.
cs.HC [Back]
[111] Prompting with Sign Parameters for Low-resource Sign Language Instruction Generation
Md Tariquzzaman,Md Farhan Ishmam,Saiyma Sittul Muna,Md Kamrul Hasan,Hasan Mahmud
Main category: cs.HC
TL;DR: 这篇论文提出了一个针对低资源手语的指令生成方法,通过引入手语参数提示(SPI prompting)来提升零样本性能,并在新构建的孟加拉手语数据集(BdSLIG)上进行了评估。
Details
Motivation: 许多手语在AI领域仍然资源不足,这限制了聋哑和听力障碍社区的交流。论文旨在通过生成结构化的手语学习指令,促进非手语用户的学习和交互。Contribution: 1. 构建了第一个孟加拉手语指令生成数据集(BdSLIG);2. 提出了手语参数注入提示(SPI prompting)方法,提升视觉语言模型在低资源手语任务上的零样本性能。
Method: 使用手语参数(如手形、动作和方向)直接注入文本提示中,生成结构化的指令,而非自然语言的自由形式。这种方法使指令更易复现和理解。
Result: 在BdSLIG数据集上的实验验证了SPI prompting的优越性,尤其在低资源和长尾视觉概念任务中表现突出。
Insight: 结构化提示(如SPI prompting)可以有效提升模型在低资源领域的性能,同时也为其他类似任务的改进提供了思路。手语学习的包容性和技术进步是研究的核心目标。
Abstract: Sign Language (SL) enables two-way communication for the deaf and hard-of-hearing community, yet many sign languages remain under-resourced in the AI space. Sign Language Instruction Generation (SLIG) produces step-by-step textual instructions that enable non-SL users to imitate and learn SL gestures, promoting two-way interaction. We introduce BdSLIG, the first Bengali SLIG dataset, used to evaluate Vision Language Models (VLMs) (i) on under-resourced SLIG tasks, and (ii) on long-tail visual concepts, as Bengali SL is unlikely to appear in the VLM pre-training data. To enhance zero-shot performance, we introduce Sign Parameter-Infused (SPI) prompting, which integrates standard SL parameters, like hand shape, motion, and orientation, directly into the textual prompts. Subsuming standard sign parameters into the prompt makes the instructions more structured and reproducible than free-form natural text from vanilla prompting. We envision that our work would promote inclusivity and advancement in SL learning systems for the under-resourced communities.
eess.IV [Back]
[112] Cross-Attention Multimodal Fusion for Breast Cancer Diagnosis: Integrating Mammography and Clinical Data with Explainability
Muhaisin Tiyumba Nantogmah,Abdul-Barik Alhassan,Salamudeen Alhassan
Main category: eess.IV
TL;DR: 该论文提出了一种基于交叉注意力的多模态融合方法,结合乳腺X光片和临床数据,用于乳腺癌诊断,并通过可解释性AI提高了模型的可信度。
Details
Motivation: 现有计算机辅助诊断系统往往仅依赖于乳腺X光片特征,未能充分利用临床数据的有价值信息。论文旨在探索临床特征与乳腺X光片的有效融合方式,并通过可解释性方法提升模型的可靠性。Contribution: 1. 提出了一种基于交叉注意力的多模态融合框架,有效整合乳腺X光片和临床数据;2. 通过可解释性AI方法提升了模型的透明度和可信度;3. 在公开数据集上取得了优异的性能(AUC-ROC 0.98,准确率0.96)。
Method: 论文对比了多种多模态融合方法,包括特征拼接(feature concatenation)、共注意力(co-attention)和交叉注意力(cross-attention),最终提出了基于交叉注意力的融合方法。
Result: 在TCGA和CBIS-DDSM数据集上,模型表现优异,AUC-ROC达0.98,准确率0.96,F1分数0.94,精确率0.92,召回率0.95。
Insight: 临床数据显著提升了乳腺癌分类的性能,而交叉注意力机制能够有效融合多模态数据。可解释性方法为模型决策提供了直观的解释,增强了临床实用性。
Abstract: A precise assessment of the risk of breast lesions can greatly lower it and assist physicians in choosing the best course of action. To categorise breast lesions, the majority of current computer-aided systems only use characteristics from mammograms. Although this method is practical, it does not completely utilise clinical reports’ valuable information to attain the best results. When compared to utilising mammography alone, will clinical features greatly enhance the categorisation of breast lesions? How may clinical features and mammograms be combined most effectively? In what ways may explainable AI approaches improve the interpretability and reliability of models used to diagnose breast cancer? To answer these basic problems, a comprehensive investigation is desperately needed. In order to integrate mammography and categorical clinical characteristics, this study examines a number of multimodal deep networks grounded on feature concatenation, co-attention, and cross-attention. The model achieved an AUC-ROC of 0.98, accuracy of 0.96, F1-score of 0.94, precision of 0.92, and recall of 0.95 when tested on publicly accessible datasets (TCGA and CBIS-DDSM).
[113] Decoding MGMT Methylation: A Step Towards Precision Medicine in Glioblastoma
Hafeez Ur Rehman,Sumaiya Fazal,Moutaz Alazab,Ali Baydoun
Main category: eess.IV
TL;DR: 这篇论文提出了基于自适应稀疏惩罚的卷积自编码器框架CAMP,用于预测MGMT基因甲基化状态,以改进胶质母细胞瘤的个性化治疗策略,显著提升了预测准确性。
Details
Motivation: 胶质母细胞瘤具有高侵袭性和治疗难度,MGMT基因甲基化状态是预测治疗效果的关键生物标志物,但目前非侵入性成像技术的预测准确性有限。Contribution: 提出了CAMP框架,通过结合卷积自编码器和自适应稀疏惩罚,实现了MGMT甲基化状态的高精度预测,并在MRI图像合成中保留了复杂的组织与肿瘤结构。
Method: 1. 使用定制自编码器生成合成的MRI切片;2. 通过带自适应稀疏惩罚的卷积神经网络预测MGMT甲基化状态。自适应稀疏惩罚根据数据动态调整。
Result: 在基准数据集上,CAMP的准确率为0.97,特异性为0.98,灵敏度为0.97,显著优于现有方法。
Insight: 自适应稀疏惩罚能够有效处理MRI图像中的对比度差异和肿瘤异质性,为精准医疗提供了新的工具。
Abstract: Glioblastomas, constituting over 50% of malignant brain tumors, are highly aggressive brain tumors that pose substantial treatment challenges due to their rapid progression and resistance to standard therapies. The methylation status of the O-6-Methylguanine-DNA Methyltransferase (MGMT) gene is a critical biomarker for predicting patient response to treatment, particularly with the alkylating agent temozolomide. However, accurately predicting MGMT methylation status using non-invasive imaging techniques remains challenging due to the complex and heterogeneous nature of glioblastomas, that includes, uneven contrast, variability within lesions, and irregular enhancement patterns. This study introduces the Convolutional Autoencoders for MGMT Methylation Status Prediction (CAMP) framework, which is based on adaptive sparse penalties to enhance predictive accuracy. The CAMP framework operates in two phases: first, generating synthetic MRI slices through a tailored autoencoder that effectively captures and preserves intricate tissue and tumor structures across different MRI modalities; second, predicting MGMT methylation status using a convolutional neural network enhanced by adaptive sparse penalties. The adaptive sparse penalty dynamically adjusts to variations in the data, such as contrast differences and tumor locations in MR images. Our method excels in MRI image synthesis, preserving brain tissue, fat, and individual tumor structures across all MRI modalities. Validated on benchmark datasets, CAMP achieved an accuracy of 0.97, specificity of 0.98, and sensitivity of 0.97, significantly outperforming existing methods. These results demonstrate the potential of the CAMP framework to improve the interpretation of MRI data and contribute to more personalized treatment strategies for glioblastoma patients.
[114] Disentangled Multi-modal Learning of Histology and Transcriptomics for Cancer Characterization
Yupei Zhang,Xiaofei Wang,Anran Liu,Lequan Yu,Chao Li
Main category: eess.IV
TL;DR: 该论文提出了一种解缠的多模态学习框架,结合组织学和转录组学数据,通过分解和协调策略解决多模态异质性、多尺度整合和配对数据依赖问题,显著提升了癌症特征的诊断和预后性能。
Details
Motivation: 现有方法在多模态异质性、多尺度整合不足和对配对数据的依赖方面存在限制,影响了多模态学习在临床中的适用性。Contribution: 1) 通过解缠多模态融合模块和置信引导的梯度协调策略处理多模态异质性;2) 提出跨放大级别基因表达一致性策略增强多尺度整合;3) 通过子空间知识蒸馏策略减少对配对数据的依赖;4) 设计信息令牌聚合模块提升推理效率。
Method: 采用解缠多模态学习框架,结合梯度协调、多尺度对齐、知识蒸馏和信息聚合等技术。
Result: 在癌症诊断、预后和生存预测任务中表现优于现有方法。
Insight: 解缠学习和多模态协调策略能显著提升多模态数据的分析和应用能力,尤其在医学领域具有重要价值。
Abstract: Histopathology remains the gold standard for cancer diagnosis and prognosis. With the advent of transcriptome profiling, multi-modal learning combining transcriptomics with histology offers more comprehensive information. However, existing multi-modal approaches are challenged by intrinsic multi-modal heterogeneity, insufficient multi-scale integration, and reliance on paired data, restricting clinical applicability. To address these challenges, we propose a disentangled multi-modal framework with four contributions: 1) To mitigate multi-modal heterogeneity, we decompose WSIs and transcriptomes into tumor and microenvironment subspaces using a disentangled multi-modal fusion module, and introduce a confidence-guided gradient coordination strategy to balance subspace optimization. 2) To enhance multi-scale integration, we propose an inter-magnification gene-expression consistency strategy that aligns transcriptomic signals across WSI magnifications. 3) To reduce dependency on paired data, we propose a subspace knowledge distillation strategy enabling transcriptome-agnostic inference through a WSI-only student model. 4) To improve inference efficiency, we propose an informative token aggregation module that suppresses WSI redundancy while preserving subspace semantics. Extensive experiments on cancer diagnosis, prognosis, and survival prediction demonstrate our superiority over state-of-the-art methods across multiple settings. Code is available at https://github.com/helenypzhang/Disentangled-Multimodal-Learning.
[115] A Disease-Centric Vision-Language Foundation Model for Precision Oncology in Kidney Cancer
Yuhui Tao,Zhongwei Zhao,Zilong Wang,Xufang Luo,Feng Chen,Kang Wang,Chuanfu Wu,Xue Zhang,Shaoting Zhang,Jiaxi Yao,Xingwei Jin,Xinyang Jiang,Yifan Yang,Dongsheng Li,Lili Qiu,Zhiqiang Shao,Jianming Guo,Nengwang Yu,Shuo Wang,Ying Xiong
Main category: eess.IV
TL;DR: 该论文提出了一种名为RenalCLIP的视觉-语言基础模型,用于肾癌的精准肿瘤学,通过两阶段预训练策略结合对比学习,显著提升了诊断和预后任务的性能。
Details
Motivation: 肾癌的非侵入性评估是一个关键挑战,常因诊断不确定性导致良性或惰性肿瘤的过度治疗。Contribution: 提出RenalCLIP模型,在10个核心任务中表现优越,包括解剖学评估、诊断分类和生存预测,并展现出卓越的数据效率。
Method: 采用两阶段预训练策略:首先增强图像和文本编码器的领域知识,然后通过对比学习目标对齐它们。
Result: 在TCIA队列中,RenalCLIP的复发自由生存预测C-index达到0.726,比基线提高约20%;且仅需20%训练数据即可达到基线模型的峰值性能。
Insight: RenalCLIP不仅提升了诊断和预后任务的性能,还在数据效率和多任务泛化能力上表现出优势。
Abstract: The non-invasive assessment of increasingly incidentally discovered renal masses is a critical challenge in urologic oncology, where diagnostic uncertainty frequently leads to the overtreatment of benign or indolent tumors. In this study, we developed and validated RenalCLIP using a dataset of 27,866 CT scans from 8,809 patients across nine Chinese medical centers and the public TCIA cohort, a visual-language foundation model for characterization, diagnosis and prognosis of renal mass. The model was developed via a two-stage pre-training strategy that first enhances the image and text encoders with domain-specific knowledge before aligning them through a contrastive learning objective, to create robust representations for superior generalization and diagnostic precision. RenalCLIP achieved better performance and superior generalizability across 10 core tasks spanning the full clinical workflow of kidney cancer, including anatomical assessment, diagnostic classification, and survival prediction, compared with other state-of-the-art general-purpose CT foundation models. Especially, for complicated task like recurrence-free survival prediction in the TCIA cohort, RenalCLIP achieved a C-index of 0.726, representing a substantial improvement of approximately 20% over the leading baselines. Furthermore, RenalCLIP’s pre-training imparted remarkable data efficiency; in the diagnostic classification task, it only needs 20% training data to achieve the peak performance of all baseline models even after they were fully fine-tuned on 100% of the data. Additionally, it achieved superior performance in report generation, image-text retrieval and zero-shot diagnosis tasks. Our findings establish that RenalCLIP provides a robust tool with the potential to enhance diagnostic accuracy, refine prognostic stratification, and personalize the management of patients with kidney cancer.
cs.AR [Back]
[116] ASIC-Agent: An Autonomous Multi-Agent System for ASIC Design with Benchmark Evaluation
Ahmed Allam,Youssef Mansour,Mohamed Shalan
Main category: cs.AR
TL;DR: 该论文提出了ASIC-Agent,一个专为数字ASIC设计任务设计的自主多智能体系统,通过整合多个子智能体和沙盒环境解决了LLM在硬件设计中的局限性,并引入了首个硬件设计任务基准ASIC-Agent-Bench进行评估。
Details
Motivation: 现有LLM在RTL设计中的能力有限,无法执行代码、调试或长期记忆,限制了其在真实硬件设计流程中的应用。Contribution: 提出了ASIC-Agent系统,整合了多智能体架构和沙盒环境,并首创了硬件设计任务基准ASIC-Agent-Bench。
Method: 采用多智能体架构,包括RTL生成、验证、OpenLane硬化和Caravel芯片集成子智能体,利用向量数据库存储知识。
Result: ASIC-Agent成功自动化了多种复杂度的ASIC设计任务,显著加速了设计流程,尤其在Claude 4 Sonnet支持下表现优异。
Insight: 多智能体系统结合沙盒环境和专业工具库是解决LLM在硬件设计中局限性的有效途径。
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in Register Transfer Level (RTL) design, enabling high-quality code generation from natural language descriptions. However, LLMs alone face significant limitations in real-world hardware design workflows, including the inability to execute code, lack of debugging capabilities, and absence of long-term memory. To address these challenges, we present ASIC-Agent, an autonomous system designed specifically for digital ASIC design tasks. ASIC-Agent enhances base LLMs with a multi-agent architecture incorporating specialized sub-agents for RTL generation, verification, OpenLane hardening, and Caravel chip integration, all operating within a comprehensive sandbox environment with access to essential hardware design tools. The system leverages a vector database containing documentation, API references, error knowledge, and curated insights from the open-source silicon community. To evaluate ASIC-Agent’s performance, we introduce ASIC-Agent-Bench, the first benchmark specifically designed to assess agentic systems in hardware design tasks. We evaluate ASIC-Agent with various base LLMs, providing quantitative comparisons and qualitative insights into agent behavior across different design scenarios. Our results demonstrate that ASIC-Agent, when powered by Claude 4 Sonnet, successfully automates a broad range of ASIC design tasks spanning varying levels of complexity, showing the potential of significantly accelerating the ASIC design workflow.
[117] Hardwired-Neurons Language Processing Units as General-Purpose Cognitive Substrates
Yang Liu,Yi Chen,Yongwei Zhao,Yifan Hao,Zifu Zheng,Weihao Kong,Zhangmai Li,Dongchen Jiang,Ruiyang Xia,Zhihong Ma,Zisheng Liu,Zhaoyong Wan,Yunqi Lu,Ximing Liu,Hongrui Guo,Zhihao Yang,Zhe Wang,Tianrui Ma,Mo Zou,Rui Zhang,Ling Li,Xing Hu,Zidong Du,Zhiwei Xu,Qi Guo,Tianshi Chen,Yunji Chen
Main category: cs.AR
TL;DR: 本文提出了一种通过物理硬连线LLM权重参数的Hardwired-Neurons语言处理单元(HNLPU),显著提升计算效率,并通过Metal-Embedding方法解决了经济成本问题。
Details
Motivation: 大型语言模型(LLM)推理系统的能源消耗日益增长,需要开发专门的高效能语言处理单元以应对这一挑战。Contribution: 提出了HNLPU和创新的Metal-Embedding方法,显著降低了制造成本,并提高了计算密度和能源效率。
Method: 通过物理硬连线LLM权重参数,并利用Metal-Embedding技术在3D金属线拓扑中嵌入权重,降低了光罩成本。
Result: HNLPU在能效(36 tokens/J)和计算速度(249,960 tokens/s)上显著优于GPU和WSE,成本效益提升了8.57倍,碳足迹减少了230倍。
Insight: 通过硬件优化和3D嵌入技术,可以显著提升LLM推理的能效和经济性,为未来专用语言处理单元的设计提供了新思路。
Abstract: The rapid advancement of Large Language Models (LLMs) has established language as a core general-purpose cognitive substrate, driving the demand for specialized Language Processing Units (LPUs) tailored for LLM inference. To overcome the growing energy consumption of LLM inference systems, this paper proposes a Hardwired-Neurons Language Processing Unit (HNLPU), which physically hardwires LLM weight parameters into the computational fabric, achieving several orders of magnitude computational efficiency improvement by extreme specialization. However, a significant challenge still lies in the scale of modern LLMs. An ideal estimation on hardwiring gpt-oss 120 B requires fabricating at least 6 billion dollars of photomask sets, rendering the straightforward solution economically impractical. Addressing this challenge, we propose the novel Metal-Embedding methodology. Instead of embedding weights in a 2D grid of silicon device cells, Metal-Embedding embeds weight parameters into the 3D topology of metal wires. This brings two benefits: (1) a 15x increase in density, and (2) 60 out of 70 layers of photomasks are made homogeneous across chips, including all EUV photomasks. In total, Metal-Embedding reduced the photomask cost by 112x, bringing the Non-Recurring Engineering (NRE) cost of HNLPU into an economically viable range. Experimental results show that HNLPU achieved 249,960 tokens/s (5,555x/85x of GPU/WSE), 36 tokens/J (1,047x/283x of GPU/WSE), 13,232 mm2 total die area (29% inscribed rectangular area in a 300 mm wafer), $184M estimated NRE at 5 nm technology. Analysis shows that HNLPU achieved 8.57x cost-effectiveness and 230x carbon footprint reduction compared to H100 clusters, under an annual weight updating assumption.
cs.CR [Back]
[118] Unveiling Unicode’s Unseen Underpinnings in Undermining Authorship Attribution
Robert Dilworth
Main category: cs.CR
TL;DR: 本文探讨了在公共通信中即使采取匿名措施,用户仍可能通过文本内容(风格分析)被识别身份,并提出了一种利用Unicode隐写术的对抗策略。
Details
Motivation: 尽管用户采取了多种匿名化措施,但文本内容本身仍可能通过风格分析(stylometry)暴露身份,这需要研究对抗性策略以保护隐私。Contribution: 论文的主要贡献包括深入剖析风格分析技术、提出对抗性策略,并通过Unicode隐写术增强匿名性。
Method: 论文分析了风格分析技术,并提出了一种基于Unicode隐写术的方法,通过隐藏或修改文本特征来对抗作者身份识别。
Result: 通过Unicode隐写术,论文展示了如何在文本中隐藏或修改风格特征,从而有效对抗风格分析。
Insight: 即使是最谨慎的匿名化措施也可能因文本风格而失效,而Unicode隐写术为保护隐私提供了一种新的可能途径。
Abstract: When using a public communication channel – whether formal or informal, such as commenting or posting on social media – end users have no expectation of privacy: they compose a message and broadcast it for the world to see. Even if an end user takes utmost precautions to anonymize their online presence – using an alias or pseudonym; masking their IP address; spoofing their geolocation; concealing their operating system and user agent; deploying encryption; registering with a disposable phone number or email; disabling non-essential settings; revoking permissions; and blocking cookies and fingerprinting – one obvious element still lingers: the message itself. Assuming they avoid lapses in judgment or accidental self-exposure, there should be little evidence to validate their actual identity, right? Wrong. The content of their message – necessarily open for public consumption – exposes an attack vector: stylometric analysis, or author profiling. In this paper, we dissect the technique of stylometry, discuss an antithetical counter-strategy in adversarial stylometry, and devise enhancements through Unicode steganography.
q-bio.NC [Back]
[119] NeuroKoop: Neural Koopman Fusion of Structural-Functional Connectomes for Identifying Prenatal Drug Exposure in Adolescents
Badhan Mazumder,Aline Kotoski,Vince D. Calhoun,Dong Hye Ye
Main category: q-bio.NC
TL;DR: NeuroKoop是一种基于图神经网络的创新框架,通过神经Koopman算子驱动的潜在空间融合,整合结构和功能性脑网络,以识别青少年产前药物暴露(PDE)。该方法在ABCD数据集的青少年队列中表现出色,揭示了结构-功能连接的关键特征。
Details
Motivation: 产前暴露于精神活性物质(如大麻)对青少年大脑组织的影响尚不明确,且现有方法难以充分利用多模态神经影像数据的互补特征,限制了生物学洞察力和预测性能。Contribution: 提出了NeuroKoop框架,首次将神经Koopman理论应用于结构和功能性脑网络的融合,显著提升了PDE状态的分类性能,并揭示了关键的神经发育影响。
Method: 基于图神经网络和Koopman算子,NeuroKoop整合了来自SBM和FNC的节点嵌入,通过潜在空间融合实现了更强大的表示学习。
Result: 在ABCD数据集上,NeuroKoop优于现有基线方法,并识别出与PDE相关的显著结构-功能性连接。
Insight: Koopman理论在神经影像数据分析中具有潜力,能够统一结构和功能性脑网络的表示,为理解产前药物暴露的神经发育影响提供了新视角。
Abstract: Understanding how prenatal exposure to psychoactive substances such as cannabis shapes adolescent brain organization remains a critical challenge, complicated by the complexity of multimodal neuroimaging data and the limitations of conventional analytic methods. Existing approaches often fail to fully capture the complementary features embedded within structural and functional connectomes, constraining both biological insight and predictive performance. To address this, we introduced NeuroKoop, a novel graph neural network-based framework that integrates structural and functional brain networks utilizing neural Koopman operator-driven latent space fusion. By leveraging Koopman theory, NeuroKoop unifies node embeddings derived from source-based morphometry (SBM) and functional network connectivity (FNC) based brain graphs, resulting in enhanced representation learning and more robust classification of prenatal drug exposure (PDE) status. Applied to a large adolescent cohort from the ABCD dataset, NeuroKoop outperformed relevant baselines and revealed salient structural-functional connections, advancing our understanding of the neurodevelopmental impact of PDE.
cs.AI [Back]
[120] Modular Embedding Recomposition for Incremental Learning
Aniello Panariello,Emanuele Frascaroli,Pietro Buzzega,Lorenzo Bonicelli,Angelo Porrello,Simone Calderara
Main category: cs.AI
TL;DR: 论文提出了一种模块化嵌入重组方法(MoDER),通过训练多个文本专家并在推理时组合它们,提升了预训练视觉语言模型(VLM)在增量学习中的零样本分类能力。
Details
Motivation: 预训练视觉语言模型(VLM)在持续学习(CL)中表现出强大的零样本分类能力,但在下游任务与预训练领域差异较大时仍需微调。现有方法主要关注保留VLM的零样本能力,而本文进一步提出通过模块化嵌入重组来增强这一能力。Contribution: 提出了MoDER方法,通过训练多个针对单个已见类的文本专家,并在推理时组合这些专家来合成更优的原型,从而提升VLM的零样本分类能力。
Method: MoDER采用模块化框架,训练多个文本专家存储在基础枢纽中。推理时,针对未见类查询枢纽并组合专家以生成更精确的原型。方法在Class-IL和MTIL两种零样本增量协议上进行了验证。
Result: 在包含14个数据集的实验中,MoDER显示了其有效性,提升了VLM在增量学习中的零样本分类性能。
Insight: 通过模块化重组专家的方式,无需直接微调VLM即可提升其零样本能力,为持续学习提供了一种高效的新思路。
Abstract: The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities. Such proficiency makes VLMs well-suited for real-world applications, enabling robust performance on novel unseen classes without requiring adaptation. However, fine-tuning remains essential when downstream tasks deviate significantly from the pre-training domain. Prior CL approaches primarily focus on preserving the zero-shot capabilities of VLMs during incremental fine-tuning on a downstream task. We take a step further by devising an approach that transforms preservation into enhancement of the zero-shot capabilities of VLMs. Our approach, named MoDular Embedding Recomposition (MoDER), introduces a modular framework that trains multiple textual experts, each specialized in a single seen class, and stores them in a foundational hub. At inference time, for each unseen class, we query the hub and compose the retrieved experts to synthesize a refined prototype that improves classification. We show the effectiveness of our method across two popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total of 14 datasets. The codebase is available at https://github.com/aimagelab/mammoth.
[121] Generative Foundation Model for Structured and Unstructured Electronic Health Records
Sonish Sivarajkumar,Hang Zhang,Yuelyu Ji,Maneesh Bilalpur,Xizhi Wu,Chenyu Li,Min Gu Kwak,Shyam Visweswaran,Yanshan Wang
Main category: cs.AI
TL;DR: 该论文提出了Generative Deep Patient (GDP),一种多模态基础模型,通过结合结构和非结构化电子健康记录(EHRs)数据,同时支持临床预测和高质量临床叙述生成。
Details
Motivation: EHRs数据复杂多样,包含结构和非结构化信息,但现有方法在序列化数字EHR数据时可能丢失时间性和定量细节。需要一种能够统一处理多模态数据并支持多种临床任务的模型。Contribution: 提出了GDP,一种结合CNN-Transformer编码器和LLaMA解码器的多模态基础模型,支持生成临床叙述和预测临床事件。
Method: 采用两阶段训练:1) 生成预训练,包括掩码特征预测和下一时间步预测;2) 针对临床任务的多任务微调。
Result: 在MIMIC-IV数据集中,GDP在临床预测任务(如心力衰竭、2型糖尿病、30天再入院)和叙述生成任务中表现优异。
Insight: 多模态基础模型能够统一处理EHR数据,同时提升临床预测和叙述生成的性能,减少医院文档工作负担。
Abstract: Electronic health records (EHRs) are rich clinical data sources but complex repositories of patient data, spanning structured elements (demographics, vitals, lab results, codes), unstructured clinical notes and other modalities of data. Harnessing this heterogeneity is critical for improving patient outcomes. Recent advances in large language models (LLMs) have enabled foundation models that can learn from multiple data modalities and support clinical tasks. However, most current approaches simply serialize numeric EHR data into text, which risks losing temporal and quantitative detail. We introduce Generative Deep Patient (GDP), a multimodal foundation model that natively encodes structured EHR time-series via a CNN-Transformer encoder and fuses it with unstructured EHRs through cross-modal attention into a LLaMA-based decoder. GDP is trained in two stages: (1) generative pretraining, where it learns to produce clinical narratives from raw patient timelines while also performing masked feature prediction (MFP) and next time-step prediction (NTP) to capture temporal dynamics; and (2) multi-task fine-tuning for clinically meaningful predictions (e.g., heart failure, type 2 diabetes, 30-day readmission). In clinical prediction, GDP demonstrated superior performance on MIMIC-IV: heart failure AUROC = 0.923, type 2 diabetes AUROC = 0.817, and 30-day readmission AUROC = 0.627. For narrative generation, GDP achieved ROUGE-L = 0.135 and BERTScore-F1 = 0.545. In a blinded human evaluation, GDP-Instruct scored highest on faithfulness, fluency, and overall clinical utility, suggesting reduced hospital documentation workload without sacrificing accuracy. Our results demonstrate that a single multimodal foundation model can both predict clinically actionable events and generate high-quality clinical narratives. Furthermore, GDP’s flexible architecture can be extended to additional modalities.