Table of Contents

cs.CL [Back]

[1] Setting The Table with Intent: Intent-aware Schema Generation and Editing for Literature Review Tables

Vishakh Padmakumar,Joseph Chee Chang,Kyle Lo,Doug Downey,Aakanksha Naik

Main category: cs.CL

TL;DR: 该论文研究了如何利用大语言模型(LLM)生成和编辑文献综述表格的模式(schema),以减少模糊性并提高效率。通过合成意图数据和使用基于意图的方法,作者显著提升了模式生成的性能,并提出了一系列LLM驱动的编辑技术,进一步优化生成的模式。

Details Motivation: 学术文献数量的快速增长要求研究人员高效组织和比较文档。现有的模式生成方法存在模糊性和缺乏编辑工具的局限性,因此需要更明确的条件生成和编辑方法。

Contribution: 1. 提出了一种通过合成意图增强未标注表格数据集的方法;2. 展示了基于意图的模式生成显著提升了性能;3. 设计了多种LLM驱动的模式编辑技术,进一步优化生成结果。

Method: 1. 合成意图数据并创建条件生成数据集;2. 对比了单次生成的LLM工作流程和微调模型;3. 提出了基于LLM的模式编辑技术。

Result: 实验表明,基于意图的模式生成显著优于基线方法,且较小的开放权重模型经微调后可媲美最先进的LLM提示方法。编辑技术进一步优化了生成的模式。

Insight: 条件生成(如基于意图)是提高模式生成性能的关键;小规模模型通过微调可接近LLM性能;编辑工具是模式生成流程的重要补充。

Abstract: The increasing volume of academic literature makes it essential for researchers to organize, compare, and contrast collections of documents. Large language models (LLMs) can support this process by generating schemas defining shared aspects along which to compare papers. However, progress on schema generation has been slow due to: (i) ambiguity in reference-based evaluations, and (ii) lack of editing/refinement methods. Our work is the first to address both issues. First, we present an approach for augmenting unannotated table corpora with synthesized intents and apply it to create a dataset for studying schema generation conditioned on a given information need, thus reducing ambiguity. With this dataset, we show how incorporating table intents significantly improves baseline performance in reconstructing reference schemas. Next, we propose several LLM-based schema editing techniques. We start by comprehensively benchmarking several single-shot schema generation methods, including prompted LLM workflows and fine-tuned models, showing that smaller, open-weight models can be fine-tuned to be competitive with state-of-the-art prompted LLMs. Then we demonstrate that our editing techniques can further improve schemas generated by these methods.

[2] Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning

Shengyuan Wang,Jie Feng,Tianhui Liu,Dan Pei,Yong Li

Main category: cs.CL

TL;DR: 该论文提出了一个全面的评估框架来检测和缓解大语言模型(LLMs)中的地理空间知识幻觉,并通过基于Kahneman-Tversky优化的动态对齐方法,显著提升了模型的性能。

Details Motivation: LLMs在生成地理空间知识时常常产生错误或不一致的表示(地理空间幻觉),但其系统性的评估和缓解方法尚未深入研究。

Contribution: 1.提出了一个基于结构化地理空间知识图谱的评估框架;2.通过动态事实对齐方法(KTO优化)显著减少了地理空间幻觉,性能提升29.6%。

Method: 1.利用知识图谱构建评估框架;2.提出基于Kahneman-Tversky优化的动态对齐方法(KTO)来调整模型输出。

Result: 实验表明,该方法和框架在提升LLMs地理空间知识可信度方面有效,性能提升29.6%。

Insight: 地理空间幻觉是LLMs的一个关键问题,结构化知识图谱和动态对齐方法可以显著改善模型的可靠性和准确性。

Abstract: Large language models (LLMs) possess extensive world knowledge, including geospatial knowledge, which has been successfully applied to various geospatial tasks such as mobility prediction and social indicator prediction. However, LLMs often generate inaccurate geospatial knowledge, leading to geospatial hallucinations (incorrect or inconsistent representations of geospatial information) that compromise their reliability. While the phenomenon of general knowledge hallucination in LLMs has been widely studied, the systematic evaluation and mitigation of geospatial hallucinations remain largely unexplored. To address this gap, we propose a comprehensive evaluation framework for geospatial hallucinations, leveraging structured geospatial knowledge graphs for controlled assessment. Through extensive evaluation across 20 advanced LLMs, we uncover the hallucinations in their geospatial knowledge. Building on these insights, we introduce a dynamic factuality aligning method based on Kahneman-Tversky Optimization (KTO) to mitigate geospatial hallucinations in LLMs, leading to a performance improvement of over 29.6% on the proposed benchmark. Extensive experimental results demonstrate the effectiveness of our benchmark and learning algorithm in enhancing the trustworthiness of LLMs in geospatial knowledge and reasoning tasks.

[3] MOCHA: Are Code Language Models Robust Against Multi-Turn Malicious Coding Prompts?

Muntasir Wahed,Xiaona Zhou,Kiet A. Nguyen,Tianjiao Yu,Nirav Diwan,Gang Wang,Dilek Hakkani-Tür,Ismini Lourentzou

Main category: cs.CL

TL;DR: 该论文探讨了代码语言模型在多轮恶意编码提示下的鲁棒性问题,提出了代码分解攻击方法,并引入了MOCHA基准测试,实验表明现有模型在多轮场景下存在漏洞,但MOCHA微调能显著提升其防御能力。

Details Motivation: 随着大语言模型在代码生成能力的提升,其对抗恶意滥用的鲁棒性仍未充分研究,尤其是多轮对话场景下的安全风险。

Contribution: 1. 提出代码分解攻击方法,通过多轮对话分解恶意任务以绕过安全过滤器;2. 引入MOCHA基准测试,评估单轮和多轮恶意提示下模型的鲁棒性;3. 实验显示MOCHA微调能同时保持模型编码能力并提升对抗外部攻击的防御能力。

Method: 1. 设计代码分解攻击,将恶意任务拆分为多轮看似无害的子任务;2. 构建MOCHA基准测试,包含单轮和多轮恶意提示场景;3. 通过微调模型提升其对恶意提示的拒绝率。

Result: 实验表明现有模型在多轮恶意提示下存在漏洞,MOCHA微调在不增加额外监督的情况下,将外部对抗数据集的拒绝率提升32.4%。

Insight: 多轮对话场景的复杂性可能被恶意利用,现有安全措施不足,但通过针对性微调可显著提升模型的防御能力。

Abstract: Recent advancements in Large Language Models (LLMs) have significantly enhanced their code generation capabilities. However, their robustness against adversarial misuse, particularly through multi-turn malicious coding prompts, remains underexplored. In this work, we introduce code decomposition attacks, where a malicious coding task is broken down into a series of seemingly benign subtasks across multiple conversational turns to evade safety filters. To facilitate systematic evaluation, we introduce \benchmarkname{}, a large-scale benchmark designed to evaluate the robustness of code LLMs against both single-turn and multi-turn malicious prompts. Empirical results across open- and closed-source models reveal persistent vulnerabilities, especially under multi-turn scenarios. Fine-tuning on MOCHA improves rejection rates while preserving coding ability, and importantly, enhances robustness on external adversarial datasets with up to 32.4% increase in rejection rates without any additional supervision.

[4] HITSZ’s End-To-End Speech Translation Systems Combining Sequence-to-Sequence Auto Speech Recognition Model and Indic Large Language Model for IWSLT 2025 in Indic Track

Xuchen Wei,Yangxin Wu,Yaoyin Zhang,Henglyu Liu,Kehai Chen,Xuefeng Bai,Min Zhang

Main category: cs.CL

TL;DR: HITSZ的论文提出了一个端到端的语音翻译系统,结合了Whisper自动语音识别模型和Krutrim这一印度语言专用大语言模型(LLM),用于IWSLT 2025印度语赛道的任务。实验表明其系统在英语-印度语互译任务中表现优异,并探索了思维链(CoT)方法的潜力与挑战。

Details Motivation: 在低资源的印度语语音翻译任务中,如何提升翻译质量是一个关键问题。通过结合预训练的语音识别模型和大语言模型,作者旨在解决这一挑战。

Contribution: 提出了一个端到端的语音翻译系统,结合Whisper ASR模型和Krutrim LLM,有效提升了英语与印度语之间的翻译质量。同时探索了思维链方法在翻译中的潜力。

Method: 1. 使用Whisper模型进行自动语音识别(ASR);2. 结合Krutrim(印度语言专用LLM)进行翻译;3. 尝试引入思维链(CoT)方法以提升翻译质量。

Result: 系统在英语-印度语方向的平均BLEU得分为28.88,印度语-英语方向为27.86。思维链方法在成功解析的案例中显著提升了翻译质量(如泰米尔语到英语的BLEU提高了13.84),但在格式一致性上存在挑战。

Insight: 1. 结合ASR和LLM可以有效解决低资源翻译问题;2. 思维链方法潜力大,但需解决输出格式一致性问题;3. 针对特定任务的模型定制(如Krutrim)是关键。

Abstract: This paper presents HITSZ’s submission for the IWSLT 2025 Indic track, focusing on speech-to-text translation (ST) for English-to-Indic and Indic-to-English language pairs. To enhance translation quality in this low-resource scenario, we propose an end-to-end system integrating the pre-trained Whisper automated speech recognition (ASR) model with Krutrim, an Indic-specialized large language model (LLM). Experimental results demonstrate that our end-to-end system achieved average BLEU scores of $28.88$ for English-to-Indic directions and $27.86$ for Indic-to-English directions. Furthermore, we investigated the Chain-of-Thought (CoT) method. While this method showed potential for significant translation quality improvements on successfully parsed outputs (e.g. a $13.84$ BLEU increase for Tamil-to-English), we observed challenges in ensuring the model consistently adheres to the required CoT output format.

[5] MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

Sara Papi,Maike Züfle,Marco Gaido,Beatrice Savoldi,Danni Liu,Ioannis Douros,Luisa Bentivogli,Jan Niehues

Main category: cs.CL

TL;DR: 该论文提出了MCIF,一个多模态跨语言指令跟随基准测试,用于评估多模态大语言模型(MLLMs)在多语言和多模态环境下的表现。

Details Motivation: 现有的基准测试在多语言、多模态和长期上下文评估方面存在不足,阻碍了对MLLMs综合能力的全面评估。

Contribution: 提出了首个基于科学讲座的多语言、多模态基准测试MCIF,覆盖语音、视觉和文本三种核心模态,以及四种语言。

Method: 通过标注的科学讲座数据构建MCIF基准测试,涵盖短时和长期输入,支持多模态和多语言任务。

Result: MCIF为MLLMs在多语言和多模态环境中的性能评估提供了标准化工具。

Insight: MCIF为未来MLLMs的研究提供了开放且全面的评估框架,推动了多模态和跨语言能力的发展。

Abstract: Recent advances in large language models have catalyzed the development of multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to general-purpose instruction-following models, a key frontier lies in evaluating their multilingual and multimodal capabilities over both long and short contexts. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on one single modality at a time, rely on short-form contexts, or lack human annotations – hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities – speech, vision, and text – and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs’ abilities to interpret instructions across languages and combine them with multimodal contextual information. MCIF is released under a CC-BY 4.0 license to encourage open research and progress in MLLMs development.

[6] RoD-TAL: A Benchmark for Answering Questions in Romanian Driving License Exams

Andrei Vlad Man,Răzvan-Alexandru Smădu,Cristian-George Craciun,Dumitru-Clementin Cercel,Florin Pop,Mihaela-Claudia Cercel

Main category: cs.CL

TL;DR: 该论文提出了RoD-TAL,一个多模态数据集,用于评估大型语言模型和视觉语言模型在罗马尼亚驾驶考试问题上的能力,并通过实验展示了领域微调对检索任务的重要性,以及思维链提示对问答任务的改进。

Details Motivation: AI与法律系统的结合需要支持法律教育的工具,尤其是在资源匮乏的语言如罗马尼亚语中。论文旨在评估LLMs和VLMs在理解和推理罗马尼亚驾驶法律方面的能力。

Contribution: 介绍了RoD-TAL数据集,包含文本和图像形式的驾驶考试问题,并评估了RAG、密集检索和推理优化模型在多种任务中的表现。

Method: 采用检索增强生成(RAG)管道、密集检索器和推理优化模型,测试了信息检索、问答、视觉信息检索和视觉问答任务。

Result: 领域微调显著提升了检索性能,思维链提示和专门推理模型提高了问答准确性,甚至超过了通过驾驶考试的最低要求,但视觉推理仍具挑战性。

Insight: 论文展示了LLMs和VLMs在法律教育中的潜力,同时也指出了视觉推理任务的局限性。

Abstract: The intersection of AI and legal systems presents a growing need for tools that support legal education, particularly in under-resourced languages such as Romanian. In this work, we aim to evaluate the capabilities of Large Language Models (LLMs) and Vision-Language Models (VLMs) in understanding and reasoning about Romanian driving law through textual and visual question-answering tasks. To facilitate this, we introduce RoD-TAL, a novel multimodal dataset comprising Romanian driving test questions, text-based and image-based, alongside annotated legal references and human explanations. We implement and assess retrieval-augmented generation (RAG) pipelines, dense retrievers, and reasoning-optimized models across tasks including Information Retrieval (IR), Question Answering (QA), Visual IR, and Visual QA. Our experiments demonstrate that domain-specific fine-tuning significantly enhances retrieval performance. At the same time, chain-of-thought prompting and specialized reasoning models improve QA accuracy, surpassing the minimum grades required to pass driving exams. However, visual reasoning remains challenging, highlighting the potential and the limitations of applying LLMs and VLMs to legal education.

[7] JT-Math: A Multi-Stage Framework for Advanced Mathematical Reasoning in Large Language Models

Yifan Hao,Fangning Chao,Yaqian Hao,Zhaojun Cui,Huan Bai,Haiyu Zhang,Yankai Liu,Chao Deng,Junlan Feng

Main category: cs.CL

TL;DR: JT-Math-8B是一个多阶段优化的开源模型,通过高质量数据集和系统性框架提升大型语言模型在复杂数学推理任务中的表现。

Details Motivation: 现有的语言模型在复杂数学问题上表现不足,尤其是需要深度理解和多步推理的任务,因此需要更系统化的优化方法。

Contribution: 提出了JT-Math-8B系列开源模型,包括基础版、指导版和思考版;设计了一个多阶段优化框架;通过高质量数据集和分阶段强化学习课程提升了模型性能。

Method: 使用高质210B token数据集,并结合模型验证;指导版通过监督微调(Supervised Fine-Tuning, SFT)和GRPO强化学习方法优化;思考版则结合长链思维(Long CoT)和多阶段强化学习课程训练。

Result: JT-Math-8B在同类开源模型中表现最优,超越OpenAI的O1-mini和GPT-4o,尤其在竞赛级数学任务中表现突出。

Insight: 分阶段强化学习课程和长链思维方法能有效提升模型在复杂任务中的推理能力;高质量数据集验证是提升性能的关键。

Abstract: Mathematical reasoning is a cornerstone of artificial general intelligence and a primary benchmark for evaluating the capabilities of Large Language Models (LLMs). While state-of-the-art models show promise, they often falter when faced with complex problems that demand deep conceptual understanding and intricate, multi-step deliberation. To address this challenge, we introduce JT-Math-8B, a series of open-source models comprising base, instruct, and thinking versions, built upon a systematic, multi-stage optimization framework. Our pre-training corpus is a high-quality, 210B-token dataset curated through a dedicated data pipeline that uses model-based validation to ensure quality and diversity. The Instruct Model is optimized for direct, concise answers through Supervised Fine-Tuning (SFT) and a GRPO-based reinforcement learning (RL) method. The Thinking Model is trained for complex problem-solving using a Long Chain-of-Thought (Long CoT) approach, combining SFT with a novel, multi-stage RL curriculum that progressively increases task difficulty and context length up to 32K tokens. JT-Math-8B achieves state-of-the-art results among open-source models of similar size, surpassing prominent models like OpenAI’s O1-mini and GPT-4o , and demonstrating superior performance on competition-level mathematics.

[8] UloRL:An Ultra-Long Output Reinforcement Learning Approach for Advancing Large Language Models’ Reasoning Abilities

Dong Du,Shulin Liu,Tao Yang,Shaohua Chen,Yang Li

Main category: cs.CL

TL;DR: 论文提出了一种超长输出强化学习(UloRL)方法,通过分段解码和动态掩码技术解决了传统RL框架在处理超长输出时的效率问题,显著提升了大型语言模型的推理能力。

Details Motivation: 传统强化学习框架在处理超长输出序列时效率低下,主要面临长尾序列分布和训练中的熵崩溃问题,因此需要一种高效的方法来提升大型语言模型(LLMs)的推理能力。

Contribution: 提出了UloRL方法,通过将超长输出分解为短段落进行解码,并引入动态掩码技术(MPTs),有效解决了训练效率问题,显著提升了模型在长序列生成任务中的推理性能。

Method: 采用分段解码技术(segment rollout)将超长输出分割为短段落,以减少长尾样本的延迟;引入动态掩码技术,防止熵崩溃。

Result: 在Qwen3-30B-A3B模型上,分段解码使训练速度提升2.06倍;128k-token输出训练将AIME2025和BeyondAIME任务上的性能分别从70.9%提升到85.1%、50.7%提升到61.9%,甚至超越Qwen3-235B-A22B。

Insight: 通过分段处理和动态掩码技术,UloRL不仅提高了训练效率,还显著增强了模型的推理能力,为超长序列生成任务提供了新的解决方案。

Abstract: Recent advances in large language models (LLMs) have highlighted the potential of reinforcement learning with verifiable rewards (RLVR) to enhance reasoning capabilities through extended output sequences. However, traditional RL frameworks face inefficiencies when handling ultra-long outputs due to long-tail sequence distributions and entropy collapse during training. To address these challenges, we propose an Ultra-Long Output Reinforcement Learning (UloRL) approach for advancing large language models’ reasoning abilities. Specifically, we divide ultra long output decoding into short segments, enabling efficient training by mitigating delays caused by long-tail samples. Additionally, we introduce dynamic masking of well-Mastered Positive Tokens (MPTs) to prevent entropy collapse. Experimental results demonstrate the effectiveness of our approach. On the Qwen3-30B-A3B model, RL with segment rollout achieved 2.06x increase in training speed, while RL training with 128k-token outputs improves the model’s performance on AIME2025 from 70.9% to 85.1% and on BeyondAIME from 50.7% to 61.9%, even surpassing Qwen3-235B-A22B with remarkable gains. These findings underscore the potential of our methods to advance the reasoning capabilities of LLMs with ultra-long sequence generation. We will release our code and model for further use by the community.

[9] Flora: Effortless Context Construction to Arbitrary Length and Scale

Tianxiang Chen,Zhentao Tan,Xiaofan Bo,Yue Wu,Tao Gong,Qi Chu,Jieping Ye,Nenghai Yu

Main category: cs.CL

TL;DR: 该论文提出了一种名为Flora的无需人工或LLM干预的长上下文构造策略,通过组合短指令和长上下文元指令,显著提升LLM的长上下文处理能力,同时几乎不影响短上下文性能。

Details Motivation: 针对LLM处理长上下文时面临的挑战(如数据稀缺、计算成本高、短上下文能力下降),提出一种低成本、高效的解决方案。

Contribution: 提出Flora方法,能够无需人工或LLM干预,构造任意长度和规模的长上下文数据,显著提升LLM的长上下文性能。

Method: 通过分类组合短指令,并使用长上下文元指令指导LLM生成响应,实现长上下文的多样化和可扩展构造。

Result: 在Llama3-8B-Instruct和QwQ-32B上的实验显示,Flora增强的LLM在长上下文任务中表现优异,同时短上下文性能几乎无损失。

Insight: 通过简单的指令组合和元指令指导,可以有效解决长上下文构造的复杂性和多样性问题,为LLM的长上下文能力提升提供新思路。

Abstract: Effectively handling long contexts is challenging for Large Language Models (LLMs) due to the rarity of long texts, high computational demands, and substantial forgetting of short-context abilities. Recent approaches have attempted to construct long contexts for instruction tuning, but these methods often require LLMs or human interventions, which are both costly and limited in length and diversity. Also, the drop in short-context performances of present long-context LLMs remains significant. In this paper, we introduce Flora, an effortless (human/LLM-free) long-context construction strategy. Flora can markedly enhance the long-context performance of LLMs by arbitrarily assembling short instructions based on categories and instructing LLMs to generate responses based on long-context meta-instructions. This enables Flora to produce contexts of arbitrary length and scale with rich diversity, while only slightly compromising short-context performance. Experiments on Llama3-8B-Instruct and QwQ-32B show that LLMs enhanced by Flora excel in three long-context benchmarks while maintaining strong performances in short-context tasks. Our data-construction code is available at \href{https://github.com/txchen-USTC/Flora}{https://github.com/txchen-USTC/Flora}.

[10] Zero-shot Performance of Generative AI in Brazilian Portuguese Medical Exam

Cesar Augusto Madid Truyts,Amanda Gomes Rabelo,Gabriel Mesquita de Souza,Daniel Scaldaferri Lages,Adriano Jose Pereira,Uri Adrian Prync Flato,Eduardo Pontes dos Reis,Joaquim Edson Vieira,Paulo Sergio Panse Silveira,Edson Amaro Junior

Main category: cs.CL

TL;DR: 本研究评估了六种LLM和四种MLLM在巴西葡萄牙语医学考试中的表现,发现部分模型(如Claude-3.5-Sonnet和Claude-3-Opus)的准确性与人类考生相当,但在多模态问题上表现较差,强调了非英语医学AI应用中需进一步优化的需求。

Details Motivation: 当前AI在医疗领域的评估多集中于英语,导致其他语言表现可能存在偏差,因此需要研究非英语环境下的模型性能。

Contribution: 系统性地评估了多种LLM和MLLM在巴西葡萄牙语医学考试中的表现,揭示了语言和多模态能力上的差距,并提出了未来研究方向。

Method: 选取了六种LLM和四种MLLM,以巴西圣保罗大学医院医学考试为题,测试模型的准确性、处理时间和生成解释的连贯性。

Result: 部分模型(如Claude-3.5-Sonnet和Claude-3-Opus)在准确性和解释连贯性上接近人类水平,但在多模态问题上表现不足。

Insight: 非英语医学AI应用需更多优化,尤其是多模态推理能力;未来研究应关注训练方法改进和实际临床整合。

Abstract: Artificial intelligence (AI) has shown the potential to revolutionize healthcare by improving diagnostic accuracy, optimizing workflows, and personalizing treatment plans. Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have achieved notable advancements in natural language processing and medical applications. However, the evaluation of these models has focused predominantly on the English language, leading to potential biases in their performance across different languages. This study investigates the capability of six LLMs (GPT-4.0 Turbo, LLaMA-3-8B, LLaMA-3-70B, Mixtral 8x7B Instruct, Titan Text G1-Express, and Command R+) and four MLLMs (Claude-3.5-Sonnet, Claude-3-Opus, Claude-3-Sonnet, and Claude-3-Haiku) to answer questions written in Brazilian spoken portuguese from the medical residency entrance exam of the Hospital das Cl'inicas da Faculdade de Medicina da Universidade de S~ao Paulo (HCFMUSP) - the largest health complex in South America. The performance of the models was benchmarked against human candidates, analyzing accuracy, processing time, and coherence of the generated explanations. The results show that while some models, particularly Claude-3.5-Sonnet and Claude-3-Opus, achieved accuracy levels comparable to human candidates, performance gaps persist, particularly in multimodal questions requiring image interpretation. Furthermore, the study highlights language disparities, emphasizing the need for further fine-tuning and data set augmentation for non-English medical AI applications. Our findings reinforce the importance of evaluating generative AI in various linguistic and clinical settings to ensure a fair and reliable deployment in healthcare. Future research should explore improved training methodologies, improved multimodal reasoning, and real-world clinical integration of AI-driven medical assistance.

[11] Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text

Mizanur Rahman,Md Tahmid Rahman Laskar,Shafiq Joty,Enamul Hoque

Main category: cs.CL

TL;DR: Text2Vis是一个文本到可视化转换的基准测试,包含20多种图表类型和1985个样本,评估了11种模型,并提出了一个跨模态的演员-评论家框架以提升性能。

Details Motivation: 现有的大语言模型在生成可视化图表时缺乏全面的基准测试,限制了其能力的严格评估。

Contribution: 引入了Text2Vis基准测试,提出跨模态演员-评论家框架,并开发了基于LLM的自动评估框架。

Method: 使用演员-评论家框架联合优化文本答案和可视化代码,并通过自动化评估框架衡量多个指标。

Result: 提出的框架将GPT-4o的通过率从26%提升到42%,同时提高了图表质量。

Insight: 跨模态联合优化和自动化评估在文本到可视化任务中具有显著潜力。

Abstract: Automated data visualization plays a crucial role in simplifying data interpretation, enhancing decision-making, and improving efficiency. While large language models (LLMs) have shown promise in generating visualizations from natural language, the absence of comprehensive benchmarks limits the rigorous evaluation of their capabilities. We introduce Text2Vis, a benchmark designed to assess text-to-visualization models, covering 20+ chart types and diverse data science queries, including trend analysis, correlation, outlier detection, and predictive analytics. It comprises 1,985 samples, each with a data table, natural language query, short answer, visualization code, and annotated charts. The queries involve complex reasoning, conversational turns, and dynamic data retrieval. We benchmark 11 open-source and closed-source models, revealing significant performance gaps, highlighting key challenges, and offering insights for future advancements. To close this gap, we propose the first cross-modal actor-critic agentic framework that jointly refines the textual answer and visualization code, increasing GPT-4o`s pass rate from 26% to 42% over the direct approach and improving chart quality. We also introduce an automated LLM-based evaluation framework that enables scalable assessment across thousands of samples without human annotation, measuring answer correctness, code execution success, visualization readability, and chart accuracy. We release Text2Vis at https://github.com/vis-nlp/Text2Vis.

[12] Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG

Baiyu Chen,Wilson Wongso,Xiaoqian Hu,Yue Tan,Flora Salim

Main category: cs.CL

TL;DR: 该论文提出了一个多阶段验证框架,用于减少多模态RAG系统中的幻觉问题,并在KDD Cup 2025的CRAG-MM挑战中取得第三名。

Details Motivation: 现代视觉语言模型(VLMs)在处理自我中心图像、长尾实体和复杂多跳问题时容易产生幻觉,这对实际应用中的事实准确性提出了挑战。

Contribution: 提出了一个轻量级查询路由、双路径生成和事后验证的多阶段框架,优先保证事实准确性而非完整性。

Method: 结合查询感知的检索和摘要流程,双路径生成及后验验证,以最小化幻觉。

Result: 在KDD Cup 2025的Task 1中排名第三,验证了框架的有效性。

Insight: 在复杂多模态RAG系统中,优先考虑回答可靠性而非完整性,能显著减少幻觉问题。

Abstract: This paper presents the technical solution developed by team CRUISE for the KDD Cup 2025 Meta Comprehensive RAG Benchmark for Multi-modal, Multi-turn (CRAG-MM) challenge. The challenge aims to address a critical limitation of modern Vision Language Models (VLMs): their propensity to hallucinate, especially when faced with egocentric imagery, long-tail entities, and complex, multi-hop questions. This issue is particularly problematic in real-world applications where users pose fact-seeking queries that demand high factual accuracy across diverse modalities. To tackle this, we propose a robust, multi-stage framework that prioritizes factual accuracy and truthfulness over completeness. Our solution integrates a lightweight query router for efficiency, a query-aware retrieval and summarization pipeline, a dual-pathways generation and a post-hoc verification. This conservative strategy is designed to minimize hallucinations, which incur a severe penalty in the competition’s scoring metric. Our approach achieved 3rd place in Task 1, demonstrating the effectiveness of prioritizing answer reliability in complex multi-modal RAG systems. Our implementation is available at https://github.com/Breezelled/KDD-Cup-2025-Meta-CRAG-MM .

[13] Multi-Agent Interactive Question Generation Framework for Long Document Understanding

Kesen Wang,Daulet Toibazar,Abdulrahman Alfulayt,Abdulaziz S. Albadawi,Ranya A. Alkahtani,Asma A. Ibrahim,Haneen A. Alhomoud,Sherif Mohamed,Pedro J. Moreno

Main category: cs.CL

TL;DR: 论文提出了一种全自动的多智能体交互框架,用于生成长文档理解中的问题,解决了低资源语言(如阿拉伯语)中细粒度训练数据不足的问题,并生成了高质量的单页和多页问题集。

Details Motivation: 长文档理解(DU)在多语言和复杂布局场景中仍具挑战性,尤其是低资源语言。现有的方法依赖人工标注,成本高且效率低。

Contribution: 提出了一个全自动多智能体交互框架,高效生成长文档问题(包括阿拉伯语和英语),推动了长上下文理解模型的开发。

Method: 采用多智能体交互框架,自动生成高质量的问题对,覆盖多领域的长文档。

Result: 实验表明生成的问题(AraEngLongBench)对主流LVLMs具有挑战性,证明了其有效性。

Insight: 通过自动化生成训练数据,可以显著降低长文档理解任务的标注成本并提升模型性能。

Abstract: Document Understanding (DU) in long-contextual scenarios with complex layouts remains a significant challenge in vision-language research. Although Large Vision-Language Models (LVLMs) excel at short-context DU tasks, their performance declines in long-context settings. A key limitation is the scarcity of fine-grained training data, particularly for low-resource languages such as Arabic. Existing state-of-the-art techniques rely heavily on human annotation, which is costly and inefficient. We propose a fully automated, multi-agent interactive framework to generate long-context questions efficiently. Our approach efficiently generates high-quality single- and multi-page questions for extensive English and Arabic documents, covering hundreds of pages across diverse domains. This facilitates the development of LVLMs with enhanced long-context understanding ability. Experimental results in this work have shown that our generated English and Arabic questions (\textbf{AraEngLongBench}) are quite challenging to major open- and close-source LVLMs. The code and data proposed in this work can be found in https://github.com/wangk0b/Multi_Agentic_QA_Long_Doc.git. Sample Question and Answer (QA) pairs and structured system prompts can be found in the Appendix.

[14] SessionIntentBench: A Multi-task Inter-session Intention-shift Modeling Benchmark for E-commerce Customer Behavior Understanding

Yuqi Yang,Weiqi Wang,Baixuan Xu,Wei Fan,Qing Zong,Chunkit Chan,Zheye Deng,Xin Liu,Yifan Gao,Changlong Yu,Chen Luo,Yang Li,Zheng Li,Qingyu Yin,Bing Yin,Yangqiu Song

Main category: cs.CL

TL;DR: 该论文提出了一个新的多模态基准测试SessionIntentBench,用于评估语言与视觉语言模型在理解跨会话意图转变中的能力,填补了电子商务领域中意图建模的数据和基准空白。

Details Motivation: 现有的研究未能有效捕捉和建模用户意图,主要因为信息利用不足,且缺乏专门的电子商务会话意图数据集和基准。

Contribution: 1) 提出了意图树的概念和数据收集流程;2) 构建了包含大量意图条目和会话轨迹的多模态基准SessionIntentBench;3) 通过实验验证了现有模型在复杂会话场景中捕捉意图的不足。

Method: 提出意图树概念并设计数据收集流程,通过挖掘会话数据构建多模态基准,结合人工标注形成黄金评估集。

Result: 实验表明现有L(V)LMs在复杂会话中捕捉意图的能力不足,但注入意图信息可以提升模型性能。

Insight: 意图信息对于理解用户行为至关重要,特别是在跨会话场景中,未来的研究应更关注意图建模和数据利用。

Abstract: Session history is a common way of recording user interacting behaviors throughout a browsing activity with multiple products. For example, if an user clicks a product webpage and then leaves, it might because there are certain features that don’t satisfy the user, which serve as an important indicator of on-the-spot user preferences. However, all prior works fail to capture and model customer intention effectively because insufficient information exploitation and only apparent information like descriptions and titles are used. There is also a lack of data and corresponding benchmark for explicitly modeling intention in E-commerce product purchase sessions. To address these issues, we introduce the concept of an intention tree and propose a dataset curation pipeline. Together, we construct a sibling multimodal benchmark, SessionIntentBench, that evaluates L(V)LMs’ capability on understanding inter-session intention shift with four subtasks. With 1,952,177 intention entries, 1,132,145 session intention trajectories, and 13,003,664 available tasks mined using 10,905 sessions, we provide a scalable way to exploit the existing session data for customer intention understanding. We conduct human annotations to collect ground-truth label for a subset of collected data to form an evaluation gold set. Extensive experiments on the annotated data further confirm that current L(V)LMs fail to capture and utilize the intention across the complex session setting. Further analysis show injecting intention enhances LLMs’ performances.

[15] Diversity-Enhanced Reasoning for Subjective Questions

Yumeng Wang,Zhiyuan Fan,Jiayu Liu,Yi R. Fung

Main category: cs.CL

TL;DR: 论文提出了一种多样性增强的框架MultiRole-R1,用于解决大规模推理模型在主观性问题上的同质化推理问题。通过引入多角色视角和无监督数据构建,结合强化学习和多样性奖励信号,显著提升了主观推理任务的准确性和多样性。

Details Motivation: 大规模推理模型虽然擅长客观任务,但在主观问题上表现受限,主要由于训练依赖单一标准答案导致推理同质化。研究发现增加角色视角能改善性能,因此提出了多样性增强框架。

Contribution: 提出MultiRole-R1框架,包含多角色视角的无监督数据构建,以及融合多样性奖励的强化学习方法(GRPO)。揭示了推理多样性与准确性之间的正向关系。

Method: 1. 无监督数据构建生成包含多角色视角的推理链。2. 使用GRPO强化学习方法,通过多样性奖励信号优化模型。3. 设计专门的奖励函数促进视角和词汇多样性。

Result: 在六个基准测试中验证了MultiRole-R1在提升主客观推理任务中的有效性和泛化性。多样性增强训练显著提升了模型性能。

Insight: 多样性增强训练不仅适用于主观问题,还能提升模型在更广泛任务上的表现。推理多样性与准确性之间存在正向关联,凸显了训练中引入多样性的重要性。

Abstract: Large reasoning models (LRM) with long chain-of-thought (CoT) capabilities have shown strong performance on objective tasks, such as math reasoning and coding. However, their effectiveness on subjective questions that may have different responses from different perspectives is still limited by a tendency towards homogeneous reasoning, introduced by the reliance on a single ground truth in supervised fine-tuning and verifiable reward in reinforcement learning. Motivated by the finding that increasing role perspectives consistently improves performance, we propose MultiRole-R1, a diversity-enhanced framework with multiple role perspectives, to improve the accuracy and diversity in subjective reasoning tasks. MultiRole-R1 features an unsupervised data construction pipeline that generates reasoning chains that incorporate diverse role perspectives. We further employ reinforcement learning via Group Relative Policy Optimization (GRPO) with reward shaping, by taking diversity as a reward signal in addition to the verifiable reward. With specially designed reward functions, we successfully promote perspective diversity and lexical diversity, uncovering a positive relation between reasoning diversity and accuracy. Our experiment on six benchmarks demonstrates MultiRole-R1’s effectiveness and generalizability in enhancing both subjective and objective reasoning, showcasing the potential of diversity-enhanced training in LRMs.

[16] IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs

Aviya Maimon,Amir DN Cohen,Gal Vishne,Shauli Ravfogel,Reut Tsarfaty

Main category: cs.CL

TL;DR: 该论文提出了一个新的评估框架,通过因子分析揭示大型语言模型(LLM)的潜在核心能力,摒弃了传统的单一基准分数平均方法,从而更全面地评估模型的优劣势。

Details Motivation: 当前LLM评估依赖基准分数,但难以揭示模型的整体能力,且缺乏对任务间关系的理解。需要更全面的方法来评估模型的优劣势。

Contribution: 提出了一种新的评估范式,通过因子分析识别潜在技能,设计了一个包含60个LLM在44个任务上的表现的排行榜,并开发了实用工具。

Method: 使用因子分析法分析60个LLM在44个任务上的表现,识别出少数潜在技能来解释模型的表现。

Result: 识别出一组潜在技能,能显著解释模型表现,并设计出识别冗余任务、辅助模型选择及分析模型潜在技能的工具。

Insight: 因子分析能更全面地评估LLM的核心能力,避免传统单一分数方法的局限性,为模型选择和优化提供新视角。

Abstract: Current evaluations of large language models (LLMs) rely on benchmark scores, but it is difficult to interpret what these individual scores reveal about a model’s overall skills. Specifically, as a community we lack understanding of how tasks relate to one another, what they measure in common, how they differ, or which ones are redundant. As a result, models are often assessed via a single score averaged across benchmarks, an approach that fails to capture the models’ wholistic strengths and limitations. Here, we propose a new evaluation paradigm that uses factor analysis to identify latent skills driving performance across benchmarks. We apply this method to a comprehensive new leaderboard showcasing the performance of 60 LLMs on 44 tasks, and identify a small set of latent skills that largely explain performance. Finally, we turn these insights into practical tools that identify redundant tasks, aid in model selection, and profile models along each latent skill.

[17] Post-Completion Learning for Language Models

Xiang Fei,Siqi Wang,Shu Wei,Yuxiang Nie,Wei Shi,Hao Feng,Can Huang

Main category: cs.CL

TL;DR: 论文提出Post-Completion Learning (PCL),一种新的语言模型训练框架,利用模型输出完成后的序列空间,提升推理和自评估能力,并通过双轨SFT和强化学习混合优化实现多目标优化。

Details Motivation: 当前语言模型训练在遇到结束时忽略后续学习机会,未能充分利用输出完成后的序列空间。

Contribution: 提出PCL框架,结合白盒强化学习方法,优化推理和自评估能力,同时保持推理效率。

Method: 设计了双轨SFT和强化学习的混合优化方法,通过自评估和奖励预测实现监督。

Result: 在不同数据集和模型上均优于传统SFT和RL方法。

Insight: 利用输出完成后的空间为语言模型训练提供新路径,同时兼顾输出质量和效率。

Abstract: Current language model training paradigms typically terminate learning upon reaching the end-of-sequence (}) token, overlooking the potential learning opportunities in the post-completion space. We propose Post-Completion Learning (PCL), a novel training framework that systematically utilizes the sequence space after model output completion, to enhance both the reasoning and self-evaluation abilities. PCL enables models to continue generating self-assessments and reward predictions during training, while maintaining efficient inference by stopping at the completion point. To fully utilize this post-completion space, we design a white-box reinforcement learning method: let the model evaluate the output content according to the reward rules, then calculate and align the score with the reward functions for supervision. We implement dual-track SFT to optimize both reasoning and evaluation capabilities, and mixed it with RL training to achieve multi-objective hybrid optimization. Experimental results on different datasets and models demonstrate consistent improvements over traditional SFT and RL methods. Our method provides a new technical path for language model training that enhances output quality while preserving deployment efficiency.

[18] EMBRACE: Shaping Inclusive Opinion Representation by Aligning Implicit Conversations with Social Norms

Abeer Aldayel,Areej Alokaili

Main category: cs.CL

TL;DR: 该论文提出了一种评估框架EMBRACE,通过隐式对话对齐社会规范,旨在改善NLP模型对多元观点的包容性表达。

Details Motivation: 现有方法多依赖显式的人口统计或行为属性,忽视了对话中的隐式观点表达,可能强化刻板印象或不公平表征。

Contribution: 提出了一个基于响应的立场作为隐式意见代理的评估框架,结合PU学习和指令调优模型,提升模型的社会规范性对齐。

Method: 通过(i)正例-未标注学习(PU学习)和(ii)指令调优语言模型,评估隐式对话与社会规范的匹配程度。

Result: 展示了一种更包容的模型行为路径,揭示了隐式观点如何被(误)表达。

Insight: 隐式观点的捕捉和社会规范的对齐是实现包容性的关键,为未来NLP模型设计提供了新视角。

Abstract: Shaping inclusive representations that embrace diversity and ensure fair participation and reflections of values is at the core of many conversation-based models. However, many existing methods rely on surface inclusion using mention of user demographics or behavioral attributes of social groups. Such methods overlook the nuanced, implicit expression of opinion embedded in conversations. Furthermore, the over-reliance on overt cues can exacerbate misalignment and reinforce harmful or stereotypical representations in model outputs. Thus, we took a step back and recognized that equitable inclusion needs to account for the implicit expression of opinion and use the stance of responses to validate the normative alignment. This study aims to evaluate how opinions are represented in NLP or computational models by introducing an alignment evaluation framework that foregrounds implicit, often overlooked conversations and evaluates the normative social views and discourse. Our approach models the stance of responses as a proxy for the underlying opinion, enabling a considerate and reflective representation of diverse social viewpoints. We evaluate the framework using both (i) positive-unlabeled (PU) online learning with base classifiers, and (ii) instruction-tuned language models to assess post-training alignment. Through this, we provide a lens on how implicit opinions are (mis)represented and offer a pathway toward more inclusive model behavior.

[19] MoL-RL: Distilling Multi-Step Environmental Feedback into LLMs for Feedback-Independent Reasoning

Kang Yang,Jingxue Chen,Qingkun Tang,Tianxiang Zhang,Qianchun Lu

Main category: cs.CL

TL;DR: MoL-RL是一种新的训练范式,通过双目标优化框架将多步环境反馈信号集成到LLM中,实现了反馈无关的推理能力。

Details Motivation: 现有方法难以充分利用多步离散的环境反馈信号(EF),导致推理能力受限。

Contribution: 提出MoL-RL框架,通过MoL持续训练和GRPO后训练,将多步EF信号蒸馏到LLM中,实现反馈无关的推理。

Method: 结合MoL(分离EF信号和语言能力)和GRPO后训练,提取多步EF信号到单步推理。

Result: 在数学推理和代码生成任务中表现优异,实现了SOTA性能。

Insight: 多步信号可以通过双目标优化框架有效集成到LLM中,提升推理能力。

Abstract: Large language models (LLMs) face significant challenges in effectively leveraging sequential environmental feedback (EF) signals, such as natural language evaluations, for feedback-independent chain-of-thought (CoT) reasoning. Existing approaches either convert EF into scalar rewards, losing rich contextual information, or employ refinement datasets, failing to exploit the multi-step and discrete nature of EF interactions. To address these limitations, we propose MoL-RL, a novel training paradigm that integrates multi-step EF signals into LLMs through a dual-objective optimization framework. Our method combines MoL (Mixture-of-Losses) continual training, which decouples domain-specific EF signals (optimized via cross-entropy loss) and general language capabilities (preserved via Kullback-Leibler divergence), with GRPO-based post-training to distill sequential EF interactions into single-step inferences. This synergy enables robust feedback-independent reasoning without relying on external feedback loops. Experimental results on mathematical reasoning (MATH-500, AIME24/AIME25) and code generation (CodeAgent-Test) benchmarks demonstrate that MoL-RL achieves state-of-the-art performance with the Qwen3-8B model, while maintaining strong generalization across model scales (Qwen3-4B). This work provides a promising approach for leveraging multi-step textual feedback to enhance LLMs’ reasoning capabilities in diverse domains.

[20] Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations

Eunkyu Park,Wesley Hanwen Deng,Gunhee Kim,Motahhare Eslami,Maarten Sap

Main category: cs.CL

TL;DR: 论文提出了一种名为‘Cognitive Chain-of-Thought (CoCoT)’的提示策略,通过三个认知启发的阶段(感知、情境和规范)来增强视觉语言模型在多模态任务中的推理能力,显著优于传统的链式思维(CoT)和直接提示方法。

Details Motivation: 传统的链式思维(CoT)提示在多模态任务中(尤其是涉及社会情境和规范判断时)表现不佳,无法同时实现感知、理解和判断的多层次推理。因此,作者希望通过认知启发的结构化方法提升模型的推理能力。

Contribution: 提出了CoCoT方法,通过分阶段的认知推理(感知、情境和规范)提升视觉语言模型在多模态任务中的表现;在多个基准测试中平均提升了8%。

Method: CoCoT分为三个阶段:感知阶段(提取视觉信息),情境阶段(理解社会背景),规范阶段(基于社会规范做出判断)。通过结构化提示将任务分解为这些阶段,逐步引导模型推理。

Result: 实验显示,CoCoT在意图消歧、常识推理和安全性等多模态任务中均优于传统CoT和直接提示方法,平均提升8%,并增强了模型的可解释性和社会意识。

Insight: 认知启发的结构化推理方法能显著提升模型在复杂多模态任务中的表现,同时也为更安全、可靠的视觉语言系统提供了可行的设计思路。

Abstract: Chain-of-Thought (CoT) prompting helps models think step by step. But what happens when they must see, understand, and judge-all at once? In visual tasks grounded in social context, where bridging perception with norm-grounded judgments is essential, flat CoT often breaks down. We introduce Cognitive Chain-of-Thought (CoCoT), a prompting strategy that scaffolds VLM reasoning through three cognitively inspired stages: perception, situation, and norm. Our experiments show that, across multiple multimodal benchmarks (including intent disambiguation, commonsense reasoning, and safety), CoCoT consistently outperforms CoT and direct prompting (+8% on average). Our findings demonstrate that cognitively grounded reasoning stages enhance interpretability and social awareness in VLMs, paving the way for safer and more reliable multimodal systems.

[21] CONCAP: Seeing Beyond English with Concepts Retrieval-Augmented Captioning

George Ibrahim,Rita Ramos,Yova Kementchedjhieva

Main category: cs.CL

TL;DR: CONCAP提出了一种多语言图像描述模型,通过检索增强生成(RAG)结合图像特定概念,减少对多语言训练数据的依赖,同时在低资源语言上表现优异。

Details Motivation: 现有的多语言视觉-语言模型由于数据限制和参数规模问题,在多语言图像描述任务中表现不如英语模型。传统的RAG方法依赖从英语翻译的检索结果,可能导致语言偏差和上下文不匹配。

Contribution: CONCAP通过引入图像特定概念增强RAG,结合多语言检索结果与图像内容,显著提升了低资源语言的描述性能,减少了数据需求。

Method: CONCAP利用检索增强生成,从目标语言中检索相关描述,并结合图像中的特定概念,生成更准确的多语言图像描述。

Result: 在XM3600数据集上的实验表明,CONCAP在低资源和中资源语言上表现优异,且数据需求大幅降低。

Insight: 概念感知的检索增强能有效缩小多语言性能差距,减少对昂贵多语言数据的依赖,为多语言任务提供新思路。

Abstract: Multilingual vision-language models have made significant strides in image captioning, yet they still lag behind their English counterparts due to limited multilingual training data and costly large-scale model parameterization. Retrieval-augmented generation (RAG) offers a promising alternative by conditioning caption generation on retrieved examples in the target language, reducing the need for extensive multilingual training. However, multilingual RAG captioning models often depend on retrieved captions translated from English, which can introduce mismatches and linguistic biases relative to the source language. We introduce CONCAP, a multilingual image captioning model that integrates retrieved captions with image-specific concepts, enhancing the contextualization of the input image and grounding the captioning process across different languages. Experiments on the XM3600 dataset indicate that CONCAP enables strong performance on low- and mid-resource languages, with highly reduced data requirements. Our findings highlight the effectiveness of concept-aware retrieval augmentation in bridging multilingual performance gaps.

[22] Survey of NLU Benchmarks Diagnosing Linguistic Phenomena: Why not Standardize Diagnostics Benchmarks?

Khloud AL Jallad,Nada Ghneim,Ghaida Rebdawi

Main category: cs.CL

TL;DR: 这篇综述论文全面回顾了现有的英语、阿拉伯语和多语言自然语言理解(NLU)评测基准,重点关注它们的诊断数据集及其覆盖的语言现象,并提出了标准化诊断评测基准的需求。

Details Motivation: 当前NLU评测基准缺乏对诊断数据集的标准化命名和语言现象分类,限制了模型结果的深度分析和比较。

Contribution: 论文提出了标准化诊断评测基准的重要性,并通过对现有基准的详细比较和分析,为未来构建统一的语言现象层级结构提供了支持。

Method: 通过综述和对比分析现有NLU评测基准中的诊断数据集及其覆盖的语言现象,提出研究问题并探讨标准化的必要性。

Result: 现有基准在语言现象覆盖和分类上存在不一致性,缺乏统一的评估标准,标准化将有助于更深入的模型比较和分析。

Insight: 诊断评测基准的标准化(如类似于ISO标准)可以提升NLU模型的评估效果,尤其是在跨基准比较和错误分析中。

Abstract: Natural Language Understanding (NLU) is a basic task in Natural Language Processing (NLP). The evaluation of NLU capabilities has become a trending research topic that attracts researchers in the last few years, resulting in the development of numerous benchmarks. These benchmarks include various tasks and datasets in order to evaluate the results of pretrained models via public leaderboards. Notably, several benchmarks contain diagnostics datasets designed for investigation and fine-grained error analysis across a wide range of linguistic phenomena. This survey provides a comprehensive review of available English, Arabic, and Multilingual NLU benchmarks, with a particular emphasis on their diagnostics datasets and the linguistic phenomena they covered. We present a detailed comparison and analysis of these benchmarks, highlighting their strengths and limitations in evaluating NLU tasks and providing in-depth error analysis. When highlighting the gaps in the state-of-the-art, we noted that there is no naming convention for macro and micro categories or even a standard set of linguistic phenomena that should be covered. Consequently, we formulated a research question regarding the evaluation metrics of the evaluation diagnostics benchmarks: “Why do not we have an evaluation standard for the NLU evaluation diagnostics benchmarks?” similar to ISO standard in industry. We conducted a deep analysis and comparisons of the covered linguistic phenomena in order to support experts in building a global hierarchy for linguistic phenomena in future. We think that having evaluation metrics for diagnostics evaluation could be valuable to gain more insights when comparing the results of the studied models on different diagnostics benchmarks.

[23] CodeNER: Code Prompting for Named Entity Recognition

Sungwoo Han,Hyeyeon Kim,Jingun Kwon,Hidetaka Kamigaito,Manabu Okumura

Main category: cs.CL

TL;DR: 论文提出了CodeNER,一种基于代码提示的命名实体识别(NER)方法,通过内嵌代码提供详细的BIO标注指令,提升了大型语言模型(LLMs)的理解与执行能力。

Details Motivation: 现有方法仅依赖上下文信息生成候选实体标签,但NER需要结合详细的标注规则。为此,论文提出通过代码提示明确标注要求,弥补传统方法的不足。

Contribution: 1. 提出代码提示方法,在提示中内嵌代码以明确BIO标注规则;2. 实验证明其在多语言基准上优于传统文本提示;3. 结合思维链提示进一步提升性能。

Method: 利用代码嵌入提示,为模型提供结构化标注指令(如BIO模式),并结合LLMs对长范围编程语言的理解能力。思维链提示被用于进一步优化。

Result: 在10个英语、阿拉伯语、芬兰语、丹麦语和德语数据集上,代码提示方法优于传统文本提示,性能得到显著提升。

Insight: 代码提示能有效结构化NER任务要求,弥补纯文本提示的不足;与思维链结合展示了LLMs在多任务协同中的潜力。

Abstract: Recent studies have explored various approaches for treating candidate named entity spans as both source and target sequences in named entity recognition (NER) by leveraging large language models (LLMs). Although previous approaches have successfully generated candidate named entity spans with suitable labels, they rely solely on input context information when using LLMs, particularly, ChatGPT. However, NER inherently requires capturing detailed labeling requirements with input context information. To address this issue, we propose a novel method that leverages code-based prompting to improve the capabilities of LLMs in understanding and performing NER. By embedding code within prompts, we provide detailed BIO schema instructions for labeling, thereby exploiting the ability of LLMs to comprehend long-range scopes in programming languages. Experimental results demonstrate that the proposed code-based prompting method outperforms conventional text-based prompting on ten benchmarks across English, Arabic, Finnish, Danish, and German datasets, indicating the effectiveness of explicitly structuring NER instructions. We also verify that combining the proposed code-based prompting method with the chain-of-thought prompting further improves performance.

[24] Speaking in Words, Thinking in Logic: A Dual-Process Framework in QA Systems

Tuan Bui,Trong Le,Phat Thai,Sang Nguyen,Minh Hua,Ngan Pham,Thang Bui,Tho Quan

Main category: cs.CL

TL;DR: 论文提出了Text-JEPA框架,通过结合自然语言处理和符号逻辑推理,解决了封闭领域QA系统中透明推理和解释性决策的需求,显著降低了计算开销。

Details Motivation: 封闭领域(如教育、医疗和法律)的QA系统不仅需要准确回答,还需透明推理过程。现有神经符号方法虽有效,但在自然语言到逻辑表示转换中存在低效问题。

Contribution: 提出了轻量级Text-JEPA框架,结合双系统认知理论,高效生成逻辑表示(System 1)并由Z3求解器实现推理(System 2),并提出了三指标评估框架。

Method: 采用Text-JEPA框架,通过双系统认知理论(System 1快速生成逻辑,System 2严谨推理)实现自然语言到一阶逻辑的转换。

Result: 在领域特定数据集上,Text-JEPA性能与大型LLM系统相当,且计算开销显著降低。

Insight: 结构化且可解释的推理框架在小规模和计算资源受限的场景中具有潜力,可推动高效透明的专用领域QA系统发展。

Abstract: Recent advances in large language models (LLMs) have significantly enhanced question-answering (QA) capabilities, particularly in open-domain contexts. However, in closed-domain scenarios such as education, healthcare, and law, users demand not only accurate answers but also transparent reasoning and explainable decision-making processes. While neural-symbolic (NeSy) frameworks have emerged as a promising solution, leveraging LLMs for natural language understanding and symbolic systems for formal reasoning, existing approaches often rely on large-scale models and exhibit inefficiencies in translating natural language into formal logic representations. To address these limitations, we introduce Text-JEPA (Text-based Joint-Embedding Predictive Architecture), a lightweight yet effective framework for converting natural language into first-order logic (NL2FOL). Drawing inspiration from dual-system cognitive theory, Text-JEPA emulates System 1 by efficiently generating logic representations, while the Z3 solver operates as System 2, enabling robust logical inference. To rigorously evaluate the NL2FOL-to-reasoning pipeline, we propose a comprehensive evaluation framework comprising three custom metrics: conversion score, reasoning score, and Spearman rho score, which collectively capture the quality of logical translation and its downstream impact on reasoning accuracy. Empirical results on domain-specific datasets demonstrate that Text-JEPA achieves competitive performance with significantly lower computational overhead compared to larger LLM-based systems. Our findings highlight the potential of structured, interpretable reasoning frameworks for building efficient and explainable QA systems in specialized domains.

[25] SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers

Chaitanya Manem,Pratik Prabhanjan Brahma,Prakamya Mishra,Zicheng Liu,Emad Barsoum

Main category: cs.CL

TL;DR: SAND-Math提出了一种生成高质量、高难度数学问题的流水线,并通过‘难度提升’步骤增强问题复杂性,显著提升了数学LLM的性能。

Details Motivation: 高性能数学LLM的发展受限于高质量、高难度训练数据的稀缺性,亟需一种有效的生成方法。

Contribution: 提出了SAND-Math流水线,包括问题生成和难度提升步骤,显著提升LLM的数学推理能力;发布了完整的数据集和工具包。

Method: 采用‘难度提升’步骤,通过系统化方式增加问题的复杂性;生成高质量数学问题。

Result: 在AIME25基准上,性能提升17.85绝对点;难度提升步骤使问题平均难度增加0.96,性能从46.38%提升至49.23%。

Insight: 生成高难度数学问题并能动态提升复杂度对数学LLM的训练至关重要;开源数据集和工具包为社区提供了实用资源。

Abstract: The demand for Large Language Models (LLMs) capable of sophisticated mathematical reasoning is growing across industries. However, the development of performant mathematical LLMs is critically bottlenecked by the scarcity of difficult, novel training data. We introduce \textbf{SAND-Math} (Synthetic Augmented Novel and Difficult Mathematics problems and solutions), a pipeline that addresses this by first generating high-quality problems from scratch and then systematically elevating their complexity via a new \textbf{Difficulty Hiking} step. We demonstrate the effectiveness of our approach through two key findings. First, augmenting a strong baseline with SAND-Math data significantly boosts performance, outperforming the next-best synthetic dataset by \textbf{$\uparrow$ 17.85 absolute points} on the AIME25 benchmark. Second, in a dedicated ablation study, we show our Difficulty Hiking process is highly effective: by increasing average problem difficulty from 5.02 to 5.98, this step lifts AIME25 performance from 46.38% to 49.23%. The full generation pipeline, final dataset, and a fine-tuned model form a practical and scalable toolkit for building more capable and efficient mathematical reasoning LLMs. SAND-Math dataset is released here: \href{https://huggingface.co/datasets/amd/SAND-MATH}{https://huggingface.co/datasets/amd/SAND-MATH}

[26] Ontology-Enhanced Knowledge Graph Completion using Large Language Models

Wenbin Guo,Xin Wang,Jiaoyan Chen,Zhao Li,Zirui Chen

Main category: cs.CL

TL;DR: 论文提出了一种结合本体论知识和语言大模型的图谱补全方法(OL-KGC),通过嵌入结构信息和自动提取本体知识提升推理能力,在多个基准数据集上表现优异。

Details Motivation: 当前基于LLM的KGC方法依赖隐含知识表示,容易传播错误知识且缺乏明确的逻辑推理能力,作者希望通过结合神经感知结构信息和本体知识来解决这一问题。

Contribution: 提出OL-KGC方法,首次将结构信息嵌入文本空间并自动提取本体知识用于逻辑引导,显著提升了图谱补全的性能。

Method: OL-KGC首先通过神经感知机制嵌入结构信息至文本空间,再通过自动化算法从知识图谱中提取本体知识并转换为文本,为LLM提供逻辑引导。

Result: 在FB15K-237、UMLS和WN18RR三个基准数据集上,OL-KGC性能显著优于现有主流方法,达到state-of-the-art水平。

Insight: 结合结构信息和本体论知识可以有效提升语言大模型在知识图谱补全任务中的逻辑推理能力。

Abstract: Large Language Models (LLMs) have been extensively adopted in Knowledge Graph Completion (KGC), showcasing significant research advancements. However, as black-box models driven by deep neural architectures, current LLM-based KGC methods rely on implicit knowledge representation with parallel propagation of erroneous knowledge, thereby hindering their ability to produce conclusive and decisive reasoning outcomes. We aim to integrate neural-perceptual structural information with ontological knowledge, leveraging the powerful capabilities of LLMs to achieve a deeper understanding of the intrinsic logic of the knowledge. We propose an ontology enhanced KGC method using LLMs – OL-KGC. It first leverages neural perceptual mechanisms to effectively embed structural information into the textual space, and then uses an automated extraction algorithm to retrieve ontological knowledge from the knowledge graphs (KGs) that needs to be completed, which is further transformed into a textual format comprehensible to LLMs for providing logic guidance. We conducted extensive experiments on three widely-used benchmarks – FB15K-237, UMLS and WN18RR. The experimental results demonstrate that OL-KGC significantly outperforms existing mainstream KGC methods across multiple evaluation metrics, achieving state-of-the-art performance.

[27] Geometric-Mean Policy Optimization

Yuzhong Zhao,Yue Liu,Junpeng Liu,Jingye Chen,Xun Wu,Yaru Hao,Tengchao Lv,Shaohan Huang,Lei Cui,Qixiang Ye,Fang Wan,Furu Wei

Main category: cs.CL

TL;DR: 论文提出了一种名为几何平均策略优化(GMPO)的新方法,通过优化令牌级奖励的几何均值来解决GRPO在异常值处理时的不稳定性问题。

Details Motivation: 现有方法GRPO在优化令牌级奖励时采用算术均值,但对异常值敏感,导致策略更新不稳定。GMPO旨在通过几何均值减少异常值的影响。

Contribution: 1. 提出GMPO方法,通过几何均值优化提升稳定性;2. 提供理论和实验分析;3. 在多项数学和多模态推理任务中表现优于GRPO。

Method: GMPO通过计算令牌级奖励的几何均值,减少异常值对策略更新的影响,同时保持了重要性采样比率的稳定性。

Result: GMPO-7B模型在数学任务(4.1%提升)和多模态推理任务(1.4%提升)中均优于GRPO。

Insight: 几何均值优化在处理异常值时比算术均值更具鲁棒性,为策略优化提供了新的思路。

Abstract: Recent advancements, such as Group Relative Policy Optimization (GRPO), have enhanced the reasoning capabilities of large language models by optimizing the arithmetic mean of token-level rewards. However, GRPO suffers from unstable policy updates when processing tokens with outlier importance-weighted rewards, which manifests as extreme importance sampling ratios during training, i.e., the ratio between the sampling probabilities assigned to a token by the current and old policies. In this work, we propose Geometric-Mean Policy Optimization (GMPO), a stabilized variant of GRPO. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. In addition, we provide comprehensive theoretical and experimental analysis to justify the design and stability benefits of GMPO. Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark, including AIME24, AMC, MATH500, OlympiadBench, Minerva, and Geometry3K. Code is available at https://github.com/callsys/GMPO.

[28] When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification

Hanna Shcharbakova,Tatiana Anikina,Natalia Skachkova,Josef van Genabith

Main category: cs.CL

TL;DR: This paper evaluates large language models (LLMs) on fine-grained multilingual claim verification, finding that smaller models like XLM-R outperform much larger LLMs, with improved performance and fewer biases.

Details Motivation: The rapid spread of multilingual misinformation necessitates robust fact verification systems. However, the effectiveness of large language models in this nuanced, multilingual setting is understudied.

Contribution: The paper establishes new benchmarks for multilingual fact verification, showing that smaller, specialized models (XLM-R) outperform larger, general-purpose LLMs by a significant margin.

Method: The study evaluates five state-of-the-art language models (XLM-R, mT5, Llama 3.1, Qwen 2.5, Mistral Nemo) on the X-Fact dataset, comparing fine-tuning and prompting approaches across 25 languages and 7 veracity categories.

Result: XLM-R (270M parameters) achieves 57.7% macro-F1, outperforming all tested LLMs (7-12B parameters) and improving the previous state-of-the-art by 15.8%. LLMs exhibit issues like difficulty leveraging evidence and category bias.

Insight: Smaller, specialized models may be more effective than general-purpose LLMs for fine-grained multilingual fact verification, highlighting practical deployment considerations.

Abstract: The rapid spread of multilingual misinformation requires robust automated fact verification systems capable of handling fine-grained veracity assessments across diverse languages. While large language models have shown remarkable capabilities across many NLP tasks, their effectiveness for multilingual claim verification with nuanced classification schemes remains understudied. We conduct a comprehensive evaluation of five state-of-the-art language models on the X-Fact dataset, which spans 25 languages with seven distinct veracity categories. Our experiments compare small language models (encoder-based XLM-R and mT5) with recent decoder-only LLMs (Llama 3.1, Qwen 2.5, Mistral Nemo) using both prompting and fine-tuning approaches. Surprisingly, we find that XLM-R (270M parameters) substantially outperforms all tested LLMs (7-12B parameters), achieving 57.7% macro-F1 compared to the best LLM performance of 16.9%. This represents a 15.8% improvement over the previous state-of-the-art (41.9%), establishing new performance benchmarks for multilingual fact verification. Our analysis reveals problematic patterns in LLM behavior, including systematic difficulties in leveraging evidence and pronounced biases toward frequent categories in imbalanced data settings. These findings suggest that for fine-grained multilingual fact verification, smaller specialized models may be more effective than general-purpose large models, with important implications for practical deployment of fact-checking systems.

[29] Text2VLM: Adapting Text-Only Datasets to Evaluate Alignment Training in Visual Language Models

Gabriel Downer,Sean Craven,Damian Ruck,Jake Thomas

Main category: cs.CL

TL;DR: Text2VLM是一个多阶段流水线,将纯文本数据集转换为多模态格式,用于评估视觉语言模型(VLMs)对排版提示注入攻击的抵抗力。它揭示了开源VLMs在视觉输入下的脆弱性,并通过人类评估验证其有效性。

Details Motivation: 现有评估数据集主要关注纯文本提示,忽略了视觉漏洞的评估。为了填补这一空白,研究提出了一种新的评估工具,以提升VLMs的安全性评估。

Contribution: 提出了Text2VLM流水线,能够将纯文本数据转换为多模态提示,专门用于评估VLMs的安全性。同时揭示了开源VLMs在视觉输入下的脆弱性。

Method: Text2VLM通过多阶段流水线,将有害文本内容转换为排版图像,生成多模态提示。随后评估VLMs的抵抗力,并通过人类验证内容对齐。

Result: 开源VLMs在视觉输入下对提示注入攻击的抵抗力显著下降,性能与闭源前沿模型存在明显差距。人类评估验证了Text2VLM生成内容与预期的对齐性。

Insight: Text2VLM为多模态漏洞评估提供了可扩展工具,有助于开发更鲁棒的安全机制,推动VLMs在真实场景中的安全部署。

Abstract: The increasing integration of Visual Language Models (VLMs) into AI systems necessitates robust model alignment, especially when handling multimodal content that combines text and images. Existing evaluation datasets heavily lean towards text-only prompts, leaving visual vulnerabilities under evaluated. To address this gap, we propose \textbf{Text2VLM}, a novel multi-stage pipeline that adapts text-only datasets into multimodal formats, specifically designed to evaluate the resilience of VLMs against typographic prompt injection attacks. The Text2VLM pipeline identifies harmful content in the original text and converts it into a typographic image, creating a multimodal prompt for VLMs. Also, our evaluation of open-source VLMs highlights their increased susceptibility to prompt injection when visual inputs are introduced, revealing critical weaknesses in the current models’ alignment. This is in addition to a significant performance gap compared to closed-source frontier models. We validate Text2VLM through human evaluations, ensuring the alignment of extracted salient concepts; text summarization and output classification align with human expectations. Text2VLM provides a scalable tool for comprehensive safety assessment, contributing to the development of more robust safety mechanisms for VLMs. By enhancing the evaluation of multimodal vulnerabilities, Text2VLM plays a role in advancing the safe deployment of VLMs in diverse, real-world applications.

[30] Investigating Structural Pruning and Recovery Techniques for Compressing Multimodal Large Language Models: An Empirical Study

Yiran Huang,Lukas Thede,Massimiliano Mancini,Wenjia Xu,Zeynep Akata

Main category: cs.CL

TL;DR: 该论文研究了多模态大语言模型(MLLMs)的结构性剪枝与恢复技术,提出通过层剪枝和宽度剪枝结合监督微调与知识蒸馏的方法压缩模型,且在低资源场景下仅需少量数据即可恢复性能。

Details Motivation: 多模态大语言模型的高计算和内存需求限制了实际部署。现有方法灵活性不足且计算成本高,因此需要一种更高效的压缩和恢复技术。

Contribution: 1. 研究了层剪枝和宽度剪枝两种剪枝范式;2. 提出结合监督微调和知识蒸馏的恢复方法;3. 发现宽度剪枝在低资源场景下效果更好,仅用5%数据即可实现95%性能恢复。

Method: 采用层剪枝和宽度剪枝,结合监督微调和知识蒸馏进行恢复训练,实验在LLaVA-v1.5-7B和Bunny-v1.0-3B上进行。

Result: 宽度剪枝在低资源场景下表现更优;仅微调多模态投影器就能在小规模剪枝(<20%)时恢复性能;仅用5%数据即可达到95%性能。

Insight: 宽度剪枝适合资源受限场景,恢复训练可通过少量数据高效完成;知识蒸馏和监督微调结合是最佳恢复策略。

Abstract: While Multimodal Large Language Models (MLLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose significant barriers to practical deployment. Current parameter reduction techniques primarily involve training MLLMs from Small Language Models (SLMs), but these methods offer limited flexibility and remain computationally intensive. To address this gap, we propose to directly compress existing MLLMs through structural pruning combined with efficient recovery training. Specifically, we investigate two structural pruning paradigms–layerwise and widthwise pruning–applied to the language model backbone of MLLMs, alongside supervised finetuning and knowledge distillation. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios with limited computational resources or insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels (< 20%). Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved with as little as 5% of the original training data, while retaining over 95% of the original performance. Through empirical study on two representative MLLMs, i.e., LLaVA-v1.5-7B and Bunny-v1.0-3B, this study offers actionable insights for practitioners aiming to compress MLLMs effectively without extensive computation resources or sufficient data.

[31] Multilingual Self-Taught Faithfulness Evaluators

Carlo Alfano,Aymen Al Marjani,Zeno Jonke,Amin Mantrach,Saab Mansour,Marcello Federico

Main category: cs.CL

TL;DR: 该论文提出了一种多语言自教导忠实度评估框架,利用合成多语言摘要数据和跨语言迁移学习,减少对昂贵人工标注数据的依赖,并在多语言环境中表现出优于现有基线方法的性能。

Details Motivation: 随着大语言模型(LLMs)在多语言场景中的应用增加,现有忠实度评估方法主要集中在英语且需要大量人工标注数据,亟需一种无需大量标注数据且能跨语言工作的自动评估系统。

Contribution: 提出了一个名为Self-Taught Evaluators for Multilingual Faithfulness的框架,通过合成多语言摘要数据和跨语言迁移学习,实现了无需依赖昂贵人工标注数据的多语言忠实度评估。

Method: 框架基于合成多语言摘要数据,采用跨语言迁移学习,比较了语言特定和混合语言微调方法,分析了大语言模型的通用语言能力与其在语言特定评估任务中的表现关系。

Result: 实验表明,该框架在多项语言中超越了现有基线方法,包括最先进的英语评估器和基于机器翻译的方法。

Insight: 大语言模型在多语言环境中的表现与其通用语言能力密切相关,跨语言迁移学习能有效提升多语言任务的性能,尤其在缺乏标注数据的情况下。

Abstract: The growing use of large language models (LLMs) has increased the need for automatic evaluation systems, particularly to address the challenge of information hallucination. Although existing faithfulness evaluation approaches have shown promise, they are predominantly English-focused and often require expensive human-labeled training data for fine-tuning specialized models. As LLMs see increased adoption in multilingual contexts, there is a need for accurate faithfulness evaluators that can operate across languages without extensive labeled data. This paper presents Self-Taught Evaluators for Multilingual Faithfulness, a framework that learns exclusively from synthetic multilingual summarization data while leveraging cross-lingual transfer learning. Through experiments comparing language-specific and mixed-language fine-tuning approaches, we demonstrate a consistent relationship between an LLM’s general language capabilities and its performance in language-specific evaluation tasks. Our framework shows improvements over existing baselines, including state-of-the-art English evaluators and machine translation-based approaches.

[32] On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

Meishan Zhang,Xin Zhang,Xinping Zhao,Shouzheng Huang,Baotian Hu,Min Zhang

Main category: cs.CL

TL;DR: 本文综述了预训练语言模型(PLMs)在通用文本嵌入(GPTE)中的关键作用,包括其基础架构、优化策略以及高级功能,如多语言支持和多模态整合,并展望了未来研究方向。

Details Motivation: 随着PLMs的兴起,GPTE因其丰富的可迁移表示能力受到广泛关注,本文旨在为研究者和从业者提供一个全面的综述,梳理PLMs在GPTE中的角色和发展方向。

Contribution: 1. 系统总结了PLMs在GPTE中的基础作用(如嵌入提取和训练策略);2. 分析了PLMs支持的进阶功能(如多语言和代码理解);3. 提出了未来研究的方向(如安全性、偏置缓解)。

Method: 通过文献综述的方式,从基础架构到高级功能,分类描述了PLMs在GPTE中的作用,并结合实际应用和挑战提出未来方向。

Result: 本文展示了PLMs如何显著提升了GPTE的性能和适用性,并总结了当前研究的局限性和潜在突破点。

Insight: PLMs不仅是GPTE的核心技术,还推动了其在多语言、多模态等领域的扩展,未来研究需关注安全性、公平性和认知扩展等问题。

Abstract: Text embeddings have attracted growing interest due to their effectiveness across a wide range of natural language processing (NLP) tasks, such as retrieval, classification, clustering, bitext mining, and summarization. With the emergence of pretrained language models (PLMs), general-purpose text embeddings (GPTE) have gained significant traction for their ability to produce rich, transferable representations. The general architecture of GPTE typically leverages PLMs to derive dense text representations, which are then optimized through contrastive learning on large-scale pairwise datasets. In this survey, we provide a comprehensive overview of GPTE in the era of PLMs, focusing on the roles PLMs play in driving its development. We first examine the fundamental architecture and describe the basic roles of PLMs in GPTE, i.e., embedding extraction, expressivity enhancement, training strategies, learning objectives, and data construction. Then, we describe advanced roles enabled by PLMs, such as multilingual support, multimodal integration, code understanding, and scenario-specific adaptation. Finally, we highlight potential future research directions that move beyond traditional improvement goals, including ranking integration, safety considerations, bias mitigation, structural information incorporation, and the cognitive extension of embeddings. This survey aims to serve as a valuable reference for both newcomers and established researchers seeking to understand the current state and future potential of GPTE.

[33] Automating Thematic Review of Prevention of Future Deaths Reports: Replicating the ONS Child Suicide Study using Large Language Models

Sam Osian,Arpan Dutta,Sahil Bhandari,Iain E. Buchan,Dan W. Joyce

Main category: cs.CL

TL;DR: 论文探讨了利用大语言模型(LLM)自动化分析英格兰和威尔士的”预防未来死亡”(PFD)报告,以复制国家统计局(ONS)手动完成的儿童自杀主题研究,结果显示自动化方法效率更高且可靠。

Details Motivation: 手动分析PFD报告耗时且效率低,需要自动化解码和分类方法以提高效率和可靠性,从而为公共卫生提供及时洞察。

Contribution: 开发了一个开源的”文本到表格”LLM流程(PFD Toolkit),能够高效准确地识别和分类儿童自杀PFD报告,结果优于手动方法。

Method: 使用LLM自动筛选和编码4279份PFD报告,并通过临床专家验证其准确性。流程包括识别自杀案例、年龄分类及23个子主题编码。

Result: 自动化方法识别出72份儿童自杀报告(比ONS多近一倍),与专家标注的Cohen’s κ为0.82,流程仅需8分16秒完成。

Insight: LLM能可靠地自动化主题分析,显著提升效率,为公共卫生提供可扩展、可复制和及时的数据支持。

Abstract: Prevention of Future Deaths (PFD) reports, issued by coroners in England and Wales, flag systemic hazards that may lead to further loss of life. Analysis of these reports has previously been constrained by the manual effort required to identify and code relevant cases. In 2025, the Office for National Statistics (ONS) published a national thematic review of child-suicide PFD reports ($\leq$ 18 years), identifying 37 cases from January 2015 to November 2023 - a process based entirely on manual curation and coding. We evaluated whether a fully automated, open source “text-to-table” language-model pipeline (PFD Toolkit) could reproduce the ONS’s identification and thematic analysis of child-suicide PFD reports, and assessed gains in efficiency and reliability. All 4,249 PFD reports published from July 2013 to November 2023 were processed via PFD Toolkit’s large language model pipelines. Automated screening identified cases where the coroner attributed death to suicide in individuals aged 18 or younger, and eligible reports were coded for recipient category and 23 concern sub-themes, replicating the ONS coding frame. PFD Toolkit identified 72 child-suicide PFD reports - almost twice the ONS count. Three blinded clinicians adjudicated a stratified sample of 144 reports to validate the child-suicide screening. Against the post-consensus clinical annotations, the LLM-based workflow showed substantial to almost-perfect agreement (Cohen’s $\kappa$ = 0.82, 95% CI: 0.66-0.98, raw agreement = 91%). The end-to-end script runtime was 8m 16s, transforming a process that previously took months into one that can be completed in minutes. This demonstrates that automated LLM analysis can reliably and efficiently replicate manual thematic reviews of coronial data, enabling scalable, reproducible, and timely insights for public health and safety. The PFD Toolkit is openly available for future research.

[34] MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

Adrien Bazoge

Main category: cs.CL

TL;DR: MediQAl 是一个法语的医疗问答数据集,包含 32,603 个问题,分为三种任务类型,旨在评估语言模型在医学事实回忆和推理能力上的表现。作者通过 14 个大型语言模型的广泛测试,揭示了事实回忆与推理任务之间的性能差距,并填补了多语言医疗领域资源的空白。

Details Motivation: 医疗领域的问答系统需要强大的事实回忆和推理能力,但法语相关的数据集稀缺。作者旨在提供一个全面的法语医疗问答数据集,以推动多语言医疗 NLP 的发展。

Contribution: 1. 提出了一个大规模的法语医疗问答数据集 MediQAl,包含多种任务类型。2. 通过标注问题的类型(理解或推理),支持对模型认知能力的细粒度分析。3. 提供了 14 个大型语言模型的基准测试结果,突出了事实回忆与推理任务的性能差异。

Method: 1. 从法语医学考试中收集 32,603 个问题,覆盖 41 个医学主题。2. 将问题分为三种任务类型(单选、多选、开放式回答)和两种认知标签(理解或推理)。3. 使用 14 个语言模型进行验证,评估其在事实回忆和推理任务上的表现。

Result: 测试结果表明,语言模型在事实回忆任务上表现较好,但在推理任务中存在显著性能差距。这为未来的模型改进提供了明确方向。

Insight: 1. 大型语言模型在推理任务上的表现仍需提升。2. 多语言医疗数据集的开发是推动医疗 NLP 发展的关键。3. 细分任务类型和认知能力评估有助于更全面地衡量模型的性能。

Abstract: This work introduces MediQAl, a French medical question answering dataset designed to evaluate the capabilities of language models in factual medical recall and reasoning over real-world clinical scenarios. MediQAl contains 32,603 questions sourced from French medical examinations across 41 medical subjects. The dataset includes three tasks: (i) Multiple-Choice Question with Unique answer, (ii) Multiple-Choice Question with Multiple answer, and (iii) Open-Ended Question with Short-Answer. Each question is labeled as Understanding or Reasoning, enabling a detailed analysis of models’ cognitive capabilities. We validate the MediQAl dataset through extensive evaluation with 14 large language models, including recent reasoning-augmented models, and observe a significant performance gap between factual recall and reasoning tasks. Our evaluation provides a comprehensive benchmark for assessing language models’ performance on French medical question answering, addressing a crucial gap in multilingual resources for the medical domain.

cs.CV [Back]

[35] Tuning adaptive gamma correction (TAGC) for enhancing images in low ligh

Ghufran Abualhail Alhamzawi,Ali Saeed Alfoudi,Ali Hakem Alsaeedi,Suha Mohammed Hadi,Amjed Abbas Ahmed,Md. Riad Hassan,Nurhizam Safie Mohd Satar,Waeel Yahya Yasseen

Main category: cs.CV

TL;DR: 这篇论文提出了一个名为TAGC的自适应伽马校正模型,用于增强低光照条件下的图像质量,通过自动计算自适应伽马系数来提升图像对比度和细节。

Details Motivation: 低光照条件下的图像通常存在对比度低、噪声强和细节模糊的问题,影响了图像的质量和应用效果。需要一个自动化且高效的解决方案来提升图像质量。

Contribution: 提出了TAGC模型,能够自动计算自适应伽马系数,无需人工干预,有效增强低光照图像的对比度和细节,同时保持自然色彩分布。

Method: 通过分析低光照图像的颜色亮度并计算平均颜色,动态确定自适应伽马系数,适用于不同光照水平的图像。

Result: 定性和定量评估表明,TAGC模型成功提升了低光照图像的质量,保持了细节和自然对比度,视觉效果更自然。

Insight: TAGC模型的自动化和自适应特性使其在夜间监控、医学图像增强和低光摄影等多种应用中具有潜在优势。

Abstract: Enhancing images in low-light conditions is an important challenge in computer vision. Insufficient illumination negatively affects the quality of images, resulting in low contrast, intensive noise, and blurred details. This paper presents a model for enhancing low-light images called tuning adaptive gamma correction (TAGC). The model is based on analyzing the color luminance of the low-light image and calculating the average color to determine the adaptive gamma coefficient. The gamma value is calculated automatically and adaptively at different illumination levels suitable for the image without human intervention or manual adjustment. Based on qualitative and quantitative evaluation, tuning adaptive gamma correction model has effectively improved low-light images while maintaining details, natural contrast, and correct color distribution. It also provides natural visual quality. It can be considered a more efficient solution for processing low-light images in multiple applications such as night surveillance, improving the quality of medical images, and photography in low-light environments.

[36] Is Exchangeability better than I.I.D to handle Data Distribution Shifts while Pooling Data for Data-scarce Medical image segmentation?

Ayush Roy,Samin Enam,Jun Xia,Vishnu Suresh Lokhande,Won Hwa Kim

Main category: cs.CV

TL;DR: 该论文探讨了在医学图像分割中,面对数据稀缺问题时,通过数据池化(合并多源数据)或数据追加(添加新数据)时引发的分布偏移问题。作者提出采用可交换性(Exchangeability)假设而非传统的独立同分布(i.i.d.)假设来更有效地处理这些分布偏移,并通过因果框架改进深度网络的特征表示,提升分割性能。

Details Motivation: 医学图像数据稀缺,传统i.i.d假设在多源数据池化中可能失效,导致分布偏移和模型性能下降。因此,需要提出更有效的框架来处理这一问题。

Contribution: 1. 提出以可交换性假设替代i.i.d假设来处理多源数据池化时的分布偏移问题。2. 基于因果框架,提出一种改进网络各层特征表示的方法,减少前景-背景特征差异。3. 在五个数据集(包括自建超声数据集)上验证了方法的有效性,取得了SOTA分割性能。

Method: 利用因果框架设计方法,控制深度网络各层的前景-背景特征差异,改进特征表示。采用可交换性假设优化数据池化过程,避免分布偏移。

Result: 在组织病理学和超声图像的五个数据集上实现了SOTA分割性能,定性结果显示分割图更加精细和准确。

Insight: 可交换性假设在多源数据池化中比i.i.d假设更实用,结合因果框架可以显著提升医学图像分割的性能,尤其是数据稀缺场景。

Abstract: Data scarcity is a major challenge in medical imaging, particularly for deep learning models. While data pooling (combining datasets from multiple sources) and data addition (adding more data from a new dataset) have been shown to enhance model performance, they are not without complications. Specifically, increasing the size of the training dataset through pooling or addition can induce distributional shifts, negatively affecting downstream model performance, a phenomenon known as the “Data Addition Dilemma”. While the traditional i.i.d. assumption may not hold in multi-source contexts, assuming exchangeability across datasets provides a more practical framework for data pooling. In this work, we investigate medical image segmentation under these conditions, drawing insights from causal frameworks to propose a method for controlling foreground-background feature discrepancies across all layers of deep networks. This approach improves feature representations, which are crucial in data-addition scenarios. Our method achieves state-of-the-art segmentation performance on histopathology and ultrasound images across five datasets, including a novel ultrasound dataset that we have curated and contributed. Qualitative results demonstrate more refined and accurate segmentation maps compared to prominent baselines across three model architectures. The code will be available on Github.

[37] Object-centric Video Question Answering with Visual Grounding and Referring

Haochen Wang,Qirui Chen,Cilin Yan,Jiayin Cai,Xiaolong Jiang,Yao Hu,Weidi Xie,Stratis Gavves

Main category: cs.CV

TL;DR: 该论文提出了一种支持多模态、对象为中心的VideoLLM模型,通过视觉提示实现视频问答和对象分割任务,显著提升了现有模型的性能。

Details Motivation: 现有VideoLLM主要关注高层视频理解且仅支持文本输出,限制了对象为中心的多轮交互灵活性。因此,作者提出支持视觉提示输入和输出的模型来解决这一问题。

Contribution: 1) 推出支持视觉提示输入和输出的VideoLLM;2) 提出STOM模块,实现单帧视觉提示跨帧传播;3) 构建VideoInfer数据集,支持对象为中心的问答任务。

Method: 1) STOM模块处理时空视觉提示传播;2) 结合视觉和文本输入/输出;3) 使用VideoInfer数据集训练和评估模型。

Result: 在12个基准测试的6项任务中,模型在视频问答和对象分割任务上均优于基线,证明了其多模态对象理解能力。

Insight: 视觉提示的引入显著提升了对象为中心的交互能力,STOM模块的有效性证明了时空信息传播的重要性。

Abstract: Video Large Language Models (VideoLLMs) have recently demonstrated remarkable progress in general video understanding. However, existing models primarily focus on high-level comprehension and are limited to text-only responses, restricting the flexibility for object-centric, multiround interactions. In this paper, we make three contributions: (i) we address these limitations by introducing a VideoLLM model, capable of performing both object referring for input and grounding for output in video reasoning tasks, i.e., allowing users to interact with videos using both textual and visual prompts; (ii) we propose STOM (Spatial-Temporal Overlay Module), a novel approach that propagates arbitrary visual prompts input at any single timestamp to the remaining frames within a video; (iii) we present VideoInfer, a manually curated object-centric video instruction dataset featuring questionanswering pairs that require reasoning. We conduct comprehensive experiments on VideoInfer and other existing benchmarks across video question answering and referring object segmentation. The results on 12 benchmarks of 6 tasks show that our proposed model consistently outperforms baselines in both video question answering and segmentation, underscoring its robustness in multimodal, object-centric video and image understanding. Project page: https://qirui-chen.github.io/RGA3-release/.

[38] Exemplar Med-DETR: Toward Generalized and Robust Lesion Detection in Mammogram Images and beyond

Sheethal Bhat,Bogdan Georgescu,Adarsh Bhandary Panambur,Mathias Zinnen,Tri-Thien Nguyen,Awais Mansoor,Karim Khalifa Elbarbary,Siming Bayer,Florin-Cristian Ghesu,Sasa Grbic,Andreas Maier

Main category: cs.CV

TL;DR: 论文提出Exemplar Med-DETR,一种多模态对比检测器,利用跨注意力和类特定示例特征改进医学图像中的病变检测,在多种模态和数据集上实现SOTA性能。

Details Motivation: 医学图像中异常检测的挑战在于特征表示差异大且解剖结构与异常关系复杂,现有方法难以学习有效的类特定特征,限制了多任务和多模态的应用。

Contribution: 提出Exemplar Med-DETR,通过类特定示例特征和跨注意力机制实现基于特征的检测,提升了检测的泛化性和鲁棒性。

Method: 采用多模态对比检测框架,结合跨注意力与迭代训练策略,利用类特定示例特征增强检测能力。

Result: 在乳腺X光、胸部X光和血管造影数据集上,性能显著提升,尤其在越南密集乳腺数据集上达到16个百分点的改进。

Insight: 通过引入类特定特征和对比学习,能够有效应对医学图像中的复杂场景,为泛化性强的检测系统提供新思路。

Abstract: Detecting abnormalities in medical images poses unique challenges due to differences in feature representations and the intricate relationship between anatomical structures and abnormalities. This is especially evident in mammography, where dense breast tissue can obscure lesions, complicating radiological interpretation. Despite leveraging anatomical and semantic context, existing detection methods struggle to learn effective class-specific features, limiting their applicability across different tasks and imaging modalities. In this work, we introduce Exemplar Med-DETR, a novel multi-modal contrastive detector that enables feature-based detection. It employs cross-attention with inherently derived, intuitive class-specific exemplar features and is trained with an iterative strategy. We achieve state-of-the-art performance across three distinct imaging modalities from four public datasets. On Vietnamese dense breast mammograms, we attain an mAP of 0.7 for mass detection and 0.55 for calcifications, yielding an absolute improvement of 16 percentage points. Additionally, a radiologist-supported evaluation of 100 mammograms from an out-of-distribution Chinese cohort demonstrates a twofold gain in lesion detection performance. For chest X-rays and angiography, we achieve an mAP of 0.25 for mass and 0.37 for stenosis detection, improving results by 4 and 7 percentage points, respectively. These results highlight the potential of our approach to advance robust and generalizable detection systems for medical imaging.

[39] Pre- and Post-Treatment Glioma Segmentation with the Medical Imaging Segmentation Toolkit

Adrian Celaya,Tucker Netherton,Dawid Schellingerhout,Caroline Chung,Beatrice Riviere,David Fuentes

Main category: cs.CV

TL;DR: 本文介绍了医学影像分割工具包MIST的最新进展,特别关注其灵活和模块化的后处理框架,专为BraTS 2025挑战赛设计,支持多种后处理操作,提升了分割性能。

Details Motivation: 医学影像分割领域缺乏标准化和可定制的工具,限制了不同方法的比较和优化。MIST工具包旨在填补这一空白,提供模块化后处理功能以提升分割质量。

Contribution: MIST的主要贡献是其灵活的后处理模块,支持多种操作(如小物体移除、形态学操作等)并可组合成用户自定义策略,从而提升分割结果的可控性和质量。

Method: MIST通过扩展后处理模块,支持对分割结果进行多种操作(如连通区域提取、形态学操作等),并允许用户定义组合策略。文中还使用BraTS排名协议评估了三种典型策略的性能。

Result: 实验表明,MIST能够快速实现高质量的分割结果,尤其适用于BraTS 2025挑战赛,其开放性和可扩展性也支持了医学影像分割的可复现研究。

Insight: 模块化和可定制的后处理工具能显著提升分割任务的性能,并支持快速实验和优化,这一思路适用于其他医学影像分析任务。

Abstract: Medical image segmentation continues to advance rapidly, yet rigorous comparison between methods remains challenging due to a lack of standardized and customizable tooling. In this work, we present the current state of the Medical Imaging Segmentation Toolkit (MIST), with a particular focus on its flexible and modular postprocessing framework designed for the BraTS 2025 pre- and post-treatment glioma segmentation challenge. Since its debut in the 2024 BraTS adult glioma post-treatment segmentation challenge, MIST’s postprocessing module has been significantly extended to support a wide range of transforms, including removal or replacement of small objects, extraction of the largest connected components, and morphological operations such as hole filling and closing. These transforms can be composed into user-defined strategies, enabling fine-grained control over the final segmentation output. We evaluate three such strategies - ranging from simple small-object removal to more complex, class-specific pipelines - and rank their performance using the BraTS ranking protocol. Our results highlight how MIST facilitates rapid experimentation and targeted refinement, ultimately producing high-quality segmentations for the BraTS 2025 challenge. MIST remains open source and extensible, supporting reproducible and scalable research in medical image segmentation.

[40] SynPAIN: A Synthetic Dataset of Pain and Non-Pain Facial Expressions

Babak Taati,Muhammad Muzammil,Yasamin Zarghami,Abhishek Moturu,Airhossein Kazerouni,Hailey Reimer,Alex Mihailidis,Thomas Hadjistavropoulos

Main category: cs.CV

TL;DR: SynPAIN是一个大规模合成的面部表情数据集,专门用于老年疼痛检测,填补了现有数据集在人口多样性和隐私问题上的不足。通过生成AI工具创造了平衡的合成身份和临床相关的疼痛表情,验证了其有效性并揭示了现有疼痛检测模型的算法偏见。

Details Motivation: 现有疼痛检测数据集在种族多样性和老年群体代表性上存在不足,限制了自动疼痛评估系统的开发和应用。

Contribution: 提出了首个公开的、人口多样化的合成疼痛表情数据集SynPAIN,并展示了其在检测和缓解算法偏见方面的价值。

Method: 利用商业生成AI工具创建合成表情,通过面部动作单元分析验证了疼痛表情的临床意义,并用于评估和改善疼痛检测模型的性能。

Result: SynPAIN有效揭示了现有模型的算法偏见,且通过合成数据增强提高了7.0%的疼痛检测平均精度。

Insight: 合成数据可以弥补真实数据在多样性和隐私上的不足,同时帮助发现和解决算法偏见问题。

Abstract: Accurate pain assessment in patients with limited ability to communicate, such as older adults with dementia, represents a critical healthcare challenge. Robust automated systems of pain detection may facilitate such assessments. Existing pain detection datasets, however, suffer from limited ethnic/racial diversity, privacy constraints, and underrepresentation of older adults who are the primary target population for clinical deployment. We present SynPAIN, a large-scale synthetic dataset containing 10,710 facial expression images (5,355 neutral/expressive pairs) across five ethnicities/races, two age groups (young: 20-35, old: 75+), and two genders. Using commercial generative AI tools, we created demographically balanced synthetic identities with clinically meaningful pain expressions. Our validation demonstrates that synthetic pain expressions exhibit expected pain patterns, scoring significantly higher than neutral and non-pain expressions using clinically validated pain assessment tools based on facial action unit analysis. We experimentally demonstrate SynPAIN’s utility in identifying algorithmic bias in existing pain detection models. Through comprehensive bias evaluation, we reveal substantial performance disparities across demographic characteristics. These performance disparities were previously undetectable with smaller, less diverse datasets. Furthermore, we demonstrate that age-matched synthetic data augmentation improves pain detection performance on real clinical data, achieving a 7.0% improvement in average precision. SynPAIN addresses critical gaps in pain assessment research by providing the first publicly available, demographically diverse synthetic dataset specifically designed for older adult pain detection, while establishing a framework for measuring and mitigating algorithmic bias. The dataset is available at https://doi.org/10.5683/SP3/WCXMAP

[41] Efficient Learning for Product Attributes with Compact Multimodal Models

Mandar Kulkarni

Main category: cs.CV

TL;DR: 论文提出了一种标签高效的半监督微调策略,通过直接偏好优化(DPO)利用未标记数据提升紧凑视觉语言模型(VLM)的产品属性预测性能。

Details Motivation: 电商中基于图像的产品属性预测任务需要大量标注数据,但对大规模视觉语言模型进行监督微调的成本高昂。为此,作者探索了利用未标记数据的半监督微调方法。

Contribution: 主要贡献是提出了一种基于DPO的标签高效半监督微调方法,通过自一致性生成偏好数据,利用未标记数据显著提升模型性能。

Method: 方法包括:1)使用PEFT训练低秩适配模块;2)为未标记样本生成多组推理-答案链,并根据自一致性分为偏好和非偏好数据;3)用DPO损失微调模型并迭代更新。

Result: 实验表明,基于DPO的微调在12个电商垂直领域中显著优于监督模型,且性能随未标记数据量增加而提升。

Insight: 利用大量未标记数据和自一致性偏好优化可以显著提升模型性能,同时减少对标注数据的依赖。

Abstract: Image-based product attribute prediction in e-commerce is a crucial task with numerous applications. The supervised fine-tuning of Vision Language Models (VLMs) faces significant scale challenges due to the cost of manual or API based annotation. In this paper, we investigate label-efficient semi-supervised fine-tuning strategies for compact VLMs (2B-3B parameters) that leverage unlabeled product listings through Direct Preference Optimization (DPO). Beginning with a small, API-based, annotated, and labeled set, we first employ PEFT to train low-rank adapter modules. To update the adapter weights with unlabeled data, we generate multiple reasoning-and-answer chains per unlabeled sample and segregate these chains into preferred and dispreferred based on self-consistency. We then fine-tune the model with DPO loss and use the updated model for the next iteration. By using PEFT fine-tuning with DPO, our method achieves efficient convergence with minimal compute overhead. On a dataset spanning twelve e-commerce verticals, DPO-based fine-tuning, which utilizes only unlabeled data, demonstrates a significant improvement over the supervised model. Moreover, experiments demonstrate that accuracy with DPO training improves with more unlabeled data, indicating that a large pool of unlabeled samples can be effectively leveraged to improve performance.

[42] DeepJIVE: Learning Joint and Individual Variation Explained from Multimodal Data Using Deep Learning

Matthew Drexler,Benjamin Risk,James J Lah,Suprateek Kundu,Deqiang Qiu

Main category: cs.CV

TL;DR: DeepJIVE是一种基于深度学习的方法,用于从多模态数据中学习联合和个体变异,解决了传统方法的局限性,如无法处理高维数据和非线性结构。

Details Motivation: 传统的多模态数据集成方法在处理高维数据和非线性结构时存在局限性,DeepJIVE旨在通过深度学习解决这些问题。

Contribution: 提出了DeepJIVE方法,实现了对多模态数据中联合和个体变异的有效建模,并通过数学推导和实验验证证明了其有效性。

Method: DeepJIVE通过三种可行的损失函数来满足一致性和正交性约束,支持对1D、2D和3D数据的处理。

Result: 在合成和真实数据集上验证了DeepJIVE的性能,并在ADNI数据集中发现了与生物学一致的协变模式。

Insight: DeepJIVE为多模态数据分析提供了一个强大的工具,尤其是在处理高维和非线性结构时表现优异。

Abstract: Conventional multimodal data integration methods provide a comprehensive assessment of the shared or unique structure within each individual data type but suffer from several limitations such as the inability to handle high-dimensional data and identify nonlinear structures. In this paper, we introduce DeepJIVE, a deep-learning approach to performing Joint and Individual Variance Explained (JIVE). We perform mathematical derivation and experimental validations using both synthetic and real-world 1D, 2D, and 3D datasets. Different strategies of achieving the identity and orthogonality constraints for DeepJIVE were explored, resulting in three viable loss functions. We found that DeepJIVE can successfully uncover joint and individual variations of multimodal datasets. Our application of DeepJIVE to the Alzheimer’s Disease Neuroimaging Initiative (ADNI) also identified biologically plausible covariation patterns between the amyloid positron emission tomography (PET) and magnetic resonance (MR) images. In conclusion, the proposed DeepJIVE can be a useful tool for multimodal data analysis.

[43] Co-Win: Joint Object Detection and Instance Segmentation in LiDAR Point Clouds via Collaborative Window Processing

Haichuan Li,Tomi Westerlund

Main category: cs.CV

TL;DR: 论文介绍了一种名为Co-Win的LiDAR点云感知框架,用于联合目标检测和实例分割,通过协作窗口处理实现高效的场景理解。

Details Motivation: 复杂城市场景中精确感知和理解是自动驾驶的关键挑战,现有方法多为单一模态或简单回归任务,缺乏细粒度的场景分解能力。

Contribution: 提出了一种BEV感知框架Co-Win,结合点云编码和窗口式特征提取,实现了多模态场景理解,并通过变分方法和实例分割提升了预测的多样性。

Method: 采用分层架构,包括专用编码器、窗口式主干网络和查询式解码器,结合掩模实例分割实现细粒度场景分解。

Result: Co-Win能够生成数据一致且上下文相关的预测掩模,提升了自动驾驶系统的决策能力。

Insight: 通过高效并行窗口处理和变分方法,Co-Win在LiDAR数据的联合目标检测和实例分割任务中表现出色,为自动驾驶提供了更精细的场景理解。

Abstract: Accurate perception and scene understanding in complex urban environments is a critical challenge for ensuring safe and efficient autonomous navigation. In this paper, we present Co-Win, a novel bird’s eye view (BEV) perception framework that integrates point cloud encoding with efficient parallel window-based feature extraction to address the multi-modality inherent in environmental understanding. Our method employs a hierarchical architecture comprising a specialized encoder, a window-based backbone, and a query-based decoder head to effectively capture diverse spatial features and object relationships. Unlike prior approaches that treat perception as a simple regression task, our framework incorporates a variational approach with mask-based instance segmentation, enabling fine-grained scene decomposition and understanding. The Co-Win architecture processes point cloud data through progressive feature extraction stages, ensuring that predicted masks are both data-consistent and contextually relevant. Furthermore, our method produces interpretable and diverse instance predictions, enabling enhanced downstream decision-making and planning in autonomous driving systems.

[44] Quaternion-Based Robust PCA for Efficient Moving Target Detection and Background Recovery in Color Videos

Liyang Wang,Shiqian Wu,Shun Fang,Qile Zhu,Jiaxin Wu,Sos Again

Main category: cs.CV

TL;DR: 论文提出了一种基于四元数Riemannian流形的高效鲁棒PCA(QRPCA)框架uQRPCA+,用于彩色视频中的移动目标检测和背景恢复,通过降低QSVD计算复杂度至o(1)并引入CR1B方法实现了SOTA性能。

Details Motivation: 移动目标检测在静态相机拍摄的多样化彩色视频中具有挑战性,现有方法计算复杂度高且无法在彩色通道中实现理想的低秩背景恢复。

Contribution: 1. 提出了uQRPCA框架,平衡了目标分割和背景恢复;2. 引入CR1B方法优化彩色通道低秩背景;3. 显著降低QSVD计算复杂度至o(1)。

Method: 利用四元数Riemannian流形降低QSVD计算复杂度,结合CR1B方法优化背景恢复。

Result: uQRPCA+在移动目标检测和背景恢复任务上达到SOTA性能。

Insight: 通过流形优化和跨通道颜色秩一致性设计,四元数方法在彩色视频处理中展现了高效性和鲁棒性。

Abstract: Moving target detection is a challenging computer vision task aimed at generating accurate segmentation maps in diverse in-the-wild color videos captured by static cameras. If backgrounds and targets can be simultaneously extracted and recombined, such synthetic data can significantly enrich annotated in-the-wild datasets and enhance the generalization ability of deep models. Quaternion-based RPCA (QRPCA) is a promising unsupervised paradigm for color image processing. However, in color video processing, Quaternion Singular Value Decomposition (QSVD) incurs high computational costs, and rank-1 quaternion matrix fails to yield rank-1 color channels. In this paper, we reduce the computational complexity of QSVD to o(1) by utilizing a quaternion Riemannian manifold. Furthermor, we propose the universal QRPCA (uQRPCA) framework, which achieves a balance in simultaneously segmenting targets and recovering backgrounds from color videos. Moreover, we expand to uQRPCA+ by introducing the Color Rank-1 Batch (CR1B) method to further process and obtain the ideal low-rank background across color channels. Experiments demonstrate our uQRPCA+ achieves State Of The Art (SOTA) performance on moving target detection and background recovery tasks compared to existing open-source methods. Our implementation is publicly available on GitHub at https://github.com/Ruchtech/uQRPCA

[45] Leveraging Sparse LiDAR for RAFT-Stereo: A Depth Pre-Fill Perspective

Jinsu Yoo,Sooyoung Jeon,Zanming Huang,Tai-Yu Pan,Wei-Lun Chao

Main category: cs.CV

TL;DR: 论文提出了一种通过预填充稀疏LiDAR深度数据来提升RAFT-Stereo性能的方法,显著提高了在稀疏LiDAR条件下的立体匹配精度。

Details Motivation: 研究LiDAR在RAFT-Stereo框架中的指导作用,发现稀疏LiDAR数据(如每帧几百个点)会导致性能急剧下降,从而提出改进方案。

Contribution: 提出了一种从信号处理角度解释稀疏LiDAR数据影响的新视角,并设计了一种简单有效的预填充方法(GRAFT-Stereo),显著提升了稀疏LiDAR条件下的性能。

Method: 通过预填充稀疏初始视差图和早期融合LiDAR深度数据,结合两种不同的预填充策略,实现了对RAFT-Stereo的有效改进。

Result: GRAFT-Stereo在多个数据集上的稀疏LiDAR条件下显著优于现有方法。

Insight: 稀疏LiDAR数据的预填充对立体匹配至关重要,且不同的预填充策略在不同阶段(视差图与特征融合)的效果原因不同。

Abstract: We investigate LiDAR guidance within the RAFT-Stereo framework, aiming to improve stereo matching accuracy by injecting precise LiDAR depth into the initial disparity map. We find that the effectiveness of LiDAR guidance drastically degrades when the LiDAR points become sparse (e.g., a few hundred points per frame), and we offer a novel explanation from a signal processing perspective. This insight leads to a surprisingly simple solution that enables LiDAR-guided RAFT-Stereo to thrive: pre-filling the sparse initial disparity map with interpolation. Interestingly, we find that pre-filling is also effective when injecting LiDAR depth into image features via early fusion, but for a fundamentally different reason, necessitating a distinct pre-filling approach. By combining both solutions, the proposed Guided RAFT-Stereo (GRAFT-Stereo) significantly outperforms existing LiDAR-guided methods under sparse LiDAR conditions across various datasets. We hope this study inspires more effective LiDAR-guided stereo methods.

[46] Latest Object Memory Management for Temporally Consistent Video Instance Segmentation

Seunghun Lee,Jiwan Seo,Minwoo Choi,Kiljoon Han,Jaehoon Jeong,Zane Durante,Ehsan Adeli,Sang Hyun Park,Sunghoon Im

Main category: cs.CV

TL;DR: 论文提出了一种新型最新对象记忆管理(LOMM)方法,用于时间一致的视频实例分割,通过显式建模对象在每一帧中的存在状态,显著提升了长期实例追踪的性能。

Details Motivation: 视频实例分割(VIS)中,长期实例追踪和身份一致性管理是关键挑战,尤其当对象频繁出现或消失时。为此,作者提出显式建模对象的最新状态,以提升追踪的稳定性和准确性。

Contribution: 1. 提出Latest Object Memory (LOM),用于持续更新对象的最新状态;
2. 引入Decoupled Object Association (DOA),分离处理新出现和已存在对象的关联;
3. 在YouTube-VIS 2022上取得54.0的AP分数,刷新了SOTA。

Method: 1. 使用LOM显式建模对象的存在状态;
2. DOA策略独立处理新/旧对象,提升匹配准确性;
3. 结合记忆系统优化身份管理,适应动态场景。

Result: 在YouTube-VIS 2022上实现了54.0的AP分数,显著优于传统方法。

Insight: 1. 显式建模对象的最新状态对长期追踪至关重要;
2. 分离新/旧对象的处理策略能有效提升匹配精度;
3. 记忆管理机制是动态场景下实现稳定性能的关键。

Abstract: In this paper, we present Latest Object Memory Management (LOMM) for temporally consistent video instance segmentation that significantly improves long-term instance tracking. At the core of our method is Latest Object Memory (LOM), which robustly tracks and continuously updates the latest states of objects by explicitly modeling their presence in each frame. This enables consistent tracking and accurate identity management across frames, enhancing both performance and reliability through the VIS process. Moreover, we introduce Decoupled Object Association (DOA), a strategy that separately handles newly appearing and already existing objects. By leveraging our memory system, DOA accurately assigns object indices, improving matching accuracy and ensuring stable identity consistency, even in dynamic scenes where objects frequently appear and disappear. Extensive experiments and ablation studies demonstrate the superiority of our method over traditional approaches, setting a new benchmark in VIS. Notably, our LOMM achieves state-of-the-art AP score of 54.0 on YouTube-VIS 2022, a dataset known for its challenging long videos. Project page: https://seung-hun-lee.github.io/projects/LOMM/

[47] MoFRR: Mixture of Diffusion Models for Face Retouching Restoration

Jiaxin Liu,Qichao Ying,Zhenxing Qian,Sheng Li,Runqi Zhang,Jian Liu,Xinpeng Zhang

Main category: cs.CV

TL;DR: 该论文提出了一个新的任务——人脸修图复原(FRR),旨在从修图后的图像中恢复原始人脸。基于扩散模型的混合模型MoFRR,通过专家隔离策略和双分支结构,有效解决了复杂修图操作的复原问题。

Details Motivation: 社交平台上的修图行为引发了人脸图像真实性的担忧,现有方法主要关注修图检测,而如何从修图图像中准确恢复原始人脸尚未解决,促成了FRR任务的提出。

Contribution: 1. 提出了一个新任务FRR,专注于从修图图像中恢复原始人脸。2. 设计了MoFRR模型,结合扩散模型和专家隔离策略,采用双分支结构处理低频和高频信息。3. 构建了新的数据集RetouchingFFHQ++用于实验验证。

Method: 1. 使用扩散模型框架,采用专家隔离策略,包括多个专用专家和一个共享专家。2. 专用专家采用双分支结构:基于DDIM的低频分支(带IDEM模块)和基于交叉注意力的高频分支(HFCAM)。

Result: 在RetouchingFFHQ++数据集上的实验验证了MoFRR的有效性,能够较好地恢复修图后的人脸图像。

Insight: 1. 低频信息是修图复原的关键。2. 专家隔离策略和双分支结构能有效处理复杂的修图操作。3. 构建高质量数据集对任务验证至关重要。

Abstract: The widespread use of face retouching on social media platforms raises concerns about the authenticity of face images. While existing methods focus on detecting face retouching, how to accurately recover the original faces from the retouched ones has yet to be answered. This paper introduces Face Retouching Restoration (FRR), a novel computer vision task aimed at restoring original faces from their retouched counterparts. FRR differs from traditional image restoration tasks by addressing the complex retouching operations with various types and degrees, which focuses more on the restoration of the low-frequency information of the faces. To tackle this challenge, we propose MoFRR, Mixture of Diffusion Models for FRR. Inspired by DeepSeek’s expert isolation strategy, the MoFRR uses sparse activation of specialized experts handling distinct retouching types and the engagement of a shared expert dealing with universal retouching traces. Each specialized expert follows a dual-branch structure with a DDIM-based low-frequency branch guided by an Iterative Distortion Evaluation Module (IDEM) and a Cross-Attention-based High-Frequency branch (HFCAM) for detail refinement. Extensive experiments on a newly constructed face retouching dataset, RetouchingFFHQ++, demonstrate the effectiveness of MoFRR for FRR.

[48] Self-Guided Masked Autoencoder

Jeongwoo Shin,Inseo Lee,Junho Lee,Joonseok Lee

Main category: cs.CV

TL;DR: 论文提出了自引导的Masked Autoencoder(MAE),通过利用其早期学习的patch聚类信息生成高质量掩码,取代了原始MAE的随机掩码,显著提升了学习性能。

Details Motivation: 尽管MAE在自监督表示学习中取得了成功,但其学习和工作机制尚未完全明确。作者通过对MAE的深入分析,发现它在预训练的早期阶段就形成了基于patch的聚类模式。

Contribution: 1. 揭示了MAE早期学习中的patch聚类现象;2. 提出自引导MAE,利用内部生成的聚类信息优化掩码策略;3. 在多种下游任务中验证了方法的有效性,且无需依赖外部模型或额外信息。

Method: 通过对MAE学习过程的分析,提出了自引导掩码策略,利用模型自身学习的patch聚类信息生成掩码,取代随机掩码。

Result: 实验表明,自引导MAE显著提升了学习效率,并在多种下游任务中优于原始MAE。

Insight: MAE的早期学习阶段已经能够捕捉patch聚类信息,利用这些信息优化掩码策略可以进一步提升模型的性能。

Abstract: Masked Autoencoder (MAE) is a self-supervised approach for representation learning, widely applicable to a variety of downstream tasks in computer vision. In spite of its success, it is still not fully uncovered what and how MAE exactly learns. In this paper, with an in-depth analysis, we discover that MAE intrinsically learns pattern-based patch-level clustering from surprisingly early stages of pretraining. Upon this understanding, we propose self-guided masked autoencoder, which internally generates informed mask by utilizing its progress in patch clustering, substituting the naive random masking of the vanilla MAE. Our approach significantly boosts its learning process without relying on any external models or supplementary information, keeping the benefit of self-supervised nature of MAE intact. Comprehensive experiments on various downstream tasks verify the effectiveness of the proposed method.

[49] HydraMamba: Multi-Head State Space Model for Global Point Cloud Learning

Kanglin Qu,Pan Gao,Qun Dai,Yuanhao Sun

Main category: cs.CV

TL;DR: HydraMamba提出了一种基于状态空间模型的多头点云学习框架,通过引入shuffle序列化策略和ConvBiS6层,解决了点云序列化和局部学习不足的问题,并在多个任务中达到SOTA。

Details Motivation: 注意力机制在点云学习中占主导地位,但其二次复杂度限制了长距离依赖建模。尽管S6模型在长距离建模中表现出色,但现有方法在点云序列化和局部学习方面仍存在不足。

Contribution: 1) 设计了shuffle序列化策略,使无序点云更适应S6的因果性;2) 提出ConvBiS6层,协同捕获局部几何和全局上下文依赖;3) 将多头设计扩展到S6,增强建模能力。

Method: 通过shuffle序列化策略优化点云输入顺序,结合ConvBiS6层实现局部与全局特征学习,并通过MHS6多头设计增强S6模型的能力。

Result: 在多个任务(对象级和场景级)中取得了SOTA结果。

Insight: 结合序列化和局部-全局协同学习是提升点云长距离依赖建模的关键。

Abstract: The attention mechanism has become a dominant operator in point cloud learning, but its quadratic complexity leads to limited inter-point interactions, hindering long-range dependency modeling between objects. Due to excellent long-range modeling capability with linear complexity, the selective state space model (S6), as the core of Mamba, has been exploited in point cloud learning for long-range dependency interactions over the entire point cloud. Despite some significant progress, related works still suffer from imperfect point cloud serialization and lack of locality learning. To this end, we explore a state space model-based point cloud network termed HydraMamba to address the above challenges. Specifically, we design a shuffle serialization strategy, making unordered point sets better adapted to the causal nature of S6. Meanwhile, to overcome the deficiency of existing techniques in locality learning, we propose a ConvBiS6 layer, which is capable of capturing local geometries and global context dependencies synergistically. Besides, we propose MHS6 by extending the multi-head design to S6, further enhancing its modeling capability. HydraMamba achieves state-of-the-art results on various tasks at both object-level and scene-level. The code is available at https://github.com/Point-Cloud-Learning/HydraMamba.

[50] JDATT: A Joint Distillation Framework for Atmospheric Turbulence Mitigation and Target Detection

Zhiming Liu,Paul Hill,Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: JDATT提出了一种联合蒸馏框架,同时解决大气湍流抑制和目标检测问题,通过知识蒸馏压缩模型,提升实时性。

Details Motivation: 大气湍流(AT)会导致图像质量下降,影响目标检测等下游任务。现有方法计算成本高且将湍流抑制与检测分开处理,效率低下。

Contribution: 1. 提出JDATT框架,联合优化湍流抑制和目标检测;2. 引入混合知识蒸馏策略(特征级和输出级),压缩模型规模;3. 在合成和真实数据集上验证了高效性和实时性。

Method: 1. 结合先进湍流抑制与检测模块;2. 使用通道蒸馏(CWD)和遮蔽生成蒸馏(MGD)进行特征级蒸馏;3. 使用KL散度实现输出级蒸馏。

Result: 实验表明JDATT在视觉恢复和目标检测上表现优越,同时大幅降低模型尺寸和推理时间。

Insight: 联合优化和知识蒸馏是解决复杂视觉任务的有效手段,尤其适合资源受限场景。

Abstract: Atmospheric turbulence (AT) introduces severe degradations, such as rippling, blur, and intensity fluctuations, that hinder both image quality and downstream vision tasks like target detection. While recent deep learning-based approaches have advanced AT mitigation using transformer and Mamba architectures, their high complexity and computational cost make them unsuitable for real-time applications, especially in resource-constrained settings such as remote surveillance. Moreover, the common practice of separating turbulence mitigation and object detection leads to inefficiencies and suboptimal performance. To address these challenges, we propose JDATT, a Joint Distillation framework for Atmospheric Turbulence mitigation and Target detection. JDATT integrates state-of-the-art AT mitigation and detection modules and introduces a unified knowledge distillation strategy that compresses both components while minimizing performance loss. We employ a hybrid distillation scheme: feature-level distillation via Channel-Wise Distillation (CWD) and Masked Generative Distillation (MGD), and output-level distillation via Kullback-Leibler divergence. Experiments on synthetic and real-world turbulence datasets demonstrate that JDATT achieves superior visual restoration and detection accuracy while significantly reducing model size and inference time, making it well-suited for real-time deployment.

[51] TransFlow: Motion Knowledge Transfer from Video Diffusion Models to Video Salient Object Detection

Suhwan Cho,Minhyeok Lee,Jungho Lee,Sunghun Yang,Sangyoun Lee

Main category: cs.CV

TL;DR: TransFlow利用预训练的视频扩散模型的运动知识,生成逼真的训练数据以提升视频显著物体检测性能。

Details Motivation: 视频显著物体检测依赖于运动线索,但训练数据有限且现有生成方法的光流缺乏语义理解。

Contribution: 提出TransFlow,通过迁移视频扩散模型的运动知识生成语义感知的光流数据。

Method: 利用视频扩散模型的语义运动先验,从静态图像生成具自然运动模式的光流。

Result: 在多个基准测试中表现提升,验证了运动知识迁移的有效性。

Insight: 预训练模型的知识迁移可解决数据稀缺问题,并提升下游任务的语义理解能力。

Abstract: Video salient object detection (SOD) relies on motion cues to distinguish salient objects from backgrounds, but training such models is limited by scarce video datasets compared to abundant image datasets. Existing approaches that use spatial transformations to create video sequences from static images fail for motion-guided tasks, as these transformations produce unrealistic optical flows that lack semantic understanding of motion. We present TransFlow, which transfers motion knowledge from pre-trained video diffusion models to generate realistic training data for video SOD. Video diffusion models have learned rich semantic motion priors from large-scale video data, understanding how different objects naturally move in real scenes. TransFlow leverages this knowledge to generate semantically-aware optical flows from static images, where objects exhibit natural motion patterns while preserving spatial boundaries and temporal coherence. Our method achieves improved performance across multiple benchmarks, demonstrating effective motion knowledge transfer.

[52] DepthFlow: Exploiting Depth-Flow Structural Correlations for Unsupervised Video Object Segmentation

Suhwan Cho,Minhyeok Lee,Jungho Lee,Donghyeong Kim,Sangyoun Lee

Main category: cs.CV

TL;DR: DepthFlow 提出了一种利用深度和光流结构相关性生成合成光流的新方法,以解决无监督视频目标分割中的数据稀缺问题,并取得了最新的SOTA效果。

Details Motivation: 无监督视频目标分割(VOS)的数据稀缺问题限制了基于光流和RGB的双流方法的性能提升。作者提出通过深度和光流的结构相关性生成合成光流,扩展训练数据。

Contribution: 1. 提出DepthFlow方法,通过单张图像生成结构相关的合成光流,解决数据稀缺问题。2. 证明了VOS模型更依赖于光流中的结构信息而非几何精度。3. 在多个公共VOS基准上实现了SOTA性能。

Method: 1. 从单张图像估计深度图。2. 将深度图转换为合成光流场,保留关键结构信息。3. 利用生成的图像-光流-掩码对训练编码-解码架构。

Result: DepthFlow在所有公共VOS基准上实现了新的SOTA性能,验证了方法的可行性和扩展性。

Insight: 光流的结构信息与深度高度相关,通过深度合成光流可有效解决数据稀缺问题,且不影响模型性能。这为无监督VOS提供了新的数据增强思路。

Abstract: Unsupervised video object segmentation (VOS) aims to detect the most prominent object in a video. Recently, two-stream approaches that leverage both RGB images and optical flow have gained significant attention, but their performance is fundamentally constrained by the scarcity of training data. To address this, we propose DepthFlow, a novel data generation method that synthesizes optical flow from single images. Our approach is driven by the key insight that VOS models depend more on structural information embedded in flow maps than on their geometric accuracy, and that this structure is highly correlated with depth. We first estimate a depth map from a source image and then convert it into a synthetic flow field that preserves essential structural cues. This process enables the transformation of large-scale image-mask pairs into image-flow-mask training pairs, dramatically expanding the data available for network training. By training a simple encoder-decoder architecture with our synthesized data, we achieve new state-of-the-art performance on all public VOS benchmarks, demonstrating a scalable and effective solution to the data scarcity problem.

[53] Smaller, Faster, Cheaper: Architectural Designs for Efficient Machine Learning

Steven Walton

Main category: cs.CV

TL;DR: 该论文探讨了如何通过架构设计提高机器学习模型的效率,使其更小、更快、更经济,重点关注数据输入输出、核心神经架构修改和归一化流自然结构的利用。

Details Motivation: 随着计算机视觉模型在资源受限环境中的部署需求增加,需要开发能够在减少计算资源的同时保持高性能的架构。

Contribution: 论文提出了三个方向的贡献:1) 优化数据输入输出以提高小型架构的性能;2) 通过限制注意力机制增强视觉变换器的表达能力;3) 利用归一化流的特性改进模型知识蒸馏。

Method: 1) 研究数据输入输出的高效传递;2) 修改核心神经架构,限制注意力窗口;3) 利用归一化流的自然结构进行知识蒸馏。

Result: 通过精心设计的架构,显著提高了机器学习模型的效率,减少了计算资源需求。

Insight: 论文表明,通过调整架构设计,可以在不牺牲性能的前提下实现模型的轻量化和高效化。

Abstract: Major advancements in the capabilities of computer vision models have been primarily fueled by rapid expansion of datasets, model parameters, and computational budgets, leading to ever-increasing demands on computational infrastructure. However, as these models are deployed in increasingly diverse and resource-constrained environments, there is a pressing need for architectures that can deliver high performance while requiring fewer computational resources. This dissertation focuses on architectural principles through which models can achieve increased performance while reducing their computational demands. We discuss strides towards this goal through three directions. First, we focus on data ingress and egress, investigating how information may be passed into and retrieved from our core neural processing units. This ensures that our models make the most of available data, allowing smaller architectures to become more performant. Second, we investigate modifications to the core neural architecture, applied to restricted attention in vision transformers. This section explores how removing uniform context windows in restricted attention increases the expressivity of the underlying neural architecture. Third, we explore the natural structures of Normalizing Flows and how we can leverage these properties to better distill model knowledge. These contributions demonstrate that careful design of neural architectures can increase the efficiency of machine learning algorithms, allowing them to become smaller, faster, and cheaper.

[54] ForCenNet: Foreground-Centric Network for Document Image Rectification

Peng Cai,Qiang Li,Kaicheng Yang,Dong Guo,Jia Li,Nan Zhou,Xiang An,Ninghua Yang,Jiankang Deng

Main category: cs.CV

TL;DR: ForCenNet提出了一种基于前景的文档图像矫正方法,通过前景标签生成、掩码机制和曲率一致性损失,显著提升了矫正效果,并在多个真实基准测试中达到SOTA性能。

Details Motivation: 现有文档图像矫正方法常忽视前景元素的重要性,而前景提供了几何参考和布局信息,影响矫正效果。因此,ForCenNet专注于前景以改进矫正任务。

Contribution: 1. 提出了前景中心标签生成方法;2. 设计了前景中心掩码机制;3. 引入了曲率一致性损失;4. 在多个基准测试中实现了SOTA性能。

Method: 1. 提取无畸变图像中的前景元素生成标签;2. 利用掩码机制区分可读区域与背景;3. 通过曲率一致性损失优化几何畸变理解。

Result: 在DocUNet、DIR300等四个基准测试中取得最优结果,有效矫正文本线和表格边框等布局元素。

Insight: 前景元素对文档图像矫正具有关键作用,通过前景标签和几何一致性约束可显著提升模型性能。

Abstract: Document image rectification aims to eliminate geometric deformation in photographed documents to facilitate text recognition. However, existing methods often neglect the significance of foreground elements, which provide essential geometric references and layout information for document image correction. In this paper, we introduce Foreground-Centric Network (ForCenNet) to eliminate geometric distortions in document images. Specifically, we initially propose a foreground-centric label generation method, which extracts detailed foreground elements from an undistorted image. Then we introduce a foreground-centric mask mechanism to enhance the distinction between readable and background regions. Furthermore, we design a curvature consistency loss to leverage the detailed foreground labels to help the model understand the distorted geometric distribution. Extensive experiments demonstrate that ForCenNet achieves new state-of-the-art on four real-world benchmarks, such as DocUNet, DIR300, WarpDoc, and DocReal. Quantitative analysis shows that the proposed method effectively undistorts layout elements, such as text lines and table borders. The resources for further comparison are provided at https://github.com/caipeng328/ForCenNet.

[55] DS-Det: Single-Query Paradigm and Attention Disentangled Learning for Flexible Object Detection

Guiping Cao,Xiangyuan Lan,Wenjian Huang,Jianguo Zhang,Dongmei Jiang,Yaowei Wang

Main category: cs.CV

TL;DR: DS-Det通过引入单查询范式和注意力解耦学习,解决了现有基于查询的检测器中的查询模糊性和交互冲突问题,提升了灵活性和效率。

Details Motivation: 现有的基于查询的检测器(如DETR)存在查询模糊性和交互冲突问题,限制了检测的灵活性和效率。

Contribution: 1)提出单查询范式,将固定查询变为灵活;2)通过注意力解耦学习简化解码器框架;3)引入PoCoo损失优化小目标检测。

Method: 1)单查询范式统一解码器建模;2)注意力解耦学习(交叉注意力用于定位,自注意力用于去重);3)PoCoo损失利用框大小先验优化学习。

Result: 在COCO2017和WiderPerson数据集上,DS-Det在多种骨干模型上表现优异。

Insight: 注意力解耦学习可直接解决查询模糊性和交互冲突问题,单查询范式提高了检测灵活性。

Abstract: Popular transformer detectors have achieved promising performance through query-based learning using attention mechanisms. However, the roles of existing decoder query types (e.g., content query and positional query) are still underexplored. These queries are generally predefined with a fixed number (fixed-query), which limits their flexibility. We find that the learning of these fixed-query is impaired by Recurrent Opposing inTeractions (ROT) between two attention operations: Self-Attention (query-to-query) and Cross-Attention (query-to-encoder), thereby degrading decoder efficiency. Furthermore, “query ambiguity” arises when shared-weight decoder layers are processed with both one-to-one and one-to-many label assignments during training, violating DETR’s one-to-one matching principle. To address these challenges, we propose DS-Det, a more efficient detector capable of detecting a flexible number of objects in images. Specifically, we reformulate and introduce a new unified Single-Query paradigm for decoder modeling, transforming the fixed-query into flexible. Furthermore, we propose a simplified decoder framework through attention disentangled learning: locating boxes with Cross-Attention (one-to-many process), deduplicating predictions with Self-Attention (one-to-one process), addressing “query ambiguity” and “ROT” issues directly, and enhancing decoder efficiency. We further introduce a unified PoCoo loss that leverages box size priors to prioritize query learning on hard samples such as small objects. Extensive experiments across five different backbone models on COCO2017 and WiderPerson datasets demonstrate the general effectiveness and superiority of DS-Det. The source codes are available at https://github.com/Med-Process/DS-Det/.

[56] SeeDiff: Off-the-Shelf Seeded Mask Generation from Diffusion Models

Joon Hyun Park,Kumju Jo,Sungyong Baik

Main category: cs.CV

TL;DR: 该论文提出了一种名为SeeDiff的方法,利用Stable Diffusion的注意力机制生成高质量像素级标注掩码,无需训练分割网络、优化文本提示或预训练模型。

Details Motivation: 传统语义分割网络需要大量人工标注的像素级掩码,而现有利用生成模型的方法未能充分利用文本引导扩散模型的潜力,仍需额外训练或优化。SeeDiff旨在直接利用扩散模型的注意力机制生成掩码,减少人工干预。

Contribution: 主要贡献包括:1) 揭示了扩散模型中交叉注意力机制可提供粗粒度物体定位,作为初始种子;2) 利用自注意力机制模拟语义对应关系,通过多尺度自注意力图迭代扩展掩码;3) 提出背景掩码细化步骤,进一步提升掩码质量。

Method: 方法分为三步:1) 通过交叉注意力生成初始种子;2) 利用多尺度自注意力图扩展掩码;3) 基于背景简单性的观察,细化背景掩码以提高精度。

Result: SeeDiff能够直接利用Stable Diffusion生成高质量掩码,无需额外训练或优化,实验表明其效果优于现有方法。

Insight: 研究揭示了扩散模型的注意力机制可直接用于语义分割任务,为生成模型在标注任务中的应用提供了新思路。

Abstract: Entrusted with the goal of pixel-level object classification, the semantic segmentation networks entail the laborious preparation of pixel-level annotation masks. To obtain pixel-level annotation masks for a given class without human efforts, recent few works have proposed to generate pairs of images and annotation masks by employing image and text relationships modeled by text-to-image generative models, especially Stable Diffusion. However, these works do not fully exploit the capability of text-guided Diffusion models and thus require a pre-trained segmentation network, careful text prompt tuning, or the training of a segmentation network to generate final annotation masks. In this work, we take a closer look at attention mechanisms of Stable Diffusion, from which we draw connections with classical seeded segmentation approaches. In particular, we show that cross-attention alone provides very coarse object localization, which however can provide initial seeds. Then, akin to region expansion in seeded segmentation, we utilize the semantic-correspondence-modeling capability of self-attention to iteratively spread the attention to the whole class from the seeds using multi-scale self-attention maps. We also observe that a simple-text-guided synthetic image often has a uniform background, which is easier to find correspondences, compared to complex-structured objects. Thus, we further refine a mask using a more accurate background mask. Our proposed method, dubbed SeeDiff, generates high-quality masks off-the-shelf from Stable Diffusion, without additional training procedure, prompt tuning, or a pre-trained segmentation network.

[57] FM-LC: A Hierarchical Framework for Urban Flood Mapping by Land Cover Identification Models

Xin Hong,Longchao Da,Hua Wei

Main category: cs.CV

TL;DR: FM-LC是一个基于土地覆盖识别的分层框架,用于干旱地区城市洪水的精细制图。通过多阶段U-Net和二元专家模型的结合,显著提升了洪水边界的准确性。

Details Motivation: 干旱地区城市洪水制图面临光谱对比度低、水文动态快速和城市土地覆盖异质性高的挑战,传统方法难以应对。高分辨率每日影像提供了解决这一问题的机会。

Contribution: 提出FM-LC框架,通过多阶段U-Net和二元专家模型的结合,有效区分光谱相似的类别(如水与植被),并通过贝叶斯平滑优化边界。

Method: 1. 初始多类U-Net分割影像;2. 对易混淆类别训练轻量级二元专家模型;3. 贝叶斯平滑优化边界。

Result: 在迪拜2024年4月风暴事件中验证,F1-score平均提升29%,洪水边界更清晰,显著优于传统单阶段U-Net。

Insight: 分层框架和专家模型的结合可以有效解决光谱相似类别的混淆问题,提升洪水制图的精度和鲁棒性。

Abstract: Urban flooding in arid regions poses severe risks to infrastructure and communities. Accurate, fine-scale mapping of flood extents and recovery trajectories is therefore essential for improving emergency response and resilience planning. However, arid environments often exhibit limited spectral contrast between water and adjacent surfaces, rapid hydrological dynamics, and highly heterogeneous urban land covers, which challenge traditional flood-mapping approaches. High-resolution, daily PlanetScope imagery provides the temporal and spatial detail needed. In this work, we introduce FM-LC, a hierarchical framework for Flood Mapping by Land Cover identification, for this challenging task. Through a three-stage process, it first uses an initial multi-class U-Net to segment imagery into water, vegetation, built area, and bare ground classes. We identify that this method has confusion between spectrally similar categories (e.g., water vs. vegetation). Second, by early checking, the class with the major misclassified area is flagged, and a lightweight binary expert segmentation model is trained to distinguish the flagged class from the rest. Third, a Bayesian smoothing step refines boundaries and removes spurious noise by leveraging nearby pixel information. We validate the framework on the April 2024 Dubai storm event, using pre- and post-rainfall PlanetScope composites. Experimental results demonstrate average F1-score improvements of up to 29% across all land-cover classes and notably sharper flood delineations, significantly outperforming conventional single-stage U-Net baselines.

[58] LAVA: Language Driven Scalable and Versatile Traffic Video Analytics

Yanrui Yu,Tianfei Zhou,Jiaxin Sun,Lianpeng Qiao,Lizhong Ding,Ye Yuan,Guoren Wang

Main category: cs.CV

TL;DR: LAVA提出了一种基于自然语言的交通视频分析方法,通过多臂老虎机采样、开放世界目标检测和长期轨迹提取,显著提升了查询性能和效率。

Details Motivation: 现代城市环境中,海量的摄像头数据需要高效分析,现有SQL方法灵活性不足,无法支持自然语言驱动的多变查询需求。

Contribution: 1) 提出语言驱动的视频分析范式;2) 设计LAVA系统,包含高效采样、开放世界检测和轨迹提取模块;3) 构建新评测基准。

Method: 1) 多臂老虎机采样定位视频片段;2) 开放世界目标检测模块;3) 长期对象轨迹关联。

Result: F1分数提升14%,MPAE降低0.39,top-k精度达86%,处理速度比基线快9.6倍。

Insight: 自然语言驱动的视频分析能显著提升灵活性和效率,开放世界检测和多层次查询是实现这一目标的关键。

Abstract: In modern urban environments, camera networks generate massive amounts of operational footage – reaching petabytes each day – making scalable video analytics essential for efficient processing. Many existing approaches adopt an SQL-based paradigm for querying such large-scale video databases; however, this constrains queries to rigid patterns with predefined semantic categories, significantly limiting analytical flexibility. In this work, we explore a language-driven video analytics paradigm aimed at enabling flexible and efficient querying of high-volume video data driven by natural language. Particularly, we build \textsc{Lava}, a system that accepts natural language queries and retrieves traffic targets across multiple levels of granularity and arbitrary categories. \textsc{Lava} comprises three main components: 1) a multi-armed bandit-based efficient sampling method for video segment-level localization; 2) a video-specific open-world detection module for object-level retrieval; and 3) a long-term object trajectory extraction scheme for temporal object association, yielding complete trajectories for object-of-interests. To support comprehensive evaluation, we further develop a novel benchmark by providing diverse, semantically rich natural language predicates and fine-grained annotations for multiple videos. Experiments on this benchmark demonstrate that \textsc{Lava} improves $F_1$-scores for selection queries by $\mathbf{14%}$, reduces MPAE for aggregation queries by $\mathbf{0.39}$, and achieves top-$k$ precision of $\mathbf{86%}$, while processing videos $ \mathbf{9.6\times} $ faster than the most accurate baseline.

[59] AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition

Samuel Ebimobowei Johnny,Blessed Guda,Andrew Blayama Stephen,Assane Gueye

Main category: cs.CV

TL;DR: 该论文提出了一种名为AutoSign的新型连续手语识别方法,通过自回归解码器直接翻译姿势序列为文本,避免了传统对齐机制的局限性,提升了识别性能。

Details Motivation: 传统连续手语识别方法依赖多阶段流水线,容易引入错误传播和过拟合问题,且词汇扩展性受限。AutoSign旨在通过直接翻译姿势序列为文本,简化流程并提升性能。

Contribution: AutoSign的核心贡献是提出了一种基于自回归解码器的端到端方法,直接生成手语对应的文本(glosses),无需中间对齐,显著降低了词错误率(WER)。

Method: 方法结合了1D CNN时的态压缩模块和预训练的AraGPT2解码器,直接从姿势序列生成文本。研究发现手部和身体姿势最具判别性。

Result: 在Isharah-1000数据集上,AutoSign比现有最佳方法提升了6.1%的WER分数。

Insight: 直接学习姿势与文本的映射关系比传统对齐方法更高效,且手部和身体姿势是连续手语识别的最重要特征。

Abstract: Continuously recognizing sign gestures and converting them to glosses plays a key role in bridging the gap between the hearing and hearing-impaired communities. This involves recognizing and interpreting the hands, face, and body gestures of the signer, which pose a challenge as it involves a combination of all these features. Continuous Sign Language Recognition (CSLR) methods rely on multi-stage pipelines that first extract visual features, then align variable-length sequences with target glosses using CTC or HMM-based approaches. However, these alignment-based methods suffer from error propagation across stages, overfitting, and struggle with vocabulary scalability due to the intermediate gloss representation bottleneck. To address these limitations, we propose AutoSign, an autoregressive decoder-only transformer that directly translates pose sequences to natural language text, bypassing traditional alignment mechanisms entirely. The use of this decoder-only approach allows the model to directly map between the features and the glosses without the need for CTC loss while also directly learning the textual dependencies in the glosses. Our approach incorporates a temporal compression module using 1D CNNs to efficiently process pose sequences, followed by AraGPT2, a pre-trained Arabic decoder, to generate text (glosses). Through comprehensive ablation studies, we demonstrate that hand and body gestures provide the most discriminative features for signer-independent CSLR. By eliminating the multi-stage pipeline, AutoSign achieves substantial improvements on the Isharah-1000 dataset, achieving an improvement of up to 6.1% in WER score compared to the best existing method.

[60] Knowledge Regularized Negative Feature Tuning for Out-of-Distribution Detection with Vision-Language Models

Wenjie Zhu,Yabin Zhang,Xin Jin,Wenjun Zeng,Lei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为知识正则化负特征调优(KR-NFT)的新方法,通过负特征调优(NFT)和知识正则化(KR)策略,提升了视觉语言模型在分布外检测(OOD)中的性能,同时避免了对未见类和风格泛化能力的下降。

Details Motivation: 现有基于负提示调优的OOD检测方法虽然提升了性能,但往往导致模型对未见类和风格的泛化能力下降。本文旨在解决这一问题。

Contribution: 1. 提出负特征调优(NFT)技术,通过分布感知变换分离正负特征空间;2. 引入知识正则化(KR)策略,优化模型在保留预训练知识的同时增强OOD检测能力。

Method: 1. NFT通过分布感知变换对预训练文本特征进行调整,分离正负特征;2. 通过轻量级元网络引入图像条件可学习因子,实现动态适应;3. KR策略通过优化损失函数平衡ID分类和OOD检测。

Result: 在ImageNet少量样本训练下,KR-NFT不仅提升了ID分类准确率和OOD检测性能,还显著将FPR95降低5.44%。

Insight: 分离正负特征空间和动态适应机制是关键,能够有效提升模型的泛化能力,同时避免知识遗忘。

Abstract: Out-of-distribution (OOD) detection is crucial for building reliable machine learning models. Although negative prompt tuning has enhanced the OOD detection capabilities of vision-language models, these tuned models often suffer from reduced generalization performance on unseen classes and styles. To address this challenge, we propose a novel method called Knowledge Regularized Negative Feature Tuning (KR-NFT), which integrates an innovative adaptation architecture termed Negative Feature Tuning (NFT) and a corresponding knowledge-regularization (KR) optimization strategy. Specifically, NFT applies distribution-aware transformations to pre-trained text features, effectively separating positive and negative features into distinct spaces. This separation maximizes the distinction between in-distribution (ID) and OOD images. Additionally, we introduce image-conditional learnable factors through a lightweight meta-network, enabling dynamic adaptation to individual images and mitigating sensitivity to class and style shifts. Compared to traditional negative prompt tuning, NFT demonstrates superior efficiency and scalability. To optimize this adaptation architecture, the KR optimization strategy is designed to enhance the discrimination between ID and OOD sets while mitigating pre-trained knowledge forgetting. This enhances OOD detection performance on trained ID classes while simultaneously improving OOD detection on unseen ID datasets. Notably, when trained with few-shot samples from ImageNet dataset, KR-NFT not only improves ID classification accuracy and OOD detection but also significantly reduces the FPR95 by 5.44% under an unexplored generalization setting with unseen ID categories. Codes can be found at \href{https://github.com/ZhuWenjie98/KRNFT}{https://github.com/ZhuWenjie98/KRNFT}.

[61] FineMotion: A Dataset and Benchmark with both Spatial and Temporal Annotation for Fine-grained Motion Generation and Editing

Bizhu Wu,Jinheng Xie,Meidan Ding,Zhe Kong,Jianfeng Ren,Ruibin Bai,Rong Qu,Linlin Shen

Main category: cs.CV

TL;DR: 论文提出了FineMotion数据集,包含442,000个人体动作片段及其详细描述,支持细粒度动作生成与编辑任务,显著提升了模型的准确性。

Details Motivation: 现有方法在生成人体动作时往往忽略具体身体部位的运动及其时序信息,导致生成的细节不足。

Contribution: 1) 提出了包含细粒度时空标注的FineMotion数据集;2) 验证了数据集在细粒度动作生成中的有效性;3) 支持零样本的细粒度动作编辑。

Method: 通过收集并标注大量人体动作片段及其详细描述,结合文本驱动的方法,生成和编辑细粒度动作。

Result: 实验表明,数据集显著提升了MDM模型的Top-3准确率(+15.3%),并支持零样本的细粒度动作编辑。

Insight: 细粒度的时空标注对提升动作生成和编辑任务的性能至关重要,尤其是在自然语言描述的驱动下。

Abstract: Generating realistic human motions from textual descriptions has undergone significant advancements. However, existing methods often overlook specific body part movements and their timing. In this paper, we address this issue by enriching the textual description with more details. Specifically, we propose the FineMotion dataset, which contains over 442,000 human motion snippets - short segments of human motion sequences - and their corresponding detailed descriptions of human body part movements. Additionally, the dataset includes about 95k detailed paragraphs describing the movements of human body parts of entire motion sequences. Experimental results demonstrate the significance of our dataset on the text-driven finegrained human motion generation task, especially with a remarkable +15.3% improvement in Top-3 accuracy for the MDM model. Notably, we further support a zero-shot pipeline of fine-grained motion editing, which focuses on detailed editing in both spatial and temporal dimensions via text. Dataset and code available at: CVI-SZU/FineMotion

[62] A Structure-aware and Motion-adaptive Framework for 3D Human Pose Estimation with Mamba

Ye Lu,Jie Wang,Jianjun Gao,Rui Gong,Chen Cai,Kim-Hui Yap

Main category: cs.CV

TL;DR: 该论文提出了一种名为SAMA的结构感知和运动自适应框架,用于3D人体姿态估计,通过动态建模关节拓扑和运动特性,实现了更高的准确性和更低的计算成本。

Details Motivation: 现有的基于Mamba的方法在姿态估计任务中难以建模复杂的关节连接关系,且忽视了运动特性的固有差异,因此需要一种能够同时捕捉空间关节拓扑和动态运动特性的方法。

Contribution: 主要贡献是提出了一个结构感知和运动自适应的框架SAMA,包括两个关键模块:结构感知状态积分器(SSI)和运动自适应状态调制器(MSM),分别用于动态建模关节关系和运动特性。

Method: SAMA框架由SSI和MSM组成,SSI通过动态关节关系在状态空间中融合特征和状态信息,而MSM则识别关节特定的运动特性并调整不同运动模式。

Result: 在多个基准测试中,SAMA表现出更高的准确性和更低的计算成本。

Insight: 通过独立建模关节拓扑和运动特性,SAMA能够更有效地捕捉复杂的姿态动态,同时减少了计算开销。

Abstract: Recent Mamba-based methods for the pose-lifting task tend to model joint dependencies by 2D-to-1D mapping with diverse scanning strategies. Though effective, they struggle to model intricate joint connections and uniformly process all joint motion trajectories while neglecting the intrinsic differences across motion characteristics. In this work, we propose a structure-aware and motion-adaptive framework to capture spatial joint topology along with diverse motion dynamics independently, named as SAMA. Specifically, SAMA consists of a Structure-aware State Integrator (SSI) and a Motion-adaptive State Modulator (MSM). The Structure-aware State Integrator is tasked with leveraging dynamic joint relationships to fuse information at both the joint feature and state levels in the state space, based on pose topology rather than sequential state transitions. The Motion-adaptive State Modulator is responsible for joint-specific motion characteristics recognition, thus applying tailored adjustments to diverse motion patterns across different joints. Through the above key modules, our algorithm enables structure-aware and motion-adaptive pose lifting. Extensive experiments across multiple benchmarks demonstrate that our algorithm achieves advanced results with fewer computational costs.

[63] RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cues for 3D Object Detection

Xiaokai Bai,Chenxu Zhou,Lianqing Zheng,Si-Yuan Cao,Jianan Liu,Xiaohan Zhang,Zhengzhuang Zhang,Hui-liang Shen

Main category: cs.CV

TL;DR: RaGS是一个利用3D高斯分布(Gaussian Splatting,GS)融合4D雷达和单目视觉信息的3D目标检测框架。通过动态分配高斯资源到前景物体,并提出级联式管道(FLI、IMA、MGF),RaGS实现了高效且灵活的检测性能,在多个基准测试中表现优异。

Details Motivation: 现有方法在融合4D雷达和单目图像时,要么依赖实例提议而缺乏全局场景理解,要么受限于固定的BEV网格结构。因此,需要一个既能动态聚焦于物体又能提供全面感知的解决方案。

Contribution: 提出首个利用3D高斯分布(GS)融合4D雷达和单目信息的3D目标检测框架RaGS,通过动态资源分配和级联式管道实现高效检测。

Method: 包括三个关键步骤:1)FLI(基于视锥的局部初始化):通过单目像素反投影初始化高斯位置;2)IMA(迭代多模态聚合):融合语义和几何信息,优化高斯分布;3)MGF(多层次高斯融合):将高斯渲染为BEV特征用于检测。

Result: 在View-of-Delft、TJ4DRadSet和OmniHD-Scenes基准测试中取得SOTA性能。

Insight: 3D高斯分布为动态分配计算资源提供了灵活性和效率,尤其适合稀疏目标场景。级联式管道实现了从粗到精的多模态融合。

Abstract: 4D millimeter-wave radar has emerged as a promising sensor for autonomous driving, but effective 3D object detection from both 4D radar and monocular images remains a challenge. Existing fusion approaches typically rely on either instance-based proposals or dense BEV grids, which either lack holistic scene understanding or are limited by rigid grid structures. To address these, we propose RaGS, the first framework to leverage 3D Gaussian Splatting (GS) as representation for fusing 4D radar and monocular cues in 3D object detection. 3D GS naturally suits 3D object detection by modeling the scene as a field of Gaussians, dynamically allocating resources on foreground objects and providing a flexible, resource-efficient solution. RaGS uses a cascaded pipeline to construct and refine the Gaussian field. It starts with the Frustum-based Localization Initiation (FLI), which unprojects foreground pixels to initialize coarse 3D Gaussians positions. Then, the Iterative Multimodal Aggregation (IMA) fuses semantics and geometry, refining the limited Gaussians to the regions of interest. Finally, the Multi-level Gaussian Fusion (MGF) renders the Gaussians into multi-level BEV features for 3D object detection. By dynamically focusing on sparse objects within scenes, RaGS enable object concentrating while offering comprehensive scene perception. Extensive experiments on View-of-Delft, TJ4DRadSet, and OmniHD-Scenes benchmarks demonstrate its state-of-the-art performance. Code will be released.

[64] OW-CLIP: Data-Efficient Visual Supervision for Open-World Object Detection via Human-AI Collaboration

Junwen Duan,Wei Xue,Ziyao Kang,Shixia Liu,Jiazhi Xia

Main category: cs.CV

TL;DR: OW-CLIP是一个通过人机协作实现数据高效开放世界目标检测(OWOD)的系统,解决了现有方法的数据依赖、部分特征过拟合和架构修改需求问题。

Details Motivation: 开放世界目标检测(OWOD)需要持续适应新标注,但现有方法面临大量标注需求、特征过拟合和灵活性不足的问题。

Contribution: 1) 提出OW-CLIP系统,支持数据高效训练;2) 开发多模态提示调优和Crop-Smoothing技术;3) 提出双模态数据精炼方法;4) 设计可视化界面提升标注质量。

Method: 通过多模态提示调优和Crop-Smoothing技术减少过拟合,利用大型语言模型和跨模态相似性生成和筛选数据,结合可视化交互提升标注质量。

Result: OW-CLIP仅需3.8%的自生成数据即可达到SOTA性能的89%,在同等数据量下优于现有方法。

Insight: 人机协作和跨模态技术能显著提升开放世界目标检测的效率和性能。

Abstract: Open-world object detection (OWOD) extends traditional object detection to identifying both known and unknown object, necessitating continuous model adaptation as new annotations emerge. Current approaches face significant limitations: 1) data-hungry training due to reliance on a large number of crowdsourced annotations, 2) susceptibility to “partial feature overfitting,” and 3) limited flexibility due to required model architecture modifications. To tackle these issues, we present OW-CLIP, a visual analytics system that provides curated data and enables data-efficient OWOD model incremental training. OW-CLIP implements plug-and-play multimodal prompt tuning tailored for OWOD settings and introduces a novel “Crop-Smoothing” technique to mitigate partial feature overfitting. To meet the data requirements for the training methodology, we propose dual-modal data refinement methods that leverage large language models and cross-modal similarity for data generation and filtering. Simultaneously, we develope a visualization interface that enables users to explore and deliver high-quality annotations: including class-specific visual feature phrases and fine-grained differentiated images. Quantitative evaluation demonstrates that OW-CLIP achieves competitive performance at 89% of state-of-the-art performance while requiring only 3.8% self-generated data, while outperforming SOTA approach when trained with equivalent data volumes. A case study shows the effectiveness of the developed method and the improved annotation quality of our visualization system.

[65] All-in-One Medical Image Restoration with Latent Diffusion-Enhanced Vector-Quantized Codebook Prior

Haowei Chen,Zhiwen Yang,Haotian Hou,Hui Zhang,Bingzheng Wei,Gang Zhou,Yan Xu

Main category: cs.CV

TL;DR: 论文提出了DiffCode框架,通过潜在扩散增强的向量量化码本先验技术,解决了医学图像修复(MedIR)中多任务处理的挑战,表现优于现有方法。

Details Motivation: 医学图像修复中多任务的异质性(如不同退化类型)导致信息损失多样化,现有方法难以统一处理这些任务,因此需要一种新的框架来整合任务特征并提升修复质量。

Contribution: 提出了DiffCode框架,结合任务自适应码本库和潜在扩散策略,利用扩散模型的强大映射能力,实现了多任务医学图像修复的统一处理。

Method: 1. 构建任务自适应码本库整合任务特征;2. 引入潜在扩散策略迭代优化潜在特征分布,提升先验特征估计的准确性。

Result: 在MRI超分辨率、CT去噪和PET合成三个任务上,DiffCode在定量指标和视觉质量上均表现出色。

Insight: 通过扩散模型增强码本先验的检索能力,能够更有效地捕捉任务间的共性特征,从而提升多任务医学图像修复的性能。

Abstract: All-in-one medical image restoration (MedIR) aims to address multiple MedIR tasks using a unified model, concurrently recovering various high-quality (HQ) medical images (e.g., MRI, CT, and PET) from low-quality (LQ) counterparts. However, all-in-one MedIR presents significant challenges due to the heterogeneity across different tasks. Each task involves distinct degradations, leading to diverse information losses in LQ images. Existing methods struggle to handle these diverse information losses associated with different tasks. To address these challenges, we propose a latent diffusion-enhanced vector-quantized codebook prior and develop \textbf{DiffCode}, a novel framework leveraging this prior for all-in-one MedIR. Specifically, to compensate for diverse information losses associated with different tasks, DiffCode constructs a task-adaptive codebook bank to integrate task-specific HQ prior features across tasks, capturing a comprehensive prior. Furthermore, to enhance prior retrieval from the codebook bank, DiffCode introduces a latent diffusion strategy that utilizes the diffusion model’s powerful mapping capabilities to iteratively refine the latent feature distribution, estimating more accurate HQ prior features during restoration. With the help of the task-adaptive codebook bank and latent diffusion strategy, DiffCode achieves superior performance in both quantitative metrics and visual quality across three MedIR tasks: MRI super-resolution, CT denoising, and PET synthesis.

[66] ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking

X. Feng,S. Hu,X. Li,D. Zhang,M. Wu,J. Zhang,X. Chen,K. Huang

Main category: cs.CV

TL;DR: ATCTrack提出了一种新的视觉-语言跟踪方法,通过动态对齐目标-上下文特征来实现鲁棒跟踪,解决了现有方法在长时复杂场景中的不足。

Details Motivation: 视觉-语言跟踪在复杂长时场景中的鲁棒性不足,现有方法难以动态对齐目标-上下文特征,尤其是文本提示词中难以区分目标与上下文词。

Contribution: 1. 提出动态目标状态对齐的目标-上下文特征建模方法;2. 基于文本内容精确识别目标词,并自适应校准上下文词;3. 在主流基准上实现SOTA性能。

Method: 1. 视觉模态:采用时序目标-上下文建模提供及时视觉线索;2. 文本模态:精确识别目标词并自适应校准上下文词。

Result: ATCTrack在主流基准上实现了新的SOTA性能。

Insight: 动态对齐目标-上下文特征是提升视觉-语言跟踪鲁棒性的关键,文本提示词的处理能力对性能有显著影响。

Abstract: Vision-language tracking aims to locate the target object in the video sequence using a template patch and a language description provided in the initial frame. To achieve robust tracking, especially in complex long-term scenarios that reflect real-world conditions as recently highlighted by MGIT, it is essential not only to characterize the target features but also to utilize the context features related to the target. However, the visual and textual target-context cues derived from the initial prompts generally align only with the initial target state. Due to their dynamic nature, target states are constantly changing, particularly in complex long-term sequences. It is intractable for these cues to continuously guide Vision-Language Trackers (VLTs). Furthermore, for the text prompts with diverse expressions, our experiments reveal that existing VLTs struggle to discern which words pertain to the target or the context, complicating the utilization of textual cues. In this work, we present a novel tracker named ATCTrack, which can obtain multimodal cues Aligned with the dynamic target states through comprehensive Target-Context feature modeling, thereby achieving robust tracking. Specifically, (1) for the visual modality, we propose an effective temporal visual target-context modeling approach that provides the tracker with timely visual cues. (2) For the textual modality, we achieve precise target words identification solely based on textual content, and design an innovative context words calibration method to adaptively utilize auxiliary context words. (3) We conduct extensive experiments on mainstream benchmarks and ATCTrack achieves a new SOTA performance. The code and models will be released at: https://github.com/XiaokunFeng/ATCTrack.

[67] Efficient Self-Supervised Neuro-Analytic Visual Servoing for Real-time Quadrotor Control

Sebastian Mocanu,Sebastian-Ion Nae,Mihai-Eugen Barbu,Marius Leordeanu

Main category: cs.CV

TL;DR: 该论文提出了一种自监督的轻量级神经网络模型(学生网络),通过从改进的分析式图像视觉伺服(IBVS)控制器(教师网络)中学习,实现了快速且稳定的无人机视觉控制。该方法解决了经典IBVS的数值不稳定问题,并通过知识蒸馏显著提升了推理速度。

Details Motivation: 在GPS缺失的室内环境中,无人机需要依赖视觉信息进行实时控制。传统IBVS方法存在数值不稳定和高计算成本的问题,因此需要一种高效、稳定的替代方案。

Contribution: 1. 改进的IBVS教师控制器,解决了经典方法的数值不稳定问题;2. 两阶段分割管道(YOLOv11+U-Net),实现鲁棒的前后车辆分割和目标方向估计;3. 高效的师生知识蒸馏系统,将IBVS能力迁移到轻量级学生网络中,适合实时部署。

Method: 1. 教师网络采用改进的IBVS控制器提升稳定性;2. 学生网络通过知识蒸馏从教师网络学习;3. 分割管道结合YOLOv11和U-Net实现目标检测与分割。

Result: 学生网络推理速度快11倍,保持与控制精度相当的性能,同时降低了计算和内存成本。实验在室内环境中验证了方法的有效性。

Insight: 通过结合经典控制理论和深度学习,可以实现高效且稳定的视觉伺服控制。知识蒸馏能将复杂的分析模型能力迁移到轻量级网络中,适合资源受限的实时应用。

Abstract: This work introduces a self-supervised neuro-analytical, cost efficient, model for visual-based quadrotor control in which a small 1.7M parameters student ConvNet learns automatically from an analytical teacher, an improved image-based visual servoing (IBVS) controller. Our IBVS system solves numerical instabilities by reducing the classical visual servoing equations and enabling efficient stable image feature detection. Through knowledge distillation, the student model achieves 11x faster inference compared to the teacher IBVS pipeline, while demonstrating similar control accuracy at a significantly lower computational and memory cost. Our vision-only self-supervised neuro-analytic control, enables quadrotor orientation and movement without requiring explicit geometric models or fiducial markers. The proposed methodology leverages simulation-to-reality transfer learning and is validated on a small drone platform in GPS-denied indoor environments. Our key contributions include: (1) an analytical IBVS teacher that solves numerical instabilities inherent in classical approaches, (2) a two-stage segmentation pipeline combining YOLOv11 with a U-Net-based mask splitter for robust anterior-posterior vehicle segmentation to correctly estimate the orientation of the target, and (3) an efficient knowledge distillation dual-path system, which transfers geometric visual servoing capabilities from the analytical IBVS teacher to a compact and small student neural network that outperforms the teacher, while being suitable for real-time onboard deployment.

[68] Interpretable Open-Vocabulary Referring Object Detection with Reverse Contrast Attention

Drandreb Earl O. Juanico,Rowel O. Atienza,Jeffrey Kenneth Go

Main category: cs.CV

TL;DR: 该论文提出了反向对比注意力(RCA),一种无需重新训练的即插即用方法,通过重新加权最终层注意力来增强视觉语言变换器中的目标定位能力。

Details Motivation: 现有的视觉语言变换器在目标定位时,由于注意力机制中的极端激活值或抑制值,可能导致语义相关但被压制的令牌未能有效引导预测。论文旨在解决这一问题。

Contribution: 1. 提出RCA方法;2. 引入FitAP指标;3. 在多个开源VLM上验证了RCA的有效性,最高提升26.6%。

Method: RCA通过抑制极端激活值并放大中等激活值,重新加权注意力层,使语义相关令牌在预测中发挥更大作用。

Result: RCA在15个开源VLM中的11个上提升了FitAP指标,尤其对晚期融合模型效果显著。

Insight: 注意力机制的尖锐性和融合时机是影响性能的关键因素,RCA为多模态变换器提供了可解释性和性能提升。

Abstract: We propose Reverse Contrast Attention (RCA), a plug-in method that enhances object localization in vision-language transformers without retraining. RCA reweights final-layer attention by suppressing extremes and amplifying mid-level activations to let semantically relevant but subdued tokens guide predictions. We evaluate it on Open Vocabulary Referring Object Detection (OV-RefOD), introducing FitAP, a confidence-free average precision metric based on IoU and box area. RCA improves FitAP in 11 out of 15 open-source VLMs, with gains up to $+26.6%$. Effectiveness aligns with attention sharpness and fusion timing; while late-fusion models benefit consistently, models like $\texttt{DeepSeek-VL2}$ also improve, pointing to capacity and disentanglement as key factors. RCA offers both interpretability and performance gains for multimodal transformers.

[69] TrackAny3D: Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking

Mengmeng Wang,Haonan Wang,Yulong Li,Xiangjie Kong,Jiaxin Du,Guojiang Shen,Feng Xia

Main category: cs.CV

TL;DR: TrackAny3D提出了一个利用预训练3D模型的通用3D点云跟踪框架,通过适配器和专家混合架构实现跨类别的几何特征自适应。

Details Motivation: 现有的3D单物体跟踪方法多为类别特定,泛化能力差且不实用。TrackAny3D旨在解决这一限制,通过利用预训练模型实现跨类别跟踪。

Contribution: 1) 第一个利用预训练3D模型的通用3D跟踪框架;2) 参数高效的适配器;3) MoGE架构和时态上下文优化策略。

Method: 使用适配器桥接预训练与跟踪任务,MoGE自适应激活子网络,时态上下文优化传递历史信息。

Result: 在多个基准测试中取得SOTA表现,展示强泛化能力和实用性。

Insight: 统一模型和大型预训练模型在3D跟踪中具有重要意义,能显著提升泛化性。

Abstract: 3D LiDAR-based single object tracking (SOT) relies on sparse and irregular point clouds, posing challenges from geometric variations in scale, motion patterns, and structural complexity across object categories. Current category-specific approaches achieve good accuracy but are impractical for real-world use, requiring separate models for each category and showing limited generalization. To tackle these issues, we propose TrackAny3D, the first framework to transfer large-scale pretrained 3D models for category-agnostic 3D SOT. We first integrate parameter-efficient adapters to bridge the gap between pretraining and tracking tasks while preserving geometric priors. Then, we introduce a Mixture-of-Geometry-Experts (MoGE) architecture that adaptively activates specialized subnetworks based on distinct geometric characteristics. Additionally, we design a temporal context optimization strategy that incorporates learnable temporal tokens and a dynamic mask weighting module to propagate historical information and mitigate temporal drift. Experiments on three commonly-used benchmarks show that TrackAny3D establishes new state-of-the-art performance on category-agnostic 3D SOT, demonstrating strong generalization and competitiveness. We hope this work will enlighten the community on the importance of unified models and further expand the use of large-scale pretrained models in this field.

[70] DriveIndia: An Object Detection Dataset for Diverse Indian Traffic Scenes

Rishav Kumar,D. Santhosh Reddy,P. Rajalakshmi

Main category: cs.CV

TL;DR: DriveIndia是一个针对印度复杂交通场景的大规模目标检测数据集,包含66,986张高分辨率图像,涵盖24类交通相关物体,并提供基线结果。

Details Motivation: 印度交通环境的复杂性和多样性对目标检测提出挑战,现有数据集不足以覆盖这些场景。

Contribution: 发布DriveIndia数据集,填补了印度交通场景数据集的空白,支持自动驾驶研究。

Method: 数据集通过120+小时的采集和3,400+公里的覆盖,使用YOLO格式标注,并采用YOLO家族模型进行基准测试。

Result: 最佳模型在mAP50上达到78.7%,验证了数据集的有效性。

Insight: 数据集强调了多样化交通场景的重要性,为自动驾驶在复杂环境中的表现提供了新基准。

Abstract: We introduce \textbf{DriveIndia}, a large-scale object detection dataset purpose-built to capture the complexity and unpredictability of Indian traffic environments. The dataset contains \textbf{66,986 high-resolution images} annotated in YOLO format across \textbf{24 traffic-relevant object categories}, encompassing diverse conditions such as varied weather (fog, rain), illumination changes, heterogeneous road infrastructure, and dense, mixed traffic patterns and collected over \textbf{120+ hours} and covering \textbf{3,400+ kilometers} across urban, rural, and highway routes. DriveIndia offers a comprehensive benchmark for real-world autonomous driving challenges. We provide baseline results using state-of-the-art \textbf{YOLO family models}, with the top-performing variant achieving a $mAP_{50}$ of \textbf{78.7%}. Designed to support research in robust, generalizable object detection under uncertain road conditions, DriveIndia will be publicly available via the TiHAN-IIT Hyderabad dataset repository (https://tihan.iith.ac.in/tiand-datasets/).

[71] A mini-batch training strategy for deep subspace clustering networks

Yuxuan Jiang,Chenwei Yu,Zhi Lin,Xiaolan Liu

Main category: cs.CV

TL;DR: 这篇论文提出了一种mini-batch训练策略,解决了深度子空间聚类网络中需要全批次处理的瓶颈问题,结合了内存库和对比学习,实现了可扩展的高分辨率图像子空间聚类。

Details Motivation: 现有的深度子空间聚类方法依赖于全批次处理,计算效率低且难以扩展到高分辨率图像。论文旨在通过mini-batch策略解决这一问题,并提出了一种无解码器的对比学习框架以进一步优化性能。

Contribution: 1. 引入内存库实现mini-batch训练,解决了全批次处理的瓶颈。2. 提出基于对比学习的无解码器框架,减少了计算开销,同时保持性能。

Method: 1. 使用内存库保存全局特征以支持mini-batch训练。2. 采用对比学习替代自编码器,避免解码器训练的高计算成本。

Result: 在COIL100和ORL数据集上,方法性能优于现有最先进技术,同时与全批次方法相当。

Insight: 通过内存库和对比学习的结合,既能实现高效mini-batch训练,又能保持子空间聚类的性能,对高分辨率图像处理具有重要价值。

Abstract: Mini-batch training is a cornerstone of modern deep learning, offering computational efficiency and scalability for training complex architectures. However, existing deep subspace clustering (DSC) methods, which typically combine an autoencoder with a self-expressive layer, rely on full-batch processing. The bottleneck arises from the self-expressive module, which requires representations of the entire dataset to construct a self-representation coefficient matrix. In this work, we introduce a mini-batch training strategy for DSC by integrating a memory bank that preserves global feature representations. Our approach enables scalable training of deep architectures for subspace clustering with high-resolution images, overcoming previous limitations. Additionally, to efficiently fine-tune large-scale pre-trained encoders for subspace clustering, we propose a decoder-free framework that leverages contrastive learning instead of autoencoding for representation learning. This design not only eliminates the computational overhead of decoder training but also provides competitive performance. Extensive experiments demonstrate that our approach not only achieves performance comparable to full-batch methods, but outperforms other state-of-the-art subspace clustering methods on the COIL100 and ORL datasets by fine-tuning deep networks.

[72] HumanSAM: Classifying Human-centric Forgery Videos in Human Spatial, Appearance, and Motion Anomaly

Chang Liu,Yunfan Ye,Fan Zhang,Qingyang Zhou,Yuchuan Luo,Zhiping Cai

Main category: cs.CV

TL;DR: HumanSAM提出了一个细粒度分类框架,用于识别人为中心伪造视频的空间、外观和运动异常,并通过融合视频理解和空间深度特征生成伪造表示,结合排名置信增强策略提升鲁棒性。

Details Motivation: 生成模型合成的以人为中心的伪造视频对信息安全构成威胁,现有二进制检测方法缺乏细粒度分类,影响可靠性和可解释性。

Contribution: 1) 提出HumanSAM框架,细粒度分类伪造视频为空间、外观和运动异常三类;2) 提出多分支特征融合和排名置信增强策略;3) 构建首个标注细粒度伪造类型的公开数据集HFV。

Method: 1) 通过视频理解和空间深度分支融合生成伪造表示;2) 引入基于排名的置信增强策略,结合几何、语义和时空一致性先验分数优化训练;3) 在HFV数据集上进行训练和评估。

Result: 实验显示HumanSAM在二进制和多类伪造分类中优于现有方法。

Insight: 细粒度分类和鲁棒特征表示对伪造视频检测至关重要;多模态特征融合和先验知识能显著提升模型性能。

Abstract: Numerous synthesized videos from generative models, especially human-centric ones that simulate realistic human actions, pose significant threats to human information security and authenticity. While progress has been made in binary forgery video detection, the lack of fine-grained understanding of forgery types raises concerns regarding both reliability and interpretability, which are critical for real-world applications. To address this limitation, we propose HumanSAM, a new framework that builds upon the fundamental challenges of video generation models. Specifically, HumanSAM aims to classify human-centric forgeries into three distinct types of artifacts commonly observed in generated content: spatial, appearance, and motion anomaly.To better capture the features of geometry, semantics and spatiotemporal consistency, we propose to generate the human forgery representation by fusing two branches of video understanding and spatial depth. We also adopt a rank-based confidence enhancement strategy during the training process to learn more robust representation by introducing three prior scores. For training and evaluation, we construct the first public benchmark, the Human-centric Forgery Video (HFV) dataset, with all types of forgeries carefully annotated semi-automatically. In our experiments, HumanSAM yields promising results in comparison with state-of-the-art methods, both in binary and multi-class forgery classification.

[73] MambaVesselNet++: A Hybrid CNN-Mamba Architecture for Medical Image Segmentation

Qing Xu,Yanming Chen,Yue Li,Ziyu Liu,Zhenye Lou,Yixuan Zhang,Xiangjian He

Main category: cs.CV

TL;DR: MambaVesselNet++提出了一种结合CNN和Mamba的混合架构,用于医学图像分割,旨在解决传统卷积方法的局部感受野限制以及Transformer的高计算成本问题。

Details Motivation: 医学图像分割中,传统卷积方法因局部感受野受限,而Transformer虽然能捕捉全局上下文,但计算成本高。Mamba模型因其高效的长程依赖建模能力成为替代方案。

Contribution: 提出了MambaVesselNet++,结合CNN和Mamba的混合架构,通过纹理感知层和线性复杂度的Mamba模块,高效捕捉局部和全局特征。

Method: 采用混合编码器(Hi-Encoder)和双焦融合解码器(BF-Decoder)。Hi-Encoder结合CNN捕获低级语义特征,Mamba建模长程依赖;BF-Decoder通过跳跃连接融合局部和全局信息。

Result: 在多种医学图像分割任务(2D、3D和实例分割)上表现优于现有卷积、Transformer和Mamba模型。

Insight: Mamba在医学图像分割中展现出高效的长程依赖建模能力,结合CNN可以进一步提升性能,同时避免Transformer的高计算成本。

Abstract: Medical image segmentation plays an important role in computer-aided diagnosis. Traditional convolution-based U-shape segmentation architectures are usually limited by the local receptive field. Existing vision transformers have been widely applied to diverse medical segmentation frameworks due to their superior capabilities of capturing global contexts. Despite the advantage, the real-world application of vision transformers is challenged by their non-linear self-attention mechanism, requiring huge computational costs. To address this issue, the selective state space model (SSM) Mamba has gained recognition for its adeptness in modeling long-range dependencies in sequential data, particularly noted for its efficient memory costs. In this paper, we propose MambaVesselNet++, a Hybrid CNN-Mamba framework for medical image segmentation. Our MambaVesselNet++ is comprised of a hybrid image encoder (Hi-Encoder) and a bifocal fusion decoder (BF-Decoder). In Hi-Encoder, we first devise the texture-aware layer to capture low-level semantic features by leveraging convolutions. Then, we utilize Mamba to effectively model long-range dependencies with linear complexity. The Bi-Decoder adopts skip connections to combine local and global information of the Hi-Encoder for the accurate generation of segmentation masks. Extensive experiments demonstrate that MambaVesselNet++ outperforms current convolution-based, transformer-based, and Mamba-based state-of-the-arts across diverse medical 2D, 3D, and instance segmentation tasks. The code is available at https://github.com/CC0117/MambaVesselNet.

[74] LLMControl: Grounded Control of Text-to-Image Diffusion-based Synthesis with Multimodal LLMs

Jiaze Wang,Rui Chen,Haowang Cui

Main category: cs.CV

TL;DR: LLMControl提出了一种基于多模态LLM的框架,用于精确控制文本到图像生成的扩散模型,改善了现有方法在复杂空间组合和多重对象提示下的表现。

Details Motivation: 现有空间控制方法在多对象或复杂空间组合的文本提示下难以精确生成对应图像,LLMControl旨在解决这一问题。

Contribution: 引入了多模态LLM作为全局控制器,增强语义描述和空间布局安排,并通过注意力机制改进扩散模型的生成质量。

Method: 利用多模态LLM生成控制信号,注入去噪网络以优化注意力图,实现文本和视觉条件的互补影响。

Result: 在多种预训练T2I模型上,LLMControl的生成质量优于现有方法,尤其在复杂输入条件下表现突出。

Insight: 多模态LLM的全局控制能力可以显著提升扩散模型在复杂生成任务中的表现。

Abstract: Recent spatial control methods for text-to-image (T2I) diffusion models have shown compelling results. However, these methods still fail to precisely follow the control conditions and generate the corresponding images, especially when encountering the textual prompts that contain multiple objects or have complex spatial compositions. In this work, we present a LLM-guided framework called LLM_Control to address the challenges of the controllable T2I generation task. By improving grounding capabilities, LLM_Control is introduced to accurately modulate the pre-trained diffusion models, where visual conditions and textual prompts influence the structures and appearance generation in a complementary way. We utilize the multimodal LLM as a global controller to arrange spatial layouts, augment semantic descriptions and bind object attributes. The obtained control signals are injected into the denoising network to refocus and enhance attention maps according to novel sampling constraints. Extensive qualitative and quantitative experiments have demonstrated that LLM_Control achieves competitive synthesis quality compared to other state-of-the-art methods across various pre-trained T2I models. It is noteworthy that LLM_Control allows the challenging input conditions on which most of the existing methods

[75] SCALAR: Scale-wise Controllable Visual Autoregressive Learning

Ryan Xu,Dongyang Jin,Yancheng Bai,Rui Lan,Xu Duan,Lei Sun,Xiangxiang Chu

Main category: cs.CV

TL;DR: SCALAR提出了一种基于视觉自回归模型(VAR)的可控图像生成方法,通过引入新颖的尺度条件解码机制,解决了现有方法在控制编码和注入机制上的低效问题。

Details Motivation: 现有的视觉自回归模型在可控图像生成中面临控制编码低效和注入机制破坏性的挑战,影响了生成质量和效率。

Contribution: 提出了SCALAR方法,引入尺度条件解码机制,实现了高效且精细的可控图像生成。

Method: 基于视觉自回归模型(VAR),设计了尺度条件解码机制,通过逐尺度控制生成过程。

Result: SCALAR在生成质量和控制效率上优于现有方法。

Insight: 尺度条件解码机制为视觉自回归模型的可控生成提供了新的研究方向。

Abstract: Controllable image synthesis, which enables fine-grained control over generated outputs, has emerged as a key focus in visual generative modeling. However, controllable generation remains challenging for Visual Autoregressive (VAR) models due to their hierarchical, next-scale prediction style. Existing VAR-based methods often suffer from inefficient control encoding and disruptive injection mechanisms that compromise both fidelity and efficiency. In this work, we present SCALAR, a controllable generation method based on VAR, incorporating a novel Scale-wise Conditional Decoding mechanism. SCALAR leverages a

[76] UniCT Depth: Event-Image Fusion Based Monocular Depth Estimation with Convolution-Compensated ViT Dual SA Block

Luoxi Jing,Dianxi Shi,Zhe Liu,Songchang Jin,Chunping Qiu,Ziteng Qiao,Yuxian Li,Jianqiang Xia

Main category: cs.CV

TL;DR: 论文提出了一种名为UniCT Depth的事件-图像融合方法,结合了CNN和Transformer的优势,通过创新的CcViT-DA块实现了局部与全局特征的建模,并设计了DCC块以增强细节表现。

Details Motivation: 图像方法在复杂场景中表现不佳,事件相机具有高动态范围但数据稀疏,融合二者仍具挑战。现有CNN和Transformer融合方法存在局限性,需新解决方案。

Contribution: 提出了UniCT Depth方法,结合CNN和Transformer;设计了CcViT-DA块(含CMSA和MFSA)和DCC块,实现了高效的模态融合与细节增强。

Method: 通过CcViT-DA块中的CMSA捕捉空间依赖,MFSA实现跨模态融合;DCC块补偿纹理细节。模型在编码器中统一局部与全局特征。

Result: 实验显示,UniCT Depth在单目深度估计任务中优于现有基于图像、事件及融合的方法。

Insight: 结合CNN和Transformer的优势,通过双注意力机制(CMSA和MFSA)和细节补偿(DCC)可显著提升模态融合与深度估计效果。

Abstract: Depth estimation plays a crucial role in 3D scene understanding and is extensively used in a wide range of vision tasks. Image-based methods struggle in challenging scenarios, while event cameras offer high dynamic range and temporal resolution but face difficulties with sparse data. Combining event and image data provides significant advantages, yet effective integration remains challenging. Existing CNN-based fusion methods struggle with occlusions and depth disparities due to limited receptive fields, while Transformer-based fusion methods often lack deep modality interaction. To address these issues, we propose UniCT Depth, an event-image fusion method that unifies CNNs and Transformers to model local and global features. We propose the Convolution-compensated ViT Dual SA (CcViT-DA) Block, designed for the encoder, which integrates Context Modeling Self-Attention (CMSA) to capture spatial dependencies and Modal Fusion Self-Attention (MFSA) for effective cross-modal fusion. Furthermore, we design the tailored Detail Compensation Convolution (DCC) Block to improve texture details and enhances edge representations. Experiments show that UniCT Depth outperforms existing image, event, and fusion-based monocular depth estimation methods across key metrics.

[77] AF-CLIP: Zero-Shot Anomaly Detection via Anomaly-Focused CLIP Adaptation

Qingqing Fang,Wenxi Lv,Qinliang Su

Main category: cs.CV

TL;DR: AF-CLIP通过优化CLIP的视觉表示,专注于局部异常检测,提出了一种轻量级适配器和多尺度空间聚合机制,同时在零样本和少样本场景中表现优异。

Details Motivation: 现有视觉异常检测方法需要大量训练样本,限制了零样本或小样本场景的应用。虽然CLIP具有零样本识别能力,但其视觉特征未针对局部异常优化,效果受限。

Contribution: 1. 提出AF-CLIP,通过适配器优化CLIP的视觉特征以聚焦局部异常;2. 引入多尺度空间聚合机制;3. 设计可学习文本提示;4. 在零样本和少样本场景中表现优异。

Method: 1. 使用轻量级适配器优化视觉特征;2. 多尺度空间聚合机制整合上下文;3. 设计可学习文本提示;4. 复合目标函数优化模型。

Result: 在工业和医疗数据集上验证了方法的有效性和泛化能力,零样本和少样本场景表现优异。

Insight: 通过适配器和多尺度机制优化CLIP特征能显著提升局部异常检测能力,同时可学习文本提示增强了模型对正常和异常状态的通用表征。

Abstract: Visual anomaly detection has been widely used in industrial inspection and medical diagnosis. Existing methods typically demand substantial training samples, limiting their utility in zero-/few-shot scenarios. While recent efforts have leveraged CLIP’s zero-shot recognition capability for this task, they often ignore optimizing visual features to focus on local anomalies, reducing their efficacy. In this work, we propose AF-CLIP (Anomaly-Focused CLIP) by dramatically enhancing its visual representations to focus on local defects. Our approach introduces a lightweight adapter that emphasizes anomaly-relevant patterns in visual features, simultaneously optimizing both class-level features for image classification and patch-level features for precise localization. To capture anomalies of different sizes and improve detection accuracy, prior to the adapter, we develop a multi-scale spatial aggregation mechanism to effectively consolidate neighborhood context. Complementing these visual enhancements, we design learnable textual prompts that generically characterize normal and abnormal states. After optimization on auxiliary datasets using a composite objective function, AF-CLIP demonstrates strong zero-shot detection capability. Our method is also extended to few-shot scenarios by extra memory banks. Experimental results across diverse industrial and medical datasets demonstrate the effectiveness and generalization of our proposed method. Code is available at https://github.com/Faustinaqq/AF-CLIP.

[78] RARE: Refine Any Registration of Pairwise Point Clouds via Zero-Shot Learning

Chengyu Zheng,Jin Huang,Honghua Chen,Mingqiang Wei

Main category: cs.CV

TL;DR: 论文提出了一种基于零样本学习的方法RARE,通过深度图像的扩散特征和几何特征结合,优化点云配准精度,无需训练数据。

Details Motivation: 现有的大规模扩散模型在图像语义对应中表现出潜力,激发了利用扩散特征改进点云配准的灵感。

Contribution: 提出了一种零样本学习方法,利用深度图像的扩散特征增强点特征表示,从而优化现有配准算法。

Method: 通过多视角投影点云为深度图,提取预训练扩散网络的隐式特征,并与几何特征结合,优化对应关系。

Result: 实验表明,该方法显著提升了配准精度,并展现出多数据集的强泛化能力。

Insight: 扩散特征的引入为点云配准提供了新的语义信息补充,展示了零样本学习的潜力。

Abstract: Recent research leveraging large-scale pretrained diffusion models has demonstrated the potential of using diffusion features to establish semantic correspondences in images. Inspired by advancements in diffusion-based techniques, we propose a novel zero-shot method for refining point cloud registration algorithms. Our approach leverages correspondences derived from depth images to enhance point feature representations, eliminating the need for a dedicated training dataset. Specifically, we first project the point cloud into depth maps from multiple perspectives and extract implicit knowledge from a pretrained diffusion network as depth diffusion features. These features are then integrated with geometric features obtained from existing methods to establish more accurate correspondences between point clouds. By leveraging these refined correspondences, our approach achieves significantly improved registration accuracy. Extensive experiments demonstrate that our method not only enhances the performance of existing point cloud registration techniques but also exhibits robust generalization capabilities across diverse datasets. Codes are available at https://github.com/zhengcy-lambo/RARE.git.

[79] Predicting Brain Responses To Natural Movies With Multimodal LLMs

Cesar Kadir Torrico Villanueva,Jiaxin Cindy Tu,Mihir Tripathy,Connor Lane,Rishab Iyer,Paul S. Scotti

Main category: cs.CV

TL;DR: 该论文提出了一个多模态LLM框架,用于预测大脑对自然电影的反应,融合了视频、语音、文本等多种模态的特征,并通过轻量级编码器将其映射到大脑皮层分区。最终在Algonauts 2025挑战赛中获得第四名。

Details Motivation: 研究动机是通过结合多模态预训练模型的表示能力,提升对大脑反应的预测准确性,尤其是在面对新颖电影刺激时的泛化能力。

Contribution: 主要贡献包括:1) 融合了多种模态(视频、语音、文本等)的预训练模型特征;2) 设计了共享组头和受试者特定残差头的轻量级编码器;3) 通过模型选择和集成提升了泛化能力。

Method: 方法包括:1) 从多种预训练模型提取特征;2) 通过线性投影和时间对齐将特征映射到fMRI时间序列;3) 使用共享组头和受试者特定残差头的简单架构;4) 训练并集成数百个模型变体。

Result: 测试集上的平均皮尔逊相关系数为0.2085,排名第四。进一步优化后可能提升至第二名。

Insight: 研究表明,多模态特征的结合和轻量级架构能够有效提升大脑反应预测的泛化能力,尤其是在面对新颖刺激时。

Abstract: We present MedARC’s team solution to the Algonauts 2025 challenge. Our pipeline leveraged rich multimodal representations from various state-of-the-art pretrained models across video (V-JEPA2), speech (Whisper), text (Llama 3.2), vision-text (InternVL3), and vision-text-audio (Qwen2.5-Omni). These features extracted from the models were linearly projected to a latent space, temporally aligned to the fMRI time series, and finally mapped to cortical parcels through a lightweight encoder comprising a shared group head plus subject-specific residual heads. We trained hundreds of model variants across hyperparameter settings, validated them on held-out movies and assembled ensembles targeted to each parcel in each subject. Our final submission achieved a mean Pearson’s correlation of 0.2085 on the test split of withheld out-of-distribution movies, placing our team in fourth place for the competition. We further discuss a last-minute optimization that would have raised us to second place. Our results highlight how combining features from models trained in different modalities, using a simple architecture consisting of shared-subject and single-subject components, and conducting comprehensive model selection and ensembling improves generalization of encoding models to novel movie stimuli. All code is available on GitHub.

[80] Region-based Cluster Discrimination for Visual Representation Learning

Yin Xie,Kaicheng Yang,Xiang An,Kun Wu,Yongle Zhao,Weimo Deng,Zimin Ran,Yumeng Wang,Ziyong Feng,Roy Miles,Ismail Elezi,Jiankang Deng

Main category: cs.CV

TL;DR: RICE提出了一种新的区域感知聚类判别方法,通过区域级视觉和OCR能力提升视觉表示学习,在分割、密集检测等任务中表现优异。

Details Motivation: 现有的视觉-语言对比模型(如CLIP和SigLIP)依赖全局表示,限制了其对密集预测任务的有效性,如grounding、OCR和分割。RICE旨在解决这一问题。

Contribution: 1. 提出RICE方法,增强区域级视觉和OCR能力;2. 构建亿级候选区域数据集;3. 设计统一的区域聚类判别损失,支持目标与OCR联合学习。

Method: 1. 构建大规模候选区域数据集;2. 引入Region Transformer提取区域语义;3. 设计统一的区域聚类判别损失,支持分布式训练。

Result: RICE在分割、密集检测和MLLM视觉感知任务中表现优于之前的方法。预训练模型已开源。

Insight: RICE通过区域级表示学习,解决了全局表示对密集任务的限制,展示了统一联合学习框架的有效性。

Abstract: Learning visual representations is foundational for a broad spectrum of downstream tasks. Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation. To address this gap, we introduce Region-Aware Cluster Discrimination (RICE), a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception for Multimodal Large Language Models (MLLMs). The pre-trained models have been released at https://github.com/deepglint/MVT.

[81] TAPS : Frustratingly Simple Test Time Active Learning for VLMs

Dhruv Sarkar,Aprameyo Chakrabartty,Bibhudatta Bhanja

Main category: cs.CV

TL;DR: 论文提出了TAPS框架,一种针对视觉语言模型(VLMs)的实时测试时主动学习(TTAL)方法,适用于单样本连续数据流,通过动态调整熵阈值和类平衡策略提升性能。

Details Motivation: 解决在实时数据流中(单样本可用)如何有效利用oracle进行主动查询的挑战,满足延迟和内存限制,适用于安全关键应用。

Contribution: 1. 提出TTAL框架,支持单样本流式处理;2. 动态熵阈值和类平衡替换策略;3. 类感知分布对齐技术增强适应性。

Method: 1. 动态调整熵阈值进行不确定样本查询;2. 类平衡内存替换;3. 类感知分布对齐。

Result: 在10个跨数据集和4个领域泛化数据集上优于现有方法,保持合理延迟和内存开销。

Insight: TAPS为实际部署(如自动驾驶和医疗诊断)提供了高效且实用的测试时主动学习方案。

Abstract: Test-Time Optimization enables models to adapt to new data during inference by updating parameters on-the-fly. Recent advances in Vision-Language Models (VLMs) have explored learning prompts at test time to improve performance in downstream tasks. In this work, we extend this idea by addressing a more general and practical challenge: Can we effectively utilize an oracle in a continuous data stream where only one sample is available at a time, requiring an immediate query decision while respecting latency and memory constraints? To tackle this, we propose a novel Test-Time Active Learning (TTAL) framework that adaptively queries uncertain samples and updates prompts dynamically. Unlike prior methods that assume batched data or multiple gradient updates, our approach operates in a real-time streaming scenario with a single test sample per step. We introduce a dynamically adjusted entropy threshold for active querying, a class-balanced replacement strategy for memory efficiency, and a class-aware distribution alignment technique to enhance adaptation. The design choices are justified using careful theoretical analysis. Extensive experiments across 10 cross-dataset transfer benchmarks and 4 domain generalization datasets demonstrate consistent improvements over state-of-the-art methods while maintaining reasonable latency and memory overhead. Our framework provides a practical and effective solution for real-world deployment in safety-critical applications such as autonomous systems and medical diagnostics.

[82] FaRMamba: Frequency-based learning and Reconstruction aided Mamba for Medical Segmentation

Ze Rong,ZiYue Zhao,Zhaoxin Wang,Lei Ma

Main category: cs.CV

TL;DR: FaRMamba通过引入多尺度频率变换模块和自监督重建辅助编码器,解决了医学图像分割中的高频信息丢失和空间结构退化问题,显著提升了分割精度。

Details Motivation: 医学图像分割中存在病变边界模糊、高频细节丢失和长程结构建模困难等问题,现有的Vision Mamba虽能缓解长程依赖问题,但会破坏局部像素邻接关系并导致高频信息丢失和空间结构退化。

Contribution: 提出了FaRMamba,通过两个模块解决了高频信息丢失和二维空间结构退化问题,分别是多尺度频率变换模块(MSFM)和自监督重建辅助编码器(SSRAE)。

Method: 1. MSFM通过小波、余弦和傅里叶变换恢复多频段高频信息;2. SSRAE通过像素级重建增强二维空间关联性。

Result: 在多个医学图像数据集上表现优异,优于CNN-Transformer混合模型及其他Mamba变体,边界精度和细节保留能力显著提升。

Insight: 频率感知框架直接针对医学图像的核心挑战,为未来分割模型提供了灵活的设计思路。

Abstract: Accurate medical image segmentation remains challenging due to blurred lesion boundaries (LBA), loss of high-frequency details (LHD), and difficulty in modeling long-range anatomical structures (DC-LRSS). Vision Mamba employs one-dimensional causal state-space recurrence to efficiently model global dependencies, thereby substantially mitigating DC-LRSS. However, its patch tokenization and 1D serialization disrupt local pixel adjacency and impose a low-pass filtering effect, resulting in Local High-frequency Information Capture Deficiency (LHICD) and two-dimensional Spatial Structure Degradation (2D-SSD), which in turn exacerbate LBA and LHD. In this work, we propose FaRMamba, a novel extension that explicitly addresses LHICD and 2D-SSD through two complementary modules. A Multi-Scale Frequency Transform Module (MSFM) restores attenuated high-frequency cues by isolating and reconstructing multi-band spectra via wavelet, cosine, and Fourier transforms. A Self-Supervised Reconstruction Auxiliary Encoder (SSRAE) enforces pixel-level reconstruction on the shared Mamba encoder to recover full 2D spatial correlations, enhancing both fine textures and global context. Extensive evaluations on CAMUS echocardiography, MRI-based Mouse-cochlea, and Kvasir-Seg endoscopy demonstrate that FaRMamba consistently outperforms competitive CNN-Transformer hybrids and existing Mamba variants, delivering superior boundary accuracy, detail preservation, and global coherence without prohibitive computational overhead. This work provides a flexible frequency-aware framework for future segmentation models that directly mitigates core challenges in medical imaging.

[83] The Devil is in the EOS: Sequence Training for Detailed Image Captioning

Abdelrahman Mohamed,Yova Kementchedjhieva

Main category: cs.CV

TL;DR: 本文指出了视觉语言模型在图像描述生成中过早预测EOS(序列结束)标记的问题,并提出了一种无监督方法来减少这种偏差,从而生成长度更长、细节更丰富的描述。

Details Motivation: 尽管视觉语言模型(VLMs)在图像描述生成方面取得了进展,但其生成的描述往往简短且缺乏细节。研究发现,这一问题的根源在于交叉熵训练中引入的EOS标记偏差。

Contribution: 提出了一种无监督方法,通过减少模型对EOS标记的过早预测倾向,生成长且详细的描述,无需复杂的奖励函数或监督数据。

Method: 通过调整训练策略,减少模型对EOS标记的预测偏差,从而鼓励生成长和详细的描述。方法简单且适用于任何预训练模型。

Result: 在三个VLMs和三个详细描述基准测试中,实验结果表明生成的描述长度和细节显著增加,尽管也存在一定程度的幻觉现象。

Insight: EOS标记的偏差是影响描述长度和细节的关键因素,简单的无监督方法可以显著改善生成结果。

Abstract: Despite significant advances in vision-language models (VLMs), image captioning often suffers from a lack of detail, with base models producing short, generic captions. This limitation persists even though VLMs are equipped with strong vision and language backbones. While supervised data and complex reward functions have been proposed to improve detailed image captioning, we identify a simpler underlying issue: a bias towards the end-of-sequence (EOS) token, which is introduced during cross-entropy training. We propose an unsupervised method to debias the model’s tendency to predict the EOS token prematurely. By reducing this bias, we encourage the generation of longer, more detailed captions without the need for intricate reward functions or supervision. Our approach is straightforward, effective, and easily applicable to any pretrained model. We demonstrate its effectiveness through experiments with three VLMs and on three detailed captioning benchmarks. Our results show a substantial increase in caption length and relevant details, albeit with an expected increase in the rate of hallucinations.

[84] KB-DMGen: Knowledge-Based Global Guidance and Dynamic Pose Masking for Human Image Generation

Shibang Liu,Xuemei Xie,Guangming Shi

Main category: cs.CV

TL;DR: KB-DMGen通过知识库全局引导和动态姿态掩码,提升了人像生成的整体质量和姿态准确性。

Details Motivation: 目前的人像生成方法大多仅关注姿态准确性,而忽略了图像的整体质量保证。为了同时满足姿态准确性和全局质量,作者提出了新方法。

Contribution: 提出了KB-DMGen框架,通过知识库(KB)和动态掩码(DM)结合,提升了人像生成的姿态准确性和整体质量。

Method: 利用知识库增强姿态准确性和全局特征,动态掩码调整姿态相关区域的重要性。

Result: 在HumanArt数据集上实现了AP和CAP的SOTA性能。

Insight: 结合知识库和动态掩码是一种有效的策略,能够同时优化生成图像的局部细节和全局质量。

Abstract: Recent methods using diffusion models have made significant progress in human image generation with various control signals such as pose priors. In portrait generation, both the accuracy of human pose and the overall visual quality are crucial for realistic synthesis. Most existing methods focus on controlling the accuracy of generated poses, but ignore the quality assurance of the entire image. In order to ensure the global image quality and pose accuracy, we propose Knowledge-Based Global Guidance and Dynamic pose Masking for human image Generation (KB-DMGen). The Knowledge Base (KB) is designed not only to enhance pose accuracy but also to leverage image feature information to maintain overall image quality. Dynamic Masking (DM) dynamically adjusts the importance of pose-related regions. Experiments demonstrate the effectiveness of our model, achieving new state-of-the-art results in terms of AP and CAP on the HumanArt dataset. The code will be made publicly available.

[85] Local Prompt Adaptation for Style-Consistent Multi-Object Generation in Diffusion Models

Ankit Sanjyal

Main category: cs.CV

TL;DR: 论文提出了一种名为本地提示适应(LPA)的无训练方法,用于解决扩散模型在多对象生成中风格一致性的问题。通过将提示分解为内容和风格token,并在不同阶段注入到U-Net的注意力层中,LPA显著提升了布局控制和风格一致性。

Details Motivation: 扩散模型在复杂提示(涉及多对象和风格要求)下往往生成缺乏风格统一性和空间一致性的图像,限制了可控内容生成的实用性。

Contribution: 提出的LPA方法无需额外训练,通过动态调整内容和风格token的注入时机,显著改善扩散模型在复杂场景生成中的表现。

Method: 将提示分解为内容和风格token,并选择性注入到U-Net的不同注意力层中,以实现对象布局和风格统一的目标。

Result: 在包含50个风格丰富提示的基准测试中,LPA在CLIP分数和风格一致性指标上优于现有方法(如Composer、MultiDiffusion等)。

Insight: 动态调整内容和风格token的注入时机是实现多对象场景风格一致性的有效策略,为可控图像生成提供了新方向。

Abstract: Diffusion models have become a powerful backbone for text-to-image generation, enabling users to synthesize high-quality visuals from natural language prompts. However, they often struggle with complex prompts involving multiple objects and global or local style specifications. In such cases, the generated scenes tend to lack style uniformity and spatial coherence, limiting their utility in creative and controllable content generation. In this paper, we propose a simple, training-free architectural method called Local Prompt Adaptation (LPA). Our method decomposes the prompt into content and style tokens, and injects them selectively into the U-Net’s attention layers at different stages. By conditioning object tokens early and style tokens later in the generation process, LPA enhances both layout control and stylistic consistency. We evaluate our method on a custom benchmark of 50 style-rich prompts across five categories and compare against strong baselines including Composer, MultiDiffusion, Attend-and-Excite, LoRA, and SDXL. Our approach outperforms prior work on both CLIP score and style consistency metrics, offering a new direction for controllable, expressive diffusion-based generation.

[86] Hybrid-Domain Synergistic Transformer for Hyperspectral Image Denoising

Haoyue Li,Di Wu

Main category: cs.CV

TL;DR: 该论文提出了一种基于Transformer的混合域协同网络(HDST),用于高光谱图像去噪,通过频率域增强和多尺度建模实现了空间、频率和通道域的三维协同处理。

Details Motivation: 高光谱图像去噪面临空间非均匀噪声和光谱相关性干扰的多维耦合问题,现有深度学习方法难以有效处理其独特的空间-光谱特性和复杂噪声分布。

Contribution: (1)引入多频带卷积的FFT预处理模块提取跨频带相关性;(2)设计动态跨域注意力模块自适应融合空间域纹理特征和频率域噪声先验;(3)构建分层架构,浅层捕捉全局噪声统计,深层实现细节恢复。

Method: HDST结合频率域预处理、动态跨域注意力模块和多尺度空洞卷积的分层架构,实现空间、频率和通道域的协同去噪。

Result: 在真实和合成数据集上的实验表明,HDST在去噪性能和计算效率上均显著提升。

Insight: 通过频率域和空间域的协同处理,HDST为高维视觉数据中的复杂噪声耦合问题提供了新的解决框架。

Abstract: Hyperspectral image denoising faces the challenge of multi-dimensional coupling of spatially non-uniform noise and spectral correlation interference. Existing deep learning methods mostly focus on RGB images and struggle to effectively handle the unique spatial-spectral characteristics and complex noise distributions of hyperspectral images (HSI). This paper proposes an HSI denoising framework, Hybrid-Domain Synergistic Transformer Network (HDST), based on frequency domain enhancement and multiscale modeling, achieving three-dimensional collaborative processing of spatial, frequency and channel domains. The method innovatively integrates three key mechanisms: (1) introducing an FFT preprocessing module with multi-band convolution to extract cross-band correlations and decouple spectral noise components; (2) designing a dynamic cross-domain attention module that adaptively fuses spatial domain texture features and frequency domain noise priors through a learnable gating mechanism; (3) building a hierarchical architecture where shallow layers capture global noise statistics using multiscale atrous convolution, and deep layers achieve detail recovery through frequency domain postprocessing. Experiments on both real and synthetic datasets demonstrate that HDST significantly improves denoising performance while maintaining computational efficiency, validating the effectiveness of the proposed method. This research provides new insights and a universal framework for addressing complex noise coupling issues in HSI and other high-dimensional visual data. The code is available at https://github.com/lhy-cn/HDST-HSIDenoise.

[87] Detection of Medial Epicondyle Avulsion in Elbow Ultrasound Images via Bone Structure Reconstruction

Shizuka Akahori,Shotaro Teruya,Pragyan Shrestha,Yuichi Yoshii,Satoshi Iizuka,Akira Ikumi,Hiromitsu Tsuge,Itaru Kitahara

Main category: cs.CV

TL;DR: 该论文提出了一种基于重建的框架,通过仅使用正常病例训练,检测肘部超声图像中的内上髁撕脱。该方法利用掩码自编码器学习正常骨骼结构的连续性,并通过重建误差检测异常。

Details Motivation: 内上髁撕脱常见于棒球运动员,表现为骨骼轮廓的不连续性。传统方法依赖于异常样本,而该研究仅需正常样本训练,更具实用性。

Contribution: 1. 提出了基于掩码自编码器的结构感知重建框架,仅需正常样本训练;2. 构建了包含像素级标注的新数据集。

Method: 使用掩码自编码器学习正常骨骼结构的连续性,通过重建误差检测异常(如撕脱)。模型在异常区域会表现出较大的重建误差。

Result: 在像素级和图像级AUC分别达0.965和0.967,优于现有方法。数据集已公开。

Insight: 仅用正常样本训练即可检测异常,避免了异常样本不足的问题,为医学图像异常检测提供了新思路。

Abstract: This study proposes a reconstruction-based framework for detecting medial epicondyle avulsion in elbow ultrasound images, trained exclusively on normal cases. Medial epicondyle avulsion, commonly observed in baseball players, involves bone detachment and deformity, often appearing as discontinuities in bone contour. Therefore, learning the structure and continuity of normal bone is essential for detecting such abnormalities. To achieve this, we propose a masked autoencoder-based, structure-aware reconstruction framework that learns the continuity of normal bone structures. Even in the presence of avulsion, the model attempts to reconstruct the normal structure, resulting in large reconstruction errors at the avulsion site. For evaluation, we constructed a novel dataset comprising normal and avulsion ultrasound images from 16 baseball players, with pixel-level annotations under orthopedic supervision. Our method outperformed existing approaches, achieving a pixel-wise AUC of 0.965 and an image-wise AUC of 0.967. The dataset is publicly available at: https://github.com/Akahori000/Ultrasound-Medial-Epicondyle-Avulsion-Dataset.

[88] NeuroVoxel-LM: Language-Aligned 3D Perception via Dynamic Voxelization and Meta-Embedding

Shiyu Liu,Lianlei Shan

Main category: cs.CV

TL;DR: NeuroVoxel-LM 提出了一种结合动态体素化和轻量级元嵌入的框架,以解决3D语言模型在稀疏、大规模点云数据中的特征提取和表示精度问题,提高了效率和准确性。

Details Motivation: 现有的3D语言模型在处理稀疏和大规模点云数据时表现出特征提取速度慢和表示精度不足的问题,需要一种更高效且精确的解决方案。

Contribution: 1. 提出了动态分辨率多尺度体素化(DR-MSV)技术,自适应调整体素粒度以降低计算成本并保持重建保真度;2. 设计了基于注意力的轻量级元嵌入机制(TAP-LME),增强语义表示能力。

Method: 1. 使用DR-MSV动态调整点云的体素分辨率;2. 通过TAP-LME实现注意力加权和残差融合,优化语义表示。

Result: 实验表明,DR-MSV显著提升了点云特征提取的效率和精度,TAP-LME在捕捉NeRF权重的细粒度语义方面优于传统最大池化。

Insight: 动态分辨率体素化和轻量级元嵌入的结合为3D语言模型提供了一种高效且精确的解决方案,尤其是在处理稀疏和大规模点云时表现突出。

Abstract: Recent breakthroughs in Visual Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have significantly advanced 3D scene perception towards language-driven cognition. However, existing 3D language models struggle with sparse, large-scale point clouds due to slow feature extraction and limited representation accuracy. To address these challenges, we propose NeuroVoxel-LM, a novel framework that integrates Neural Radiance Fields (NeRF) with dynamic resolution voxelization and lightweight meta-embedding. Specifically, we introduce a Dynamic Resolution Multiscale Voxelization (DR-MSV) technique that adaptively adjusts voxel granularity based on geometric and structural complexity, reducing computational cost while preserving reconstruction fidelity. In addition, we propose the Token-level Adaptive Pooling for Lightweight Meta-Embedding (TAP-LME) mechanism, which enhances semantic representation through attention-based weighting and residual fusion. Experimental results demonstrate that DR-MSV significantly improves point cloud feature extraction efficiency and accuracy, while TAP-LME outperforms conventional max-pooling in capturing fine-grained semantics from NeRF weights.

[89] Local2Global query Alignment for Video Instance Segmentation

Rajat Koner,Zhipeng Wang,Srinivas Parthasarathy,Chinghang Chen

Main category: cs.CV

TL;DR: Local2Global 是一种在线视频实例分割框架,通过结合局部和全局查询以及对齐机制,实现高性能的时序一致性预测。

Details Motivation: 在线视频分割方法在处理长序列和捕捉渐进变化方面表现优异,但在时序一致性预测上仍面临噪声积累、遮挡和场景转换等挑战。

Contribution: 提出 Local2Global 框架,包含局部和全局查询以及 L2G-aligner 对齐机制,无需复杂启发式或内存机制,实现高效时序一致性分割。

Method: 基于 DETR 的查询传播框架,引入局部查询(捕捉当前帧特征)和全局查询(包含历史时空特征),利用轻量级 L2G-aligner 对齐两者。

Result: 在 YouTube-VIS-19/-21 和 OVIS 数据集上分别达到 54.3、49.4 和 37.0 AP,超越当前基准方法。

Insight: 局部与全局查询的早期对齐能有效利用当前帧信息并保持时序一致性,简单在线训练即可实现高性能,为视频实例分割提供了新思路。

Abstract: Online video segmentation methods excel at handling long sequences and capturing gradual changes, making them ideal for real-world applications. However, achieving temporally consistent predictions remains a challenge, especially with gradual accumulation of noise or drift in on-line propagation, abrupt occlusions and scene transitions. This paper introduces Local2Global, an online framework, for video instance segmentation, exhibiting state-of-the-art performance with simple baseline and training purely in online fashion. Leveraging the DETR-based query propagation framework, we introduce two novel sets of queries:(1) local queries that capture initial object-specific spatial features from each frame and (2) global queries containing past spatio-temporal representations. We propose the L2G-aligner, a novel lightweight transformer decoder, to facilitate an early alignment between local and global queries. This alignment allows our model to effectively utilize current frame information while maintaining temporal consistency, producing a smooth transition between frames. Furthermore, L2G-aligner is integrated within the segmentation model, without relying on additional complex heuristics, or memory mechanisms. Extensive experiments across various challenging VIS and VPS datasets showcase the superiority of our method with simple online training, surpassing current benchmarks without bells and rings. For instance, we achieve 54.3 and 49.4 AP on Youtube-VIS-19/-21 datasets and 37.0 AP on OVIS dataset respectively withthe ResNet-50 backbone.

[90] Multi-output Deep-Supervised Classifier Chains for Plant Pathology

Jianping Yao,Son N. Tran

Main category: cs.CV

TL;DR: 本文提出了一种名为Mo-DsCC的新型模型,用于植物病理学中的多输出分类任务,通过链式结构将植物种类和病害类型的预测结合起来,提高了分类性能。

Details Motivation: 现有植物病害分类方法大多直接使用卷积神经网络,忽略了植物种类与病害类型之间的关系对预测性能的影响,本文旨在解决这一问题。

Contribution: 提出了Mo-DsCC模型,结合修改的VGG-16网络、深度监督训练和分类器链结构,显著提升了分类性能。

Method: 模型包含三个部分:修改的VGG-16作为主干网络、深度监督训练机制和多分类器链结构,通过链式输出层优化植物种类和病害类型的联合预测。

Result: 在Plant Village和PlantDoc数据集上的实验表明,Mo-DsCC在准确率和F1分数上优于现有方法,如多模型、多标签(Power-set)、多输出和多任务方法。

Insight: Mo-DsCC的成功表明,通过显式建模标签间关系,可以提升多输出分类任务的性能,为智慧农业提供了新的技术思路。

Abstract: Plant leaf disease classification is an important task in smart agriculture which plays a critical role in sustainable production. Modern machine learning approaches have shown unprecedented potential in this classification task which offers an array of benefits including time saving and cost reduction. However, most recent approaches directly employ convolutional neural networks where the effect of the relationship between plant species and disease types on prediction performance is not properly studied. In this study, we proposed a new model named Multi-output Deep Supervised Classifier Chains (Mo-DsCC) which weaves the prediction of plant species and disease by chaining the output layers for the two labels. Mo-DsCC consists of three components: A modified VGG-16 network as the backbone, deep supervision training, and a stack of classification chains. To evaluate the advantages of our model, we perform intensive experiments on two benchmark datasets Plant Village and PlantDoc. Comparison to recent approaches, including multi-model, multi-label (Power-set), multi-output and multi-task, demonstrates that Mo-DsCC achieves better accuracy and F1-score. The empirical study in this paper shows that the application of Mo-DsCC could be a useful puzzle for smart agriculture to benefit farms and bring new ideas to industry and academia.

[91] Trust the Model: Compact VLMs as In-Context Judges for Image-Text Data Quality

Daulet Toibazar,Kesen Wang,Sherif Mohamed,Abdulaziz Al-Badawi,Abdulrahman Alfulayt,Pedro J. Moreno

Main category: cs.CV

TL;DR: 该论文提出了一种轻量级方法,利用小型视觉语言模型(VLM)过滤低质量的图像-文本数据,以提高训练数据的质量,无需额外模块。

Details Motivation: 视觉语言模型(VLMs)结合视觉数据扩展了传统语言模型的能力,但视觉数据的引入带来了数据质量维护的挑战。精心筛选的代表性训练数据通常优于大量噪声数据,因此需要一种高效的数据过滤方法。

Contribution: 提出了一种轻量级的数据过滤框架,利用小型VLM评估和过滤图像-文本对的图像质量、文本质量及其对齐性,避免了额外模块的需求并降低了训练开销。

Method: 通过在高质量图像-标注数据集上微调小型VLM,利用其固有的评估能力直接过滤低质量数据,而不依赖于大型VLM的辅助模块。

Result: 实验证明,经过小型VLM过滤的数据集在性能上媲美或优于通过大规模网络爬取获得的大规模噪声数据集。

Insight: 轻量级VLM可以高效地执行数据质量评估任务,为构建高质量视觉语言训练语料库提供了一种成本低廉且高效的解决方案。

Abstract: Vision-language models (VLMs) extend the conventional large language models by integrating visual data, enabling richer multimodal reasoning and significantly broadens the practical applications of AI. However, including visual inputs also brings new challenges in maintaining data quality. Empirical evidence consistently shows that carefully curated and representative training examples often yield superior results compared to simply increasing the quantity of data. Inspired by this observation, we introduce a streamlined data filtration framework that employs a compact VLM, fine-tuned on a high-quality image-caption annotated dataset. This model effectively evaluates and filters potential training samples based on caption and image quality and alignment. Unlike previous approaches, which typically add auxiliary filtration modules on top of existing full-scale VLMs, our method exclusively utilizes the inherent evaluative capability of a purpose-built small VLM. This strategy eliminates the need for extra modules and reduces training overhead. Our lightweight model efficiently filters out inaccurate, noisy web data, improving image-text alignment and caption linguistic fluency. Experimental results show that datasets underwent high-precision filtration using our compact VLM perform on par with, or even surpass, larger and noisier datasets gathered through high-volume web crawling. Thus, our method provides a lightweight yet robust solution for building high-quality vision-language training corpora. \ \textbf{Availability and implementation:} Our compact VLM filtration model, training data, utility scripts, and Supplementary data (Appendices) are freely available at https://github.com/daulettoibazar/Compact_VLM_Filter.

[92] AnimeColor: Reference-based Animation Colorization with Diffusion Transformers

Yuhong Zhang,Liyao Wang,Han Wang,Danni Wu,Zuzeng Lin,Feng Wang,Li Song

Main category: cs.CV

TL;DR: AnimeColor 是一种基于参考的动画着色框架,利用扩散变换器(DiT)实现高精度的颜色一致性和时序一致性。

Details Motivation: 现有动画着色方法在颜色准确性和时序一致性方面表现不足,亟需一种更高效的解决方案。

Contribution: 提出了 AnimeColor 框架,结合了高级颜色提取器(HCE)和低级颜色引导器(LCG),通过多阶段训练策略优化参考图像的颜色信息利用。

Method: 采用 DiT 为基础的视频扩散模型,引入了 HCE 和 LCG 分别提取语义和细粒度颜色信息,并通过草图序列控制动画生成。

Result: 实验表明 AnimeColor 在颜色准确性、草图对齐、时序一致性和视觉质量上优于现有方法。

Insight: 该框架不仅推动了动画着色技术的进步,还为工业应用提供了实用解决方案。

Abstract: Animation colorization plays a vital role in animation production, yet existing methods struggle to achieve color accuracy and temporal consistency. To address these challenges, we propose \textbf{AnimeColor}, a novel reference-based animation colorization framework leveraging Diffusion Transformers (DiT). Our approach integrates sketch sequences into a DiT-based video diffusion model, enabling sketch-controlled animation generation. We introduce two key components: a High-level Color Extractor (HCE) to capture semantic color information and a Low-level Color Guider (LCG) to extract fine-grained color details from reference images. These components work synergistically to guide the video diffusion process. Additionally, we employ a multi-stage training strategy to maximize the utilization of reference image color information. Extensive experiments demonstrate that AnimeColor outperforms existing methods in color accuracy, sketch alignment, temporal consistency, and visual quality. Our framework not only advances the state of the art in animation colorization but also provides a practical solution for industrial applications. The code will be made publicly available at \href{https://github.com/IamCreateAI/AnimeColor}{https://github.com/IamCreateAI/AnimeColor}.

[93] Player-Centric Multimodal Prompt Generation for Large Language Model Based Identity-Aware Basketball Video Captioning

Zeyu Xi,Haoying Sun,Yaofei Wu,Junchi Yan,Haoran Zhang,Lifang Wu,Liang Wang,Changwen Chen

Main category: cs.CV

TL;DR: 论文提出了一种以球员为中心的多模态提示生成网络(LLM-IAVC),用于生成包含球员身份信息的篮球视频描述。通过提取球员相关多模态嵌入,并结合大语言模型,显著提升了身份信息的准确性。

Details Motivation: 现有体育视频描述方法常忽略球员身份信息,导致描述实用性受限。部分方法虽尝试整合额外信息,但因信息与视频内容无关,容易出现身份识别错误。

Contribution: 1. 提出身份相关信息提取模块(IRIEM),包含球员识别网络(PIN)和双向语义交互模块(BSIM),提取并关联球员特征与视频内容。2. 设计视觉上下文学习模块(VCLM)捕获关键视频上下文。3. 构建新数据集NBA-Identity,包含9,726个视频,覆盖9类主要事件。

Method: 1. 使用IRIEM提取球员多模态嵌入(视觉特征与姓名)。2. BSIM增强球员特征与视频内容的关联。3. VCLM学习关键上下文信息。4. 将多模态提示输入大语言模型生成描述。

Result: 在NBA-Identity和VC-NBA-2022数据集上,模型表现出先进性能,生成的身份信息描述更准确。

Insight: 通过视觉角度识别球员身份并关联视频内容,能显著提升描述的准确性。多模态提示生成结合大语言模型是解决身份感知视频描述的有效途径。

Abstract: Existing sports video captioning methods often focus on the action yet overlook player identities, limiting their applicability. Although some methods integrate extra information to generate identity-aware descriptions, the player identities are sometimes incorrect because the extra information is independent of the video content. This paper proposes a player-centric multimodal prompt generation network for identity-aware sports video captioning (LLM-IAVC), which focuses on recognizing player identities from a visual perspective. Specifically, an identity-related information extraction module (IRIEM) is designed to extract player-related multimodal embeddings. IRIEM includes a player identification network (PIN) for extracting visual features and player names, and a bidirectional semantic interaction module (BSIM) to link player features with video content for mutual enhancement. Additionally, a visual context learning module (VCLM) is designed to capture the key video context information. Finally, by integrating the outputs of the above modules as the multimodal prompt for the large language model (LLM), it facilitates the generation of descriptions with player identities. To support this work, we construct a new benchmark called NBA-Identity, a large identity-aware basketball video captioning dataset with 9,726 videos covering 9 major event types. The experimental results on NBA-Identity and VC-NBA-2022 demonstrate that our proposed model achieves advanced performance. Code and dataset are publicly available at https://github.com/Zeyu1226-mt/LLM-IAVC.

[94] PUMPS: Skeleton-Agnostic Point-based Universal Motion Pre-Training for Synthesis in Human Motion Tasks

Clinton Ansun Mo,Kun Hu,Chengjiang Long,Dong Yuan,Wan-Chi Siu,Zhiyong Wang

Main category: cs.CV

TL;DR: PUMPS提出了一种基于时间点云(TPC)的通用运动预训练方法,通过自编码器架构实现了跨骨架的运动数据合成与任务学习。

Details Motivation: 运动骨架的数据差异性导致运动数据难以跨骨架迁移,传统TPC主要用于兼容性而非直接任务学习,因此需要一种能处理TPC独特特性(如时间一致性和点可识别性)的解决方案。

Contribution: 提出了PUMPS,首个针对TPC数据的自编码器架构,能够独立处理帧级点云并生成可采样的特征向量;引入线性分配点配对优化重建过程;无需昂贵点注意力机制。

Method: PUMPS通过编码器将帧级点云转换为特征向量,解码器利用潜在高斯噪声作为采样标识生成时序点;使用线性分配优化点配对;支持无监督预训练。

Result: PUMPS在运动预测、过渡生成和关键帧插值等任务中表现优异,无需数据集监督即可达到SOTA性能;在微调任务如运动去噪或估计中优于专用方法。

Insight: TPC作为跨兼容的运动表示具有潜力,通过潜空间建模和高效点配对方法能解决其独特挑战;通用架构在任务泛化性和性能上均有优势。

Abstract: Motion skeletons drive 3D character animation by transforming bone hierarchies, but differences in proportions or structure make motion data hard to transfer across skeletons, posing challenges for data-driven motion synthesis. Temporal Point Clouds (TPCs) offer an unstructured, cross-compatible motion representation. Though reversible with skeletons, TPCs mainly serve for compatibility, not for direct motion task learning. Doing so would require data synthesis capabilities for the TPC format, which presents unexplored challenges regarding its unique temporal consistency and point identifiability. Therefore, we propose PUMPS, the primordial autoencoder architecture for TPC data. PUMPS independently reduces frame-wise point clouds into sampleable feature vectors, from which a decoder extracts distinct temporal points using latent Gaussian noise vectors as sampling identifiers. We introduce linear assignment-based point pairing to optimise the TPC reconstruction process, and negate the use of expensive point-wise attention mechanisms in the architecture. Using these latent features, we pre-train a motion synthesis model capable of performing motion prediction, transition generation, and keyframe interpolation. For these pre-training tasks, PUMPS performs remarkably well even without native dataset supervision, matching state-of-the-art performance. When fine-tuned for motion denoising or estimation, PUMPS outperforms many respective methods without deviating from its generalist architecture.

[95] LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks

Fei Kong,Jinhao Duan,Kaidi Xu,Zhenhua Guo,Xiaofeng Zhu,Xiaoshuang Shi

Main category: cs.CV

TL;DR: 该论文提出了一个名为LRR-Bench的评估框架,用于测试视觉语言模型(VLMs)在空间理解任务上的表现,发现现有模型在此类任务上仍有显著提升空间。

Details Motivation: 现实应用(如自动驾驶和人形机器人操作)需要精确的空间感知能力,但目前对视觉语言模型如何识别空间关系和感知运动的研究较少。

Contribution: 提出了一个全面的空间评估流程和合成的基准数据集,分为绝对空间理解和3D空间理解两类任务。

Method: 通过合成数据集生成低成本的测试样本,避免了数据污染,并对多个先进VLMs进行了实验评估。

Result: 人类在所有任务中表现近乎完美,而VLMs仅在少数简单任务中达到人类水平,某些复杂任务上模型得分接近零。

Insight: 现有VLMs在空间理解任务上性能不足,尤其在复杂空间关系识别上表现较差,需进一步改进。

Abstract: Real-world applications, such as autonomous driving and humanoid robot manipulation, require precise spatial perception. However, it remains underexplored how Vision-Language Models (VLMs) recognize spatial relationships and perceive spatial movement. In this work, we introduce a spatial evaluation pipeline and construct a corresponding benchmark. Specifically, we categorize spatial understanding into two main types: absolute spatial understanding, which involves querying the absolute spatial position (e.g., left, right) of an object within an image, and 3D spatial understanding, which includes movement and rotation. Notably, our dataset is entirely synthetic, enabling the generation of test samples at a low cost while also preventing dataset contamination. We conduct experiments on multiple state-of-the-art VLMs and observe that there is significant room for improvement in their spatial understanding abilities. Explicitly, in our experiments, humans achieve near-perfect performance on all tasks, whereas current VLMs attain human-level performance only on the two simplest tasks. For the remaining tasks, the performance of VLMs is distinctly lower than that of humans. In fact, the best-performing Vision-Language Models even achieve near-zero scores on multiple tasks. The dataset and code are available on https://github.com/kong13661/LRR-Bench.

[96] Towards Universal Modal Tracking with Online Dense Temporal Token Learning

Yaozong Zheng,Bineng Zhong,Qihua Liang,Shengping Zhang,Guorong Li,Xianxian Li,Rongrong Ji

Main category: cs.CV

TL;DR: 本文提出了一种通用的视频级模态感知跟踪模型(Modaltracker),通过在线密集时间标记学习,支持多种跟踪任务(如RGB、RGB+热成像、RGB+深度、RGB+事件等),无需更改模型架构或参数。核心贡献包括视频级采样、视频级关联和多模态扩展,实现了新SOTA性能。

Details Motivation: 当前的多模态跟踪器通常需要独立训练,无法通用化处理不同模态的数据。本文旨在设计一个统一模型,支持多种模态的跟踪任务,同时通过在线学习机制提升性能。

Contribution: 1. 提出视频级采样的方法,扩展了模型的输入范围。2. 引入在线密集时间标记关联机制,传播目标的外观和运动轨迹信息。3. 设计了两种门控感知器,通过门控注意力机制自适应学习跨模态表示,并压缩到同一组参数中。

Method: 1. 使用视频级采样和关联机制,从全局视角捕获丰富的上下文信息。2. 通过门控注意力机制学习跨模态表示,并采用一次性训练支持多任务推理。3. 在线学习机制利用历史信息引导未来推理。

Result: 在多个可见光和多模态基准测试中,Modaltracker实现了新的SOTA性能。

Insight: 通过统一模型和在线学习机制,可以实现多种模态的高效跟踪,同时降低训练负担并提升表征能力。

Abstract: We propose a universal video-level modality-awareness tracking model with online dense temporal token learning (called {\modaltracker}). It is designed to support various tracking tasks, including RGB, RGB+Thermal, RGB+Depth, and RGB+Event, utilizing the same model architecture and parameters. Specifically, our model is designed with three core goals: \textbf{Video-level Sampling}. We expand the model’s inputs to a video sequence level, aiming to see a richer video context from an near-global perspective. \textbf{Video-level Association}. Furthermore, we introduce two simple yet effective online dense temporal token association mechanisms to propagate the appearance and motion trajectory information of target via a video stream manner. \textbf{Modality Scalable}. We propose two novel gated perceivers that adaptively learn cross-modal representations via a gated attention mechanism, and subsequently compress them into the same set of model parameters via a one-shot training manner for multi-task inference. This new solution brings the following benefits: (i) The purified token sequences can serve as temporal prompts for the inference in the next video frames, whereby previous information is leveraged to guide future inference. (ii) Unlike multi-modal trackers that require independent training, our one-shot training scheme not only alleviates the training burden, but also improves model representation. Extensive experiments on visible and multi-modal benchmarks show that our {\modaltracker} achieves a new \textit{SOTA} performance. The code will be available at https://github.com/GXNU-ZhongLab/ODTrack.

[97] MoCTEFuse: Illumination-Gated Mixture of Chiral Transformer Experts for Multi-Level Infrared and Visible Image Fusion

Li Jinfu,Song Hong,Xia Jianghan,Lin Yucong,Wang Ting,Shao Long,Fan Jingfan,Yang Jian

Main category: cs.CV

TL;DR: MoCTEFuse提出了一种动态多级图像融合网络,通过光照门控的混合手性Transformer专家(MoCTE)自适应平衡保留纹理细节和目标对比度。

Details Motivation: 现有方法在红外和可见光图像融合中忽视光照变化,导致模态偏差,MoCTEFuse旨在解决这一问题。

Contribution: 提出光照门控的混合手性Transformer专家(MoCTE)和异步交叉注意力机制,动态分配模态权重,并通过竞争性损失函数优化训练。

Method: MoCTEFuse由高低光照专家子网络组成,基于CTFB模块动态切换模态,通过多阶段堆叠逐步聚合和优化模态特定与跨模态信息。

Result: 在多个数据集上表现出卓越的融合性能,检测mAP达到70.93%(MFNet)和45.14%(DroneVehicle)。

Insight: 光照动态门控和竞争性损失设计有效平衡了模态融合中的细节保留和目标对比度。

Abstract: While illumination changes inevitably affect the quality of infrared and visible image fusion, many outstanding methods still ignore this factor and directly merge the information from source images, leading to modality bias in the fused results. To this end, we propose a dynamic multi-level image fusion network called MoCTEFuse, which applies an illumination-gated Mixture of Chiral Transformer Experts (MoCTE) to adaptively preserve texture details and object contrasts in balance. MoCTE consists of high- and low-illumination expert subnetworks, each built upon the Chiral Transformer Fusion Block (CTFB). Guided by the illumination gating signals, CTFB dynamically switches between the primary and auxiliary modalities as well as assigning them corresponding weights with its asymmetric cross-attention mechanism. Meanwhile, it is stacked at multiple stages to progressively aggregate and refine modality-specific and cross-modality information. To facilitate robust training, we propose a competitive loss function that integrates illumination distributions with three levels of sub-loss terms. Extensive experiments conducted on the DroneVehicle, MSRS, TNO and RoadScene datasets show MoCTEFuse’s superior fusion performance. Finally, it achieves the best detection mean Average Precision (mAP) of 70.93% on the MFNet dataset and 45.14% on the DroneVehicle dataset. The code and model are released at https://github.com/Bitlijinfu/MoCTEFuse.

[98] SAMwave: Wavelet-Driven Feature Enrichment for Effective Adaptation of Segment Anything Model

Saurabh Yadav,Avi Gupta,Koteswar Rao Jerripothula

Main category: cs.CV

TL;DR: SAMwave提出了基于小波变换的多尺度高频特征提取方法,通过复值适配器提升SAM在复杂任务中的性能,显著优于现有适配方法。

Details Motivation: 大型基础模型(如SAM)在未经训练的复杂任务中性能下降,现有方法的高频特征提取能力有限。

Contribution: 提出SAMwave,利用小波变换提取多尺度高频特征,并引入复值适配器捕捉空间-频率信息。

Method: 采用小波变换提取多尺度特征,设计复值适配器自适应整合小波系数。

Result: 在四个低层视觉任务上,SAMwave显著优于现有方法,适用于SAM和SAM2主干网络。

Insight: 小波变换能更丰富地提取高频信息,复值适配器增强了模型的灵活性和可解释性。

Abstract: The emergence of large foundation models has propelled significant advances in various domains. The Segment Anything Model (SAM), a leading model for image segmentation, exemplifies these advances, outperforming traditional methods. However, such foundation models often suffer from performance degradation when applied to complex tasks for which they are not trained. Existing methods typically employ adapter-based fine-tuning strategies to adapt SAM for tasks and leverage high-frequency features extracted from the Fourier domain. However, Our analysis reveals that these approaches offer limited benefits due to constraints in their feature extraction techniques. To overcome this, we propose \textbf{\textit{SAMwave}}, a novel and interpretable approach that utilizes the wavelet transform to extract richer, multi-scale high-frequency features from input data. Extending this, we introduce complex-valued adapters capable of capturing complex-valued spatial-frequency information via complex wavelet transforms. By adaptively integrating these wavelet coefficients, SAMwave enables SAM’s encoder to capture information more relevant for dense prediction. Empirical evaluations on four challenging low-level vision tasks demonstrate that SAMwave significantly outperforms existing adaptation methods. This superior performance is consistent across both the SAM and SAM2 backbones and holds for both real and complex-valued adapter variants, highlighting the efficiency, flexibility, and interpretability of our proposed method for adapting segment anything models.

[99] SAViL-Det: Semantic-Aware Vision-Language Model for Multi-Script Text Detection

Mohammed-En-Nadhir Zighem,Abdenour Hadid

Main category: cs.CV

TL;DR: SAViL-Det 是一种新型的语义感知视觉语言模型,通过整合文本提示与视觉特征,提升了多语言文本检测的性能,尤其适用于形状任意和多样脚本的场景。

Details Motivation: 现有方法在多语言和形状任意的文本检测中表现不足,未能充分利用语义上下文信息。SAViL-Det 旨在通过结合文本提示与视觉特征解决这一问题。

Contribution: 1. 提出了一种语义感知的视觉语言模型(SAViL-Det);2. 设计了语言-视觉解码器,通过跨模态注意力传播细粒度语义信息;3. 引入文本-像素对比学习机制,对齐文本与视觉特征。

Method: 1. 结合 CLIP 预训练模型和渐进特征金字塔网络(AFPN)提取多尺度视觉特征;2. 使用语言-视觉解码器实现跨模态注意力;3. 通过对比学习对齐文本提示与视觉像素特征。

Result: 在 MLT-2019 和 CTW1500 数据集上的实验显示,SAViL-Det 分别取得了 84.8% 和 90.2% 的 F-score,达到了最先进的性能。

Insight: 跨模态注意力机制和对比学习能够有效提升多语言文本检测的性能,尤其在形状和语言多样性较高的场景中。

Abstract: Detecting text in natural scenes remains challenging, particularly for diverse scripts and arbitrarily shaped instances where visual cues alone are often insufficient. Existing methods do not fully leverage semantic context. This paper introduces SAViL-Det, a novel semantic-aware vision-language model that enhances multi-script text detection by effectively integrating textual prompts with visual features. SAViL-Det utilizes a pre-trained CLIP model combined with an Asymptotic Feature Pyramid Network (AFPN) for multi-scale visual feature fusion. The core of the proposed framework is a novel language-vision decoder that adaptively propagates fine-grained semantic information from text prompts to visual features via cross-modal attention. Furthermore, a text-to-pixel contrastive learning mechanism explicitly aligns textual and corresponding visual pixel features. Extensive experiments on challenging benchmarks demonstrate the effectiveness of the proposed approach, achieving state-of-the-art performance with F-scores of 84.8% on the benchmark multi-lingual MLT-2019 dataset and 90.2% on the curved-text CTW1500 dataset.

[100] Color histogram equalization and fine-tuning to improve expression recognition of (partially occluded) faces on sign language datasets

Fabrizio Nunnari,Alakshendra Jyotsnaditya Ramkrishna Singh,Patrick Gebhard

Main category: cs.CV

TL;DR: 论文研究了计算机视觉方法在手语数据集中面部表情分类的表现,并通过颜色直方图均衡化和微调提升分类准确性,尤其是在部分遮挡情况下。

Details Motivation: 研究旨在量化计算机视觉方法在手语数据集中面部表情分类的准确性,并探索听力正常与聋哑者在情绪表达上的差异,尤其是在部分面部遮挡的情况下。

Contribution: 提出了一种基于颜色直方图均衡化和微调的颜色归一化方法,显著提升了表情分类的准确性(83.8%平均灵敏度)。

Method: 采用颜色直方图均衡化和微调技术对数据集进行颜色归一化,并在部分遮挡(上半部分或下半部分面部)情况下测试表情分类效果。

Result: 模型在下半部分面部的分类准确率(79.6%)高于上半部分(77.9%),且上半部分的分类准确率甚至超过人类水平。

Insight: 面部下半部分在表情识别中更为重要,但计算机视觉模型在上半部分的表现优于人类,展现了其在部分遮挡情况下的潜力。

Abstract: The goal of this investigation is to quantify to what extent computer vision methods can correctly classify facial expressions on a sign language dataset. We extend our experiments by recognizing expressions using only the upper or lower part of the face, which is needed to further investigate the difference in emotion manifestation between hearing and deaf subjects. To take into account the peculiar color profile of a dataset, our method introduces a color normalization stage based on histogram equalization and fine-tuning. The results show the ability to correctly recognize facial expressions with 83.8% mean sensitivity and very little variance (.042) among classes. Like for humans, recognition of expressions from the lower half of the face (79.6%) is higher than that from the upper half (77.9%). Noticeably, the classification accuracy from the upper half of the face is higher than human level.

[101] When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios

Kele Shao,Keda Tao,Kejia Zhang,Sicheng Feng,Mu Cai,Yuzhang Shang,Haoxuan You,Can Qin,Yang Sui,Huan Wang

Main category: cs.CV

TL;DR: 该论文首次系统性地调查了多模态长上下文令牌压缩领域,针对图像、视频和音频的不同冗余特性,分类整理了现有方法,并探讨了未来研究方向。

Details Motivation: 随着多模态大语言模型(MLLMs)的发展,处理长上下文(如高分辨率图像、长视频和音频)的需求增长,但计算复杂度高的问题亟待解决。令牌压缩成为一种关键方法。

Contribution: 论文提供了首个多模态长上下文令牌压缩的系统性综述,按模态(图像、视频、音频)和方法机制(变换、相似性、注意力、查询等)分类,并建立了公开资源库以跟踪进展。

Method: 通过模态驱动(图像、视频、音频)和方法机制(变换、相似性、注意力、查询)双维度分类,总结了现有令牌压缩技术。

Result: 综述了当前进展,指出了关键挑战,并为未来研究方向提供了灵感。

Insight: 令牌压缩是多模态长上下文处理的关键技术,不同模态的冗余特性需要针对性的压缩方法。

Abstract: Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input. While this ability significantly enhances MLLM capabilities, it introduces substantial computational challenges, primarily due to the quadratic complexity of self-attention mechanisms with numerous input tokens. To mitigate these bottlenecks, token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference. In this paper, we present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression. Recognizing that effective compression strategies are deeply tied to the unique characteristics and redundancies of each modality, we categorize existing approaches by their primary data focus, enabling researchers to quickly access and learn methods tailored to their specific area of interest: (1) image-centric compression, which addresses spatial redundancy in visual data; (2) video-centric compression, which tackles spatio-temporal redundancy in dynamic sequences; and (3) audio-centric compression, which handles temporal and spectral redundancy in acoustic signals. Beyond this modality-driven categorization, we further dissect methods based on their underlying mechanisms, including transformation-based, similarity-based, attention-based, and query-based approaches. By providing a comprehensive and structured overview, this survey aims to consolidate current progress, identify key challenges, and inspire future research directions in this rapidly evolving domain. We also maintain a public repository to continuously track and update the latest advances in this promising area.

[102] Dual-Stream Global-Local Feature Collaborative Representation Network for Scene Classification of Mining Area

Shuqi Fan,Haoyi Wang,Xianju Li

Main category: cs.CV

TL;DR: 本文提出了一种双流全局-局部特征协作表示网络,用于矿区场景分类,通过多尺度全局Transformer分支和局部增强协作表示分支,结合双分支深度特征融合模块,显著提升了分类精度。

Details Motivation: 矿区场景分类对地质环境监测和资源开发规划至关重要,但矿区复杂的空间布局和多尺度特征使得分类任务极具挑战性。

Contribution: 提出了双分支融合模型,通过全局和局部特征的协作表示,捕获矿区的多尺度特征和细粒度空间变化。

Method: 方法包括多尺度全局Transformer分支、局部增强协作表示分支和双分支深度特征融合模块,并通过多损失计算实现模块平衡。

Result: 模型整体准确率达到83.63%,优于其他对比模型,在所有评估指标中表现最佳。

Insight: 全局与局部特征的协作表示能有效捕捉矿区的复杂特征,双流融合机制显著提升了分类性能。

Abstract: Scene classification of mining areas provides accurate foundational data for geological environment monitoring and resource development planning. This study fuses multi-source data to construct a multi-modal mine land cover scene classification dataset. A significant challenge in mining area classification lies in the complex spatial layout and multi-scale characteristics. By extracting global and local features, it becomes possible to comprehensively reflect the spatial distribution, thereby enabling a more accurate capture of the holistic characteristics of mining scenes. We propose a dual-branch fusion model utilizing collaborative representation to decompose global features into a set of key semantic vectors. This model comprises three key components:(1) Multi-scale Global Transformer Branch: It leverages adjacent large-scale features to generate global channel attention features for small-scale features, effectively capturing the multi-scale feature relationships. (2) Local Enhancement Collaborative Representation Branch: It refines the attention weights by leveraging local features and reconstructed key semantic sets, ensuring that the local context and detailed characteristics of the mining area are effectively integrated. This enhances the model’s sensitivity to fine-grained spatial variations. (3) Dual-Branch Deep Feature Fusion Module: It fuses the complementary features of the two branches to incorporate more scene information. This fusion strengthens the model’s ability to distinguish and classify complex mining landscapes. Finally, this study employs multi-loss computation to ensure a balanced integration of the modules. The overall accuracy of this model is 83.63%, which outperforms other comparative models. Additionally, it achieves the best performance across all other evaluation metrics.

[103] Motion-example-controlled Co-speech Gesture Generation Leveraging Large Language Models

Bohong Chen,Yumeng Li,Youyi Zheng,Yao-Xiang Ding,Kun Zhou

Main category: cs.CV

TL;DR: 这篇论文提出了一种名为MECo的框架,利用大语言模型(LLMs)实现基于运动示例的共语手势生成,能够保留示例细节并保持与语音的协调。

Details Motivation: 现有方法通过预定义标签或隐式伪标签控制手势生成,但会丢失原始运动示例的丰富细节。因此,作者希望通过LLMs的细粒度理解能力改进这一问题。

Contribution: 1. 提出MECo框架,首次将LLMs用于共语手势生成;2. 通过显式查询上下文利用运动示例控制生成,避免了伪标签的局限性;3. 支持多模态输入(如视频、文本)及细粒度身体部位控制。

Method: 通过微调LLMs同时解析语音音频和运动示例,将运动示例作为提示结构的显式查询上下文,生成既保留示例特征又与语音协调的手势。

Result: 实验表明在FGD、运动多样性和示例-手势相似性三项指标上达到SOTA性能,并支持多种输入形式。

Insight: LLMs在细粒度运动生成任务中具有潜力,显式利用示例而非伪标签可以更好地保留细节,为多模态控制提供了新思路。

Abstract: The automatic generation of controllable co-speech gestures has recently gained growing attention. While existing systems typically achieve gesture control through predefined categorical labels or implicit pseudo-labels derived from motion examples, these approaches often compromise the rich details present in the original motion examples. We present MECo, a framework for motion-example-controlled co-speech gesture generation by leveraging large language models (LLMs). Our method capitalizes on LLMs’ comprehension capabilities through fine-tuning to simultaneously interpret speech audio and motion examples, enabling the synthesis of gestures that preserve example-specific characteristics while maintaining speech congruence. Departing from conventional pseudo-labeling paradigms, we position motion examples as explicit query contexts within the prompt structure to guide gesture generation. Experimental results demonstrate state-of-the-art performance across three metrics: Fr'echet Gesture Distance (FGD), motion diversity, and example-gesture similarity. Furthermore, our framework enables granular control of individual body parts and accommodates diverse input modalities including motion clips, static poses, human video sequences, and textual descriptions. Our code, pre-trained models, and videos are available at https://robinwitch.github.io/MECo-Page.

[104] AnimalClue: Recognizing Animals by their Traces

Risa Shinoda,Nakamasa Inoue,Iro Laina,Christian Rupprecht,Hirokatsu Kataoka

Main category: cs.CV

TL;DR: 论文提出了一个名为AnimalClue的大规模数据集,专注于通过动物的间接证据(如脚印、粪便等)识别物种,填补了野生动物监测领域的空白。

Details Motivation: 野生动物监测需要从间接证据(如脚印、粪便)中准确识别物种,但现有计算机视觉研究主要关注动物的直接视觉特征,间接证据识别的研究相对较少。

Contribution: 1) 提出首个大规模数据集AnimalClue,包含15.9万标注样本,覆盖968个物种的五类间接证据;2) 数据集包含物种标签、细粒度特征(活动模式、栖息地偏好)等丰富标注;3) 系统评估了现有视觉模型在该任务上的表现和挑战。

Method: 通过构建大规模数据集AnimalClue,并对五类间接证据(脚印、粪便、卵、骨骼、羽毛)进行标注和分析,结合物种标签和细粒度特征,评估现有视觉模型的性能。

Result: 论文展示了现有视觉模型在间接证据识别任务中的表现,揭示了识别细微视觉特征的挑战。

Insight: 间接证据识别需要模型捕捉更细粒度的视觉特征,这为未来研究提供了新的方向。

Abstract: Wildlife observation plays an important role in biodiversity conservation, necessitating robust methodologies for monitoring wildlife populations and interspecies interactions. Recent advances in computer vision have significantly contributed to automating fundamental wildlife observation tasks, such as animal detection and species identification. However, accurately identifying species from indirect evidence like footprints and feces remains relatively underexplored, despite its importance in contributing to wildlife monitoring. To bridge this gap, we introduce AnimalClue, the first large-scale dataset for species identification from images of indirect evidence. Our dataset consists of 159,605 bounding boxes encompassing five categories of indirect clues: footprints, feces, eggs, bones, and feathers. It covers 968 species, 200 families, and 65 orders. Each image is annotated with species-level labels, bounding boxes or segmentation masks, and fine-grained trait information, including activity patterns and habitat preferences. Unlike existing datasets primarily focused on direct visual features (e.g., animal appearances), AnimalClue presents unique challenges for classification, detection, and instance segmentation tasks due to the need for recognizing more detailed and subtle visual features. In our experiments, we extensively evaluate representative vision models and identify key challenges in animal identification from their traces. Our dataset and code are available at https://dahlian00.github.io/AnimalCluePage/

[105] MIRepNet: A Pipeline and Foundation Model for EEG-Based Motor Imagery Classification

Dingkun Liu,Zhu Chen,Jingwei Luo,Shijie Lian,Dongrui Wu

Main category: cs.CV

TL;DR: MIRepNet是首个专为运动想象(MI)范式设计的EEG基础模型,结合神经生理学信息的预处理管道和混合预训练策略,在小样本下游任务中表现优异。

Details Motivation: 现有EEG基础模型忽略了范式特定的神经生理学差异,导致泛化能力受限,而实际BCI部署通常预先确定范式(如MI)。

Contribution: 1. 针对MI范式的首个EEG基础模型;2. 结合神经生理学的预处理管道;3. 混合预训练策略(自监督与监督结合)。

Method: 1. 基于神经生理学的EEG预处理管道;2. 自监督掩码重建与监督MI分类的混合预训练。

Result: 在五个公共MI数据集上表现优异,显著优于现有EEG模型。

Insight: 针对特定范式设计基础模型可显著提升性能,混合预训练策略在小样本适应中效果突出。

Abstract: Brain-computer interfaces (BCIs) enable direct communication between the brain and external devices. Recent EEG foundation models aim to learn generalized representations across diverse BCI paradigms. However, these approaches overlook fundamental paradigm-specific neurophysiological distinctions, limiting their generalization ability. Importantly, in practical BCI deployments, the specific paradigm such as motor imagery (MI) for stroke rehabilitation or assistive robotics, is generally determined prior to data acquisition. This paper proposes MIRepNet, the first EEG foundation model tailored for the MI paradigm. MIRepNet comprises a high-quality EEG preprocessing pipeline incorporating a neurophysiologically-informed channel template, adaptable to EEG headsets with arbitrary electrode configurations. Furthermore, we introduce a hybrid pretraining strategy that combines self-supervised masked token reconstruction and supervised MI classification, facilitating rapid adaptation and accurate decoding on novel downstream MI tasks with fewer than 30 trials per class. Extensive evaluations across five public MI datasets demonstrated that MIRepNet consistently achieved state-of-the-art performance, significantly outperforming both specialized and generalized EEG models. Our code will be available on GitHub\footnote{https://github.com/staraink/MIRepNet}.

[106] L-MCAT: Unpaired Multimodal Transformer with Contrastive Attention for Label-Efficient Satellite Image Classification

Mitul Goswami,Mrinal Goswami

Main category: cs.CV

TL;DR: L-MCAT 是一种基于Transformer的轻量级多模态对比注意力框架,用于标签效率高的卫星图像分类,通过创新性的模态-光谱适配器和无监督多模态注意力对齐机制,显著减少参数和计算量,同时保持高精度。

Details Motivation: 由于卫星数据的多模态性和标签获取成本高,需要一种高效、轻量且无需像素级对齐的方法来进行分类。

Contribution: 1. 提出模态-光谱适配器(MSA)压缩高维输入;2. 设计无监督多模态注意力对齐(U-MAA)机制,无需像素级对齐或标签;3. 模型在标签极少的情况下仍表现优异,且计算高效。

Method: 使用Transformer框架,结合MSA和U-MAA,通过对比自监督学习对齐多模态数据。

Result: 在SEN12MS数据集上达到95.4%准确率(每类仅需20个标签),计算量减少47倍,参数减少23倍,5小时内完成训练。

Insight: 多模态数据的无监督对齐和高维压缩是提升卫星图像分类效率的关键。模型轻量化可推广到资源受限的实际场景。

Abstract: We propose the Lightweight Multimodal Contrastive Attention Transformer (L-MCAT), a novel transformer-based framework for label-efficient remote sensing image classification using unpaired multimodal satellite data. L-MCAT introduces two core innovations: (1) Modality-Spectral Adapters (MSA) that compress high-dimensional sensor inputs into a unified embedding space, and (2) Unpaired Multimodal Attention Alignment (U-MAA), a contrastive self-supervised mechanism integrated into the attention layers to align heterogeneous modalities without pixel-level correspondence or labels. L-MCAT achieves 95.4% overall accuracy on the SEN12MS dataset using only 20 labels per class, outperforming state-of-the-art baselines while using 47x fewer parameters and 23x fewer FLOPs than MCTrans. It maintains over 92% accuracy even under 50% spatial misalignment, demonstrating robustness for real-world deployment. The model trains end-to-end in under 5 hours on a single consumer GPU.

[107] T$^\text{3}$SVFND: Towards an Evolving Fake News Detector for Emergencies with Test-time Training on Short Video Platforms

Liyuan Zhang,Zeyun Cheng,Yan Yang,Yong Liu,Jinke Ma

Main category: cs.CV

TL;DR: 该论文提出了一种名为T$^3$SVFND的假新闻短视频检测框架,通过测试时间训练(TTT)提升模型的泛化能力,特别针对紧急事件中的假新闻检测。

Details Motivation: 现有假新闻视频检测方法在分布偏移的情况下表现不佳,尤其是面对紧急事件的新闻时。

Contribution: 提出了一种结合测试时间训练和多模态自监督辅助任务的假新闻检测框架,提升了模型在紧急事件中的鲁棒性。

Method: 设计了基于掩码语言建模(MLM)的自监督辅助任务,通过多模态(音频和视频)上下文信息预测被掩码的词,并在测试阶段通过辅助任务适应测试数据分布。

Result: 在公开基准测试中表现优异,尤其是在紧急事件假新闻检测任务中效果显著。

Insight: 测试时间训练和多模态自监督任务的结合可以有效应对分布偏移问题,提升模型的泛化能力。

Abstract: The existing methods for fake news videos detection may not be generalized, because there is a distribution shift between short video news of different events, and the performance of such techniques greatly drops if news records are coming from emergencies. We propose a new fake news videos detection framework (T$^3$SVFND) using Test-Time Training (TTT) to alleviate this limitation, enhancing the robustness of fake news videos detection. Specifically, we design a self-supervised auxiliary task based on Mask Language Modeling (MLM) that masks a certain percentage of words in text and predicts these masked words by combining contextual information from different modalities (audio and video). In the test-time training phase, the model adapts to the distribution of test data through auxiliary tasks. Extensive experiments on the public benchmark demonstrate the effectiveness of the proposed model, especially for the detection of emergency news.

[108] Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training

Qiaosi Yi,Shuai Li,Rongyuan Wu,Lingchen Sun,Yuhui Wu,Lei Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种Transfer VAE Training (TVT)策略,通过将8倍下采样的VAE转移到4倍下采样,同时适应预训练的UNet,以改善真实世界图像超分辨率中的细结构重建效果,同时降低计算成本。

Details Motivation: 现有的基于稳定扩散模型(SD)的真实世界图像超分辨率方法在细结构(如小字符和纹理)重建上表现不佳,主要原因在于SD模型中VAE的激进分辨率降低(如8倍下采样)。作者希望通过降低VAE的下采样率并适应预训练的UNet,来解决这一问题。

Contribution: 论文的主要贡献包括:1) 提出了TVT策略,将8倍下采样的VAE转移到4倍下采样,同时保持与预训练UNet的兼容性;2) 设计了紧凑的VAE和计算高效的UNet,以降低计算成本;3) 实验证明该方法显著提高了细结构重建效果,同时减少了计算开销。

Method: TVT策略包括两步:首先基于原始VAE编码器的输出特征训练一个4倍解码器,随后固定新训练的解码器并训练一个4倍编码器。此外,优化了VAE和UNet的网络架构以减少计算量。

Result: 实验结果表明,TVT方法显著改善了细结构的重建效果(如小字符和纹理),同时计算开销低于当前最先进的单步扩散模型。

Insight: 通过降低VAE的下采样率并转移训练,可以在保持与预训练UNet兼容的同时,显著提升图像超分辨率中的细结构重建效果。此外,网络架构的优化可以进一步降低计算成本。

Abstract: Impressive results on real-world image super-resolution (Real-ISR) have been achieved by employing pre-trained stable diffusion (SD) models. However, one critical issue of such methods lies in their poor reconstruction of image fine structures, such as small characters and textures, due to the aggressive resolution reduction of the VAE (eg., 8$\times$ downsampling) in the SD model. One solution is to employ a VAE with a lower downsampling rate for diffusion; however, adapting its latent features with the pre-trained UNet while mitigating the increased computational cost poses new challenges. To address these issues, we propose a Transfer VAE Training (TVT) strategy to transfer the 8$\times$ downsampled VAE into a 4$\times$ one while adapting to the pre-trained UNet. Specifically, we first train a 4$\times$ decoder based on the output features of the original VAE encoder, then train a 4$\times$ encoder while keeping the newly trained decoder fixed. Such a TVT strategy aligns the new encoder-decoder pair with the original VAE latent space while enhancing image fine details. Additionally, we introduce a compact VAE and compute-efficient UNet by optimizing their network architectures, reducing the computational cost while capturing high-resolution fine-scale features. Experimental results demonstrate that our TVT method significantly improves fine-structure preservation, which is often compromised by other SD-based methods, while requiring fewer FLOPs than state-of-the-art one-step diffusion models. The official code can be found at https://github.com/Joyies/TVT.

Chenjian Gao,Lihe Ding,Rui Han,Zhanpeng Huang,Zibin Wang,Tianfan Xue

Main category: cs.CV

TL;DR: 论文提出了一种结合3D渲染和2D扩散模型的混合方法,用于在视频中插入3D手镯,以实现时空一致性和真实光照效果。

Details Motivation: 视频中插入3D物体的挑战在于同时实现时空一致性和真实光照,尤其是动态场景中。现有2D扩散模型缺乏时间一致性,而传统3D渲染又难以达到真实光照效果。

Contribution: 提出了一种混合管道,结合3D高斯泼溅(3DGS)的时空一致性和2D扩散模型的光照增强;首次将3D渲染与2D扩散模型结合用于视频编辑。

Method: 使用3DGS进行初始渲染,通过2D扩散模型优化光照和sRGB图像;采用多帧加权优化3DGS以保持时间一致性。

Result: 实现了在动态视频中插入3D手镯的高质量效果,兼具时空一致性和真实光照。

Insight: 结合3D渲染的几何一致性与2D扩散模型的光照优化,是解决视频中3D物体插入问题的有效途径。

Abstract: Inserting 3D objects into videos is a longstanding challenge in computer graphics with applications in augmented reality, virtual try-on, and video composition. Achieving both temporal consistency, or realistic lighting remains difficult, particularly in dynamic scenarios with complex object motion, perspective changes, and varying illumination. While 2D diffusion models have shown promise for producing photorealistic edits, they often struggle with maintaining temporal coherence across frames. Conversely, traditional 3D rendering methods excel in spatial and temporal consistency but fall short in achieving photorealistic lighting. In this work, we propose a hybrid object insertion pipeline that combines the strengths of both paradigms. Specifically, we focus on inserting bracelets into dynamic wrist scenes, leveraging the high temporal consistency of 3D Gaussian Splatting (3DGS) for initial rendering and refining the results using a 2D diffusion-based enhancement model to ensure realistic lighting interactions. Our method introduces a shading-driven pipeline that separates intrinsic object properties (albedo, shading, reflectance) and refines both shading and sRGB images for photorealism. To maintain temporal coherence, we optimize the 3DGS model with multi-frame weighted adjustments. This is the first approach to synergize 3D rendering and 2D diffusion for video object insertion, offering a robust solution for realistic and consistent video editing. Project Page: https://cjeen.github.io/BraceletPaper/

[110] Detecting Visual Information Manipulation Attacks in Augmented Reality: A Multimodal Semantic Reasoning Approach

Yanming Xiu,Maria Gorlatova

Main category: cs.CV

TL;DR: 该论文针对AR中的视觉信息操纵(VIM)攻击,提出了分类方法、构建了数据集,并提出了一种基于多模态语义推理的检测框架。

Details Motivation: AR中的虚拟内容可能导致误导性信息,引发语义误解或用户错误。研究聚焦于VIM攻击,旨在解决这一问题。

Contribution: 提出了VIM攻击的分类方法,构建了AR-VIM数据集,并提出了一种多模态语义推理框架VIM-Sense。

Method: 结合视觉语言模型(VLMs)和OCR文本分析的多模态语义推理框架,用于检测VIM攻击。

Result: VIM-Sense在AR-VIM数据集上的检测准确率达88.94%,优于单模态基线,且在移动端实现了低延迟检测。

Insight: 多模态方法能有效结合视觉和语言信息,提升对VIM攻击的检测性能,为AR安全提供了新解决方案。

Abstract: The virtual content in augmented reality (AR) can introduce misleading or harmful information, leading to semantic misunderstandings or user errors. In this work, we focus on visual information manipulation (VIM) attacks in AR where virtual content changes the meaning of real-world scenes in subtle but impactful ways. We introduce a taxonomy that categorizes these attacks into three formats: character, phrase, and pattern manipulation, and three purposes: information replacement, information obfuscation, and extra wrong information. Based on the taxonomy, we construct a dataset, AR-VIM. It consists of 452 raw-AR video pairs spanning 202 different scenes, each simulating a real-world AR scenario. To detect such attacks, we propose a multimodal semantic reasoning framework, VIM-Sense. It combines the language and visual understanding capabilities of vision-language models (VLMs) with optical character recognition (OCR)-based textual analysis. VIM-Sense achieves an attack detection accuracy of 88.94% on AR-VIM, consistently outperforming vision-only and text-only baselines. The system reaches an average attack detection latency of 7.07 seconds in a simulated video processing framework and 7.17 seconds in a real-world evaluation conducted on a mobile Android AR application.

[111] Generative Pre-training for Subjective Tasks: A Diffusion Transformer-Based Framework for Facial Beauty Prediction

Djamel Eddine Boukhari,Ali chemsa

Main category: cs.CV

TL;DR: 论文提出了一种基于扩散变换器的两阶段框架(Diff-FBP),通过生成式预训练提升面部美观预测任务的表现,显著优于现有方法。

Details Motivation: 面部美观预测(FBP)因其主观性和需要捕捉微妙特征的特点而具有挑战性,现有方法通常基于通用预训练模型,难以学习与美学评估对齐的特征。

Contribution: 提出Diff-FBP框架,利用扩散变换器生成式预训练学习面部数据的分布,作为域特异性特征提取器,显著提升了FBP任务的性能。

Method: 1. 在FFHQ数据集上通过自监督去噪任务预训练扩散变换器;
2. 冻结预训练编码器,仅微调轻量回归头,应用于FBP5500数据集。

Result: 在FBP5500基准测试中达到0.932的皮尔逊相关系数,显著优于现有方法。

Insight: 生成式预训练能够学习域特异性特征,尤其适用于主观视觉任务,为相关研究提供了新思路。

Abstract: Facial Beauty Prediction (FBP) is a challenging computer vision task due to its subjective nature and the subtle, holistic features that influence human perception. Prevailing methods, often based on deep convolutional networks or standard Vision Transformers pre-trained on generic object classification (e.g., ImageNet), struggle to learn feature representations that are truly aligned with high-level aesthetic assessment. In this paper, we propose a novel two-stage framework that leverages the power of generative models to create a superior, domain-specific feature extractor. In the first stage, we pre-train a Diffusion Transformer on a large-scale, unlabeled facial dataset (FFHQ) through a self-supervised denoising task. This process forces the model to learn the fundamental data distribution of human faces, capturing nuanced details and structural priors essential for aesthetic evaluation. In the second stage, the pre-trained and frozen encoder of our Diffusion Transformer is used as a backbone feature extractor, with only a lightweight regression head being fine-tuned on the target FBP dataset (FBP5500). Our method, termed Diff-FBP, sets a new state-of-the-art on the FBP5500 benchmark, achieving a Pearson Correlation Coefficient (PCC) of 0.932, significantly outperforming prior art based on general-purpose pre-training. Extensive ablation studies validate that our generative pre-training strategy is the key contributor to this performance leap, creating feature representations that are more semantically potent for subjective visual tasks.

[112] MagicAnime: A Hierarchically Annotated, Multimodal and Multitasking Dataset with Benchmarks for Cartoon Animation Generation

Shuolin Xu,Bingyuan Wang,Zeyu Cai,Fangteng Fu,Yue Ma,Tongyi Lee,Hongchuan Yu,Zeyu Wang

Main category: cs.CV

TL;DR: MagicAnime是一个大规模、层次化注释的多模态数据集,专为支持多任务卡通动画生成而设计,包含40万视频片段、5万关键点对等数据,并提供了多模态基准测试。

Details Motivation: 卡通动画生成因非人类角色复杂、动作风格多样和精细情感表达而极具挑战性,现有数据集稀缺且与现实视频存在领域鸿沟。

Contribution: 提出了MagicAnime数据集,支持多任务生成,并提供了多模态基准测试MagicAnime-Bench。

Method: 构建了包含视频片段、关键点、音频等多种数据的层次化注释数据集,并通过实验验证了其在多个任务中的有效性。

Result: 在视频驱动、音频驱动、图像到视频和姿态驱动等任务上实现了高保真、细粒度且可控的生成。

Insight: MagicAnime填补了卡通动画数据集的空白,为多模态控制和高保真生成提供了重要支持。

Abstract: Generating high-quality cartoon animations multimodal control is challenging due to the complexity of non-human characters, stylistically diverse motions and fine-grained emotions. There is a huge domain gap between real-world videos and cartoon animation, as cartoon animation is usually abstract and has exaggerated motion. Meanwhile, public multimodal cartoon data are extremely scarce due to the difficulty of large-scale automatic annotation processes compared with real-life scenarios. To bridge this gap, We propose the MagicAnime dataset, a large-scale, hierarchically annotated, and multimodal dataset designed to support multiple video generation tasks, along with the benchmarks it includes. Containing 400k video clips for image-to-video generation, 50k pairs of video clips and keypoints for whole-body annotation, 12k pairs of video clips for video-to-video face animation, and 2.9k pairs of video and audio clips for audio-driven face animation. Meanwhile, we also build a set of multi-modal cartoon animation benchmarks, called MagicAnime-Bench, to support the comparisons of different methods in the tasks above. Comprehensive experiments on four tasks, including video-driven face animation, audio-driven face animation, image-to-video animation, and pose-driven character animation, validate its effectiveness in supporting high-fidelity, fine-grained, and controllable generation.

[113] ModalFormer: Multimodal Transformer for Low-Light Image Enhancement

Alexandru Brateanu,Raul Balmez,Ciprian Orhei,Codruta Ancuti,Cosmin Ancuti

Main category: cs.CV

TL;DR: ModalFormer是一种多模态低光图像增强框架,通过结合跨模态Transformer和多个辅助子网络,利用九种辅助模态信息实现最先进的性能。

Details Motivation: 目前低光图像增强方法多依赖RGB图像的像素级变换,忽略了多模态提供的丰富上下文信息,ModalFormer旨在解决这一问题。

Contribution: 1. 提出首个大规模多模态LLIE框架ModalFormer;2. 设计了跨模态多头自注意力机制(CM-MSA)以实现RGB数据与多模态特征的有效融合;3. 在多个基准数据集上验证了模型的优越性。

Method: 1. 利用跨模态Transformer(CM-T)恢复图像并整合多模态信息;2. 采用多个辅助子网络重建多模态特征;3. CM-MSA机制融合RGB与多模态特征生成混合注意力图。

Result: 在多个低光图像增强基准测试中,ModalFormer达到了最先进的性能。

Insight: 多模态信息的引入显著提升了低光图像增强的效果,CM-MSA机制为跨模态特征融合提供了一种有效解决方案。

Abstract: Low-light image enhancement (LLIE) is a fundamental yet challenging task due to the presence of noise, loss of detail, and poor contrast in images captured under insufficient lighting conditions. Recent methods often rely solely on pixel-level transformations of RGB images, neglecting the rich contextual information available from multiple visual modalities. In this paper, we present ModalFormer, the first large-scale multimodal framework for LLIE that fully exploits nine auxiliary modalities to achieve state-of-the-art performance. Our model comprises two main components: a Cross-modal Transformer (CM-T) designed to restore corrupted images while seamlessly integrating multimodal information, and multiple auxiliary subnetworks dedicated to multimodal feature reconstruction. Central to the CM-T is our novel Cross-modal Multi-headed Self-Attention mechanism (CM-MSA), which effectively fuses RGB data with modality-specific features–including deep feature embeddings, segmentation information, geometric cues, and color information–to generate information-rich hybrid attention maps. Extensive experiments on multiple benchmark datasets demonstrate ModalFormer’s state-of-the-art performance in LLIE. Pre-trained models and results are made available at https://github.com/albrateanu/ModalFormer.

[114] VESPA: Towards un(Human)supervised Open-World Pointcloud Labeling for Autonomous Driving

Levente Tempfli,Esteban Rivera,Markus Lienkamp

Main category: cs.CV

TL;DR: VESPA是一个多模态自动标注流水线,结合LiDAR的几何精度与相机图像的语义丰富性,利用视觉语言模型(VLMs)实现开放词汇物体标注,提升了点云域的检测质量,无需真实标注或高精地图即可发现新类别并生成高质量3D伪标签。

Details Motivation: 自动驾驶数据收集快速增加,但手动标注(尤其是3D标注)成本高且劳动密集。现有LiDAR自动标注方法受限于数据稀疏性和遮挡问题,且缺乏语义细粒度。VESPA旨在解决这些限制。

Contribution: 1.提出多模态自动标注流水线VESPA;2.融合LiDAR几何信息与相机语义信息;3.利用VLMs实现开放词汇标注并提升点云检测质量;4.支持新类别的发现且无需真实标注或高精地图。

Method: VESPA结合LiDAR和相机数据,利用视觉语言模型进行开放词汇物体标注,并通过多模态融合直接在点云域优化检测质量。该方法不需要真实标注或先验地图支持。

Result: 在Nuscenes数据集上,VESPA在物体发现任务中达到52.95% AP,多类别物体检测任务中达到46.54% AP,展示了其在可扩展3D场景理解中的强性能。

Insight: 多模态融合(LiDAR+相机)和视觉语言模型的应用显著提升了自动标注的语义能力和实用性,为开放世界的3D场景理解提供了新思路。

Abstract: Data collection for autonomous driving is rapidly accelerating, but manual annotation, especially for 3D labels, remains a major bottleneck due to its high cost and labor intensity. Autolabeling has emerged as a scalable alternative, allowing the generation of labels for point clouds with minimal human intervention. While LiDAR-based autolabeling methods leverage geometric information, they struggle with inherent limitations of lidar data, such as sparsity, occlusions, and incomplete object observations. Furthermore, these methods typically operate in a class-agnostic manner, offering limited semantic granularity. To address these challenges, we introduce VESPA, a multimodal autolabeling pipeline that fuses the geometric precision of LiDAR with the semantic richness of camera images. Our approach leverages vision-language models (VLMs) to enable open-vocabulary object labeling and to refine detection quality directly in the point cloud domain. VESPA supports the discovery of novel categories and produces high-quality 3D pseudolabels without requiring ground-truth annotations or HD maps. On Nuscenes dataset, VESPA achieves an AP of 52.95% for object discovery and up to 46.54% for multiclass object detection, demonstrating strong performance in scalable 3D scene understanding. Code will be available upon acceptance.

[115] The Importance of Facial Features in Vision-based Sign Language Recognition: Eyes, Mouth or Full Face?

Dinh Nam Pham,Eleftherios Avramidis

Main category: cs.CV

TL;DR: 论文系统研究了面部特征(眼睛、嘴巴和全脸)在手语自动识别中的作用,发现嘴巴是最重要的非手动特征。

Details Motivation: 非手动面部特征在手语中至关重要,但其在自动手语识别(ASLR)中的作用尚未充分研究,现有方法多依赖手工提取特征且未能深入比较不同面部区域。

Contribution: 通过定量和定性分析,揭示了嘴巴是最重要的非手动面部特征,显著提升了识别准确率。

Method: 使用基于CNN和Transformer的深度学习模型,在孤立手语数据集上对不同面部区域进行系统评估。

Result: 实验表明,嘴巴特征显著提高了识别准确率,突出了在ASLR中融入面部特征的必要性。

Insight: 研究发现,面部特征(尤其是嘴巴)的引入可以显著提升手语识别的性能,为未来ASLR研究提供了新方向。

Abstract: Non-manual facial features play a crucial role in sign language communication, yet their importance in automatic sign language recognition (ASLR) remains underexplored. While prior studies have shown that incorporating facial features can improve recognition, related work often relies on hand-crafted feature extraction and fails to go beyond the comparison of manual features versus the combination of manual and facial features. In this work, we systematically investigate the contribution of distinct facial regionseyes, mouth, and full faceusing two different deep learning models (a CNN-based model and a transformer-based model) trained on an SLR dataset of isolated signs with randomly selected classes. Through quantitative performance and qualitative saliency map evaluation, we reveal that the mouth is the most important non-manual facial feature, significantly improving accuracy. Our findings highlight the necessity of incorporating facial features in ASLR.

[116] $A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement

Zhecheng Li,Guoxian Song,Yiwei Wang,Zhen Xiong,Junsong Yuan,Yujun Cai

Main category: cs.CV

TL;DR: 本文提出了一种名为$A^2R^2$的新框架,通过注意力引导的精细化和视觉推理,解决了现有视觉语言模型在Img2LaTeX任务中表现不佳的问题。

Details Motivation: 现有的视觉语言模型在Img2LaTeX任务中表现不理想,尤其是在处理细粒度视觉元素时容易出错。

Contribution: 提出了$A^2R^2$框架,结合注意力定位和迭代精细化,显著提升了模型性能;并引入了一个新的数据集Img2LaTex-Hard-1K用于严格评估。

Method: $A^2R^2$框架通过注意力机制和迭代推理实现自我纠正,逐步提升预测质量。

Result: 实验表明,$A^2R^2$在六大评估指标上均优于基线方法,且推理轮数增加能进一步提升性能。

Insight: 视觉推理与注意力引导的精细化相结合,显著提升了视觉语言模型在复杂任务中的表现。

Abstract: Img2LaTeX is a practically significant task that involves converting mathematical expressions or tabular data from images into LaTeX code. In recent years, vision-language models (VLMs) have demonstrated strong performance across a variety of visual understanding tasks, owing to their generalization capabilities. While some studies have explored the use of VLMs for the Img2LaTeX task, their performance often falls short of expectations. Empirically, VLMs sometimes struggle with fine-grained visual elements, leading to inaccurate LaTeX predictions. To address this challenge, we propose $A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement, a framework that effectively integrates attention localization and iterative refinement within a visual reasoning framework, enabling VLMs to perform self-correction and progressively improve prediction quality. For effective evaluation, we introduce a new dataset, Img2LaTex-Hard-1K, consisting of 1,100 carefully curated and challenging examples designed to rigorously evaluate the capabilities of VLMs within this task domain. Extensive experimental results demonstrate that: (1) $A^2R^2$ significantly improves model performance across six evaluation metrics spanning both textual and visual levels, consistently outperforming other baseline methods; (2) Increasing the number of inference rounds yields notable performance gains, underscoring the potential of $A^2R^2$ in test-time scaling scenarios; (3) Ablation studies and human evaluations validate the practical effectiveness of our approach, as well as the strong synergy among its core components during inference.

[117] Automated 3D-GS Registration and Fusion via Skeleton Alignment and Gaussian-Adaptive Features

Shiyang Liu,Dianyi Yang,Yu Gao,Bohan Ren,Yi Yang,Mengyin Fu

Main category: cs.CV

TL;DR: 该论文提出了一种自动化方法,用于多3D高斯分布子图的注册与融合,通过骨架对齐和高斯自适应特征,显著提升了注册精度与融合质量。

Details Motivation: 现有方法多依赖人工干预,且硬阈值过滤会导致融合后的渲染质量下降。因此,作者提出一种无需人工干预、能提升注册精度和融合质量的自动化方法。

Contribution: 1. 提出一种基于骨架对齐的自动化3D-GS子图注册方法;2. 引入多因素高斯融合策略,减少场景元素丢失;3. 在复杂场景中提升了注册与融合的质量。

Method: 1. 通过几何骨架提取和椭球感知卷积,捕获3D-GS属性以实现鲁棒场景注册;2. 采用多因素高斯融合策略,避免硬阈值过滤导致的渲染质量下降。

Result: 在ScanNet-GSReg和自建数据集上,该方法在复杂场景中注册RRE降低41.9%,融合PSNR提升10.11 dB,证明了其有效性。

Insight: 通过自动化骨架对齐和自适应融合策略,可以在不依赖人工干预的情况下,显著提升3D场景注册与重建的准确性和一致性。

Abstract: In recent years, 3D Gaussian Splatting (3D-GS)-based scene representation demonstrates significant potential in real-time rendering and training efficiency. However, most existing methods primarily focus on single-map reconstruction, while the registration and fusion of multiple 3D-GS sub-maps remain underexplored. Existing methods typically rely on manual intervention to select a reference sub-map as a template and use point cloud matching for registration. Moreover, hard-threshold filtering of 3D-GS primitives often degrades rendering quality after fusion. In this paper, we present a novel approach for automated 3D-GS sub-map alignment and fusion, eliminating the need for manual intervention while enhancing registration accuracy and fusion quality. First, we extract geometric skeletons across multiple scenes and leverage ellipsoid-aware convolution to capture 3D-GS attributes, facilitating robust scene registration. Second, we introduce a multi-factor Gaussian fusion strategy to mitigate the scene element loss caused by rigid thresholding. Experiments on the ScanNet-GSReg and our Coord datasets demonstrate the effectiveness of the proposed method in registration and fusion. For registration, it achieves a 41.9% reduction in RRE on complex scenes, ensuring more precise pose estimation. For fusion, it improves PSNR by 10.11 dB, highlighting superior structural preservation. These results confirm its ability to enhance scene alignment and reconstruction fidelity, ensuring more consistent and accurate 3D scene representation for robotic perception and autonomous navigation.

[118] Investigating the Effect of Spatial Context on Multi-Task Sea Ice Segmentation

Behzad Vahedi,Rafael Pires de Lima,Sepideh Jalayer,Walter N. Meier,Andrew P. Barrett,Morteza Karimzadeh

Main category: cs.CV

TL;DR: 论文研究了空间上下文对多任务海冰分割的影响,提出通过调整空洞率控制感受野大小,并结合不同分辨率数据提升分割性能。

Details Motivation: 海冰分割中多尺度空间上下文的影响尚未充分探索,尤其是在多任务和多源数据(如SAR和AMSR2)的背景下。

Contribution: 首次系统研究了空间上下文和多任务分割的关系,提出了基于观测分辨率和任务特征的感受野优化方法,并结合了多源数据融合的优势。

Method: 采用空洞空间金字塔池化(ASPP)调整感受野,通过Grad-CAM可视化模型决策,并分析了不同分辨率数据(Sentinel-1和AMSR2)的融合效果。

Result: 小感受野适合高分辨率Sentinel-1数据,中等感受野在发展阶段分割中表现更好,而大感受野性能较差;多源数据融合显著提升了所有任务的分割效果。

Insight: 观测分辨率和目标特性对空间上下文的选择至关重要,多源数据融合能够弥补单一数据的局限性,为地学应用中的深度学习模型优化提供了指导。

Abstract: Capturing spatial context at multiple scales is crucial for deep learning-based sea ice segmentation. However, the optimal specification of spatial context based on observation resolution and task characteristics remains underexplored. This study investigates the impact of spatial context on the segmentation of sea ice concentration, stage of development, and floe size using a multi-task segmentation model. We implement Atrous Spatial Pyramid Pooling with varying atrous rates to systematically control the receptive field size of convolutional operations, and to capture multi-scale contextual information. We explore the interactions between spatial context and feature resolution for different sea ice properties and examine how spatial context influences segmentation performance across different input feature combinations from Sentinel-1 SAR and Advanced Microwave Radiometer-2 (AMSR2) for multi-task mapping. Using Gradient-weighted Class Activation Mapping, we visualize how atrous rates influence model decisions. Our findings indicate that smaller receptive fields excel for high-resolution Sentinel-1 data, while medium receptive fields yield better performances for stage of development segmentation and larger receptive fields often lead to diminished performances. The fusion of SAR and AMSR2 enhances segmentation across all tasks. We highlight the value of lower-resolution 18.7 and 36.5 GHz AMSR2 channels in sea ice mapping. These findings highlight the importance of selecting appropriate spatial context based on observation resolution and target properties in sea ice mapping. By systematically analyzing receptive field effects in a multi-task setting, our study provides insights for optimizing deep learning models in geospatial applications.

[119] GaRe: Relightable 3D Gaussian Splatting for Outdoor Scenes from Unconstrained Photo Collections

Haiyang Bai,Jiaqi Zhu,Songru Jiang,Wei Huang,Tao Lu,Yuanqi Li,Jie Guo,Runze Fu,Yanwen Guo,Lijun Chen

Main category: cs.CV

TL;DR: 该论文提出了一种基于3D高斯散射的室外重光照框架GaRe,通过本征图像分解精确整合了来自无约束照片集的阳光、天空辐射和间接光照,支持多样化着色操作和动态阴影效果。

Details Motivation: 现有方法将每张图像的全局光照压缩为单一潜在向量,难以实现多样化的着色和动态阴影效果。论文旨在解决这一问题,提出更精确和灵活的重光照方法。

Contribution: 1) 提出残差太阳可见性提取方法分离阳光效果;2) 设计基于区域的监督框架和结构一致性损失,实现物理可解释的光照分解;3) 开发基于光线追踪的阴影模拟技术。

Method: 结合本征图像分解和高斯散射,采用残差方法提取太阳可见性,通过区域监督和结构一致性损失优化光照分解,并利用光线追踪模拟阴影。

Result: 实验表明,该方法在生成新颖视图时达到与最先进重光照方案相当的保真度,同时产生更自然且多方面的光照和阴影效果。

Insight: 通过分解光照为可解释的物理分量并引入动态阴影模拟,能够更灵活地控制重光照效果,适用于复杂室外场景。

Abstract: We propose a 3D Gaussian splatting-based framework for outdoor relighting that leverages intrinsic image decomposition to precisely integrate sunlight, sky radiance, and indirect lighting from unconstrained photo collections. Unlike prior methods that compress the per-image global illumination into a single latent vector, our approach enables simultaneously diverse shading manipulation and the generation of dynamic shadow effects. This is achieved through three key innovations: (1) a residual-based sun visibility extraction method to accurately separate direct sunlight effects, (2) a region-based supervision framework with a structural consistency loss for physically interpretable and coherent illumination decomposition, and (3) a ray-tracing-based technique for realistic shadow simulation. Extensive experiments demonstrate that our framework synthesizes novel views with competitive fidelity against state-of-the-art relighting solutions and produces more natural and multifaceted illumination and shadow effects.

[120] T2VParser: Adaptive Decomposition Tokens for Partial Alignment in Text to Video Retrieval

Yili Li,Gang Xiong,Gaopeng Gou,Xiangyan Qu,Jiamin Zhuang,Zhen Li,Junzheng Shi

Main category: cs.CV

TL;DR: T2VParser提出了自适应分解令牌(Adaptive Decomposition Tokens),通过多视角语义表示实现文本与视频的局部对齐,解决了文本描述仅反映部分视频内容的问题,并在跨模态内容分解上取得了显著效果。

Details Motivation: 视频通常包含比图像更丰富的信息,而当前的视频-文本数据集中,文本描述仅能反映部分视频内容,导致视频-文本匹配中的部分错位。直接对齐文本与视频表示会引入错误监督,忽略了信息的不对等性。

Contribution: 1. 提出了T2VParser,通过多视角语义表示实现文本与视频的自适应局部对齐。2. 引入了跨模态共享的自适应分解令牌(Adaptive Decomposition Tokens),用于提取不同模态的对应表示。

Method: 1. 从文本和视频中提取多视角语义表示。2. 使用共享的自适应分解令牌分解跨模态内容,实现局部对齐。3. 保留预训练模型的知识,同时强调文本与视频的精确对齐。

Result: 实验结果表明,T2VParser通过有效的跨模态内容分解,实现了准确的局部对齐。

Insight: 1. 文本与视频的信息对等性是一个重要问题,直接全局对齐可能引入噪声。2. 自适应分解令牌提供了一种灵活的方式来捕捉跨模态的局部语义对应关系。

Abstract: Text-to-video retrieval essentially aims to train models to align visual content with textual descriptions accurately. Due to the impressive general multimodal knowledge demonstrated by image-text pretrained models such as CLIP, existing work has primarily focused on extending CLIP knowledge for video-text tasks. However, videos typically contain richer information than images. In current video-text datasets, textual descriptions can only reflect a portion of the video content, leading to partial misalignment in video-text matching. Therefore, directly aligning text representations with video representations can result in incorrect supervision, ignoring the inequivalence of information. In this work, we propose T2VParser to extract multiview semantic representations from text and video, achieving adaptive semantic alignment rather than aligning the entire representation. To extract corresponding representations from different modalities, we introduce Adaptive Decomposition Tokens, which consist of a set of learnable tokens shared across modalities. The goal of T2VParser is to emphasize precise alignment between text and video while retaining the knowledge of pretrained models. Experimental results demonstrate that T2VParser achieves accurate partial alignment through effective cross-modal content decomposition. The code is available at https://github.com/Lilidamowang/T2VParser.

[121] AgroBench: Vision-Language Model Benchmark in Agriculture

Risa Shinoda,Nakamasa Inoue,Hirokatsu Kataoka,Masaki Onishi,Yoshitaka Ushiku

Main category: cs.CV

TL;DR: AgroBench是一个农业领域的视觉-语言模型(VLM)基准测试,涵盖7个农业主题,由专业农学家标注,旨在评估VL模型在精细农业任务中的表现。结果显示,现有模型在细粒度识别任务中仍有提升空间,尤其是在杂草识别方面表现较差。

Details Motivation: 农业任务的自动化理解对可持续作物生产至关重要,而VLMs通过文本交互为农业应用提供了潜力。现有基准缺乏专业标注和广泛覆盖,因此需要更全面的评估工具。

Contribution: 提出AgroBench基准,覆盖203种作物和682种病害,由专家标注,为VLMs提供更全面的农业任务评估。

Method: 构建一个包含7个农业主题的数据集,由农学家标注,评估现有VLMs在细粒度任务(如疾病和杂草识别)中的表现。

Result: 现有VLMs在细粒度识别任务中表现不佳,尤其是杂草识别接近随机水平。错误分析揭示了模型的局限性。

Insight: VLMs在农业领域的应用需要针对细粒度任务优化,专家标注和广泛覆盖的数据集是评估和改进的关键。

Abstract: Precise automated understanding of agricultural tasks such as disease identification is essential for sustainable crop production. Recent advances in vision-language models (VLMs) are expected to further expand the range of agricultural tasks by facilitating human-model interaction through easy, text-based communication. Here, we introduce AgroBench (Agronomist AI Benchmark), a benchmark for evaluating VLM models across seven agricultural topics, covering key areas in agricultural engineering and relevant to real-world farming. Unlike recent agricultural VLM benchmarks, AgroBench is annotated by expert agronomists. Our AgroBench covers a state-of-the-art range of categories, including 203 crop categories and 682 disease categories, to thoroughly evaluate VLM capabilities. In our evaluation on AgroBench, we reveal that VLMs have room for improvement in fine-grained identification tasks. Notably, in weed identification, most open-source VLMs perform close to random. With our wide range of topics and expert-annotated categories, we analyze the types of errors made by VLMs and suggest potential pathways for future VLM development. Our dataset and code are available at https://dahlian00.github.io/AgroBenchPage/ .

[122] Enhancing Spatial Reasoning through Visual and Textual Thinking

Xun Liang,Xin Guo,Zhongming Jin,Weihang Pan,Penghui Shang,Deng Cai,Binbin Lin,Jieping Ye

Main category: cs.CV

TL;DR: 该论文提出了一种通过视觉和文本双重思考(SpatialVTS)来增强空间推理能力的方法,在多个空间理解任务中显著提升了模型表现。

Details Motivation: 尽管视觉语言模型(VLMs)发展迅速,但在空间推理任务中仍存在困难。论文旨在通过同时利用视觉和文本信息来改进这一问题。

Contribution: 1. 提出SpatialVTS方法,通过视觉和文本双阶段思考增强空间推理;2. 优化数据集标注和输入格式;3. 在不引入额外信息的情况下显著提升模型性能。

Method: 1. 空间视觉思考阶段:生成与位置相关的特定目标标记;2. 空间文本思考阶段:基于视觉线索和对话逐步推理答案;3. 人工修正数据集并优化输入格式。

Result: 在多个空间理解任务中,模型的整体平均表现显著优于其他模型。

Insight: 空间推理需要通过视觉和文本信息的协同处理,同时数据质量和输入格式对模型性能有重要影响。

Abstract: The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space, which is a fundamental capability for Visual Question Answering (VQA) and robotics. Although vision language models (VLMs) have developed rapidly in recent years, they are still struggling with the spatial reasoning task. In this paper, we introduce a method that can enhance Spatial reasoning through Visual and Textual thinking Simultaneously (SpatialVTS). In the spatial visual thinking phase, our model is trained to generate location-related specific tokens of essential targets automatically. Not only are the objects mentioned in the problem addressed, but also the potential objects related to the reasoning are considered. During the spatial textual thinking phase, Our model conducts long-term thinking based on visual cues and dialogues, gradually inferring the answers to spatial reasoning problems. To effectively support the model’s training, we perform manual corrections to the existing spatial reasoning dataset, eliminating numerous incorrect labels resulting from automatic annotation, restructuring the data input format to enhance generalization ability, and developing thinking processes with logical reasoning details. Without introducing additional information (such as masks or depth), our model’s overall average level in several spatial understanding tasks has significantly improved compared with other models.

[123] Low-Cost Machine Vision System for Sorting Green Lentils (Lens Culinaris) Based on Pneumatic Ejection and Deep Learning

Davy Rojas Yana,Edwin Salcedo

Main category: cs.CV

TL;DR: 本文设计了一种基于计算机视觉和气动喷射的低成本绿扁豆动态分类系统,采用两阶段YOLOv8模型实现实时多类别分类,准确率为87.2%。

Details Motivation: 为农产品加工提供低成本、高效且准确的分类解决方案,减少人工干预。

Contribution: 提出两阶段YOLOv8流水线,结合气动喷射机制,实现绿扁豆的实时分类与分离。

Method: 使用YOLOv8检测模型定位豆粒,再用分类模型分为六类,结合气动喷射和Arduino控制系统。

Result: 系统在59毫米/秒的传送速度下达到87.2%的分离准确率,处理速率为8克/分钟。

Insight: 展示了低成本机器视觉在农产品分类中的潜力,为未来改进提供了模块化基础。

Abstract: This paper presents the design, development, and evaluation of a dynamic grain classification system for green lentils (Lens Culinaris), which leverages computer vision and pneumatic ejection. The system integrates a YOLOv8-based detection model that identifies and locates grains on a conveyor belt, together with a second YOLOv8-based classification model that categorises grains into six classes: Good, Yellow, Broken, Peeled, Dotted, and Reject. This two-stage YOLOv8 pipeline enables accurate, real-time, multi-class categorisation of lentils, implemented on a low-cost, modular hardware platform. The pneumatic ejection mechanism separates defective grains, while an Arduino-based control system coordinates real-time interaction between the vision system and mechanical components. The system operates effectively at a conveyor speed of 59 mm/s, achieving a grain separation accuracy of 87.2%. Despite a limited processing rate of 8 grams per minute, the prototype demonstrates the potential of machine vision for grain sorting and provides a modular foundation for future enhancements.

[124] T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation

Chieh-Yun Chen,Min Shi,Gong Zhang,Humphrey Shi

Main category: cs.CV

TL;DR: T2I-Copilot提出了一种无需训练的多代理系统,通过多模态大型语言模型协作,自动化完成提示词优化、模型选择和迭代生成,显著提升了文本到图像生成的质量和可控性。

Details Motivation: 现有的文本到图像生成模型对提示词的敏感度高,用户需要反复调整提示词,且缺乏清晰的反馈。现有解决方案如自动提示工程或多轮生成,往往需要额外训练且泛化能力有限。

Contribution: T2I-Copilot提出了一个无训练的多代理系统,通过三个代理(输入解析器、生成引擎和质量评估器)协作实现提示词优化、模型选择和迭代生成,提升了生成质量和可控性。

Method: 系统包含三个代理:输入解析器(解析和标准化提示词)、生成引擎(选择模型并生成图像)、质量评估器(评估图像质量和文本对齐),支持全自动和人工干预模式。

Result: 在GenAI-Bench上,T2I-Copilot的表现接近商业模型,成本仅为FLUX1.1-pro的16.59%,但性能超出其6.17%,并显著优于其他开源模型。

Insight: 通过多代理协作,无需额外训练即可显著提升文本到图像生成的质量和可控性,为提示工程和交互式生成提供了新思路。

Abstract: Text-to-Image (T2I) generative models have revolutionized content creation but remain highly sensitive to prompt phrasing, often requiring users to repeatedly refine prompts multiple times without clear feedback. While techniques such as automatic prompt engineering, controlled text embeddings, denoising, and multi-turn generation mitigate these issues, they offer limited controllability, or often necessitate additional training, restricting the generalization abilities. Thus, we introduce T2I-Copilot, a training-free multi-agent system that leverages collaboration between (Multimodal) Large Language Models to automate prompt phrasing, model selection, and iterative refinement. This approach significantly simplifies prompt engineering while enhancing generation quality and text-image alignment compared to direct generation. Specifically, T2I-Copilot consists of three agents: (1) Input Interpreter, which parses the input prompt, resolves ambiguities, and generates a standardized report; (2) Generation Engine, which selects the appropriate model from different types of T2I models and organizes visual and textual prompts to initiate generation; and (3) Quality Evaluator, which assesses aesthetic quality and text-image alignment, providing scores and feedback for potential regeneration. T2I-Copilot can operate fully autonomously while also supporting human-in-the-loop intervention for fine-grained control. On GenAI-Bench, using open-source generation models, T2I-Copilot achieves a VQA score comparable to commercial models RecraftV3 and Imagen 3, surpasses FLUX1.1-pro by 6.17% at only 16.59% of its cost, and outperforms FLUX.1-dev and SD 3.5 Large by 9.11% and 6.36%. Code will be released at: https://github.com/SHI-Labs/T2I-Copilot.

[125] FED-PsyAU: Privacy-Preserving Micro-Expression Recognition via Psychological AU Coordination and Dynamic Facial Motion Modeling

Jingting Li,Yu Qian,Lin Zhao,Su-Jing Wang

Main category: cs.CV

TL;DR: 该论文提出了FED-PsyAU框架,结合心理学先验和动态面部运动建模,通过联邦学习实现隐私保护的微表情识别。

Details Motivation: 微表情识别(MER)面临样本量小、特征微妙等挑战,且实际应用中存在隐私问题。论文旨在解决这些问题。

Contribution: 1. 提供面部动作单元(AU)协调的心理学先验;2. 提出DPK-GAT网络结合先验与统计模式;3. 设计联邦学习框架保护隐私。

Method: 1. 通过心理学研究分析AU协调;2. DPK-GAT网络分层学习面部运动特征;3. 联邦学习框架实现在多客户端间的隐私保护MER。

Result: 在常用ME数据库上的实验验证了方法的有效性。

Insight: 结合心理学先验与联邦学习,能在隐私保护下提升MER性能,同时缓解小样本问题。

Abstract: Micro-expressions (MEs) are brief, low-intensity, often localized facial expressions. They could reveal genuine emotions individuals may attempt to conceal, valuable in contexts like criminal interrogation and psychological counseling. However, ME recognition (MER) faces challenges, such as small sample sizes and subtle features, which hinder efficient modeling. Additionally, real-world applications encounter ME data privacy issues, leaving the task of enhancing recognition across settings under privacy constraints largely unexplored. To address these issues, we propose a FED-PsyAU research framework. We begin with a psychological study on the coordination of upper and lower facial action units (AUs) to provide structured prior knowledge of facial muscle dynamics. We then develop a DPK-GAT network that combines these psychological priors with statistical AU patterns, enabling hierarchical learning of facial motion features from regional to global levels, effectively enhancing MER performance. Additionally, our federated learning framework advances MER capabilities across multiple clients without data sharing, preserving privacy and alleviating the limited-sample issue for each client. Extensive experiments on commonly-used ME databases demonstrate the effectiveness of our approach.

[126] Learning Phonetic Context-Dependent Viseme for Enhancing Speech-Driven 3D Facial Animation

Hyung Kyu Kim,Hak Gu Kim

Main category: cs.CV

TL;DR: 该论文提出了一种语音驱动的3D面部动画方法,通过引入音素上下文感知损失,显式建模音素对唇位(viseme)转换的影响,解决了传统帧对齐方法中面部运动不自然的问题。

Details Motivation: 传统的帧对齐方法在语音驱动的3D面部动画中无法捕捉面部运动的连续性,导致输出不自然。音素上下文对唇位转换的影响未得到显式建模。

Contribution: 提出了一种音素上下文感知损失,通过动态调整唇位转换的权重,显式建模音素上下文对唇位的影响,从而生成更自然的面部动画。

Method: 引入一个音素协同发音权重,根据面部运动的动态变化自适应调整其重要性,替代传统的重建损失。

Result: 实验表明,该方法在定量指标和视觉质量上均优于传统方法。

Insight: 显式建模音素上下文对唇位的依赖性是生成自然语音驱动3D面部动画的关键。

Abstract: Speech-driven 3D facial animation aims to generate realistic facial movements synchronized with audio. Traditional methods primarily minimize reconstruction loss by aligning each frame with ground-truth. However, this frame-wise approach often fails to capture the continuity of facial motion, leading to jittery and unnatural outputs due to coarticulation. To address this, we propose a novel phonetic context-aware loss, which explicitly models the influence of phonetic context on viseme transitions. By incorporating a viseme coarticulation weight, we assign adaptive importance to facial movements based on their dynamic changes over time, ensuring smoother and perceptually consistent animations. Extensive experiments demonstrate that replacing the conventional reconstruction loss with ours improves both quantitative metrics and visual quality. It highlights the importance of explicitly modeling phonetic context-dependent visemes in synthesizing natural speech-driven 3D facial animation. Project page: https://cau-irislab.github.io/interspeech25/

[127] AV-Deepfake1M++: A Large-Scale Audio-Visual Deepfake Benchmark with Real-World Perturbations

Zhixi Cai,Kartik Kuckreja,Shreya Ghosh,Akanksha Chuchra,Muhammad Haris Khan,Usman Tariq,Tom Gedeon,Abhinav Dhall

Main category: cs.CV

TL;DR: AV-Deepfake1M++是一个扩展版的音频-视觉Deepfake数据集,包含200万视频片段,涵盖多样化的生成方法和扰動策略,用于推动Deepfake检测研究。

Details Motivation: 为解决文本转语音和面部-声音重现模型的快速发展导致视频伪造问题,需要多样化生成方法和扰動策略的数据集。

Contribution: 提出了AV-Deepfake1M++数据集,扩展至200万视频片段,提供多样化的生成方法和扰動策略,并举办2025年1M-Deepfakes检测挑战赛。

Method: 描述了数据生成策略,并对数据集进行了基准测试,采用了现有最先进的方法进行评估。

Result: 数据集通过基准测试验证了其多样性,并支持Deepfake检测研究的推进。

Insight: 多样化的生成方法和扰動策略对提升Deepfake检测模型的鲁棒性至关重要,而大规模数据集可显著推动研究进展。

Abstract: The rapid surge of text-to-speech and face-voice reenactment models makes video fabrication easier and highly realistic. To encounter this problem, we require datasets that rich in type of generation methods and perturbation strategy which is usually common for online videos. To this end, we propose AV-Deepfake1M++, an extension of the AV-Deepfake1M having 2 million video clips with diversified manipulation strategy and audio-visual perturbation. This paper includes the description of data generation strategies along with benchmarking of AV-Deepfake1M++ using state-of-the-art methods. We believe that this dataset will play a pivotal role in facilitating research in Deepfake domain. Based on this dataset, we host the 2025 1M-Deepfakes Detection Challenge. The challenge details, dataset and evaluation scripts are available online under a research-only license at https://deepfakes1m.github.io/2025.

[128] M-Net: MRI Brain Tumor Sequential Segmentation Network via Mesh-Cast

Jiacheng Lu,Hui Ding,Shiyu Zhang,Guoping Huo

Main category: cs.CV

TL;DR: M-Net是一种针对MRI脑肿瘤序列分割的灵活框架,通过Mesh-Cast机制和两阶段训练策略,有效利用相邻MRI切片的空间关联性,显著提高了分割精度。

Details Motivation: MRI肿瘤分割在医学成像中至关重要,但现有的模型未能充分利用相邻切片的空间关联性,导致分割的连续性和准确性不足。M-Net通过将这种空间关联性视为类似时序数据的问题,提出了一种新的解决方案。

Contribution: 1. 提出Mesh-Cast机制,集成任意序列模型处理通道和时序信息;2. 设计MRI序列输入模式和两阶段训练策略(TPS),优化特征提取;3. 在BraTS2019和BraTS2023数据集上表现优异。

Method: 1. Mesh-Cast机制整合序列模型,捕获MRI切片间的时空关联;2. TPS训练策略分两阶段:先学习序列共同模式,再优化切片特定特征提取;3. 避免高昂的3D卷积计算,保留体积上下文信息。

Result: M-Net在BraTS2019和BraTS2023数据集上均优于现有方法,验证了其在时序感知MRI分割任务中的鲁棒性。

Insight: 通过将MRI切片的空间关联性建模为时序数据,M-Net有效解决了3D分割的计算复杂度问题,同时提升了分割准确性,为医学图像分析提供了新思路。

Abstract: MRI tumor segmentation remains a critical challenge in medical imaging, where volumetric analysis faces unique computational demands due to the complexity of 3D data. The spatially sequential arrangement of adjacent MRI slices provides valuable information that enhances segmentation continuity and accuracy, yet this characteristic remains underutilized in many existing models. The spatial correlations between adjacent MRI slices can be regarded as “temporal-like” data, similar to frame sequences in video segmentation tasks. To bridge this gap, we propose M-Net, a flexible framework specifically designed for sequential image segmentation. M-Net introduces the novel Mesh-Cast mechanism, which seamlessly integrates arbitrary sequential models into the processing of both channel and temporal information, thereby systematically capturing the inherent “temporal-like” spatial correlations between MRI slices. Additionally, we define an MRI sequential input pattern and design a Two-Phase Sequential (TPS) training strategy, which first focuses on learning common patterns across sequences before refining slice-specific feature extraction. This approach leverages temporal modeling techniques to preserve volumetric contextual information while avoiding the high computational cost of full 3D convolutions, thereby enhancing the generalizability and robustness of M-Net in sequential segmentation tasks. Experiments on the BraTS2019 and BraTS2023 datasets demonstrate that M-Net outperforms existing methods across all key metrics, establishing itself as a robust solution for temporally-aware MRI tumor segmentation.

[129] Enhanced Deep Learning DeepFake Detection Integrating Handcrafted Features

Alejandro Hinke-Navarro,Mario Nieto-Hidalgo,Juan M. Espin,Juan E. Tapia

Main category: cs.CV

TL;DR: 该论文提出了一种结合手工特征和深度学习的方法,用于检测深度伪造(DeepFake)和人脸交换(face swap)技术。

Details Motivation: 随着深度伪造技术的快速发展,数字安全问题日益突出,尤其是在身份验证和注册过程中。传统检测方法难以应对复杂的面部篡改技术。

Contribution: 主要贡献是提出了一种混合框架,结合了频域手工特征和RGB输入,以增强检测能力。

Method: 提出的方法整合了多种手工特征(如Steganalysis Rich Model、离散余弦变换等)与传统RGB输入,通过频域和空间域特征捕捉篡改痕迹。

Result: 该方法提供了更丰富和更具区分性的信息,提升了分类器的检测性能。

Insight: 通过结合频域和空间域特征,可以更好地捕捉深度伪造中的篡改痕迹,从而提高检测的鲁棒性。

Abstract: The rapid advancement of deepfake and face swap technologies has raised significant concerns in digital security, particularly in identity verification and onboarding processes. Conventional detection methods often struggle to generalize against sophisticated facial manipulations. This study proposes an enhanced deep-learning detection framework that combines handcrafted frequency-domain features with conventional RGB inputs. This hybrid approach exploits frequency and spatial domain artifacts introduced during image manipulation, providing richer and more discriminative information to the classifier. Several frequency handcrafted features were evaluated, including the Steganalysis Rich Model, Discrete Cosine Transform, Error Level Analysis, Singular Value Decomposition, and Discrete Fourier Transform

[130] DAMS:Dual-Branch Adaptive Multiscale Spatiotemporal Framework for Video Anomaly Detection

Dezhi An,Wenqiang Liu,Kefan Wang,Zening Chen,Jun Lu,Shengcai Zhang

Main category: cs.CV

TL;DR: 该论文提出了DAMS框架,通过双分支架构结合多尺度时空特征学习与跨模态语义对齐,解决了视频异常检测中的多尺度时间依赖性和视觉-语义异构性问题。

Details Motivation: 视频异常检测面临多尺度时间依赖性、视觉-语义异构性以及标记数据稀缺的挑战,需要一种能够高效建模时空特征的方法。

Contribution: 提出了DAMS框架,通过双路径(时空特征学习与跨模态语义对齐)实现异常事件的全面检测。AMTPN和CBAM用于多尺度特征提取与优化,CLIP驱动的并行路径提供语义指导。

Method: 1. 主路径结合AMTPN(多级时间特征重建)和CBAM(注意力优化)。2. 并行路径利用CLIP进行跨模态语义对齐和实例选择。3. 双路径通过信息融合互补。

Result: 在UCF-Crime和XD-Violence基准测试中取得了显著效果。

Insight: 双路径互补性是关键,时空特征与语义信息的结合提升了异常检测的性能。

Abstract: The goal of video anomaly detection is tantamount to performing spatio-temporal localization of abnormal events in the video. The multiscale temporal dependencies, visual-semantic heterogeneity, and the scarcity of labeled data exhibited by video anomalies collectively present a challenging research problem in computer vision. This study offers a dual-path architecture called the Dual-Branch Adaptive Multiscale Spatiotemporal Framework (DAMS), which is based on multilevel feature decoupling and fusion, enabling efficient anomaly detection modeling by integrating hierarchical feature learning and complementary information. The main processing path of this framework integrates the Adaptive Multiscale Time Pyramid Network (AMTPN) with the Convolutional Block Attention Mechanism (CBAM). AMTPN enables multigrained representation and dynamically weighted reconstruction of temporal features through a three-level cascade structure (time pyramid pooling, adaptive feature fusion, and temporal context enhancement). CBAM maximizes the entropy distribution of feature channels and spatial dimensions through dual attention mapping. Simultaneously, the parallel path driven by CLIP introduces a contrastive language-visual pre-training paradigm. Cross-modal semantic alignment and a multiscale instance selection mechanism provide high-order semantic guidance for spatio-temporal features. This creates a complete inference chain from the underlying spatio-temporal features to high-level semantic concepts. The orthogonal complementarity of the two paths and the information fusion mechanism jointly construct a comprehensive representation and identification capability for anomalous events. Extensive experimental results on the UCF-Crime and XD-Violence benchmarks establish the effectiveness of the DAMS framework.

[131] TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model

Ao Li,Yuxiang Duan,Jinghui Zhang,Congbo Ma,Yutong Xie,Gustavo Carneiro,Mohammad Yaqub,Hu Wang

Main category: cs.CV

TL;DR: 该论文提出了一种无需训练的高效视觉-语言模型(LVLM)token剪枝方法TransPrune,通过token表示的转移变化和指令引导注意力来评估重要性,显著降低计算开销。

Details Motivation: 大型视觉-语言模型(LVLM)因视觉token数量庞大导致计算成本高,现有方法因注意力机制存在位置偏差等问题,需探索新的token重要性评估视角。

Contribution: 提出基于token转移变化(TTV)和指令引导注意力(IGA)的TransPrune方法,无需额外训练即可高效剪枝,性能接近原始模型,计算量减少一半以上。

Method: 通过Token Transition Variation(TTV)度量token表示的幅度和方向变化,结合Instruction-Guided Attention(IGA)评估token重要性,逐步剪枝。

Result: 在八个基准测试中,TransPrune性能接近原始LVLM(如LLaVA-v1.5/Next),推理计算量(TFLOPs)减少超一半。TTV单独使用时表现媲美注意力方法。

Insight: token表示的转移变化(TTV)可作为有效的token重要性信号,无需依赖注意力机制;指令引导注意力(IGA)进一步优化剪枝效果。

Abstract: Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational costs due to the large number of visual tokens, motivating token pruning to improve inference efficiency. The key challenge lies in identifying which tokens are truly important. Most existing approaches rely on attention-based criteria to estimate token importance. However, they inherently suffer from certain limitations, such as positional bias. In this work, we explore a new perspective on token importance based on token transitions in LVLMs. We observe that the transition of token representations provides a meaningful signal of semantic information. Based on this insight, we propose TransPrune, a training-free and efficient token pruning method. Specifically, TransPrune progressively prunes tokens by assessing their importance through a combination of Token Transition Variation (TTV)-which measures changes in both the magnitude and direction of token representations-and Instruction-Guided Attention (IGA), which measures how strongly the instruction attends to image tokens via attention. Extensive experiments demonstrate that TransPrune achieves comparable multimodal performance to original LVLMs, such as LLaVA-v1.5 and LLaVA-Next, across eight benchmarks, while reducing inference TFLOPs by more than half. Moreover, TTV alone can serve as an effective criterion without relying on attention, achieving performance comparable to attention-based methods. The code will be made publicly available upon acceptance of the paper at https://github.com/liaolea/TransPrune.

[132] A Multimodal Architecture for Endpoint Position Prediction in Team-based Multiplayer Games

Jonas Peche,Aliaksei Tsishurou,Alexander Zap,Guenter Wallner

Main category: cs.CV

TL;DR: 该论文提出了一种多模态架构,用于预测团队多人游戏中玩家的未来位置,结合U-Net和多头注意力机制处理异构输入数据。

Details Motivation: 在多人游戏中,预测玩家移动对实现玩家模仿机器人导航、实时行为分析等任务至关重要,但由于环境复杂和团队交互,需要有效利用异构数据。

Contribution: 提出了一种基于U-Net和多头注意力机制的多模态架构,能够高效利用图像、数值和动态数据预测玩家未来位置。

Method: 采用U-Net生成位置概率热图,结合多模态特征编码器和多头注意力机制,对不同特征组进行条件化处理。

Result: 该架构为依赖未来位置预测的下游任务(如机器人行为生成或异常检测)奠定了基础。

Insight: 通过多模态特征和注意力机制的结合,模型能够更好地捕捉复杂环境中的玩家交互和导航行为。

Abstract: Understanding and predicting player movement in multiplayer games is crucial for achieving use cases such as player-mimicking bot navigation, preemptive bot control, strategy recommendation, and real-time player behavior analytics. However, the complex environments allow for a high degree of navigational freedom, and the interactions and team-play between players require models that make effective use of the available heterogeneous input data. This paper presents a multimodal architecture for predicting future player locations on a dynamic time horizon, using a U-Net-based approach for calculating endpoint location probability heatmaps, conditioned using a multimodal feature encoder. The application of a multi-head attention mechanism for different groups of features allows for communication between agents. In doing so, the architecture makes efficient use of the multimodal game state including image inputs, numerical and categorical features, as well as dynamic game data. Consequently, the presented technique lays the foundation for various downstream tasks that rely on future player positions such as the creation of player-predictive bot behavior or player anomaly detection.

[133] Automatic camera orientation estimation for a partially calibrated camera above a plane with a line at known planar distance

Gergely Dinya,Anna Gelencsér-Horváth

Main category: cs.CV

TL;DR: 论文提出了一种部分校准相机通过平面距离已知的直线估计相机滚动和俯仰角的方法。

Details Motivation: 在多相机系统中,完全校准可能不切实际,因此需要一种轻量级的替代方法。

Contribution: 提出了仅需相机内参和已知平面高度的情况下,通过检测一条已知平面距离的直线来估计相机姿态的方法。

Method: 利用逆投影几何和几何约束,通过检测一条参考线(如地板与墙的交线)估计相机的滚动和俯仰角。

Result: 方法适用于部分校准的场景,特别是在多相机系统和受限环境中。

Insight: 通过简单的几何约束和已知信息,可以实现相机姿态的快速估计,减少对完全校准的依赖。

Abstract: We present a derivation for estimating the roll and pitch orientation of a partially calibrated camera mounted above a planar surface, using minimal scene information. Specifically, we assume known intrinsic parameters and a fixed height between the camera and the observed plane. By detecting a single straight reference line at a known planar distance – such as the edge between a floor and a wall – we estimate the roll and pitch angles via inverse projection geometry. The method leverages geometric constraints and the camera model, including lens distortion correction. This approach is suitable for scenarios where full calibration is impractical and offers a lightweight alternative for multi-camera systems operating in constrained environments.

[134] Style-Aware Blending and Prototype-Based Cross-Contrast Consistency for Semi-Supervised Medical Image Segmentation

Chaowei Chen,Xiang Zhang,Honglie Guo,Shunfang Wang

Main category: cs.CV

TL;DR: 本文提出了一种结合风格感知混合和原型对比一致性的半监督医学图像分割框架,解决了现有方法中独立数据流和信息利用不足的问题。

Details Motivation: 现有半监督医学图像分割方法主要关注扰动方案的设计,忽视了框架内部潜力与局限,特别是数据流分离和监督信息利用不足的问题。

Contribution: 1. 提出风格感知混合模块,打破数据流分离;2. 引入原型对比策略,增强弱-强和强-弱预测的一致性。

Method: 通过风格引导分布混合模块解决数据分布不匹配问题,并利用原型对比策略挖掘监督信号,减少噪声影响。

Result: 实验证明该框架在多种半监督设置下优于现有方法。

Insight: 风格统计特征和原型对比是提升半监督医学图像分割性能的关键。

Abstract: Weak-strong consistency learning strategies are widely employed in semi-supervised medical image segmentation to train models by leveraging limited labeled data and enforcing weak-to-strong consistency. However, existing methods primarily focus on designing and combining various perturbation schemes, overlooking the inherent potential and limitations within the framework itself. In this paper, we first identify two critical deficiencies: (1) separated training data streams, which lead to confirmation bias dominated by the labeled stream; and (2) incomplete utilization of supervisory information, which limits exploration of strong-to-weak consistency. To tackle these challenges, we propose a style-aware blending and prototype-based cross-contrast consistency learning framework. Specifically, inspired by the empirical observation that the distribution mismatch between labeled and unlabeled data can be characterized by statistical moments, we design a style-guided distribution blending module to break the independent training data streams. Meanwhile, considering the potential noise in strong pseudo-labels, we introduce a prototype-based cross-contrast strategy to encourage the model to learn informative supervisory signals from both weak-to-strong and strong-to-weak predictions, while mitigating the adverse effects of noise. Experimental results demonstrate the effectiveness and superiority of our framework across multiple medical segmentation benchmarks under various semi-supervised settings.

[135] Multi-Masked Querying Network for Robust Emotion Recognition from Incomplete Multi-Modal Physiological Signals

Geng-Xin Xu,Xiang Zuo,Ye Li

Main category: cs.CV

TL;DR: 该论文提出了一种名为MMQ-Net的新方法,用于从不完整的多模态生理信号中识别情绪,通过多查询机制解决数据不完整和噪声干扰的问题。

Details Motivation: 生理信号的情绪识别对心理健康评估至关重要,但面临信号不完整和运动伪迹干扰的挑战。

Contribution: 提出MMQ-Net,整合模态查询、类别查询和干扰查询机制,有效处理不完整信号和噪声分离。

Method: 采用多掩码查询机制,通过模态查询重建缺失数据,类别查询聚焦情绪特征,干扰查询去除噪声。

Result: 实验表明MMQ-Net在数据不完整情况下性能优于现有方法。

Insight: 多查询机制的整合能够显著提升在不完整和噪声环境下的情绪识别能力。

Abstract: Emotion recognition from physiological data is crucial for mental health assessment, yet it faces two significant challenges: incomplete multi-modal signals and interference from body movements and artifacts. This paper presents a novel Multi-Masked Querying Network (MMQ-Net) to address these issues by integrating multiple querying mechanisms into a unified framework. Specifically, it uses modality queries to reconstruct missing data from incomplete signals, category queries to focus on emotional state features, and interference queries to separate relevant information from noise. Extensive experiment results demonstrate the superior emotion recognition performance of MMQ-Net compared to existing approaches, particularly under high levels of data incompleteness.

[136] Implicit Counterfactual Learning for Audio-Visual Segmentation

Mingfeng Zha,Tianyu Li,Guoqing Wang,Peng Wang,Yangyang Wu,Yang Yang,Heng Tao Shen

Main category: cs.CV

TL;DR: 该论文提出了一种隐式反事实框架(ICF)和多粒度隐式文本(MIT),用于解决音频-视觉分割(AVS)中的模态表示差异和知识偏好问题。通过语义反事实(SC)学习和协作分布感知对比学习(CDCL),实现了跨模态理解的无偏性和高效性。

Details Motivation: 现有音频-视觉分割方法主要关注交互效率,但忽略了模态表示差异和知识偏好问题。本文旨在解决这些问题,实现更公平的跨模态理解。

Contribution: 1. 提出隐式反事实框架(ICF)和多粒度隐式文本(MIT),减少模态差异;2. 引入语义反事实(SC)学习,解决知识偏好问题;3. 设计协作分布感知对比学习(CDCL),优化表示对齐。

Method: 1. 使用MIT构建模态共享空间;2. 通过SC学习正交表示;3. 结合CDCL进行对比学习,对齐事实-反事实和跨模态表示。

Result: 在三个公开数据集上验证了方法的先进性,性能达到SOTA水平。

Insight: 通过隐式反事实学习和多粒度文本引导,可以有效减少模态差异和知识偏好,提升跨模态任务的性能。

Abstract: Audio-visual segmentation (AVS) aims to segment objects in videos based on audio cues. Existing AVS methods are primarily designed to enhance interaction efficiency but pay limited attention to modality representation discrepancies and imbalances. To overcome this, we propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding. Due to the lack of semantics, heterogeneous representations may lead to erroneous matches, especially in complex scenes with ambiguous visual content or interference from multiple audio sources. We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space, reducing modality gaps and providing prior guidance. Visual content carries more information and typically dominates, thereby marginalizing audio features in the decision-making. To mitigate knowledge preference, we propose the semantic counterfactual (SC) to learn orthogonal representations in the latent space, generating diverse counterfactual samples, thus avoiding biases introduced by complex functional designs and explicit modifications of text structures or attributes. We further formulate the collaborative distribution-aware contrastive learning (CDCL), incorporating factual-counterfactual and inter-modality contrasts to align representations, promoting cohesion and decoupling. Extensive experiments on three public datasets validate that the proposed method achieves state-of-the-art performance.

[137] Regularizing Subspace Redundancy of Low-Rank Adaptation

Yue Zhu,Haiwen Diao,Shang Gao,Jiazuo Yu,Jiawen Zhu,Yunzhi Zhuge,Shuai Hao,Xu Jia,Lu Zhang,Ying Zhang,Huchuan Lu

Main category: cs.CV

TL;DR: 论文提出了一种名为ReSoRA的方法,通过显式建模投影子空间之间的冗余性,并自适应性约束低秩适应的子空间冗余,提高了参数高效迁移学习的表现。

Details Motivation: 现有的低秩适应方法(如LoRA)在训练时投影矩阵不受限制,导致特征子空间的高冗余性,影响了迁移学习的效果。现有方法缺乏灵活性,难以泛化到不同数据集和架构。

Contribution: 提出ReSoRA方法,显式建模子空间冗余性,自适应约束特征分布的冗余性,显著提升了参数高效迁移学习的性能。

Method: 通过理论分解低秩子矩阵为多个等价子空间,并在不同投影间系统性应用去冗余约束,优化特征分布。

Result: 实验表明,ReSoRA在多种骨干网络和数据集上显著提升了现有方法的性能,且无需额外推理成本。

Insight: 显式建模和约束子空间冗余性是提升参数高效迁移学习效果的关键,ReSoRA为现有方法提供了灵活的插件式解决方案。

Abstract: Low-Rank Adaptation (LoRA) and its variants have delivered strong capability in Parameter-Efficient Transfer Learning (PETL) by minimizing trainable parameters and benefiting from reparameterization. However, their projection matrices remain unrestricted during training, causing high representation redundancy and diminishing the effectiveness of feature adaptation in the resulting subspaces. While existing methods mitigate this by manually adjusting the rank or implicitly applying channel-wise masks, they lack flexibility and generalize poorly across various datasets and architectures. Hence, we propose ReSoRA, a method that explicitly models redundancy between mapping subspaces and adaptively Regularizes Subspace redundancy of Low-Rank Adaptation. Specifically, it theoretically decomposes the low-rank submatrices into multiple equivalent subspaces and systematically applies de-redundancy constraints to the feature distributions across different projections. Extensive experiments validate that our proposed method consistently facilitates existing state-of-the-art PETL methods across various backbones and datasets in vision-language retrieval and standard visual classification benchmarks. Besides, as a training supervision, ReSoRA can be seamlessly integrated into existing approaches in a plug-and-play manner, with no additional inference costs. Code is publicly available at: https://github.com/Lucenova/ReSoRA.

[138] Learning to See Inside Opaque Liquid Containers using Speckle Vibrometry

Matan Kichler,Shai Bagon,Mark Sheinin

Main category: cs.CV

TL;DR: 该论文提出了一种通过传感表面微小振动来推断不透明容器内液体水平的新方法,结合了斑点振动测量技术和基于Transformer的模型,首次实现了对密封容器液体水平的远程、无接触检测。

Details Motivation: 传统计算机视觉系统只能从物体的可见表面提取信息,无法检测封闭容器内液体的状态。本文旨在解决这一问题,扩展计算机视觉的应用范围,使其能够推断不可见信息。

Contribution: 1. 提出了一种基于斑点振动测量的新型振动传感系统;2. 开发了基于Transformer的模型分析振动数据,实现了容器类型和液体水平的分类;3. 展示了方法的有效性,能够泛化到未见过的容器实例和液体水平。

Method: 1. 提出斑点振动传感系统,以二维网格点的方式同时捕捉场景振动;2. 收集多种日常容器的振动响应数据集;3. 设计基于Transformer的架构分析振动数据,模型对振动源具有不变性。

Result: 实验表明,该方法能够准确分类容器类型和液体水平,即使对于未见过的容器实例和不同的液体水平也具有泛化能力。

Insight: 通过振动传感技术,计算机视觉可以超越传统图像信息的限制,实现对封闭容器内液体水平的非接触式检测,为工业检测和日常应用提供了新思路。

Abstract: Computer vision seeks to infer a wide range of information about objects and events. However, vision systems based on conventional imaging are limited to extracting information only from the visible surfaces of scene objects. For instance, a vision system can detect and identify a Coke can in the scene, but it cannot determine whether the can is full or empty. In this paper, we aim to expand the scope of computer vision to include the novel task of inferring the hidden liquid levels of opaque containers by sensing the tiny vibrations on their surfaces. Our method provides a first-of-a-kind way to inspect the fill level of multiple sealed containers remotely, at once, without needing physical manipulation and manual weighing. First, we propose a novel speckle-based vibration sensing system for simultaneously capturing scene vibrations on a 2D grid of points. We use our system to efficiently and remotely capture a dataset of vibration responses for a variety of everyday liquid containers. Then, we develop a transformer-based approach for analyzing the captured vibrations and classifying the container type and its hidden liquid level at the time of measurement. Our architecture is invariant to the vibration source, yielding correct liquid level estimates for controlled and ambient scene sound sources. Moreover, our model generalizes to unseen container instances within known classes (e.g., training on five Coke cans of a six-pack, testing on a sixth) and fluid levels. We demonstrate our method by recovering liquid levels from various everyday containers.

[139] KASportsFormer: Kinematic Anatomy Enhanced Transformer for 3D Human Pose Estimation on Short Sports Scene Video

Zhuoer Yin,Calvin Yeung,Tomohiro Suzuki,Ryota Tanaka,Keisuke Fujii

Main category: cs.CV

TL;DR: KASportsFormer是一个基于Transformer的3D人体姿态估计框架,专为短时运动场景设计,通过骨骼提取器和肢体融合模块改善了运动理解能力。

Details Motivation: 当前Transformer方法在复杂运动场景(如体育比赛)中表现不佳,主要受限于运动模糊、遮挡和领域偏移。此外,短暂动作的捕捉能力不足。

Contribution: 提出KASportsFormer,结合了运动解剖学特征表示和融合模块,提升了短时运动视频中的姿态估计能力。

Method: 通过BoneExt和LimbFus模块提取和融合骨骼运动信息,以多模态方式编码特征。

Result: 在SportsPose和WorldPose数据集上分别取得58.0mm和34.3mm的MPJPE,达到SOTA。

Insight: 运动解剖学特征的引入能显著提升复杂运动场景的3D姿态估计效果。

Abstract: Recent transformer based approaches have demonstrated impressive performance in solving real-world 3D human pose estimation problems. Albeit these approaches achieve fruitful results on benchmark datasets, they tend to fall short of sports scenarios where human movements are more complicated than daily life actions, as being hindered by motion blur, occlusions, and domain shifts. Moreover, due to the fact that critical motions in a sports game often finish in moments of time (e.g., shooting), the ability to focus on momentary actions is becoming a crucial factor in sports analysis, where current methods appear to struggle with instantaneous scenarios. To overcome these limitations, we introduce KASportsFormer, a novel transformer based 3D pose estimation framework for sports that incorporates a kinematic anatomy-informed feature representation and integration module. In which the inherent kinematic motion information is extracted with the Bone Extractor (BoneExt) and Limb Fuser (LimbFus) modules and encoded in a multimodal manner. This improved the capability of comprehending sports poses in short videos. We evaluate our method through two representative sports scene datasets: SportsPose and WorldPose. Experimental results show that our proposed method achieves state-of-the-art results with MPJPE errors of 58.0mm and 34.3mm, respectively. Our code and models are available at: https://github.com/jw0r1n/KASportsFormer

[140] ATR-UMMIM: A Benchmark Dataset for UAV-Based Multimodal Image Registration under Complex Imaging Conditions

Kangcheng Bin,Chen Chen,Ting Hu,Jiahao Qi,Ping Zhong

Main category: cs.CV

TL;DR: 论文提出了首个针对无人机多模态图像配准的基准数据集ATR-UMMIM,填补了该领域的数据空白,支持高质量配准算法开发与评估。

Details Motivation: 当前缺乏针对无人机多模态图像配准的公开基准数据集,限制了真实场景下配准算法的开发与评估。

Contribution: 提出了首个无人机多模态图像配准基准数据集ATR-UMMIM,包含7,969组可见光-红外-配准可见光三模态数据,并提供像素级标注和丰富场景属性。

Method: 通过半自动标注流程生成像素级真实配准数据,涵盖多种飞行高度、视角和时间变化条件,并引入成像条件属性标注。

Result: 数据集支持高质量配准算法评估,并提供11类目标检测标注(7.7万可见光、7.8万红外框),推动下游任务研究。

Insight: ATR-UMMIM为无人机多模态配准与感知提供了标准化评估平台,强调了真实场景条件对算法鲁棒性的关键影响。

Abstract: Multimodal fusion has become a key enabler for UAV-based object detection, as each modality provides complementary cues for robust feature extraction. However, due to significant differences in resolution, field of view, and sensing characteristics across modalities, accurate registration is a prerequisite before fusion. Despite its importance, there is currently no publicly available benchmark specifically designed for multimodal registration in UAV-based aerial scenarios, which severely limits the development and evaluation of advanced registration methods under real-world conditions. To bridge this gap, we present ATR-UMMIM, the first benchmark dataset specifically tailored for multimodal image registration in UAV-based applications. This dataset includes 7,969 triplets of raw visible, infrared, and precisely registered visible images captured covers diverse scenarios including flight altitudes from 80m to 300m, camera angles from 0{\deg} to 75{\deg}, and all-day, all-year temporal variations under rich weather and illumination conditions. To ensure high registration quality, we design a semi-automated annotation pipeline to introduce reliable pixel-level ground truth to each triplet. In addition, each triplet is annotated with six imaging condition attributes, enabling benchmarking of registration robustness under real-world deployment settings. To further support downstream tasks, we provide object-level annotations on all registered images, covering 11 object categories with 77,753 visible and 78,409 infrared bounding boxes. We believe ATR-UMMIM will serve as a foundational benchmark for advancing multimodal registration, fusion, and perception in real-world UAV scenarios. The datatset can be download from https://github.com/supercpy/ATR-UMMIM

[141] Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback

Yang Chen,Yufan Shen,Wenxuan Huang,Shen Zhou,Qunshu Lin,Xinyu Cai,Zhi Yu,Botian Shi,Yu Qiao

Main category: cs.CV

TL;DR: 论文提出了一种名为RRVF的新框架,仅通过原始图像训练MLLMs进行复杂视觉推理,减少了对图像-文本监督的依赖。

Details Motivation: 当前MLLMs在视觉推理中的性能提升依赖于大量图像-文本监督,这限制了其进一步发展。

Contribution: 提出了RRVF框架,基于‘验证的不对称性’原则,通过强化学习优化,实现了仅需原始图像的视觉推理学习。

Method: 框架包含推理、渲染和视觉反馈组件,通过GRPO算法端到端优化,支持多轮自我修正和工具调用。

Result: 在图像到代码生成任务中,RRVF显著优于现有开源MLLMs和监督微调基线。

Insight: 纯视觉反馈驱动的系统为更鲁棒和泛化的推理模型提供了可行路径,无需显式监督。

Abstract: Multimodal Large Language Models (MLLMs) have exhibited impressive performance across various visual tasks. Subsequent investigations into enhancing their visual reasoning abilities have significantly expanded their performance envelope. However, a critical bottleneck in the advancement of MLLMs toward deep visual reasoning is their heavy reliance on curated image-text supervision. To solve this problem, we introduce a novel framework termed Reasoning-Rendering-Visual-Feedback'' (RRVF), which enables MLLMs to learn complex visual reasoning from only raw images. This framework builds on the Asymmetry of Verification’’ principle to train MLLMs, i.e., verifying the rendered output against a source image is easier than generating it. We demonstrate that this relative ease provides an ideal reward signal for optimization via Reinforcement Learning (RL) training, reducing the reliance on the image-text supervision. Guided by the above principle, RRVF implements a closed-loop iterative process encompassing reasoning, rendering, and visual feedback components, enabling the model to perform self-correction through multi-turn interactions and tool invocation, while this pipeline can be optimized by the GRPO algorithm in an end-to-end manner. Extensive experiments on image-to-code generation for data charts and web interfaces show that RRVF substantially outperforms existing open-source MLLMs and surpasses supervised fine-tuning baselines. Our findings demonstrate that systems driven by purely visual feedback present a viable path toward more robust and generalizable reasoning models without requiring explicit supervision. Code will be available at https://github.com/L-O-I/RRVF.

[142] RingMo-Agent: A Unified Remote Sensing Foundation Model for Multi-Platform and Multi-Modal Reasoning

Huiyang Hu,Peijin Wang,Yingchao Feng,Kaiwen Wei,Wenxin Yin,Wenhui Diao,Mengyu Wang,Hanbo Bi,Kaiyue Kang,Tong Ling,Kun Fu,Xian Sun

Main category: cs.CV

TL;DR: RingMo-Agent是一个统一的遥感基础模型,能够处理多平台和多模态数据,并通过用户文本指令执行感知和推理任务。

Details Motivation: 现有的遥感视觉语言方法依赖同质数据源,局限于分类或标题生成等任务,无法有效处理多样化遥感数据。

Contribution: 1)构建了大规模遥感视觉语言数据集RS-VL3M;2)提出模态自适应表示学习方法;3)通过任务特定令牌统一任务建模。

Method: 使用分离的嵌入层为异构模态构建独立特征,并通过令牌机制解码高维隐藏状态。

Result: 在多种遥感视觉任务中表现出色,具备强大的跨平台和跨模态泛化能力。

Insight: 异构模态的独立特征学习和任务统一建模是提升遥感模型性能的关键。

Abstract: Remote sensing (RS) images from multiple modalities and platforms exhibit diverse details due to differences in sensor characteristics and imaging perspectives. Existing vision-language research in RS largely relies on relatively homogeneous data sources. Moreover, they still remain limited to conventional visual perception tasks such as classification or captioning. As a result, these methods fail to serve as a unified and standalone framework capable of effectively handling RS imagery from diverse sources in real-world applications. To address these issues, we propose RingMo-Agent, a model designed to handle multi-modal and multi-platform data that performs perception and reasoning tasks based on user textual instructions. Compared with existing models, RingMo-Agent 1) is supported by a large-scale vision-language dataset named RS-VL3M, comprising over 3 million image-text pairs, spanning optical, SAR, and infrared (IR) modalities collected from both satellite and UAV platforms, covering perception and challenging reasoning tasks; 2) learns modality adaptive representations by incorporating separated embedding layers to construct isolated features for heterogeneous modalities and reduce cross-modal interference; 3) unifies task modeling by introducing task-specific tokens and employing a token-based high-dimensional hidden state decoding mechanism designed for long-horizon spatial tasks. Extensive experiments on various RS vision-language tasks demonstrate that RingMo-Agent not only proves effective in both visual understanding and sophisticated analytical tasks, but also exhibits strong generalizability across different platforms and sensing modalities.

[143] An Efficient Machine Learning Framework for Forest Height Estimation from Multi-Polarimetric Multi-Baseline SAR data

Francesca Razzano,Wenyu Yang,Sergio Vitale,Giampaolo Ferraioli,Silvia Liberata Ullo,Gilda Schirinzi

Main category: cs.CV

TL;DR: 论文提出了一种名为FGump的高效机器学习框架,用于从多极化多基线SAR数据中估计森林高度,通过梯度提升方法实现高精度和高计算效率的平衡。

Details Motivation: 森林高度估计对气候变化监测和碳循环评估至关重要。传统基于模型的SAR方法和最近的数据驱动ML/DL方法各有局限,FGump旨在解决这些问题。

Contribution: FGump框架通过梯度提升和多通道SAR处理,结合LiDAR数据作为地面真值,实现了在有限特征集下的高精度森林高度估计,避免了复杂预处理和大规模数据集需求。

Method: FGump结合多极化多基线SAR数据,利用梯度提升算法和LiDAR地面真值,通过回归范式实现连续高度估计,避免了量化误差。

Result: FGump在精度和计算效率上均优于现有AI和传统方法,训练和推理时间显著降低。

Insight: 回归范式优于分类范式,能够实现更精细的连续估计,避免量化伪影;FGump展示了在小数据集和简单特征下也能实现高性能的可能性。

Abstract: Accurate forest height estimation is crucial for climate change monitoring and carbon cycle assessment. Synthetic Aperture Radar (SAR), particularly in multi-channel configurations, has provided support for a long time in 3D forest structure reconstruction through model-based techniques. More recently, data-driven approaches using Machine Learning (ML) and Deep Learning (DL) have enabled new opportunities for forest parameter retrieval. This paper introduces FGump, a forest height estimation framework by gradient boosting using multi-channel SAR processing with LiDAR profiles as Ground Truth(GT). Unlike typical ML and DL approaches that require large datasets and complex architectures, FGump ensures a strong balance between accuracy and computational efficiency, using a limited set of hand-designed features and avoiding heavy preprocessing (e.g., calibration and/or quantization). Evaluated under both classification and regression paradigms, the proposed framework demonstrates that the regression formulation enables fine-grained, continuous estimations and avoids quantization artifacts by resulting in more precise measurements without rounding. Experimental results confirm that FGump outperforms State-of-the-Art (SOTA) AI-based and classical methods, achieving higher accuracy and significantly lower training and inference times, as demonstrated in our results.

[144] SCANet: Split Coordinate Attention Network for Building Footprint Extraction

Chunshi Wang,Bin Zhao,Shuxue Ding

Main category: cs.CV

TL;DR: SCANet提出了一种新颖的Split Coordinate Attention (SCA)模块,通过双空间范围池化核和通道编码技术提升建筑物轮廓提取任务的效果,在多个公开数据集上达到了SOTA性能。

Details Motivation: 建筑物轮廓提取对城市规划等应用至关重要,但现有传统和深度学习方法仍面临挑战,尤其是在空间远程交互和语义特征提取方面。

Contribution: 1. 提出Split Coordinate Attention (SCA)模块,通过双池化核和通道编码捕捉远程空间交互;2. 构建SCANet网络,显著提升建筑物轮廓提取任务的性能。

Method: SCA模块利用双空间范围池化核捕获远程交互,沿x、y平面编码通道,并进行分组拆分操作,提升特征提取效率;SCANet将该模块嵌入到2D CNN中。

Result: 在WHU和Massachusetts数据集上,SCANet的IoU分别达到91.61%和75.49%,优于现有SOTA方法。

Insight: 通过双池化核和通道编码的注意力机制,能够有效提升建筑物轮廓提取任务中远程空间交互的建模能力。

Abstract: Building footprint extraction holds immense significance in remote sensing image analysis and has great value in urban planning, land use, environmental protection and disaster assessment. Despite the progress made by conventional and deep learning approaches in this field, they continue to encounter significant challenges. This paper introduces a novel plug-and-play attention module, Split Coordinate Attention (SCA), which ingeniously captures spatially remote interactions by employing two spatial range of pooling kernels, strategically encoding each channel along x and y planes, and separately performs a series of split operations for each feature group, thus enabling more efficient semantic feature extraction. By inserting into a 2D CNN to form an effective SCANet, our SCANet outperforms recent SOTA methods on the public Wuhan University (WHU) Building Dataset and Massachusetts Building Dataset in terms of various metrics. Particularly SCANet achieves the best IoU, 91.61% and 75.49% for the two datasets. Our code is available at https://github.com/AiEson/SCANet

[145] METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models

Yuchen Liu,Yaoming Wang,Bowen Shi,Xiaopeng Zhang,Wenrui Dai,Chenglin Li,Hongkai Xiong,Qi Tian

Main category: cs.CV

TL;DR: METEOR提出了一种多阶段剪枝框架,通过协作剪枝策略减少多编码器视觉语言模型中的冗余视觉标记,显著降低计算开销,同时保持性能。

Details Motivation: 现有单编码器架构在多样化多模态任务上泛化能力有限,而多编码器融合方法虽性能优越但计算开销过大。需平衡性能与效率。

Contribution: 提出METEOR框架,首次在多编码器视觉语言模型中实现多阶段剪枝,包括编码、融合和解码阶段的冗余标记去除。

Method: 采用分级协作剪枝策略:1)编码阶段通过排名引导的协作标记分配;2)融合阶段减少跨编码器冗余;3)解码阶段动态调整剪枝比例。

Result: 在11个基准测试中表现出色,相比EAGLE减少76%视觉标记,性能仅下降0.3%。

Insight: 多阶段协作剪枝能有效平衡多编码器模型的性能与效率,为高效多模态理解提供新思路。

Abstract: Vision encoders serve as the cornerstone of multimodal understanding. Single-encoder architectures like CLIP exhibit inherent constraints in generalizing across diverse multimodal tasks, while recent multi-encoder fusion methods introduce prohibitive computational overhead to achieve superior performance using complementary visual representations from multiple vision encoders. To address this, we propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR), that eliminates redundant visual tokens across the encoding, fusion, and decoding stages for multi-encoder MLLMs. For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy. Subsequently, for multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning. Finally, we propose an adaptive token pruning method in the LLM decoding stage to further discard irrelevant tokens based on the text prompts with dynamically adjusting pruning ratios for specific task demands. To our best knowledge, this is the first successful attempt that achieves an efficient multi-encoder based vision language model with multi-stage pruning strategies. Extensive experiments on 11 benchmarks demonstrate the effectiveness of our proposed approach. Compared with EAGLE, a typical multi-encoder MLLMs, METEOR reduces 76% visual tokens with only 0.3% performance drop in average. The code is available at https://github.com/YuchenLiu98/METEOR.

[146] Compositional Video Synthesis by Temporal Object-Centric Learning

Adil Kaan Akan,Yucel Yemez

Main category: cs.CV

TL;DR: 该论文提出了一种基于时序对象中心学习的组合视频合成框架,通过将对象中心的表示与扩散模型结合,实现了高质量、时序一致的视频生成,并支持对象级别的编辑。

Details Motivation: 现有的对象中心方法要么缺乏生成能力,要么忽略视频中的显式对象结构,无法满足高质量视频合成和编辑的需求。本文旨在填补这一空白,通过时序一致的对象中心表示实现可控的视频生成。

Contribution: 主要贡献包括:(1) 将对象中心表示扩展到视频领域,学习姿态不变的对象槽;(2) 结合预训练的扩散模型,实现高质量视频生成;(3) 支持对象级别的编辑操作(如插入、删除、替换),保持对象身份一致性。

Method: 方法包括:(1) 学习时序一致的对象中心槽;(2) 将对象槽与扩散模型结合,作为条件输入;(3) 通过对象级别的编辑操作实现交互式视频合成。

Result: 实验表明,该方法在视频生成质量和时序一致性上超越现有方法,同时支持对象级别的语义编辑。

Insight: 通过结合对象中心学习和扩散模型,可以同时实现高质量的生成和精确的编辑能力,为动态场景理解和内容创作开辟新方向。

Abstract: We present a novel framework for compositional video synthesis that leverages temporally consistent object-centric representations, extending our previous work, SlotAdapt, from images to video. While existing object-centric approaches either lack generative capabilities entirely or treat video sequences holistically, thus neglecting explicit object-level structure, our approach explicitly captures temporal dynamics by learning pose invariant object-centric slots and conditioning them on pretrained diffusion models. This design enables high-quality, pixel-level video synthesis with superior temporal coherence, and offers intuitive compositional editing capabilities such as object insertion, deletion, or replacement, maintaining consistent object identities across frames. Extensive experiments demonstrate that our method sets new benchmarks in video generation quality and temporal consistency, outperforming previous object-centric generative methods. Although our segmentation performance closely matches state-of-the-art methods, our approach uniquely integrates this capability with robust generative performance, significantly advancing interactive and controllable video generation and opening new possibilities for advanced content creation, semantic editing, and dynamic scene understanding.

[147] Not Only Grey Matter: OmniBrain for Robust Multimodal Classification of Alzheimer’s Disease

Ahmed Sharshar,Yasser Ashraf,Tameem Bakr,Salma Hassan,Hosam Elgendy,Mohammad Yaqub,Mohsen Guizani

Main category: cs.CV

TL;DR: OmniBrain is a multimodal framework for Alzheimer’s disease classification, integrating MRI, radiomics, gene expression, and clinical data with cross-attention and modality dropout. It achieves high accuracy, generalizes well, and provides explainability.

Details Motivation: Existing approaches for Alzheimer's diagnosis lack accuracy, generalization, robustness to missing data, and explainability simultaneously, limiting their clinical reliability.

Contribution: OmniBrain is a unified model that integrates multiple data modalities, outperforms existing methods, and enhances explainability for clinical trust.

Method: Uses cross-attention and modality dropout to handle multimodal data (MRI, radiomics, gene expression, clinical) robustly and efficiently.

Result: Achieves 92.2% accuracy on ANMerge dataset and 70.4% on ADNI dataset, surpassing prior unimodal and multimodal approaches.

Insight: The framework’s ability to generalize and explain decisions makes it practical for real-world clinical applications.

Abstract: Alzheimer’s disease affects over 55 million people worldwide and is projected to more than double by 2050, necessitating rapid, accurate, and scalable diagnostics. However, existing approaches are limited because they cannot achieve clinically acceptable accuracy, generalization across datasets, robustness to missing modalities, and explainability all at the same time. This inability to satisfy all these requirements simultaneously undermines their reliability in clinical settings. We propose OmniBrain, a multimodal framework that integrates brain MRI, radiomics, gene expression, and clinical data using a unified model with cross-attention and modality dropout. OmniBrain achieves $92.2 \pm 2.4%$accuracy on the ANMerge dataset and generalizes to the MRI-only ADNI dataset with $70.4 \pm 2.7%$ accuracy, outperforming unimodal and prior multimodal approaches. Explainability analyses highlight neuropathologically relevant brain regions and genes, enhancing clinical trust. OmniBrain offers a robust, interpretable, and practical solution for real-world Alzheimer’s diagnosis.

[148] DriveAgent-R1: Advancing VLM-based Autonomous Driving with Hybrid Thinking and Active Perception

Weicheng Zheng,Xiaofei Mao,Nanfei Ye,Pengxiang Li,Kun Zhan,Xianpeng Lang,Hang Zhao

Main category: cs.CV

TL;DR: DriveAgent-R1是一个基于视觉语言模型(VLM)的自动驾驶系统,通过混合思维框架和主动感知机制,解决了传统方法在远见决策和复杂环境中的局限性,表现出卓越的性能。

Details Motivation: 现有的视觉语言模型在自动驾驶中存在短视决策和被动感知的问题,限制了其在复杂环境中的可靠性。

Contribution: 提出了Hybrid-Thinking框架和Active Perception机制,结合三阶段渐进强化学习策略,显著提升了自动驾驶系统的决策效率和可靠性。

Method: 采用混合思维框架(文本和工具驱动推理)和主动感知机制(视觉工具包),并通过三阶段强化学习策略进行训练。

Result: 实验表明DriveAgent-R1在性能上超越现有最先进的模型(如Claude Sonnet 4),并通过消融研究验证了方法的鲁棒性。

Insight: 通过主动感知和混合思维平衡效率与可靠性,为更安全和智能的自动驾驶系统提供了新思路。

Abstract: Vision-Language Models (VLMs) are advancing autonomous driving, yet their potential is constrained by myopic decision-making and passive perception, limiting reliability in complex environments. We introduce DriveAgent-R1 to tackle these challenges in long-horizon, high-level behavioral decision-making. DriveAgent-R1 features two core innovations: a Hybrid-Thinking framework that adaptively switches between efficient text-based and in-depth tool-based reasoning, and an Active Perception mechanism with a vision toolkit to proactively resolve uncertainties, thereby balancing decision-making efficiency and reliability. The agent is trained using a novel, three-stage progressive reinforcement learning strategy designed to master these hybrid capabilities. Extensive experiments demonstrate that DriveAgent-R1 achieves state-of-the-art performance, outperforming even leading proprietary large multimodal models, such as Claude Sonnet 4. Ablation studies validate our approach and confirm that the agent’s decisions are robustly grounded in actively perceived visual evidence, paving a path toward safer and more intelligent autonomous systems.

[149] Endoscopic Depth Estimation Based on Deep Learning: A Survey

Ke Niu,Zeyun Liu,Xue Feng,Heng Li,Kaize Shi

Main category: cs.CV

TL;DR: 这篇综述系统回顾了基于深度学习的内窥镜深度估计技术,从数据、方法和应用三个角度展开,分析了现有方法的监督策略和网络架构,并讨论了在机器人辅助手术中的应用及未来研究方向。

Details Motivation: 内窥镜深度估计对提升微创手术的安全性和精确性至关重要,但目前缺乏针对近年来深度学习技术的全面综述。

Contribution: 填补了深度学习在内窥镜深度估计领域的综述空白,系统梳理了相关数据、方法和应用,并总结了公开数据集和性能评估指标。

Method: 从数据、方法和应用三个维度进行分类综述,重点分析了监督策略和网络架构。

Result: 总结了现有技术的优缺点,并提出未来研究方向如领域适应、实时化和模型泛化增强。

Insight: 内窥镜场景的独特挑战(如光照变化、组织变形)需要通过更鲁棒的深度学习方法解决,跨领域合作是未来发展的关键。

Abstract: Endoscopic depth estimation is a critical technology for improving the safety and precision of minimally invasive surgery. It has attracted considerable attention from researchers in medical imaging, computer vision, and robotics. Over the past decade, a large number of methods have been developed. Despite the existence of several related surveys, a comprehensive overview focusing on recent deep learning-based techniques is still limited. This paper endeavors to bridge this gap by systematically reviewing the state-of-the-art literature. Specifically, we provide a thorough survey of the field from three key perspectives: data, methods, and applications, covering a range of methods including both monocular and stereo approaches. We describe common performance evaluation metrics and summarize publicly available datasets. Furthermore, this review analyzes the specific challenges of endoscopic scenes and categorizes representative techniques based on their supervision strategies and network architectures. The application of endoscopic depth estimation in the important area of robot-assisted surgery is also reviewed. Finally, we outline potential directions for future research, such as domain adaptation, real-time implementation, and enhanced model generalization, thereby providing a valuable starting point for researchers to engage with and advance the field.

[150] Event-Based De-Snowing for Autonomous Driving

Manasi Muglikar,Nico Messikommer,Marco Cannici,Davide Scaramuzza

Main category: cs.CV

TL;DR: 该论文提出了一种基于事件相机的去雪方法,利用事件数据的时空特性识别雪花的掩盖特征,通过注意力模块恢复背景信息,显著提升了图像重建质量,并增强了后续视觉任务的性能。

Details Motivation: 传统基于图像或视频的去雪方法因依赖空间信息或高帧率而存在局限性,如产生幻觉伪影或对齐问题。事件相机以其低延迟和高动态特性,为去雪提供了新思路。

Contribution: 1. 提出了一种基于事件相机的去雪方法;2. 设计了一个注意力模块,专注于雪花掩盖的特征;3. 创建了新数据集DSEC-Snow,提供了同步的图像和事件数据;4. 在PSNR上优于现有方法3 dB,并提升后续视觉任务的性能。

Method: 利用事件相机捕捉的时空数据,设计注意力模块识别雪花掩盖的特征,并通过恢复背景点原始强度实现去雪。

Result: 方法在PSNR上提升3 dB,在后续任务(如深度估计和光流)上性能提高20%。

Insight: 事件相机的动态特性为恶劣天气下的视觉任务提供了新解决方案,显著提升了图像重建和后续任务的鲁棒性。

Abstract: Adverse weather conditions, particularly heavy snowfall, pose significant challenges to both human drivers and autonomous vehicles. Traditional image-based de-snowing methods often introduce hallucination artifacts as they rely solely on spatial information, while video-based approaches require high frame rates and suffer from alignment artifacts at lower frame rates. Camera parameters, such as exposure time, also influence the appearance of snowflakes, making the problem difficult to solve and heavily dependent on network generalization. In this paper, we propose to address the challenge of desnowing by using event cameras, which offer compressed visual information with submillisecond latency, making them ideal for de-snowing images, even in the presence of ego-motion. Our method leverages the fact that snowflake occlusions appear with a very distinctive streak signature in the spatio-temporal representation of event data. We design an attention-based module that focuses on events along these streaks to determine when a background point was occluded and use this information to recover its original intensity. We benchmark our method on DSEC-Snow, a new dataset created using a green-screen technique that overlays pre-recorded snowfall data onto the existing DSEC driving dataset, resulting in precise ground truth and synchronized image and event streams. Our approach outperforms state-of-the-art de-snowing methods by 3 dB in PSNR for image reconstruction. Moreover, we show that off-the-shelf computer vision algorithms can be applied to our reconstructions for tasks such as depth estimation and optical flow, achieving a $20%$ performance improvement over other de-snowing methods. Our work represents a crucial step towards enhancing the reliability and safety of vision systems in challenging winter conditions, paving the way for more robust, all-weather-capable applications.

[151] HAMLET-FFD: Hierarchical Adaptive Multi-modal Learning Embeddings Transformation for Face Forgery Detection

Jialei Cui,Jianwei Du,Yanzhe Li,Lei Gao,Hui Jiang,Chenfu Bao

Main category: cs.CV

TL;DR: HAMLET-FFD 提出了一种层级自适应多模态学习框架,通过双向跨模态推理解决人脸伪造检测中的跨域泛化问题,结合视觉和文本信息,生成图像自适应提示,提升检测性能。

Details Motivation: 人脸伪造技术快速发展,现有方法依赖简单分类目标,难以学习域不变表示,跨域泛化能力不足。

Contribution: 1. 提出层级自适应多模态学习框架HAMLET-FFD;2. 引入知识精炼循环,结合视觉和概念线索;3. 双向融合机制提升语义对齐。

Method: 基于CLIP模型,通过双向跨模态融合机制,冻结预训练参数,结合视觉特征和文本嵌入实现层级特征聚合和提示生成。

Result: 在多个基准测试中表现出优异的跨域泛化能力,可视化分析显示嵌入特征的分工明确。

Insight: 通过结合视觉和语义线索,HAMLET-FFD模拟专家法证分析,显著提升了伪造检测的鲁棒性和泛化能力。

Abstract: The rapid evolution of face manipulation techniques poses a critical challenge for face forgery detection: cross-domain generalization. Conventional methods, which rely on simple classification objectives, often fail to learn domain-invariant representations. We propose HAMLET-FFD, a cognitively inspired Hierarchical Adaptive Multi-modal Learning framework that tackles this challenge via bidirectional cross-modal reasoning. Building on contrastive vision-language models such as CLIP, HAMLET-FFD introduces a knowledge refinement loop that iteratively assesses authenticity by integrating visual evidence with conceptual cues, emulating expert forensic analysis. A key innovation is a bidirectional fusion mechanism in which textual authenticity embeddings guide the aggregation of hierarchical visual features, while modulated visual features refine text embeddings to generate image-adaptive prompts. This closed-loop process progressively aligns visual observations with semantic priors to enhance authenticity assessment. By design, HAMLET-FFD freezes all pretrained parameters, serving as an external plugin that preserves CLIP’s original capabilities. Extensive experiments demonstrate its superior generalization to unseen manipulations across multiple benchmarks, and visual analyses reveal a division of labor among embeddings, with distinct representations specializing in fine-grained artifact recognition.

[152] RIS-LAD: A Benchmark and Model for Referring Low-Altitude Drone Image Segmentation

Kai Ye,YingShi Luan,Zhudi Chen,Guangyue Meng,Pingyang Dai,Liujuan Cao

Main category: cs.CV

TL;DR: 本文提出了首个专为低空无人机(LAD)场景设计的细粒度Referring Image Segmentation(RIS)基准数据集RIS-LAD,并提出了一种语义感知自适应推理网络(SAARN)以应对LAD场景中的新挑战。

Details Motivation: 现有的RIS数据集和方法主要针对高空和静态视角图像,无法有效处理低空无人机场景中的多样视角和高物体密度等独特特性,需要专门的数据集和算法来解决这些问题。

Contribution: 1. 提出了首个针对LAD场景的RIS基准数据集RIS-LAD,包含13,871个精细标注的图像-文本-掩码三元组。2. 提出了语义感知自适应推理网络(SAARN),通过分层语义信息注入和动态推理适应复杂场景。

Method: 1. 提出了Category-Dominated Linguistic Enhancement(CDLE)模块,在早期编码阶段对齐视觉特征和对象类别。2. 设计了Adaptive Reasoning Fusion Module(ARFM),动态选择多尺度语义线索以改进推理。

Result: 实验表明,RIS-LAD对现有RIS算法提出显著挑战,同时SAARN在应对这些挑战时表现出色。

Insight: 低空无人机场景中的小目标、类别漂移和物体密集等新挑战需要通过专门设计的数据集和算法来解决,而分层语义注入和动态推理是解决这些问题的有效途径。

Abstract: Referring Image Segmentation (RIS), which aims to segment specific objects based on natural language descriptions, plays an essential role in vision-language understanding. Despite its progress in remote sensing applications, RIS in Low-Altitude Drone (LAD) scenarios remains underexplored. Existing datasets and methods are typically designed for high-altitude and static-view imagery. They struggle to handle the unique characteristics of LAD views, such as diverse viewpoints and high object density. To fill this gap, we present RIS-LAD, the first fine-grained RIS benchmark tailored for LAD scenarios. This dataset comprises 13,871 carefully annotated image-text-mask triplets collected from realistic drone footage, with a focus on small, cluttered, and multi-viewpoint scenes. It highlights new challenges absent in previous benchmarks, such as category drift caused by tiny objects and object drift under crowded same-class objects. To tackle these issues, we propose the Semantic-Aware Adaptive Reasoning Network (SAARN). Rather than uniformly injecting all linguistic features, SAARN decomposes and routes semantic information to different stages of the network. Specifically, the Category-Dominated Linguistic Enhancement (CDLE) aligns visual features with object categories during early encoding, while the Adaptive Reasoning Fusion Module (ARFM) dynamically selects semantic cues across scales to improve reasoning in complex scenes. The experimental evaluation reveals that RIS-LAD presents substantial challenges to state-of-the-art RIS algorithms, and also demonstrates the effectiveness of our proposed model in addressing these challenges. The dataset and code will be publicly released soon at: https://github.com/AHideoKuzeA/RIS-LAD/.

[153] ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

Yuying Ge,Yixiao Ge,Chen Li,Teng Wang,Junfu Pu,Yizhuo Li,Lu Qiu,Jin Ma,Lisheng Duan,Xinyu Zuo,Jinwen Luo,Weibo Gu,Zexuan Li,Xiaojing Zhang,Yangyu Tao,Han Hu,Di Wang,Ying Shan

Main category: cs.CV

TL;DR: ARC-Hunyuan-Video-7B是一个多模态模型,专注于对现实世界短视频进行结构化理解,具备多粒度时间戳标注、开放问答等功能,通过高质量数据训练实现了高效推理和下游应用。

Details Motivation: 现实世界的短视频在移动互联网中占据主导地位,但现有的大多模态模型缺乏对视频的时间结构和细节的深入理解能力,因此需要开发更高效的视频理解模型以支持搜索、推荐等应用。

Contribution: 1. 提出ARC-Hunyuan-Video-7B模型,支持端到端的视频多模态信号处理;2. 引入高质量自动标注数据;3. 在多个任务中表现优异,并实现高效推理。

Method: 通过预训练、指令微调、冷启动、强化学习后训练和最终指令微调的综合训练方案,开发了一个7B参数的紧凑模型。

Result: 在ShortVid-Bench基准测试中表现优异,实际部署显著提升了用户参与度和满意度,推理时间高效(10秒处理1分钟视频)。

Insight: 结合高质量数据和综合训练方案,紧凑模型也能实现对复杂短视频的深入理解,为实际应用提供了高效的解决方案。

Abstract: Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning. Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning, cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning. Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU.

[154] Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation

Dogucan Yaman,Fevziye Irem Eyiokur,Leonard Bärmann,Hazım Kemal Ekenel,Alexander Waibel

Main category: cs.CV

TL;DR: 本文提出了一种无需掩码的音频驱动说话人脸生成方法,通过两步地标变换方法将输入图像转换为闭口状态,避免了传统掩码方法的信息损失和身份参考图像的负面影响,显著提升了视觉质量和身份细节的保持。

Details Motivation: 传统方法通过掩码下半部分人脸并依赖身份参考图像生成嘴唇同步,导致信息损失、身份细节不一致及负面模型影响。本文旨在消除这些问题,实现更高质量的说话人脸生成。

Contribution: 主要贡献包括:1) 提出无需掩码和身份参考图像的方法;2) 使用两步地标变换生成闭口图像;3) 在LRS2和HDTF数据集上验证了方法的有效性。

Method: 采用两步地标变换方法,先将输入图像转换为闭口状态,再将未掩码的图像与音频输入嘴唇适配模型生成同步的嘴唇动作。

Result: 在LRS2和HDTF数据集上实验表明,该方法显著提升了视觉质量和身份细节的保持,避免了传统掩码方法的局限性。

Insight: 通过避免掩码和身份参考图像的使用,可以显著减少模型的不稳定性,同时提高生成结果的真实性和一致性。

Abstract: Audio-Driven Talking Face Generation aims at generating realistic videos of talking faces, focusing on accurate audio-lip synchronization without deteriorating any identity-related visual details. Recent state-of-the-art methods are based on inpainting, meaning that the lower half of the input face is masked, and the model fills the masked region by generating lips aligned with the given audio. Hence, to preserve identity-related visual details from the lower half, these approaches additionally require an unmasked identity reference image randomly selected from the same video. However, this common masking strategy suffers from (1) information loss in the input faces, significantly affecting the networks’ ability to preserve visual quality and identity details, (2) variation between identity reference and input image degrading reconstruction performance, and (3) the identity reference negatively impacting the model, causing unintended copying of elements unaligned with the audio. To address these issues, we propose a mask-free talking face generation approach while maintaining the 2D-based face editing task. Instead of masking the lower half, we transform the input images to have closed mouths, using a two-step landmark-based approach trained in an unpaired manner. Subsequently, we provide these edited but unmasked faces to a lip adaptation model alongside the audio to generate appropriate lip movements. Thus, our approach needs neither masked input images nor identity reference images. We conduct experiments on the benchmark LRS2 and HDTF datasets and perform various ablation studies to validate our contributions.

[155] Adapting Vehicle Detectors for Aerial Imagery to Unseen Domains with Weak Supervision

Xiao Fang,Minhyek Jeon,Zheyang Qin,Stanislav Panev,Celso de Melo,Shuowen Hu,Shayok Chakraborty,Fernando De la Torre

Main category: cs.CV

TL;DR: 该论文提出了一种利用生成式AI合成高质量航空图像及其标签的新方法,通过数据增强提升车辆检测器的训练效果,并通过多阶段、多模态知识转移框架缓解域间分布差异。

Details Motivation: 航空图像中车辆检测在交通监控等领域有广泛应用,但模型在不同地理区域的泛化能力受限,主要因环境条件、图像采集参数等引起的域偏移问题。

Contribution: 1. 提出基于生成式AI的数据增强方法;2. 开发多阶段、多模态知识转移框架;3. 引入两个新的标注数据集(新西兰和犹他州)。

Method: 利用微调的潜在扩散模型(LDMs)生成合成图像和标签,通过弱监督适应方法提升检测器对目标域的泛化能力。

Result: 在多域航空图像上表现优异,AP50指标显著优于其他方法:分别提升4-23%(有监督源域)、6-10%(弱监督适应)、7-40%(无监督适应)和50%以上(开放集检测器)。

Insight: 生成式AI能有效缓解域偏移问题,多模态知识转移框架为跨域检测任务提供了新思路。

Abstract: Detecting vehicles in aerial imagery is a critical task with applications in traffic monitoring, urban planning, and defense intelligence. Deep learning methods have provided state-of-the-art (SOTA) results for this application. However, a significant challenge arises when models trained on data from one geographic region fail to generalize effectively to other areas. Variability in factors such as environmental conditions, urban layouts, road networks, vehicle types, and image acquisition parameters (e.g., resolution, lighting, and angle) leads to domain shifts that degrade model performance. This paper proposes a novel method that uses generative AI to synthesize high-quality aerial images and their labels, improving detector training through data augmentation. Our key contribution is the development of a multi-stage, multi-modal knowledge transfer framework utilizing fine-tuned latent diffusion models (LDMs) to mitigate the distribution gap between the source and target environments. Extensive experiments across diverse aerial imagery domains show consistent performance improvements in AP50 over supervised learning on source domain data, weakly supervised adaptation methods, unsupervised domain adaptation methods, and open-set object detectors by 4-23%, 6-10%, 7-40%, and more than 50%, respectively. Furthermore, we introduce two newly annotated aerial datasets from New Zealand and Utah to support further research in this field. Project page is available at: https://humansensinglab.github.io/AGenDA

[156] JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1

Xinhan Di,Kristin Qi,Pengqian Yu

Main category: cs.CV

TL;DR: 该论文提出了一个名为JWB-DH-V1的基准测试,用于评估联合生成全身可动画化虚拟人物和自然语音的多模态一致性。

Details Motivation: 当前基于扩散模型的视频生成方法在生成全身运动和自然语音时难以实现多模态一致性,且缺乏综合评估框架和区域特异性性能分析的基准。

Contribution: 论文贡献包括:一个大规多多模态数据集和评估协议,用于联合音频视频生成任务。

Method: 通过构建包含10,000个独特身份和200万视频样本的数据集,以及设计评估协议来实现目标。

Result: 评估发现当前SOTA模型在面部/手部与全身表现之间存在性能差异。

Insight: 结果表明,未来研究需要关注全身生成任务的区域一致性。

Abstract: Recent advances in diffusion-based video generation have enabled photo-realistic short clips, but current methods still struggle to achieve multi-modal consistency when jointly generating whole-body motion and natural speech. Current approaches lack comprehensive evaluation frameworks that assess both visual and audio quality, and there are insufficient benchmarks for region-specific performance analysis. To address these gaps, we introduce the Joint Whole-Body Talking Avatar and Speech Generation Version I(JWB-DH-V1), comprising a large-scale multi-modal dataset with 10,000 unique identities across 2 million video samples, and an evaluation protocol for assessing joint audio-video generation of whole-body animatable avatars. Our evaluation of SOTA models reveals consistent performance disparities between face/hand-centric and whole-body performance, which incidates essential areas for future research. The dataset and evaluation tools are publicly available at https://github.com/deepreasonings/WholeBodyBenchmark.

[157] Security Tensors as a Cross-Modal Bridge: Extending Text-Aligned Safety to Vision in LVLM

Shen Li,Liuyi Yao,Wujia Niu,Lan Zhang,Yaliang Li

Main category: cs.CV

TL;DR: 该论文提出了一种称为’安全张量’的方法,用于解决大视觉语言模型(LVLM)中视觉模态的安全性问题,通过可训练的输入向量实现文本安全机制向视觉模态的扩展。

Details Motivation: 现有的文本安全机制无法直接应用于视觉模态,导致LVLM在处理有害图像输入时存在漏洞。

Contribution: 引入了安全张量,无需修改模型参数即可将文本安全对齐扩展到视觉处理,并通过优化数据集实现了对有害输入的可靠拒绝。

Method: 提出了一种可训练的安全张量,使用包含恶意图像-文本对和良性样本的数据集进行优化,通过文本或视觉模态应用于推理过程。

Result: 实验表明,安全张量显著提升了LVLM拒绝有害视觉输入的能力,同时保持了在良性任务上的性能。

Insight: 安全张量能够激活语言模块的文本’安全层’在视觉输入中的功能,从而有效扩展文本安全机制到视觉模态。

Abstract: Large visual-language models (LVLMs) integrate aligned large language models (LLMs) with visual modules to process multimodal inputs. However, the safety mechanisms developed for text-based LLMs do not naturally extend to visual modalities, leaving LVLMs vulnerable to harmful image inputs. To address this cross-modal safety gap, we introduce security tensors - trainable input vectors applied during inference through either the textual or visual modality. These tensors transfer textual safety alignment to visual processing without modifying the model’s parameters. They are optimized using a curated dataset containing (i) malicious image-text pairs requiring rejection, (ii) contrastive benign pairs with text structurally similar to malicious queries, with the purpose of being contrastive examples to guide visual reliance, and (iii) general benign samples preserving model functionality. Experimental results demonstrate that both textual and visual security tensors significantly enhance LVLMs’ ability to reject diverse harmful visual inputs while maintaining near-identical performance on benign tasks. Further internal analysis towards hidden-layer representations reveals that security tensors successfully activate the language module’s textual “safety layers” in visual inputs, thereby effectively extending text-based safety to the visual modality.

[158] Improving Adversarial Robustness Through Adaptive Learning-Driven Multi-Teacher Knowledge Distillation

Hayat Ullah,Syed Muhammad Talha Zaidi,Arslan Munir

Main category: cs.CV

TL;DR: 论文提出了一种基于自适应学习策略的多教师知识蒸馏方法,通过对抗训练提升卷积神经网络的对抗鲁棒性,尽管学生模型未接触对抗数据,仍能有效抵抗多种对抗攻击。

Details Motivation: 卷积神经网络(CNN)在计算机视觉领域表现优异,但其容易受到对抗攻击的影响。虽然对抗训练取得了一定进展,但模型的准确性和鲁棒性之间仍存在差距。

Contribution: 提出了多教师对抗鲁棒性蒸馏框架,结合自适应学习策略,动态调节各教师模型的贡献权重,提升学生模型的对抗鲁棒性。

Method: 1. 使用不同对抗攻击生成扰动数据,训练多个对抗训练的教师模型;2. 通过多教师知识蒸馏,在干净数据上训练学生模型;3. 设计自适应学习策略,根据教师模型的预测精度动态分配权重。

Result: 在MNIST-Digits和Fashion-MNIST数据集上的实验表明,该方法显著提升了模型对多种对抗攻击的鲁棒性。

Insight: 多教师蒸馏结合自适应学习策略能够有效传递对抗鲁棒性,即使学生模型未接触对抗数据,也能具备抵抗能力。

Abstract: Convolutional neural networks (CNNs) excel in computer vision but are susceptible to adversarial attacks, crafted perturbations designed to mislead predictions. Despite advances in adversarial training, a gap persists between model accuracy and robustness. To mitigate this issue, in this paper, we present a multi-teacher adversarial robustness distillation using an adaptive learning strategy. Specifically, our proposed method first trained multiple clones of a baseline CNN model using an adversarial training strategy on a pool of perturbed data acquired through different adversarial attacks. Once trained, these adversarially trained models are used as teacher models to supervise the learning of a student model on clean data using multi-teacher knowledge distillation. To ensure an effective robustness distillation, we design an adaptive learning strategy that controls the knowledge contribution of each model by assigning weights as per their prediction precision. Distilling knowledge from adversarially pre-trained teacher models not only enhances the learning capabilities of the student model but also empowers it with the capacity to withstand different adversarial attacks, despite having no exposure to adversarial data. To verify our claims, we extensively evaluated our proposed method on MNIST-Digits and Fashion-MNIST datasets across diverse experimental settings. The obtained results exhibit the efficacy of our multi-teacher adversarial distillation and adaptive learning strategy, enhancing CNNs’ adversarial robustness against various adversarial attacks.

[159] Learning Transferable Facial Emotion Representations from Large-Scale Semantically Rich Captions

Licai Sun,Xingxun Jiang,Haoyu Chen,Yante Li,Zheng Lian,Biu Liu,Yuan Zong,Wenming Zheng,Jukka M. Leppänen,Guoying Zhao

Main category: cs.CV

TL;DR: 论文提出了一种从大规模语义丰富的自然语言标注中学习可迁移的面部表情表征的方法,通过构建数据集EmoCap100K和设计框架EmoCapCLIP,实现了对多级标注信息的利用,显著提升了面部情感识别的性能。

Details Motivation: 现有的面部情感识别系统通常基于固定的类别或维度进行训练,忽略了情感的丰富性和多样性。自然语言标注能更灵活地表达情感,但目前缺乏大规模数据集和有效的学习框架。

Contribution: 1) 构建了EmoCap100K数据集,包含10万以上的样本,提供丰富的语义标注;2) 提出了EmoCapCLIP框架,利用全局-局部对比学习并结合跨模态引导的正样本挖掘,充分利用多级标注信息。

Method: EmoCapCLIP采用联合全局-局部对比学习框架,并引入跨模态引导的正样本挖掘模块,以捕捉面部情感的全局状态和局部行为,同时处理相近表达的语义相似性。

Result: 在覆盖5类任务的20多个基准测试中,该方法表现出色,验证了从语义丰富标注中学习表情表征的有效性。

Insight: 自然语言标注能够更全面地捕捉情感信息,结合全局和局部特征学习可以显著提升表情识别的泛化能力和表现。

Abstract: Current facial emotion recognition systems are predominately trained to predict a fixed set of predefined categories or abstract dimensional values. This constrained form of supervision hinders generalization and applicability, as it reduces the rich and nuanced spectrum of emotions into oversimplified labels or scales. In contrast, natural language provides a more flexible, expressive, and interpretable way to represent emotions, offering a much broader source of supervision. Yet, leveraging semantically rich natural language captions as supervisory signals for facial emotion representation learning remains relatively underexplored, primarily due to two key challenges: 1) the lack of large-scale caption datasets with rich emotional semantics, and 2) the absence of effective frameworks tailored to harness such rich supervision. To this end, we introduce EmoCap100K, a large-scale facial emotion caption dataset comprising over 100,000 samples, featuring rich and structured semantic descriptions that capture both global affective states and fine-grained local facial behaviors. Building upon this dataset, we further propose EmoCapCLIP, which incorporates a joint global-local contrastive learning framework enhanced by a cross-modal guided positive mining module. This design facilitates the comprehensive exploitation of multi-level caption information while accommodating semantic similarities between closely related expressions. Extensive evaluations on over 20 benchmarks covering five tasks demonstrate the superior performance of our method, highlighting the promise of learning facial emotion representations from large-scale semantically rich captions. The code and data will be available at https://github.com/sunlicai/EmoCapCLIP.

[160] Deep Learning for Skeleton Based Human Motion Rehabilitation Assessment: A Benchmark

Ali Ismail-Fawaz,Maxime Devanne,Stefano Berretti,Jonathan Weber,Germain Forestier

Main category: cs.CV

TL;DR: 该论文提出了一个统一的康复评估基准Rehab-Pile,并设计了深度学习方法的标准评估框架,填补了康复运动评估领域缺乏标准化基准的空白。

Details Motivation: 康复运动评估需要检测细微的运动偏差,但当前领域缺乏标准化的数据集和评估方法,限制了研究的可比性和进展。

Contribution: 1) 整合现有数据集为Rehab-Pile;2) 提出了康复评估的通用基准框架;3) 对多种深度学习架构进行了广泛评测。

Method: 1) 数据整合;2) 设计评估框架;3) 使用分类和回归任务对多种架构进行评测。

Result: 所有数据集和代码公开,为自动化康复评估研究奠定了坚实基础。

Insight: 标准化基准和公开数据有助于促进康复评估领域的发展,并推动个性化康复方案的实现。

Abstract: Automated assessment of human motion plays a vital role in rehabilitation, enabling objective evaluation of patient performance and progress. Unlike general human activity recognition, rehabilitation motion assessment focuses on analyzing the quality of movement within the same action class, requiring the detection of subtle deviations from ideal motion. Recent advances in deep learning and video-based skeleton extraction have opened new possibilities for accessible, scalable motion assessment using affordable devices such as smartphones or webcams. However, the field lacks standardized benchmarks, consistent evaluation protocols, and reproducible methodologies, limiting progress and comparability across studies. In this work, we address these gaps by (i) aggregating existing rehabilitation datasets into a unified archive called Rehab-Pile, (ii) proposing a general benchmarking framework for evaluating deep learning methods in this domain, and (iii) conducting extensive benchmarking of multiple architectures across classification and regression tasks. All datasets and implementations are released to the community to support transparency and reproducibility. This paper aims to establish a solid foundation for future research in automated rehabilitation assessment and foster the development of reliable, accessible, and personalized rehabilitation solutions. The datasets, source-code and results of this article are all publicly available.

[161] GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset

Yuhan Wang,Siwei Yang,Bingchen Zhao,Letian Zhang,Qing Liu,Yuyin Zhou,Cihang Xie

Main category: cs.CV

TL;DR: 该论文提出了GPT-IMAGE-EDIT-1.5M,一个由GPT-4o生成的大规模公开图像编辑数据集,包含150万高质量三联数据(指令、源图像、编辑图像),用于推动开源研究。

Details Motivation: 近年来,如GPT-4o等大型多模态模型在指令引导的图像编辑方面表现出色,但其专有性和训练数据的封闭性限制了开源研究的发展。为了解决这一问题,作者构建了一个公开的大规模数据集。

Contribution: 主要贡献是发布了GPT-IMAGE-EDIT-1.5M数据集,通过GPT-4o统一和优化了三个流行的图像编辑数据集(OmniEdit、HQ-Edit和UltraEdit),提升了视觉质量和指令对齐。

Method: 方法包括:1)重新生成输出图像以提高视觉质量和指令契合度;2)选择性重写提示以增强语义清晰度。通过微调先进的开放模型验证数据集的有效性。

Result: 微调后的FluxKontext模型在多个基准测试中表现优异(如GEdit-EN得分7.24,ImgEdit-Full得分3.80,Complex-Edit得分8.78),显著优于之前的开源方法并缩短了与专有模型的差距。

Insight: 公开的大规模高质量数据集可以显著推动指令引导图像编辑领域的研究,并为缩小开源与专有模型的性能差距提供了重要资源。

Abstract: Recent advancements in large multimodal models like GPT-4o have set a new standard for high-fidelity, instruction-guided image editing. However, the proprietary nature of these models and their training data creates a significant barrier for open-source research. To bridge this gap, we introduce GPT-IMAGE-EDIT-1.5M, a publicly available, large-scale image-editing corpus containing more than 1.5 million high-quality triplets (instruction, source image, edited image). We systematically construct this dataset by leveraging the versatile capabilities of GPT-4o to unify and refine three popular image-editing datasets: OmniEdit, HQ-Edit, and UltraEdit. Specifically, our methodology involves 1) regenerating output images to enhance visual quality and instruction alignment, and 2) selectively rewriting prompts to improve semantic clarity. To validate the efficacy of our dataset, we fine-tune advanced open-source models on GPT-IMAGE-EDIT-1.5M. The empirical results are exciting, e.g., the fine-tuned FluxKontext achieves highly competitive performance across a comprehensive suite of benchmarks, including 7.24 on GEdit-EN, 3.80 on ImgEdit-Full, and 8.78 on Complex-Edit, showing stronger instruction following and higher perceptual quality while maintaining identity. These scores markedly exceed all previously published open-source methods and substantially narrow the gap to leading proprietary models. We hope the full release of GPT-IMAGE-EDIT-1.5M can help to catalyze further open research in instruction-guided image editing.

[162] Reconstructing 4D Spatial Intelligence: A Survey

Yukang Cao,Jiahao Lu,Zhisheng Huang,Zhuowei Shen,Chengfeng Zhao,Fangzhou Hong,Zhaoxi Chen,Xin Li,Wenping Wang,Yuan Liu,Ziwei Liu

Main category: cs.CV

TL;DR: 这篇论文提出了一种新的视角,将4D空间智能的重建方法分为五个渐进层次,填补了现有综述中关于层次结构的空白。

Details Motivation: 由于3D表示和深度学习架构的快速发展,4D空间智能重建领域进展迅速,但现有综述未能全面分析其层次结构。

Contribution: 论文将现有方法分为五个层次,并总结了每个层次的关键挑战和未来方向。

Method: 通过层次化分类法(从低级的3D属性到高级的物理约束建模)组织现有方法。

Result: 提出了一个系统的分类框架,并为每个层次指明了未来的研究方向。

Insight: 4D空间智能的重建需要从低级到高级逐步推进,未来的研究应关注交互建模和物理约束的融合。

Abstract: Reconstructing 4D spatial intelligence from visual observations has long been a central yet challenging task in computer vision, with broad real-world applications. These range from entertainment domains like movies, where the focus is often on reconstructing fundamental visual elements, to embodied AI, which emphasizes interaction modeling and physical realism. Fueled by rapid advances in 3D representations and deep learning architectures, the field has evolved quickly, outpacing the scope of previous surveys. Additionally, existing surveys rarely offer a comprehensive analysis of the hierarchical structure of 4D scene reconstruction. To address this gap, we present a new perspective that organizes existing methods into five progressive levels of 4D spatial intelligence: (1) Level 1 – reconstruction of low-level 3D attributes (e.g., depth, pose, and point maps); (2) Level 2 – reconstruction of 3D scene components (e.g., objects, humans, structures); (3) Level 3 – reconstruction of 4D dynamic scenes; (4) Level 4 – modeling of interactions among scene components; and (5) Level 5 – incorporation of physical laws and constraints. We conclude the survey by discussing the key challenges at each level and highlighting promising directions for advancing toward even richer levels of 4D spatial intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/yukangcao/Awesome-4D-Spatial-Intelligence.

cs.IR [Back]

[163] Improving the Performance of Sequential Recommendation Systems with an Extended Large Language Model

Sinnyum Choi,Woong Kim

Main category: cs.IR

TL;DR: 论文提出通过将LlamaRec框架中的Llama2替换为Llama3,提升了基于大语言模型(LLM)的推荐系统性能。实验结果显示,该方法在多个数据集上平均性能提升显著。

Details Motivation: 当前AI领域竞争激烈,新版大语言模型(如Llama3)在语言理解和上下文推理方面表现更优,但现有的推荐系统研究尚未充分利用这些进展。

Contribution: 提出通过简单的模型替换(Llama2→Llama3)改进推荐系统性能,无需改动系统结构,成本效益显著。

Method: 在LlamaRec框架中,将Llama2替换为Llama3,并使用相同的输入数据和随机种子,以保证公平比较。

Result: 在ML-100K、Beauty和Games数据集上,平均性能分别提升38.65%、8.69%和8.19%,验证了方法的有效性。

Insight: 通过升级基础模型(如Llama3),可以低成本地显著提升推荐系统性能,适用于现有系统的小规模迭代优化。

Abstract: Recently, competition in the field of artificial intelligence (AI) has intensified among major technological companies, resulting in the continuous release of new large-language models (LLMs) that exhibit improved language understanding and context-based reasoning capabilities. It is expected that these advances will enable more efficient personalized recommendations in LLM-based recommendation systems through improved quality of training data and architectural design. However, many studies have not considered these recent developments. In this study, it was proposed to improve LLM-based recommendation systems by replacing Llama2 with Llama3 in the LlamaRec framework. To ensure a fair comparison, random seed values were set and identical input data was provided during preprocessing and training. The experimental results show average performance improvements of 38.65%, 8.69%, and 8.19% for the ML-100K, Beauty, and Games datasets, respectively, thus confirming the practicality of this method. Notably, the significant improvements achieved by model replacement indicate that the recommendation quality can be improved cost-effectively without the need to make structural changes to the system. Based on these results, it is our contention that the proposed approach is a viable solution for improving the performance of current recommendation systems.

cs.CR [Back]

[164] ConSeg: Contextual Backdoor Attack Against Semantic Segmentation

Bilal Hussain Abbasi,Zirui Gong,Yanjun Zhang,Shang Gao,Antonio Robles-Kelly,Leo Zhang

Main category: cs.CR

TL;DR: 该论文提出了一种针对语义分割模型的上下文后门攻击方法ConSeg,通过利用目标类与受害类的上下文关系,显著提高了攻击成功率。

Details Motivation: 语义分割模型可能受到后门攻击的威胁,尤其是当目标类与受害类具有共现关系时,攻击更容易成功。这一现象促使作者提出一种更有效的攻击方法。

Contribution: 提出了ConSeg方法,利用目标类和受害类的上下文关系增强后门攻击效果,攻击成功率比现有方法提高15.55%,且能抵抗现有防御手段。

Method: ConSeg通过模拟目标类的上下文信息,并在受害区域重建这种关系,从而强化攻击效果。实验验证了其有效性。

Result: 在攻击成功率上,ConSeg比现有方法提高了15.55%,同时对最新的后门防御方法具有鲁棒性。

Insight: 目标类与受害类的上下文关系是后门攻击的关键因素,利用这种关系可以显著提升攻击效果。

Abstract: Despite significant advancements in computer vision, semantic segmentation models may be susceptible to backdoor attacks. These attacks, involving hidden triggers, aim to cause the models to misclassify instances of the victim class as the target class when triggers are present, posing serious threats to the reliability of these models. To further explore the field of backdoor attacks against semantic segmentation, in this paper, we propose a simple yet effective backdoor attack called Contextual Segmentation Backdoor Attack (ConSeg). ConSeg leverages the contextual information inherent in semantic segmentation models to enhance backdoor performance. Our method is motivated by an intriguing observation, i.e., when the target class is set as the co-occurring' class of the victim class, the victim class can be more easily mis-segmented’. Building upon this insight, ConSeg mimics the contextual information of the target class and rebuilds it in the victim region to establish the contextual relationship between the target class and the victim class, making the attack easier. Our experiments reveal that ConSeg achieves improvements in Attack Success Rate (ASR) with increases of 15.55%, compared to existing methods, while exhibiting resilience against state-of-the-art backdoor defenses.

cs.GR [Back]

[165] ChoreoMuse: Robust Music-to-Dance Video Generation with Style Transfer and Beat-Adherent Motion

Xuanchen Wang,Heng Wang,Weidong Cai

Main category: cs.GR

TL;DR: ChoreoMuse是一个基于扩散模型的框架,利用SMPL参数作为音乐与舞蹈视频生成的桥梁,支持风格可控的高质量舞蹈视频生成,并通过新型音乐编码器MotionTune捕捉音频中的动作线索。

Details Motivation: 现有的舞蹈生成方法难以生成既符合音乐节奏又适应自定义舞蹈风格的高质量视频,限制了在实际创作中的应用潜力。

Contribution: 1. 提出ChoreoMuse框架,支持风格可控的高保真舞蹈视频生成;2. 引入MotionTune编码器,确保动作与音乐节拍对齐;3. 提出两项新指标,量化评估风格对齐效果。

Method: 使用SMPL参数作为中介,结合扩散模型生成视频;通过MotionTune编码音乐中的动作线索;利用风格迁移技术实现多样化生成。

Result: 实验表明,ChoreoMuse在视频质量、节拍对齐、舞蹈多样性和风格一致性上达到最先进水平。

Insight: SMPL参数作为中介可有效突破视频分辨率限制;音乐与动作的对齐是生成高质量舞蹈视频的关键。

Abstract: Modern artistic productions increasingly demand automated choreography generation that adapts to diverse musical styles and individual dancer characteristics. Existing approaches often fail to produce high-quality dance videos that harmonize with both musical rhythm and user-defined choreography styles, limiting their applicability in real-world creative contexts. To address this gap, we introduce ChoreoMuse, a diffusion-based framework that uses SMPL format parameters and their variation version as intermediaries between music and video generation, thereby overcoming the usual constraints imposed by video resolution. Critically, ChoreoMuse supports style-controllable, high-fidelity dance video generation across diverse musical genres and individual dancer characteristics, including the flexibility to handle any reference individual at any resolution. Our method employs a novel music encoder MotionTune to capture motion cues from audio, ensuring that the generated choreography closely follows the beat and expressive qualities of the input music. To quantitatively evaluate how well the generated dances match both musical and choreographic styles, we introduce two new metrics that measure alignment with the intended stylistic cues. Extensive experiments confirm that ChoreoMuse achieves state-of-the-art performance across multiple dimensions, including video quality, beat alignment, dance diversity, and style adherence, demonstrating its potential as a robust solution for a wide range of creative applications. Video results can be found on our project page: https://choreomuse.github.io.

physics.ao-ph [Back]

[166] A Machine Learning Framework for Predicting Microphysical Properties of Ice Crystals from Cloud Particle Imagery

Joseph Ko,Jerry Harrington,Kara Sulia,Vanessa Przybylo,Marcus van Lier-Walqui,Kara Lamb

Main category: physics.ao-ph

TL;DR: 论文提出了一种机器学习框架,通过合成冰晶数据训练模型,从二维图像预测三维冰晶微物理特性(如密度、表面积和子弹数),并验证了模型的准确性。

Details Motivation: 冰晶的微物理特性对云层辐射特性和气候影响显著,但直接测量这些特性(如质量和形态)具有挑战性。现有的二维图像数据难以直接反映三维特性,因此需要一种高效的方法从二维图像中推断三维特性。

Contribution: 1. 提出了一种基于机器学习的新框架,从二维图像预测冰晶的三维微物理特性。2. 通过合成数据训练模型,展示了高精度的预测能力。3. 量化了多视角(立体视图)对预测效果的提升。

Method: 1. 使用三维建模软件生成合成冰晶数据,并提取几何参数。2. 训练机器学习模型(如ResNet-18)从单视图或立体视图预测冰晶特性(如密度、表面积和子弹数)。3. 通过指标(如R²和F1分数)评估模型性能。

Result: 单视图模型对密度(ρₑ)和表面积(Aₑ)的预测R²分别达到0.99和0.98;子弹数(N_b)的平衡准确率和F1分数为0.91。立体视图模型进一步降低了误差(如RMSE降低40%)。

Insight: 1. 合成数据可有效支持机器学习模型的训练。2. 多视角数据能显著提升预测性能,但不同特性的提升幅度可能不同(如密度和表面积的提升远高于子弹数)。3. 该框架为冰晶微物理特性的研究提供了新工具。

Abstract: The microphysical properties of ice crystals are important because they significantly alter the radiative properties and spatiotemporal distributions of clouds, which in turn strongly affect Earth’s climate. However, it is challenging to measure key properties of ice crystals, such as mass or morphological features. Here, we present a framework for predicting three-dimensional (3D) microphysical properties of ice crystals from in situ two-dimensional (2D) imagery. First, we computationally generate synthetic ice crystals using 3D modeling software along with geometric parameters estimated from the 2021 Ice Cryo-Encapsulation Balloon (ICEBall) field campaign. Then, we use synthetic crystals to train machine learning (ML) models to predict effective density ($\rho_{e}$), effective surface area ($A_e$), and number of bullets ($N_b$) from synthetic rosette imagery. When tested on unseen synthetic images, we find that our ML models can predict microphysical properties with high accuracy. For $\rho_{e}$ and $A_e$, respectively, our best-performing single view models achieved $R^2$ values of 0.99 and 0.98. For $N_b$, our best single view model achieved a balanced accuracy and F1 score of 0.91. We also quantify the marginal prediction improvements from incorporating a second view. A stereo view ResNet-18 model reduced RMSE by 40% for both $\rho_e$ and $A_e$, relative to a single view ResNet-18 model. For $N_b$, we find that a stereo view ResNet-18 model improved the F1 score by 8%. This work provides a novel ML-driven framework for estimating ice microphysical properties from in situ imagery, which will allow for downstream constraints on microphysical parameterizations, such as the mass-size relationship.

q-fin.TR [Back]

[167] MountainLion: A Multi-Modal LLM-Based Agent System for Interpretable and Adaptive Financial Trading

Siyi Wu,Zhaoyang Guan,Leyi Zhao,Xinyuan Song,Xinyu Ying,Hanlin Zhang,Michele Pak,Yangfan He,Yi Xin,Jianhui Wang,Tianyu Shi

Main category: q-fin.TR

TL;DR: MountainLion是一个基于多模态大语言模型(LLM)的金融交易代理系统,通过整合文本新闻、K线图和交易信号图等多种数据模态,生成高质量金融报告和投资策略,同时支持用户互动和实时调整。

Details Motivation: 加密货币交易需要整合多模态异构数据,但传统深度学习和强化学习方法需要大量训练数据且缺乏可解释性。MountainLion的目标是通过LLM代理系统解决这些问题。

Contribution: 1. 提出多模态、多代理的金融交易系统MountainLion;2. 整合文本、图表等多模态数据生成可解释的投资策略;3. 通过用户互动和反思模块动态优化决策。

Method: 1. 利用LLM处理多模态数据(文本新闻、K线图、交易信号图);2. 通过多代理协作生成金融报告和策略;3. 引入反思模块分析历史信号以优化决策。

Result: 实证结果表明,MountainLion能结合宏观经济和资金流信号,提升投资回报和投资者信心,提供更可解释和稳健的投资框架。

Insight: 多模态LLM代理系统在金融交易中具有潜力,不仅能整合复杂数据,还能通过动态调整和用户互动提高可解释性和实用性。

Abstract: Cryptocurrency trading is a challenging task requiring the integration of heterogeneous data from multiple modalities. Traditional deep learning and reinforcement learning approaches typically demand large training datasets and encode diverse inputs into numerical representations, often at the cost of interpretability. Recent progress in large language model (LLM)-based agents has demonstrated the capacity to process multi-modal data and support complex investment decision-making. Building on these advances, we present \textbf{MountainLion}, a multi-modal, multi-agent system for financial trading that coordinates specialized LLM-based agents to interpret financial data and generate investment strategies. MountainLion processes textual news, candlestick charts, and trading signal charts to produce high-quality financial reports, while also enabling modification of reports and investment recommendations through data-driven user interaction and question answering. A central reflection module analyzes historical trading signals and outcomes to continuously refine decision processes, and the system is capable of real-time report analysis, summarization, and dynamic adjustment of investment strategies. Empirical results confirm that MountainLion systematically enriches technical price triggers with contextual macroeconomic and capital flow signals, providing a more interpretable, robust, and actionable investment framework that improves returns and strengthens investor confidence.

cs.HC [Back]

[168] RISEE: A Highly Interactive Naturalistic Driving Trajectories Dataset with Human Subjective Risk Perception and Eye-tracking Information

Xinzheng Wu,Junyi Chen,Peiyi Wang,Shunxiang Chen,Yong Shen

Main category: cs.HC

TL;DR: RISEE数据集整合了人类主观风险感知和眼动追踪数据,弥补了现有驾驶数据集中人类因素缺失的不足,结合了无人机和仿真数据采集的优势,为自动驾驶决策和规划系统提供了更符合人类认知的验证数据。

Details Motivation: 现有驾驶数据集主要关注车辆运动状态和轨迹,缺乏人类主观风险感知和眼动追踪数据,同时自然驾驶数据集缺乏安全关键场景,仿真数据集真实性不足。

Contribution: 构建RISEE数据集,首次整合人类主观风险评分和眼动追踪数据,结合无人机和仿真的方法,提供高真实性和安全性的场景数据。

Method: 通过无人机记录高速公路匝道合并区域的交通视频,人工筛选高交互场景并在仿真软件中重建,生成驾驶员第一视角视频供参与者评价,同时采集眼动数据。

Result: 收集了3567份有效主观风险评分和2045份合格眼动数据,覆盖179个场景,数据集公开可用。

Insight: RISEE数据集为自动驾驶系统的决策和规划提供了更全面的验证工具,强调了人类因素在自动驾驶研究中的重要性。

Abstract: In the research and development (R&D) and verification and validation (V&V) phases of autonomous driving decision-making and planning systems, it is necessary to integrate human factors to achieve decision-making and evaluation that align with human cognition. However, most existing datasets primarily focus on vehicle motion states and trajectories, neglecting human-related information. In addition, current naturalistic driving datasets lack sufficient safety-critical scenarios while simulated datasets suffer from low authenticity. To address these issues, this paper constructs the Risk-Informed Subjective Evaluation and Eye-tracking (RISEE) dataset which specifically contains human subjective evaluations and eye-tracking data apart from regular naturalistic driving trajectories. By leveraging the complementary advantages of drone-based (high realism and extensive scenario coverage) and simulation-based (high safety and reproducibility) data collection methods, we first conduct drone-based traffic video recording at a highway ramp merging area. After that, the manually selected highly interactive scenarios are reconstructed in simulation software, and drivers’ first-person view (FPV) videos are generated, which are then viewed and evaluated by recruited participants. During the video viewing process, participants’ eye-tracking data is collected. After data processing and filtering, 3567 valid subjective risk ratings from 101 participants across 179 scenarios are retained, along with 2045 qualified eye-tracking data segments. The collected data and examples of the generated FPV videos are available in our website.

[169] ChartGen: Scaling Chart Understanding Via Code-Guided Synthetic Chart Generation

Jovana Kondic,Pengyuan Li,Dhiraj Joshi,Zexue He,Shafiq Abedin,Jennifer Sun,Ben Wiesel,Eli Schwartz,Ahmed Nassar,Bo Wu,Assaf Arbelle,Aude Oliva,Dan Gutfreund,Leonid Karlinsky,Rogerio Feris

Main category: cs.HC

TL;DR: ChartGen是一个自动化流程,通过代码引导生成合成图表,旨在提升图表到代码的重建任务,并构建了一个大规模的多模态数据集。

Details Motivation: 现有基准测试多关注图表问答或总结,缺乏针对图表到代码重建任务的评估能力,因此需要一种自动化生成方法以填补这一空白。

Contribution: 1. 提出ChartGen,一个全自动的代码引导合成图表生成流程;2. 构建了一个包含22.5万对图表-代码的数据集,涵盖多种图表类型和数据模态;3. 提供了4.3K对的评估子集,并对多个开源VLM进行了评测。

Method: 1. 使用视觉语言模型(VLM)将种子图表图像重建为Python脚本;2. 利用代码导向的LLM迭代增强生成的脚本;3. 生成涵盖多种图表类型和数据模态的大规模数据集。

Result: 生成的合成数据集包含27种图表类型和11种绘图库,评测显示当前VLM在图表到代码任务上有显著提升空间。

Insight: ChartGen为图表理解和视觉条件代码生成的研究提供了重要资源,突显了当前模型在此任务上的局限性。

Abstract: Chart-to-code reconstruction – the task of recovering executable plotting scripts from chart images – provides important insights into a model’s ability to ground data visualizations in precise, machine-readable form. Yet many existing multimodal benchmarks largely focus primarily on answering questions about charts or summarizing them. To bridge this gap, we present ChartGen, a fully-automated pipeline for code-guided synthetic chart generation. Starting from seed chart images, ChartGen (i) prompts a vision-language model (VLM) to reconstruct each image into a python script, and (ii) iteratively augments that script with a code-oriented large language model (LLM). Using ChartGen, we create 222.5K unique chart-image code pairs from 13K seed chart images, and present an open-source synthetic chart dataset covering 27 chart types, 11 plotting libraries, and multiple data modalities (image, code, text, CSV, DocTags). From this corpus, we curate a held-out chart-to-code evaluation subset of 4.3K chart image-code pairs, and evaluate six open-weight VLMs (3B - 26B parameters), highlighting substantial room for progress. We release the pipeline, prompts, and the dataset to help accelerate efforts towards robust chart understanding and vision-conditioned code generation: https://github.com/SD122025/ChartGen/

cs.CY [Back]

[170] The Carbon Cost of Conversation, Sustainability in the Age of Language Models

Sayed Mahbub Hasan Amiri,Prasun Goswami,Md. Mainul Islam,Mohammad Shakhawat Hossen,Sayed Majhab Hasan Amiri,Naznin Akter

Main category: cs.CY

TL;DR: 该论文探讨了大型语言模型(LLM)的环境成本,包括碳排放、水资源消耗和电子废弃物,并提出技术、政策和文化的解决方案以实现可持续NLP。

Details Motivation: LLM在自然语言处理领域的巨大成功背后隐藏着高昂的环境代价,而这一问题被严重忽视。论文旨在量化这些成本并呼吁行业采取行动。

Contribution: 论文首次系统量化了LLM的环境影响(如碳排放和用水),并提出了技术、政策和文化三方面的可持续路径。

Method: 通过案例研究(如GPT-4和Mistral 7B)和行业分析(Google、微软等),量化环境成本,并提出解决方案(如模型剪枝、碳税等)。

Result: 研究发现训练单个LLM的碳排放相当于数百辆汽车的年排放量,且数据中心冷却加剧了水资源短缺。

Insight: 技术创新的同时必须考虑环境可持续性,全球合作和政策改革是减少AI生态影响的关键。

Abstract: Large language models (LLMs) like GPT-3 and BERT have revolutionized natural language processing (NLP), yet their environmental costs remain dangerously overlooked. This article critiques the sustainability of LLMs, quantifying their carbon footprint, water usage, and contribution to e-waste through case studies of models such as GPT-4 and energy-efficient alternatives like Mistral 7B. Training a single LLM can emit carbon dioxide equivalent to hundreds of cars driven annually, while data centre cooling exacerbates water scarcity in vulnerable regions. Systemic challenges corporate greenwashing, redundant model development, and regulatory voids perpetuate harm, disproportionately burdening marginalized communities in the Global South. However, pathways exist for sustainable NLP: technical innovations (e.g., model pruning, quantum computing), policy reforms (carbon taxes, mandatory emissions reporting), and cultural shifts prioritizing necessity over novelty. By analysing industry leaders (Google, Microsoft) and laggards (Amazon), this work underscores the urgency of ethical accountability and global cooperation. Without immediate action, AIs ecological toll risks outpacing its societal benefits. The article concludes with a call to align technological progress with planetary boundaries, advocating for equitable, transparent, and regenerative AI systems that prioritize both human and environmental well-being.

[171] Rainbow Noise: Stress-Testing Multimodal Harmful-Meme Detectors on LGBTQ Content

Ran Tong,Songtao Wei,Jiaqi Liu,Lanruo Wang

Main category: cs.CY

TL;DR: 该论文提出了一个针对LGBTQ内容的多模态有害模因检测器的鲁棒性基准,测试了两种先进检测器在多种图像和文本攻击下的表现,并提出了一个轻量级的文本去噪适配器(TDA)来提升模型的稳健性。

Details Motivation: LGBTQ群体常成为有害模因的目标,这些模因通过修改图像或文本躲避检测。论文旨在通过构建鲁棒性基准,揭示现有多模态安全模型的不足,并提出改进方法。

Contribution: 1) 构建了首个针对LGBTQ内容的多模态有害模因检测器鲁棒性基准;2) 提出了轻量级的文本去噪适配器(TDA),显著提升模型稳健性;3) 展示了预训练数据和架构选择对模型鲁棒性的影响。

Method: 1) 使用四种文本攻击和三种图像破坏方式组合测试模型;2) 引入TDA模块优化MemeBLIP2的文本处理能力;3) 通过消融实验分析模型对文本和图像的依赖。

Result: 1) MemeCLIP在攻击下表现更稳定,而MemeBLIP2对文本编辑敏感;2) TDA使MemeBLIP2成为最鲁棒的模型;3) 所有模型都高度依赖文本信息。

Insight: 1) 轻量级适配器可显著提升模型鲁棒性;2) 预训练数据和架构设计对模型稳健性有重要影响;3) 多模态模型仍需改进以应对复杂的攻击形式。

Abstract: Hateful memes aimed at LGBTQ,+ communities often evade detection by tweaking either the caption, the image, or both. We build the first robustness benchmark for this setting, pairing four realistic caption attacks with three canonical image corruptions and testing all combinations on the PrideMM dataset. Two state-of-the-art detectors, MemeCLIP and MemeBLIP2, serve as case studies, and we introduce a lightweight \textbf{Text Denoising Adapter (TDA)} to enhance the latter’s resilience. Across the grid, MemeCLIP degrades more gently, while MemeBLIP2 is particularly sensitive to the caption edits that disrupt its language processing. However, the addition of the TDA not only remedies this weakness but makes MemeBLIP2 the most robust model overall. Ablations reveal that all systems lean heavily on text, but architectural choices and pre-training data significantly impact robustness. Our benchmark exposes where current multimodal safety models crack and demonstrates that targeted, lightweight modules like the TDA offer a powerful path towards stronger defences.

q-bio.QM [Back]

[172] Review of Deep Learning Applications to Structural Proteomics Enabled by Cryogenic Electron Microscopy and Tomography

Brady K. Zhou,Jason J. Hu,Jane K. J. Lee,Z. Hong Zhou,Demetri Terzopoulos

Main category: q-bio.QM

TL;DR: 这篇综述探讨了深度学习在冷冻电镜(cryoEM)和断层扫描(cryoET)中的广泛应用,解决了信号噪声比低、偏向性伪影和缺失楔形问题,提升了结构蛋白质组学的工作流效率。

Details Motivation: 冷冻电镜技术的大发展为结构生物学提供了海量高分辨率数据,但传统方法在处理信号噪声、偏向性和缺失数据等问题时效率低下。深度学习为这些问题提供了自动化解决方案。

Contribution: 总结了深度学习在冷冻电镜和断层扫描全流程中的关键应用,包括粒子自动挑选、偏向性校正、去噪算法和原子模型构建,展示了AI如何显著提升结构解析的效率和精度。

Method: 采用卷积神经网络(如Topaz、crYOLO)进行粒子自动挑选,U-Net架构(如IsoNet)用于缺失楔形校正和去噪,以及基于AI的原子模型构建工具(如ModelAngelo、DeepTracer)。

Result: AI方法实现了接近原子分辨率的重建,解决了传统方法难以处理的数据集(如严重偏向性数据),并成功应用于HIV病毒样颗粒和原位核糖体复合物等复杂系统。

Insight: 深度学习的进一步发展(如大语言模型和视觉Transformer)有望实现结构生物学的全面自动化和普及,推动对生物大分子结构与功能的深入理解。

Abstract: The past decade’s “cryoEM revolution” has produced exponential growth in high-resolution structural data through advances in cryogenic electron microscopy (cryoEM) and tomography (cryoET). Deep learning integration into structural proteomics workflows addresses longstanding challenges including low signal-to-noise ratios, preferred orientation artifacts, and missing-wedge problems that historically limited efficiency and scalability. This review examines AI applications across the entire cryoEM pipeline, from automated particle picking using convolutional neural networks (Topaz, crYOLO, CryoSegNet) to computational solutions for preferred orientation bias (spIsoNet, cryoPROS) and advanced denoising algorithms (Topaz-Denoise). In cryoET, tools like IsoNet employ U-Net architectures for simultaneous missing-wedge correction and noise reduction, while TomoNet streamlines subtomogram averaging through AI-driven particle detection. The workflow culminates with automated atomic model building using sophisticated tools like ModelAngelo, DeepTracer, and CryoREAD that translate density maps into interpretable biological structures. These AI-enhanced approaches have achieved near-atomic resolution reconstructions with minimal manual intervention, resolved previously intractable datasets suffering from severe orientation bias, and enabled successful application to diverse biological systems from HIV virus-like particles to in situ ribosomal complexes. As deep learning evolves, particularly with large language models and vision transformers, the future promises sophisticated automation and accessibility in structural biology, potentially revolutionizing our understanding of macromolecular architecture and function.

cs.AI [Back]

[173] Leveraging Fine-Tuned Large Language Models for Interpretable Pancreatic Cystic Lesion Feature Extraction and Risk Categorization

Ebrahim Rasromani,Stella K. Kang,Yanqi Xu,Beisong Liu,Garvit Luhadia,Wan Fung Chui,Felicia L. Pasadyn,Yu Chih Hung,Julie Y. An,Edwin Mathieu,Zehui Gu,Carlos Fernandez-Granda,Ammar A. Javed,Greg D. Sacks,Tamas Gonda,Chenchan Huang,Yiqiu Shen

Main category: cs.AI

TL;DR: 该论文通过微调开源大语言模型(LLMs)和链式思考(CoT)监督,实现了从MRI/CT报告中自动提取胰腺囊性病变(PCL)特征并进行风险分类,性能与GPT-4o相当。

Details Motivation: 人工提取PCL特征耗时耗力,限制了大规模研究的开展,因此需要自动化方法提升效率和准确性。

Contribution: 提出了基于微调LLMs和CoT监督的方法,成功实现了PCL特征的自动提取和风险分类,性能接近GPT-4o。

Method: 使用GPT-4o生成的CoT数据微调LLaMA和DeepSeek模型,特征映射基于2017 ACR白皮书的指南。

Result: 微调后模型的特征提取准确率和风险分类F1分数显著提升,与人类放射科医生的一致性高。

Insight: 通过CoT监督微调的开源LLMs可以在医疗领域实现高效、可解释的特征提取,减少对GPT-4o等商业模型的依赖。

Abstract: Background: Manual extraction of pancreatic cystic lesion (PCL) features from radiology reports is labor-intensive, limiting large-scale studies needed to advance PCL research. Purpose: To develop and evaluate large language models (LLMs) that automatically extract PCL features from MRI/CT reports and assign risk categories based on guidelines. Materials and Methods: We curated a training dataset of 6,000 abdominal MRI/CT reports (2005-2024) from 5,134 patients that described PCLs. Labels were generated by GPT-4o using chain-of-thought (CoT) prompting to extract PCL and main pancreatic duct features. Two open-source LLMs were fine-tuned using QLoRA on GPT-4o-generated CoT data. Features were mapped to risk categories per institutional guideline based on the 2017 ACR White Paper. Evaluation was performed on 285 held-out human-annotated reports. Model outputs for 100 cases were independently reviewed by three radiologists. Feature extraction was evaluated using exact match accuracy, risk categorization with macro-averaged F1 score, and radiologist-model agreement with Fleiss’ Kappa. Results: CoT fine-tuning improved feature extraction accuracy for LLaMA (80% to 97%) and DeepSeek (79% to 98%), matching GPT-4o (97%). Risk categorization F1 scores also improved (LLaMA: 0.95; DeepSeek: 0.94), closely matching GPT-4o (0.97), with no statistically significant differences. Radiologist inter-reader agreement was high (Fleiss’ Kappa = 0.888) and showed no statistically significant difference with the addition of DeepSeek-FT-CoT (Fleiss’ Kappa = 0.893) or GPT-CoT (Fleiss’ Kappa = 0.897), indicating that both models achieved agreement levels on par with radiologists. Conclusion: Fine-tuned open-source LLMs with CoT supervision enable accurate, interpretable, and efficient phenotyping for large-scale PCL research, achieving performance comparable to GPT-4o.

[174] PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training

Sarat Chandra Bobbili,Ujwal Dinesha,Dheeraj Narasimha,Srinivas Shakkottai

Main category: cs.AI

TL;DR: PITA 是一种新颖的推理时间对齐框架,无需预训练奖励模型,直接通过偏好反馈指导 LLM 的生成。

Details Motivation: 现有推理时间对齐方法依赖预训练的奖励模型,但奖励模型拟合人类偏好可能存在不稳定性,PITA 旨在消除这一依赖。

Contribution: 提出 PITA 框架,通过学习小型基于偏好的指导策略,直接在推理时修改 LLM 的生成,无需微调 LLM 或依赖预训练奖励模型。

Method: 通过随机搜索和迭代优化,学习一个基于偏好的小型指导策略,直接调整 LLM 的生成概率分布。

Result: 在数学推理和情感分类等任务中,PITA 能够有效地使 LLM 输出与用户偏好对齐。

Insight: PITA 框架展示了直接利用偏好反馈指导生成的可能性,避免了传统奖励模型的不稳定性问题。

Abstract: Inference-time alignment enables large language models (LLMs) to generate outputs aligned with end-user preferences without further training. Recent post-training methods achieve this by using small guidance models to modify token generation during inference. These methods typically optimize a reward function KL-regularized by the original LLM taken as the reference policy. A critical limitation, however, is their dependence on a pre-trained reward model, which requires fitting to human preference feedback–a potentially unstable process. In contrast, we introduce PITA, a novel framework that integrates preference feedback directly into the LLM’s token generation, eliminating the need for a reward model. PITA learns a small preference-based guidance policy to modify token probabilities at inference time without LLM fine-tuning, reducing computational cost and bypassing the pre-trained reward model dependency. The problem is framed as identifying an underlying preference distribution, solved through stochastic search and iterative refinement of the preference-based guidance model. We evaluate PITA across diverse tasks, including mathematical reasoning and sentiment classification, demonstrating its effectiveness in aligning LLM outputs with user preferences.

[175] The Policy Cliff: A Theoretical Analysis of Reward-Policy Maps in Large Language Models

Xingcheng Xu

Main category: cs.AI

TL;DR: 这篇论文提出了一个理论框架,分析大型语言模型(LLMs/LRMs)中奖励函数到最优策略映射的稳定性,解释了政策脆弱性的根源。

Details Motivation: 现有的强化学习(RL)方法在LLMs/LRMs中常导致脆弱、不稳定的策略,引发虚假推理、欺骗性对齐等问题,亟需统一的理论解释。

Contribution: 提出了一个数学框架,揭示了政策脆弱性的根本原因(如动作退化),并扩展至多奖励RL场景,解释了熵正则化的作用。

Method: 通过理论分析奖励-策略映射的稳定性,尤其是动作非唯一性对政策的影响,并验证了熵正则化的效果。

Result: 发现政策脆弱性源于奖励函数的不完整性或噪声,熵正则化可恢复稳定性但增加随机性。

Insight: 政策稳定性问题可以通过理论框架统一解释,为设计更安全的AI系统提供了理论基础。

Abstract: Reinforcement learning (RL) plays a crucial role in shaping the behavior of large language and reasoning models (LLMs/LRMs). However, it often produces brittle and unstable policies, leading to critical failures such as spurious reasoning, deceptive alignment, and instruction disobedience that undermine the trustworthiness and safety of LLMs/LRMs. Currently, these issues lack a unified theoretical explanation and are typically addressed using ad-hoc heuristics. This paper presents a rigorous mathematical framework for analyzing the stability of the mapping from a reward function to the optimal policy. We show that policy brittleness often stems from non-unique optimal actions, a common occurrence when multiple valid traces exist in a reasoning task. This theoretical lens provides a unified explanation for a range of seemingly disparate failures, reframing them as rational outcomes of optimizing rewards that may be incomplete or noisy, especially in the presence of action degeneracy. We extend this analysis from the fundamental single-reward setting to the more realistic multi-reward RL across diverse domains, showing how stability is governed by an “effective reward” aggregation mechanism. We also prove that entropy regularization restores policy stability at the cost of increased stochasticity. Our framework provides a unified explanation for recent empirical findings on deceptive reasoning, instruction-following trade-offs, and RLHF-induced sophistry, and is further validated through perturbation experiments in multi-reward RL. This work advances policy-stability analysis from empirical heuristics towards a principled theory, offering essential insights for designing safer and more trustworthy AI systems.

[176] Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition

Andy Zou,Maxwell Lin,Eliot Jones,Micha Nowak,Mateusz Dziemian,Nick Winter,Alexander Grattan,Valent Nathanael,Ayla Croft,Xander Davies,Jai Patel,Robert Kirk,Nate Burnikell,Yarin Gal,Dan Hendrycks,J. Zico Kolter,Matt Fredrikson

Main category: cs.AI

TL;DR: 该论文通过大规模公开竞赛(红队测试)揭示了AI代理在现实部署中的安全性漏洞,成功诱发了大量策略违规行为,并构建了Agent Red Teaming(ART)基准测试,以推动更安全的AI代理部署。

Details Motivation: 研究动机是评估LLM驱动的AI代理在现实环境中是否能够遵循部署策略,尤其是在受到攻击时,揭示当前AI代理的安全性问题。

Contribution: 主要贡献包括:1)运行了迄今为止最大的公开红队测试竞赛;2)构建了ART基准测试,涵盖高影响力攻击;3)发现AI代理的鲁棒性与模型规模或计算能力相关性有限。

Method: 方法包括:1)通过大规模红队测试收集攻击数据;2)分析攻击成功率和策略违规行为;3)构建ART基准测试并评估19种先进模型。

Result: 结果显示,几乎所有代理在10-100次查询内都会出现策略违规,且攻击在不同模型和任务间具有高迁移性。模型鲁棒性与规模或能力相关性较低。

Insight: 研究发现当前AI代理存在持续的安全漏洞,且传统的防御措施(如模型规模或计算能力)不足以应对对抗性攻击,亟需新的安全防御技术。

Abstract: Recent advances have enabled LLM-powered AI agents to autonomously execute complex tasks by combining language model reasoning with tools, memory, and web access. But can these systems be trusted to follow deployment policies in realistic environments, especially under attack? To investigate, we ran the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios. Participants submitted 1.8 million prompt-injection attacks, with over 60,000 successfully eliciting policy violations such as unauthorized data access, illicit financial actions, and regulatory noncompliance. We use these results to build the Agent Red Teaming (ART) benchmark - a curated set of high-impact attacks - and evaluate it across 19 state-of-the-art models. Nearly all agents exhibit policy violations for most behaviors within 10-100 queries, with high attack transferability across models and tasks. Importantly, we find limited correlation between agent robustness and model size, capability, or inference-time compute, suggesting that additional defenses are needed against adversarial misuse. Our findings highlight critical and persistent vulnerabilities in today’s AI agents. By releasing the ART benchmark and accompanying evaluation framework, we aim to support more rigorous security assessment and drive progress toward safer agent deployment.

[177] A Multi-Agent System for Information Extraction from the Chemical Literature

Yufan Chen,Ching Ting Leung,Bowen Yu,Jianwei Sun,Yong Huang,Linyan Li,Hao Chen,Hanyu Gao

Main category: cs.AI

TL;DR: 该论文提出了一种基于多模态大型语言模型(MLLM)的多智能体系统,用于从化学文献中自动提取信息,显著优于现有方法。

Details Motivation: 化学研究依赖于高质量的数据库,但化学文献中的信息具有多模态性和风格多样性,导致自动提取困难。为加速AI驱动的化学研究,需要一种能够高效提取复杂化学信息的系统。

Contribution: 1) 开发了基于MLLM的多智能体系统,用于化学信息提取;2) 系统通过任务分解和智能体协同,显著提升了性能;3) 在多个子任务(如分子图像识别、反应图像解析等)上表现优异。

Method: 利用MLLM的强大推理能力,将复杂的化学信息提取任务分解为多个子任务,并协调多个专业智能体完成这些任务。

Result: 在复杂化学反应图形的基准数据集上,F1分数达到80.8%,远超之前的最佳模型(F1分数35.6%),并在多个子任务中表现出持续改进。

Insight: 通过结合MLLM的推理能力和多智能体协作,可以显著提升多模态信息的自动化提取效果,这一方法对其他领域的结构化数据生成也具有参考价值。

Abstract: To fully expedite AI-powered chemical research, high-quality chemical databases are the cornerstone. Automatic extraction of chemical information from the literature is essential for constructing reaction databases, but it is currently limited by the multimodality and style variability of chemical information. In this work, we developed a multimodal large language model (MLLM)-based multi-agent system for automatic chemical information extraction. We used the MLLM’s strong reasoning capability to understand the structure of complex chemical graphics, decompose the extraction task into sub-tasks and coordinate a set of specialized agents to solve them. Our system achieved an F1 score of 80.8% on a benchmark dataset of complex chemical reaction graphics from the literature, surpassing the previous state-of-the-art model (F1 score: 35.6%) by a significant margin. Additionally, it demonstrated consistent improvements in key sub-tasks, including molecular image recognition, reaction image parsing, named entity recognition and text-based reaction extraction. This work is a critical step toward automated chemical information extraction into structured datasets, which will be a strong promoter of AI-driven chemical research.

[178] Complementarity-driven Representation Learning for Multi-modal Knowledge Graph Completion

Lijian Li

Main category: cs.AI

TL;DR: 该论文提出了一种名为MoCME的新框架,通过互补性驱动的模态知识融合和熵引导的负采样机制,解决了多模态知识图谱补全中模态不平衡的问题,显著提升了性能。

Details Motivation: 多模态知识图谱(MKG)中存在模态分布不均的问题,传统方法未能充分利用多模态数据的互补性,影响了实体表示的质量。因此,作者提出了一种新方法以更好地利用多模态信息的互补性。

Contribution: 主要贡献包括提出了MoCME框架,其中包含了互补性引导的模态知识融合模块(CMKF)和熵引导的负采样机制(EGNS),有效提升了多模态知识图谱补全的性能。

Method: 方法包括利用CMKF模块进行多模态数据的互补性融合,以及通过EGNS动态选择信息量和不确定性高的负样本。

Result: 在五个基准数据集上的实验表明,MoCME优于现有方法,达到了最先进的性能。

Insight: 论文揭示了多模态数据的互补性对提升知识图谱补全任务的重要性,并展示了如何通过动态负采样优化模型训练。

Abstract: Multi-modal Knowledge Graph Completion (MMKGC) aims to uncover hidden world knowledge in multimodal knowledge graphs by leveraging both multimodal and structural entity information. However, the inherent imbalance in multimodal knowledge graphs, where modality distributions vary across entities, poses challenges in utilizing additional modality data for robust entity representation. Existing MMKGC methods typically rely on attention or gate-based fusion mechanisms but overlook complementarity contained in multi-modal data. In this paper, we propose a novel framework named Mixture of Complementary Modality Experts (MoCME), which consists of a Complementarity-guided Modality Knowledge Fusion (CMKF) module and an Entropy-guided Negative Sampling (EGNS) mechanism. The CMKF module exploits both intra-modal and inter-modal complementarity to fuse multi-view and multi-modal embeddings, enhancing representations of entities. Additionally, we introduce an Entropy-guided Negative Sampling mechanism to dynamically prioritize informative and uncertain negative samples to enhance training effectiveness and model robustness. Extensive experiments on five benchmark datasets demonstrate that our MoCME achieves state-of-the-art performance, surpassing existing approaches.

q-fin.PM [Back]

[179] Your AI, Not Your View: The Bias of LLMs in Investment Analysis

Hoyoung Lee,Junhyuk Seo,Suhwan Park,Junhyeong Lee,Wonbin Ahn,Chanyeol Choi,Alejandro Lopez-Lira,Yongjae Lee

Main category: q-fin.PM

TL;DR: 该论文提出了一种实验框架,用于分析大型语言模型(LLMs)在投资分析中的确认偏差,揭示了模型对大盘股和逆向策略的偏好,并发现其存在固化偏见的倾向。

Details Motivation: 金融领域中,LLMs的预训练知识与实时市场数据之间存在知识冲突,可能导致不可靠的投资建议。目前缺乏对LLMs实际投资观点的研究。

Contribution: 首次对LLM在投资分析中的确认偏差进行了定量分析,揭示了模型对大盘股和逆向策略的偏好及其固化行为。

Method: 通过假设情景(包含平衡和不平衡的论点)提取模型的潜在偏好,并测量其持续性,重点分析行业、规模和动量因素。

Result: 发现大多数模型对大盘股和逆向策略有显著偏好,且在面对反证时仍坚持初始判断。

Insight: LLMs在投资分析中存在固有偏见,可能影响金融服务的可靠性,需进一步研究以优化模型部署。

Abstract: In finance, Large Language Models (LLMs) face frequent knowledge conflicts due to discrepancies between pre-trained parametric knowledge and real-time market data. These conflicts become particularly problematic when LLMs are deployed in real-world investment services, where misalignment between a model’s embedded preferences and those of the financial institution can lead to unreliable recommendations. Yet little research has examined what investment views LLMs actually hold. We propose an experimental framework to investigate such conflicts, offering the first quantitative analysis of confirmation bias in LLM-based investment analysis. Using hypothetical scenarios with balanced and imbalanced arguments, we extract models’ latent preferences and measure their persistence. Focusing on sector, size, and momentum, our analysis reveals distinct, model-specific tendencies. In particular, we observe a consistent preference for large-cap stocks and contrarian strategies across most models. These preferences often harden into confirmation bias, with models clinging to initial judgments despite counter-evidence.

cs.SE [Back]

[180] The Impact of Fine-tuning Large Language Models on Automated Program Repair

Roman Macháček,Anastasiia Grishina,Max Hort,Leon Moonen

Main category: cs.SE

TL;DR: 论文研究了微调大语言模型对自动程序修复(APR)的影响,发现参数高效微调(PEFT)方法优于全微调,因后者易过拟合并导致性能下降。

Details Motivation: 探索如何通过微调预训练的大语言模型(LLMs)提升其在自动程序修复任务中的性能,同时降低计算成本。

Contribution: 实证研究了多种微调技术对LLMs在APR任务中的影响,揭示了参数高效微调(PEFT)的优越性。

Method: 在三个APR基准数据集(QuixBugs、Defects4J、HumanEval-Java)上评估六种LLMs(如CodeGen、CodeT5等),比较了无微调、全微调和PEFT(LoRA和IA3)的效果。

Result: 全微调因数据分布差异和过拟合降低了模型性能,而PEFT方法通过限制可训练参数取得了更好结果。

Insight: 参数高效微调是优化LLMs在特定任务(如APR)中性能的有效途径,避免了全微调的高计算成本和性能下降问题。

Abstract: Automated Program Repair (APR) uses various tools and techniques to help developers achieve functional and error-free code faster. In recent years, Large Language Models (LLMs) have gained popularity as components in APR tool chains because of their performance and flexibility. However, training such models requires a significant amount of resources. Fine-tuning techniques have been developed to adapt pre-trained LLMs to specific tasks, such as APR, and enhance their performance at far lower computational costs than training from scratch. In this study, we empirically investigate the impact of various fine-tuning techniques on the performance of LLMs used for APR. Our experiments provide insights into the performance of a selection of state-of-the-art LLMs pre-trained on code. The evaluation is done on three popular APR benchmarks (i.e., QuixBugs, Defects4J and HumanEval-Java) and considers six different LLMs with varying parameter sizes (resp. CodeGen, CodeT5, StarCoder, DeepSeekCoder, Bloom, and CodeLlama-2). We consider three training regimens: no fine-tuning, full fine-tuning, and parameter-efficient fine-tuning (PEFT) using LoRA and IA3. We observe that full fine-tuning techniques decrease the benchmarking performance of various models due to different data distributions and overfitting. By using parameter-efficient fine-tuning methods, we restrict models in the amount of trainable parameters and achieve better results. Keywords: large language models, automated program repair, parameter-efficient fine-tuning, AI4Code, AI4SE, ML4SE.

[181] Enhancing Project-Specific Code Completion by Inferring Internal API Information

Le Deng,Xiaoxue Ren,Chao Ni,Ming Liang,David Lo,Zhongxin Liu

Main category: cs.SE

TL;DR: 该论文提出了一种通过推断内部API信息来增强项目特定代码补全的方法。通过构建API的使用示例和语义描述,扩展API的表示形式,进而为LLM生成相关补全提供知识库。实验表明,该方法显著提升了代码补全的准确性,超越了现有方法。

Details Motivation: 现有基于检索增强生成(RAG)和大语言模型(LLM)的项目特定代码补全方法往往难以有效利用内部API信息,尤其是在未显式导入API的文件中。这限制了补全的准确性。

Contribution: 1. 提出了一种无需依赖导入即可推断内部API信息的方法;2. 构建了ProjBench基准测试,避免泄漏导入问题;3. 实验证明该方法在代码和标识符匹配任务上显著优于现有方法。

Method: 通过分析API的使用示例和语义描述,扩展API的表示形式,为LLM提供知识库以生成更准确的代码补全。

Result: 在ProjBench和CrossCodeEval上的实验显示,该方法将代码精确匹配率提升了22.72%,标识符匹配率提升了18.31%。与现有基线结合时,进一步提升至47.80%和35.55%。

Insight: 内部API信息的推断对代码补全的准确性至关重要,尤其是在未显式导入API的情况下。构建API的语义和上下文表示能够显著提升LLM的补全效果。

Abstract: Project-specific code completion is a critical task that leverages context from a project to generate accurate code. State-of-the-art methods use retrieval-augmented generation (RAG) with large language models (LLMs) and project information for code completion. However, they often struggle to incorporate internal API information, which is crucial for accuracy, especially when APIs are not explicitly imported in the file. To address this, we propose a method to infer internal API information without relying on imports. Our method extends the representation of APIs by constructing usage examples and semantic descriptions, building a knowledge base for LLMs to generate relevant completions. We also introduce ProjBench, a benchmark that avoids leaked imports and consists of large-scale real-world projects. Experiments on ProjBench and CrossCodeEval show that our approach significantly outperforms existing methods, improving code exact match by 22.72% and identifier exact match by 18.31%. Additionally, integrating our method with existing baselines boosts code match by 47.80% and identifier match by 35.55%.

cs.RO [Back]

[182] Digital and Robotic Twinning for Validation of Proximity Operations and Formation Flying

Aviad Golan,Gregory Zin,Zahra Ahmed,Emily Bates,Toby Bell,Pol Francesch Huc,Samuel Y. W. Low,Juergen Bosse,Simone D’Amico

Main category: cs.RO

TL;DR: 该论文提出了一个统一的端到端数字与机器人孪生框架,用于多模态GNC系统的验证与测试,通过实验验证了其在低地球轨道任务中的有效性和一致性。

Details Motivation: 在航天器交会、接近操作和编队飞行中,GNC系统至关重要且需满足严格性能要求,但由于空间环境复杂性,验证这些系统极具挑战性,亟需一种连接仿真与现实行为的验证方法。

Contribution: 主要贡献是一个统一的数字与机器人孪生框架,支持多模态GNC系统的软硬件在环测试,并提供了三个实验平台(GRAND、TRON、OS)验证不同导航技术。

Method: 采用混合数字与机器人孪生框架,通过GRAND测试无线电导航,TRON和OS测试视觉导航,并对GNC软件栈进行全范围低地球轨道任务场景验证。

Result: 实验结果表明数字与机器人孪生结果一致,证明了该框架在GNC系统评估与验证中的可靠性。

Insight: 该框架为复杂空间环境中的GNC系统验证提供了一种现实可行的解决方案,弥合了仿真与实际任务之间的差距。

Abstract: In spacecraft Rendezvous, Proximity Operations (RPO), and Formation Flying (FF), the Guidance Navigation and Control (GNC) system is safety-critical and must meet strict performance requirements. However, validating such systems is challenging due to the complexity of the space environment, necessitating a verification and validation (V&V) process that bridges simulation and real-world behavior. The key contribution of this paper is a unified, end-to-end digital and robotic twinning framework that enables software- and hardware-in-the-loop testing for multi-modal GNC systems. The robotic twin includes three testbeds at Stanford’s Space Rendezvous Laboratory (SLAB): the GNSS and Radiofrequency Autonomous Navigation Testbed for Distributed Space Systems (GRAND) to validate RF-based navigation techniques, and the Testbed for Rendezvous and Optical Navigation (TRON) and Optical Stimulator (OS) to validate vision-based methods. The test article for this work is an integrated multi-modal GNC software stack for RPO and FF developed at SLAB. This paper introduces the hybrid framework and summarizes calibration and error characterization for the robotic twin. Then, the GNC stack’s performance and robustness is characterized using the integrated digital and robotic twinning pipeline for a full-range RPO mission scenario in Low-Earth Orbit (LEO). The results shown in the paper demonstrate consistency between digital and robotic twins, validating the hybrid twinning pipeline as a reliable framework for realistic assessment and verification of GNC systems.

[183] Humanoid Occupancy: Enabling A Generalized Multimodal Occupancy Perception System on Humanoid Robots

Wei Cui,Haoyu Wang,Wenkang Qin,Yijie Guo,Gang Han,Wen Zhao,Jiahang Cao,Zhang Zhang,Jiaru Zhong,Jingkai Sun,Pihai Sun,Shuai Shi,Botuo Jiang,Jiahao Ma,Jiaxu Wang,Hao Cheng,Zhichao Liu,Yang Wang,Zheng Zhu,Guan Huang,Jian Tang,Qiang Zhang

Main category: cs.RO

TL;DR: 论文提出了一种名为Humanoid Occupancy的通用多模态占用感知系统,通过结合硬件与软件组件、数据采集设备和专用标注流程,为人形机器人提供全面的环境理解。

Details Motivation: 由于人形机器人技术的快速发展,现有感知模块多为特定场景设计,缺乏通用性。基于占用的表示方法因其能提供丰富的语义和几何信息,被认为更适合人形机器人。

Contribution: 1) 提出通用的多模态占用感知系统;2) 解决人形机器人独特的运动干扰和遮挡问题;3) 首次构建了人形机器人的全景占用数据集。

Method: 系统采用多模态融合技术和时间信息集成,生成基于网格的占用输出,包含占用状态和语义标签。设计了有效的传感器布局策略以应对运动干扰。

Result: 实验证明了系统的有效性,为人形机器人提供了稳健的环境感知能力。

Insight: 该系统为标准化通用视觉模块奠定了基础,有望推动人形机器人在复杂场景中的广泛应用。

Abstract: Humanoid robot technology is advancing rapidly, with manufacturers introducing diverse heterogeneous visual perception modules tailored to specific scenarios. Among various perception paradigms, occupancy-based representation has become widely recognized as particularly suitable for humanoid robots, as it provides both rich semantic and 3D geometric information essential for comprehensive environmental understanding. In this work, we present Humanoid Occupancy, a generalized multimodal occupancy perception system that integrates hardware and software components, data acquisition devices, and a dedicated annotation pipeline. Our framework employs advanced multi-modal fusion techniques to generate grid-based occupancy outputs encoding both occupancy status and semantic labels, thereby enabling holistic environmental understanding for downstream tasks such as task planning and navigation. To address the unique challenges of humanoid robots, we overcome issues such as kinematic interference and occlusion, and establish an effective sensor layout strategy. Furthermore, we have developed the first panoramic occupancy dataset specifically for humanoid robots, offering a valuable benchmark and resource for future research and development in this domain. The network architecture incorporates multi-modal feature fusion and temporal information integration to ensure robust perception. Overall, Humanoid Occupancy delivers effective environmental perception for humanoid robots and establishes a technical foundation for standardizing universal visual modules, paving the way for the widespread deployment of humanoid robots in complex real-world scenarios.

[184] Methods for the Segmentation of Reticular Structures Using 3D LiDAR Data: A Comparative Evaluation

Francisco J. Soler Mora,Adrián Peidró Vidal,Marc Fabregat-Jaén,Luis Payá Castelló,Óscar Reinoso García

Main category: cs.RO

TL;DR: 该论文比较了基于3D LiDAR数据的网状结构分割方法,包括解析算法和深度学习模型,展示了它们在自主机器人导航中的应用潜力。

Details Motivation: 网状结构(如桥梁、塔架)的检查和维护成本高且危险,现有研究多集中于故障检测或机器人平台设计,而自主导航的研究较少。

Contribution: 提出多种方法用于分割网状结构的可导航表面,比较了解析算法与深度学习模型的性能,为自主机器人导航提供新思路。

Method: 方法分为两类:一是基于特征分析的解析算法,二是训练了PointNet、PointNet++、MinkUNet34C和PointTransformerV3等深度学习模型。

Result: 解析算法参数调节简单且性能接近深度学习模型,而后者(尤其是PointTransformerV3)在分割精度上表现优异(mIoU约97%)。

Insight: 研究展示了计算效率与分割性能之间的权衡,为基础设施自主巡检和维护的实践提供了重要参考。

Abstract: Reticular structures form the backbone of major infrastructure like bridges, pylons, and airports, but their inspection and maintenance are costly and hazardous, often requiring human intervention. While prior research has focused on fault detection via images or robotic platform design, the autonomous navigation of robots within these structures is less explored. This study addresses that gap by proposing methods to detect navigable surfaces in truss structures, enhancing the autonomy of climbing robots. The paper introduces several approaches for binary segmentation of navigable surfaces versus background from 3D point clouds of metallic trusses. These methods fall into two categories: analytical algorithms and deep learning models. The analytical approach features a custom algorithm that segments structures by analyzing the eigendecomposition of planar patches in the point cloud. In parallel, advanced deep learning models PointNet, PointNet++, MinkUNet34C, and PointTransformerV3 are trained and evaluated for the same task. Comparative analysis shows that the analytical algorithm offers easier parameter tuning and performance comparable to deep learning models, which, while more computationally intensive, excel in segmentation accuracy. Notably, PointTransformerV3 achieves a Mean Intersection Over Union (mIoU) of about 97%. The study demonstrates the promise of both analytical and deep learning methods for improving autonomous navigation in complex truss environments. The results highlight the trade-offs between computational efficiency and segmentation performance, providing valuable guidance for future research and practical applications in autonomous infrastructure inspection and maintenance.

[185] LanternNet: A Novel Hub-and-Spoke System to Seek and Suppress Spotted Lanternfly Populations

Vinil Polepalli

Main category: cs.RO

TL;DR: 论文提出了一种名为LanternNet的新型自主机器人系统,用于检测和抑制斑点灯笼蝇(SLF)种群。该系统采用轮辐式设计,结合计算机视觉和机器人技术,显著降低了SLF数量,并具有成本效益和可扩展性。

Details Motivation: 斑点灯笼蝇对农业和生态系统造成严重威胁,而现有控制方法劳动强度大、对环境有害且效果有限。因此,需要一种更高效、自主的解决方案。

Contribution: LanternNet通过中央枢纽和机器人辐条的结合,实现了精准检测和针对性抑制SLF。该系统在多个实地测试中表现出显著效果,并具备适应其他入侵物种的潜力。

Method: 系统采用YOLOv8计算机视觉模型进行SLF识别,配合三种专用机器人辐条执行任务:害虫消灭、环境监测及导航制图。

Result: 5周的实地测试表明,SLF种群显著减少(p < 0.01),树木健康指标改善。与传统方法相比,LanternNet成本更低且更易扩展。

Insight: LanternNet展示了机器人技术和AI在入侵物种管理中的潜力,可为环境问题提供创新解决方案。

Abstract: The invasive spotted lanternfly (SLF) poses a significant threat to agriculture and ecosystems, causing widespread damage. Current control methods, such as egg scraping, pesticides, and quarantines, prove labor-intensive, environmentally hazardous, and inadequate for long-term SLF suppression. This research introduces LanternNet, a novel autonomous robotic Hub-and-Spoke system designed for scalable detection and suppression of SLF populations. A central, tree-mimicking hub utilizes a YOLOv8 computer vision model for precise SLF identification. Three specialized robotic spokes perform targeted tasks: pest neutralization, environmental monitoring, and navigation/mapping. Field deployment across multiple infested sites over 5 weeks demonstrated LanternNet’s efficacy. Quantitative analysis revealed significant reductions (p < 0.01, paired t-tests) in SLF populations and corresponding improvements in tree health indicators across the majority of test sites. Compared to conventional methods, LanternNet offers substantial cost advantages and improved scalability. Furthermore, the system’s adaptability for enhanced autonomy and targeting of other invasive species presents significant potential for broader ecological impact. LanternNet demonstrates the transformative potential of integrating robotics and AI for advanced invasive species management and improved environmental outcomes.

cs.LG [Back]

[186] Agentic Reinforced Policy Optimization

Guanting Dong,Hangyu Mao,Kai Ma,Licheng Bao,Yifei Chen,Zhongyuan Wang,Zhongxia Chen,Jiazhen Du,Huiyang Wang,Fuzheng Zhang,Guorui Zhou,Yutao Zhu,Ji-Rong Wen,Zhicheng Dou

Main category: cs.LG

TL;DR: 论文提出了Agentic Reinforced Policy Optimization (ARPO),一种针对多轮工具交互任务的强化学习算法,旨在平衡LLM的长时程推理能力和工具交互能力,通过熵自适应机制和优势估计提升性能。

Details Motivation: 当前强化学习算法在平衡大语言模型(LLM)的长时程推理能力和多轮工具交互能力方面表现不足。为解决这一问题,作者提出了ARPO。

Contribution: 提出ARPO算法,引入熵自适应采样机制和优势估计,显著提升LLM在多轮工具交互任务中的性能,并减少工具使用预算。

Method: ARPO结合熵自适应机制动态调整全局轨迹采样和步级采样,利用优势估计优化步级交互策略。

Result: 在13个基准测试中,ARPO表现优于轨迹级强化学习算法,且工具使用预算减半。

Insight: 工具交互后LLM行为的不确定性增加,通过动态调整探索策略可显著提升性能。

Abstract: Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks. In realistic reasoning scenarios, LLMs can often utilize external tools to assist in task-solving processes. However, current RL algorithms inadequately balance the models’ intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions. To bridge this gap, we propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents. Through preliminary experiments, we observe that LLMs tend to exhibit highly uncertain behavior, characterized by an increase in the entropy distribution of generated tokens, immediately following interactions with external tools. Motivated by this observation, ARPO incorporates an entropy-based adaptive rollout mechanism, dynamically balancing global trajectory sampling and step-level sampling, thereby promoting exploration at steps with high uncertainty after tool usage. By integrating an advantage attribution estimation, ARPO enables LLMs to internalize advantage differences in stepwise tool-use interactions. Our experiments across 13 challenging benchmarks in computational reasoning, knowledge reasoning, and deep search domains demonstrate ARPO’s superiority over trajectory-level RL algorithms. Remarkably, ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments. Our code and datasets are released at https://github.com/dongguanting/ARPO

[187] EcoTransformer: Attention without Multiplication

Xin Gao,Xingming Xu

Main category: cs.LG

TL;DR: EcoTransformer提出了一种新型的注意力机制,避免了点积运算,采用L1度量和拉普拉斯核卷积替代传统点积注意力,显著降低计算能耗。

Details Motivation: 传统Transformer的点积注意力机制计算成本高且能耗大,EcoTransformer旨在设计一种更高效的替代方案。

Contribution: 提出EcoTransformer,通过L1度量和拉普拉斯核卷积实现注意力机制,消除矩阵乘法,显著降低计算能耗。

Method: 使用L1度量计算查询和键的距离,并通过拉普拉斯核卷积生成上下文向量,替代传统点积注意力机制。

Result: 在NLP、生物信息学和视觉任务中表现与传统点积注意力相当或更优,同时大幅减少能耗。

Insight: L1度量和卷积操作可实现高效的注意力计算,为低能耗模型设计提供了新思路。

Abstract: The Transformer, with its scaled dot-product attention mechanism, has become a foundational architecture in modern AI. However, this mechanism is computationally intensive and incurs substantial energy costs. We propose a new Transformer architecture EcoTransformer, in which the output context vector is constructed as the convolution of the values using a Laplacian kernel, where the distances are measured by the L1 metric between the queries and keys. Compared to dot-product based attention, the new attention score calculation is free of matrix multiplication. It performs on par with, or even surpasses, scaled dot-product attention in NLP, bioinformatics, and vision tasks, while consuming significantly less energy.

[188] Customize Multi-modal RAI Guardrails with Precedent-based predictions

Cheng-Fu Yang,Thanh Tran,Christos Christodoulopoulos,Weitong Ruan,Rahul Gupta,Kai-Wei Chang

Main category: cs.LG

TL;DR: 本文提出了一种基于先例的多模态RAI护栏方法,通过利用相似数据的推理过程而非固定策略,显著提升了护栏的灵活性和适应性。

Details Motivation: 现实应用中,用户需要高度可定制化的多模态护栏策略,但现有方法在新策略上的泛化能力有限或需要大量重新训练。

Contribution: 1. 引入基于先例的预测方法;2. 提出了一种高质量先例收集的critique-revise机制;3. 设计了两种利用先例的预测策略。

Method: 通过相似数据的推理过程(先例)来指导模型判断,而非依赖固定策略。结合critique-revise机制优化先例质量。

Result: 实验表明,该方法在少样本和全数据集场景下均优于现有方法,并对新策略表现出更好的泛化能力。

Insight: 利用先例而非固定策略,可以更灵活地适应多样化需求,减少对新策略的重新训练依赖。

Abstract: A multi-modal guardrail must effectively filter image content based on user-defined policies, identifying material that may be hateful, reinforce harmful stereotypes, contain explicit material, or spread misinformation. Deploying such guardrails in real-world applications, however, poses significant challenges. Users often require varied and highly customizable policies and typically cannot provide abundant examples for each custom policy. Consequently, an ideal guardrail should be scalable to the multiple policies and adaptable to evolving user standards with minimal retraining. Existing fine-tuning methods typically condition predictions on pre-defined policies, restricting their generalizability to new policies or necessitating extensive retraining to adapt. Conversely, training-free methods struggle with limited context lengths, making it difficult to incorporate all the policies comprehensively. To overcome these limitations, we propose to condition model’s judgment on “precedents”, which are the reasoning processes of prior data points similar to the given input. By leveraging precedents instead of fixed policies, our approach greatly enhances the flexibility and adaptability of the guardrail. In this paper, we introduce a critique-revise mechanism for collecting high-quality precedents and two strategies that utilize precedents for robust prediction. Experimental results demonstrate that our approach outperforms previous methods across both few-shot and full-dataset scenarios and exhibits superior generalization to novel policies.

[189] Kimi K2: Open Agentic Intelligence

Kimi Team,Yifan Bai,Yiping Bao,Guanduo Chen,Jiahao Chen,Ningxin Chen,Ruijue Chen,Yanru Chen,Yuankun Chen,Yutian Chen,Zhuofu Chen,Jialei Cui,Hao Ding,Mengnan Dong,Angang Du,Chenzhuang Du,Dikang Du,Yulun Du,Yu Fan,Yichen Feng,Kelin Fu,Bofei Gao,Hongcheng Gao,Peizhong Gao,Tong Gao,Xinran Gu,Longyu Guan,Haiqing Guo,Jianhang Guo,Hao Hu,Xiaoru Hao,Tianhong He,Weiran He,Wenyang He,Chao Hong,Yangyang Hu,Zhenxing Hu,Weixiao Huang,Zhiqi Huang,Zihao Huang,Tao Jiang,Zhejun Jiang,Xinyi Jin,Yongsheng Kang,Guokun Lai,Cheng Li,Fang Li,Haoyang Li,Ming Li,Wentao Li,Yanhao Li,Yiwei Li,Zhaowei Li,Zheming Li,Hongzhan Lin,Xiaohan Lin,Zongyu Lin,Chengyin Liu,Chenyu Liu,Hongzhang Liu,Jingyuan Liu,Junqi Liu,Liang Liu,Shaowei Liu,T. Y. Liu,Tianwei Liu,Weizhou Liu,Yangyang Liu,Yibo Liu,Yiping Liu,Yue Liu,Zhengying Liu,Enzhe Lu,Lijun Lu,Shengling Ma,Xinyu Ma,Yingwei Ma,Shaoguang Mao,Jie Mei,Xin Men,Yibo Miao,Siyuan Pan,Yebo Peng,Ruoyu Qin,Bowen Qu,Zeyu Shang,Lidong Shi,Shengyuan Shi,Feifan Song,Jianlin Su,Zhengyuan Su,Xinjie Sun,Flood Sung,Heyi Tang,Jiawen Tao,Qifeng Teng,Chensi Wang,Dinglu Wang,Feng Wang,Haiming Wang,Jianzhou Wang,Jiaxing Wang,Jinhong Wang,Shengjie Wang,Shuyi Wang,Yao Wang,Yejie Wang,Yiqin Wang,Yuxin Wang,Yuzhi Wang,Zhaoji Wang,Zhengtao Wang,Zhexu Wang,Chu Wei,Qianqian Wei,Wenhao Wu,Xingzhe Wu,Yuxin Wu,Chenjun Xiao,Xiaotong Xie,Weimin Xiong,Boyu Xu,Jing Xu,Jinjing Xu,L. H. Xu,Lin Xu,Suting Xu,Weixin Xu,Xinran Xu,Yangchuan Xu,Ziyao Xu,Junjie Yan,Yuzi Yan,Xiaofei Yang,Ying Yang,Zhen Yang,Zhilin Yang,Zonghan Yang,Haotian Yao,Xingcheng Yao,Wenjie Ye,Zhuorui Ye,Bohong Yin,Longhui Yu,Enming Yuan,Hongbang Yuan,Mengjie Yuan,Haobing Zhan,Dehao Zhang,Hao Zhang,Wanlu Zhang,Xiaobin Zhang,Yangkun Zhang,Yizhi Zhang,Yongting Zhang,Yu Zhang,Yutao Zhang,Yutong Zhang,Zheng Zhang,Haotian Zhao,Yikai Zhao,Huabin Zheng,Shaojie Zheng,Jianren Zhou,Xinyu Zhou,Zaida Zhou,Zhen Zhu,Weiyu Zhuang,Xinxing Zu

Main category: cs.LG

TL;DR: Kimi K2是一种基于专家混合(MoE)架构的大型语言模型,具有32亿激活参数和1万亿总参数。通过创新的MuonClip优化器和大规模多阶段后训练,K2在非思考任务上实现了最先进的性能,尤其在代理能力上表现突出。

Details Motivation: 当前开源大型语言模型在代理任务(如软件工程和交互任务)上的能力有限,Kimi K2旨在通过更高效的训练技术和多阶段后训练填补这一空白。

Contribution: 1. 提出Kimi K2模型,结合MoE架构和MuonClip优化器,解决了训练不稳定性问题。
2. 设计了大规模代理数据合成管道和联合强化学习后训练阶段,显著提升了模型的代理能力。
3. 在多个基准测试中实现了超越开源和闭源模型的性能。

Method: 1. 使用MuonClip优化器(基于QK-clip技术)确保训练稳定性。
2. 预训练阶段处理了15.5万亿令牌数据且零损失峰值。
3. 后训练阶段包括代理数据合成和联合RL训练,增强模型在真实与合成环境中的交互能力。

Result: K2在多个基准测试中表现优异:Tau2-Bench(66.1)、ACEBench(76.5)、SWE-Bench Verified(65.8)等,同时在编码、数学和推理任务上也有出色表现。

Insight: 1. MuonClip优化器显著提升了大规模MoE模型的训练效率。
2. 代理任务性能的提升验证了多阶段后训练(尤其是联合RL)的价值。
3. 开源模型K2在非思考任务中具备与闭源模型竞争的能力。

Abstract: We introduce Kimi K2, a Mixture-of-Experts (MoE) large language model with 32 billion activated parameters and 1 trillion total parameters. We propose the MuonClip optimizer, which improves upon Muon with a novel QK-clip technique to address training instability while enjoying the advanced token efficiency of Muon. Based on MuonClip, K2 was pre-trained on 15.5 trillion tokens with zero loss spike. During post-training, K2 undergoes a multi-stage post-training process, highlighted by a large-scale agentic data synthesis pipeline and a joint reinforcement learning (RL) stage, where the model improves its capabilities through interactions with real and synthetic environments. Kimi K2 achieves state-of-the-art performance among open-source non-thinking models, with strengths in agentic capabilities. Notably, K2 obtains 66.1 on Tau2-Bench, 76.5 on ACEBench (En), 65.8 on SWE-Bench Verified, and 47.3 on SWE-Bench Multilingual – surpassing most open and closed-sourced baselines in non-thinking settings. It also exhibits strong capabilities in coding, mathematics, and reasoning tasks, with a score of 53.7 on LiveCodeBench v6, 49.5 on AIME 2025, 75.1 on GPQA-Diamond, and 27.1 on OJBench, all without extended thinking. These results position Kimi K2 as one of the most capable open-source large language models to date, particularly in software engineering and agentic tasks. We release our base and post-trained model checkpoints to facilitate future research and applications of agentic intelligence.

[190] Dissecting Persona-Driven Reasoning in Language Models via Activation Patching

Ansh Poonia,Maeghal Jain

Main category: cs.LG

TL;DR: 本文通过激活修补技术,研究大语言模型(LLMs)在角色驱动推理中的作用,发现早期多层感知机(MLP)层处理语义内容,而中层多头注意力(MHA)层利用这些信息影响输出;同时识别出某些注意力头会过度关注种族和肤色身份。

Details Motivation: 探索大语言模型在角色驱动推理中的内部机制,理解不同模型组件如何编码和处理角色特定信息,进而揭示模型推理过程中潜在的偏差。

Contribution: 1. 使用激活修补技术揭示了早期MLP层和中层MHA层在角色驱动推理中的作用;2. 发现某些注意力头对种族和肤色身份存在过度关注的现象。

Method: 通过激活修补(activation patching)技术分析模型内部组件(如MLP和MHA层)在处理角色特定信息时的行为及其对输出的影响。

Result: 早期MLP层不仅处理输入句法结构还处理语义内容,中层MHA层则利用这些信息塑造模型输出;同时发现某些注意力头对种族和肤色身份的关注存在偏差。

Insight: 研究揭示了LLMs在角色驱动推理中的内部机制,为理解模型行为及其潜在偏差提供了新视角。

Abstract: Large language models (LLMs) exhibit remarkable versatility in adopting diverse personas. In this study, we examine how assigning a persona influences a model’s reasoning on an objective task. Using activation patching, we take a first step toward understanding how key components of the model encode persona-specific information. Our findings reveal that the early Multi-Layer Perceptron (MLP) layers attend not only to the syntactic structure of the input but also process its semantic content. These layers transform persona tokens into richer representations, which are then used by the middle Multi-Head Attention (MHA) layers to shape the model’s output. Additionally, we identify specific attention heads that disproportionately attend to racial and color-based identities.

[191] LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning

Yining Huang,Bin Li,Keke Tang,Meilian Chen

Main category: cs.LG

TL;DR: LoRA-PAR提出了一种基于双系统理论的参数分区方法,通过数据分类和参数划分,结合两阶段微调,降低参数量同时保持或超越现有PEFT方法的性能。

Details Motivation: 受《思考,快与慢》中双系统理论的启发,论文认为LLM参数的不同子区域可能分别擅长快速直觉响应和多步逻辑推理任务,从而提出了一种更高效的微调方法。

Contribution: 1. 提出了LoRA-PAR,一种基于双系统理论的LoRA框架;2. 通过数据分类和参数划分,减少参数量;3. 采用两阶段微调策略(SFT和RL)提升性能。

Method: 1. 数据分类:通过多模型角色扮演和投票;2. 参数分区:基于重要性评分;3. 两阶段微调:SFT用于System 1任务,RL用于System 2任务。

Result: 实验表明,LoRA-PAR在减少参数量的同时,性能匹配或超越了现有的PEFT基线。

Insight: 双系统理论可以用于LLM的微调策略设计,通过任务分类和参数分区,实现更高效的模型优化。

Abstract: Large-scale generative models like DeepSeek-R1 and OpenAI-O1 benefit substantially from chain-of-thought (CoT) reasoning, yet pushing their performance typically requires vast data, large model sizes, and full-parameter fine-tuning. While parameter-efficient fine-tuning (PEFT) helps reduce cost, most existing approaches primarily address domain adaptation or layer-wise allocation rather than explicitly tailoring data and parameters to different response demands. Inspired by “Thinking, Fast and Slow,” which characterizes two distinct modes of thought-System 1 (fast, intuitive, often automatic) and System 2 (slower, more deliberative and analytic)-we draw an analogy that different “subregions” of an LLM’s parameters might similarly specialize for tasks that demand quick, intuitive responses versus those requiring multi-step logical reasoning. Therefore, we propose LoRA-PAR, a dual-system LoRA framework that partitions both data and parameters by System 1 or System 2 demands, using fewer yet more focused parameters for each task. Specifically, we classify task data via multi-model role-playing and voting, and partition parameters based on importance scoring, then adopt a two-stage fine-tuning strategy of training System 1 tasks with supervised fine-tuning (SFT) to enhance knowledge and intuition and refine System 2 tasks with reinforcement learning (RL) to reinforce deeper logical deliberation next. Extensive experiments show that the two-stage fine-tuning strategy, SFT and RL, lowers active parameter usage while matching or surpassing SOTA PEFT baselines.

[192] GNSP: Gradient Null Space Projection for Preserving Cross-Modal Alignment in VLMs Continual Learning

Tiantian Peng,Yuyang Liu,Shuo Yang,Qiuhe Hong,YongHong Tian

Main category: cs.LG

TL;DR: 论文提出了一种名为Gradient Null Space Projection (GNSP)的持续学习方法,用于在视觉语言模型(VLMs)的连续微调中保持跨模态对齐,避免灾难性遗忘和嵌入对齐退化。

Details Motivation: 现有的视觉语言模型如CLIP在连续微调多个任务时,容易发生灾难性遗忘和跨模态对齐能力下降,影响零样本泛化能力。

Contribution: 1. 提出GNSP方法,通过将任务特定梯度投影到先前知识的零空间,实现不干扰旧任务的持续学习;2. 引入知识蒸馏和模态对齐保护损失,稳定多模态嵌入空间结构。

Method: 1. GNSP将新任务梯度投影到旧知识零空间;2. 结合知识蒸馏和CLIP预训练启发的对齐损失。

Result: 在11个任务的MTIL基准测试中取得了SOTA性能,成功保持了CLIP的模态差距和跨模态检索能力。

Insight: GNSP通过数学正交投影避免了任务间干扰,同时保留CLIP预训练的跨模态泛化能力,为视觉语言模型的持续学习提供了高效解决方案。

Abstract: Contrastive Language-Image Pretraining has demonstrated remarkable zero-shot generalization by aligning visual and textual modalities in a shared embedding space. However, when continuously fine-tuned on diverse tasks, CLIP suffers from catastrophic forgetting and degradation of its embedding alignment, undermining its zero-shot capabilities. In this work, we propose Gradient Null Space Projection (GNSP), an efficient continual learning method that projects task-specific gradients onto the null space of previously learned knowledge. This orthogonal projection mathematically prevents interference with previous tasks without relying on rehearsal or architectural modification. Furthermore, to preserve the inherent generalization property of CLIP, we introduce knowledge distillation and combine it with a modality alignment preservation loss inspired by CLIP pre-training to stabilize the structure of the multimodal embedding space during fine-tuning. On the MTIL benchmark consisting of 11 tasks, our method achieved SOTA performance on both the Average and Last key metrics. More importantly, experiments show that our method successfully maintains the original modality gap and cross-modal retrieval performance of CLIP, confirming its effectiveness in maintaining a robust visual-language space throughout the continual learning process.

[193] Rep-MTL: Unleashing the Power of Representation-level Task Saliency for Multi-Task Learning

Zedong Wang,Siyuan Li,Dan Xu

Main category: cs.LG

TL;DR: Rep-MTL提出了一种通过表征级任务显著性来优化多任务学习的方法,利用熵惩罚和跨任务对齐来减少负迁移并促进信息互补共享。

Details Motivation: 现有多任务优化方法主要集中在损失缩放和梯度操纵上,忽视了共享表征空间的潜在信息。Rep-MTL探索了表征级任务显著性,以更有效地解决任务间的冲突和互补性问题。

Contribution: 提出了Rep-MTL方法,通过表征级任务显著性量化任务间互动,引入熵惩罚和样本级跨任务对齐来优化负迁移和互补共享。

Method: Rep-MTL利用共享表征空间的显著性信息,通过熵惩罚和跨任务对齐机制,调节任务间的互动,提升多任务学习的性能。

Result: 在四个多任务学习基准测试中,Rep-MTL即使与基本的等权重策略结合,也能取得竞争性的性能提升。幂律分析表明其在任务特定学习和跨任务共享之间取得了良好平衡。

Insight: 表征级别的任务显著性能够有效捕捉任务间的动态互动,提供了一种新的优化视角,弥补了现有多任务学习方法的不足。

Abstract: Despite the promise of Multi-Task Learning in leveraging complementary knowledge across tasks, existing multi-task optimization (MTO) techniques remain fixated on resolving conflicts via optimizer-centric loss scaling and gradient manipulation strategies, yet fail to deliver consistent gains. In this paper, we argue that the shared representation space, where task interactions naturally occur, offers rich information and potential for operations complementary to existing optimizers, especially for facilitating the inter-task complementarity, which is rarely explored in MTO. This intuition leads to Rep-MTL, which exploits the representation-level task saliency to quantify interactions between task-specific optimization and shared representation learning. By steering these saliencies through entropy-based penalization and sample-wise cross-task alignment, Rep-MTL aims to mitigate negative transfer by maintaining the effective training of individual tasks instead pure conflict-solving, while explicitly promoting complementary information sharing. Experiments are conducted on four challenging MTL benchmarks covering both task-shift and domain-shift scenarios. The results show that Rep-MTL, even paired with the basic equal weighting policy, achieves competitive performance gains with favorable efficiency. Beyond standard performance metrics, Power Law exponent analysis demonstrates Rep-MTL’s efficacy in balancing task-specific learning and cross-task sharing. The project page is available at HERE.

eess.IV [Back]

[194] Multi-Attention Stacked Ensemble for Lung Cancer Detection in CT Scans

Uzzal Saha,Surya Prakash

Main category: eess.IV

TL;DR: 论文提出了一种多级注意力堆叠集成方法,用于CT扫描中的肺结节二分类(良性 vs 恶性),通过结合多种预训练模型和注意力机制,显著提升了分类性能。

Details Motivation: 解决肺结节分类中的挑战,尤其是类不平衡问题和高争议案例的识别,为肺癌筛查提供自动化辅助工具。

Contribution: 1. 提出多级注意力堆叠集成框架,结合模型级和类别级注意力机制;2. 采用动态焦点损失和多种增强策略提升泛化能力;3. 在LIDC-IDRI数据集上实现突破性性能。

Method: 1. 使用EfficientNet V2 S、MobileViT XXS和DenseNet201作为预训练骨干,自定义分类头;2. 两阶段注意力机制学习模型级和类别级权重;3. 轻量级元学习器优化最终预测。

Result: 在LIDC-IDRI数据集上达到98.09的准确率和0.9961的AUC,比现有方法错误率降低35%,敏感性和特异性平衡表现优异(98.73和98.96)。

Insight: 注意力机制和集成学习的结合显著提升了模型性能,尤其是在高争议案例中表现突出,为医学图像分类提供了新思路。

Abstract: In this work, we address the challenge of binary lung nodule classification (benign vs malignant) using CT images by proposing a multi-level attention stacked ensemble of deep neural networks. Three pretrained backbones – EfficientNet V2 S, MobileViT XXS, and DenseNet201 – are each adapted with a custom classification head tailored to 96 x 96 pixel inputs. A two-stage attention mechanism learns both model-wise and class-wise importance scores from concatenated logits, and a lightweight meta-learner refines the final prediction. To mitigate class imbalance and improve generalization, we employ dynamic focal loss with empirically calculated class weights, MixUp augmentation during training, and test-time augmentation at inference. Experiments on the LIDC-IDRI dataset demonstrate exceptional performance, achieving 98.09 accuracy and 0.9961 AUC, representing a 35 percent reduction in error rate compared to state-of-the-art methods. The model exhibits balanced performance across sensitivity (98.73) and specificity (98.96), with particularly strong results on challenging cases where radiologist disagreement was high. Statistical significance testing confirms the robustness of these improvements across multiple experimental runs. Our approach can serve as a robust, automated aid for radiologists in lung cancer screening.

[195] SpecBPP: A Self-Supervised Learning Approach for Hyperspectral Representation and Soil Organic Carbon Estimation

Daniel La’ah Ayuba,Jean-Yves Guillemaut,Belen Marti-Cardona,Oscar Mendez Maldonado

Main category: eess.IV

TL;DR: 论文提出了SpecBPP,一种新颖的自监督学习方法,通过预测光谱带顺序来学习高光谱图像的表征,并在土壤有机碳估计任务中取得了优于现有方法的结果。

Details Motivation: 高光谱图像(HSI)的光谱带具有独特的序列结构,但目前自监督学习在这一领域的探索不足。作者希望通过利用光谱连续性,设计一种新的自监督任务来提升表征学习能力。

Contribution: 提出了SpecBPP框架,首次将光谱带顺序预测作为自监督任务;设计了一种基于课程的训练策略,逐步增加任务难度;在土壤有机碳估计任务中实现了SOTA性能。

Method: 通过预测打乱的光谱片段的正确顺序作为自监督任务,采用基于课程的训练策略逐步增加排列复杂度,最终在有限标注数据下微调模型。

Result: 在EnMAP卫星数据上的实验中,SpecBPP实现了$R^2$为0.9456,RMSE为1.1053%,RPD为4.19,显著优于MAE和JEPA等基准方法。

Insight: 光谱顺序预测是一种强大的自监督任务,能够有效学习高光谱图像的表征,为遥感等领域的科学表征学习开辟了新方向。

Abstract: Self-supervised learning has revolutionized representation learning in vision and language, but remains underexplored for hyperspectral imagery (HSI), where the sequential structure of spectral bands offers unique opportunities. In this work, we propose Spectral Band Permutation Prediction (SpecBPP), a novel self-supervised learning framework that leverages the inherent spectral continuity in HSI. Instead of reconstructing masked bands, SpecBPP challenges a model to recover the correct order of shuffled spectral segments, encouraging global spectral understanding. We implement a curriculum-based training strategy that progressively increases permutation difficulty to manage the factorial complexity of the permutation space. Applied to Soil Organic Carbon (SOC) estimation using EnMAP satellite data, our method achieves state-of-the-art results, outperforming both masked autoencoder (MAE) and joint-embedding predictive (JEPA) baselines. Fine-tuned on limited labeled samples, our model yields an $R^2$ of 0.9456, RMSE of 1.1053%, and RPD of 4.19, significantly surpassing traditional and self-supervised benchmarks. Our results demonstrate that spectral order prediction is a powerful pretext task for hyperspectral understanding, opening new avenues for scientific representation learning in remote sensing and beyond.

[196] Taming Domain Shift in Multi-source CT-Scan Classification via Input-Space Standardization

Chia-Ming Lee,Bo-Cheng Qiu,Ting-Yao Chen,Ming-Han Sun,Fang-Ying Lin,Jung-Tse Tsai,I-An Tsai,Yu-Fan Lin,Chih-Chung Hsu

Main category: eess.IV

TL;DR: 这篇论文提出了一种通过输入空间标准化(SSFL++和KDS管道)来缓解多源CT扫描分类中的域偏移问题的方法,显著提高了跨域泛化能力。

Details Motivation: 多源CT扫描分类面临域偏移问题,影响模型的跨域泛化性能。虽然现有的预处理方法(如SSFL++和KDS)在实验上表现良好,但其鲁棒性机制尚不明确。

Contribution: 论文分析了SSFL++和KDS如何通过空间和时间标准化减少域间差异,将输入映射到一致的目标空间,从而提升跨域泛化能力。

Method: 提出了SSFL++和KDS管道,对输入数据进行空间和时间标准化,降低了域间差异,简化了网络优化任务。

Result: 实验表明该方法在不同架构下均能稳定提升性能,并在挑战赛中取得第一名,验证了其有效性。

Insight: 输入空间标准化是一种鲁棒且实用的方法,适用于多机构医学影像任务的域偏移问题。

Abstract: Multi-source CT-scan classification suffers from domain shifts that impair cross-source generalization. While preprocessing pipelines combining Spatial-Slice Feature Learning (SSFL++) and Kernel-Density-based Slice Sampling (KDS) have shown empirical success, the mechanisms underlying their domain robustness remain underexplored. This study analyzes how this input-space standardization manages the trade-off between local discriminability and cross-source generalization. The SSFL++ and KDS pipeline performs spatial and temporal standardization to reduce inter-source variance, effectively mapping disparate inputs into a consistent target space. This preemptive alignment mitigates domain shift and simplifies the learning task for network optimization. Experimental validation demonstrates consistent improvements across architectures, proving the benefits stem from the preprocessing itself. The approach’s effectiveness was validated by securing first place in a competitive challenge, supporting input-space standardization as a robust and practical solution for multi-institutional medical imaging.

[197] SkinDualGen: Prompt-Driven Diffusion for Simultaneous Image-Mask Generation in Skin Lesions

Zhaobin Xu

Main category: eess.IV

TL;DR: 论文提出了一种基于预训练的Stable Diffusion-2.0模型的新方法,通过域特定的LoRA微调和多目标损失函数的联合优化,实现从文本描述单步生成高质量皮肤病变图像及分割掩码,有效解决了医学图像数据稀缺和类别不平衡问题。

Details Motivation: 医学图像分析在疾病早期诊断中至关重要,但数据稀缺和类别不平衡严重限制了深度学习模型的性能。

Contribution: 提出了一种基于Stable Diffusion-2.0的单步生成图像和分割掩码的方法,显著提升了分类和分割任务的性能。

Method: 通过域特定的LoRA微调和多目标损失函数的联合优化,从文本描述单步生成图像和分割掩码。

Result: 生成的图像质量接近真实图像(FID评估),混合数据集使分类和分割模型的性能提升8%-15%,其他关键指标如Dice系数和IoU也有显著提升。

Insight: 该方法为解决医学图像数据稀缺和类别不平衡提供了可扩展的解决方案,提升了罕见疾病诊断的准确性和可靠性。

Abstract: Medical image analysis plays a pivotal role in the early diagnosis of diseases such as skin lesions. However, the scarcity of data and the class imbalance significantly hinder the performance of deep learning models. We propose a novel method that leverages the pretrained Stable Diffusion-2.0 model to generate high-quality synthetic skin lesion images and corresponding segmentation masks. This approach augments training datasets for classification and segmentation tasks. We adapt Stable Diffusion-2.0 through domain-specific Low-Rank Adaptation (LoRA) fine-tuning and joint optimization of multi-objective loss functions, enabling the model to simultaneously generate clinically relevant images and segmentation masks conditioned on textual descriptions in a single step. Experimental results show that the generated images, validated by FID scores, closely resemble real images in quality. A hybrid dataset combining real and synthetic data markedly enhances the performance of classification and segmentation models, achieving substantial improvements in accuracy and F1-score of 8% to 15%, with additional positive gains in other key metrics such as the Dice coefficient and IoU. Our approach offers a scalable solution to address the challenges of medical imaging data, contributing to improved accuracy and reliability in diagnosing rare diseases.

[198] Onboard Hyperspectral Super-Resolution with Deep Pushbroom Neural Network

Davide Piccinini,Diego Valsesia,Enrico Magli

Main category: eess.IV

TL;DR: 提出了一种轻量级的深度学习网络DPSR,用于在卫星上实时实现高光谱图像的空间超分辨率,匹配推扫式传感器的采集方式,显著降低内存和计算复杂度。

Details Motivation: 高光谱成像仪在获取精细光谱特征时空间分辨率有限,且卫星上部署的推理方法需要轻量级实时超分辨率技术。

Contribution: 设计了DPSR网络,通过逐行处理和因果记忆机制匹配推扫式传感器,实现了低功耗硬件上的实时超分辨率。

Method: 采用逐行处理和高光谱推扫式传感器匹配的设计,结合因果记忆机制,优化内存和计算需求。

Result: 实验表明,DPSR的图像超分辨率质量与甚至超越更复杂的方法。

Insight: DPSR为卫星上的实时高光谱超分辨率提供了可行的轻量化解决方案,推动了星载实时处理的进步。

Abstract: Hyperspectral imagers on satellites obtain the fine spectral signatures essential for distinguishing one material from another at the expense of limited spatial resolution. Enhancing the latter is thus a desirable preprocessing step in order to further improve the detection capabilities offered by hyperspectral images on downstream tasks. At the same time, there is a growing interest towards deploying inference methods directly onboard of satellites, which calls for lightweight image super-resolution methods that can be run on the payload in real time. In this paper, we present a novel neural network design, called Deep Pushbroom Super-Resolution (DPSR) that matches the pushbroom acquisition of hyperspectral sensors by processing an image line by line in the along-track direction with a causal memory mechanism to exploit previously acquired lines. This design greatly limits memory requirements and computational complexity, achieving onboard real-time performance, i.e., the ability to super-resolve a line in the time it takes to acquire the next one, on low-power hardware. Experiments show that the quality of the super-resolved images is competitive or even outperforms state-of-the-art methods that are significantly more complex.