Table of Contents

cs.CL [Back]

[1] MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables

Matteo Marcuzzo,Alessandro Zangari,Andrea Albarelli,Jose Camacho-Collados,Mohammad Taher Pilehvar

Main category: cs.CL

TL;DR: MORABLES是一个基于寓言和短篇故事构建的基准测试,用于评估大语言模型(LLMs)在抽象道德推理方面的能力。研究发现,虽然大模型表现优于小模型,但它们仍易受对抗性攻击,且依赖浅层模式而非真正的道德推理。

Details Motivation: 随着LLMs在标准阅读理解任务上的优异表现,研究转向评估其在复杂抽象推理和深层理解能力方面的表现,尤其是道德推理。寓言和故事的丰富叙事为此提供了理想框架。

Contribution: 提出了MORABLES基准测试,包含人类验证的多选题,专注于道德推理;设计了对抗性变体以测试模型鲁棒性;发现模型表现主要依赖规模而非推理能力。

Method: 从历史文学中选择寓言和短篇故事,构建多选题形式的任务,加入精心设计的干扰项;设计对抗性变体以暴露模型弱点;测试不同规模的LLMs(包括推理增强模型)。

Result: 大模型表现优于小模型,但对对抗性输入脆弱,且常依赖浅层模式而非深刻推理。推理增强模型未能显著缩小性能差距。

Insight: 模型在道德推理任务中的表现主要由规模驱动,而非推理能力;对抗性测试揭示了模型的脆弱性,表明现有方法仍难实现真正的抽象推理。

Abstract: As LLMs excel on standard reading comprehension benchmarks, attention is shifting toward evaluating their capacity for complex abstract reasoning and inference. Literature-based benchmarks, with their rich narrative and moral depth, provide a compelling framework for evaluating such deeper comprehension skills. Here, we present MORABLES, a human-verified benchmark built from fables and short stories drawn from historical literature. The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering. To further stress-test model robustness, we introduce adversarial variants designed to surface LLM vulnerabilities and shortcuts due to issues such as data contamination. Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning. This brittleness results in significant self-contradiction, with the best models refuting their own answers in roughly 20% of cases depending on the framing of the moral choice. Interestingly, reasoning-enhanced models fail to bridge this gap, suggesting that scale - not reasoning ability - is the primary driver of performance.

[2] SENTRA: Selected-Next-Token Transformer for LLM Text Detection

Mitchell Plyler,Yilun Zhang,Alexander Tuzhilin,Saoud Khalifah,Sen Tian

Main category: cs.CL

TL;DR: SENTRA是一种基于Transformer的LLM文本检测器,通过选择下一个令牌概率序列和对比预训练,显著优于现有基线,尤其是在跨域场景中。

Details Motivation: 随着LLM能力的提升和广泛应用,其生成文本的滥用问题日益凸显,亟需一种通用的检测器来识别未声明的LLM生成文本。

Contribution: 提出了SENTRA,一种通用的基于Transformer的监督检测器,利用选择下一个令牌概率序列和对比预训练方法。

Method: SENTRA采用Transformer编码器,结合选择的下一令牌概率序列,并通过对比预训练在大量无标签数据上学习。

Result: 在三个公开数据集和24个文本领域的实验中,SENTRA在跨域场景下显著优于现有基线。

Insight: SENTRA通过对比预训练和令牌概率序列优化了跨域检测能力,为LLM文本检测提供了新的思路。

Abstract: LLMs are becoming increasingly capable and widespread. Consequently, the potential and reality of their misuse is also growing. In this work, we address the problem of detecting LLM-generated text that is not explicitly declared as such. We present a novel, general-purpose, and supervised LLM text detector, SElected-Next-Token tRAnsformer (SENTRA). SENTRA is a Transformer-based encoder leveraging selected-next-token-probability sequences and utilizing contrastive pre-training on large amounts of unlabeled data. Our experiments on three popular public datasets across 24 domains of text demonstrate SENTRA is a general-purpose classifier that significantly outperforms popular baselines in the out-of-domain setting.

[3] MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering

Wen-wai Yim,Asma Ben Abacha,Zixuan Yu,Robert Doerning,Fei Xia,Meliha Yetisgen

Main category: cs.CL

TL;DR: MORQA是一个新的多语言基准,用于评估医学开放问答任务中的自然语言生成评估指标,通过专家评分的黄金标准答案比较传统指标和LLM方法的性能。

Details Motivation: 医学领域的开放问答任务对准确性、相关性和领域专业知识要求极高,传统自动评估指标如BLEU等在区分高质量回答时表现不佳,因此需要更有效的评估方法。

Contribution: 1) 提出MORQA基准,覆盖多语言医学问答数据集;2) 首次全面比较传统指标和LLM评估方法的性能;3) 分析LLM方法的优势及其驱动因素。

Method: 使用2-4+个医学专家提供的黄金标准答案和专家评分,在多语言数据集上对比传统指标(BLEU等)和LLM(如GPT-4、Gemini)的相关性表现。

Result: LLM评估方法显著优于传统指标,尤其在语义敏感性和参考答案多样性处理上表现更优。

Insight: 医学问答评估需要更注重与人类专家判断的一致性,LLM方法在语义理解和多参考答案处理方面具有优势。

Abstract: Evaluating natural language generation (NLG) systems in the medical domain presents unique challenges due to the critical demands for accuracy, relevance, and domain-specific expertise. Traditional automatic evaluation metrics, such as BLEU, ROUGE, and BERTScore, often fall short in distinguishing between high-quality outputs, especially given the open-ended nature of medical question answering (QA) tasks where multiple valid responses may exist. In this work, we introduce MORQA (Medical Open-Response QA), a new multilingual benchmark designed to assess the effectiveness of NLG evaluation metrics across three medical visual and text-based QA datasets in English and Chinese. Unlike prior resources, our datasets feature 2-4+ gold-standard answers authored by medical professionals, along with expert human ratings for three English and Chinese subsets. We benchmark both traditional metrics and large language model (LLM)-based evaluators, such as GPT-4 and Gemini, finding that LLM-based approaches significantly outperform traditional metrics in correlating with expert judgments. We further analyze factors driving this improvement, including LLMs’ sensitivity to semantic nuances and robustness to variability among reference answers. Our results provide the first comprehensive, multilingual qualitative study of NLG evaluation in the medical domain, highlighting the need for human-aligned evaluation methods. All datasets and annotations will be publicly released to support future research.

[4] MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts

Jiayi He,Yangmin Huang,Qianyun Du,Xiangying Zhou,Zhiyang He,Jiaxue Hu,Xiaodong Tao,Lixian Lai

Main category: cs.CL

TL;DR: MedFact是一个专为中文医学事实核查设计的新基准数据集,包含2,116个专家标注实例,涵盖13个医学专业、8种错误类型和4种写作风格。通过全面评估20个领先的大语言模型(LLMs),研究发现模型虽能检测错误,但准确定位错误仍具有挑战性。

Details Motivation: 由于LLMs在医疗领域的广泛应用,其事实可靠性亟需测试。现有基准数据集覆盖领域有限,难以反映真实医学信息的复杂性。

Contribution: MedFact填补了中文医学事实核查的资源空白,提供了高质量、多样化的数据,并通过AI-人工混合框架确保了数据难度和质量。

Method: 采用AI驱动、多标准筛选的流程,结合迭代专家反馈构建数据集,并对20个LLMs进行了事实分类和错误定位的评估。

Result: 研究发现LLMs在错误检测上表现尚可,但在具体定位上远不及人类专家,且存在过度批评现象。

Insight: 高级推理技术(如多智能体协作)可能加剧模型的过度批评倾向,提示医疗领域需更可靠的模型。

Abstract: The increasing deployment of Large Language Models (LLMs) in healthcare necessitates a rigorous evaluation of their factual reliability. However, existing benchmarks are often limited by narrow domains of data, failing to capture the complexity of real-world medical information. To address this critical gap, we introduce MedFact, a new and challenging benchmark for Chinese medical fact-checking. MedFact comprises 2,116 expert-annotated instances curated from diverse real-world texts, spanning 13 medical specialties, 8 fine-grained error types, 4 writing styles, and multiple difficulty levels. Its construction employs a hybrid AI-human framework where iterative expert feedback refines an AI-driven, multi-criteria filtering process, ensuring both high data quality and difficulty. We conduct a comprehensive evaluation of 20 leading LLMs, benchmarking their performance on veracity classification and error localization against a human expert baseline. Our results reveal that while models can often determine if a text contains an error, precisely localizing it remains a substantial challenge, with even top-performing models falling short of human performance. Furthermore, our analysis uncovers a frequent ``over-criticism’’ phenomenon, a tendency for models to misidentify correct information as erroneous, which is exacerbated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. By highlighting these critical challenges for deploying LLMs in medical applications, MedFact provides a robust resource to drive the development of more factually reliable and medically aware models.

[5] Audited Reasoning Refinement: Fine-Tuning Language Models via LLM-Guided Step-Wise Evaluation and Correction

Sumanta Bhattacharyya,Sara Riaz,Pedram Rooshenas

Main category: cs.CL

TL;DR: 论文提出了一种名为R2tA的方法,通过LLM生成和优化的中间推理轨迹来训练任务特定的小型推理模型,解决了标注数据稀缺的问题。

Details Motivation: 在任务特定的小型模型训练中,直接人工监督或高质量标签稀缺是一个挑战。LLMs生成的中间推理轨迹可以被系统优化,为训练提供有效监督信号。

Contribution: 提出了R2tA方法,通过生成、优化和对齐LLM的中间推理轨迹,形成高质量数据集,并通过两阶段对齐(SFT和DPO)训练任务特定模型。

Method: 1. 生成初始推理和响应;2. 优化推理轨迹以修正幻觉和不一致性;3. 两阶段对齐(监督微调和直接偏好优化)训练模型。

Result: 在扩展实体关系图(EERD)评估任务中,R2tA表现出色,提供了低成本、可扩展的LLM适应方案。

Insight: R2tA展示了在数据稀缺领域利用LLM生成高质量监督信号的潜力,为教育和复杂任务提供了可复现的AI工具。

Abstract: Training a task-specific small reasoning model is challenging when direct human supervision or high-quality labels are scarce. However, LLMs with reasoning capabilities produce abundant intermediate reasoning traces that can be systematically refined to create effective supervision signals. We propose Reason-Refine-then-Align (R2tA), which turns refined model rationales into supervision for training task-specific reasoning models. Our method generates initial reasoning and responses from an open-source base model on task-specific inputs, then refines these traces, fixing hallucinations and inconsistencies, to form a high-fidelity dataset. We perform a two-stage alignment, supervised fine-tuning (SFT), followed by direct preference optimization (DPO) to calibrate the model’s intermediate reasoning with human-validated conceptual preferences and then condition the final output on that aligned reasoning. As a case study, we apply R2tA to evaluate extended entity relationship diagrams (EERDs) in database system design, a structurally complex task where prompt-only methods miss or hallucinate errors. We curated a dataset of 600 EERD variants (train/test split of 450/150, respectively) with induced mistakes spanning 11 categories. Empirical evaluation suggests R2tA provides a practical, cost-effective path to scalable LLM adaptation in data-scarce domains, enabling reproducible AI tools for education and beyond.

[6] FunAudio-ASR Technical Report

Keyu An,Yanni Chen,Chong Deng,Changfeng Gao,Zhifu Gao,Bo Gong,Xiangang Li,Yabin Li,Xiang Lv,Yunjie Ji,Yiheng Jiang,Bin Ma,Haoneng Luo,Chongjia Ni,Zexu Pan,Yiping Peng,Zhendong Peng,Peiyao Wang,Hao Wang,Wen Wang,Wupeng Wang,Biao Tian,Zhentao Tan,Nan Yang,Bin Yuan,Jieping Ye,Jixing Yu,Qinglin Zhang,Kun Zou,Han Zhao,Shengkui Zhao,Jingren Zhou

Main category: cs.CL

TL;DR: FunAudio-ASR是一个基于大语言模型(LLM)的大规模自动语音识别(ASR)系统,通过结合大数据、大模型、LLM集成和强化学习,在多样复杂的语音识别场景中实现了最优性能,并针对实际部署进行了优化。

Details Motivation: 现有基于LLM的ASR系统虽然在新基准测试中表现优异,但在实际工业评估中表现不佳,存在幻觉问题,严重影响了用户体验。

Contribution: 提出了FunAudio-ASR系统,集成了数据扩展、模型规模扩展和LLM深度融合,并结合强化学习,解决了幻觉问题并优化了实际部署需求。

Method: 结合大规模数据、大模型容量、LLM集成和强化学习进行训练,并针对流式能力、噪声鲁棒性、代码切换、热词定制等实际需求优化系统。

Result: 在实际应用数据集中实现了最优性能,证明了其在真实场景中的有效性和鲁棒性。

Insight: LLM-based ASR系统需要通过实际部署优化(如流式能力、噪声处理等)来提升工业应用中的表现,而不仅仅是依赖新基准测试的指标。

Abstract: In recent years, automatic speech recognition (ASR) has witnessed transformative advancements driven by three complementary paradigms: data scaling, model size scaling, and deep integration with large language models (LLMs). However, LLMs are prone to hallucination, which can significantly degrade user experience in real-world ASR applications. In this paper, we present FunAudio-ASR, a large-scale, LLM-based ASR system that synergistically combines massive data, large model capacity, LLM integration, and reinforcement learning to achieve state-of-the-art performance across diverse and complex speech recognition scenarios. Moreover, FunAudio-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements. Experimental results show that while most LLM-based ASR systems achieve strong performance on open-source benchmarks, they often underperform on real industry evaluation sets. Thanks to production-oriented optimizations, FunAudio-ASR achieves SOTA performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.

[7] A comparison of pipelines for the translation of a low resource language based on transformers

Chiara Bonfanti,Michele Colombino,Giulia Coucourde,Faeze Memari,Stefano Pinardi,Rosa Meo

Main category: cs.CL

TL;DR: 本文比较了三种基于Transformer的神经网络的训练流水线,用于低资源语言Bambara的机器翻译。结果表明,简单的Transformer模型表现最佳,尤其是在低资源环境下。

Details Motivation: Bambara是一种非洲低资源语言,缺乏高质量的机器翻译工具。本文旨在比较不同训练流水线的效果,为低资源语言的翻译提供实用解决方案。

Contribution: 主要贡献包括:1)比较三种Transformer训练流水线在Bambara翻译中的表现;2)引入新的Yiri数据集评估翻译效果;3)展示了简单模型在低资源语言翻译中的优势。

Method: 1)使用简单Transformer模型直接训练翻译任务;2)基于LLaMA3微调解码器架构的模型;3)通过语言蒸馏将Bambara整合到预训练的LaBSE模型中。

Result: 在Bayelemagaba数据集上,简单Transformer模型的BLEU和chrF得分最高(10%和21%)。在Yiri数据集上,BLEU得分达33.81%,chrF得分41%。基于LLaMA3的模型在单数据集上表现更好。

Insight: 简单模型在低资源语言翻译中可能更具鲁棒性,而基于微调的模型更擅长捕捉特定数据集的模式。语言蒸馏方法在整合低资源语言到预训练模型中具有潜力。

Abstract: This work compares three pipelines for training transformer-based neural networks to produce machine translators for Bambara, a Mand`e language spoken in Africa by about 14,188,850 people. The first pipeline trains a simple transformer to translate sentences from French into Bambara. The second fine-tunes LLaMA3 (3B-8B) instructor models using decoder-only architectures for French-to-Bambara translation. Models from the first two pipelines were trained with different hyperparameter combinations to improve BLEU and chrF scores, evaluated on both test sentences and official Bambara benchmarks. The third pipeline uses language distillation with a student-teacher dual neural network to integrate Bambara into a pre-trained LaBSE model, which provides language-agnostic embeddings. A BERT extension is then applied to LaBSE to generate translations. All pipelines were tested on Dokotoro (medical) and Bayelemagaba (mixed domains). Results show that the first pipeline, although simpler, achieves the best translation accuracy (10% BLEU, 21% chrF on Bayelemagaba), consistent with low-resource translation results. On the Yiri dataset, created for this work, it achieves 33.81% BLEU and 41% chrF. Instructor-based models perform better on single datasets than on aggregated collections, suggesting they capture dataset-specific patterns more effectively.

[8] MAGIC-Enhanced Keyword Prompting for Zero-Shot Audio Captioning with CLIP Models

Vijay Govindarajan,Pratik Patel,Sahil Tripathi,Md Azizul Hoque,Gautam Siddharth Kashyap

Main category: cs.CL

TL;DR: 该论文提出了一种零样本音频字幕生成系统,利用预训练音频CLIP模型提取特征并生成结构化提示,结合LLM生成字幕,显著提升了性能。

Details Motivation: 由于音频字幕数据集有限,传统方法需要大量训练数据,论文提出利用预训练模型实现零样本字幕生成,减少数据依赖。

Contribution: 1. 提出零样本音频字幕生成系统;2. 利用音频CLIP模型生成结构化提示并优化解码;3. 实验显示性能提升35%。

Method: 1. 使用预训练音频CLIP模型提取特征;2. 生成结构化提示引导LLM生成字幕;3. 引入MAGIC搜索优化关键词选择。

Result: 实验结果表明,使用MAGIC搜索和WavCaps模型时,NLG平均得分从4.7提升至7.3(提升35%)。

Insight: 1. 音频文本匹配模型和关键词选择对性能至关重要;2. 单关键词提示效果最佳;3. 无关键词列表时性能下降50%。

Abstract: Automated Audio Captioning (AAC) generates captions for audio clips but faces challenges due to limited datasets compared to image captioning. To overcome this, we propose the zero-shot AAC system that leverages pre-trained models, eliminating the need for extensive training. Our approach uses a pre-trained audio CLIP model to extract auditory features and generate a structured prompt, which guides a Large Language Model (LLM) in caption generation. Unlike traditional greedy decoding, our method refines token selection through the audio CLIP model, ensuring alignment with the audio content. Experimental results demonstrate a 35% improvement in NLG mean score (from 4.7 to 7.3) using MAGIC search with the WavCaps model. The performance is heavily influenced by the audio-text matching model and keyword selection, with optimal results achieved using a single keyword prompt, and a 50% performance drop when no keyword list is used.

[9] EconProver: Towards More Economical Test-Time Scaling for Automated Theorem Proving

Mukai Li,Linfeng Song,Zhenwen Liang,Jiahao Xu,Shansan Gong,Qi Liu,Haitao Mi,Dong Yu

Main category: cs.CL

TL;DR: EconProver提出了两种互补方法(动态CoT切换机制和多样化并行强化学习),以减少计算成本同时保持自动化定理证明的性能,实验表明仅需12%的计算成本即可达到基线性能。

Details Motivation: 当前自动化定理证明中广泛采用的测试时扩展策略(如反射性CoT推理和增加采样次数)带来显著计算开销,且现有成本分析未能充分考虑不同策略导致的采样成本差异。

Contribution: 1) 提出动态CoT切换机制以减少不必要的token消耗;2) 引入多样化并行强化学习(带可训练前缀)以提高采样效率;3) 将两种方法整合为EconRL流水线,大幅降低计算成本。

Method: 1) 动态CoT切换机制:根据条件动态启用或跳过CoT步骤;2) 多样化并行RL:利用可训练前缀优化采样效率;3) 结合两者构建EconRL流水线。

Result: 在miniF2F和ProofNet上的实验表明,仅需基线12%的计算成本即可达到同等性能。

Insight: 结合动态策略和优化采样效率可显著提升ATP模型的经济性,为轻量级部署提供可行方案。

Abstract: Large Language Models (LLMs) have recently advanced the field of Automated Theorem Proving (ATP), attaining substantial performance gains through widely adopted test-time scaling strategies, notably reflective Chain-of-Thought (CoT) reasoning and increased sampling passes. However, they both introduce significant computational overhead for inference. Moreover, existing cost analyses typically regulate only the number of sampling passes, while neglecting the substantial disparities in sampling costs introduced by different scaling strategies. In this paper, we systematically compare the efficiency of different test-time scaling strategies for ATP models and demonstrate the inefficiency of the current state-of-the-art (SOTA) open-source approaches. We then investigate approaches to significantly reduce token usage and sample passes while maintaining the original performance. Specifically, we propose two complementary methods that can be integrated into a unified EconRL pipeline for amplified benefits: (1) a dynamic Chain-of-Thought (CoT) switching mechanism designed to mitigate unnecessary token consumption, and (2) Diverse parallel-scaled reinforcement learning (RL) with trainable prefixes to enhance pass rates under constrained sampling passes. Experiments on miniF2F and ProofNet demonstrate that our EconProver achieves comparable performance to baseline methods with only 12% of the computational cost. This work provides actionable insights for deploying lightweight ATP models without sacrificing performance.

[10] Positional Encoding via Token-Aware Phase Attention

Yu,Wang,Sheng Shen,Rémi Munos,Hongyuan Zhan,Yuandong Tian

Main category: cs.CL

TL;DR: 论文提出了一种新的位置编码方法TAPA,通过学习相位函数改进注意力机制,解决了RoPE的长距离依赖问题,且无需预训练后调整。

Details Motivation: RoPE在长距离建模中存在距离依赖的偏差,现有扩展方法通常需要预训练后的调整(如重新缩放或超参数微调),这限制了其灵活性。

Contribution: 提出了Token-Aware Phase Attention (TAPA),一种将可学习相位函数融入注意力机制的新位置编码方法,支持长距离依赖和无缝扩展到新长度。

Method: TAPA通过引入可学习的相位函数,直接在注意力机制中调整位置编码,避免了对预训练模型的后续调整需求。

Result: TAPA在长距离上下文任务中显著降低了困惑度,优于RoPE系列方法,并能推广到未见过的长度。

Insight: 相位函数的学习能力是关键,它使模型能够灵活适应不同距离的依赖关系,从而提升长距离建模的性能。

Abstract: We prove under practical assumptions that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE’s ability to model long-context. RoPE extension methods may alleviate this issue, but they typically require post-hoc adjustments after pretraining, such as rescaling or hyperparameters retuning. This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism. TAPA preserves token interactions over long range, extends to longer contexts with direct and light fine-tuning, extrapolates to unseen lengths, and attains significantly lower perplexity on long-context than RoPE families.

[11] PAC: Pronunciation-Aware Contextualized Large Language Model-based Automatic Speech Recognition

Li Fu,Yu Xin,Sunlu Zeng,Lu Fan,Youzheng Wu,Xiaodong He

Main category: cs.CL

TL;DR: 该论文提出了一个发音感知的上下文化框架PAC,用于解决基于大语言模型(LLM)的自动语音识别(ASR)系统中的发音建模和同音词区分问题。通过两阶段学习方法,显著降低了词错误率(WER)和长尾词的偏置WER。

Details Motivation: 在基于LLM的ASR系统中,如何有效建模发音并区分同音词是关键挑战,尤其是在处理原始或长尾词汇时。

Contribution: 1. 提出了一种发音感知的上下文化框架(PAC);2. 设计了发音引导的上下文学习方法和发音区分的强化学习方法。

Method: 1. 采用两阶段学习范式;2. 第一阶段使用发音引导的上下文学习方法,结合字形和音素建模;3. 第二阶段通过带有扰动标签采样的强化学习增强同音词区分能力。

Result: 在Librispeech和AISHELL-1数据集上,PAC相比预训练的LLM-based ASR模型分别减少了30.2%和53.8%的相对WER,长尾词的偏置WER减少31.8%和60.5%。

Insight: 结合发音(音素)和字形信息的上下文建模以及强化学习是提升ASR系统性能的有效手段,特别是在处理复杂或长尾词汇时。

Abstract: This paper presents a Pronunciation-Aware Contextualized (PAC) framework to address two key challenges in Large Language Model (LLM)-based Automatic Speech Recognition (ASR) systems: effective pronunciation modeling and robust homophone discrimination. Both are essential for raw or long-tail word recognition. The proposed approach adopts a two-stage learning paradigm. First, we introduce a pronunciation-guided context learning method. It employs an interleaved grapheme-phoneme context modeling strategy that incorporates grapheme-only distractors, encouraging the model to leverage phonemic cues for accurate recognition. Then, we propose a pronunciation-discriminative reinforcement learning method with perturbed label sampling to further enhance the model's ability to distinguish contextualized homophones. Experimental results on the public English Librispeech and Mandarin AISHELL-1 datasets indicate that PAC: (1) reduces relative Word Error Rate (WER) by 30.2% and 53.8% compared to pre-trained LLM-based ASR models, and (2) achieves 31.8% and 60.5% relative reductions in biased WER for long-tail words compared to strong baselines, respectively.

[12] Don’t Change My View: Ideological Bias Auditing in Large Language Models

Paul Kröger,Emilio Barkett

Main category: cs.CL

TL;DR: 论文提出了一种检测大型语言模型(LLMs)意识形态偏见的统计方法,适用于黑箱系统审计。

Details Motivation: LLMs的输出可能影响公众意见,因此需要检测其是否被有意导向特定意识形态立场。

Contribution: 提出了一种模型无关的意识形态偏见审计方法,无需访问模型内部,适用于黑箱系统。

Method: 通过分析模型在主题相关提示下的输出分布变化,检测意识形态偏见。

Result: 实验验证了方法的实用性,支持对LLM行为的独立事后审计。

Insight: 该方法为检测和防范LLMs的意识形态偏差提供了工具,有助于透明性和问责。

Abstract: As large language models (LLMs) become increasingly embedded in products used by millions, their outputs may influence individual beliefs and, cumulatively, shape public opinion. If the behavior of LLMs can be intentionally steered toward specific ideological positions, such as political or religious views, then those who control these systems could gain disproportionate influence over public discourse. Although it remains an open question whether LLMs can reliably be guided toward coherent ideological stances and whether such steering can be effectively prevented, a crucial first step is to develop methods for detecting when such steering attempts occur. In this work, we adapt a previously proposed statistical method to the new context of ideological bias auditing. Our approach carries over the model-agnostic design of the original framework, which does not require access to the internals of the language model. Instead, it identifies potential ideological steering by analyzing distributional shifts in model outputs across prompts that are thematically related to a chosen topic. This design makes the method particularly suitable for auditing proprietary black-box systems. We validate our approach through a series of experiments, demonstrating its practical applicability and its potential to support independent post hoc audits of LLM behavior.

[13] Mitigating Strategy Preference Bias in Emotional Support Conversation via Uncertainty Estimations

Yougen Zhou,Qin Chen,Ningning Zhou,Jie Zhou,Xingjiao Wu,Liang He

Main category: cs.CL

TL;DR: 论文通过分析大型语言模型(LLM)在情感支持对话(ESC)中策略规划的偏好偏见原因,提出了一种基于知识边界和双重奖励函数的强化学习方法,有效减少了偏好偏见并提高了策略规划的准确性。

Details Motivation: 情感支持对话中,LLM存在的策略偏好偏见导致ESC效果不佳,现有方法对偏见的根源研究不足。

Contribution: 揭示了LLM策略规划偏见的根本原因(知识边界),并提出了一种基于双重奖励函数的强化学习方法来减轻偏见。

Method: 通过分析LLM的知识边界,设计了一个结合准确性和熵的强化学习双重奖励函数,优化策略规划。

Result: 在ESCov和ExTES数据集上的实验表明,该方法优于基线模型。

Insight: LLM的策略偏好偏见与其知识边界密切相关,通过熵调整可以更好地平衡策略选择的多样性和准确性。

Abstract: Emotional support conversation (ESC) aims to alleviate distress through empathetic dialogue, yet large language models (LLMs) face persistent challenges in delivering effective ESC due to low accuracy in strategy planning. Moreover, there is a considerable preference bias towards specific strategies. Prior methods using fine-tuned strategy planners have shown potential in reducing such bias, while the underlying causes of the preference bias in LLMs have not well been studied. To address these issues, we first reveal the fundamental causes of the bias by identifying the knowledge boundaries of LLMs in strategy planning. Then, we propose an approach to mitigate the bias by reinforcement learning with a dual reward function, which optimizes strategy planning via both accuracy and entropy-based confidence for each region according to the knowledge boundaries. Experiments on the ESCov and ExTES datasets with multiple LLM backbones show that our approach outperforms the baselines, confirming the effectiveness of our approach.

[14] Chat-Driven Text Generation and Interaction for Person Retrieval

Zequn Xie,Chuxin Wang,Sihang Cai,Yeqiang Wang,Shulei Wang,Tao Jin

Main category: cs.CL

TL;DR: 该论文提出了一个无需标注的文本驱动人物检索框架,包含多轮文本生成(MTG)和多轮文本交互(MTI)模块,显著提升了检索的准确性和鲁棒性。

Details Motivation: 传统基于文本的人物检索(TBPS)依赖大量人工标注的文本描述,成本高昂且难以扩展。论文旨在通过模拟对话生成伪标签和动态交互优化查询,减少对人工标注的依赖。

Contribution: 1. 提出MTG模块,利用MLLMs模拟对话生成多样化的伪标签;2. 设计MTI模块,通过交互式对话优化模糊或不完整的查询;3. 构建了一个统一的免标注框架,提升了TBPS的可扩展性和实用性。

Method: 1. MTG:通过与MLLMs的模拟对话生成细粒度和多样化的视觉描述;2. MTI:在推理时通过动态对话优化用户查询,解决描述模糊或不完整的问题。

Result: 实验表明,该方法在免人工标注的情况下取得了竞争性或更优的检索结果,显著提升了系统的鲁棒性和实用性。

Insight: 通过模拟对话生成伪标签和动态交互优化查询是一种有效的免标注方法,为TBPS的实际部署提供了新思路。

Abstract: Text-based person search (TBPS) enables the retrieval of person images from large-scale databases using natural language descriptions, offering critical value in surveillance applications. However, a major challenge lies in the labor-intensive process of obtaining high-quality textual annotations, which limits scalability and practical deployment. To address this, we introduce two complementary modules: Multi-Turn Text Generation (MTG) and Multi-Turn Text Interaction (MTI). MTG generates rich pseudo-labels through simulated dialogues with MLLMs, producing fine-grained and diverse visual descriptions without manual supervision. MTI refines user queries at inference time through dynamic, dialogue-based reasoning, enabling the system to interpret and resolve vague, incomplete, or ambiguous descriptions - characteristics often seen in real-world search scenarios. Together, MTG and MTI form a unified and annotation-free framework that significantly improves retrieval accuracy, robustness, and usability. Extensive evaluations demonstrate that our method achieves competitive or superior results while eliminating the need for manual captions, paving the way for scalable and practical deployment of TBPS systems.

[15] Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content

Shaz Furniturewala,Arkaitz Zubiaga

Main category: cs.CL

TL;DR: 该论文旨在解决毒性分类器在面对LLM生成内容和对抗攻击时的脆弱性问题,提出了一种基于机制可解释性技术的新策略,通过识别和抑制脆弱电路来提升模型的鲁棒性和公平性。

Details Motivation: 随着大型语言模型(LLMs)的广泛使用,机器生成内容激增,传统基于人类文本训练的内容审核分类器在面对LLM生成内容和对抗攻击时表现不佳。当前防御方法多为被动应对,缺乏对脆弱性的主动识别和针对性改进。

Contribution: 论文的主要贡献包括:1)提出了一种新的方法,通过机制可解释性技术识别毒性分类器的脆弱电路;2)展示了抑制这些脆弱电路可以提升模型对抗攻击的鲁棒性;3)揭示了脆弱性在不同人口统计群体中的差异,为开发更具包容性的模型提供了依据。

Method: 研究采用微调的BERT和RoBERTa分类器,使用对抗攻击技术识别脆弱电路,并通过抑制这些电路来提升模型性能。实验覆盖多个数据集,重点关注少数群体的内容审核。

Result: 结果表明,模型中存在对性能至关重要或易受攻击的特定头部。抑制脆弱头部能够显著提升对抗输入的性能。此外,不同人口统计群体的脆弱性由不同头部负责,这为模型的公平性改进提供了方向。

Insight: 论文揭示了脆弱性与模型结构的紧密关联,并为未来毒性分类器的设计提出了针对性建议,尤其是如何在提升鲁棒性的同时兼顾公平性。

Abstract: The volume of machine-generated content online has grown dramatically due to the widespread use of Large Language Models (LLMs), leading to new challenges for content moderation systems. Conventional content moderation classifiers, which are usually trained on text produced by humans, suffer from misclassifications due to LLM-generated text deviating from their training data and adversarial attacks that aim to avoid detection. Present-day defence tactics are reactive rather than proactive, since they rely on adversarial training or external detection models to identify attacks. In this work, we aim to identify the vulnerable components of toxicity classifiers that contribute to misclassification, proposing a novel strategy based on mechanistic interpretability techniques. Our study focuses on fine-tuned BERT and RoBERTa classifiers, testing on diverse datasets spanning a variety of minority groups. We use adversarial attacking techniques to identify vulnerable circuits. Finally, we suppress these vulnerable circuits, improving performance against adversarial attacks. We also provide demographic-level insights into these vulnerable circuits, exposing fairness and robustness gaps in model training. We find that models have distinct heads that are either crucial for performance or vulnerable to attack and suppressing the vulnerable heads improves performance on adversarial input. We also find that different heads are responsible for vulnerability across different demographic groups, which can inform more inclusive development of toxicity detection models.

[16] Case-Based Decision-Theoretic Decoding with Quality Memories

Hiroyuki Deguchi,Masaaki Nagata

Main category: cs.CL

TL;DR: 本文提出了一种基于案例的决策理论(CBDT)解码方法,通过利用领域数据示例估计预期效用,改进了传统的最小贝叶斯风险(MBR)解码方法,并在多领域翻译和图像描述任务中表现优于MBR和MAP解码。

Details Motivation: 传统的MBR解码依赖从文本生成模型中采样的文本,难以捕捉领域外知识。为解决这一问题,作者提出了CBDT解码方法。

Contribution: 主要贡献是提出了CBDT解码方法,该方法通过领域数据示例估计预期效用,结合MBR解码在多领域任务中显著提升了文本生成质量。

Method: CBDT解码利用领域数据示例估计预期效用,并与MBR解码相结合。实验涵盖了De–En、Ja↔En翻译任务以及MSCOCO和nocaps数据集上的图像描述任务。

Result: 实验结果表明,CBDT解码不仅优于MAP解码,其与MBR的结合还显著优于单独的MBR解码。

Insight: 利用领域数据示例可以更好地估计预期效用,尤其是在处理领域外知识时,CBDT解码提供了一种有效的补充方法。

Abstract: Minimum Bayes risk (MBR) decoding is a decision rule of text generation, which selects the hypothesis that maximizes the expected utility and robustly generates higher-quality texts than maximum a posteriori (MAP) decoding. However, it depends on sample texts drawn from the text generation model; thus, it is difficult to find a hypothesis that correctly captures the knowledge or information of out-of-domain. To tackle this issue, we propose case-based decision-theoretic (CBDT) decoding, another method to estimate the expected utility using examples of domain data. CBDT decoding not only generates higher-quality texts than MAP decoding, but also the combination of MBR and CBDT decoding outperformed MBR decoding in seven domain De–En and Ja$\leftrightarrow$En translation tasks and image captioning tasks on MSCOCO and nocaps datasets.

[17] HistoryBankQA: Multilingual Temporal Question Answering on Historical Events

Biswadip Mandal,Anant Khandelwal,Manish Gupta

Main category: cs.CL

TL;DR: 该论文提出HistoryBank,一个多语言历史事件数据库,覆盖10种语言和10M+事件,并构建了一个涵盖6种任务的时序问答基准,评估了多款大语言模型(如GPT4o、Gemma-2等)的性能。

Details Motivation: 当前的时序推理数据集规模有限、多语言覆盖不足且更关注当代事件,无法充分评估语言模型的时序推理能力。

Contribution: 1. 构建了一个大规模多语言历史事件数据库(HistoryBank);2. 提出了涵盖6种任务的时序问答基准;3. 评估了多种大语言模型的性能。

Method: 从维基百科时间线页面和信息框中提取10M+历史事件,设计6种时序问答任务,并测试多个语言模型。

Result: GPT4o在所有任务和语言中表现最佳,Gemma-2在小型模型中表现最优。

Insight: 该研究为增强多语言和时序感知的自然语言理解提供了资源,展示了语言模型在历史事件推理中的潜力与局限。

Abstract: Temporal reasoning about historical events is a critical skill for NLP tasks like event extraction, historical entity linking, temporal question answering, timeline summarization, temporal event clustering and temporal natural language inference. Yet efforts on benchmarking temporal reasoning capabilities of large language models (LLMs) are rather limited. Existing temporal reasoning datasets are limited in scale, lack multilingual coverage and focus more on contemporary events. To address these limitations, we present HistoryBank, a multilingual database of 10M+ historical events extracted from Wikipedia timeline pages and article infoboxes. Our database provides unprecedented coverage in both historical depth and linguistic breadth with 10 languages. Additionally, we construct a comprehensive question answering benchmark for temporal reasoning across all languages. This benchmark covers a diverse set of 6 temporal QA reasoning tasks, and we evaluate a suite of popular language models (LLaMA-3-8B, Mistral-7B, Gemma-2-9b, Qwen3-8B, GPT4o) to assess their performance on these tasks. As expected GPT4o performs best across all answer types and languages; Gemma-2 outperforms the other small language models. Our work aims to provide a comprehensive resource for advancing multilingual and temporally-aware natural language understanding of historical events. To facilitate further research, we will make our code and datasets publicly available upon acceptance of this paper.

[18] Contrastive Learning with Enhanced Abstract Representations using Grouped Loss of Abstract Semantic Supervision

Omri Suissa,Muhiim Ali,Shengmai Chen,Yinuo Cai,Shekhar Pradhan

Main category: cs.CL

TL;DR: 该论文提出了一种通过分组对比损失增强视觉语言模型(VLM)抽象概念识别能力的方法,并引入了MAGIC数据集。

Details Motivation: 人类能够识别图像中的抽象概念,而不仅仅是物体及其关系。作者旨在研究视觉语言模型是否具备这种抽象概念能力,并提出方法增强其能力。

Contribution: 主要贡献是基于分组对比损失设计和内外部损失函数,以及MAGIC数据集的构建,提升了模型对抽象概念的识别能力。

Method: 使用分组图像-文本数据集(MAGIC),提出外部分组对比损失和内部分组对比损失,通过对比学习训练模型,使其隐式学习抽象概念。

Result: 实验表明,CLEAR GLASS模型在抽象概念识别任务上优于现有方法。

Insight: 通过对比学习隐式引入抽象概念信息,模型能在未直接接触高层概念的训练中,自发提升抽象表达能力。

Abstract: Humans can recognize an image as an instance of a general concept, beyond simply identifying its objects and their relationships. In this paper, we investigate 1. The extent to which VLMs have this concept abstraction capacity, and 2. Strategies for encoding the sort of higher-concept information in images that would enable the resulting VLM model (CLEAR GLASS model) to have this capability to a greater degree. To this end, we introduce a grouped image-caption dataset (MAGIC), which consists of several groups of image captions and for each group a set of associated images and higher-level conceptual labels. We use a novel contrastive loss technique to induce the model to encode in the representation of each image (caption) in a group the information that is common to all members of the image-caption group. Our main contribution is a grouped contrastive loss function based on text-image contrastive groups (outer contrastive loss) as well as an inner loss which measures the distances between image-caption instances in the group. Our training methodology results in the CLEAR GLASS model having the concept abstraction capacity as an emergent capacity because the model is not exposed to the higher-level concepts associated with each group. Instead, the training forces the model to create for each image-caption group a semantic representation that brings it closer to the semantic representation of the higher-level concepts in the latent semantic space. Our experiments show that this training methodology results in a model which shows improvement in abstract concept recognition compared to SOTA models.

[19] ConvergeWriter: Data-Driven Bottom-Up Article Construction

Binquan Ji,Jiaqi Wang,Ruiting Li,Xingchen Han,Yiyang Qi,Shichao Wang,Yifei Lu,Yuantao Han,Feiliang Ren

Main category: cs.CL

TL;DR: 论文提出了一种‘自下而上’的数据驱动框架ConvergeWriter,通过‘先检索知识,再聚类结构’策略,解决现有‘自上而下’方法在生成长篇、事实性文档时的内容碎片化和事实不准确问题。

Details Motivation: 现有的大语言模型(LLM)在生成长篇、事实性文档时,常因‘自上而下’的方法导致生成内容与知识库脱节。作者希望通过数据驱动的方法解决这一问题,确保生成内容忠实于源材料。

Contribution: 提出了一种新颖的‘自下而上’生成框架,利用‘先检索知识,再聚类结构’策略,确保生成内容严格受限于知识库并具有高可追溯性。

Method: 方法包括:(1) 从知识库中进行迭代检索;(2) 使用无监督聚类算法将检索到的文档组织成‘知识簇’;(3) 基于知识簇生成层次化大纲和最终文档内容。

Result: 实验表明,该方法在14B和32B参数模型上表现优于或接近现有基线,特别在知识受限、要求高保真度和结构一致性的场景中具有优势。

Insight: 通过数据驱动的方法约束生成过程,可以显著减少幻觉风险,为高风险、知识密集领域的应用提供了可靠的长文档生成范式。

Abstract: Large Language Models (LLMs) have shown remarkable prowess in text generation, yet producing long-form, factual documents grounded in extensive external knowledge bases remains a significant challenge. Existing “top-down” methods, which first generate a hypothesis or outline and then retrieve evidence, often suffer from a disconnect between the model’s plan and the available knowledge, leading to content fragmentation and factual inaccuracies. To address these limitations, we propose a novel “bottom-up,” data-driven framework that inverts the conventional generation pipeline. Our approach is predicated on a “Retrieval-First for Knowledge, Clustering for Structure” strategy, which first establishes the “knowledge boundaries” of the source corpus before any generative planning occurs. Specifically, we perform exhaustive iterative retrieval from the knowledge base and then employ an unsupervised clustering algorithm to organize the retrieved documents into distinct “knowledge clusters.” These clusters form an objective, data-driven foundation that directly guides the subsequent generation of a hierarchical outline and the final document content. This bottom-up process ensures that the generated text is strictly constrained by and fully traceable to the source material, proactively adapting to the finite scope of the knowledge base and fundamentally mitigating the risk of hallucination. Experimental results on both 14B and 32B parameter models demonstrate that our method achieves performance comparable to or exceeding state-of-the-art baselines, and is expected to demonstrate unique advantages in knowledge-constrained scenarios that demand high fidelity and structural coherence. Our work presents an effective paradigm for generating reliable, structured, long-form documents, paving the way for more robust LLM applications in high-stakes, knowledge-intensive domains.

[20] Data Augmentation for Maltese NLP using Transliterated and Machine Translated Arabic Data

Kurt Micallef,Nizar Habash,Claudia Borg

Main category: cs.CL

TL;DR: 论文探讨了如何利用阿拉伯语的资源通过跨语言数据增强技术来支持马耳他语的自然语言处理(NLP),包括多种音译方案和机器翻译方法,并展示了这种增强对马耳他语NLP任务的显著益处。

Details Motivation: 马耳他语是一种独特的闪米特语言,受到了罗曼语和日耳曼语(尤其是意大利语和英语)的深远影响。尽管其根源为闪米特语,但其书写系统基于拉丁字母,与其最近的阿拉伯语亲属语言存在差异。研究者探索是否可以利用阿拉伯语资源通过数据增强技术来支持马耳他语的NLP任务。

Contribution: 论文的主要贡献包括:(1)提出了多种将阿拉伯语文本数据与马耳他语对齐的策略,包括音译方案和机器翻译方法;(2)引入了新的音译系统,以更好地表示马耳他语的书写规则;(3)展示了阿拉伯语数据增强对马耳他语NLP任务的显著益处。

Method: 研究者采用了多种策略,包括音译(transliteration)和机器翻译(MT),将阿拉伯语数据转换为马耳他语,以支持数据增强。此外,他们还开发了新的音译系统,以更好地捕捉马耳他语的书写特点。这些方法被用于评估单语和多语模型的效果。

Result: 实验结果表明,基于阿拉伯语的数据增强技术可以显著提升马耳他语NLP任务的性能。这些增强方法在单语和多语模型中均表现出积极效果。

Insight: 论文的启示在于,尽管马耳他语与阿拉伯语在书写系统上存在差异,但通过适当的转换和增强方法,可以利用阿拉伯语资源弥补马耳他语数据不足的问题,从而提升NLP任务的性能。这为低资源语言的数据增强提供了新的思路。

Abstract: Maltese is a unique Semitic language that has evolved under extensive influence from Romance and Germanic languages, particularly Italian and English. Despite its Semitic roots, its orthography is based on the Latin script, creating a gap between it and its closest linguistic relatives in Arabic. In this paper, we explore whether Arabic-language resources can support Maltese natural language processing (NLP) through cross-lingual augmentation techniques. We investigate multiple strategies for aligning Arabic textual data with Maltese, including various transliteration schemes and machine translation (MT) approaches. As part of this, we also introduce novel transliteration systems that better represent Maltese orthography. We evaluate the impact of these augmentations on monolingual and mutlilingual models and demonstrate that Arabic-based augmentation can significantly benefit Maltese NLP tasks.

[21] Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents

Fuyu Xing,Zimu Wang,Wei Wang,Haiyang Zhang

Main category: cs.CL

TL;DR: 本文首次系统评估了DeepSeek-VL2和Qwen-VL等大型视觉语言模型(LVLM)在多模态事件提取(M2E2)任务中的表现,揭示了其在少样本提示和微调设置下的性能差异,并提出了改进方向。

Details Motivation: 随着多媒体内容的快速增长,多模态事件提取(M2E2)变得日益重要。尽管大型视觉语言模型(LVLM)在多模态任务中表现出色,但其在M2E2中的应用尚未得到充分研究。

Contribution: 首次系统评估了LVLM在M2E2任务中的表现,并提出LoRA微调方法显著提升了模型性能。同时,揭示了LVLM在多模态任务中的协同效应和现有挑战。

Method: 通过少样本提示和LoRA微调两种设置,评估了LVLM在文本、图像及跨模态子任务中的表现,并进行了详细的错误分析。

Result: 结果表明:(1) LVLM在视觉任务中表现较好,但在文本任务中表现较差;(2) LoRA微调显著提升性能;(3) 多模态结合效果更优。但语义精度、定位和跨模态基础仍是挑战。

Insight: LVLM在多模态任务中具有潜力,但需进一步优化其在文本任务中的表现,并解决跨模态任务中的语义对齐问题。

Abstract: The proliferation of multimedia content necessitates the development of effective Multimedia Event Extraction (M2E2) systems. Though Large Vision-Language Models (LVLMs) have shown strong cross-modal capabilities, their utility in the M2E2 task remains underexplored. In this paper, we present the first systematic evaluation of representative LVLMs, including DeepSeek-VL2 and the Qwen-VL series, on the M2E2 dataset. Our evaluations cover text-only, image-only, and cross-media subtasks, assessed under both few-shot prompting and fine-tuning settings. Our key findings highlight the following valuable insights: (1) Few-shot LVLMs perform notably better on visual tasks but struggle significantly with textual tasks; (2) Fine-tuning LVLMs with LoRA substantially enhances model performance; and (3) LVLMs exhibit strong synergy when combining modalities, achieving superior performance in cross-modal settings. We further provide a detailed error analysis to reveal persistent challenges in areas such as semantic precision, localization, and cross-modal grounding, which remain critical obstacles for advancing M2E2 capabilities.

[22] The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations

Yubo Zhu,Dongrui Liu,Zecheng Lin,Wei Tong,Sheng Zhong,Jing Shao

Main category: cs.CL

TL;DR: 本文提出了一种利用大型语言模型(LLM)隐藏表征来估计输入问题难度的新方法,避免了传统方法的计算成本或泛化性问题。

Details Motivation: 现有方法依赖重复采样、辅助模型或微调目标模型,计算成本高且可能影响泛化性。本文旨在通过LLM的隐藏表征直接估计问题难度。

Contribution: 1. 提出一种基于LLM隐藏表征的难度估计方法;2. 通过马尔可夫链建模生成过程,并定义值函数以估计隐藏状态对应的输出质量;3. 在文本和多模态任务上验证了方法的有效性。

Method: 1. 将LLM的token级生成过程建模为马尔可夫链;2. 定义值函数,仅通过初始隐藏状态估计输出质量;3. 无需生成输出token即可高效估计难度。

Result: 实验表明,该方法在难度估计上优于现有基线,并成功应用于自适应推理策略(如Self-Consistency),提高了推理效率。

Insight: LLM的隐藏表征已隐含问题难度信息,利用这些信息可避免冗余计算,实现高效推理。

Abstract: Estimating the difficulty of input questions as perceived by large language models (LLMs) is essential for accurate performance evaluation and adaptive inference. Existing methods typically rely on repeated response sampling, auxiliary models, or fine-tuning the target model itself, which may incur substantial computational costs or compromise generality. In this paper, we propose a novel approach for difficulty estimation that leverages only the hidden representations produced by the target LLM. We model the token-level generation process as a Markov chain and define a value function to estimate the expected output quality given any hidden state. This allows for efficient and accurate difficulty estimation based solely on the initial hidden state, without generating any output tokens. Extensive experiments across both textual and multimodal tasks demonstrate that our method consistently outperforms existing baselines in difficulty estimation. Moreover, we apply our difficulty estimates to guide adaptive reasoning strategies, including Self-Consistency, Best-of-N, and Self-Refine, achieving higher inference efficiency with fewer generated tokens.

[23] Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings

Shiyu Li,Yang Tang,Ruijie Liu,Shi-Zhe Chen,Xi Chen

Main category: cs.CL

TL;DR: Conan-embedding-v2是一个从头训练的1.4B参数LLM,专注于解决LLM在文本嵌入任务中的数据与训练差异,提出跨语言检索数据集和软掩码机制,动态硬负采样方法,性能达到SOTA。

Details Motivation: LLM在文本嵌入任务中表现优异,但现有方法依赖微调(如LoRA),存在数据与训练差异问题。

Contribution: 1. 提出从头训练的1.4B参数LLM;2. 引入跨语言检索数据集;3. 设计软掩码机制和动态硬负采样方法。

Method: 1. 增加新闻和多语言数据预训练LLM;2. 软掩码机制过渡因果掩码与双向掩码;3. 动态硬负采样提升模型能力。

Result: 在MTEB和中文MTEB上达到SOTA性能。

Insight: 通过数据扩展和训练机制改进,小规模LLM也能在嵌入任务中实现高性能。

Abstract: Large language models (LLMs) have recently demonstrated excellent performance in text embedding tasks. Previous work usually use LoRA to fine-tune existing LLMs, which are limited by the data and training gap between LLMs and embedding models. In this work, we introduce Conan-embedding-v2, a new 1.4B-parameter LLM trained from scratch and fine-tuned as a text embedder. First, we add news data and multilingual pairs for LLM pretraining to bridge the data gap. Based on this, we propose a cross-lingual retrieval dataset that enables the LLM to better integrate embeddings across different languages. Second, whereas LLMs use a causal mask with token-level loss, embedding models use a bidirectional mask with sentence-level loss. This training gap makes full fine-tuning less effective than LoRA. We introduce a soft-masking mechanism to gradually transition between these two types of masks, enabling the model to learn more comprehensive representations. Based on this, we propose a dynamic hard negative mining method that exposes the model to more difficult negative examples throughout the training process. Being intuitive and effective, with only approximately 1.4B parameters, Conan-embedding-v2 achieves SOTA performance on both the Massive Text Embedding Benchmark (MTEB) and Chinese MTEB (May 19, 2025).

[24] All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning

Caiqi Zhang,Chang Shu,Ehsan Shareghi,Nigel Collier

Main category: cs.CL

TL;DR: 论文提出了一种基于图的无训练置信度估计方法,专门用于大型语言模型(LLM)的推理任务,通过建模推理路径为有向图并利用图的属性(如中心性、路径收敛和路径权重)来提升置信度估计效果。

Details Motivation: 现有置信度估计方法主要针对事实性问答任务,难以推广到复杂的推理任务,因此需要一种更适合推理任务的置信度估计方法。

Contribution: 提出了一种无训练的、基于图的置信度估计方法,利用推理路径的图结构属性(如中心性、路径收敛等)来提升置信度估计的准确性。

Method: 将推理路径建模为有向图,利用图的中心性、路径收敛和路径权重等属性来估计置信度。

Result: 在两个大型语言模型和三个推理数据集上的实验表明,该方法能显著提升置信度估计效果,并在两个下游任务中表现更优。

Insight: 通过图的属性建模推理路径,可以更有效地捕捉推理过程中的不确定性,从而提升置信度估计的鲁棒性。

Abstract: Confidence estimation is essential for the reliable deployment of large language models (LLMs). Existing methods are primarily designed for factual QA tasks and often fail to generalize to reasoning tasks. To address this gap, we propose a set of training-free, graph-based confidence estimation methods tailored to reasoning tasks. Our approach models reasoning paths as directed graphs and estimates confidence by exploiting graph properties such as centrality, path convergence, and path weighting. Experiments with two LLMs on three reasoning datasets demonstrate improved confidence estimation and enhanced performance on two downstream tasks.

[25] Automated Generation of Research Workflows from Academic Papers: A Full-text Mining Framework

Heng Zhang,Chengzhi Zhang

Main category: cs.CL

TL;DR: 该论文提出了一种端到端的框架,通过挖掘全文学术论文自动生成结构化研究流程,重点解决了现有方法仅能提取片段化研究过程的问题。

Details Motivation: 为了提高研究的可重复性和推动“AI for Science”范式,需要自动化生成完整的研究流程,而现有方法通常只能提取片段化的过程信息。

Contribution: 1. 提出了一种端到端的框架,从全文论文中生成结构化研究流程;2. 使用PU学习和SciBERT识别流程描述段落,Flan-T5生成流程短语,ChatGPT分类流程;3. 生成了可读性强的可视化流程图,并揭示了NLP领域的方法论演变。

Method: 1. 段落识别:PU学习和SciBERT;2. 流程短语生成:Flan-T5和提示学习;3. 流程分类:ChatGPT和小样本学习;4. 可视化流程图生成。

Result: 段落识别的F1得分为0.9772;流程生成的ROUGE-1/2/L分别为0.4543/0.2877/0.4427;分类精确度为0.958。

Insight: NLP领域的研究流程逐渐从特征工程转向消融研究,数据分析的重要性日益凸显。该方法为自动化流程生成提供了技术框架,并为科学范式演变研究提供了新视角。

Abstract: The automated generation of research workflows is essential for improving the reproducibility of research and accelerating the paradigm of “AI for Science”. However, existing methods typically extract merely fragmented procedural components and thus fail to capture complete research workflows. To address this gap, we propose an end-to-end framework that generates comprehensive, structured research workflows by mining full-text academic papers. As a case study in the Natural Language Processing (NLP) domain, our paragraph-centric approach first employs Positive-Unlabeled (PU) Learning with SciBERT to identify workflow-descriptive paragraphs, achieving an F1-score of 0.9772. Subsequently, we utilize Flan-T5 with prompt learning to generate workflow phrases from these paragraphs, yielding ROUGE-1, ROUGE-2, and ROUGE-L scores of 0.4543, 0.2877, and 0.4427, respectively. These phrases are then systematically categorized into data preparation, data processing, and data analysis stages using ChatGPT with few-shot learning, achieving a classification precision of 0.958. By mapping categorized phrases to their document locations in the documents, we finally generate readable visual flowcharts of the entire research workflows. This approach facilitates the analysis of workflows derived from an NLP corpus and reveals key methodological shifts over the past two decades, including the increasing emphasis on data analysis and the transition from feature engineering to ablation studies. Our work offers a validated technical framework for automated workflow generation, along with a novel, process-oriented perspective for the empirical investigation of evolving scientific paradigms. Source code and data are available at: https://github.com/ZH-heng/research_workflow.

[26] Investigating ReLoRA: Effects on the Learning Dynamics of Small Language Models

Yuval Weiss,David Demitri Africa,Paula Buttery,Richard Diehl Martinez

Main category: cs.CL

TL;DR: 本文研究了ReLoRA在小语言模型(SLMs)中的学习动态和性能表现,发现它在损失、Paloma困惑度和BLiMP任务上普遍表现不如标准训练,且在较大模型中差距更明显。

Details Motivation: LoRA等参数高效方法在大语言模型(LLMs)微调中表现优异,但其在预训练(如ReLoRA)中的应用,尤其是对小语言模型(SLMs)的影响尚不明确。SLMs在计算和环境成本上更低,因此研究其学习动态和性能表现具有重要意义。

Contribution: 本文首次系统地研究了ReLoRA在SLMs(11M-66M参数)中的应用,评估了其性能和学习动态,并发现其在SLM预训练中表现不佳,尤其是对较大模型。

Method: 通过消融实验,比较ReLoRA与标准训练在损失、Paloma困惑度和BLiMP任务上的表现,并分析学习动态。

Result: ReLoRA在SLMs中表现不如标准训练,且在较大模型中表现差距更大。学习动态分析表明,ReLoRA加剧了小模型的秩不足问题。

Insight: 低秩更新策略(如ReLoRA)可能难以直接迁移到SLM预训练中,提示在低计算资源领域需要更多研究。

Abstract: Parameter-efficient methods such as LoRA have revolutionised the fine-tuning of LLMs. Still, their extension to pretraining via ReLoRA is less well understood, especially for small language models (SLMs), which offer lower computational and environmental costs. This work is the first systematic study of ReLoRA in SLMs (11M-66M parameters), evaluating both performance and learning dynamics. Through ablation experiments, we find that ReLoRA generally performs worse than standard training on loss, Paloma perplexity and BLiMP, with the gap widening for the larger models. Further analysis of the learning dynamics of the models indicates that ReLoRA reinforces the rank deficiencies found in smaller models. These results indicate that low-rank update strategies may not transfer easily to SLM pretraining, highlighting the need for more research in the low-compute regime.

[27] SitLLM: Large Language Models for Sitting Posture Health Understanding via Pressure Sensor Data

Jian Gao,Fufangchen Zhao,Yiyang Zhang,Danfeng Yan

Main category: cs.CL

TL;DR: SitLLM是一个轻量级多模态框架,结合压力传感器与大语言模型(LLM),实现细粒度坐姿理解与个性化健康反馈。

Details Motivation: 现有坐姿监测系统识别粒度粗且缺乏语义表达力,难以提供个性化反馈,SitLLM旨在解决这一问题。

Contribution: 提出SitLLM框架,包含三个核心模块:高斯鲁棒传感器嵌入、提示驱动的跨模态对齐和多上下文提示模块。

Method: 1)通过空间分块和局部噪声扰动提取鲁棒特征;2)利用多头跨注意力将传感器嵌入对齐到LLM语义空间;3)融合多层次上下文信息生成反馈。

Result: 实现了细粒度坐姿理解和个性化健康反馈。

Insight: 结合压力传感器与LLM的多模态方法能有效提升坐姿监测的语义表达能力。

Abstract: Poor sitting posture is a critical yet often overlooked factor contributing to long-term musculoskeletal disorders and physiological dysfunctions. Existing sitting posture monitoring systems, although leveraging visual, IMU, or pressure-based modalities, often suffer from coarse-grained recognition and lack the semantic expressiveness necessary for personalized feedback. In this paper, we propose \textbf{SitLLM}, a lightweight multimodal framework that integrates flexible pressure sensing with large language models (LLMs) to enable fine-grained posture understanding and personalized health-oriented response generation. SitLLM comprises three key components: (1) a \textit{Gaussian-Robust Sensor Embedding Module} that partitions pressure maps into spatial patches and injects local noise perturbations for robust feature extraction; (2) a \textit{Prompt-Driven Cross-Modal Alignment Module} that reprograms sensor embeddings into the LLM’s semantic space via multi-head cross-attention using the pre-trained vocabulary embeddings; and (3) a \textit{Multi-Context Prompt Module} that fuses feature-level, structure-level, statistical-level, and semantic-level contextual information to guide instruction comprehension.

[28] Multi-Model Synthetic Training for Mission-Critical Small Language Models

Nolan Platt,Pragyansmita Nayak

Main category: cs.CL

TL;DR: 论文提出了一种利用大模型(LLMs)生成合成数据,并用于训练小模型的方法,显著降低了海事领域任务的成本,同时保持了较高的准确性。

Details Motivation: 大模型在专业领域的应用受限于领域特定数据的稀缺性和复杂性,而直接使用大模型推理成本高昂。

Contribution: 1. 提出了通过大模型生成合成数据训练小模型的方法,实现成本大幅降低(261倍);2. 在海事任务中,优化的小模型(Qwen2.5-7B)达到75%的准确率;3. 提供了一种可复现的框架,适用于无法手动标注数据的领域。

Method: 1. 使用GPT-4o和o3-mini多模型生成合成问答对;2. 将3.2亿条船舶跟踪记录转化为21,543组数据,避免过拟合;3. 对Qwen2.5-7B小模型进行微调。

Result: 优化后的小模型比直接使用大模型推理成本更低,同时在海事任务上达到75%的准确性。

Insight: 通过合成数据训练的小模型在专业领域可以达到与昂贵大模型相近的性能,为无法手动标注数据的领域提供了可行解决方案。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across many domains, yet their application to specialized fields remains constrained by the scarcity and complexity of domain-specific training data. We present a novel approach that achieves a 261x cost reduction for maritime intelligence by using LLMs as one-time teachers rather than using them directly for inference. Our method transforms 3.2 billion Automatic Identification System (AIS) vessel tracking records into 21,543 synthetic question and answer pairs through multi-model generation (GPT-4o and o3-mini), preventing overfitting and ensuring accurate reasoning. The resulting fine-tuned Qwen2.5-7B model achieves 75% accuracy on maritime tasks, while being substantially cheaper than using a larger model for inference. We show that smaller, cheaper models – when fine tuned properly – can provide similar accuracy compared to larger models that are prohibitively expensive. Our work contributes to the growing field of synthetic dataset generation for specialized AI applications and presents a highly reproducible framework for domains where manual annotation is infeasible. Beyond expanding research in the growing field of specialized small language models, our approach has immediate applications in maritime safety, security operations, and vessel traffic management systems in various industries.

[29] Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO

Francesco Pappone,Ruggero Marino Lazzaroni,Federico Califano,Niccolò Gentile,Roberto Marras

Main category: cs.CL

TL;DR: 该论文提出了一种基于编码器专用变压器(encoder-only transformer)的语义奖励模型,用于在GRPO框架中生成更符合人类专家推理的高质量解释。

Details Motivation: 尽管大型语言模型(LLMs)能够生成人类般的文本,但其输出与复杂定性目标(如教学合理性)的对齐仍是一个挑战。现有方法如基于关键词的ROUGE或昂贵的LLM-as-a-judge评估无法有效捕捉语义质量。

Contribution: 在GRPO框架中引入轻量级编码器专用变压器作为语义奖励模型,提供密集且语义丰富的奖励信号,以生成结构化和概念上与专家推理一致的答案。

Method: 利用编码器专用变压器计算生成解释与真实参考之间的余弦相似度作为奖励信号,结合GRPO框架优化模型。实验在意大利医学入学考试任务上进行,包括领域自适应预训练(CPT)和监督微调(SFT)。

Result: 与强SFT基线相比,提出的语义奖励显著提高了生成解释的忠实性和清晰度。

Insight: 轻量级编码器模型可用于复杂生成任务中的细腻奖励建模,避免依赖昂贵的LLM评估或基于关键词的指标。

Abstract: While Large Language Models (LLMs) excel at generating human-like text, aligning their outputs with complex, qualitative goals like pedagogical soundness remains a significant challenge. Standard reinforcement learning techniques often rely on slow and expensive LLM-as-a-judge evaluations or on brittle, keyword-based metrics like ROUGE, which fail to capture the semantic essence of a high-quality explanation. In this work, we introduce a novel approach to reward shaping within the Group Relative Policy Optimisation (GRPO) framework. Our central contribution is the use of a small, efficient encoder-only transformer as a semantic reward model. This model provides a dense, semantically rich reward signal based on the cosine similarity between a generated explanation and a ground-truth reference, guiding the policy towards explanations that are not just factually correct but also structurally and conceptually aligned with expert reasoning. We apply this method to the task of training a model for the Italian medical-school entrance examinations, following standard domain-adaptive continued pre-training (CPT) and supervised fine-tuning (SFT). Our results demonstrate that GRPO with our proposed semantic reward significantly improves explanation faithfulness and clarity over a strong SFT baseline, showcasing the power of using lightweight encoder models for nuanced reward shaping in complex generation tasks

[30] Empowering LLMs with Parameterized Skills for Adversarial Long-Horizon Planning

Sijia Cui,Shuai Xu,Aiyao He,Yanna Wang,Bo Xu

Main category: cs.CL

TL;DR: 论文提出了一种名为PLAP的规划框架,结合了语言模型与参数化技能,以提升LLM在长周期对抗性环境中的规划能力。

Details Motivation: 现有方法在长周期环境中难以生成可靠的低层动作,或过度依赖专家经验指导高层任务分解。PLAP旨在解决这些问题。

Contribution: PLAP框架结合了参数化技能库、语言模型驱动的技能规划器和执行器,显著提升了LLM在复杂环境中的表现。

Method: PLAP包含三个核心组件:参数化技能库、基于LLM的技能规划器以及将技能转化为可执行动作序列的执行器。

Result: 实验显示,GPT-4驱动的PLAP在零样本设定下优于80%的基线方法,Qwen2-72B驱动的PLAP甚至超越了顶级脚本代理CoacAI。

Insight: 参数化技能库的引入有效填补了LLM在低层动作生成与高层任务分解间的鸿沟,显著提升了规划能力。

Abstract: Recent advancements in Large Language Models(LLMs) have led to the development of LLM-based AI agents. A key challenge is the creation of agents that can effectively ground themselves in complex, adversarial long-horizon environments. Existing methods mainly focus on (1) using LLMs as policies to interact with the environment through generating low-level feasible actions, and (2) utilizing LLMs to generate high-level tasks or language guides to stimulate action generation. However, the former struggles to generate reliable actions, while the latter relies heavily on expert experience to translate high-level tasks into specific action sequences. To address these challenges, we introduce the Plan with Language, Act with Parameter (PLAP) planning framework that facilitates the grounding of LLM-based agents in long-horizon environments. The PLAP method comprises three key components: (1) a skill library containing environment-specific parameterized skills, (2) a skill planner powered by LLMs, and (3) a skill executor converting the parameterized skills into executable action sequences. We implement PLAP in MicroRTS, a long-horizon real-time strategy game that provides an unfamiliar and challenging environment for LLMs. The experimental results demonstrate the effectiveness of PLAP. In particular, GPT-4o-driven PLAP in a zero-shot setting outperforms 80% of baseline agents, and Qwen2-72B-driven PLAP, with carefully crafted few-shot examples, surpasses the top-tier scripted agent, CoacAI. Additionally, we design comprehensive evaluation metrics and test 6 closed-source and 2 open-source LLMs within the PLAP framework, ultimately releasing an LLM leaderboard ranking long-horizon skill planning ability. Our code is available at https://github.com/AI-Research-TeamX/PLAP.

[31] LLM Hallucination Detection: A Fast Fourier Transform Method Based on Hidden Layer Temporal Signals

Jinxin Li,Gang Tu,ShengYu Cheng,Junjie Hu,Jinting Wang,Rui Chen,Zhilong Zhou,Dongbo Shan

Main category: cs.CL

TL;DR: 论文提出了一种基于快速傅里叶变换(FFT)和大语言模型(LLM)隐藏层时序信号的新方法HSAD,用于检测幻觉现象,显著优于现有方法。

Details Motivation: 由于幻觉问题限制了LLM在可靠性敏感场景中的应用,现有方法(如事实性检查和静态隐藏状态分析)受限于外部知识覆盖或未能捕捉推理动态偏差,效果和鲁棒性不足。

Contribution: 提出了HSAD框架,首次将隐藏层信号的时序动态建模与频域分析结合,通过FFT提取频谱特征,并利用自回归特性优化观测点选择。

Method: 方法包括:1) 跨层采样隐藏层激活信号;2) 应用FFT获取频域表示;3) 提取最强非直流分量作为特征;4) 基于自回归特性选择最佳观测点。

Result: 在TruthfulQA等基准测试中,HSAD相比现有方法提升了超过10个百分点。

Insight: 将推理过程建模与频域分析结合,为LLM幻觉检测开辟了新方向,凸显了时序动态特征的重要性。

Abstract: Hallucination remains a critical barrier for deploying large language models (LLMs) in reliability-sensitive applications. Existing detection methods largely fall into two categories: factuality checking, which is fundamentally constrained by external knowledge coverage, and static hidden-state analysis, that fails to capture deviations in reasoning dynamics. As a result, their effectiveness and robustness remain limited. We propose HSAD (Hidden Signal Analysis-based Detection), a novel hallucination detection framework that models the temporal dynamics of hidden representations during autoregressive generation. HSAD constructs hidden-layer signals by sampling activations across layers, applies Fast Fourier Transform (FFT) to obtain frequency-domain representations, and extracts the strongest non-DC frequency component as spectral features. Furthermore, by leveraging the autoregressive nature of LLMs, HSAD identifies optimal observation points for effective and reliable detection. Across multiple benchmarks, including TruthfulQA, HSAD achieves over 10 percentage points improvement compared to prior state-of-the-art methods. By integrating reasoning-process modeling with frequency-domain analysis, HSAD establishes a new paradigm for robust hallucination detection in LLMs.

[32] The Few-shot Dilemma: Over-prompting Large Language Models

Yongjian Tang,Doruk Tuncel,Christian Koerner,Thomas Runkler

Main category: cs.CL

TL;DR: 论文探讨了在大型语言模型(LLMs)中过度提示(over-prompting)导致性能下降的现象,并提出了一种少样本提示框架,通过实验验证了最佳示例数量的重要性。

Details Motivation: 传统观点认为增加相关的少样本示例会提升LLMs性能,但在某些LLMs中,过多的示例反而导致性能下降,这种现象称为'少样本困境'。研究旨在量化并解决这一问题。

Contribution: 1. 揭示了过度提示导致性能下降的现象;2. 提出了结合TF-IDF和分层选择的少样本提示框架;3. 在软件需求分类任务中超越当前最佳方法1%。

Method: 采用三种少样本选择方法(随机采样、语义嵌入、TF-IDF),逐步增加示例数量,测试不同LLMs的性能变化,以确定最佳示例数量。

Result: 实验表明,过多领域特定示例会降低某些LLMs的性能;通过优化示例数量,方法在需求分类任务中提升了性能。

Insight: 少样本示例的数量和质量需谨慎平衡,过度提示反而适得其反;优化提示设计可显著提升LLMs的实际应用效果。

Abstract: Over-prompting, a phenomenon where excessive examples in prompts lead to diminished performance in Large Language Models (LLMs), challenges the conventional wisdom about in-context few-shot learning. To investigate this few-shot dilemma, we outline a prompting framework that leverages three standard few-shot selection methods - random sampling, semantic embedding, and TF-IDF vectors - and evaluate these methods across multiple LLMs, including GPT-4o, GPT-3.5-turbo, DeepSeek-V3, Gemma-3, LLaMA-3.1, LLaMA-3.2, and Mistral. Our experimental results reveal that incorporating excessive domain-specific examples into prompts can paradoxically degrade performance in certain LLMs, which contradicts the prior empirical conclusion that more relevant few-shot examples universally benefit LLMs. Given the trend of LLM-assisted software engineering and requirement analysis, we experiment with two real-world software requirement classification datasets. By gradually increasing the number of TF-IDF-selected and stratified few-shot examples, we identify their optimal quantity for each LLM. This combined approach achieves superior performance with fewer examples, avoiding the over-prompting problem, thus surpassing the state-of-the-art by 1% in classifying functional and non-functional requirements.

[33] Evaluating LLM Alignment on Personality Inference from Real-World Interview Data

Jianfeng Zhu,Julina Maharjan,Xinyu Li,Karin G. Coifman,Ruoming Jin

Main category: cs.CL

TL;DR: 这篇论文提出了一个新颖的基准测试,用于评估大型语言模型(LLMs)在真实访谈数据中对人格特质的推断能力。实验表明,当前LLMs在连续人格特质评估上的表现有限,相关性低于0.26,链式思考提示仅带来微小提升,突显了LLMs与复杂人类属性对齐的挑战。

Details Motivation: LLMs在需要心理理解的场景(如心理咨询)中日益重要,但其对真实对话中人格特质的推断能力尚未充分研究。现有的工作多基于模拟数据,缺乏对连续人格评估的研究。

Contribution: 1) 引入一个新基准,结合半结构化访谈文本与验证过的连续人格特质评分;2) 系统地评估了三种LLM范式(零样本/链式思考提示、LoRA微调、静态嵌入回归);3) 揭示了当前LLMs与人格特质对齐的局限性。

Method: 研究使用了三种方法:1) 零样本和链式思考提示(GPT-4.1 Mini);2) 对RoBERTa和Meta-LLaMA架构进行基于LoRA的微调;3) 使用预训练的BERT和OpenAI文本嵌入进行回归分析。

Result: LLMs预测的人格特质与真实得分的皮尔逊相关性均低于0.26,链式思考提示相比零样本仅带来微小改善。这表明人格推断更依赖于潜在语义表示而非显式推理。

Insight: 论文指出,LLMs对复杂人类属性的对齐仍面临挑战,未来需要关注特质特异性提示、上下文感知建模和对齐导向的微调。

Abstract: Large Language Models (LLMs) are increasingly deployed in roles requiring nuanced psychological understanding, such as emotional support agents, counselors, and decision-making assistants. However, their ability to interpret human personality traits, a critical aspect of such applications, remains unexplored, particularly in ecologically valid conversational settings. While prior work has simulated LLM “personas” using discrete Big Five labels on social media data, the alignment of LLMs with continuous, ground-truth personality assessments derived from natural interactions is largely unexamined. To address this gap, we introduce a novel benchmark comprising semi-structured interview transcripts paired with validated continuous Big Five trait scores. Using this dataset, we systematically evaluate LLM performance across three paradigms: (1) zero-shot and chain-of-thought prompting with GPT-4.1 Mini, (2) LoRA-based fine-tuning applied to both RoBERTa and Meta-LLaMA architectures, and (3) regression using static embeddings from pretrained BERT and OpenAI’s text-embedding-3-small. Our results reveal that all Pearson correlations between model predictions and ground-truth personality traits remain below 0.26, highlighting the limited alignment of current LLMs with validated psychological constructs. Chain-of-thought prompting offers minimal gains over zero-shot, suggesting that personality inference relies more on latent semantic representation than explicit reasoning. These findings underscore the challenges of aligning LLMs with complex human attributes and motivate future work on trait-specific prompting, context-aware modeling, and alignment-oriented fine-tuning.

[34] ChartGaze: Enhancing Chart Understanding in LVLMs with Eye-Tracking Guided Attention Refinement

Ali Salamatian,Amirhossein Abaskohi,Wan-Cyuan Fan,Mir Rayat Imtiaz Hossain,Leonid Sigal,Giuseppe Carenini

Main category: cs.CL

TL;DR: ChartGaze利用眼动追踪数据优化LVLMs在图表问答任务中的注意力对齐,通过注意力细化提升模型准确性和可解释性。

Details Motivation: 现有LVLMs在图表问答任务中因注意力分散到无关区域导致性能不佳,与人类注视行为不一致。研究旨在通过人类注视数据优化模型注意力。

Contribution: 1)发布ChartGaze眼动追踪数据集,记录人类图表推理中的注视模式;2)提出基于注视的注意力细化方法,提升模型性能与对齐度。

Method: 1)对比人类与模型注意力分布差异;2)设计注意力细化模块,将模型注意力与人类注视对齐,优化图像-文本交互。

Result: 实验显示该方法在多个模型上提升准确率达2.56个百分点,同时注意力对齐度显著改善。

Insight: 人类注视数据能有效指导模型注意力优化,提升图表理解任务中的性能和可解释性。

Abstract: Charts are a crucial visual medium for communicating and representing information. While Large Vision-Language Models (LVLMs) have made progress on chart question answering (CQA), the task remains challenging, particularly when models attend to irrelevant regions of the chart. In this work, we present ChartGaze, a new eye-tracking dataset that captures human gaze patterns during chart reasoning tasks. Through a systematic comparison of human and model attention, we find that LVLMs often diverge from human gaze, leading to reduced interpretability and accuracy. To address this, we propose a gaze-guided attention refinement that aligns image-text attention with human fixations. Our approach improves both answer accuracy and attention alignment, yielding gains of up to 2.56 percentage points across multiple models. These results demonstrate the promise of incorporating human gaze to enhance both the reasoning quality and interpretability of chart-focused LVLMs.

[35] WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents

Zile Qiao,Guoxin Chen,Xuanzhong Chen,Donglei Yu,Wenbiao Yin,Xinyu Wang,Zhen Zhang,Baixuan Li,Huifeng Yin,Kuan Li,Rui Min,Minpeng Liao,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou

Main category: cs.CL

TL;DR: WebResearcher提出了一种新型框架,通过迭代深度研究范式(WebResearcher)和可扩展数据合成引擎(WebFrontier),解决了传统单上下文方法中的上下文窒息和噪声污染问题,显著提升了工具的利用能力,并在6个基准测试中达到了最先进性能。

Details Motivation: 当前AI代理在自主发现和合成外部知识时,面临上下文窒息和噪声污染的挑战,限制了其在长时推理任务中的表现。

Contribution: 1. 提出WebResearcher迭代深度研究范式,将深度研究建模为马尔可夫决策过程;2. 开发WebFrontier数据合成引擎,生成高质量训练数据,提升工具利用能力;3. 支持并行多代理探索,扩展推理能力。

Method: 1. 迭代深度研究范式(WebResearcher):周期性整合发现到报告中,保持专注工作区;2. WebFrontier:通过工具增强的复杂性逐步生成训练数据;3. 并行多代理探索。

Result: 在6个基准测试中达到最先进性能,甚至超越前沿专有系统,同时显著提升单上下文方法的工具利用能力。

Insight: 1. 迭代和并行策略能有效解决长时推理中的上下文限制;2. 高质量合成数据对提升工具利用能力至关重要。

Abstract: Recent advances in deep-research systems have demonstrated the potential for AI agents to autonomously discover and synthesize knowledge from external sources. In this paper, we introduce WebResearcher, a novel framework for building such agents through two key components: (1) WebResearcher, an iterative deep-research paradigm that reformulates deep research as a Markov Decision Process, where agents periodically consolidate findings into evolving reports while maintaining focused workspaces, overcoming the context suffocation and noise contamination that plague existing mono-contextual approaches; and (2) WebFrontier, a scalable data synthesis engine that generates high-quality training data through tool-augmented complexity escalation, enabling systematic creation of research tasks that bridge the gap between passive knowledge recall and active knowledge construction. Notably, we find that the training data from our paradigm significantly enhances tool-use capabilities even for traditional mono-contextual methods. Furthermore, our paradigm naturally scales through parallel thinking, enabling concurrent multi-agent exploration for more comprehensive conclusions. Extensive experiments across 6 challenging benchmarks demonstrate that WebResearcher achieves state-of-the-art performance, even surpassing frontier proprietary systems.

[36] Scaling Agents via Continual Pre-training

Liangcai Su,Zhen Zhang,Guangyu Li,Zhuo Chen,Chenxi Wang,Maojia Song,Xinyu Wang,Kuan Li,Jialong Wu,Xuanzhong Chen,Zile Qiao,Zhongwang Zhang,Huifeng Yin,Shihao Cai,Runnan Fang,Zhengwei Tao,Wenbiao Yin,Chenxiong Qian,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou

Main category: cs.CL

TL;DR: 论文提出了一种名为Agentic CPT的持续预训练方法,用于构建强大的代理基础模型,并通过实验验证了其性能优势。

Details Motivation: 现有的大型语言模型在代理任务中表现不佳,主要原因是缺乏强大的代理基础模型,导致模型在训练过程中需要同时学习多种代理行为并与专家演示对齐,造成优化冲突。

Contribution: 提出了Agentic CPT方法,首次将持续预训练引入代理训练流程,并开发了名为AgentFounder的深度研究代理模型。

Method: 采用Agentic CPT方法预训练AgentFounder模型,通过持续学习构建代理基础模型,并在10个基准测试上评估性能。

Result: AgentFounder-30B在多个基准测试中达到了最先进的性能,尤其是在工具使用能力上表现突出。

Insight: 通过持续预训练可以有效解决代理任务中模型优化的冲突问题,提升模型的性能和泛化能力。

Abstract: Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.

[37] Towards General Agentic Intelligence via Environment Scaling

Runnan Fang,Shihao Cai,Baixuan Li,Jialong Wu,Guangyu Li,Wenbiao Yin,Xinyu Wang,Xiaobin Wang,Liangcai Su,Zhen Zhang,Shibin Wu,Zhengwei Tao,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou

Main category: cs.CL

TL;DR: 该论文提出了一种通过环境扩展实现通用智能体的方法,设计了自动化构建多样化仿真环境的框架,并采用两阶段微调策略提升智能体的功能调用能力。

Details Motivation: 为了让大型语言模型在实际应用中更高效地调用多样化的API,需要智能体通过与环境交互培养精确、鲁棒的功能调用能力。环境多样性是关键,但如何规模化扩展环境和高效训练智能体是主要挑战。

Contribution: 1. 提出了一个可扩展的框架,自动化构建多样化仿真环境;2. 设计了两阶段微调策略(基础能力与领域专精);3. 实现了显著的性能提升,智能体(AgentScaler)在多个基准测试(tau-bench、tau2-Bench、ACEBench)中表现优异。

Method: 1. 通过可扩展框架自动化生成异构环境;2. 采用两阶段微调策略:先培养基础功能调用能力,再针对特定领域优化。

Result: 实验表明,AgentScaler显著提升了功能调用能力,在多个基准测试中优于基线模型。

Insight: 环境多样性对智能体的功能调用能力至关重要,且通过系统性扩展环境和分阶段训练可以高效提升智能体的通用性。

Abstract: Advanced agentic intelligence is a prerequisite for deploying Large Language Models in practical, real-world applications. Diverse real-world APIs demand precise, robust function-calling intelligence, which needs agents to develop these capabilities through interaction in varied environments. The breadth of function-calling competence is closely tied to the diversity of environments in which agents are trained. In this work, we scale up environments as a step towards advancing general agentic intelligence. This gives rise to two central challenges: (i) how to scale environments in a principled manner, and (ii) how to effectively train agentic capabilities from experiences derived through interactions with these environments. To address these, we design a scalable framework that automatically constructs heterogeneous environments that are fully simulated, systematically broadening the space of function-calling scenarios. We further adapt a two-phase agent fine-tuning strategy: first endowing agents with fundamental agentic capabilities, then specializing them for domain-specific contexts. Extensive experiments on agentic benchmarks, tau-bench, tau2-Bench, and ACEBench, demonstrate that our trained model, AgentScaler, significantly enhances the function-calling capability of models.

[38] ReSum: Unlocking Long-Horizon Search Intelligence via Context Summarization

Xixi Wu,Kuan Li,Yida Zhao,Liwen Zhang,Litu Ou,Huifeng Yin,Zhongwang Zhang,Yong Jiang,Pengjun Xie,Fei Huang,Minhao Cheng,Shuai Wang,Hong Cheng,Jingren Zhou

Main category: cs.CL

TL;DR: ReSum提出了一种通过上下文摘要解锁长程搜索智能的新范式,解决了大型语言模型代理在处理复杂查询时因上下文限制而无法完成搜索任务的问题。

Details Motivation: 大型语言模型(LLM)在知识密集型任务上表现优异,但在涉及多实体、复杂关系和高不确定性的查询时,上下文窗口限制成为主要障碍。

Contribution: 提出ReSum范式,通过定期上下文摘要实现无限探索,同时保持对先前发现的感知;并提出ReSum-GRPO训练方法,进一步提升代理性能。

Method: ReSum将交互历史压缩为紧凑的推理状态,绕过上下文限制;ReSum-GRPO通过分段轨迹训练和优势广播适应摘要条件下的推理。

Result: 在三个基准测试中,ReSum平均比ReAct提升4.5%,ReSum-GRPO进一步带来高达8.2%的提升;WebResummer-30B在少量训练样本下表现优异。

Insight: 上下文摘要是一种有效缓解LLM代理在长程搜索中上下文限制的方法,且通过适当训练可显著提升性能。

Abstract: Large Language Model (LLM)-based web agents demonstrate strong performance on knowledge-intensive tasks but are hindered by context window limitations in paradigms like ReAct. Complex queries involving multiple entities, intertwined relationships, and high uncertainty demand extensive search cycles that rapidly exhaust context budgets before reaching complete solutions. To overcome this challenge, we introduce ReSum, a novel paradigm that enables indefinite exploration through periodic context summarization. ReSum converts growing interaction histories into compact reasoning states, maintaining awareness of prior discoveries while bypassing context constraints. For paradigm adaptation, we propose ReSum-GRPO, integrating GRPO with segmented trajectory training and advantage broadcasting to familiarize agents with summary-conditioned reasoning. Extensive experiments on web agents of varying scales across three benchmarks demonstrate that ReSum delivers an average absolute improvement of 4.5% over ReAct, with further gains of up to 8.2% following ReSum-GRPO training. Notably, with only 1K training samples, our WebResummer-30B (a ReSum-GRPO-trained version of WebSailor-30B) achieves 33.3% Pass@1 on BrowseComp-zh and 18.3% on BrowseComp-en, surpassing existing open-source web agents.

cs.CV [Back]

[39] RU-Net for Automatic Characterization of TRISO Fuel Cross Sections

Lu Cai,Fei Xu,Min Xian,Yalei Tang,Shoukun Sun,John Stempien

Main category: cs.CV

TL;DR: 论文提出了一种基于卷积神经网络(CNN)的RU-Net模型,用于自动分割TRISO燃料颗粒的显微截面图像,以提高分析效率和结果客观性。

Details Motivation: 传统手动分析TRISO燃料颗粒显微图像的方法费时且主观,因此需要一种自动化的解决方案来加速数据分析和提高准确性。

Contribution: 1. 生成了一个包含2000多张TRISO颗粒显微图像的标注数据集;2. 开发了RU-Net模型,与其他CNN架构(如U-Net、ResNet、Attention U-Net)相比,在Intersection over Union(IoU)指标上表现最佳。

Method: 使用了多种CNN架构(包括新开发的RU-Net和已有的U-Net、ResNet、Attention U-Net)对TRISO颗粒的显微截面图像进行自动分割。

Result: RU-Net在IoU指标上表现最优,能够显著减少人工劳动并提高分割结果的客观性。

Insight: CNN在微观图像分析任务中具有巨大潜力,而RU-Net的设计可能为类似的分割任务提供新的思路。

Abstract: During irradiation, phenomena such as kernel swelling and buffer densification may impact the performance of tristructural isotropic (TRISO) particle fuel. Post-irradiation microscopy is often used to identify these irradiation-induced morphologic changes. However, each fuel compact generally contains thousands of TRISO particles. Manually performing the work to get statistical information on these phenomena is cumbersome and subjective. To reduce the subjectivity inherent in that process and to accelerate data analysis, we used convolutional neural networks (CNNs) to automatically segment cross-sectional images of microscopic TRISO layers. CNNs are a class of machine-learning algorithms specifically designed for processing structured grid data. They have gained popularity in recent years due to their remarkable performance in various computer vision tasks, including image classification, object detection, and image segmentation. In this research, we generated a large irradiated TRISO layer dataset with more than 2,000 microscopic images of cross-sectional TRISO particles and the corresponding annotated images. Based on these annotated images, we used different CNNs to automatically segment different TRISO layers. These CNNs include RU-Net (developed in this study), as well as three existing architectures: U-Net, Residual Network (ResNet), and Attention U-Net. The preliminary results show that the model based on RU-Net performs best in terms of Intersection over Union (IoU). Using CNN models, we can expedite the analysis of TRISO particle cross sections, significantly reducing the manual labor involved and improving the objectivity of the segmentation results.

[40] Modular, On-Site Solutions with Lightweight Anomaly Detection for Sustainable Nutrient Management in Agriculture

Abigail R. Cohen,Yuming Sun,Zhihao Qin,Harsh S. Muriki,Zihao Xiao,Yeonju Lee,Matthew Housley,Andrew F. Sharkey,Rhuanito S. Ferrarezi,Jing Li,Lu Gan,Yongsheng Chen

Main category: cs.CV

TL;DR: 研究了農業中的輕量級異常檢測模組化解決方案,結合高效養分管理與多光譜成像技術,提出分層檢測方法並分析能耗與準確性之間的平衡。

Details Motivation: 傳統養分管理方法耗時且難以實現實時優化,多光譜成像雖然快速,但計算成本高,限制了在資源受限環境中的應用。

Contribution: 提出了一個分層的異常檢測管道,結合自動編碼器和狀態估計模組,實現高效的養分管理和低能耗的農作物生長監測。

Method: 使用多光譜成像數據,結合自動編碼器(AE)進行異常檢測,並比較基於植被指數(VI)的隨機森林(RF)和基於原始圖像的視覺變換器(ViT)兩種狀態估計方法。

Result: AE在能耗較低的情況下實現了73%的異常檢測率;ViT在部分養分估計上表現優於RF,但能耗更高。

Insight: 模組化設計可平衡能耗與準確性,為邊緣計算在農業中的應用提供了實際可能性。

Abstract: Efficient nutrient management is critical for crop growth and sustainable resource consumption (e.g., nitrogen, energy). Current approaches require lengthy analyses, preventing real-time optimization; similarly, imaging facilitates rapid phenotyping but can be computationally intensive, preventing deployment under resource constraints. This study proposes a flexible, tiered pipeline for anomaly detection and status estimation (fresh weight, dry mass, and tissue nutrients), including a comprehensive energy analysis of approaches that span the efficiency-accuracy spectrum. Using a nutrient depletion experiment with three treatments (T1-100%, T2-50%, and T3-25% fertilizer strength) and multispectral imaging (MSI), we developed a hierarchical pipeline using an autoencoder (AE) for early warning. Further, we compared two status estimation modules of different complexity for more detailed analysis: vegetation index (VI) features with machine learning (Random Forest, RF) and raw whole-image deep learning (Vision Transformer, ViT). Results demonstrated high-efficiency anomaly detection (73% net detection of T3 samples 9 days after transplanting) at substantially lower energy than embodied energy in wasted nitrogen. The state estimation modules show trade-offs, with ViT outperforming RF on phosphorus and calcium estimation (R2 0.61 vs. 0.58, 0.48 vs. 0.35) at higher energy cost. With our modular pipeline, this work opens opportunities for edge diagnostics and practical opportunities for agricultural sustainability.

[41] Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics

Yuriel Ryan,Rui Yang Tan,Kenny Tsu Wei Choo,Roy Ka-Wei Lee

Main category: cs.CV

TL;DR: PixelHumor基准测试数据集用于评估大型多模态模型(LMMs)对多模态幽默和叙事序列的理解能力,揭示了当前模型在整合视觉和文本线索方面的局限性。

Details Motivation: 幽默理解是社会智能的核心,但对LMMs而言仍是一个重大挑战。需要系统评估其多模态幽默理解和叙事能力。

Contribution: 提出PixelHumor数据集(2800个标注多格漫画),用于评估LMMs的多模态幽默理解和叙事连贯性,揭示模型在视觉与文本整合上的不足。

Method: 通过标注多格漫画构建PixelHumor数据集,实验评估主流LMMs在幽默推理和面板排序等任务中的表现。

Result: 顶级LMMs在面板排序任务中仅达到61%准确率,远低于人类水平,凸显当前模型在多模态理解和叙事连贯性上的不足。

Insight: PixelHumor为开发更具社交智能的LMMs提供了严格评估框架,强调需改进模型的视觉-文本整合与叙事推理能力。

Abstract: Understanding humor is a core aspect of social intelligence, yet it remains a significant challenge for Large Multimodal Models (LMMs). We introduce PixelHumor, a benchmark dataset of 2,800 annotated multi-panel comics designed to evaluate LMMs’ ability to interpret multimodal humor and recognize narrative sequences. Experiments with state-of-the-art LMMs reveal substantial gaps: for instance, top models achieve only 61% accuracy in panel sequencing, far below human performance. This underscores critical limitations in current models’ integration of visual and textual cues for coherent narrative and humor understanding. By providing a rigorous framework for evaluating multimodal contextual and narrative reasoning, PixelHumor aims to drive the development of LMMs that better engage in natural, socially aware interactions.

[42] OnlineHOI: Towards Online Human-Object Interaction Generation and Perception

Yihong Ji,Yunze Liu,Yiyao Zhuo,Weijiang Yu,Fei Ma,Joshua Huang,Fei Yu

Main category: cs.CV

TL;DR: 论文提出了在线人-物交互(HOI)生成与感知的新任务,并提出了基于Mamba框架和记忆机制的OnlineHOI框架,解决了现有离线方法在在线场景中的性能问题。

Details Motivation: 现有的人-物交互(HOI)方法主要针对离线场景,无法有效处理在线场景中仅依赖当前和历史数据的限制。

Contribution: 1. 提出了在线HOI生成与感知的新任务;2. 设计了基于Mamba框架和记忆机制的OnlineHOI框架。

Method: 采用Mamba框架处理流式数据,结合记忆机制高效整合历史信息。

Result: 在Core4D和OAKINK2的在线生成任务以及HOI4D的在线感知任务中取得了最佳性能。

Insight: 在线场景需要动态处理数据流,Mamba框架和记忆机制的结合为处理此类任务提供了新思路。

Abstract: The perception and generation of Human-Object Interaction (HOI) are crucial for fields such as robotics, AR/VR, and human behavior understanding. However, current approaches model this task in an offline setting, where information at each time step can be drawn from the entire interaction sequence. In contrast, in real-world scenarios, the information available at each time step comes only from the current moment and historical data, i.e., an online setting. We find that offline methods perform poorly in an online context. Based on this observation, we propose two new tasks: Online HOI Generation and Perception. To address this task, we introduce the OnlineHOI framework, a network architecture based on the Mamba framework that employs a memory mechanism. By leveraging Mamba’s powerful modeling capabilities for streaming data and the Memory mechanism’s efficient integration of historical information, we achieve state-of-the-art results on the Core4D and OAKINK2 online generation tasks, as well as the online HOI4D perception task.

[43] EfficientNet-Based Multi-Class Detection of Real, Deepfake, and Plastic Surgery Faces

Li Kun,Milena Radenkovic

Main category: cs.CV

TL;DR: 该论文提出了一种基于EfficientNet的多类检测方法,用于区分真实人脸、Deepfake生成的人脸以及经过整形手术的人脸,以应对Deepfake技术对社会带来的潜在危害。

Details Motivation: Deepfake技术的滥用对社会造成了严重影响,包括隐私侵犯、名人声誉损害和国家安全威胁。因此,需要一种高效的方法来检测和区分真实人脸与伪造或修改的人脸。

Contribution: 论文的主要贡献是开发了一种基于EfficientNet框架的多类检测系统,能够有效识别真实、Deepfake和整形手术三种类别的人脸图像。

Method: 论文采用EfficientNet作为基础架构,结合多类分类任务,优化模型以检测并区分真实、Deepfake和整形手术人脸的特征。

Result: 所提方法在多类人脸检测任务中表现出色,能够高精度识别不同类型的伪造或修改人脸,为Deepfake检测提供了有效解决方案。

Insight: 通过结合高效的深度学习架构(如EfficientNet)和多类分类策略,可以显著提升对复杂伪造手段的检测能力,为未来相关研究提供了方向。

Abstract: Currently, deep learning has been utilised to tackle several difficulties in our everyday lives. It not only exhibits progress in computer vision but also constitutes the foundation for several revolutionary technologies. Nonetheless, similar to all phenomena, the use of deep learning in diverse domains has produced a multifaceted interaction of advantages and disadvantages for human society. Deepfake technology has advanced, significantly impacting social life. However, developments in this technology can affect privacy, the reputations of prominent personalities, and national security via software development. It can produce indistinguishable counterfeit photographs and films, potentially impairing the functionality of facial recognition systems, so presenting a significant risk. The improper application of deepfake technology produces several detrimental effects on society. Face-swapping programs mislead users by altering persons’ appearances or expressions to fulfil particular aims or to appropriate personal information. Deepfake technology permeates daily life through such techniques. Certain individuals endeavour to sabotage election campaigns or subvert prominent political figures by creating deceptive pictures to influence public perception, causing significant harm to a nation’s political and economic structure.

[44] PATIMT-Bench: A Multi-Scenario Benchmark for Position-Aware Text Image Machine Translation in Large Vision-Language Models

Wanru Zhuang,Wenbo Li,Zhibin Lan,Xu Han,Peng Li,Jinsong Su

Main category: cs.CV

TL;DR: 该论文提出了位置感知文本图像机器翻译(PATIMT)任务,构建了PATIMT-Bench基准,支持细粒度和布局保留的翻译,并通过自适应OCR细化流程和数据增强提升模型性能。

Details Motivation: 传统TIMT任务忽略了位置信息和多样场景,无法满足实际需求。PATIMT任务旨在解决这些问题,提供更细粒度的翻译支持。

Contribution: 1. 提出了PATIMT任务及其两个子任务;2. 构建了包含10个多样场景的PATIMT-Bench基准;3. 设计了自适应OCR细化流程;4. 提供了高质量的人工标注测试集。

Method: 1. 采用自适应OCR工具选择与结果细化;2. 通过微调紧凑型大型视觉语言模型(LVLMs)实现PATIMT任务;3. 结合区域特定翻译和全图像翻译两种任务。

Result: 微调后的紧凑型LVLMs在两个子任务上均达到SOTA性能,展示了数据的可扩展性和泛化能力。

Insight: PATIMT任务的实际价值在于其细粒度和布局保留特性,自适应OCR流程和数据多样性是性能提升的关键。

Abstract: Text Image Machine Translation (TIMT) aims to translate texts embedded within an image into another language. Current TIMT studies primarily focus on providing translations for all the text within an image, while neglecting to provide bounding boxes and covering limited scenarios. In this work, we extend traditional TIMT into position-aware TIMT (PATIMT), aiming to support fine-grained and layoutpreserving translation, which holds great practical value but remains largely unexplored. This task comprises two key sub-tasks: regionspecific translation and full-image translation with grounding. To support existing models on PATIMT and conduct fair evaluation, we construct the PATIMT benchmark (PATIMTBench), which consists of 10 diverse real-world scenarios. Specifically, we introduce an Adaptive Image OCR Refinement Pipeline, which adaptively selects appropriate OCR tools based on scenario and refines the results of text-rich images. To ensure evaluation reliability, we further construct a test set, which contains 1,200 high-quality instances manually annotated and reviewed by human experts. After fine-tuning on our data, compact Large Vision-Language Models (LVLMs) achieve state-of-the-art performance on both sub-tasks. Experimental results also highlight the scalability and generalizability of our training data

[45] Deep learning for 3D point cloud processing – from approaches, tasks to its implications on urban and environmental applications

Zhenxin Zhang,Zhihua Xu,Yuwei Cao,Ningli Xu,Shuye Wang,Shen’ao Cui,Zhen Li,Rongjun Qin

Main category: cs.CV

TL;DR: 这篇综述论文探讨了深度学习在3D点云处理中的应用,涵盖了关键任务如场景补全、配准、语义分割和建模,并分析了其在城市与环境应用中的实际价值与挑战。

Details Motivation: 点云处理在测绘与环境监测等领域具有重要意义,但现有研究多关注网络架构,忽略了实际应用中大规模数据、多样场景内容和非均匀点密度等问题。

Contribution: 论文提供了一个针对点云处理关键任务(如语义分割、配准)的深度学习方法和数据集的元综述,并评估了这些方法在城市与环境应用中的实际潜力。

Method: 通过系统梳理点云处理的核心任务和相关深度学习算法,结合实际应用场景(如自动驾驶、环境监测),分析方法的适用性与局限性。

Result: 研究发现当前的深度学习方法仍需改进以应对大规模数据和非均匀点密度等实际问题,尤其是在城市与环境应用中。

Insight: 论文强调了将深度学习点云处理技术转化为实际应用时需解决的挑战,包括算法优化和多模态数据融合。

Abstract: Point cloud processing as a fundamental task in the field of geomatics and computer vision, has been supporting tasks and applications at different scales from air to ground, including mapping, environmental monitoring, urban/tree structure modeling, automated driving, robotics, disaster responses etc. Due to the rapid development of deep learning, point cloud processing algorithms have nowadays been almost explicitly dominated by learning-based approaches, most of which are yet transitioned into real-world practices. Existing surveys primarily focus on the ever-updating network architecture to accommodate unordered point clouds, largely ignoring their practical values in typical point cloud processing applications, in which extra-large volume of data, diverse scene contents, varying point density, data modality need to be considered. In this paper, we provide a meta review on deep learning approaches and datasets that cover a selection of critical tasks of point cloud processing in use such as scene completion, registration, semantic segmentation, and modeling. By reviewing a broad range of urban and environmental applications these tasks can support, we identify gaps to be closed as these methods transformed into applications and draw concluding remarks in both the algorithmic and practical aspects of the surveyed methods.

[46] Evaluating Robustness of Vision-Language Models Under Noisy Conditions

Purushoth,Alireza

Main category: cs.CV

TL;DR: 该论文提出了一个全面的评估框架,测试了多种先进视觉-语言模型(VLM)在噪声条件下的鲁棒性,揭示了模型大小、数据集特性和噪声类型之间的复杂权衡。

Details Motivation: 视觉-语言模型在多模态任务中表现出色,但其在噪声条件下的鲁棒性尚未得到充分研究。论文旨在填补这一空白。

Contribution: 1. 提出了一个标准化的评估框架;2. 揭示了噪声类型和模型性能之间的关系;3. 提供了关于模型大小和数据集的见解。

Method: 通过控制光照变化、运动模糊和压缩伪影等噪声条件,结合词汇指标(BLEU等)和神经网络相似性度量,对多种VLM进行评估。

Result: 实验表明:(1)地面实况描述的清晰度显著影响性能;(2)大模型(如LLaVA)在语义理解上表现更优,但并不总是优于小模型;(3)JPEG压缩和运动模糊对模型性能影响最大。

Insight: 研究发现模型鲁棒性不仅依赖于模型大小,还与数据集特性和噪声类型密切相关,为未来鲁棒多模态学习提供了基准。

Abstract: Vision-Language Models (VLMs) have attained exceptional success across multimodal tasks such as image captioning and visual question answering. However, their robustness under noisy conditions remains unfamiliar. In this study, we present a comprehensive evaluation framework to evaluate the performance of several state-of-the-art VLMs under controlled perturbations, including lighting variation, motion blur, and compression artifacts. We used both lexical-based metrics (BLEU, METEOR, ROUGE, CIDEr) and neural-based similarity measures using sentence embeddings to quantify semantic alignment. Our experiments span diverse datasets, revealing key insights: (1) descriptiveness of ground-truth captions significantly influences model performance; (2) larger models like LLaVA excel in semantic understanding but do not universally outperform smaller models; and (3) certain noise types, such as JPEG compression and motion blur, dramatically degrade performance across models. Our findings highlight the nuanced trade-offs between model size, dataset characteristics, and noise resilience, offering a standardized benchmark for future robust multimodal learning.

[47] Axis-Aligned 3D Stalk Diameter Estimation from RGB-D Imagery

Benjamin Vail,Rahul Harsha Cheppally,Ajay Sharda,Sidharth Rai

Main category: cs.CV

TL;DR: 这篇论文提出了一种基于RGB-D图像的几何感知计算机视觉流程,用于估计作物茎干直径,为高通量表型分析提供可靠且可扩展的解决方案。

Details Motivation: 传统测量茎干直径的方法费力且易出错,不适合高通量表型分析。论文旨在通过计算机视觉技术解决这一问题。

Contribution: 主要贡献是提出了一种结合深度学习实例分割、3D点云重建和PCA轴对齐切片的稳健直径估计方法。

Method: 方法包括:1) 深度学习实例分割定位茎干,2) 从RGB-D数据重建3D点云,3) 通过PCA对齐轴并进行切片估计直径。

Result: 该方法能够有效减少曲率、遮挡和图像噪声的影响,实现高精度的茎干直径估计。

Insight: 几何感知的计算机视觉方法可以显著提升农业表型分析的效率和可靠性。

Abstract: Accurate, high-throughput phenotyping is a critical component of modern crop breeding programs, especially for improving traits such as mechanical stability, biomass production, and disease resistance. Stalk diameter is a key structural trait, but traditional measurement methods are labor-intensive, error-prone, and unsuitable for scalable phenotyping. In this paper, we present a geometry-aware computer vision pipeline for estimating stalk diameter from RGB-D imagery. Our method integrates deep learning-based instance segmentation, 3D point cloud reconstruction, and axis-aligned slicing via Principal Component Analysis (PCA) to perform robust diameter estimation. By mitigating the effects of curvature, occlusion, and image noise, this approach offers a scalable and reliable solution to support high-throughput phenotyping in breeding and agronomic research.

[48] Explicit Multimodal Graph Modeling for Human-Object Interaction Detection

Wenxuan Ji,Haichao Shi,Xiao-Yu zhang

Main category: cs.CV

TL;DR: 本文提出了一种基于图神经网络的多模态建模方法(MGNM),通过显式建模人-物交互(HOI)任务中的关系结构,提升了HOI检测的性能,并在HICO-DET和V-COCO基准上取得了最优结果。

Details Motivation: Transformer架构在HOI检测中未显式建模关系结构,阻碍了交互识别的性能;而图神经网络(GNN)天然适合建模此类关系。

Contribution: 提出了MGNM框架,显式地将HOI任务建模为四阶段图结构,并引入多级特征交互机制,融合视觉与语言特征以提升信息传播效果。

Method: 设计了多模态图网络,通过四阶段图结构显式建模HOI关系,并结合多级视觉与语言特征增强信息交互。

Result: 在HICO-DET和V-COCO基准上取得最优性能,与高级目标检测器结合后性能进一步提升,且平衡了稀有类与非稀有类的识别效果。

Insight: 显式建模关系结构对HOI任务至关重要,多模态特征的融合能有效提升交互识别的性能。

Abstract: Transformer-based methods have recently become the prevailing approach for Human-Object Interaction (HOI) detection. However, the Transformer architecture does not explicitly model the relational structures inherent in HOI detection, which impedes the recognition of interactions. In contrast, Graph Neural Networks (GNNs) are inherently better suited for this task, as they explicitly model the relationships between human-object pairs. Therefore, in this paper, we propose \textbf{M}ultimodal \textbf{G}raph \textbf{N}etwork \textbf{M}odeling (MGNM) that leverages GNN-based relational structures to enhance HOI detection. Specifically, we design a multimodal graph network framework that explicitly models the HOI task in a four-stage graph structure. Furthermore, we introduce a multi-level feature interaction mechanism within our graph network. This mechanism leverages multi-level vision and language features to enhance information propagation across human-object pairs. Consequently, our proposed MGNM achieves state-of-the-art performance on two widely used benchmarks: HICO-DET and V-COCO. Moreover, when integrated with a more advanced object detector, our method demonstrates a significant performance gain and maintains an effective balance between rare and non-rare classes.

[49] VQT-Light:Lightweight HDR Illumination Map Prediction with Richer Texture.pdf

Kunliang Xie

Main category: cs.CV

TL;DR: VQT-Light是一个基于VQVAE和ViT架构的轻量级框架,用于高动态范围(HDR)光照图的预测,通过离散特征提取和全局上下文捕获,实现了纹理丰富且高效的光照估计。

Details Motivation: 现有方法在光照图的纹理恢复和运行速度之间存在权衡,难以同时满足高纹理保真度和实时性需求。

Contribution: 1. 提出了基于VQVAE和ViT的新型框架VQT-Light;2. 将光照估计任务重新定义为多分类问题;3. 实现了高效且纹理丰富的光照预测。

Method: 1. 使用VQVAE提取光照图的离散特征以避免”后验塌陷”;2. 利用ViT捕获输入图像的全局上下文;3. 将光照估计建模为多分类任务。

Result: VQT-Light在40FPS的速度下运行,并在多项评估指标上优于现有方法,实现了更高纹理保真度的光照预测。

Insight: 通过离散特征表征和全局建模,可以显著提升光照估计任务的效果和效率。

Abstract: Accurate lighting estimation is a significant yet challenging task in computer vision and graphics. However, existing methods either struggle to restore detailed textures of illumination map, or face challenges in running speed and texture fidelity. To tackle this problem, we propose a novel framework (VQT-Light) based on VQVAE and ViT architecture. VQT-Light includes two modules: feature extraction and lighting estimation. First, we take advantages of VQVAE to extract discrete features of illumination map rather than continuous features to avoid “posterior collapse”. Second, we capture global context and dependencies of input image through ViT rather than CNNs to improve the prediction of illumination outside the field of view. Combining the above two modules, we formulate the lighting estimation as a multiclass classification task, which plays a key role in our pipeline. As a result, our model predicts light map with richer texture and better fidelity while keeping lightweight and fast. VQT-Light achieves an inference speed of 40FPS and improves multiple evaluation metrics. Qualitative and quantitative experiments demonstrate that the proposed method realizes superior results compared to existing state-of-the-art methods.

[50] Beyond Artificial Misalignment: Detecting and Grounding Semantic-Coordinated Multimodal Manipulations

Jinjie Shen,Yaxiong Wang,Lechao Cheng,Nan Pu,Zhun Zhong

Main category: cs.CV

TL;DR: 该论文提出了首个语义对齐的多模态操纵检测数据集(SAMM),并开发了一种检索增强的检测与定位框架(RamDG),显著提升了检测性能。

Details Motivation: 现有数据集因人工破坏模态对齐导致检测简单化,无法反映真实世界的多模态操纵行为,因此需要更真实的数据集和方法。

Contribution: 1. 构建了首个语义对齐的多模态操纵数据集(SAMM);2. 提出了检索增强的检测与定位框架(RamDG)。

Method: 1. 两阶段生成SAMM数据集(图像操纵+生成语义一致的文本);2. RamDG结合外部知识库生成辅助文本,通过图像伪造定位和深度操纵检测模块进行检测。

Result: RamDG在SAMM数据集上的检测准确率比现有方法高2.06%。

Insight: 真实世界的多模态操纵需要语义一致性,现有数据集的简单破坏会导致检测偏差。SAMM和RamDG提供了更真实的评测基准和解法。

Abstract: The detection and grounding of manipulated content in multimodal data has emerged as a critical challenge in media forensics. While existing benchmarks demonstrate technical progress, they suffer from misalignment artifacts that poorly reflect real-world manipulation patterns: practical attacks typically maintain semantic consistency across modalities, whereas current datasets artificially disrupt cross-modal alignment, creating easily detectable anomalies. To bridge this gap, we pioneer the detection of semantically-coordinated manipulations where visual edits are systematically paired with semantically consistent textual descriptions. Our approach begins with constructing the first Semantic-Aligned Multimodal Manipulation (SAMM) dataset, generated through a two-stage pipeline: 1) applying state-of-the-art image manipulations, followed by 2) generation of contextually-plausible textual narratives that reinforce the visual deception. Building on this foundation, we propose a Retrieval-Augmented Manipulation Detection and Grounding (RamDG) framework. RamDG commences by harnessing external knowledge repositories to retrieve contextual evidence, which serves as the auxiliary texts and encoded together with the inputs through our image forgery grounding and deep manipulation detection modules to trace all manipulations. Extensive experiments demonstrate our framework significantly outperforms existing methods, achieving 2.06% higher detection accuracy on SAMM compared to state-of-the-art approaches. The dataset and code are publicly available at https://github.com/shen8424/SAMM-RamDG-CAP.

[51] A Comparative Study of YOLOv8 to YOLOv11 Performance in Underwater Vision Tasks

Gordon Hung,Ivan Felipe Rodriguez

Main category: cs.CV

TL;DR: 本文对比了YOLOv8至YOLOv11在水下视觉任务中的性能,发现轻量级YOLOv10在速度和精度之间提供了最佳权衡,同时提供了一个可复现的基准。

Details Motivation: 自主水下车辆(AUVs)依赖计算机视觉系统完成水下任务,但水下图像受限于光线衰减、浑浊度和类别不平衡。YOLO系列检测器在陆地基准的表现未必适用于水下领域,因此需要系统对比。

Contribution: 1. 提供了YOLOv8至YOLOv11在水下图像上的首次控制对比;2. 发现轻量级YOLOv10最适合嵌入式AUV部署;3. 公开了数据集和代码基准。

Method: 使用两个公开数据集(珊瑚疾病和鱼类物种),在不同训练比例下统一训练YOLOv8至YOLOv11的轻量版(s),评估精度、速度和特征可视化(Grad-CAM)。

Result: YOLOv9后精度趋近饱和,但YOLOv10在速度和精度之间表现最佳。推理速度显著提升。

Insight: YOLO系列的迭代主要优化效率而非精度,轻量级模型更适合资源受限的水下应用。

Abstract: Autonomous underwater vehicles (AUVs) increasingly rely on on-board computer-vision systems for tasks such as habitat mapping, ecological monitoring, and infrastructure inspection. However, underwater imagery is hindered by light attenuation, turbidity, and severe class imbalance, while the computational resources available on AUVs are limited. One-stage detectors from the YOLO family are attractive because they fuse localization and classification in a single, low-latency network; however, their terrestrial benchmarks (COCO, PASCAL-VOC, Open Images) leave open the question of how successive YOLO releases perform in the marine domain. We curate two openly available datasets that span contrasting operating conditions: a Coral Disease set (4,480 images, 18 classes) and a Fish Species set (7,500 images, 20 classes). For each dataset, we create four training regimes (25 %, 50 %, 75 %, 100 % of the images) while keeping balanced validation and test partitions fixed. We train YOLOv8-s, YOLOv9-s, YOLOv10-s, and YOLOv11-s with identical hyperparameters (100 epochs, 640 px input, batch = 16, T4 GPU) and evaluate precision, recall, mAP50, mAP50-95, per-image inference time, and frames-per-second (FPS). Post-hoc Grad-CAM visualizations probe feature utilization and localization faithfulness. Across both datasets, accuracy saturates after YOLOv9, suggesting architectural innovations primarily target efficiency rather than accuracy. Inference speed, however, improves markedly. Our results (i) provide the first controlled comparison of recent YOLO variants on underwater imagery, (ii) show that lightweight YOLOv10 offers the best speed-accuracy trade-off for embedded AUV deployment, and (iii) deliver an open, reproducible benchmark and codebase to accelerate future marine-vision research.

[52] RIS-FUSION: Rethinking Text-Driven Infrared and Visible Image Fusion from the Perspective of Referring Image Segmentation

Siju Ma,Changsiyu Gong,Xiaofeng Fan,Yong Ma,Chengjie Jiang

Main category: cs.CV

TL;DR: 这篇论文提出了RIS-FUSION,一种将文本驱动的红外与可见光图像融合与参照图像分割(RIS)结合的级联框架,通过LangGatedFusion模块注入文本特征以增强语义对齐,并引入了大规模数据集MM-RIS。

Details Motivation: 当前的文本驱动红外与可见光图像融合方法缺乏目标对齐的任务来监督和评估文本输入对融合效果的影响。作者观察到RIS任务与文本驱动融合有共同目标,即突出文本所指的对象。

Contribution: 1. 提出RIS-FUSION框架,通过联合优化统一了图像融合与RIS任务;2. 设计了LangGatedFusion模块,将文本特征注入融合主干网络;3. 构建了大规模数据集MM-RIS。

Method: 采用级联框架,通过LangGatedFusion模块实现文本特征的注入和多模态对齐,同时联合优化融合与RIS任务。

Result: 实验表明,RIS-FUSION取得了最先进的性能,mIoU超过现有方法11%以上。

Insight: 将RIS任务与图像融合结合,可以更好地实现文本驱动的多模态对齐和目标聚焦。

Abstract: Text-driven infrared and visible image fusion has gained attention for enabling natural language to guide the fusion process. However, existing methods lack a goal-aligned task to supervise and evaluate how effectively the input text contributes to the fusion outcome. We observe that referring image segmentation (RIS) and text-driven fusion share a common objective: highlighting the object referred to by the text. Motivated by this, we propose RIS-FUSION, a cascaded framework that unifies fusion and RIS through joint optimization. At its core is the LangGatedFusion module, which injects textual features into the fusion backbone to enhance semantic alignment. To support multimodal referring image segmentation task, we introduce MM-RIS, a large-scale benchmark with 12.5k training and 3.5k testing triplets, each consisting of an infrared-visible image pair, a segmentation mask, and a referring expression. Extensive experiments show that RIS-FUSION achieves state-of-the-art performance, outperforming existing methods by over 11% in mIoU. Code and dataset will be released at https://github.com/SijuMa2003/RIS-FUSION.

[53] Learning by Imagining: Debiased Feature Augmentation for Compositional Zero-Shot Learning

Haozhe Zhang,Chenchen Jing,Mingyu Liu,Qingsheng Wang,Hao Chen

Main category: cs.CV

TL;DR: 本文提出了一种名为Debiased Feature Augmentation (DeFA)的新方法,通过解构和重构特征增强框架结合去偏策略,解决了组合零样本学习中的纠缠和长尾分布问题。

Details Motivation: 组合零样本学习(CZSL)的挑战在于属性与对象的纠缠性和现实数据中的长尾分布,受神经科学中想象与感知共享相似神经过程的启发,提出了DeFA方法。

Contribution: 提出DeFA方法,通过解构-重构框架和去偏策略,生成高保真的组合特征,显著提升了CZSL的泛化能力。

Method: 结合了解构-重构特征增强框架和去偏策略,利用已见属性的先验知识合成组合特征。

Result: 在三个广泛使用的数据集上,DeFA在封闭世界和开放世界设定下均达到最先进性能。

Insight: 通过想象驱动的特征合成方法,可以有效缓解数据分布偏差和组合泛化问题。

Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute-object compositions by learning prior knowledge of seen primitives, \textit{i.e.}, attributes and objects. Learning generalizable compositional representations in CZSL remains challenging due to the entangled nature of attributes and objects as well as the prevalence of long-tailed distributions in real-world data. Inspired by neuroscientific findings that imagination and perception share similar neural processes, we propose a novel approach called Debiased Feature Augmentation (DeFA) to address these challenges. The proposed DeFA integrates a disentangle-and-reconstruct framework for feature augmentation with a debiasing strategy. DeFA explicitly leverages the prior knowledge of seen attributes and objects by synthesizing high-fidelity composition features to support compositional generalization. Extensive experiments on three widely used datasets demonstrate that DeFA achieves state-of-the-art performance in both \textit{closed-world} and \textit{open-world} settings.

[54] AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models

Heng Zhang,Haichuan Hu,Yaomin Shen,Weihao Yu,Yilei Yuan,Haochen You,Guo Cheng,Zijian Zhang,Lubin Gan,Huihui Wei,Hao Zhang,Jin Huang

Main category: cs.CV

TL;DR: AsyMoE 提出了一种新型架构,通过模态不对称性增强专家特化,解决了视觉和语言处理不对称性问题。

Details Motivation: 现有 MoE 方法在处理视觉和语言模态不对称时存在困难,导致语言专家在深层逐渐失去上下文基础,依赖参数知识而非模态信息。

Contribution: 提出了 AsyMoE 架构,包含三种专家组:模态内专家、双曲跨模态专家和证据优先语言专家,显著提升了模型性能。

Method: 设计了三种专家分工协作的架构:模态内专家处理模态特定特征,双曲跨模态专家处理层次化跨模态交互,证据优先语言专家抑制参数偏见。

Result: AsyMoE 在准确率上比传统 MoE 和模态专用 MoE 分别提升 26.58% 和 15.45%,且激活参数比密集模型少 25.45%。

Insight: 模态不对称性是跨模态建模的关键问题,通过专家分工特化能有效平衡模态特征与跨模态交互。

Abstract: Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. However, existing Mixture of Experts (MoE) approaches face challenges due to the asymmetry between visual and linguistic processing. Visual information is spatially complete, while language requires maintaining sequential context. As a result, MoE models struggle to balance modality-specific features and cross-modal interactions. Through systematic analysis, we observe that language experts in deeper layers progressively lose contextual grounding and rely more on parametric knowledge rather than utilizing the provided visual and linguistic information. To address this, we propose AsyMoE, a novel architecture that models this asymmetry using three specialized expert groups. We design intra-modality experts for modality-specific processing, hyperbolic inter-modality experts for hierarchical cross-modal interactions, and evidence-priority language experts to suppress parametric biases and maintain contextual grounding. Extensive experiments demonstrate that AsyMoE achieves 26.58% and 15.45% accuracy improvements over vanilla MoE and modality-specific MoE respectively, with 25.45% fewer activated parameters than dense models.

[55] EvoEmpirBench: Dynamic Spatial Reasoning with Agent-ExpVer

Pukun Zhao,Longxiang Wang,Miaowei Wang,Chen Chen,Fanqing Zhou,Haojian Huang

Main category: cs.CV

TL;DR: 该论文提出了两个动态空间推理基准(迷宮导航和match-2消除任务),以评估模型在局部可观察、动态变化环境中的空间理解和自适应规划能力。作者还提出了一种基于主观经验的记忆机制,用于跨任务经验转移。实验验证了当前主流模型在动态空间推理和长期记忆中的局限性。

Details Motivation: 现有空间推理基准主要关注静态或全局可观察环境,未能捕捉局部可观察性和动态变化下长期推理与记忆利用的挑战。

Contribution: 1.引入了两个动态空间基准任务;2.提出了一种主观经验记忆机制;3.揭示了主流模型在动态空间推理和长期记忆中的不足。

Method: 设计了动态变化的环境任务,每个动作触发环境结构变化;提出记忆机制支持跨任务经验转移。

Result: 实验表明主流模型在处理动态空间任务时表现不佳,验证了基准的有效性和记忆机制的实用性。

Insight: 动态环境下的推理需要模型具备实时更新认知和策略的能力,而现有模型在长期记忆和自适应规划上仍有改进空间。

Abstract: Most existing spatial reasoning benchmarks focus on static or globally observable environments, failing to capture the challenges of long-horizon reasoning and memory utilization under partial observability and dynamic changes. We introduce two dynamic spatial benchmarks, locally observable maze navigation and match-2 elimination that systematically evaluate models’ abilities in spatial understanding and adaptive planning when local perception, environment feedback, and global objectives are tightly coupled. Each action triggers structural changes in the environment, requiring continuous update of cognition and strategy. We further propose a subjective experience-based memory mechanism for cross-task experience transfer and validation. Experiments show that our benchmarks reveal key limitations of mainstream models in dynamic spatial reasoning and long-term memory, providing a comprehensive platform for future methodological advances. Our code and data are available at https://anonymous.4open.science/r/EvoEmpirBench-143C/.

[56] SPGen: Spherical Projection as Consistent and Flexible Representation for Single Image 3D Shape Generation

Jingdong Zhang,Weikai Chen,Yuan Liu,Jionghao Wang,Zhengming Yu,Zhuowen Shen,Bo Yang,Wenping Wang,Xin Li

Main category: cs.CV

TL;DR: SPGen提出了一种新颖的球面投影(SP)表示方法,用于单张图像生成3D形状,解决了现有方法在视图一致性和复杂结构表示上的局限性,同时在几何质量和计算效率上显著优于基线方法。

Details Motivation: 现有的单视图3D生成模型通常依赖多视角扩散先验,但在视图一致性和复杂内部结构表示上存在不足。SPGen旨在提供一种更一致、灵活且高效的表示方法。

Contribution: SPGen的核心贡献是提出了一种球面投影(SP)表示方法,通过将几何信息投影到球面上并展开为多层2D表示,实现了视图一致性、灵活性和高效性。

Method: SPGen通过将几何信息投影到球面上并展开为多层2D表示(SP),利用图像域的直接优势,继承2D扩散先验并支持高效微调。

Result: 实验表明,SPGen在几何质量和计算效率上显著优于现有基线方法。

Insight: 球面投影表示方法为3D形状生成提供了一种新的范式,特别适用于复杂内部结构和拓扑的处理,同时避免了多视角不一致的问题。

Abstract: Existing single-view 3D generative models typically adopt multiview diffusion priors to reconstruct object surfaces, yet they remain prone to inter-view inconsistencies and are unable to faithfully represent complex internal structure or nontrivial topologies. In particular, we encode geometry information by projecting it onto a bounding sphere and unwrapping it into a compact and structural multi-layer 2D Spherical Projection (SP) representation. Operating solely in the image domain, SPGen offers three key advantages simultaneously: (1) Consistency. The injective SP mapping encodes surface geometry with a single viewpoint which naturally eliminates view inconsistency and ambiguity; (2) Flexibility. Multi-layer SP maps represent nested internal structures and support direct lifting to watertight or open 3D surfaces; (3) Efficiency. The image-domain formulation allows the direct inheritance of powerful 2D diffusion priors and enables efficient finetuning with limited computational resources. Extensive experiments demonstrate that SPGen significantly outperforms existing baselines in geometric quality and computational efficiency.

[57] Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models

Yunhan Zhao,Xiang Zheng,Xingjun Ma

Main category: cs.CV

TL;DR: 论文揭示了将弱防御机制整合到攻击流程中可以显著提升视觉语言模型(VLMs)越狱攻击的效果和效率,并提出了一种新方法Defense2Attack,通过视觉和文本优化器以及强化微调增强越狱能力。

Details Motivation: 尽管视觉语言模型(VLMs)能力强大,但其易受越狱攻击的问题限制了其安全性。现有方法的有效性和效率仍有提升空间,尤其是如何利用现有防御机制来反向提升攻击效果。

Contribution: 提出了一种新方法Defense2Attack,通过结合弱防御机制来设计越狱提示,显著提升了攻击的效率和成功率。内容包括视觉优化器、文本优化器和强化微调的后缀生成器。

Method: Defense2Attack包含三个核心组件:1. 视觉优化器,嵌入具有积极语义的通用对抗扰动;2. 文本优化器,利用防御风格的提示优化输入;3. 通过强化微调生成的后缀增强越狱能力。

Result: 在四个VLM和四个安全基准测试上的实验表明,Defense2Attack在单次尝试中实现了优于现有方法的越狱性能。

Insight: 将防御机制反向整合到攻击流程中,可以为提升越狱攻击的效果提供新思路,同时揭示了模型安全性设计的潜在漏洞。

Abstract: Despite their superb capabilities, Vision-Language Models (VLMs) have been shown to be vulnerable to jailbreak attacks. While recent jailbreaks have achieved notable progress, their effectiveness and efficiency can still be improved. In this work, we reveal an interesting phenomenon: incorporating weak defense into the attack pipeline can significantly enhance both the effectiveness and the efficiency of jailbreaks on VLMs. Building on this insight, we propose Defense2Attack, a novel jailbreak method that bypasses the safety guardrails of VLMs by leveraging defensive patterns to guide jailbreak prompt design. Specifically, Defense2Attack consists of three key components: (1) a visual optimizer that embeds universal adversarial perturbations with affirmative and encouraging semantics; (2) a textual optimizer that refines the input using a defense-styled prompt; and (3) a red-team suffix generator that enhances the jailbreak through reinforcement fine-tuning. We empirically evaluate our method on four VLMs and four safety benchmarks. The results demonstrate that Defense2Attack achieves superior jailbreak performance in a single attempt, outperforming state-of-the-art attack methods that often require multiple tries. Our work offers a new perspective on jailbreaking VLMs.

[58] Effective Gaussian Management for High-fidelity Object Reconstruction

Jiateng Liu,Hao Gao,Jiu-Cheng Xie,Chi-Man Pun,Jian Xiong,Haolun Li,Feng Xu

Main category: cs.CV

TL;DR: 本文提出了一种高效的高斯管理方法,用于高保真物体重建。通过动态激活球谐函数(SHs)或法线,并结合表面重建模块的监督,有效解决了双重监督引起的梯度冲突问题。此外,还提出了一种轻量化的高斯表示方法,通过自适应调整SH阶数和任务解耦剪枝,平衡了表示能力和参数量。

Details Motivation: 现有的基于高斯泼溅(GS)的方法存在属性盲目分配问题,导致双重监督引起的梯度冲突,影响重建质量和效率。本文旨在通过动态管理和轻量化表示解决这些问题。

Contribution: 1. 提出动态密集化策略,在表面重建监督下激活SHs或法线,缓解梯度冲突。2. 开发轻量化高斯表示,自适应调整SH阶数并进行任务解耦剪枝,平衡参数和性能。3. 方法模型无关,可扩展性强。

Method: 1. 动态密集化策略:根据表面重建模块监督动态选择SHs或法线。2. 轻量化表示:基于梯度幅值自适应调整SH阶数,并通过任务解耦剪枝去除冗余高斯。

Result: 实验表明,该方法在重建质量和效率上均优于现有方法,参数量显著减少的同时性能提升。

Insight: 动态管理和轻量化是解决高斯重建中梯度冲突和参数冗余的有效途径,方法通用性强,适用于多种框架。

Abstract: This paper proposes an effective Gaussian management approach for high-fidelity object reconstruction. Departing from recent Gaussian Splatting (GS) methods that employ indiscriminate attribute assignment, our approach introduces a novel densification strategy that dynamically activates spherical harmonics (SHs) or normals under the supervision of a surface reconstruction module, which effectively mitigates the gradient conflicts caused by dual supervision and achieves superior reconstruction results. To further improve representation efficiency, we develop a lightweight Gaussian representation that adaptively adjusts the SH orders of each Gaussian based on gradient magnitudes and performs task-decoupled pruning to remove Gaussian with minimal impact on a reconstruction task without sacrificing others, which balances the representational capacity with parameter quantity. Notably, our management approach is model-agnostic and can be seamlessly integrated into other frameworks, enhancing performance while reducing model size. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art approaches in both reconstruction quality and efficiency, achieving superior performance with significantly fewer parameters.

[59] Modelling and analysis of the 8 filters from the “master key filters hypothesis” for depthwise-separable deep networks in relation to idealized receptive fields based on scale-space theory

Tony Lindeberg,Zahra Babaiee,Peyman M. Kiasari

Main category: cs.CV

TL;DR: 本文通过分析深度可分卷积网络中的8种’主键滤波器’,提出了一种基于离散尺度空间理论的理想化感受野模型,证明了学习到的滤波器可以被高斯核的离散模拟近似。

Details Motivation: 研究动机在于探索深度可分卷积网络中学习到的滤波器是否可以被理想化模型(如高斯核及其差分算子)近似,从而简化网络设计并提升理论理解。

Contribution: 主要贡献包括:(1)提出了一种基于离散尺度空间的理想化感受野模型;(2)证明了学习到的滤波器与高斯核及其差分算子具有良好相似性;(3)展示了理想化模型在网络中的预测性能。

Method: 方法包括:(1)使用聚类提取8种主键滤波器;(2)通过加权均值和方差分析滤波器空间分布;(3)采用高斯核及其差分算子建模滤波器;(4)通过$l_1$或$l_2$范数最小化拟合模型。

Result: 结果表明,学习到的滤波器可以被高斯核的离散模拟很好地近似,且理想化模型在网络中具有良好的预测性能。

Insight: 研究揭示了深度可分卷积网络中的滤波器与理论中的理想化感受野(如高斯核)之间的紧密联系,为网络设计提供了新的理论支持。

Abstract: This paper presents the results of analysing and modelling a set of 8 master key filters'', which have been extracted by applying a clustering approach to the receptive fields learned in depthwise-separable deep networks based on the ConvNeXt architecture. For this purpose, we first compute spatial spread measures in terms of weighted mean values and weighted variances of the absolute values of the learned filters, which support the working hypotheses that: (i) the learned filters can be modelled by separable filtering operations over the spatial domain, and that (ii) the spatial offsets of the those learned filters that are non-centered are rather close to half a grid unit. Then, we model the clustered master key filters’’ in terms of difference operators applied to a spatial smoothing operation in terms of the discrete analogue of the Gaussian kernel, and demonstrate that the resulting idealized models of the receptive fields show good qualitative similarity to the learned filters. This modelling is performed in two different ways: (i) using possibly different values of the scale parameters in the coordinate directions for each filter, and (ii) using the same value of the scale parameter in both coordinate directions. Then, we perform the actual model fitting by either (i) requiring spatial spread measures in terms of spatial variances of the absolute values of the receptive fields to be equal, or (ii) minimizing the discrete $l_1$- or $l_2$-norms between the idealized receptive field models and the learned filters. Complementary experimental results then demonstrate the idealized models of receptive fields have good predictive properties for replacing the learned filters by idealized filters in depthwise-separable deep networks, thus showing that the learned filters in depthwise-separable deep networks can be well approximated by discrete scale-space filters.

[60] What Makes a Good Generated Image? Investigating Human and Multimodal LLM Image Preference Alignment

Rishab Parthasarathy,Jasmine Collins,Cory Stephenson

Main category: cs.CV

TL;DR: 该论文研究了人类和多模态大语言模型(LLM)在图像质量评估中的偏好差异,重点关注了美学、无伪影、解剖学准确性、构图正确性、对象一致性和风格等属性。通过合成图像对构建数据集,发现人类和LLM在这些属性上的相关性存在显著差异。

Details Motivation: 自动化评估生成式文本到图像模型的性能是一个挑战性问题。多模态LLMs被用于图像质量评估,但其与人类在图像属性上的判断差异尚不清楚。本文旨在探究人类和LLM在图像质量评估中的偏好和差异。

Contribution: 1. 构建了人类偏好的图像质量数据集;2. 分析了人类和LLM在图像质量属性上的相关性差异;3. 揭示了LLM在某些属性(如解剖学准确性)上的评估能力较弱。

Method: 1. 使用合成图像对构建人类偏好数据集;2. 分析人类和LLM在多个图像质量属性上的相关性;3. 通过可控合成数据集评估LLM在特定属性上的表现。

Result: 人类能够轻松判断图像质量属性(如美学、构图等),但LLM在解剖学准确性等属性上的评估能力较弱。人类和LLM在图像质量判断上的相关性差异显著。

Insight: 多模态LLM在图像质量评估中的表现与人类存在显著差异,尤其是在涉及复杂视觉属性(如解剖学准确性)时。这为未来改进LLM的图像评估能力提供了方向。

Abstract: Automated evaluation of generative text-to-image models remains a challenging problem. Recent works have proposed using multimodal LLMs to judge the quality of images, but these works offer little insight into how multimodal LLMs make use of concepts relevant to humans, such as image style or composition, to generate their overall assessment. In this work, we study what attributes of an image–specifically aesthetics, lack of artifacts, anatomical accuracy, compositional correctness, object adherence, and style–are important for both LLMs and humans to make judgments on image quality. We first curate a dataset of human preferences using synthetically generated image pairs. We use inter-task correlation between each pair of image quality attributes to understand which attributes are related in making human judgments. Repeating the same analysis with LLMs, we find that the relationships between image quality attributes are much weaker. Finally, we study individual image quality attributes by generating synthetic datasets with a high degree of control for each axis. Humans are able to easily judge the quality of an image with respect to all of the specific image quality attributes (e.g. high vs. low aesthetic image), however we find that some attributes, such as anatomical accuracy, are much more difficult for multimodal LLMs to learn to judge. Taken together, these findings reveal interesting differences between how humans and multimodal LLMs perceive images.

[61] Recurrent Cross-View Object Geo-Localization

Xiaohan Zhang,Si-Yuan Cao,Xiaokai Bai,Yiming Li,Zhangkai Shen,Zhe Wu,Xiaoxi Hu,Hui-liang Shen

Main category: cs.CV

TL;DR: 这篇论文提出了一种名为ReCOT的循环跨视角物体地理定位Transformer,通过迭代细化定位结果,结合SAM知识蒸馏和分层注意力机制,显著提升了性能并减少了参数量。

Details Motivation: 现有方法仅将跨视角物体地理定位视为一次性检测任务,容易受特征噪声干扰且缺乏纠错机制。因此,需要一种能够逐步优化的方法。

Contribution: 1. 提出ReCOT框架,将地理定位任务转化为循环过程;
2. 引入可学习token和SAM知识蒸馏;
3. 设计参考特征增强模块(RFEM)。

Method: 1. 通过循环Transformer逐步优化定位结果;
2. 利用SAM蒸馏提供语义先验;
3. RFEM通过分层注意力机制增强参考特征的物体相关性。

Result: 在标准CVOGL基准测试中达到SOTA性能,并将参数量减少60%。

Insight: 循环优化和分层注意力机制显著提升定位精度,同时知识蒸馏在无需额外推理成本的情况下提供了语义指导。

Abstract: Cross-view object geo-localization (CVOGL) aims to determine the location of a specific object in high-resolution satellite imagery given a query image with a point prompt. Existing approaches treat CVOGL as a one-shot detection task, directly regressing object locations from cross-view information aggregation, but they are vulnerable to feature noise and lack mechanisms for error correction. In this paper, we propose ReCOT, a Recurrent Cross-view Object geo-localization Transformer, which reformulates CVOGL as a recurrent localization task. ReCOT introduces a set of learnable tokens that encode task-specific intent from the query image and prompt embeddings, and iteratively attend to the reference features to refine the predicted location. To enhance this recurrent process, we incorporate two complementary modules: (1) a SAM-based knowledge distillation strategy that transfers segmentation priors from the Segment Anything Model (SAM) to provide clearer semantic guidance without additional inference cost, and (2) a Reference Feature Enhancement Module (RFEM) that introduces a hierarchical attention to emphasize object-relevant regions in the reference features. Extensive experiments on standard CVOGL benchmarks demonstrate that ReCOT achieves state-of-the-art (SOTA) performance while reducing parameters by 60% compared to previous SOTA approaches.

[62] A-TDOM: Active TDOM via On-the-Fly 3DGS

Yiwei Xu,Xiang Wang,Yifei Yu,Wentian Gan,Luca Morelli,Giulio Perda,Xiongwu Xiao,Zongqian Zhan,Xin Wang,Fabio Remondino

Main category: cs.CV

TL;DR: A-TDOM是一种基于On-the-Fly 3DGS优化的近实时TDOM生成方法,通过动态优化新图像与3DGS场,解决了传统方法延迟和质量问题。

Details Motivation: 传统TDOM生成方法依赖复杂离线流程,延迟高且质量易受相机位姿不准确或遮挡影响,无法满足实时需求。

Contribution: 提出A-TDOM,结合On-the-Fly 3DGS优化和正交投影渲染,实现近实时TDOM生成。

Method: 通过动态SfM计算新图像位姿和稀疏点云,将新高斯分布集成优化至未重建区域,结合正交投影渲染。

Result: 实验表明,A-TDOM能在几秒内处理新图像,保持渲染质量和几何精度,支持近实时应用。

Insight: 动态3DGS优化为实时地学产品生成提供了新思路,未来或可扩展到其他动态场景重建任务。

Abstract: True Digital Orthophoto Map (TDOM) serves as a crucial geospatial product in various fields such as urban management, city planning, land surveying, etc. However, traditional TDOM generation methods generally rely on a complex offline photogrammetric pipeline, resulting in delays that hinder real-time applications. Moreover, the quality of TDOM may degrade due to various challenges, such as inaccurate camera poses or Digital Surface Model (DSM) and scene occlusions. To address these challenges, this work introduces A-TDOM, a near real-time TDOM generation method based on On-the-Fly 3DGS optimization. As each image is acquired, its pose and sparse point cloud are computed via On-the-Fly SfM. Then new Gaussians are integrated and optimized into previously unseen or coarsely reconstructed regions. By integrating with orthogonal splatting, A-TDOM can render just after each update of a new 3DGS field. Initial experiments on multiple benchmarks show that the proposed A-TDOM is capable of actively rendering TDOM in near real-time, with 3DGS optimization for each new image in seconds while maintaining acceptable rendering quality and TDOM geometric accuracy.

[63] DyGLNet: Hybrid Global-Local Feature Fusion with Dynamic Upsampling for Medical Image Segmentation

Yican Zhao,Ce Wang,You Hao,Lei Li,Tianli Liao

Main category: cs.CV

TL;DR: DyGLNet提出了一种融合全局和局部特征的动态上采样方法,用于高效且精准的医疗图像分割,通过创新的SHDCBlock和DyFusionUp模块实现了多尺度特征建模和高保真重建,同时降低计算开销。

Details Motivation: 医疗图像分割面临多尺度病灶变异性、模糊边界和高计算需求的挑战。DyGLNet旨在通过融合全局和局部特征,动态调整上采样方式,提升分割精度和效率。

Contribution: 1) 设计了混合特征提取模块SHDCBlock,结合单头自注意力和多尺度空洞卷积;2) 提出动态自适应上采样模块DyFusionUp;3) 轻量化设计降低计算复杂度。

Method: DyGLNet通过SHDCBlock提取全局和局部特征,利用DyFusionUp动态调整上采样,实现高保真重建。轻量化设计减少了计算负担。

Result: 在七个公开数据集上表现优于现有方法,尤其在边界精度和小目标分割上表现突出,同时计算复杂度更低。

Insight: 结合自注意力与多尺度卷积能有效捕捉医疗图像的复杂特征,动态上采样机制优化了特征重建过程,轻量化设计适合临床应用。

Abstract: Medical image segmentation grapples with challenges including multi-scale lesion variability, ill-defined tissue boundaries, and computationally intensive processing demands. This paper proposes the DyGLNet, which achieves efficient and accurate segmentation by fusing global and local features with a dynamic upsampling mechanism. The model innovatively designs a hybrid feature extraction module (SHDCBlock), combining single-head self-attention and multi-scale dilated convolutions to model local details and global context collaboratively. We further introduce a dynamic adaptive upsampling module (DyFusionUp) to realize high-fidelity reconstruction of feature maps based on learnable offsets. Then, a lightweight design is adopted to reduce computational overhead. Experiments on seven public datasets demonstrate that DyGLNet outperforms existing methods, particularly excelling in boundary accuracy and small-object segmentation. Meanwhile, it exhibits lower computation complexity, enabling an efficient and reliable solution for clinical medical image analysis. The code will be made available soon.

[64] BATR-FST: Bi-Level Adaptive Token Refinement for Few-Shot Transformers

Mohammed Al-Habib,Zuping Zhang,Abdulrahman Noman

Main category: cs.CV

TL;DR: BATR-FST提出了一种两阶段的双层自适应令牌精炼方法,用于提升Vision Transformers在小样本学习中的表现,通过预训练和元微调阶段优化令牌表示和归纳偏置。

Details Motivation: Vision Transformers在小样本学习中面临令牌级交互精炼难、训练数据少和归纳偏置弱的问题,现有方法依赖不灵活的令牌匹配或简单相似性度量,限制了全局上下文和局部特征的整合。

Contribution: 提出了BATR-FST方法,包含双层自适应令牌精炼模块(令牌聚类、不确定性感知令牌加权和双层注意力机制)、图令牌传播和类分离惩罚机制,显著提升小样本分类性能。

Method: 1)预训练阶段使用MIM方法生成可迁移的补丁级表示;2)元微调阶段结合令牌聚类、不确定性加权、双层注意力和图令牌传播等方法优化令牌表示。

Result: 在三个基准小样本数据集上的实验表明,BATR-FST在1-shot和5-shot场景中均取得优越结果。

Insight: 双层精炼机制和全局-局部特征平衡是小样本学习的有效性关键;图令牌传播和类分离惩罚增强了模型的判别力。

Abstract: Vision Transformers (ViTs) have shown significant promise in computer vision applications. However, their performance in few-shot learning is limited by challenges in refining token-level interactions, struggling with limited training data, and developing a strong inductive bias. Existing methods often depend on inflexible token matching or basic similarity measures, which limit the effective incorporation of global context and localized feature refinement. To address these challenges, we propose Bi-Level Adaptive Token Refinement for Few-Shot Transformers (BATR-FST), a two-stage approach that progressively improves token representations and maintains a robust inductive bias for few-shot classification. During the pre-training phase, Masked Image Modeling (MIM) provides Vision Transformers (ViTs) with transferable patch-level representations by recreating masked image regions, providing a robust basis for subsequent adaptation. In the meta-fine-tuning phase, BATR-FST incorporates a Bi-Level Adaptive Token Refinement module that utilizes Token Clustering to capture localized interactions, Uncertainty-Aware Token Weighting to prioritize dependable features, and a Bi-Level Attention mechanism to balance intra-cluster and inter-cluster relationships, thereby facilitating thorough token refinement. Furthermore, Graph Token Propagation ensures semantic consistency between support and query instances, while a Class Separation Penalty preserves different class borders, enhancing discriminative capability. Extensive experiments on three benchmark few-shot datasets demonstrate that BATR-FST achieves superior results in both 1-shot and 5-shot scenarios and improves the few-shot classification via transformers.

[65] CECT-Mamba: a Hierarchical Contrast-enhanced-aware Model for Pancreatic Tumor Subtyping from Multi-phase CECT

Zhifang Gong,Shuo Gao,Ben Zhao,Yingjing Xu,Yijun Yang,Shenghong Ju,Guangquan Zhou

Main category: cs.CV

TL;DR: 本文提出了一种基于Mamba的分层对比增强感知模型CECT-Mamba,用于从多期相CECT数据中自动分类胰腺肿瘤亚型,通过空间和时间采样序列以及相似性引导的细化模块,显著提高了亚型诊断的准确性。

Details Motivation: 胰腺肿瘤的高异质性和变异性为精确亚型诊断带来了挑战。现有方法未能有效利用多期相CECT数据的上下文信息,限制了其性能。

Contribution: 1. 首次提出了一种结合多期相CECT数据的自动分类方法。2. 设计了双层次对比增强感知Mamba模块,探索病灶的时空对比变化。3. 引入了相似性引导的细化模块和空间互补集成器,提升局部肿瘤区域的学习效率。

Method: 通过双层次对比增强感知Mamba模块和空间时间采样序列,结合相似性引导的细化模块和多粒度融合模块,实现了多尺度语义的编码和聚合。

Result: 在270例临床数据上,区分胰腺导管腺癌(PDAC)和胰腺神经内分泌肿瘤(PNETs)的准确率达97.4%,AUC为98.6%。

Insight: 利用Mamba模型的可学习性和简洁性,结合多期相CECT数据,能够显著提升胰腺肿瘤亚型诊断的性能。

Abstract: Contrast-enhanced computed tomography (CECT) is the primary imaging technique that provides valuable spatial-temporal information about lesions, enabling the accurate diagnosis and subclassification of pancreatic tumors. However, the high heterogeneity and variability of pancreatic tumors still pose substantial challenges for precise subtyping diagnosis. Previous methods fail to effectively explore the contextual information across multiple CECT phases commonly used in radiologists’ diagnostic workflows, thereby limiting their performance. In this paper, we introduce, for the first time, an automatic way to combine the multi-phase CECT data to discriminate between pancreatic tumor subtypes, among which the key is using Mamba with promising learnability and simplicity to encourage both temporal and spatial modeling from multi-phase CECT. Specifically, we propose a dual hierarchical contrast-enhanced-aware Mamba module incorporating two novel spatial and temporal sampling sequences to explore intra and inter-phase contrast variations of lesions. A similarity-guided refinement module is also imposed into the temporal scanning modeling to emphasize the learning on local tumor regions with more obvious temporal variations. Moreover, we design the space complementary integrator and multi-granularity fusion module to encode and aggregate the semantics across different scales, achieving more efficient learning for subtyping pancreatic tumors. The experimental results on an in-house dataset of 270 clinical cases achieve an accuracy of 97.4% and an AUC of 98.6% in distinguishing between pancreatic ductal adenocarcinoma (PDAC) and pancreatic neuroendocrine tumors (PNETs), demonstrating its potential as a more accurate and efficient tool.

[66] Modeling the Multivariate Relationship with Contextualized Representations for Effective Human-Object Interaction Detection

Zhehao Li,Yucheng Qian,Chong Wang,Yinghao Lu,Zhihao Yang,Jiafei Wu

Main category: cs.CV

TL;DR: 该论文提出了一种上下文表征学习网络,通过结合功能引导推理和上下文提示,改进人-物交互(HOI)检测的多元关系建模,尤其在涉及工具时表现突出。

Details Motivation: 现有两阶段HOI检测方法在上下文建模中存在不足,无法充分捕捉复杂交互(如依赖工具的交互)。

Contribution: 1. 提出了结合功能引导推理和上下文提示的上下文表征学习网络;2. 扩展了HOI检测框架,引入多元关系(如<人,工具,物体>三元组)。

Method: 1. 使用三元组结构显式建模功能角色(affordance);2. 通过注意力机制将可学习提示与视觉特征对齐。

Result: 在HICO-Det和V-COCO数据集上表现优越。

Insight: 通过功能角色建模和语言-视觉对齐,可以更可靠地推理复杂交互。

Abstract: Human-Object Interaction (HOI) detection aims to simultaneously localize human-object pairs and recognize their interactions. While recent two-stage approaches have made significant progress, they still face challenges due to incomplete context modeling. In this work, we introduce a Contextualized Representation Learning Network that integrates both affordance-guided reasoning and contextual prompts with visual cues to better capture complex interactions. We enhance the conventional HOI detection framework by expanding it beyond simple human-object pairs to include multivariate relationships involving auxiliary entities like tools. Specifically, we explicitly model the functional role (affordance) of these auxiliary objects through triplet structures <human, tool, object>. This enables our model to identify tool-dependent interactions such as ‘filling’. Furthermore, the learnable prompt is enriched with instance categories and subsequently integrated with contextual visual features using an attention mechanism. This process aligns language with image content at both global and regional levels. These contextualized representations equip the model with enriched relational cues for more reliable reasoning over complex, context-dependent interactions. Our proposed method demonstrates superior performance on both the HICO-Det and V-COCO datasets in most scenarios. Codes will be released upon acceptance.

[67] Double Helix Diffusion for Cross-Domain Anomaly Image Generation

Linchun Wu,Qin Zou,Xianbiao Qi,Bo Du,Zhongyuan Wang,Qingquan Li

Main category: cs.CV

TL;DR: 该论文提出了双螺旋扩散模型(DH-Diff),用于跨领域的异常图像生成,解决现有方法在结构不一致和特征纠缠上的问题,显著提升了生成图像的真实性和多样性。

Details Motivation: 制造业中的视觉异常检测缺乏真实的异常样本,现有合成方法存在结构不一致和特征纠缠的问题,限制了检测器的训练效果。

Contribution: 提出了一种新型跨领域生成框架DH-Diff,能够同时生成高保真的异常图像及其像素级标注掩模,解决了特征纠缠和结构不一致问题。

Method: 采用双螺旋架构,包含特征分离、连接和融合模块;通过域解耦注意力机制和语义分数图对齐模块提升生成质量。

Result: 实验表明DH-Diff在多样性和真实性上显著优于现有方法,并提升了下游异常检测性能。

Insight: 通过域解耦注意力机制和语义对齐,生成模型可以有效解决特征纠缠问题,同时保持图像结构的真实性。

Abstract: Visual anomaly inspection is critical in manufacturing, yet hampered by the scarcity of real anomaly samples for training robust detectors. Synthetic data generation presents a viable strategy for data augmentation; however, current methods remain constrained by two principal limitations: 1) the generation of anomalies that are structurally inconsistent with the normal background, and 2) the presence of undesirable feature entanglement between synthesized images and their corresponding annotation masks, which undermines the perceptual realism of the output. This paper introduces Double Helix Diffusion (DH-Diff), a novel cross-domain generative framework designed to simultaneously synthesize high-fidelity anomaly images and their pixel-level annotation masks, explicitly addressing these challenges. DH-Diff employs a unique architecture inspired by a double helix, cycling through distinct modules for feature separation, connection, and merging. Specifically, a domain-decoupled attention mechanism mitigates feature entanglement by enhancing image and annotation features independently, and meanwhile a semantic score map alignment module ensures structural authenticity by coherently integrating anomaly foregrounds. DH-Diff offers flexible control via text prompts and optional graphical guidance. Extensive experiments demonstrate that DH-Diff significantly outperforms state-of-the-art methods in diversity and authenticity, leading to significant improvements in downstream anomaly detection performance.

[68] Superpixel Anything: A general object-based framework for accurate yet regular superpixel segmentation

Julien Walther,Rémi Giraud,Michaël Clément

Main category: cs.CV

TL;DR: 本文提出了SPAM(SuperPixel Anything Model),一种通用的超像素分割框架,能够在保持规则性的同时实现高精度分割。

Details Motivation: 传统超像素方法依赖低层特征,而深度学习方法虽利用高层特征但牺牲了超像素的规则性。SPAM旨在平衡这两者。

Contribution: 提出了SPAM框架,支持高精度且规则的超像素分割,并能与任何高级分割方法结合,处理不确定性区域。

Method: 训练模型提取图像特征用于超像素生成,并利用预训练的大规模模型进行语义无关分割。

Result: 实验表明,SPAM在分割任务上定性和定量均优于现有方法。

Insight: 结合高层特征与规则性约束可显著提升超像素分割的精度和实用性。

Abstract: Superpixels are widely used in computer vision to simplify image representation and reduce computational complexity. While traditional methods rely on low-level features, deep learning-based approaches leverage high-level features but also tend to sacrifice regularity of superpixels to capture complex objects, leading to accurate but less interpretable segmentations. In this work, we introduce SPAM (SuperPixel Anything Model), a versatile framework for segmenting images into accurate yet regular superpixels. We train a model to extract image features for superpixel generation, and at inference, we leverage a large-scale pretrained model for semantic-agnostic segmentation to ensure that superpixels align with object masks. SPAM can handle any prior high-level segmentation, resolving uncertainty regions, and is able to interactively focus on specific objects. Comprehensive experiments demonstrate that SPAM qualitatively and quantitatively outperforms state-of-the-art methods on segmentation tasks, making it a valuable and robust tool for various applications. Code and pre-trained models are available here: https://github.com/waldo-j/spam.

[69] SAGA: Selective Adaptive Gating for Efficient and Expressive Linear Attention

Yuan Cao,Dong Wang

Main category: cs.CV

TL;DR: 论文提出SAGA方法,通过选择性自适应门控改进线性注意力机制,解决了传统线性注意力特征冗余和对齐问题,显著提升了计算效率和模型性能。

Details Motivation: Transformer中的softmax注意力机制在处理高分辨率图像时因二次复杂度成为瓶颈,而现有线性注意力方法因均匀压缩KV信息导致特征冗余和对齐问题。

Contribution: 提出SAGA方法,通过输入自适应门控选择性调制KV信息,增强语义多样性,并提出高效的门控计算分解方法。

Method: 引入学习门控选择性调控KV特征图的聚合信息,缓解低秩约束,使用Hadamard积分解高效计算门控。

Result: SAGA在1280×1280分辨率下吞吐量提升1.76倍,GPU峰值内存降低2.69倍,ImageNet上Top-1精度最高提升4.4%。

Insight: 选择性门控能够优化线性注意力的信息聚合方式,实现高效与表达能力兼备,优于传统均匀压缩方法。

Abstract: While Transformer architecture excel at modeling long-range dependencies contributing to its widespread adoption in vision tasks the quadratic complexity of softmax-based attention mechanisms imposes a major bottleneck, particularly when processing high-resolution images. Linear attention presents a promising alternative by reformulating the attention computation from $(QK)V$ to $Q(KV)$, thereby reducing the complexity from $\mathcal{O}(N^2)$ to $\mathcal{O}(N)$ while preserving the global receptive field. However, most existing methods compress historical key-value (KV) information uniformly, which can lead to feature redundancy and the loss of directional alignment with the query (Q). This uniform compression results in low-rank $KV$ feature maps, contributing to a performance gap compared to softmax attention. To mitigate this limitation, we propose \textbf{S}elective \textbf{A}daptive \textbf{GA}ting for Efficient and Expressive Linear Attention (SAGA) , which introduces input-adaptive learnable gates to selectively modulate information aggregation into the $KV$ feature map. These gates enhance semantic diversity and alleviate the low-rank constraint inherent in conventional linear attention. Additionally, we propose an efficient Hadamard-product decomposition method for gate computation, which introduces no additional memory overhead. Experiments demonstrate that SAGA achieves a 1.76$\times$ improvement in throughput and a 2.69$\times$ reduction in peak GPU memory compared to PVT-T at a resolution of $1280 \times 1280$. Moreover, it improves top-1 accuracy by up to 4.4% on the ImageNet dataset, demonstrating both computational efficiency and model effectiveness.

[70] Data Scaling Laws for Radiology Foundation Models

Maximilian Ilse,Harshita Sharma,Anton Schwaighofer,Sam Bond-Taylor,Fernando Pérez-García,Olesya Melnichenko,Anne-Marie G. Sykes,Kelly K. Horst,Ashish Khandelwal,Maxwell Reynolds,Maria T. Wetscherek,Noel C. F. Codella,Javier Alvarez-Valle,Korfiatis Panagiotis,Valentina Salvatelli

Main category: cs.CV

TL;DR: 研究了医学影像基础模型在大规模数据下的表现,比较了两种主要视觉编码器(MI2和RAD-DINO)在胸部X光数据上的性能,发现MI2在疾病识别任务中表现更好,而RAD-DINO在管状结构任务中更优,同时强调了结构化监督和本地化持续预训练的价值。

Details Motivation: 医学影像基础模型的数据规模通常较小,限制了对其性能与数据规模关系的理解,本文旨在探索数据规模对医学影像基础模型的影响。

Contribution: 1) 系统性研究两种视觉编码器(MI2和RAD-DINO)在医学影像数据上的表现;2) 发现结构化监督和本地化持续预训练对性能的提升;3) 证明小规模域内数据也能超越开放权重基础模型。

Method: 持续预训练两种视觉编码器(MI2和RAD-DINO)在3.5M胸部X光数据上,评估分类、分割和报告生成任务。

Result: MI2在疾病相关任务中表现更优,RAD-DINO在管状结构任务中更强;结构化监督和UniCL方法能进一步提升性能;30k域内数据即可超越开放权重模型。

Insight: 医学机构可以通过本地化持续预训练和结构化监督显著提升模型性能,无需依赖大规模公共数据集。

Abstract: Foundation vision encoders such as CLIP and DINOv2, trained on web-scale data, exhibit strong transfer performance across tasks and datasets. However, medical imaging foundation models remain constrained by smaller datasets, limiting our understanding of how data scale and pretraining paradigms affect performance in this setting. In this work, we systematically study continual pretraining of two vision encoders, MedImageInsight (MI2) and RAD-DINO representing the two major encoder paradigms CLIP and DINOv2, on up to 3.5M chest x-rays from a single institution, holding compute and evaluation protocols constant. We evaluate on classification (radiology findings, lines and tubes), segmentation (lines and tubes), and radiology report generation. While prior work has primarily focused on tasks related to radiology findings, we include lines and tubes tasks to counterbalance this bias and evaluate a model’s ability to extract features that preserve continuity along elongated structures. Our experiments show that MI2 scales more effectively for finding-related tasks, while RAD-DINO is stronger on tube-related tasks. Surprisingly, continually pretraining MI2 with both reports and structured labels using UniCL improves performance, underscoring the value of structured supervision at scale. We further show that for some tasks, as few as 30k in-domain samples are sufficient to surpass open-weights foundation models. These results highlight the utility of center-specific continual pretraining, enabling medical institutions to derive significant performance gains by utilizing in-domain data.

[71] Exploring Metric Fusion for Evaluation of NeRFs

Shreyas Shivakumara,Gabriel Eilertsen,Karljohan Lundin Palmerius

Main category: cs.CV

TL;DR: 该论文研究了如何结合DISTS和VMAF两种指标来评价NeRF生成的图像质量,通过归一化和融合策略提升了评价指标与主观评分的相关性。

Details Motivation: NeRF生成的图像存在独特伪影,现有单一指标无法在所有数据集上表现良好,因此需要结合不同感知方法的指标以提升评价效果。

Contribution: 提出了基于DISTS和VMAF的指标融合方法,通过实验验证了其对主观评分的相关性优于单一指标。

Method: 采用两种归一化策略和两种融合策略,结合DISTS和VMAF指标,并在Synthetic和Outdoor数据集上进行测试。

Result: 在两种数据集和三种配置下验证了融合指标的鲁棒性和泛化能力,其相关性显著优于单一指标。

Insight: 指标融合能够弥补单一指标的局限性,结合不同感知方法的指标可以更全面地评价NeRF生成的图像质量。

Abstract: Neural Radiance Fields (NeRFs) have demonstrated significant potential in synthesizing novel viewpoints. Evaluating the NeRF-generated outputs, however, remains a challenge due to the unique artifacts they exhibit, and no individual metric performs well across all datasets. We hypothesize that combining two successful metrics, Deep Image Structure and Texture Similarity (DISTS) and Video Multi-Method Assessment Fusion (VMAF), based on different perceptual methods, can overcome the limitations of individual metrics and achieve improved correlation with subjective quality scores. We experiment with two normalization strategies for the individual metrics and two fusion strategies to evaluate their impact on the resulting correlation with the subjective scores. The proposed pipeline is tested on two distinct datasets, Synthetic and Outdoor, and its performance is evaluated across three different configurations. We present a detailed analysis comparing the correlation coefficients of fusion methods and individual scores with subjective scores to demonstrate the robustness and generalizability of the fusion metrics.

[72] Leveraging Large Language Models to Effectively Generate Visual Data for Canine Musculoskeletal Diagnoses

Martin Thißen,Thi Ngoc Diep Tran,Barbara Esteve Ratsch,Ben Joel Schönbein,Ute Trapp,Beate Egner,Romana Piat,Elke Hergenröther

Main category: cs.CV

TL;DR: 该论文探讨了利用大语言模型(LLM)生成犬类肌肉骨骼诊断的合成视觉数据,以解决真实数据稀缺的问题。通过映射视觉标注到文本域,并结合多种提示技术,生成的合成数据在真实数据上表现出色。

Details Motivation: 由于某些任务(如罕见疾病诊断)的真实数据稀缺且收集成本高,本文旨在探索LLM生成合成视觉数据的潜力,以补充训练数据集,提升AI模型的性能。

Contribution: 1. 开发了一种将视觉标注映射到文本域的方法,将犬类的肌肉骨骼异常标记为200多个标签区域。2. 利用引导解码、思维链推理和少样本提示等技术,生成了高质量合成数据。3. 展示了合成数据在真实数据上训练的模型表现优异(F1分数88%)。

Method: 1. 设计视觉标注到文本域的映射方法;2. 结合引导解码、思维链推理和少样本提示等LLM技术生成合成数据;3. 构建二元分类任务验证数据有效性。

Result: 生成的合成数据在真实数据上训练的模型达到88%的F1分数,且具有诊断位置和严重程度的敏感性,与犬类性别无关。

Insight: LLM生成的合成数据可以有效缓解数据稀缺问题,尤其在罕见疾病领域。该方法虽然针对医学领域设计,但可推广至其他领域。

Abstract: It is well-established that more data generally improves AI model performance. However, data collection can be challenging for certain tasks due to the rarity of occurrences or high costs. These challenges are evident in our use case, where we apply AI models to a novel approach for visually documenting the musculoskeletal condition of dogs. Here, abnormalities are marked as colored strokes on a body map of a dog. Since these strokes correspond to distinct muscles or joints, they can be mapped to the textual domain in which large language models (LLMs) operate. LLMs have demonstrated impressive capabilities across a wide range of tasks, including medical applications, offering promising potential for generating synthetic training data. In this work, we investigate whether LLMs can effectively generate synthetic visual training data for canine musculoskeletal diagnoses. For this, we developed a mapping that segments visual documentations into over 200 labeled regions representing muscles or joints. Using techniques like guided decoding, chain-of-thought reasoning, and few-shot prompting, we generated 1,000 synthetic visual documentations for patellar luxation (kneecap dislocation) diagnosis, the diagnosis for which we have the most real-world data. Our analysis shows that the generated documentations are sensitive to location and severity of the diagnosis while remaining independent of the dog’s sex. We further generated 1,000 visual documentations for various other diagnoses to create a binary classification dataset. A model trained solely on this synthetic data achieved an F1 score of 88% on 70 real-world documentations. These results demonstrate the potential of LLM-generated synthetic data, which is particularly valuable for addressing data scarcity in rare diseases. While our methodology is tailored to the medical domain, the insights and techniques can be adapted to other fields.

[73] Lego-Edit: A General Image Editing Framework with Model-Level Bricks and MLLM Builder

Qifei Jia,Yu Liu,Yajie Chai,Xintong Yao,Qiming Lu,Yasen Zhang,Runyu Shi,Ying Huang,Guoquan Zhang

Main category: cs.CV

TL;DR: Lego-Edit是一个基于多模态大语言模型(MLLM)的图像编辑框架,通过模型级工具包和渐进式强化学习,实现了对开放域用户指令的通用编辑能力,并在多个基准测试中取得最优性能。

Details Motivation: 现有基于指令的图像编辑方法难以泛化到训练域之外的多样化用户指令,限制了其实际应用。

Contribution: 1. 提出了模型级工具包,支持细粒度编辑动作组合;2. 设计了渐进式强化学习方法,提升MLLM对开放域指令的通用推理能力。

Method: 1. 使用多模态大语言模型(MLLM)组织模型级编辑工具;2. 通过三阶段渐进式强化学习训练MLLM。

Result: 在GEdit-Bench和ImgBench上达到了最优性能,并能无需微调直接使用新工具。

Insight: 将MLLM的通用能力与模型级工具结合,可以显著提升图像编辑系统对开放域指令的适应性和灵活性。

Abstract: Instruction-based image editing has garnered significant attention due to its direct interaction with users. However, real-world user instructions are immensely diverse, and existing methods often fail to generalize effectively to instructions outside their training domain, limiting their practical application. To address this, we propose Lego-Edit, which leverages the generalization capability of Multi-modal Large Language Model (MLLM) to organize a suite of model-level editing tools to tackle this challenge. Lego-Edit incorporates two key designs: (1) a model-level toolkit comprising diverse models efficiently trained on limited data and several image manipulation functions, enabling fine-grained composition of editing actions by the MLLM; and (2) a three-stage progressive reinforcement learning approach that uses feedback on unannotated, open-domain instructions to train the MLLM, equipping it with generalized reasoning capabilities for handling real-world instructions. Experiments demonstrate that Lego-Edit achieves state-of-the-art performance on GEdit-Bench and ImgBench. It exhibits robust reasoning capabilities for open-domain instructions and can utilize newly introduced editing tools without additional fine-tuning. Code is available: https://github.com/xiaomi-research/lego-edit.

[74] Runge-Kutta Approximation and Decoupled Attention for Rectified Flow Inversion and Semantic Editing

Weiming Chen,Zhihan Zhu,Yijia Wang,Zhihai He

Main category: cs.CV

TL;DR: 该论文提出了基于Runge-Kutta求解器的高阶逆方法提升Rectified Flow模型的逆精度,并引入DDTA机制解耦多模态注意力,提升语义控制,实现了高保真和可编辑性。

Details Motivation: Rectified Flow模型在生成性能上优于DDIM-based扩散模型,但在实际应用中面临逆精度低和多模态注意力纠缠的问题,限制了其源图像一致性和语义控制能力。

Contribution: 提出了Runge-Kutta高阶逆方法提升逆精度;设计了DDTA机制解耦注意力,实现更精确的语义控制。

Method: 利用Runge-Kutta求解器优化逆过程;在多模态扩散Transformer中引入解耦注意力机制(DDTA)。

Result: 在图像重建和文本引导编辑任务中,方法在保真度和可编辑性上达到SOTA性能。

Insight: Runge-Kutta方法可高效提升逆精度;注意力解耦是多模态扩散模型的关键改进方向。

Abstract: Rectified flow (RF) models have recently demonstrated superior generative performance compared to DDIM-based diffusion models. However, in real-world applications, they suffer from two major challenges: (1) low inversion accuracy that hinders the consistency with the source image, and (2) entangled multimodal attention in diffusion transformers, which hinders precise attention control. To address the first challenge, we propose an efficient high-order inversion method for rectified flow models based on the Runge-Kutta solver of differential equations. To tackle the second challenge, we introduce Decoupled Diffusion Transformer Attention (DDTA), a novel mechanism that disentangles text and image attention inside the multimodal diffusion transformers, enabling more precise semantic control. Extensive experiments on image reconstruction and text-guided editing tasks demonstrate that our method achieves state-of-the-art performance in terms of fidelity and editability. Code is available at https://github.com/wmchen/RKSovler_DDTA.

[75] MEJO: MLLM-Engaged Surgical Triplet Recognition via Inter- and Intra-Task Joint Optimization

Yiyi Zhang,Yuchen Yuan,Ying Zheng,Jialun Pei,Jinpeng Li,Zheng Li,Pheng-Ann Heng

Main category: cs.CV

TL;DR: 论文提出MEJO框架,通过任务间和任务内联合优化解决外科手术三元组识别中的长尾分布问题,利用MLLM增强语义特征并协调梯度学习,在CholecT45和CholecT50数据集上表现优异。

Details Motivation: 外科手术三元组识别(器械、动作、目标及其组合)面临长尾数据分布问题。现有方法在多任务学习中存在任务间和任务内的优化冲突,需设计更有效的联合优化策略。

Contribution: 1. 提出MEJO框架,联合优化任务间和任务内冲突;2. 设计S$^2$D学习方案,分解任务共享和任务特定表示;3. 利用MLLM增强语义特征;4. 提出CGL策略协调梯度学习,解决类别不平衡问题。

Method: 1. S$^2$D:分解表示并利用MLLM构建概率提示池动态增强特征;2. CGL:分析并重新平衡头尾类别的正负梯度。

Result: 在CholecT45和CholecT50数据集上表现优于基线方法,验证了框架的有效性。

Insight: 通过联合优化任务间和任务内冲突,并结合MLLM的高级语义信息,可以有效提升长尾数据集上的性能。

Abstract: Surgical triplet recognition, which involves identifying instrument, verb, target, and their combinations, is a complex surgical scene understanding challenge plagued by long-tailed data distribution. The mainstream multi-task learning paradigm benefiting from cross-task collaborative promotion has shown promising performance in identifying triples, but two key challenges remain: 1) inter-task optimization conflicts caused by entangling task-generic and task-specific representations; 2) intra-task optimization conflicts due to class-imbalanced training data. To overcome these difficulties, we propose the MLLM-Engaged Joint Optimization (MEJO) framework that empowers both inter- and intra-task optimization for surgical triplet recognition. For inter-task optimization, we introduce the Shared-Specific-Disentangled (S$^2$D) learning scheme that decomposes representations into task-shared and task-specific components. To enhance task-shared representations, we construct a Multimodal Large Language Model (MLLM) powered probabilistic prompt pool to dynamically augment visual features with expert-level semantic cues. Additionally, comprehensive task-specific cues are modeled via distinct task prompts covering the temporal-spatial dimensions, effectively mitigating inter-task ambiguities. To tackle intra-task optimization conflicts, we develop a Coordinated Gradient Learning (CGL) strategy, which dissects and rebalances the positive-negative gradients originating from head and tail classes for more coordinated learning behaviors. Extensive experiments on the CholecT45 and CholecT50 datasets demonstrate the superiority of our proposed framework, validating its effectiveness in handling optimization conflicts.

[76] Cross-Layer Vision Smoothing: Enhancing Visual Understanding via Sustained Focus on Key Objects in Large Vision-Language Models

Jianfei Zhao,Feng Zhang,Xin Sun,Lingxing Kong,Zhixing Tan,Chong Feng

Main category: cs.CV

TL;DR: 提出跨层视觉平滑(CLVS)方法,通过持续关注关键对象来增强大视觉语言模型(LVLM)的视觉理解能力,实验验证其在多项任务中的有效性。

Details Motivation: 大视觉语言模型(LVLM)对图像关键对象的注意力短暂,假设持续关注这些对象能提升视觉能力。

Contribution: 提出CLVS方法,通过跨层视觉记忆平滑注意力分布,显著提升视觉任务性能,尤其在关系和属性理解上。

Method: 初始化视觉记忆并在后续层平滑注意力分布,利用不确定性指标终止平滑过程。

Result: 在四个基准测试和三种LVLM上验证,CLVS在视觉理解任务中达到最优,关系和属性理解提升显著。

Insight: 持续平滑关注关键对象能有效提升LVLM的视觉能力,尤其是多层级任务中的表现。

Abstract: Large Vision-Language Models (LVLMs) can accurately locate key objects in images, yet their attention to these objects tends to be very brief. Motivated by the hypothesis that sustained focus on key objects can improve LVLMs’ visual capabilities, we propose Cross-Layer Vision Smoothing (CLVS). The core idea of CLVS is to incorporate a vision memory that smooths the attention distribution across layers. Specifically, we initialize this vision memory with position-unbiased visual attention in the first layer. In subsequent layers, the model’s visual attention jointly considers the vision memory from previous layers, while the memory is updated iteratively, thereby maintaining smooth attention on key objects. Given that visual understanding primarily occurs in the early and middle layers of the model, we use uncertainty as an indicator of completed visual understanding and terminate the smoothing process accordingly. Experiments on four benchmarks across three LVLMs confirm the effectiveness and generalizability of our method. CLVS achieves state-of-the-art performance on a variety of visual understanding tasks, with particularly significant improvements in relation and attribute understanding.

[77] MSGFusion: Multimodal Scene Graph-Guided Infrared and Visible Image Fusion

Guihui Li,Bowei Dong,Kaizhi Dong,Jiayi Li,Haiyong Zheng

Main category: cs.CV

TL;DR: MSGFusion提出了一种基于多模态场景图(scene graph)的红外与可见光图像融合框架,通过结合文本与视觉的结构化信息,显著提升融合图像的语义一致性和细节保留能力。

Details Motivation: 当前的红外与可见光图像融合方法主要依赖低层次视觉特征(如纹理和对比度),难以捕捉图像的高层次语义信息。现有方法引入文本指导时也未显式建模实体、属性和关系,限制了融合的性能。

Contribution: 1. 提出MSGFusion框架,首次将结构化的场景图(scene graph)引入红外与可见光图像融合任务;2. 通过多模态信息耦合,显式建模实体、属性和空间关系;3. 实验证明其在细节保留和语义一致性上的显著优势。

Method: 1. 结合文本和视觉信息生成结构化场景图;2. 设计分层聚合模块和基于图的融合模块,同步优化高层次语义和低层次细节。

Result: 在多个公开数据集上,MSGFusion在细节保留和结构清晰度上优于现有方法,并在下游任务(如低光目标检测、语义分割和医学图像融合)中表现出更好的语义一致性和泛化能力。

Insight: 结构化场景图能有效桥接低层次特征与高层次语义,为多模态图像融合提供了新的研究方向。

Abstract: Infrared and visible image fusion has garnered considerable attention owing to the strong complementarity of these two modalities in complex, harsh environments. While deep learning-based fusion methods have made remarkable advances in feature extraction, alignment, fusion, and reconstruction, they still depend largely on low-level visual cues, such as texture and contrast, and struggle to capture the high-level semantic information embedded in images. Recent attempts to incorporate text as a source of semantic guidance have relied on unstructured descriptions that neither explicitly model entities, attributes, and relationships nor provide spatial localization, thereby limiting fine-grained fusion performance. To overcome these challenges, we introduce MSGFusion, a multimodal scene graph-guided fusion framework for infrared and visible imagery. By deeply coupling structured scene graphs derived from text and vision, MSGFusion explicitly represents entities, attributes, and spatial relations, and then synchronously refines high-level semantics and low-level details through successive modules for scene graph representation, hierarchical aggregation, and graph-driven fusion. Extensive experiments on multiple public benchmarks show that MSGFusion significantly outperforms state-of-the-art approaches, particularly in detail preservation and structural clarity, and delivers superior semantic consistency and generalizability in downstream tasks such as low-light object detection, semantic segmentation, and medical image fusion.

[78] T-SiamTPN: Temporal Siamese Transformer Pyramid Networks for Robust and Efficient UAV Tracking

Hojat Ardi,Amir Jahanshahi,Ali Diba

Main category: cs.CV

TL;DR: T-SiamTPN是一个基于时间感知的Siamese跟踪框架,通过显式时间建模解决了现有跟踪器在时间依赖性上的不足,显著提升了性能和鲁棒性。

Details Motivation: 现有Siamese跟踪器主要依赖空间线索,忽略了时间依赖性,导致长期跟踪和被遮挡场景下鲁棒性不足。同时,相关操作限制了其处理非线性外观变化的能力。

Contribution: 1.提出了T-SiamTPN,首次在Siamese跟踪框架中引入了显式时间建模;2.通过时间特征融合和注意力机制增强时间一致性和特征表示;3.在计算效率与性能之间取得了良好平衡。

Method: 1.扩展SiamTPN架构,加入时间特征融合模块;2.采用注意力机制增强时空交互;3.保持轻量级设计,适用于嵌入式设备。

Result: 在Jetson Nano上实时运行(7.1 FPS),成功率提升13.7%,精确率提升14.7%,性能接近最先进方法。

Insight: 1.时间建模对提升跟踪鲁棒性至关重要;2.轻量级设计可兼顾性能和效率,适合实际应用部署。

Abstract: Aerial object tracking remains a challenging task due to scale variations, dynamic backgrounds, clutter, and frequent occlusions. While most existing trackers emphasize spatial cues, they often overlook temporal dependencies, resulting in limited robustness in long-term tracking and under occlusion. Furthermore, correlation-based Siamese trackers are inherently constrained by the linear nature of correlation operations, making them ineffective against complex, non-linear appearance changes. To address these limitations, we introduce T-SiamTPN, a temporal-aware Siamese tracking framework that extends the SiamTPN architecture with explicit temporal modeling. Our approach incorporates temporal feature fusion and attention-based interactions, strengthening temporal consistency and enabling richer feature representations. These enhancements yield significant improvements over the baseline and achieve performance competitive with state-of-the-art trackers. Crucially, despite the added temporal modules, T-SiamTPN preserves computational efficiency. Deployed on the resource-constrained Jetson Nano, the tracker runs in real time at 7.1 FPS, demonstrating its suitability for real-world embedded applications without notable runtime overhead. Experimental results highlight substantial gains: compared to the baseline, T-SiamTPN improves success rate by 13.7% and precision by 14.7%. These findings underscore the importance of temporal modeling in Siamese tracking frameworks and establish T-SiamTPN as a strong and efficient solution for aerial object tracking. Code is available at: https://github.com/to/be/released

[79] MATTER: Multiscale Attention for Registration Error Regression

Shipeng Liu,Ziliang Xiong,Khac-Hoang Ngo,Per-Erik Forssén

Main category: cs.CV

TL;DR: 该论文提出了一种基于回归的PCR质量验证方法MATTER,通过多尺度特征提取和注意力聚合,显著提升了点云配准误差估计的精度和鲁棒性。

Details Motivation: 现有PCR质量验证方法通常将其视为分类问题,导致结果粗糙。作者希望通过回归实现更精细的量化,并扩展特征提取方式以提升性能,尤其是在空间密度不均匀的点云上。

Contribution: 1. 提出首个基于回归的PCR质量验证方法;2. 引入多尺度特征提取和注意力机制提升性能;3. 在异构数据集上验证了方法的有效性。

Method: 1. 通过多尺度提取配准误差相关特征;2. 使用注意力机制聚合特征;3. 训练回归模型预测配准误差。

Result: 在多种数据集上,MATTER显著优于现有分类方法,尤其对空间密度不均的点云效果更佳。此外,其能有效提升下游任务(如地图构建)的质量。

Insight: 回归方法可从粗粒度分类任务中释放更多信息;多尺度和注意力机制能有效捕捉复杂点云数据的特征。

Abstract: Point cloud registration (PCR) is crucial for many downstream tasks, such as simultaneous localization and mapping (SLAM) and object tracking. This makes detecting and quantifying registration misalignment, i.e.,~{\it PCR quality validation}, an important task. All existing methods treat validation as a classification task, aiming to assign the PCR quality to a few classes. In this work, we instead use regression for PCR validation, allowing for a more fine-grained quantification of the registration quality. We also extend previously used misalignment-related features by using multiscale extraction and attention-based aggregation. This leads to accurate and robust registration error estimation on diverse datasets, especially for point clouds with heterogeneous spatial densities. Furthermore, when used to guide a mapping downstream task, our method significantly improves the mapping quality for a given amount of re-registered frames, compared to the state-of-the-art classification-based method.

[80] 4DRadar-GS: Self-Supervised Dynamic Driving Scene Reconstruction with 4D Radar

Xiao Tang,Guirong Zhuo,Cong Wang,Boyuan Zheng,Minqing Huang,Lianqing Zheng,Long Chen,Shouyi Lu

Main category: cs.CV

TL;DR: 4DRadar-GS是一种利用4D雷达的自监督动态驾驶场景重建框架,通过结合4D雷达的速度和空间信息,实现了对动态对象的准确分割和深度恢复,提升了动态场景重建的精确性与时间一致性。

Details Motivation: 现有方法在动态场景重建中由于运动估计不准确和时间一致性弱,导致动态对象的重建效果不佳。为提升动态驾驶场景的重建质量,结合4D雷达的能力成为一种潜在解决方案。

Contribution: 1. 提出了4D雷达辅助的高斯初始化方案,利用雷达数据分割动态对象并恢复深度尺度;2. 设计了Velocity-guided PointTrack(VGPT)模型,联合重建流水线优化动态轨迹跟踪,增强时间一致性。

Method: 1. 使用4D雷达的速度和空间信息初始化高斯点表示;2. 提出VGPT模型,在场景流监督下联合训练,优化动态对象的轨迹跟踪。

Result: 在OmniHD-Scenes数据集上实现了最先进的动态驾驶场景3D重建性能。

Insight: 4D雷达的多普勒信息为动态对象重建提供了关键运动线索,结合自监督学习可显著提升动态场景的建模精度。

Abstract: 3D reconstruction and novel view synthesis are critical for validating autonomous driving systems and training advanced perception models. Recent self-supervised methods have gained significant attention due to their cost-effectiveness and enhanced generalization in scenarios where annotated bounding boxes are unavailable. However, existing approaches, which often rely on frequency-domain decoupling or optical flow, struggle to accurately reconstruct dynamic objects due to imprecise motion estimation and weak temporal consistency, resulting in incomplete or distorted representations of dynamic scene elements. To address these challenges, we propose 4DRadar-GS, a 4D Radar-augmented self-supervised 3D reconstruction framework tailored for dynamic driving scenes. Specifically, we first present a 4D Radar-assisted Gaussian initialization scheme that leverages 4D Radar’s velocity and spatial information to segment dynamic objects and recover monocular depth scale, generating accurate Gaussian point representations. In addition, we propose a Velocity-guided PointTrack (VGPT) model, which is jointly trained with the reconstruction pipeline under scene flow supervision, to track fine-grained dynamic trajectories and construct temporally consistent representations. Evaluated on the OmniHD-Scenes dataset, 4DRadar-GS achieves state-of-the-art performance in dynamic driving scene 3D reconstruction.

[81] Time-step Mixup for Efficient Spiking Knowledge Transfer from Appearance to Event Domain

Yuqi Xie,Shuhan Ye,Chong Wang,Jiazhen Xu,Le Shen,Yuanbin Qian,Jiangbo Qian

Main category: cs.CV

TL;DR: 本文提出了Time-step Mixup知识迁移(TMKT)方法,通过时间步混合RGB和DVS输入,解决事件摄像头和脉冲神经网络训练中的模态差异问题。

Details Motivation: 事件摄像头和脉冲神经网络结合的视觉处理具有高效能潜力,但事件数据稀缺和DVS输出稀疏性限制了训练效果。现有方法忽略了RGB和DVS模态间的分布差异。

Contribution: 提出了TMKT方法,通过时间步混合和模态感知辅助学习目标,实现跨模态平滑知识迁移,显著提升分类性能。

Method: 设计了时间步混合策略,利用SNN的异步特性对RGB和DVS输入插值,并引入模态感知辅助学习目标以支持混合过程。

Result: 在多数据集上的实验验证了方法的有效性,实现了优异的脉冲图像分类性能。

Insight: 时间步混合策略能够有效缓解训练中的模态偏移问题,模态感知辅助目标增强了模型的跨模态判别能力。

Abstract: The integration of event cameras and spiking neural networks holds great promise for energy-efficient visual processing. However, the limited availability of event data and the sparse nature of DVS outputs pose challenges for effective training. Although some prior work has attempted to transfer semantic knowledge from RGB datasets to DVS, they often overlook the significant distribution gap between the two modalities. In this paper, we propose Time-step Mixup knowledge transfer (TMKT), a novel fine-grained mixing strategy that exploits the asynchronous nature of SNNs by interpolating RGB and DVS inputs at various time-steps. To enable label mixing in cross-modal scenarios, we further introduce modality-aware auxiliary learning objectives. These objectives support the time-step mixup process and enhance the model’s ability to discriminate effectively across different modalities. Our approach enables smoother knowledge transfer, alleviates modality shift during training, and achieves superior performance in spiking image classification tasks. Extensive experiments demonstrate the effectiveness of our method across multiple datasets. The code will be released after the double-blind review process.

[82] MMMS: Multi-Modal Multi-Surface Interactive Segmentation

Robin Schön,Julian Lorenz,Katja Ludwig,Daniel Kienzle,Rainer Lienhart

Main category: cs.CV

TL;DR: 该论文提出了一种MMMS方法,通过用户点击交互式生成分割掩码,特别关注同一图像中多个纠缠表面的分割问题,并引入了一种新的评估指标。

Details Motivation: 现有的分割方法在处理同一图像中多个纠缠或相邻表面时表现不佳,且缺乏有效的评估指标。同时,多模态输入可能提升分割效果但现有方法未充分利用。

Contribution: 1) 提出MMMS方法,支持多表面分割;2) 引入新的评估指标;3) 设计了高效的多模态融合网络架构;4) 验证了多模态输入的优势。

Method: 基于RGB图像、非RGB模态、错误掩码和编码点击输入,设计了一个网络架构,在后处理阶段融合交互信息以预测改进的分割掩码。

Result: 多模态输入显著提升了性能,DeLiVER和MFNet数据集的NoC@90分别降低了1.28和1.19次点击。同时,RGB基线在单掩码场景中也表现优异。

Insight: 多模态融合和交互信息的延迟整合能有效提升分割精度,尤其是在复杂场景中。

Abstract: In this paper, we present a method to interactively create segmentation masks on the basis of user clicks. We pay particular attention to the segmentation of multiple surfaces that are simultaneously present in the same image. Since these surfaces may be heavily entangled and adjacent, we also present a novel extended evaluation metric that accounts for the challenges of this scenario. Additionally, the presented method is able to use multi-modal inputs to facilitate the segmentation task. At the center of this method is a network architecture which takes as input an RGB image, a number of non-RGB modalities, an erroneous mask, and encoded clicks. Based on this input, the network predicts an improved segmentation mask. We design our architecture such that it adheres to two conditions: (1) The RGB backbone is only available as a black-box. (2) To reduce the response time, we want our model to integrate the interaction-specific information after the image feature extraction and the multi-modal fusion. We refer to the overall task as Multi-Modal Multi-Surface interactive segmentation (MMMS). We are able to show the effectiveness of our multi-modal fusion strategy. Using additional modalities, our system reduces the NoC@90 by up to 1.28 clicks per surface on average on DeLiVER and up to 1.19 on MFNet. On top of this, we are able to show that our RGB-only baseline achieves competitive, and in some cases even superior performance when tested in a classical, single-mask interactive segmentation scenario.

[83] SHREC 2025: Protein surface shape retrieval including electrostatic potential

Taher Yacoub,Camille Depenveiller,Atsushi Tatsuma,Tin Barisin,Eugen Rusakov,Udo Gobel,Yuxu Peng,Shiqiang Deng,Yuki Kagaya,Joon Hong Park,Daisuke Kihara,Marco Guerra,Giorgio Palmieri,Andrea Ranieri,Ulderico Fugacci,Silvia Biasotti,Ruiwen He,Halim Benhabiles,Adnane Cabani,Karim Hammoudi,Haotian Li,Hao Huang,Chunyan Li,Alireza Tehrani,Fanwang Meng,Farnaz Heidar-Zadeh,Tuan-Anh Yang,Matthieu Montes

Main category: cs.CV

TL;DR: SHREC 2025赛道专注于蛋白质表面形状检索任务,9个团队参与评测,15种方法在一个包含11,555个蛋白质表面的数据集上进行了性能评估。结果表明,结合静电势作为补充信息的检索方法表现最佳。

Details Motivation: 蛋白质表面形状检索在生物信息学中具有重要意义。研究动机是探索如何通过结合静电势等分子表面描述符提升检索性能,尤其是对于数据有限的类别。

Contribution: 主要贡献包括:(1)组织了大规模的蛋白质表面形状检索评测;(2)验证了静电势作为补充描述符的有效性;(3)展示了该方法在数据有限类别中的鲁棒性。

Method: 研究方法包括:(1)构建包含11,555个蛋白质表面的数据集;(2)使用多种检索方法(包括基于静电势的组合方法)进行实验;(3)通过准确率、平衡准确率、F1分数等指标进行性能评估。

Result: 结果表明,结合静电势信息的检索方法在多项指标上表现最佳,尤其是在数据有限的类别中效果显著。

Insight: 静电势作为分子表面描述符可以显著提升蛋白质表面形状检索的性能,该方法在数据不足的情况下仍然有效,强调了多模态信息的重要性。

Abstract: This SHREC 2025 track dedicated to protein surface shape retrieval involved 9 participating teams. We evaluated the performance in retrieval of 15 proposed methods on a large dataset of 11,555 protein surfaces with calculated electrostatic potential (a key molecular surface descriptor). The performance in retrieval of the proposed methods was evaluated through different metrics (Accuracy, Balanced accuracy, F1 score, Precision and Recall). The best retrieval performance was achieved by the proposed methods that used the electrostatic potential complementary to molecular surface shape. This observation was also valid for classes with limited data which highlights the importance of taking into account additional molecular surface descriptors.

[84] PANORAMA: The Rise of Omnidirectional Vision in the Embodied AI Era

Xu Zheng,Chenfei Liao,Ziqiao Weng,Kaiyu Lei,Zihao Dongfang,Haocong He,Yuanhuiyi Lyu,Lutao Jiang,Lu Qi,Li Chen,Danda Pani Paudel,Kailun Yang,Linfeng Zhang,Luc Van Gool,Xuming Hu

Main category: cs.CV

TL;DR: 这篇综述论文探讨了全向视觉在具身AI时代的崛起,介绍了其重要性、最新突破以及未来挑战,并提出了一个理想的全景系统架构PANORAMA。

Details Motivation: 传统针孔视觉在某些领域的环境感知能力有限,而全向视觉提供了更全面的环境感知能力,因此在机器人、工业检测和环境监测等领域的需求日益增长。

Contribution: 论文提出了一个全景系统架构PANORAMA,总结了全向生成、感知和理解方面的最新突破,并指出了未来研究的挑战与机遇。

Method: 论文基于学术界和工业界的见解,提出了一个由四个关键子系统组成的PANORAMA架构,并讨论了相关数据集和跨领域影响。

Result: 综述了全向视觉的最新进展,并提出了一个系统化的架构,为未来的研究提供了方向。

Insight: 全向视觉在具身AI时代具有巨大潜力,但其研究基础仍需进一步加强,未来的工作需解决数据、泛化性和系统集成等问题。

Abstract: Omnidirectional vision, using 360-degree vision to understand the environment, has become increasingly critical across domains like robotics, industrial inspection, and environmental monitoring. Compared to traditional pinhole vision, omnidirectional vision provides holistic environmental awareness, significantly enhancing the completeness of scene perception and the reliability of decision-making. However, foundational research in this area has historically lagged behind traditional pinhole vision. This talk presents an emerging trend in the embodied AI era: the rapid development of omnidirectional vision, driven by growing industrial demand and academic interest. We highlight recent breakthroughs in omnidirectional generation, omnidirectional perception, omnidirectional understanding, and related datasets. Drawing on insights from both academia and industry, we propose an ideal panoramic system architecture in the embodied AI era, PANORAMA, which consists of four key subsystems. Moreover, we offer in-depth opinions related to emerging trends and cross-community impacts at the intersection of panoramic vision and embodied AI, along with the future roadmap and open challenges. This overview synthesizes state-of-the-art advancements and outlines challenges and opportunities for future research in building robust, general-purpose omnidirectional AI systems in the embodied AI era.

[85] Dual-Stage Reweighted MoE for Long-Tailed Egocentric Mistake Detection

Boyu Han,Qianqian Xu,Shilong Bao,Zhiyong Yang,Sicong Li,Qingming Huang

Main category: cs.CV

TL;DR: 论文提出了一种双阶段重加权混合专家(DR-MoE)框架,用于处理长尾分布的自我中心视角错误检测问题。通过结合特征级和分类级的专家模块,以及对不同损失函数的分阶段优化,显著提升了罕见和模糊错误实例的检测性能。

Details Motivation: 自我中心视角视频中的错误检测面临罕见错误实例和类别不平衡的挑战,传统方法在此类问题上表现不佳,因此需要一种新的框架来解决这些问题。

Contribution: 1. 提出DR-MoE框架,结合特征级(ViViT与LoRA调优)和分类级(多目标分类器)专家模块;2. 使用重加权交叉熵、AUC损失和标签感知损失优化分类器性能;3. 显著提升罕见和模糊错误实例的识别能力。

Method: 1. 第一阶段:使用ViViT和LoRA调优的ViViT提取特征,通过特征级专家模块融合;2. 第二阶段:训练三个分类器,分别优化重加权交叉熵、AUC损失和标签感知损失;3. 通过分类级专家模块融合预测结果。

Result: DR-MoE在罕见和模糊错误实例检测上表现优异,且开源代码可供复现。

Insight: 结合特征级和分类级的专家模块可以有效缓解长尾分布问题,多目标优化的分类器设计进一步提升了模型的鲁棒性和泛化能力。

Abstract: In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To handle the challenges posed by subtle and infrequent mistakes, we propose a Dual-Stage Reweighted Mixture-of-Experts (DR-MoE) framework. In the first stage, features are extracted using a frozen ViViT model and a LoRA-tuned ViViT model, which are combined through a feature-level expert module. In the second stage, three classifiers are trained with different objectives: reweighted cross-entropy to mitigate class imbalance, AUC loss to improve ranking under skewed distributions, and label-aware loss with sharpness-aware minimization to enhance calibration and generalization. Their predictions are fused using a classification-level expert module. The proposed method achieves strong performance, particularly in identifying rare and ambiguous mistake instances. The code is available at https://github.com/boyuh/DR-MoE.

[86] Brought a Gun to a Knife Fight: Modern VFM Baselines Outgun Specialized Detectors on In-the-Wild AI Image Detection

Yue Zhou,Xinan He,Kaiqing Lin,Bing Fan,Feng Ding,Jinhua Zeng,Bin Li

Main category: cs.CV

TL;DR: 使用现代视觉基础模型(VFM)的简单线性分类器显著优于专门针对AI生成图像设计的检测器,在真实场景中准确率提升了20%以上。这表明更新模型的‘火力’比静态检测器的‘工艺’更有效,同时强调了测试数据需独立于模型预训练历史的重要性。

Details Motivation: 现有的专门检测器在静态数据集上表现优异,但在真实场景中表现堪忧,尤其是假阴性率高。作者试图探索现代视觉基础模型是否能更有效地解决这一现实世界问题。

Contribution: 1. 展示了现代VFM(如Meta CLIP2)在检测AI生成图像方面的优越性能。2. 揭示了VFM通过预训练学到的文本-图像对齐能力是关键优势。3. 提出测试数据需完全独立于模型预训练历史的评估方法。

Method: 使用现代VFM(如Meta CLIP2)的线性分类器作为基线模型,在相同数据上训练,并与专门检测器对比。通过分析文本-图像相似性,验证VFM对‘AI生成’等概念的潜在对齐能力。

Result: 现代VFM在真实场景中的检测准确率比专门检测器高出20%以上。但若测试数据为模型预训练后收集的,性能显著下降。

Insight: 1. 现代VFM的‘火力’(预训练学习能力)比静态检测器的‘工艺’更适用于动态问题。2. 模型评估需严格排除预训练数据的潜在污染。

Abstract: While specialized detectors for AI-generated images excel on curated benchmarks, they fail catastrophically in real-world scenarios, as evidenced by their critically high false-negative rates on in-the-wild' benchmarks. Instead of crafting another specialized knife’ for this problem, we bring a gun' to the fight: a simple linear classifier on a modern Vision Foundation Model (VFM). Trained on identical data, this baseline decisively outguns’ bespoke detectors, boosting in-the-wild accuracy by a striking margin of over 20%. Our analysis pinpoints the source of the VFM’s firepower': First, by probing text-image similarities, we find that recent VLMs (e.g., Perception Encoder, Meta CLIP2) have learned to align synthetic images with forgery-related concepts (e.g., AI-generated’), unlike previous versions. Second, we speculate that this is due to data exposure, as both this alignment and overall accuracy plummet on a novel dataset scraped after the VFM’s pre-training cut-off date, ensuring it was unseen during pre-training. Our findings yield two critical conclusions: 1) For the real-world gunfight' of AI-generated image detection, the raw firepower’ of an updated VFM is far more effective than the `craftsmanship’ of a static detector. 2) True generalization evaluation requires test data to be independent of the model’s entire training history, including pre-training.

[87] Dream3DAvatar: Text-Controlled 3D Avatar Reconstruction from a Single Image

Gaofeng Liu,Hengsen Li,Ruoyu Gao,Xuetong Li,Zhiyuan Ma,Tao Fang

Main category: cs.CV

TL;DR: 这篇论文提出了Dream3DAvatar,一个高效、可分阶段控制的文本驱动框架,用于从单张图像重建3D虚拟化身,解决了遮挡区域生成时的几何和纹理控制问题。

Details Motivation: 由于单目输入的局限性,当前的3D化身重建技术难以控制遮挡区域的几何和纹理。为解决这一问题,作者提出了一个两阶段的框架,通过改进重建流程和引入文本控制,提升重建质量和可控性。

Contribution: 1. 设计了一个两阶段的框架:多视图生成和3D重建;2. 引入了姿势适配器(Pose-Adapter)和身份适配器(ID-Adapter),确保多视图一致性和面部细节保留;3. 利用BLIP2生成高质量的文本描述,提升文本驱动能力;4. 提出了一个前馈Transformer模型,用于高效重建3D高斯溅射表示。

Method: 1. 第一阶段:轻量级的多视图生成模型,通过Pose-Adapter注入SMPL-X渲染和骨骼信息,ID-Adapter-G保留面部特征;BLIP2生成文本描述。2. 第二阶段:前馈Transformer模型融合多视图特征,结合ID-Adapter-R恢复高频细节,重建3D高斯溅射表示。

Result: 实验表明,Dream3DAvatar能够生成无需后处理即可动画化的高质量3D化身,并在多评估指标上优于现有基线方法。

Insight: 结合适配器和文本驱动的多视图生成能够有效解决遮挡区域的生成问题,同时前馈Transformer在3D重建中表现出高效率和高质量。

Abstract: With the rapid advancement of 3D representation techniques and generative models, substantial progress has been made in reconstructing full-body 3D avatars from a single image. However, this task remains fundamentally ill-posedness due to the limited information available from monocular input, making it difficult to control the geometry and texture of occluded regions during generation. To address these challenges, we redesign the reconstruction pipeline and propose Dream3DAvatar, an efficient and text-controllable two-stage framework for 3D avatar generation. In the first stage, we develop a lightweight, adapter-enhanced multi-view generation model. Specifically, we introduce the Pose-Adapter to inject SMPL-X renderings and skeletal information into SDXL, enforcing geometric and pose consistency across views. To preserve facial identity, we incorporate ID-Adapter-G, which injects high-resolution facial features into the generation process. Additionally, we leverage BLIP2 to generate high-quality textual descriptions of the multi-view images, enhancing text-driven controllability in occluded regions. In the second stage, we design a feedforward Transformer model equipped with a multi-view feature fusion module to reconstruct high-fidelity 3D Gaussian Splat representations (3DGS) from the generated images. Furthermore, we introduce ID-Adapter-R, which utilizes a gating mechanism to effectively fuse facial features into the reconstruction process, improving high-frequency detail recovery. Extensive experiments demonstrate that our method can generate realistic, animation-ready 3D avatars without any post-processing and consistently outperforms existing baselines across multiple evaluation metrics.

[88] Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models

Yan Chen,Long Li,Teng Xi,Long Zeng,Jingdong Wang

Main category: cs.CV

TL;DR: 论文提出了一种两阶段强化学习框架PeBR-R1,用于提升视觉语言模型(VLMs)的感知和推理能力,通过分阶段训练解决直接从语言模型移植方法的不足,实验验证了其有效性。

Details Motivation: 直接将从大型语言模型(LLMs)中成功的强化学习方法应用于视觉语言模型(VLMs)效果不佳,因为VLMs需要先准确感知视觉输入才能推理。因此,需要一种针对VLMs的两阶段学习方法。

Contribution: 1. 提出两阶段强化学习框架,分别提升VLMs的感知和推理能力;2. 通过数据集级采样缓解强化学习中的优势消失问题;3. 实验证明PeBR-R1在多个基准数据集上的优越性能。

Method: 1. 第一阶段通过粗粒度和细粒度视觉理解提升感知能力;2. 第二阶段专注于增强推理能力;3. 使用数据集级采样优化训练过程。

Result: PeBR-R1在七个基准数据集上表现优异,验证了方法的有效性。

Insight: 视觉推理任务中,感知能力是推理的基础,分阶段训练优于直接移植LLMs的方法。

Abstract: Reinforcement learning (RL) has proven highly effective in eliciting the reasoning capabilities of large language models (LLMs). Inspired by this success, recent studies have explored applying similar techniques to vision-language models (VLMs), aiming to enhance their reasoning performance. However, directly transplanting RL methods from LLMs to VLMs is suboptimal, as the tasks faced by VLMs are inherently more complex. Specifically, VLMs must first accurately perceive and understand visual inputs before reasoning can be effectively performed. To address this challenge, we propose a two-stage reinforcement learning framework designed to jointly enhance both the perceptual and reasoning capabilities of VLMs. To mitigate the vanishing advantage issue commonly observed in RL training, we first perform dataset-level sampling to selectively strengthen specific capabilities using distinct data sources. During training, the first stage focuses on improving the model’s visual perception through coarse- and fine-grained visual understanding, while the second stage targets the enhancement of reasoning abilities. After the proposed two-stage reinforcement learning process, we obtain PeBR-R1, a vision-language model with significantly enhanced perceptual and reasoning capabilities. Experimental results on seven benchmark datasets demonstrate the effectiveness of our approach and validate the superior performance of PeBR-R1 across diverse visual reasoning tasks.

[89] HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models

Xu Li,Yuxuan Liang,Xiaolei Chen,Yi Zheng,Haotian Chen,Bin Li,Xiangyang Xue

Main category: cs.CV

TL;DR: HERO 是一种高分辨率视觉语言模型(HR-LVLMs)的高效推理框架,通过动态分配视觉标记预算和选择性保留互补标记,显著提升了模型的效率与精度权衡。

Details Motivation: HR-LVLMs 将高分辨率图像裁剪为局部块并独立编码,虽提升了细粒度视觉理解能力,但增加了视觉标记数量,导致计算和内存开销大幅增加。本文旨在解决这一效率问题。

Contribution: 1. 揭示了 HR-LVLMs 中视觉标记的利用规律;2. 提出 HERO 框架,动态分配标记预算并选择性保留互补标记;3. 无需训练即可实现高效推理。

Method: 基于三点发现:局部块重要性由视觉显著性和任务相关性决定;CLIP 视觉编码器的 CLS 标记在层间分两阶段关注不同视觉标记;不同阶段标记编码不同粒度的互补信息。通过内容自适应的标记预算分配和功能感知的标记选择,实现高效推理。

Result: HERO 在多个基准测试和模型规模上实现了优越的效率-精度权衡,且无需训练。

Insight: 视觉标记在不同阶段的互补性是实现高效推理的关键,动态分配和选择性保留标记是优化 HR-LVLMs 效率的有效途径。

Abstract: By cropping high-resolution images into local tiles and encoding them independently, High-Resolution Large Vision-Language Models (HR-LVLMs) have demonstrated remarkable fine-grained visual understanding capabilities. However, this divide-and-conquer paradigm significantly increases the number of visual tokens, resulting in substantial computational and memory overhead. To better understand and address this challenge, we empirically investigate visual token utilization in HR-LVLMs and uncover three key findings: (1) the local tiles have varying importance, jointly determined by visual saliency and task relevance; (2) the CLS token in CLIP-based vision encoders exhibits a two-stage attention pattern across layers, with each stage attending to different types of visual tokens; (3) the visual tokens emphasized at different stages encode information at varying levels of granularity, playing complementary roles within LVLMs. Building on these insights, we propose HERO, a High-resolution visual token early dropping framework that integrates content-adaptive token budget allocation with function-aware token selection. By accurately estimating tile-level importance and selectively retaining visual tokens with complementary roles, HERO achieves superior efficiency-accuracy trade-offs across diverse benchmarks and model scales, all in a training-free manner. This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.

[90] TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation

Qianqi Lu,Yuxiang Xie,Jing Zhang,Shiwei Zou,Yan Chen,Xidao Luan

Main category: cs.CV

TL;DR: 论文提出了一种三阶段图像-文本特征对齐网络(TFANet),通过层次化框架解决指代图像分割(RIS)任务中的多模态不对齐和语言语义损失问题。

Details Motivation: 现有的指代图像分割方法在多模态对齐和语言语义保持方面存在不足,特别是在复杂场景中容易导致目标错位或不完整分割。

Contribution: 提出了TFANet,通过三阶段(KPS、KFS、KIS)分别实现多尺度双向对齐、跨模态特征扫描和语言特征引导的语义深化,显著提升对齐精度。

Method: 1. KPS阶段:提出多尺度线性交叉注意力模块(MLAM);2. KFS阶段:设计跨模态特征扫描模块(CFSM);3. KIS阶段:引入词级语言特征引导的语义深化模块(WFDM)。

Result: TFANet在复杂场景下表现优异,能够更准确地定位和分割目标。

Insight: 层次化的三阶段设计有效解决了多模态对齐的难点,尤其是通过词级语义深化弥补了早期阶段的语义退化问题。

Abstract: Referring Image Segmentation (RIS) is a task that segments image regions based on language expressions, requiring fine-grained alignment between two modalities. However, existing methods often struggle with multimodal misalignment and language semantic loss, especially in complex scenes containing multiple visually similar objects, where uniquely described targets are frequently mislocalized or incompletely segmented. To tackle these challenges, this paper proposes TFANet, a Three-stage Image-Text Feature Alignment Network that systematically enhances multimodal alignment through a hierarchical framework comprising three stages: Knowledge Plus Stage (KPS), Knowledge Fusion Stage (KFS), and Knowledge Intensification Stage (KIS). In the first stage, we design the Multiscale Linear Cross-Attention Module (MLAM), which facilitates bidirectional semantic exchange between visual features and textual representations across multiple scales. This establishes rich and efficient alignment between image regions and different granularities of linguistic descriptions. Subsequently, the KFS further strengthens feature alignment through the Cross-modal Feature Scanning Module (CFSM), which applies multimodal selective scanning to capture long-range dependencies and construct a unified multimodal representation. This is essential for modeling long-range cross-modal dependencies and enhancing alignment accuracy in complex scenes. Finally, in the KIS, we propose the Word-level Linguistic Feature-guided Semantic Deepening Module (WFDM) to compensate for semantic degradation introduced in earlier stages.

[91] Enhancing Dual Network Based Semi-Supervised Medical Image Segmentation with Uncertainty-Guided Pseudo-Labeling

Yunyao Lu,Yihang Wu,Ahmad Chaddad,Tareef Daqqaq,Reem Kateb

Main category: cs.CV

TL;DR: 本文提出了一种基于双网络架构的半监督3D医学图像分割框架,通过交叉一致性增强模块和动态加权策略减少噪声伪标签,并利用对比学习机制降低预测不确定性。

Details Motivation: 现有半监督分割方法存在噪声伪标签和特征空间监督不足的问题,限制了医学图像分割的实用性。

Contribution: 1. 提出交叉一致性增强模块和动态加权策略以减少伪标签噪声;2. 设计不确定性感知机制(KL散度)调整伪标签贡献;3. 引入自监督对比学习机制对齐不确定体素特征与可靠类别原型。

Method: 1. 使用双网络架构生成伪标签;2. 通过交叉一致性增强和熵过滤减少噪声;3. 利用不确定性感知动态加权伪标签;4. 自监督对比学习对齐特征。

Result: 在Left Atrial、NIH Pancreas和BraTS-2019三个数据集上表现出色(如Left Atrial上10%标注数据时Dice得分89.95%),优于现有方法。

Insight: 通过结合伪标签优化和特征对齐,能有效提升半监督医学图像分割的性能,尤其在标注数据较少时。

Abstract: Despite the remarkable performance of supervised medical image segmentation models, relying on a large amount of labeled data is impractical in real-world situations. Semi-supervised learning approaches aim to alleviate this challenge using unlabeled data through pseudo-label generation. Yet, existing semi-supervised segmentation methods still suffer from noisy pseudo-labels and insufficient supervision within the feature space. To solve these challenges, this paper proposes a novel semi-supervised 3D medical image segmentation framework based on a dual-network architecture. Specifically, we investigate a Cross Consistency Enhancement module using both cross pseudo and entropy-filtered supervision to reduce the noisy pseudo-labels, while we design a dynamic weighting strategy to adjust the contributions of pseudo-labels using an uncertainty-aware mechanism (i.e., Kullback-Leibler divergence). In addition, we use a self-supervised contrastive learning mechanism to align uncertain voxel features with reliable class prototypes by effectively differentiating between trustworthy and uncertain predictions, thus reducing prediction uncertainty. Extensive experiments are conducted on three 3D segmentation datasets, Left Atrial, NIH Pancreas and BraTS-2019. The proposed approach consistently exhibits superior performance across various settings (e.g., 89.95% Dice score on left Atrial with 10% labeled data) compared to the state-of-the-art methods. Furthermore, the usefulness of the proposed modules is further validated via ablation experiments.

[92] A Synthetic Data Pipeline for Supporting Manufacturing SMEs in Visual Assembly Control

Jonas Werheid,Shengjie He,Aymen Gannouni,Anas Abdelrazeq,Robert H. Schmitt

Main category: cs.CV

TL;DR: 该论文提出了一种基于合成数据的新型视觉装配控制方法,通过利用CAD数据和目标检测算法,为制造业中小型企业(SMEs)提供资源高效的数据生成和质量管理解决方案。

Details Motivation: 制造业中的装配质量控制至关重要,但传统的计算机视觉方法在数据采集和标注方面成本高昂,尤其是对中小型企业来说难以负担。合成数据可以减少这些成本,但其在实际装配质量控制中的应用仍然有限。

Contribution: 提出了一个易于集成且数据高效的视觉装配控制方法,利用CAD数据和对象检测算法生成合成数据,显著减少了手动数据采集和标注的需求。

Method: 基于CAD数据生成模拟场景,结合目标检测算法,构建了一个合成数据生成流水线,用于训练和测试装配质量控制系统。

Result: 合成数据的训练精度达到了99.5%(mAP@0.5:0.95),在实际测试数据中的迁移效果为93%,证明了方法的有效性。

Insight: 通过合成数据生成流水线,中小型企业可以更高效地实现视觉装配控制,降低了资源投入和技术门槛。

Abstract: Quality control of assembly processes is essential in manufacturing to ensure not only the quality of individual components but also their proper integration into the final product. To assist in this matter, automated assembly control using computer vision methods has been widely implemented. However, the costs associated with image acquisition, annotation, and training of computer vision algorithms pose challenges for integration, especially for small- and medium-sized enterprises (SMEs), which often lack the resources for extensive training, data collection, and manual image annotation. Synthetic data offers the potential to reduce manual data collection and labeling. Nevertheless, its practical application in the context of assembly quality remains limited. In this work, we present a novel approach for easily integrable and data-efficient visual assembly control. Our approach leverages simulated scene generation based on computer-aided design (CAD) data and object detection algorithms. The results demonstrate a time-saving pipeline for generating image data in manufacturing environments, achieving a mean Average Precision (mAP@0.5:0.95) up to 99,5% for correctly identifying instances of synthetic planetary gear system components within our simulated training data, and up to 93% when transferred to real-world camera-captured testing data. This research highlights the effectiveness of synthetic data generation within an adaptable pipeline and underscores its potential to support SMEs in implementing resource-efficient visual assembly control solutions.

[93] Weakly and Self-Supervised Class-Agnostic Motion Prediction for Autonomous Driving

Ruibo Li,Hanyu Shi,Zhe Wang,Guosheng Lin

Main category: cs.CV

TL;DR: 该论文研究了自动驾驶中弱监督和自监督的类无关运动预测方法,利用LiDAR点云数据,通过前景/背景或非地面/地面掩码减少标注需求,并提出了鲁棒一致性感知Chamfer距离损失以提升性能。

Details Motivation: 自动驾驶需要准确理解动态环境中的运动,但传统方法依赖大量标注。研究目标是开发一种减少标注需求的运动预测方法。

Contribution: 1. 提出了一种弱监督范式,用前景/背景掩码替代运动标注;2. 利用非地面/地面掩码进一步减少了标注需求;3. 设计了鲁棒一致性感知Chamfer距离损失以提升自监督学习效果。

Method: 1. 基于前景/背景或非地面/地面掩码的弱监督方法;2. 多帧信息和鲁棒惩罚函数的自监督损失设计;3. 支持极低标注(0.01%)或无标注的场景。

Result: 实验表明,提出的弱监督和自监督模型优于现有自监督方法,弱监督模型的性能甚至接近一些监督模型。

Insight: 前景/背景或非地面/地面掩码是有效的运动预测监督信号,在减少标注的同时保持了高性能。

Abstract: Understanding motion in dynamic environments is critical for autonomous driving, thereby motivating research on class-agnostic motion prediction. In this work, we investigate weakly and self-supervised class-agnostic motion prediction from LiDAR point clouds. Outdoor scenes typically consist of mobile foregrounds and static backgrounds, allowing motion understanding to be associated with scene parsing. Based on this observation, we propose a novel weakly supervised paradigm that replaces motion annotations with fully or partially annotated (1%, 0.1%) foreground/background masks for supervision. To this end, we develop a weakly supervised approach utilizing foreground/background cues to guide the self-supervised learning of motion prediction models. Since foreground motion generally occurs in non-ground regions, non-ground/ground masks can serve as an alternative to foreground/background masks, further reducing annotation effort. Leveraging non-ground/ground cues, we propose two additional approaches: a weakly supervised method requiring fewer (0.01%) foreground/background annotations, and a self-supervised method without annotations. Furthermore, we design a Robust Consistency-aware Chamfer Distance loss that incorporates multi-frame information and robust penalty functions to suppress outliers in self-supervised learning. Experiments show that our weakly and self-supervised models outperform existing self-supervised counterparts, and our weakly supervised models even rival some supervised ones. This demonstrates that our approaches effectively balance annotation effort and performance.

[94] MSDNet: Efficient 4D Radar Super-Resolution via Multi-Stage Distillation

Minqing Huang,Shouyi Lu,Boyuan Zheng,Ziyao Li,Xiao Tang,Guirong Zhuo

Main category: cs.CV

TL;DR: MSDNet 提出了一种多阶段蒸馏框架,通过高效的 LiDAR 先验知识迁移,显著提升 4D 雷达点云的分辨率,同时兼顾重建质量和计算效率。

Details Motivation: 现有的 4D 雷达超分辨率方法存在训练成本高、推理延迟大且泛化性差的问题,难以平衡精度与效率。MSDNet 旨在解决这些问题。

Contribution: 1. 提出了多阶段蒸馏框架,分两个阶段优化雷达特征;2. 引入噪声适配器自适应对齐噪声水平;3. 在 VoD 和内部数据集上验证了高保真重建和低延迟推理。

Method: 1. 第一阶段通过特征重构进行重构引导的特征蒸馏;2. 第二阶段通过轻量级扩散网络进行扩散引导的特征蒸馏;3. 使用噪声适配器优化噪声对齐。

Result: 实验表明,MSDNet 在 4D 雷达点云超分辨率任务中实现了高保真重建和低延迟推理,并提升了下游任务性能。

Insight: 多阶段蒸馏和噪声适配器的设计,为高效雷达数据增强提供了一种新思路,可能扩展到其他点云超分辨率任务中。

Abstract: 4D radar super-resolution, which aims to reconstruct sparse and noisy point clouds into dense and geometrically consistent representations, is a foundational problem in autonomous perception. However, existing methods often suffer from high training cost or rely on complex diffusion-based sampling, resulting in high inference latency and poor generalization, making it difficult to balance accuracy and efficiency. To address these limitations, we propose MSDNet, a multi-stage distillation framework that efficiently transfers dense LiDAR priors to 4D radar features to achieve both high reconstruction quality and computational efficiency. The first stage performs reconstruction-guided feature distillation, aligning and densifying the student’s features through feature reconstruction. In the second stage, we propose diffusion-guided feature distillation, which treats the stage-one distilled features as a noisy version of the teacher’s representations and refines them via a lightweight diffusion network. Furthermore, we introduce a noise adapter that adaptively aligns the noise level of the feature with a predefined diffusion timestep, enabling a more precise denoising. Extensive experiments on the VoD and in-house datasets demonstrate that MSDNet achieves both high-fidelity reconstruction and low-latency inference in the task of 4D radar point cloud super-resolution, and consistently improves performance on downstream tasks. The code will be publicly available upon publication.

[95] Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning (early version)

Zhihao He,Tianyao He,Tieyuan Chen,Yun Xu,Huabin Liu,Chaofan Gan,Gui Zou,Weiyao Lin

Main category: cs.CV

TL;DR: 该论文提出了一种多视频协作框架,通过结构化视频表示和图融合模块,增强视频大型语言模型(VLM)的推理能力,解决单视频时空不完整性和冗余信息问题。

Details Motivation: 当前视频语言模型在处理综合视频推理任务时,由于单个视频的时空不完整性(spatio-temporal incompleteness)和冗余信息,容易产生幻觉和错误。多视频协作是一种潜在解决方案,但直接输入视频数据会导致性能下降。

Contribution: 1)设计了视频结构化模块(Video Structuring Module),将视频知识表示为时空图;2)提出图融合模块(Graph Fusion Module),融合多视频的结构化知识到增强的图节点;3)构建了多视频结构化提示(structured prompt),整合图、视觉和文本特征作为输入。

Method: 1)时空图表示视频知识;2)图融合模块提取多视频有效信息;3)结构化提示整合多模态输入。

Result: 实验验证了框架的有效性,展示了其提升视频语言模型推理能力的潜力。

Insight: 通过结构化表示和多视频协作,可以有效减少冗余信息,提升视频推理的准确性和鲁棒性。

Abstract: Despite the prosperity of the video language model, the current pursuit of comprehensive video reasoning is thwarted by the inherent spatio-temporal incompleteness within individual videos, resulting in hallucinations and inaccuracies. A promising solution is to augment the reasoning performance with multiple related videos. However, video tokens are numerous and contain redundant information, so directly feeding the relevant video data into a large language model to enhance responses could be counterproductive. To address this challenge, we propose a multi-video collaborative framework for video language models. For efficient and flexible video representation, we establish a Video Structuring Module to represent the video’s knowledge as a spatio-temporal graph. Based on the structured video representation, we design the Graph Fusion Module to fuse the structured knowledge and valuable information from related videos into the augmented graph node tokens. Finally, we construct an elaborate multi-video structured prompt to integrate the graph, visual, and textual tokens as the input to the large language model. Extensive experiments substantiate the effectiveness of our framework, showcasing its potential as a promising avenue for advancing video language models.

[96] WHU-STree: A Multi-modal Benchmark Dataset for Street Tree Inventory

Ruifei Ding,Zhe Chen,Wen Fan,Chen Long,Huijuan Xiao,Yelu Zeng,Zhen Dong,Bisheng Yang

Main category: cs.CV

TL;DR: WHU-STree 是一个多模态街树数据集,支持街树库存的多种任务,填补了现有数据在小规模、标注有限或单模态方面的不足。

Details Motivation: 传统街树调查耗时耗力,现有移动测绘系统(MMS)获取的数据集规模小、标注有限或单模态,限制了全面分析。WHU-STree 旨在解决这些问题。

Contribution: 介绍了 WHU-STree,一个跨城市、多模态且标注丰富的街树数据集,支持 10 多种任务,并提供了基准测试结果。

Method: 利用 MMS 采集同步点云和高分辨率图像,标注了 21,007 棵树实例,涵盖 50 个树种和 2 个形态参数。

Result: 实验证明了多模态数据融合的潜力,并强调跨域适用性对算法实际部署的关键性。

Insight: 多模态融合、多任务协作、跨域泛化、空间模式学习及多模态大语言模型是未来街树资产管理的重要方向。

Abstract: Street trees are vital to urban livability, providing ecological and social benefits. Establishing a detailed, accurate, and dynamically updated street tree inventory has become essential for optimizing these multifunctional assets within space-constrained urban environments. Given that traditional field surveys are time-consuming and labor-intensive, automated surveys utilizing Mobile Mapping Systems (MMS) offer a more efficient solution. However, existing MMS-acquired tree datasets are limited by small-scale scene, limited annotation, or single modality, restricting their utility for comprehensive analysis. To address these limitations, we introduce WHU-STree, a cross-city, richly annotated, and multi-modal urban street tree dataset. Collected across two distinct cities, WHU-STree integrates synchronized point clouds and high-resolution images, encompassing 21,007 annotated tree instances across 50 species and 2 morphological parameters. Leveraging the unique characteristics, WHU-STree concurrently supports over 10 tasks related to street tree inventory. We benchmark representative baselines for two key tasks–tree species classification and individual tree segmentation. Extensive experiments and in-depth analysis demonstrate the significant potential of multi-modal data fusion and underscore cross-domain applicability as a critical prerequisite for practical algorithm deployment. In particular, we identify key challenges and outline potential future works for fully exploiting WHU-STree, encompassing multi-modal fusion, multi-task collaboration, cross-domain generalization, spatial pattern learning, and Multi-modal Large Language Model for street tree asset management. The WHU-STree dataset is accessible at: https://github.com/WHU-USI3DV/WHU-STree.

[97] More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era

Yingtai Li,Haoran Lai,Xiaoqian Zhou,Shuai Ming,Wenxin Ma,Wei Wei,Shaohua Kevin Zhou

Main category: cs.CV

TL;DR: 论文探讨了如何利用大型语言模型(LLM)提升医学对比视觉-语言预训练的效能和可扩展性。通过自动提取放射报告中的诊断标签,创建低成本的大规模”银标准”数据集,并验证了基于这些数据的预训练模型的优越性。

Details Motivation: 大型语言模型的出现为医学视觉-语言预训练提供了新的机会,尤其是在缺乏标注数据的领域(如医学影像)。本文旨在利用LLM自动提取高质量标签,降低预训练成本,并提升模型性能。

Contribution: 1. 展示了LLM无需复杂提示工程即可高精度(>96% AUC)提取放射报告中的诊断标签,生成低成本的大规模”银标准”数据集。
2. 验证了基于”银标准”数据训练的视觉编码器性能媲美专用BERT模型提取的标签训练结果。
3. 证明了监督预训练显著提升了对比视觉-语言对齐的效果。

Method: 1. 利用LLM自动标注放射报告,生成大规模”银标准”数据集。
2. 使用3D ResNet-18和标准CLIP训练方法进行监督预训练。
3. 在多个数据集(如CT-RATE和RAD-ChestCT)上评估模型性能。

Result: 1. 在CT-RATE上零样本诊断AUC达83.8%,在RAD-ChestCT上达77.3%。
2. 跨模态检索表现显著提升(图像-图像MAP@50=53.7%,报告-图像Recall@100=52.2%)。

Insight: 1. LLM可以低成本生成高质量标注数据,降低医学AI开发的门槛。
2. 监督预训练对提升视觉-语言对齐至关重要。
3. 简化的模型架构(如3D ResNet-18)结合LLM标注数据,可以实现高性能的医学AI系统。

Abstract: The emergence of Large Language Models (LLMs) presents unprecedented opportunities to revolutionize medical contrastive vision-language pre-training. In this paper, we show how LLMs can facilitate large-scale supervised pre-training, thereby advancing vision-language alignment. We begin by demonstrate that modern LLMs can automatically extract diagnostic labels from radiology reports with remarkable precision (>96% AUC in our experiments) without complex prompt engineering, enabling the creation of large-scale “silver-standard” datasets at a minimal cost (~$3 for 50k CT image-report pairs). Further, we find that vision encoder trained on this “silver-standard” dataset achieves performance comparable to those trained on labels extracted by specialized BERT-based models, thereby democratizing the access to large-scale supervised pre-training. Building on this foundation, we proceed to reveal that supervised pre-training fundamentally improves contrastive vision-language alignment. Our approach achieves state-of-the-art performance using only a 3D ResNet-18 with vanilla CLIP training, including 83.8% AUC for zero-shot diagnosis on CT-RATE, 77.3% AUC on RAD-ChestCT, and substantial improvements in cross-modal retrieval (MAP@50=53.7% for image-image, Recall@100=52.2% for report-image). These results demonstrate the potential of utilizing LLMs to facilitate {\bf more performant and scalable} medical AI systems. Our code is avaiable at https://github.com/SadVoxel/More-performant-and-scalable.

[98] Road Obstacle Video Segmentation

Shyam Nandan Rai,Shyamgopal Karthik,Mariana-Iuliana Georgescu,Barbara Caputo,Carlo Masone,Zeynep Akata

Main category: cs.CV

TL;DR: 论文通过整合时序信息,提出了一种新的道路障碍物视频分割方法,并在四个评估基准上验证了其有效性。

Details Motivation: 现有道路障碍物分割方法多基于单帧图像,忽视了问题的时序特性,导致预测结果不一致。论文认为该任务是时序性的,需要对连续帧的相关性进行建模。

Contribution: 1. 提出并整合了四个评估基准用于道路障碍物视频分割;2. 评估了11种前沿分割方法;3. 提出两种基于视觉基础模型的基线方法。

Method: 论文基于视觉基础模型提出两种方法,探索了时序信息在道路障碍物分割中的作用,并通过实验验证了其性能。

Result: 所提方法在长序列视频中实现了新的最先进性能。

Insight: 道路障碍物分割任务是时序性的,时序建模显著提升了分割的连续性和一致性。

Abstract: With the growing deployment of autonomous driving agents, the detection and segmentation of road obstacles have become critical to ensure safe autonomous navigation. However, existing road-obstacle segmentation methods are applied on individual frames, overlooking the temporal nature of the problem, leading to inconsistent prediction maps between consecutive frames. In this work, we demonstrate that the road-obstacle segmentation task is inherently temporal, since the segmentation maps for consecutive frames are strongly correlated. To address this, we curate and adapt four evaluation benchmarks for road-obstacle video segmentation and evaluate 11 state-of-the-art image- and video-based segmentation methods on these benchmarks. Moreover, we introduce two strong baseline methods based on vision foundation models. Our approach establishes a new state-of-the-art in road-obstacle video segmentation for long-range video sequences, providing valuable insights and direction for future research.

[99] Vi-SAFE: A Spatial-Temporal Framework for Efficient Violence Detection in Public Surveillance

Ligang Chang,Shengkai Xu,Liangchang Shen,Binhan Xu,Junqiao Wang,Tianyu Shi,Yanhui Du

Main category: cs.CV

TL;DR: Vi-SAFE是一个空间-时间框架,用于高效检测公共监控中的暴力行为,通过改进的YOLOv8和TSN结合,优化了轻量级结构和计算效率,并在RWF-2000数据集上表现优异。

Details Motivation: 公共监控中的暴力检测对公共安全至关重要,但面临小规模目标、复杂环境和实时分析的挑战,亟待高效解决方法。

Contribution: 提出了Vi-SAFE框架,通过轻量化的YOLOv8(采用GhostNetV3和EMA注意力机制)结合TSN,提升了暴力检测的准确性和效率。

Method: 1. 改进的YOLOv8:GhostNetV3作为骨干网络、EMA注意力机制和剪枝优化。
2. 结合TSN进行时间分析:YOLOv8提取人体区域,TSN进行暴力行为分类。

Result: 在RWF-2000数据集上,Vi-SAFE准确率达0.88,优于单独TSN(0.77)和其他现有方法,且在计算效率上表现优异。

Insight: 1. 空间-时间框架的结合显著提升暴力检测性能。
2. 轻量化和注意力机制优化对实时监控系统至关重要。

Abstract: Violence detection in public surveillance is critical for public safety. This study addresses challenges such as small-scale targets, complex environments, and real-time temporal analysis. We propose Vi-SAFE, a spatial-temporal framework that integrates an enhanced YOLOv8 with a Temporal Segment Network (TSN) for video surveillance. The YOLOv8 model is optimized with GhostNetV3 as a lightweight backbone, an exponential moving average (EMA) attention mechanism, and pruning to reduce computational cost while maintaining accuracy. YOLOv8 and TSN are trained separately on pedestrian and violence datasets, where YOLOv8 extracts human regions and TSN performs binary classification of violent behavior. Experiments on the RWF-2000 dataset show that Vi-SAFE achieves an accuracy of 0.88, surpassing TSN alone (0.77) and outperforming existing methods in both accuracy and efficiency, demonstrating its effectiveness for public safety surveillance. Code is available at https://anonymous.4open.science/r/Vi-SAFE-3B42/README.md.

[100] Curriculum Multi-Task Self-Supervision Improves Lightweight Architectures for Onboard Satellite Hyperspectral Image Segmentation

Hugo Carlesso,Josiane Mothe,Radu Tudor Ionescu

Main category: cs.CV

TL;DR: 论文提出了一种新颖的课程多任务自监督学习框架(CMTSSL),专为轻量级高光谱图像(HSI)分析设计,通过联合学习空间和光谱特征,显著提升了轻量模型的性能。

Details Motivation: 高光谱数据维度高且卫星传输速率低,需要轻量高效的模型支持星上处理,减少冗余数据传输。

Contribution: 提出了CMTSSL框架,结合掩码图像建模和空间/光谱拼图任务,通过课程学习策略逐步提升数据复杂性,实现了光谱连续性、空间结构和全局语义特征的联合学习。

Method: 采用掩码图像建模和解耦的空间/光谱拼图任务,结合课程学习策略动态调整数据复杂度。

Result: 在四个公开数据集上验证,下游分割任务性能显著提升,模型轻量化程度高达16,000倍以上。

Insight: CMTSSL展示了轻量级模型在自监督学习中的潜力,尤其适用于星上高光谱图像分析的实际场景。

Abstract: Hyperspectral imaging (HSI) captures detailed spectral signatures across hundreds of contiguous bands per pixel, being indispensable for remote sensing applications such as land-cover classification, change detection, and environmental monitoring. Due to the high dimensionality of HSI data and the slow rate of data transfer in satellite-based systems, compact and efficient models are required to support onboard processing and minimize the transmission of redundant or low-value data, e.g. cloud-covered areas. To this end, we introduce a novel curriculum multi-task self-supervised learning (CMTSSL) framework designed for lightweight architectures for HSI analysis. CMTSSL integrates masked image modeling with decoupled spatial and spectral jigsaw puzzle solving, guided by a curriculum learning strategy that progressively increases data complexity during self-supervision. This enables the encoder to jointly capture fine-grained spectral continuity, spatial structure, and global semantic features. Unlike prior dual-task SSL methods, CMTSSL simultaneously addresses spatial and spectral reasoning within a unified and computationally efficient design, being particularly suitable for training lightweight models for onboard satellite deployment. We validate our approach on four public benchmark datasets, demonstrating consistent gains in downstream segmentation tasks, using architectures that are over 16,000x lighter than some state-of-the-art models. These results highlight the potential of CMTSSL in generalizable representation learning with lightweight architectures for real-world HSI applications. Our code is publicly available at https://github.com/hugocarlesso/CMTSSL.

[101] Intelligent Vacuum Thermoforming Process

Andi Kuswoyo,Christos Margadji,Sebastian W. Pattinson

Main category: cs.CV

TL;DR: 该论文提出了一种基于视觉的质量控制系统,用于优化真空热成型工艺参数,通过k近邻算法调整工艺参数以减少缺陷并提高生产效率。

Details Motivation: 真空热成型工艺中因材料和工具配置的变化导致质量一致性难以保证,亟需一种高效的方法优化工艺参数。

Contribution: 提出了一种基于视觉的质量控制系统,能够在少量数据需求下预测并优化工艺参数,显著提升部件质量。

Method: 利用视觉数据和图像增强技术构建数据集,采用k近邻算法映射低质量部件到高质量部件,以调整工艺参数。

Result: 模型在调整加热功率、加热时间和真空时间方面表现优异,有效减少了缺陷并提高了生产效率。

Insight: 基于视觉的方法和k近邻算法可以有效优化真空热成型工艺,为类似制造过程的优化提供了新思路。

Abstract: Ensuring consistent quality in vacuum thermoforming presents challenges due to variations in material properties and tooling configurations. This research introduces a vision-based quality control system to predict and optimise process parameters, thereby enhancing part quality with minimal data requirements. A comprehensive dataset was developed using visual data from vacuum-formed samples subjected to various process parameters, supplemented by image augmentation techniques to improve model training. A k-Nearest Neighbour algorithm was subsequently employed to identify adjustments needed in process parameters by mapping low-quality parts to their high-quality counterparts. The model exhibited strong performance in adjusting heating power, heating time, and vacuum time to reduce defects and improve production efficiency.

[102] ResidualViT for Efficient Temporally Dense Video Encoding

Mattia Soldan,Fabian Caba Heilbron,Bernard Ghanem,Josef Sivic,Bryan Russell

Main category: cs.CV

TL;DR: 提出ResidualViT架构,通过残差连接和token缩减模块高效处理高时间分辨率视频任务,结合轻量级蒸馏策略,显著降低计算成本并加快推理速度,同时保持模型性能。

Details Motivation: 高时间分辨率视频任务需要密集帧级特征计算,但计算成本高昂,需解决冗余信息和效率问题。

Contribution: 1. 提出ResidualViT架构,利用残差连接和token缩减模块优化计算;2. 轻量级蒸馏策略逼近原始模型特征;3. 在多个任务和数据集上验证效果。

Method: 结合残差连接保证时间一致性,通过token缩减模块选择性地丢弃冗余信息,重用预训练模型权重,并采用轻量级蒸馏策略。

Result: 计算成本降低60%,推理速度提升2.5倍,性能接近原始模型。

Insight: 视频中的时间冗余信息可被有效利用,通过残差和token缩减技术显著提升效率。

Abstract: Several video understanding tasks, such as natural language temporal video grounding, temporal activity localization, and audio description generation, require “temporally dense” reasoning over frames sampled at high temporal resolution. However, computing frame-level features for these tasks is computationally expensive given the temporal resolution requirements. In this paper, we make three contributions to reduce the cost of computing features for temporally dense tasks. First, we introduce a vision transformer (ViT) architecture, dubbed ResidualViT, that leverages the large temporal redundancy in videos to efficiently compute temporally dense frame-level features. Our architecture incorporates (i) learnable residual connections that ensure temporal consistency across consecutive frames and (ii) a token reduction module that enhances processing speed by selectively discarding temporally redundant information while reusing weights of a pretrained foundation model. Second, we propose a lightweight distillation strategy to approximate the frame-level features of the original foundation model. Finally, we evaluate our approach across four tasks and five datasets, in both zero-shot and fully supervised settings, demonstrating significant reductions in computational cost (up to 60%) and improvements in inference speed (up to 2.5x faster), all while closely approximating the accuracy of the original foundation model.

[103] RadGame: An AI-Powered Platform for Radiology Education

Mohammed Baharoon,Siavash Raissi,John S. Jun,Thibault Heintz,Mahmoud Alabbad,Ali Alburkani,Sung Eun Kim,Kent Kleinschmidt,Abdulrahman O. Alhumaydhi,Mohannad Mohammed G. Alghamdi,Jeremy Francis Palacio,Mohammed Bukhaytan,Noah Michael Prudlo,Rithvik Akula,Brady Chrisler,Benjamin Galligos,Mohammed O. Almutairi,Mazeen Mohammed Alanazi,Nasser M. Alrashdi,Joel Jihwan Hwang,Sri Sai Dinesh Jaliparthi,Luke David Nelson,Nathaniel Nguyen,Sathvik Suryadevara,Steven Kim,Mohammed F. Mohammed,Yevgeniy R. Semenov,Kun-Hsing Yu,Abdulrhman Aljouie,Hassan AlOmaish,Adam Rodman,Pranav Rajpurkar

Main category: cs.CV

TL;DR: RadGame 是一个基于 AI 和游戏化的放射学教育平台,通过自动反馈提升学员的定位和报告撰写能力,显著优于传统方法。

Details Motivation: 传统放射学教育缺乏即时且可扩展的反馈机制,RadGame 旨在通过 AI 和游戏化解决这一问题。

Contribution: 1. 提出结合游戏化和 AI 的放射学教育平台;2. 通过公共数据集和 AI 反馈提供高效培训;3. 显著提升学员的定位和报告撰写能力。

Method: 1. RadGame Localize:学员绘制异常区域边界框,与公共数据集的标注对比,生成视觉解释;2. RadGame Report:学员撰写报告,AI 根据放射学报告生成指标提供结构化反馈。

Result: 学员使用 RadGame 后,定位准确率提升 68%(传统方法为 17%),报告撰写准确率提升 31%(传统方法为 4%)。

Insight: AI 驱动的游戏化教育可以显著提升培训效果,为医学教育开辟了新途径。

Abstract: We introduce RadGame, an AI-powered gamified platform for radiology education that targets two core skills: localizing findings and generating reports. Traditional radiology training is based on passive exposure to cases or active practice with real-time input from supervising radiologists, limiting opportunities for immediate and scalable feedback. RadGame addresses this gap by combining gamification with large-scale public datasets and automated, AI-driven feedback that provides clear, structured guidance to human learners. In RadGame Localize, players draw bounding boxes around abnormalities, which are automatically compared to radiologist-drawn annotations from public datasets, and visual explanations are generated by vision-language models for user missed findings. In RadGame Report, players compose findings given a chest X-ray, patient age and indication, and receive structured AI feedback based on radiology report generation metrics, highlighting errors and omissions compared to a radiologist’s written ground truth report from public datasets, producing a final performance and style score. In a prospective evaluation, participants using RadGame achieved a 68% improvement in localization accuracy compared to 17% with traditional passive methods and a 31% improvement in report-writing accuracy compared to 4% with traditional methods after seeing the same cases. RadGame highlights the potential of AI-driven gamification to deliver scalable, feedback-rich radiology training and reimagines the application of medical AI resources in education.

[104] Image Realness Assessment and Localization with Multimodal Features

Lovish Kaushik,Agnij Biswas,Somdyuti Paul

Main category: cs.CV

TL;DR: 该论文提出了一种评估AI生成图像真实感并定位视觉不一致区域的框架,利用多模态特征提升真实感预测性能。

Details Motivation: AI生成图像的真实感评估和局部不一致识别对其实际应用和生成模型的改进至关重要。

Contribution: 提出了一个结合多模态特征的框架,能够全局评估图像真实感并定位局部不一致区域。

Method: 利用基于大规模数据集训练的视觉语言模型生成文本描述,替代人工标注,实现真实感预测和稠密真实感地图生成。

Result: 实验表明该方法在真实感预测性能上有所提升,并能有效区分图像中的真实与非真实区域。

Insight: 多模态特征(视觉与语言结合)能更有效地评估和定位AI生成图像的真实感问题。

Abstract: A reliable method of quantifying the perceptual realness of AI-generated images and identifying visually inconsistent regions is crucial for practical use of AI-generated images and for improving photorealism of generative AI via realness feedback during training. This paper introduces a framework that accomplishes both overall objective realness assessment and local inconsistency identification of AI-generated images using textual descriptions of visual inconsistencies generated by vision-language models trained on large datasets that serve as reliable substitutes for human annotations. Our results demonstrate that the proposed multimodal approach improves objective realness prediction performance and produces dense realness maps that effectively distinguish between realistic and unrealistic spatial regions.

[105] StyleSculptor: Zero-Shot Style-Controllable 3D Asset Generation with Texture-Geometry Dual Guidance

Zefan Qu,Zhenwei Wang,Haoyuan Wang,Ke Xu,Gerhard Hancke,Rynson W. H. Lau

Main category: cs.CV

TL;DR: StyleSculptor提出了一种无需训练的零样本方法,通过纹理-几何双引导生成风格可控的3D资产,核心模块是Style Disentangled Attention (SD-Attn),能动态融合内容与风格图像的特征。

Details Motivation: 在实际应用中,如游戏和虚拟现实,生成与现有资产风格一致的3D资产是常见的需求。传统方法难以实现细粒度的风格控制,而StyleSculptor旨在解决这一问题。

Contribution: 1. 提出了零样本风格可控的3D生成方法;2. 设计了SD-Attn模块,实现动态特征融合与风格解耦;3. 引入了Style Guided Control机制,支持纹理或几何风格的独立控制。

Method: 1. 基于SD-Attn模块,通过跨3D注意力机制动态融合内容与风格特征;2. 提出风格解耦特征选择策略,区分风格与内容通道;3. 利用SGC机制实现风格强度和类型的灵活控制。

Result: 实验表明,StyleSculptor在生成高保真3D资产方面优于基线方法,支持纹理、几何或混合风格的细粒度控制。

Insight: 通过动态注意力机制和解耦策略,可以实现高效的风格控制,为3D生成任务提供了新思路。

Abstract: Creating 3D assets that follow the texture and geometry style of existing ones is often desirable or even inevitable in practical applications like video gaming and virtual reality. While impressive progress has been made in generating 3D objects from text or images, creating style-controllable 3D assets remains a complex and challenging problem. In this work, we propose StyleSculptor, a novel training-free approach for generating style-guided 3D assets from a content image and one or more style images. Unlike previous works, StyleSculptor achieves style-guided 3D generation in a zero-shot manner, enabling fine-grained 3D style control that captures the texture, geometry, or both styles of user-provided style images. At the core of StyleSculptor is a novel Style Disentangled Attention (SD-Attn) module, which establishes a dynamic interaction between the input content image and style image for style-guided 3D asset generation via a cross-3D attention mechanism, enabling stable feature fusion and effective style-guided generation. To alleviate semantic content leakage, we also introduce a style-disentangled feature selection strategy within the SD-Attn module, which leverages the variance of 3D feature patches to disentangle style- and content-significant channels, allowing selective feature injection within the attention framework. With SD-Attn, the network can dynamically compute texture-, geometry-, or both-guided features to steer the 3D generation process. Built upon this, we further propose the Style Guided Control (SGC) mechanism, which enables exclusive geometry- or texture-only stylization, as well as adjustable style intensity control. Extensive experiments demonstrate that StyleSculptor outperforms existing baseline methods in producing high-fidelity 3D assets.

[106] 3D Aware Region Prompted Vision Language Model

An-Chieh Cheng,Yang Fu,Yukang Chen,Zhijian Liu,Xiaolong Li,Subhashree Radhakrishnan,Song Han,Yao Lu,Jan Kautz,Pavlo Molchanov,Hongxu Yin,Xiaolong Wang,Sifei Liu

Main category: cs.CV

TL;DR: SR-3D是一个结合2D图像和3D数据的视觉语言模型,通过共享视觉标记空间实现灵活的区域标注,无需多帧标注。它利用3D位置嵌入增强2D特征,在2D和3D基准测试中表现优异。

Details Motivation: 现有的视觉语言模型在2D与3D数据之间的连接较弱,标注3D数据需要大量多帧标注。SR-3D旨在解决这一问题,实现更高效的3D空间理解。

Contribution: 1. 提出了SR-3D模型,支持灵活的区域标注(如2D框、分割掩码或直接3D标注)。2. 通过3D位置嵌入增强2D特征,提升跨帧空间推理能力。

Method: SR-3D通过共享视觉标记空间将2D和3D数据统一,利用3D位置嵌入增强2D视觉特征,实现跨帧的空间推理和对未标注3D数据的推理。

Result: 在2D和3D基准测试中达到SOTA性能,适用于无3D输入或标注的野外视频,能准确推断空间关系和度量信息。

Insight: SR-3D展示了2D和3D表示空间统一的可能性,为场景理解提供了更高效的解决方案,同时适用于实际场景中的跨帧推理。

Abstract: We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric measurements.

cs.HC [Back]

[107] Textarium: Entangling Annotation, Abstraction and Argument

Philipp Proff,Marian Dörk

Main category: cs.HC

TL;DR: Textarium是一个基于网页的环境,通过将标注、抽象和论证相结合,支持文本解释过程。它为学术阅读和写作提供了可视化界面,结合人为分析与轻量计算,弥合了细读与远读的差距。

Details Motivation: 当前数字人文领域缺乏有效工具来透明化和共享文本解释过程,特别是在学术阅读与写作中。Textarium旨在填补这一空白,通过技术手段支持复杂的解释行为。

Contribution: Textarium的主要贡献在于提出了一种结合标注、抽象和论证的阅读-写作方法,通过参数化可视化状态展示解释过程,使其透明且可共享。

Method: 通过协作设计和迭代原型开发的交互式界面,支持用户高亮文本、分组关键字并嵌入观察作为论文锚点,同时结合轻量计算处理。

Result: 开发了一个功能完备的网页工具,能够有效支持学术研究中的解释性阅读与写作,提高了过程的透明性和可共享性。

Insight: 文本解释过程可以通过可视化和技术支持的结合变得更加透明和高效,为数字人文研究提供了新的工具和方法。

Abstract: We present a web-based environment that connects annotation, abstraction, and argumentation during the interpretation of text. As a visual interface for scholarly reading and writing, Textarium combines human analysis with lightweight computational processing to bridge close and distant reading practices. Readers can highlight text, group keywords into concepts, and embed these observations as anchors in essays. The interface renders these interpretive actions as parameterized visualization states. Through a speculative design process of co-creative and iterative prototyping, we developed a reading-writing approach that makes interpretive processes transparent and shareable within digital narratives.

cs.AI [Back]

[108] Small Models, Big Results: Achieving Superior Intent Extraction through Decomposition

Danielle Cohen,Yoni Halpern,Noam Kahlon,Joel Oren,Omri Berkovitch,Sapir Caduri,Ido Dagan,Anatoly Efros

Main category: cs.AI

TL;DR: 这篇论文提出了一种分解方法,通过结构化交互摘要和改进的意图提取,显著提升了小模型在资源受限环境中的意图理解能力,甚至超越了大模型的基准性能。

Details Motivation: 当前的MLLMs虽然强大,但无法在设备上高效运行,导致隐私保护、成本和延迟问题。小模型在意图理解上表现不佳,限制了其应用。

Contribution: 提出了一种两阶段的分解方法:1)结构化交互摘要捕获关键信息;2)基于摘要的意图提取。该方法显著提升了小模型的性能。

Method: 1. 结构化交互摘要;2. 在摘要上微调模型进行意图提取。

Result: 小模型的意图理解能力得到显著提升,甚至超越了大模型的基准性能。

Insight: 通过任务分解和结构化信息处理,小模型可以在资源受限环境中达到甚至超越大模型的性能。

Abstract: Understanding user intents from UI interaction trajectories remains a challenging, yet crucial, frontier in intelligent agent development. While massive, datacenter-based, multi-modal large language models (MLLMs) possess greater capacity to handle the complexities of such sequences, smaller models which can run on-device to provide a privacy-preserving, low-cost, and low-latency user experience, struggle with accurate intent inference. We address these limitations by introducing a novel decomposed approach: first, we perform structured interaction summarization, capturing key information from each user action. Second, we perform intent extraction using a fine-tuned model operating on the aggregated summaries. This method improves intent understanding in resource-constrained models, even surpassing the base performance of large MLLMs.

[109] Zero-shot Graph Reasoning via Retrieval Augmented Framework with LLMs

Hanqing Li,Kiran Sheena Jyothi,Henry Liang,Sharika Mahadevan,Diego Klabjan

Main category: cs.AI

TL;DR: 本文提出了一种无需训练的新方法GRRAF,结合检索增强生成(RAG)和大型语言模型(LLMs)的代码生成能力,解决了多种图推理任务。

Details Motivation: 现有方法需要大量微调或依赖预定义算法,GRRAF通过检索增强框架弥补了这些不足。

Contribution: GRRAF无需训练即可实现高精度图推理,支持多种任务,并在大规模图上表现出色。

Method: GRRAF通过将目标图存储在图形数据库中,并提示LLM生成可执行代码查询以检索信息,同时引入错误反馈循环和超时机制。

Result: 在GraphInstruct数据集上,GRRAF在大多数图推理任务中达到100%准确率,并能扩展到10,000个节点的大图。

Insight: GRRAF展示了LLMs在复杂图推理任务中的潜力,同时避免了传统方法的训练成本。

Abstract: We propose a new, training-free method, Graph Reasoning via Retrieval Augmented Framework (GRRAF), that harnesses retrieval-augmented generation (RAG) alongside the code-generation capabilities of large language models (LLMs) to address a wide range of graph reasoning tasks. In GRRAF, the target graph is stored in a graph database, and the LLM is prompted to generate executable code queries that retrieve the necessary information. This approach circumvents the limitations of existing methods that require extensive finetuning or depend on predefined algorithms, and it incorporates an error feedback loop with a time-out mechanism to ensure both correctness and efficiency. Experimental evaluations on the GraphInstruct dataset reveal that GRRAF achieves 100% accuracy on most graph reasoning tasks, including cycle detection, bipartite graph checks, shortest path computation, and maximum flow, while maintaining consistent token costs regardless of graph sizes. Imperfect but still very high performance is observed on subgraph matching. Notably, GRRAF scales effectively to large graphs with up to 10,000 nodes.

[110] V-Math: An Agentic Approach to the Vietnamese National High School Graduation Mathematics Exams

Duong Q. Nguyen,Quy P. Nguyen,Nguyen Van Nhon,Quang-Thinh Bui,H. Nguyen-Xuan

Main category: cs.AI

TL;DR: 论文提出了一个名为V-Math的自主代理框架,旨在帮助越南高中生备考国家高中数学毕业考试。该系统整合了三个AI代理:基于规范矩阵的问题生成器、提供详细分步解答的求解/解释器,以及根据学生表现调整内容的个性化辅导模块。

Details Motivation: 越南国家高中数学毕业考试对学生至关重要,但传统备考方式缺乏个性化和高效的工具,教师也难以快速生成高质量题目。V-Math的提出旨在通过AI技术解决这些痛点,提升备考效率和公平性。

Contribution: V-Math的主要贡献在于:(1) 集成三个AI代理的框架;(2) 支持学生自主学习和教师题库生成;(3) 初步验证了系统在生成合规题目、解答准确性和多样性方面的有效性。

Method: 系统采用多代理架构:问题生成器基于规范矩阵生成题目;求解/解释器提供分步推理;个性化辅导代理根据学生表现动态调整内容。教师可以利用该系统生成多样化的题库。

Result: 初步评估显示,V-Math生成的题目符合规范矩阵,解答准确率高,解释清晰,并能丰富练习材料的多样性。

Insight: AI代理的组合可以高效支持教育和考试准备,同时减轻教师负担。未来的方向可能包括扩展到其他学科和更复杂的学习场景。

Abstract: This paper develops an autonomous agentic framework called V-Math that aims to assist Vietnamese high school students in preparing for the National High School Graduation Mathematics Exams (NHSGMEs). The salient framework integrates three specialized AI agents: a specification-matrix-conditioned question generator, a solver/explainer for detailed step-by-step reasoning, and a personalized tutor that adapts to student performance. Beyond enabling self-paced student practice, V-Math supports teachers by generating innovative, compliant exam questions and building diverse, high-quality question banks. This reduces manual workload and enriches instructional resources. We describe the system architecture, focusing on practice modes for learners and teacher-oriented features for question generation. Preliminary evaluations demonstrate that V-Math produces matrix-aligned exams with high solution accuracy, delivers coherent explanations, and enhances the variety of practice materials. These results highlight its potential to support scalable, equitable mathematics preparation aligned with national standards while also empowering teachers through AI-assisted exam creation.

[111] HLSMAC: A New StarCraft Multi-Agent Challenge for High-Level Strategic Decision-Making

Xingxing Hong,Yungong Wang,Dexin Jin,Ye Yuan,Ximing Huang,Zijian Wu,Wenxin Li

Main category: cs.AI

TL;DR: HLSMAC是一个新的多智能体强化学习基准,基于星际争霸II设计,专注于评估高级战略决策能力,弥补了现有基准如SMAC主要测试微观管理的不足。

Details Motivation: 现有的多智能体强化学习基准(如SMAC)主要集中在微观管理上,缺乏对高级战略决策能力的评估。这限制了多智能体系统在复杂战略环境中的表现分析。

Contribution: 提出了HLSMAC,一个基于星际争霸II的新MARL基准,包含12个设计场景,旨在测试高级战略决策能力;同时还提出了多维度的新型评估指标。

Method: 设计12个基于《三十六计》的场景,每个场景对应一种战略;提出包括能力利用率和推进效率在内的多维度评估指标;集成先进的MARL算法和LLM智能体进行实验。

Result: 实验结果证明,HLSMAC能有效评估多智能体在高级战略决策中的表现,是一个强大的测试平台。

Insight: 高级战略决策能力是多智能体系统在复杂环境中表现的关键,需要专门的基准和评估指标。

Abstract: Benchmarks are crucial for assessing multi-agent reinforcement learning (MARL) algorithms. While StarCraft II-related environments have driven significant advances in MARL, existing benchmarks like SMAC focus primarily on micromanagement, limiting comprehensive evaluation of high-level strategic intelligence. To address this, we introduce HLSMAC, a new cooperative MARL benchmark with 12 carefully designed StarCraft II scenarios based on classical stratagems from the Thirty-Six Stratagems. Each scenario corresponds to a specific stratagem and is designed to challenge agents with diverse strategic elements, including tactical maneuvering, timing coordination, and deception, thereby opening up avenues for evaluating high-level strategic decision-making capabilities. We also propose novel metrics across multiple dimensions beyond conventional win rate, such as ability utilization and advancement efficiency, to assess agents’ overall performance within the HLSMAC environment. We integrate state-of-the-art MARL algorithms and LLM-based agents with our benchmark and conduct comprehensive experiments. The results demonstrate that HLSMAC serves as a robust testbed for advancing multi-agent strategic decision-making.

[112] Simulating Clinical AI Assistance using Multimodal LLMs: A Case Study in Diabetic Retinopathy

Nadim Barakat,William Lotter

Main category: cs.AI

TL;DR: 该论文研究了多模态大语言模型(MLLMs)在糖尿病视网膜病变(DR)检测中的表现,并评估了不同输出格式对临床AI辅助的影响。实验表明,MedGemma在灵敏度上优于GPT-4o,而GPT-4o在特定条件下表现稳定。多模态输出可能提升临床信任和实用性。

Details Motivation: 糖尿病视网膜病变是全球致盲的主要原因之一,当前FDA批准的AI系统仅提供二元输出,限制了临床信任和实用性。研究旨在探索多模态大语言模型如何通过不同输出格式提升临床AI辅助效果。

Contribution: 1. 评估了MLLMs在DR检测中的性能;2. 模拟了不同AI输出对临床辅助的影响;3. 展示了MLLMs作为可扩展模拟器的潜力,尤其是在低资源环境中。

Method: 1. 在两个数据集(IDRiD和Messidor-2)上测试GPT-4o和MedGemma;2. 包括基线评估、模拟AI辅助和实际协作实验;3. 分析模型在灵敏度、特异性和AUROC上的表现。

Result: MedGemma在基线测试中表现优于GPT-4o,灵敏度更高;GPT-4o在协作实验中使用MedGemma的描述性输出后,AUROC达到0.96。描述性输出增强了可解释性和临床信任。

Insight: 多模态输出可能显著提升临床AI的实用性,特别是在低资源环境中,开放轻量级模型(如MedGemma)更具潜力。描述性输出有助于增强临床信任。

Abstract: Diabetic retinopathy (DR) is a leading cause of blindness worldwide, and AI systems can expand access to fundus photography screening. Current FDA-cleared systems primarily provide binary referral outputs, where this minimal output may limit clinical trust and utility. Yet, determining the most effective output format to enhance clinician-AI performance is an empirical challenge that is difficult to assess at scale. We evaluated multimodal large language models (MLLMs) for DR detection and their ability to simulate clinical AI assistance across different output types. Two models were tested on IDRiD and Messidor-2: GPT-4o, a general-purpose MLLM, and MedGemma, an open-source medical model. Experiments included: (1) baseline evaluation, (2) simulated AI assistance with synthetic predictions, and (3) actual AI-to-AI collaboration where GPT-4o incorporated MedGemma outputs. MedGemma outperformed GPT-4o at baseline, achieving higher sensitivity and AUROC, while GPT-4o showed near-perfect specificity but low sensitivity. Both models adjusted predictions based on simulated AI inputs, but GPT-4o’s performance collapsed with incorrect ones, whereas MedGemma remained more stable. In actual collaboration, GPT-4o achieved strong results when guided by MedGemma’s descriptive outputs, even without direct image access (AUROC up to 0.96). These findings suggest MLLMs may improve DR screening pipelines and serve as scalable simulators for studying clinical AI assistance across varying output configurations. Open, lightweight models such as MedGemma may be especially valuable in low-resource settings, while descriptive outputs could enhance explainability and clinician trust in clinical workflows.

cs.RO [Back]

[113] The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning

Titong Jiang,Xuefeng Jiang,Yuan Ma,Xin Wen,Bailin Li,Kun Zhan,Peng Jia,Yahui Liu,Sheng Sun,Xianpeng Lang

Main category: cs.RO

TL;DR: LightVLA是一种可微分视觉token剪枝框架,通过动态评估token重要性并采用Gumbel softmax实现高效剪枝,提升VLA模型的效率与性能。

Details Motivation: VLA模型在资源受限平台上的部署因大量视觉token计算而受限,需要一种高效且性能驱动的剪枝方法。

Contribution: 提出了LightVLA框架,首次将自适应视觉token剪枝应用于VLA任务,无需启发式参数或额外可训练参数。

Method: 基于动态查询评估token重要性,使用Gumbel softmax实现可微分剪枝,保留关键token并剪枝冗余token。

Result: 在LIBERO基准测试中,LightVLA显著减少59.1% FLOPs和38.2%延迟,同时任务成功率提升2.9%。

Insight: LightVLA在追求性能优化的过程中自发地学习从性能驱动视角剪枝token,为实时机器人系统提供了高效实用的解决方案。

Abstract: We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on resource-constrained platforms is often bottlenecked by the heavy attention-based computation over large sets of visual tokens. LightVLA addresses this challenge through adaptive, performance-driven pruning of visual tokens: It generates dynamic queries to evaluate visual token importance, and adopts Gumbel softmax to enable differentiable token selection. Through fine-tuning, LightVLA learns to preserve the most informative visual tokens while pruning tokens which do not contribute to task execution, thereby improving efficiency and performance simultaneously. Notably, LightVLA requires no heuristic magic numbers and introduces no additional trainable parameters, making it compatible with modern inference frameworks. Experimental results demonstrate that LightVLA outperforms different VLA models and existing token pruning methods across diverse tasks on the LIBERO benchmark, achieving higher success rates with substantially reduced computational overhead. Specifically, LightVLA reduces FLOPs and latency by 59.1% and 38.2% respectively, with a 2.9% improvement in task success rate. Meanwhile, we also investigate the learnable query-based token pruning method LightVLA* with additional trainable parameters, which also achieves satisfactory performance. Our work reveals that as VLA pursues optimal performance, LightVLA spontaneously learns to prune tokens from a performance-driven perspective. To the best of our knowledge, LightVLA is the first work to apply adaptive visual token pruning to VLA tasks with the collateral goals of efficiency and performance, marking a significant step toward more efficient, powerful and practical real-time robotic systems.

[114] Neural 3D Object Reconstruction with Small-Scale Unmanned Aerial Vehicles

Àlmos Veres-Vitàlyos,Genis Castillo Gomez-Raya,Filip Lemic,Daniel Johannes Bugelnig,Bernhard Rinner,Sergi Abadal,Xavier Costa-Pérez

Main category: cs.RO

TL;DR: 该论文提出了一种用于小型无人机的高质量3D重建系统架构,通过双重建管道实现了数据捕获与飞行控制的实时反馈,显著提升了重建质量。

Details Motivation: 小型无人机在复杂任务(如高质量3D重建)中的应用受到负载和自主性的限制,本研究旨在解决这一问题。

Contribution: 主要贡献包括:1) 提出了双重建管道(近实时和离线重建);2) 动态轨迹调整技术,提升数据捕获效果;3) 结合N3DR和UWB数据实现高精度3D重建。

Method: 方法包括:1) 近实时SfM生成点云;2) 动态分析模型质量并调整无人机轨迹;3) 离线使用N3DR结合UWB数据进行精细重建。

Result: 实验验证表明,动态轨迹调整显著优于静态路径,在多无人机配置下也表现出色。

Insight: 该工作展示了小型无人机在受限环境中实现高质量3D重建的潜力,突破了传统依赖大型平台的限制。

Abstract: Small Unmanned Aerial Vehicles (UAVs) exhibit immense potential for navigating indoor and hard-to-reach areas, yet their significant constraints in payload and autonomy have largely prevented their use for complex tasks like high-quality 3-Dimensional (3D) reconstruction. To overcome this challenge, we introduce a novel system architecture that enables fully autonomous, high-fidelity 3D scanning of static objects using UAVs weighing under 100 grams. Our core innovation lies in a dual-reconstruction pipeline that creates a real-time feedback loop between data capture and flight control. A near-real-time (near-RT) process uses Structure from Motion (SfM) to generate an instantaneous pointcloud of the object. The system analyzes the model quality on the fly and dynamically adapts the UAV’s trajectory to intelligently capture new images of poorly covered areas. This ensures comprehensive data acquisition. For the final, detailed output, a non-real-time (non-RT) pipeline employs a Neural Radiance Fields (NeRF)-based Neural 3D Reconstruction (N3DR) approach, fusing SfM-derived camera poses with precise Ultra Wide-Band (UWB) location data to achieve superior accuracy. We implemented and validated this architecture using Crazyflie 2.1 UAVs. Our experiments, conducted in both single- and multi-UAV configurations, conclusively show that dynamic trajectory adaptation consistently improves reconstruction quality over static flight paths. This work demonstrates a scalable and autonomous solution that unlocks the potential of miniaturized UAVs for fine-grained 3D reconstruction in constrained environments, a capability previously limited to much larger platforms.

[115] ActiveVLN: Towards Active Exploration via Multi-Turn RL in Vision-and-Language Navigation

Zekai Zhang,Weiye Zhu,Hewei Pan,Xiangchen Wang,Rongtao Xu,Xing Sun,Feng Zheng

Main category: cs.RO

TL;DR: ActiveVLN提出了一种基于多轮强化学习(RL)的主动探索框架,用于视觉与语言导航(VLN)任务,解决了现有方法依赖模仿学习和专家轨迹的局限性,并通过动态剪枝策略提升RL效率。

Details Motivation: 现有VLN方法主要依赖模仿学习(IL),成本高且缺乏主动探索能力。强化学习虽有潜力,但依赖专家轨迹奖励塑造,未能实现开放式探索。ActiveVLN旨在通过多轮RL实现主动探索。

Contribution: 1. 提出ActiveVLN框架,通过多轮RL实现主动探索;2. 引入动态剪枝策略优化RL效率;3. 在较小模型下达到与最先进方法竞争的性能。

Method: 1. 使用少量专家轨迹进行模仿学习初始化;2. 通过多轮RL迭代预测和执行动作,收集多样轨迹并用GRPO目标优化;3. 动态剪枝处理长尾或失败轨迹。

Result: ActiveVLN在性能提升上优于DAgger和现有RL方法,尽管使用较小模型,仍达到与最先进方法竞争的结果。

Insight: 通过主动探索和RL动态优化,VLN任务可以摆脱对专家轨迹的依赖,提升导航多样性和效率。

Abstract: The Vision-and-Language Navigation (VLN) task requires an agent to follow natural language instructions and navigate through complex environments. Existing MLLM-based VLN methods primarily rely on imitation learning (IL) and often use DAgger for post-training to mitigate covariate shift. While effective, these approaches incur substantial data collection and training costs. Reinforcement learning (RL) offers a promising alternative. However, prior VLN RL methods lack dynamic interaction with the environment and depend on expert trajectories for reward shaping, rather than engaging in open-ended active exploration. This restricts the agent’s ability to discover diverse and plausible navigation routes. To address these limitations, we propose ActiveVLN, a VLN framework that explicitly enables active exploration through multi-turn RL. In the first stage, a small fraction of expert trajectories is used for IL to bootstrap the agent. In the second stage, the agent iteratively predicts and executes actions, automatically collects diverse trajectories, and optimizes multiple rollouts via the GRPO objective. To further improve RL efficiency, we introduce a dynamic early-stopping strategy to prune long-tail or likely failed trajectories, along with additional engineering optimizations. Experiments show that ActiveVLN achieves the largest performance gains over IL baselines compared to both DAgger-based and prior RL-based post-training methods, while reaching competitive performance with state-of-the-art approaches despite using a smaller model. Code and data will be released soon.

[116] Unleashing the Power of Discrete-Time State Representation: Ultrafast Target-based IMU-Camera Spatial-Temporal Calibration

Junlin Song,Antoine Richard,Miguel Olivares-Mendez

Main category: cs.RO

TL;DR: 该论文提出了一种基于离散时间状态表示的极高效IMU-相机时空标定方法,解决了传统连续时间状态标定方法的高计算成本问题,并弥补了离散时间状态在时间标定中的弱点。

Details Motivation: 视觉-惯性融合在机器人导航和增强现实等智能应用中至关重要,但现有的标定方法通常采用连续时间状态表示(如B样条),计算成本较高。随着无人机、手机等设备的普及,高效的标定方法能为大规模设备节省大量时间。

Contribution: 1. 提出了一种基于离散时间状态表示的高效IMU-相机时空标定方法;2. 解决了离散时间状态在时间标定中的弱点。

Method: 采用离散时间状态表示替代传统的连续时间状态表示(如B样条),从而显著降低计算成本,同时通过技术手段弥补离散时间状态在时间标定中的不足。

Result: 该方法在保持高精度的同时大幅提升了标定效率,为大规模设备标定节省了可观的时间成本。

Insight: 离散时间状态表示在标定任务中具有高效潜力,通过针对性优化可以解决其不足,为视觉-惯性融合的实际应用提供了实用解决方案。

Abstract: Visual-inertial fusion is crucial for a large amount of intelligent and autonomous applications, such as robot navigation and augmented reality. To bootstrap and achieve optimal state estimation, the spatial-temporal displacements between IMU and cameras must be calibrated in advance. Most existing calibration methods adopt continuous-time state representation, more specifically the B-spline. Despite these methods achieve precise spatial-temporal calibration, they suffer from high computational cost caused by continuous-time state representation. To this end, we propose a novel and extremely efficient calibration method that unleashes the power of discrete-time state representation. Moreover, the weakness of discrete-time state representation in temporal calibration is tackled in this paper. With the increasing production of drones, cellphones and other visual-inertial platforms, if one million devices need calibration around the world, saving one minute for the calibration of each device means saving 2083 work days in total. To benefit both the research and industry communities, our code will be open-source.

cs.LG [Back]

[117] Similarity-Distance-Magnitude Activations

Allen Schmaltz

Main category: cs.LG

TL;DR: 论文提出了一种更鲁棒且可解释的激活函数SDM(Similarity-Distance-Magnitude),通过引入相似性和距离感知,改进了传统的softmax函数,使其对协变量偏移和分布外输入更具鲁棒性,并提供了基于示例的可解释性。

Details Motivation: 传统的softmax函数在高概率区域对协变量偏移和分布外输入的鲁棒性不足,且缺乏可解释性。作者希望通过引入相似性和距离感知来改进这些问题。

Contribution: 提出了SDM激活函数,结合了相似性、距离和幅值感知,提高了模型的鲁棒性和可解释性,特别适用于选择性分类任务。

Method: 在softmax的基础上,引入了训练分布的相似性和距离信息,形成SDM激活函数,并通过密集匹配提供示例级别的解释。

Result: SDM在协变量偏移和分布外输入的情况下表现优于softmax,且能为选择性分类提供更好的校准效果。

Insight: 结合相似性和距离信息可以有效提升激活函数的鲁棒性和可解释性,为模型的决策边界提供了更丰富的上下文信息。

Abstract: We introduce a more robust and interpretable formulation of the standard softmax activation function commonly used with neural networks by adding Similarity (i.e., correctly predicted depth-matches into training) awareness and Distance-to-training-distribution awareness to the existing output Magnitude (i.e., decision-boundary) awareness. When used as the final-layer activation with language models, the resulting Similarity-Distance-Magnitude (SDM) activation function is more robust than the softmax function to co-variate shifts and out-of-distribution inputs in high-probability regions, and provides interpretability-by-exemplar via dense matching. Complementing the prediction-conditional estimates, the SDM activation enables a partitioning of the class-wise empirical CDFs to guard against low class-wise recall among selective classifications. These properties make it preferable for selective classification, even when considering post-hoc calibration methods over the softmax.

[118] When Inverse Data Outperforms: Exploring the Pitfalls of Mixed Data in Multi-Stage Fine-Tuning

Mengyi Deng,Xin Li,Tingyu Zhu,Zhicheng Yang,Zhijiang Guo,Wei Wang

Main category: cs.LG

TL;DR: 论文探讨了混合数据在多阶段微调中的陷阱,通过构建高质量的反向推理数据集r1k,并分析了SFT和DPO对双向推理目标对齐的影响。结果表明,单纯混合数据会削弱方向区分,而DPO可能抑制非偏好路径。

Details Motivation: 现有方法主要关注单向监督微调(SFT),忽略了多样推理模式之间的复杂互动。论文旨在探索混合数据在多阶段微调中的潜在问题及其对齐效果。

Contribution: 论文构建了高质量的反向推理数据集r1k,并通过实验发现其在SFT中优于原始s1k数据集。同时揭示了混合数据的冲突监督信号问题。

Method: 通过反转s1k中的1000个前向推理例子构建r1k数据集,并对比分析了SFT和DPO在双向推理目标下的表现。

Result: SFT在r1k上的表现比s1k提升了1.6%–6.8%,但混合数据会削弱方向区分。DPO虽能部分恢复区分,但也可能抑制非偏好推理路径。

Insight: 混合推理数据可能引入冲突的监督信号,需要设计更具鲁棒性和方向感知的对齐策略。

Abstract: Existing work has shown that o1-level performance can be achieved with limited data distillation, but most existing methods focus on unidirectional supervised fine-tuning (SFT), overlooking the intricate interplay between diverse reasoning patterns. In this paper, we construct r1k, a high-quality reverse reasoning dataset derived by inverting 1,000 forward examples from s1k, and examine how SFT and Direct Preference Optimization (DPO) affect alignment under bidirectional reasoning objectives. SFT on r1k yields a 1.6%–6.8% accuracy improvement over s1k across evaluated benchmarks. However, naively mixing forward and reverse data during SFT weakens the directional distinction. Although DPO can partially recover this distinction, it also suppresses less preferred reasoning paths by shifting the probability mass toward irrelevant outputs. These findings suggest that mixed reasoning data introduce conflicting supervision signals, underscoring the need for robust and direction-aware alignment strategies.

[119] WebSailor-V2: Bridging the Chasm to Proprietary Agents via Synthetic Data and Scalable Reinforcement Learning

Kuan Li,Zhongwang Zhang,Huifeng Yin,Rui Ye,Yida Zhao,Liwen Zhang,Litu Ou,Dingchu Zhang,Xixi Wu,Jialong Wu,Xinyu Wang,Zile Qiao,Zhen Zhang,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou

Main category: cs.LG

TL;DR: 论文提出WebSailor-V2,通过合成数据和可扩展强化学习提升开源代理在复杂信息检索任务中的表现,缩小与专属代理的差距。

Details Motivation: 专属代理(如DeepResearch)在复杂信息检索任务中表现超人类能力,而开源模型缺乏类似的系统性推理能力。本文旨在通过合成数据和强化学习填补这一差距。

Contribution: 提出WebSailor-V2,包括生成高不确定性任务的方法、RFT冷启动策略和高效的强化学习算法DUPO,显著提升开源代理性能。

Method: 通过结构化采样和信息模糊生成高不确定性任务,结合RFT冷启动和DUPO算法进行训练。

Result: WebSailor-V2在复杂信息检索任务中表现优于所有开源代理,接近专属代理的性能。

Insight: 系统性推理能力是缩小专属与开源代理差距的关键,联合合成数据与强化学习可有效提升模型能力。

Abstract: Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all open-source agents in complex information-seeking tasks, matching proprietary agents’ performance and closing the capability gap.

[120] Flexible Multimodal Neuroimaging Fusion for Alzheimer’s Disease Progression Prediction

Benjamin Burns,Yuan Xue,Douglas W. Scharre,Xia Ning

Main category: cs.LG

TL;DR: 论文提出了PerM-MoE方法,通过为每种模态设计独立的路由器,提升多模态模型在缺失模态情况下的灵活性,用于阿尔茨海默病进展预测。

Details Motivation: 阿尔茨海默病的进展具有高度个体化差异,现有多模态模型在模态缺失时预测准确性下降,限制了临床应用。

Contribution: 提出PerM-MoE方法,通过独立模态路由器提升模型在模态缺失情况下的灵活性。

Method: 采用稀疏混合专家(Mixture-of-Experts)框架,为每种模态设计独立路由器,而非传统单一路由器。

Result: 在ADNI数据集上,PerM-MoE在多数模态缺失情况下优于现有最优模型Flex-MoE,并更有效利用专家模型。

Insight: 独立路由器设计有助于在多模态任务中灵活处理模态缺失问题,提升模型鲁棒性。

Abstract: Alzheimer’s disease (AD) is a progressive neurodegenerative disease with high inter-patient variance in rate of cognitive decline. AD progression prediction aims to forecast patient cognitive decline and benefits from incorporating multiple neuroimaging modalities. However, existing multimodal models fail to make accurate predictions when many modalities are missing during inference, as is often the case in clinical settings. To increase multimodal model flexibility under high modality missingness, we introduce PerM-MoE, a novel sparse mixture-of-experts method that uses independent routers for each modality in place of the conventional, single router. Using T1-weighted MRI, FLAIR, amyloid beta PET, and tau PET neuroimaging data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), we evaluate PerM-MoE, state-of-the-art Flex-MoE, and unimodal neuroimaging models on predicting two-year change in Clinical Dementia Rating-Sum of Boxes (CDR-SB) scores under varying levels of modality missingness. PerM-MoE outperforms the state of the art in most variations of modality missingness and demonstrates more effective utility of experts than Flex-MoE.

[121] InJecteD: Analyzing Trajectories and Drift Dynamics in Denoising Diffusion Probabilistic Models for 2D Point Cloud Generation

Sanyam Jain,Khuram Naveed,Illia Oleksiienko,Alexandros Iosifidis,Ruben Pauwels

Main category: cs.LG

TL;DR: InJecteD是一个用于分析DDPM在2D点云生成中的轨迹和漂移动力学的框架,通过量化轨迹属性增强模型透明度,帮助调试和改进生成模型。

Details Motivation: 研究DDPM在2D点云生成中的轨迹和漂移动力学,以增强模型的可解释性,支持人机协作。

Contribution: 提出了InJecteD框架,通过统计指标分析轨迹属性,揭示了去噪过程的不同阶段和数据集特异性行为。

Method: 使用简化的DDPM架构,量化轨迹的位移、速度、聚类等属性,并评估不同的嵌入和噪声调度配置。

Result: 实验展示了去噪过程的三个阶段,并发现基于傅里叶的嵌入提高了轨迹稳定性和重建质量。

Insight: 模型的可解释性对于调试和改进生成模型至关重要,数据集特异性行为为模型设计提供了新视角。

Abstract: This work introduces InJecteD, a framework for interpreting Denoising Diffusion Probabilistic Models (DDPMs) by analyzing sample trajectories during the denoising process of 2D point cloud generation. We apply this framework to three datasets from the Datasaurus Dozen bullseye, dino, and circle using a simplified DDPM architecture with customizable input and time embeddings. Our approach quantifies trajectory properties, including displacement, velocity, clustering, and drift field dynamics, using statistical metrics such as Wasserstein distance and cosine similarity. By enhancing model transparency, InJecteD supports human AI collaboration by enabling practitioners to debug and refine generative models. Experiments reveal distinct denoising phases: initial noise exploration, rapid shape formation, and final refinement, with dataset-specific behaviors example, bullseyes concentric convergence vs. dinos complex contour formation. We evaluate four model configurations, varying embeddings and noise schedules, demonstrating that Fourier based embeddings improve trajectory stability and reconstruction quality

[122] iCD: A Implicit Clustering Distillation Mathod for Structural Information Mining

Xiang Xue,Yatu Ji,Qing-dao-er-ji Ren,Bao Shi,Min Lu,Nier Wu,Xufei Zhuang,Haiteng Xu,Gan-qi-qi-ge Cha

Main category: cs.LG

TL;DR: iCD是一种无需特征对齐的聚类蒸馏方法,通过解耦局部逻辑表示和Gram矩阵,挖掘可解释的结构知识,显著提升细粒度分类任务性能。

Details Motivation: 传统的逻辑知识蒸馏虽然简单易用,但其决策过程缺乏可解释性,限制了进一步的应用。iCD旨在通过结构信息挖掘解决这一问题。

Contribution: 提出了一种隐式聚类蒸馏方法(iCD),无需标签或特征对齐,即可挖掘和转移可解释的结构知识,显著提升模型性能。

Method: 通过解耦局部逻辑表示并利用Gram矩阵,使学生模型学习潜在的语义结构模式。

Result: 在基准数据集上验证了iCD的有效性,特别是在细粒度分类任务中,性能峰值提升达5.08%。

Insight: iCD提供了一种无需复杂对齐的可解释性蒸馏方法,展示了结构信息在学生模型训练中的重要性。

Abstract: Logit Knowledge Distillation has gained substantial research interest in recent years due to its simplicity and lack of requirement for intermediate feature alignment; however, it suffers from limited interpretability in its decision-making process. To address this, we propose implicit Clustering Distillation (iCD): a simple and effective method that mines and transfers interpretable structural knowledge from logits, without requiring ground-truth labels or feature-space alignment. iCD leverages Gram matrices over decoupled local logit representations to enable student models to learn latent semantic structural patterns. Extensive experiments on benchmark datasets demonstrate the effectiveness of iCD across diverse teacher-student architectures, with particularly strong performance in fine-grained classification tasks – achieving a peak improvement of +5.08% over the baseline. The code is available at: https://github.com/maomaochongaa/iCD.

[123] Tool-R1: Sample-Efficient Reinforcement Learning for Agentic Tool Use

Yabo Zhang,Yihan Zeng,Qingyun Li,Zhen Hu,Kavin Han,Wangmeng Zuo

Main category: cs.LG

TL;DR: Tool-R1是一个基于强化学习的框架,旨在通过生成可执行的Python代码,帮助大语言模型(LLMs)完成多步骤的工具使用任务,提升任务的准确性和鲁棒性。

Details Motivation: 尽管大语言模型在语言理解和推理方面表现强大,但在需要实时知识、精确操作或专用工具使用的实际任务中仍然受限。工具的使用能够增强其能力,但现有方法在复杂、多步骤任务中的表现不理想。

Contribution: 提出了Tool-R1框架,通过强化学习支持LLMs生成可执行代码以完成通用、组合和多步骤的工具使用任务;设计了基于结果的奖励函数和动态样本队列以提高训练效率。

Method: 1. 使用强化学习框架生成Python代码以实现工具使用;2. 结合基于LLM的答案判断和代码执行结果的奖励函数;3. 动态样本队列缓存高质量轨迹以减少在线采样开销。

Result: 在GAIA基准测试中,Tool-R1显著提升了准确性和鲁棒性,比基线方法高约10%,尤其在复杂多步骤任务上表现更优。

Insight: Tool-R1展示了通过强化学习优化LLMs工具使用能力的潜力,为实际应用中的可靠和高效工具增强推理提供了新思路。

Abstract: Large language models (LLMs) have demonstrated strong capabilities in language understanding and reasoning, yet they remain limited when tackling real-world tasks that require up-to-date knowledge, precise operations, or specialized tool use. To address this, we propose Tool-R1, a reinforcement learning framework that enables LLMs to perform general, compositional, and multi-step tool use by generating executable Python code. Tool-R1 supports integration of user-defined tools and standard libraries, with variable sharing across steps to construct coherent workflows. An outcome-based reward function, combining LLM-based answer judgment and code execution success, guides policy optimization. To improve training efficiency, we maintain a dynamic sample queue to cache and reuse high-quality trajectories, reducing the overhead of costly online sampling. Experiments on the GAIA benchmark show that Tool-R1 substantially improves both accuracy and robustness, achieving about 10% gain over strong baselines, with larger improvements on complex multi-step tasks. These results highlight the potential of Tool-R1 for enabling reliable and efficient tool-augmented reasoning in real-world applications. Our code will be available at https://github.com/YBYBZhang/Tool-R1.

eess.IV [Back]

[124] Enhancing Radiographic Disease Detection with MetaCheX, a Context-Aware Multimodal Model

Nathan He,Cody Chen

Main category: eess.IV

TL;DR: MetaCheX整合胸部X光影像與患者元數據,顯著提升疾病檢測的準確性和公平性。

Details Motivation: 現有深度學習模型忽視患者元數據,限制了診斷準確性和公平性。

Contribution: 提出MetaCheX,結合影像與元數據,提升診斷性能和減少算法偏見。

Method: 結合CNN和MLP處理元數據,通過共享分類器整合多模態數據。

Result: 在CheXpert Plus數據集上優於僅影像模型,顯著提升AUROC。

Insight: 元數據有助於提升模型泛化能力並減少偏見,更貼近臨床決策。

Abstract: Existing deep learning models for chest radiology often neglect patient metadata, limiting diagnostic accuracy and fairness. To bridge this gap, we introduce MetaCheX, a novel multimodal framework that integrates chest X-ray images with structured patient metadata to replicate clinical decision-making. Our approach combines a convolutional neural network (CNN) backbone with metadata processed by a multilayer perceptron through a shared classifier. Evaluated on the CheXpert Plus dataset, MetaCheX consistently outperformed radiograph-only baseline models across multiple CNN architectures. By integrating metadata, the overall diagnostic accuracy was significantly improved, measured by an increase in AUROC. The results of this study demonstrate that metadata reduces algorithmic bias and enhances model generalizability across diverse patient populations. MetaCheX advances clinical artificial intelligence toward robust, context-aware radiographic disease detection.

[125] DinoAtten3D: Slice-Level Attention Aggregation of DinoV2 for 3D Brain MRI Anomaly Classification

Fazle Rafsani,Jay Shah,Catherine D. Chong,Todd J. Schwedt,Teresa Wu

Main category: eess.IV

TL;DR: 该论文提出了一种基于注意力机制的3D医学图像异常分类方法DinoAtten3D,利用DINOv2预训练模型提取特征,并通过软注意力机制为2D轴向切片分配自适应权重。结合对比学习和类方差正则化的复合损失函数,解决了数据稀缺和类别不平衡问题。

Details Motivation: 医学图像中的异常检测和分类对早期诊断至关重要,但由于标注数据有限、类别不平衡和专家标注成本高,这一问题极具挑战性。

Contribution: 1) 提出了一种注意力驱动的全局聚合框架,专门用于3D医学图像异常分类;2) 利用自监督DINOv2模型作为特征提取器;3) 设计了复合损失函数,结合监督对比学习和类方差正则化。

Method: 1) 使用DINOv2提取2D轴向切片特征;2) 通过软注意力机制分配自适应切片级权重;3) 采用复合损失函数优化模型。

Result: 在ADNI数据集和头痛队列中表现出色,有效解决了数据稀缺和类别不平衡问题。

Insight: 预训练的2D基础模型结合注意力切片聚合,可显著提升3D医学图像异常检测的鲁棒性。

Abstract: Anomaly detection and classification in medical imaging are critical for early diagnosis but remain challenging due to limited annotated data, class imbalance, and the high cost of expert labeling. Emerging vision foundation models such as DINOv2, pretrained on extensive, unlabeled datasets, offer generalized representations that can potentially alleviate these limitations. In this study, we propose an attention-based global aggregation framework tailored specifically for 3D medical image anomaly classification. Leveraging the self-supervised DINOv2 model as a pretrained feature extractor, our method processes individual 2D axial slices of brain MRIs, assigning adaptive slice-level importance weights through a soft attention mechanism. To further address data scarcity, we employ a composite loss function combining supervised contrastive learning with class-variance regularization, enhancing inter-class separability and intra-class consistency. We validate our framework on the ADNI dataset and an institutional multi-class headache cohort, demonstrating strong anomaly classification performance despite limited data availability and significant class imbalance. Our results highlight the efficacy of utilizing pretrained 2D foundation models combined with attention-based slice aggregation for robust volumetric anomaly detection in medical imaging. Our implementation is publicly available at https://github.com/Rafsani/DinoAtten3D.git.

[126] DeepEyeNet: Generating Medical Report for Retinal Images

Jia-Hong Huang

Main category: eess.IV

TL;DR: 论文提出了DeepEyeNet,一种AI驱动的自动化视网膜图像医疗报告生成系统,旨在解决眼科医生资源不足的问题。

Details Motivation: 视网膜疾病日益增多,而眼科医生资源有限,传统手动报告生成效率低下且易出错。AI自动化可显著提升诊断效率,减轻医生负担。

Contribution: 1) 多模态深度学习方法捕捉图像与文本关键词的交互;2) 改进医学术语表示;3) 解决RNN在长依赖关系中的限制;4) 增强AI系统的可解释性。

Method: 结合多模态深度学习与改进的关键词表示方法,优化RNN在长序列问题中的表现,并通过可解释性技术提升临床接受度。

Result: 所提方法在多种评估指标下取得了最优性能。

Insight: AI自动化医疗报告生成有望提升临床效率、诊断准确性和患者护理水平,但需解决技术局限性和临床信任问题。

Abstract: The increasing prevalence of retinal diseases poses a significant challenge to the healthcare system, as the demand for ophthalmologists surpasses the available workforce. This imbalance creates a bottleneck in diagnosis and treatment, potentially delaying critical care. Traditional methods of generating medical reports from retinal images rely on manual interpretation, which is time-consuming and prone to errors, further straining ophthalmologists’ limited resources. This thesis investigates the potential of Artificial Intelligence (AI) to automate medical report generation for retinal images. AI can quickly analyze large volumes of image data, identifying subtle patterns essential for accurate diagnosis. By automating this process, AI systems can greatly enhance the efficiency of retinal disease diagnosis, reducing doctors’ workloads and enabling them to focus on more complex cases. The proposed AI-based methods address key challenges in automated report generation: (1) A multi-modal deep learning approach captures interactions between textual keywords and retinal images, resulting in more comprehensive medical reports; (2) Improved methods for medical keyword representation enhance the system’s ability to capture nuances in medical terminology; (3) Strategies to overcome RNN-based models’ limitations, particularly in capturing long-range dependencies within medical descriptions; (4) Techniques to enhance the interpretability of the AI-based report generation system, fostering trust and acceptance in clinical practice. These methods are rigorously evaluated using various metrics and achieve state-of-the-art performance. This thesis demonstrates AI’s potential to revolutionize retinal disease diagnosis by automating medical report generation, ultimately improving clinical efficiency, diagnostic accuracy, and patient care.

[127] MEGAN: Mixture of Experts for Robust Uncertainty Estimation in Endoscopy Videos

Damola Agbelese,Krishna Chaitanya,Pushpak Pati,Chaitanya Parmar,Pooya Mobadersany,Shreyas Fadnavis,Lindsey Surace,Shadi Yarandi,Louis R. Ghanem,Molly Lucas,Tommaso Mansi,Oana Gabriela Cula,Pablo F. Damasceno,Kristopher Standish

Main category: eess.IV

TL;DR: MEGAN 是一种多专家门控网络,通过结合多个基于Evidential Deep Learning (EDL)的专家模型,显著提升了内窥镜视频中UC疾病严重程度估计的预测置信度和校准性能。

Details Motivation: 当前医学AI中的不确定性量化方法通常依赖于单一专家的标注数据,忽略了医疗领域中常见的标注者间变异性。MEGAN旨在通过结合多位专家的标注和建模策略来解决这一问题。

Contribution: 提出MEGAN,一个多专家门控网络,结合多个EDL模型的预测和不确定性估计,显著提升了模型性能和校准能力。

Method: MEGAN通过门控网络优化组合多个EDL专家的输出,每个EDL模型基于不同的标注数据和建模策略训练。实验基于UC疾病严重程度的估计任务。

Result: 在UC临床试验中,MEGAN相较现有方法提高了3.5%的F1分数,降低了30.5%的ECE,并实现了基于不确定性的样本分层。

Insight: MEGAN表明在医学AI中,结合多专家标注和建模策略可以有效提升模型性能和不确定性量化能力,同时减轻标注负担。

Abstract: Reliable uncertainty quantification (UQ) is essential in medical AI. Evidential Deep Learning (EDL) offers a computationally efficient way to quantify model uncertainty alongside predictions, unlike traditional methods such as Monte Carlo (MC) Dropout and Deep Ensembles (DE). However, all these methods often rely on a single expert’s annotations as ground truth for model training, overlooking the inter-rater variability in healthcare. To address this issue, we propose MEGAN, a Multi-Expert Gating Network that aggregates uncertainty estimates and predictions from multiple AI experts via EDL models trained with diverse ground truths and modeling strategies. MEGAN’s gating network optimally combines predictions and uncertainties from each EDL model, enhancing overall prediction confidence and calibration. We extensively benchmark MEGAN on endoscopy videos for Ulcerative colitis (UC) disease severity estimation, assessed by visual labeling of Mayo Endoscopic Subscore (MES), where inter-rater variability is prevalent. In large-scale prospective UC clinical trial, MEGAN achieved a 3.5% improvement in F1-score and a 30.5% reduction in Expected Calibration Error (ECE) compared to existing methods. Furthermore, MEGAN facilitated uncertainty-guided sample stratification, reducing the annotation burden and potentially increasing efficiency and consistency in UC trials.

cs.SI [Back]

[128] Podcasts as a Medium for Participation in Collective Action: A Case Study of Black Lives Matter

Theodora Moldovan,Arianna Pera,Davide Vega,Luca Maria Aiello

Main category: cs.SI

TL;DR: 本文以BLM运动为例,研究播客如何表达集体行动的参与,填补了音频格式研究的空白,并分析了情感与行动阶段的关联。

Details Motivation: 以往关于集体行动的研究主要聚焦于文本内容,而播客作为音频媒体尚未被充分探索。本研究首次尝试通过播客转录内容分析口语化参与表达及其情感维度。

Contribution: 1. 首次将集体行动研究扩展到音频媒体;2. 提出对播客讨论中参与表达的情感分析框架;3. 发现情感与行动阶段之间的关联性。

Method: 使用SPoRC语料库,提取播客转录文本,采用分层框架分类参与表达(如问题-解决、号召行动等),并检测八种关键情感及其在行动不同阶段的变化。

Result: 情感分布因行动阶段而异,正面情感在号召行动和意图阶段较突出,负面情感与集体行动呈负相关。结果挑战了理论预期。

Insight: 播客中的情感表达可能具有媒体特异性,为研究数字口语中的社会运动参与提供了新视角。

Abstract: We study how participation in collective action is articulated in podcast discussions, using the Black Lives Matter (BLM) movement as a case study. While research on collective action discourse has primarily focused on text-based content, this study takes a first step toward analyzing audio formats by using podcast transcripts. Using the Structured Podcast Research Corpus (SPoRC), we investigated spoken language expressions of participation in collective action, categorized as problem-solution, call-to-action, intention, and execution. We identified podcast episodes discussing racial justice after important BLM-related events in May and June of 2020, and extracted participatory statements using a layered framework adapted from prior work on social media. We examined the emotional dimensions of these statements, detecting eight key emotions and their association with varying stages of activism. We found that emotional profiles vary by stage, with different positive emotions standing out during calls-to-action, intention, and execution. We detected negative associations between collective action and negative emotions, contrary to theoretical expectations. Our work contributes to a better understanding of how activism is expressed in spoken digital discourse and how emotional framing may depend on the format of the discussion.