Table of Contents

cs.CL [Back]

[1] The Role of Orthographic Consistency in Multilingual Embedding Models for Text Classification in Arabic-Script Languages

Abdulhady Abas Abdullah,Amir H. Gandomi,Tarik A Rashid,Seyedali Mirjalili,Laith Abualigah,Milena Živković,Hadi Veisi

Main category: cs.CL

TL;DR: 该论文提出了一种针对阿拉伯语系语言(如库尔德索拉尼语、阿拉伯语、波斯语和乌尔都语)的预训练模型AS-RoBERTa,通过语言特定的脚本特征和统计信息进行预训练,在文本分类任务中表现优于通用多语言模型。

Details Motivation: 尽管多语言模型(如mBERT和XLM-RoBERTa)在NLP任务中表现广泛,但它们对共享同一书写系统但具有不同正字规范和文化背景的语言(如阿拉伯语系语言)表现不佳。论文旨在通过语言特定的预训练解决这一问题。

Contribution: 论文的主要贡献是提出了AS-RoBERTa模型家族,这些模型针对阿拉伯语系语言的特定脚本特征进行预训练,显著提升了分类任务的性能。

Method: 论文基于RoBERTa架构,为四种阿拉伯语系语言分别预训练了AS-RoBERTa模型,并通过消融实验验证了脚本专注的预训练是性能提升的关键。

Result: AS-RoBERTa变体在分类任务中比mBERT和XLM-RoBERTa高2至5个百分点。混淆矩阵的错误分析揭示了共享脚本特征和领域内容对性能的影响。

Insight: 论文强调了脚本意识在预训练中的重要性,支持了对脚本和语言特定性预训练策略的进一步研究。

Abstract: In natural language processing, multilingual models like mBERT and XLM-RoBERTa promise broad coverage but often struggle with languages that share a script yet differ in orthographic norms and cultural context. This issue is especially notable in Arabic-script languages such as Kurdish Sorani, Arabic, Persian, and Urdu. We introduce the Arabic Script RoBERTa (AS-RoBERTa) family: four RoBERTa-based models, each pre-trained on a large corpus tailored to its specific language. By focusing pre-training on language-specific script features and statistics, our models capture patterns overlooked by general-purpose models. When fine-tuned on classification tasks, AS-RoBERTa variants outperform mBERT and XLM-RoBERTa by 2 to 5 percentage points. An ablation study confirms that script-focused pre-training is central to these gains. Error analysis using confusion matrices shows how shared script traits and domain-specific content affect performance. Our results highlight the value of script-aware specialization for languages using the Arabic script and support further work on pre-training strategies rooted in script and language specificity.

[2] Evaluating Code-Mixing in LLMs Across 18 Languages

Yilun Yang,Yekun Chai

Main category: cs.CL

TL;DR: 这篇论文全面评估了大语言模型(LLMs)在18种语言的代码混合数据上的表现,并提出了一种结合词替换和GPT-4提示的合成代码混合文本生成方法。

Details Motivation: 代码混合(在同一对话中切换语言)对传统自然语言处理提出了独特挑战,但现有基准测试(如LinCE和GLUECoS)语言对的覆盖范围有限,无法充分评估LLMs的代码混合能力。

Contribution: 1)在18种语言的代码混合数据上评估LLMs;2)提出了一种新的合成代码混合文本生成方法,结合词替换和GPT-4提示;3)揭示了LLMs在多语言家族代码混合数据上的持续表现不佳。

Method: 1)使用18种语言的代码混合数据进行评估;2)通过词替换和GPT-4提示生成合成代码混合文本;3)分析LLMs的表现并提出改进建议。

Result: LLMs在多语言家族的代码混合数据集上表现一致不佳,但可通过增加训练数据规模、模型规模和少样本学习来提升性能。

Insight: 代码混合任务的挑战性需要更全面的基准测试和更先进的合成数据生成方法,而LLMs的当前局限性提示了未来改进的方向。

Abstract: Code-mixing, the practice of switching between languages within a conversation, presents unique challenges for traditional natural language processing. Existing benchmarks, such as LinCE and GLUECoS, are limited by narrow language pairings and tasks, failing to adequately evaluate the code-mixing capabilities of large language models (LLMs). Despite the significance of code-mixing for multilingual users, research on LLMs in this context remains limited. Additionally, current methods for generating code-mixed data are underdeveloped. In this paper, we conduct a comprehensive evaluation of LLMs’ performance on code-mixed data across 18 languages from seven language families. We also propose a novel approach for generating synthetic code-mixed texts by combining word substitution with GPT-4 prompting. Our analysis reveals consistent underperformance of LLMs on code-mixed datasets involving multiple language families. We suggest that improvements in training data size, model scale, and few-shot learning could enhance their performance.

[3] PrismRAG: Boosting RAG Factuality with Distractor Resilience and Strategized Reasoning

Mohammad Kachuee,Teja Gollapudi,Minseok Kim,Yin Huang,Kai Sun,Xiao Yang,Jiaqi Wang,Nirav Shah,Yue Liu,Aaron Colak,Anuj Kumar,Wen-tau Yih,Xin Luna Dong

Main category: cs.CL

TL;DR: PrismRAG is a fine-tuning framework designed to enhance the factuality of retrieval-augmented generation (RAG) models by improving their resilience to distractors and integrating strategized reasoning.

Details Motivation: RAG models struggle with semi-relevant passages and complex reasoning tasks, often leading to factual inaccuracies. The paper aims to address these limitations.

Contribution: PrismRAG introduces a distractor-aware training method and reasoning-centric habits to improve model factuality, achieving a 5.4% average improvement on RAG QA benchmarks.

Method: The framework fine-tunes models using distractor-aware QA pairs and instills reasoning habits like planning and synthesizing without heavy reliance on human-engineered instructions.

Result: PrismRAG outperforms state-of-the-art solutions, demonstrating significant improvements in factuality across 12 diverse benchmarks.

Insight: Combining distractor resilience with strategized reasoning can effectively enhance the factuality of RAG models, even in complex scenarios.

Abstract: Retrieval-augmented generation (RAG) often falls short when retrieved context includes confusing semi-relevant passages, or when answering questions require deep contextual understanding and reasoning. We propose an efficient fine-tuning framework, called PrismRAG, that (i) trains the model with distractor-aware QA pairs mixing gold evidence with subtle distractor passages, and (ii) instills reasoning-centric habits that make the LLM plan, rationalize, and synthesize without relying on extensive human engineered instructions. Evaluated across 12 open-book RAG QA benchmarks spanning diverse application domains and scenarios, PrismRAG improves average factuality by 5.4%, outperforming state-of-the-art solutions.

[4] MindFlow+: A Self-Evolving Agent for E-Commerce Customer Service

Ming Gong,Xucheng Huang,Ziheng Xu,Vijayan K. Asari

Main category: cs.CL

TL;DR: MindFlow+是一个自进化的电商客服对话代理,通过结合大语言模型(LLMs)、模仿学习和离线强化学习(RL),利用工具增强的示范构造和奖励条件数据建模,实现了更灵活、上下文相关的响应生成。

Details Motivation: 传统的基于意图的客服系统在多轮动态交互中表现不佳,MindFlow+旨在通过自进化学习提升电商客服的对话质量。

Contribution: 1. 提出MindFlow+,结合LLMs和强化学习,实现领域专业化对话代理。 2. 引入工具增强的示范构造和奖励条件数据建模机制。 3. 提出AI贡献率(AI Contribution Ratio)新指标,量化AI在对话中的参与度。

Method: 1. 使用模仿学习(imitation learning)和离线强化学习训练模型。 2. 工具增强的示范构造将知识增强和代理交互(ReAct风格)结合。 3. 奖励条件数据建模通过奖励信号对齐任务目标。

Result: 实验中,MindFlow+在上下文相关性、灵活性和任务准确性上优于基线模型,验证了其在电商客服中的有效性。

Insight: 结合LLMs、工具推理和奖励引导学习,可以构建领域专用且上下文感知的对话系统,提升客服表现。

Abstract: High-quality dialogue is crucial for e-commerce customer service, yet traditional intent-based systems struggle with dynamic, multi-turn interactions. We present MindFlow+, a self-evolving dialogue agent that learns domain-specific behavior by combining large language models (LLMs) with imitation learning and offline reinforcement learning (RL). MindFlow+ introduces two data-centric mechanisms to guide learning: tool-augmented demonstration construction, which exposes the model to knowledge-enhanced and agentic (ReAct-style) interactions for effective tool use; and reward-conditioned data modeling, which aligns responses with task-specific goals using reward signals. To evaluate the model’s role in response generation, we introduce the AI Contribution Ratio, a novel metric quantifying AI involvement in dialogue. Experiments on real-world e-commerce conversations show that MindFlow+ outperforms strong baselines in contextual relevance, flexibility, and task accuracy. These results demonstrate the potential of combining LLMs tool reasoning, and reward-guided learning to build domain-specialized, context-aware dialogue systems.

[5] Mining Contextualized Visual Associations from Images for Creativity Understanding

Ananya Sahu,Amith Ananthram,Kathleen McKeown

Main category: cs.CL

TL;DR: 该论文提出了一种从图像中挖掘上下文视觉关联的方法,用于生成具有不同抽象层次的创意描述,并在MSCOCO上构建了一个新数据集。通过人类评估验证了描述的质量,并在零样本检索任务中展示了其有效性。

Details Motivation: 现有的视觉语言模型(如CLIP)依赖于网络爬取的简短、字面化的替代文本数据,缺乏对创意表达的深入理解。论文旨在通过挖掘图像的上下文关联,提供更具抽象性和创意性的描述。

Contribution: 1. 提出了一种从图像中挖掘上下文视觉关联的 scalable 方法;2. 构建了一个包含1.7m创意描述的新数据集;3. 展示了该方法在提升零样本检索任务中的效果,尤其是在诗歌和隐喻可视化领域。

Method: 通过视觉元素的上下文关联挖掘,生成多层次的抽象描述。首先识别图像的显著视觉元素,然后通过关联模型生成从具体到抽象的创意描述。

Result: 人类评估表明生成的描述既保持视觉基础又具有可识别的抽象性。在零样本检索任务中,使用该方法微调的视觉编码器在创意领域表现显著提升。

Insight: 挖掘图像的上下文关联可以为创意表达提供共享的语言基础,同时提升模型在抽象领域的能力。

Abstract: Understanding another person’s creative output requires a shared language of association. However, when training vision-language models such as CLIP, we rely on web-scraped datasets containing short, predominantly literal, alt-text. In this work, we introduce a method for mining contextualized associations for salient visual elements in an image that can scale to any unlabeled dataset. Given an image, we can use these mined associations to generate high quality creative captions at increasing degrees of abstraction. With our method, we produce a new dataset of visual associations and 1.7m creative captions for the images in MSCOCO. Human evaluation confirms that these captions remain visually grounded while exhibiting recognizably increasing abstraction. Moreover, fine-tuning a visual encoder on this dataset yields meaningful improvements in zero-shot image-text retrieval in two creative domains: poetry and metaphor visualization. We release our dataset, our generation code and our models for use by the broader community.

[6] LLaVA-NeuMT: Selective Layer-Neuron Modulation for Efficient Multilingual Multimodal Translation

Jingxuan Wei,Caijun Jia,Qi Chen,Yujun Cai,Linzhuang Sun,Xiangxiang Zhang,Gaowei Wu,Bihui Yu

Main category: cs.CL

TL;DR: 论文LLaVA-NeuMT提出了一种新型多模态多语言翻译框架,通过显式建模语言特定和语言无关的表示来减少多语言干扰,并利用层选择和神经元级适应策略提升翻译效果。

Details Motivation: 现有多模态机器翻译(MMT)方法在双语场景表现良好,但扩展到多语言翻译时面临跨语言干扰和参数共享效率低的问题。

Contribution: 提出LLaVA-NeuMT框架,包括层选择机制和神经元级适应策略,有效减少多语言干扰并提高翻译质量。

Method: 通过层选择机制识别不同语言对的最具信息量层,并结合动态神经元选择策略优化参数共享。

Result: 在M3-Multi30K和M3-AmbigCaps数据集上,仅微调40%参数即超越全微调方法,达到SOTA结果。

Insight: 层和神经元的选择对多模态多语言适应至关重要,为跨语言适应提供了高效且可扩展的解决方案。

Abstract: Multimodal Machine Translation (MMT) enhances translation quality by incorporating visual context, helping to resolve textual ambiguities. While existing MMT methods perform well in bilingual settings, extending them to multilingual translation remains challenging due to cross-lingual interference and ineffective parameter-sharing strategies. To address this, we propose LLaVA-NeuMT, a novel multimodal multilingual translation framework that explicitly models language-specific and language-agnostic representations to mitigate multilingual interference. Our approach consists of a layer selection mechanism that identifies the most informative layers for different language pairs and a neuron-level adaptation strategy that dynamically selects language-specific and agnostic neurons to improve translation quality while reducing redundancy. We conduct extensive experiments on the M3-Multi30K and M3-AmbigCaps datasets, demonstrating that LLaVA-NeuMT, while fine-tuning only 40% of the model parameters, surpasses full fine-tuning approaches and ultimately achieves SOTA results on both datasets. Our analysis further provides insights into the importance of selected layers and neurons in multimodal multilingual adaptation, offering an efficient and scalable solution to cross-lingual adaptation in multimodal translation.

[7] A Toolbox, Not a Hammer – Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation

Bohan Yao,Vikas Yadav

Main category: cs.CL

TL;DR: 本文提出了Multi-TAG框架,通过多工具聚合增强大型语言模型的数学推理能力,显著提升了复杂数学问题的解决效果,无需微调即可应用于多种LLM。

Details Motivation: 现有的工具增强方法通常在每个推理步骤中仅使用单一工具,难以处理需要多步精确推理的复杂数学问题。

Contribution: 1. 提出Multi-TAG框架,支持LLM在每个推理步骤中并发调用多工具并聚合结果;2. 无需微调,适用于多种LLM;3. 在四个复杂数学基准测试中表现优异。

Method: Multi-TAG框架中,LLM在每个推理步骤中调用多个工具,并通过聚合多样化的输出来验证和优化推理过程。

Result: 在MATH500等四个基准测试中,Multi-TAG显著优于现有方法,平均提升6.0%至7.5%。

Insight: 多工具聚合能够有效提升复杂数学问题的推理能力,且无需微调的设计使其具有广泛适用性。

Abstract: Augmenting large language models (LLMs) with external tools is a promising avenue for developing high-performance mathematical reasoning systems. Prior tool-augmented approaches typically finetune an LLM to select and invoke a single tool at each reasoning step and show promising results on simpler math reasoning benchmarks such as GSM8K. However, these approaches struggle with more complex math problems that require precise reasoning over multiple steps. To address this limitation, in this work, we propose Multi-TAG, a Multi-Tool AGgregation-based framework. Instead of relying on a single tool, Multi-TAG guides an LLM to concurrently invoke multiple tools at each reasoning step. It then aggregates their diverse outputs to verify and refine the reasoning process, enhancing solution robustness and accuracy. Notably, Multi-TAG is a finetuning-free, inference-only framework, making it readily applicable to any LLM backbone, including large open-weight models which are computationally expensive to finetune and proprietary frontier models which cannot be finetuned with custom recipes. We evaluate Multi-TAG on four challenging benchmarks: MATH500, AIME, AMC, and OlympiadBench. Across both open-weight and closed-source LLM backbones, Multi-TAG consistently and substantially outperforms state-of-the-art baselines, achieving average improvements of 6.0% to 7.5% over state-of-the-art baselines.

[8] Arg-LLaDA: Argument Summarization via Large Language Diffusion Models and Sufficiency-Aware Refinement

Hao Li,Yizheng Sun,Viktor Schlegel,Kailai Yang,Riza Batista-Navarro,Goran Nenadic

Main category: cs.CL

TL;DR: 该论文提出了Arg-LLaDA,一种基于大语言扩散模型的迭代生成框架,用于改进论点摘要的准确性和连贯性,通过充分性导向的重新掩蔽和再生策略,显著优于现有方法。

Details Motivation: 论点摘要的生成阶段研究不足,现有方法通常依赖单次生成,缺乏对事实校正和结构优化的支持。因此,需要一种能够迭代改进摘要的方法。

Contribution: 主要贡献是提出Arg-LLaDA框架,结合掩蔽控制器和充分性检查模块,实现摘要的迭代优化,生成更忠实、简洁和连贯的摘要。

Method: 采用大语言扩散模型进行迭代生成,通过掩蔽控制器动态调整掩蔽范围,配合充分性检查模块识别和修正不完整、冗余或不受支持的文本段。

Result: 实验表明,Arg-LLaDA在多个自动评估指标和人工评估中显著优于现有方法,尤其在覆盖率、忠实性和简洁性方面表现突出。

Insight: 迭代生成和充分性检查是提升论点摘要质量的关键,动态掩蔽策略可以有效优化生成过程。

Abstract: Argument summarization aims to generate concise, structured representations of complex, multi-perspective debates. While recent work has advanced the identification and clustering of argumentative components, the generation stage remains underexplored. Existing approaches typically rely on single-pass generation, offering limited support for factual correction or structural refinement. To address this gap, we introduce Arg-LLaDA, a novel large language diffusion framework that iteratively improves summaries via sufficiency-guided remasking and regeneration. Our method combines a flexible masking controller with a sufficiency-checking module to identify and revise unsupported, redundant, or incomplete spans, yielding more faithful, concise, and coherent outputs. Empirical results on two benchmark datasets demonstrate that Arg-LLaDA surpasses state-of-the-art baselines in 7 out of 10 automatic evaluation metrics. In addition, human evaluations reveal substantial improvements across core dimensions, coverage, faithfulness, and conciseness, validating the effectiveness of our iterative, sufficiency-aware generation strategy.

[9] Jailbreaking Large Language Diffusion Models: Revealing Hidden Safety Flaws in Diffusion-Based Text Generation

Yuanhe Zhang,Fangzhou Xie,Zhenhong Zhou,Zherui Li,Hao Chen,Kun Wang,Yufei Guo

Main category: cs.CL

TL;DR: 论文揭示了大型语言扩散模型(LLDM)在安全问题上的漏洞,并提出了一种名为PAD的平行解码越狱攻击方法,成功率达到97%,同时显示LLDM的有害生成速度是自回归模型的2倍。

Details Motivation: LLDM在推理速度和数学推理任务上表现优异,但其快速精准的生成能力引发了有害内容生成的担忧,而现有的针对LLM的越狱方法对LLDM效果有限,未能暴露其安全弱点。

Contribution: 1. 首次揭示LLDM对越狱攻击的脆弱性;2. 提出PAD方法,通过多点注意力攻击实现97%的成功率;3. 发现LLDM有害生成速度是LLM的2倍。

Method: 提出PAD(Parallel Decoding jailbreak)方法,利用多点注意力攻击(Multi-Point Attention Attack)引导并行生成过程,基于LLM的肯定响应模式设计攻击。

Result: 在四种LLDM上的实验显示,PAD越狱攻击成功率达到97%,且LLDM的有害生成速度比同规模LLM快2倍。

Insight: LLDM的安全脆弱性源于其架构差异,现有防御无法根本解决有害生成问题;PAD方法的成功揭示了其安全隐患,为扩散模型的安全部署提供了关键启示。

Abstract: Large Language Diffusion Models (LLDMs) exhibit comparable performance to LLMs while offering distinct advantages in inference speed and mathematical reasoning tasks.The precise and rapid generation capabilities of LLDMs amplify concerns of harmful generations, while existing jailbreak methodologies designed for Large Language Models (LLMs) prove limited effectiveness against LLDMs and fail to expose safety vulnerabilities.Successful defense cannot definitively resolve harmful generation concerns, as it remains unclear whether LLDMs possess safety robustness or existing attacks are incompatible with diffusion-based architectures.To address this, we first reveal the vulnerability of LLDMs to jailbreak and demonstrate that attack failure in LLDMs stems from fundamental architectural differences.We present a PArallel Decoding jailbreak (PAD) for diffusion-based language models. PAD introduces Multi-Point Attention Attack, which guides parallel generative processes toward harmful outputs that inspired by affirmative response patterns in LLMs. Experimental evaluations across four LLDMs demonstrate that PAD achieves jailbreak attack success rates by 97%, revealing significant safety vulnerabilities. Furthermore, compared to autoregressive LLMs of the same size, LLDMs increase the harmful generation speed by 2x, significantly highlighting risks of uncontrolled misuse.Through comprehensive analysis, we provide an investigation into LLDM architecture, offering critical insights for the secure deployment of diffusion-based language models.

[10] Enhancing Speech Emotion Recognition Leveraging Aligning Timestamps of ASR Transcripts and Speaker Diarization

Hsuan-Yu Wang,Pei-Ying Lee,Berlin Chen

Main category: cs.CL

TL;DR: 论文研究了通过对齐ASR转录和说话人日志(SD)的时间戳提升语音情感识别(SER)准确性的方法,提出了一种结合文本和音频模态的多模态方法,实验验证了时间对齐的重要性。

Details Motivation: 在多模态情感识别系统中,ASR转录和说话人日志的时间戳未对齐会导致不可靠的结果,尤其是在对话场景中。为解决这一问题,论文探索了时间对齐对SER准确性的影响。

Contribution: 主要贡献是提出了一种时间戳对齐的流水线,结合ASR和说话人日志的预训练模型,生成准确的说话人分段,并通过多模态融合方法提升了情感识别性能。

Method: 使用预训练的ASR和说话人日志模型对齐时间戳,结合RoBERTa的文本嵌入和Wav2Vec的音频嵌入,通过交叉注意力融合和门控机制实现多模态融合。

Result: 在IEMOCAP数据集上的实验表明,时间戳对齐显著提升了SER准确性,优于未同步的基线方法。

Insight: 时间对齐对多模态情感识别至关重要,能够显著提升性能,并为鲁棒的多模态情感分析奠定了基础。

Abstract: In this paper, we investigate the impact of incorporating timestamp-based alignment between Automatic Speech Recognition (ASR) transcripts and Speaker Diarization (SD) outputs on Speech Emotion Recognition (SER) accuracy. Misalignment between these two modalities often reduces the reliability of multimodal emotion recognition systems, particularly in conversational contexts. To address this issue, we introduce an alignment pipeline utilizing pre-trained ASR and speaker diarization models, systematically synchronizing timestamps to generate accurately labeled speaker segments. Our multimodal approach combines textual embeddings extracted via RoBERTa with audio embeddings from Wav2Vec, leveraging cross-attention fusion enhanced by a gating mechanism. Experimental evaluations on the IEMOCAP benchmark dataset demonstrate that precise timestamp alignment improves SER accuracy, outperforming baseline methods that lack synchronization. The results highlight the critical importance of temporal alignment, demonstrating its effectiveness in enhancing overall emotion recognition accuracy and providing a foundation for robust multimodal emotion analysis.

[11] SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models

Zhen Wan,Chao-Han Huck Yang,Yahan Yu,Jinchuan Tian,Sheng Li,Ke Hu,Zhehuai Chen,Shinji Watanabe,Fei Cheng,Chenhui Chu,Sadao Kurohashi

Main category: cs.CL

TL;DR: 这篇论文提出了基于语音的智商评分(Speech-based Intelligence Quotient, SIQ),用于评估语音理解大语言模型(LLM Voice)的能力,超越了传统的语音理解指标(如词错误率WER),并引入了基于布鲁姆分类学的三个认知层次。

Details Motivation: 传统语音理解指标(如WER)未能全面评估大语言模型在语音理解中的表现,尤其是在复杂任务中的认知能力。因此,需要一种更全面的评估框架,结合认知科学理论,量化模型的不同认知水平。

Contribution: 1. 提出了SIQ,一种结合认知科学的新型语音理解评估框架;2. 通过三个认知层次(记忆、理解和应用)全面评估模型能力;3. 展示了SIQ在模型比较、标注错误检测和幻觉识别中的实用性。

Method: SIQ基于布鲁姆分类学,分为三个认知层次:(1) 记忆(如词错误率);(2) 理解(如语义相似性);(3) 应用(如问答任务准确率)。通过多层次的评估,全面量化语音理解模型的性能。

Result: SIQ不仅能量化语音理解能力,还能统一比较级联方法(ASR LLM)与端到端模型,识别现有基准中的标注错误,并检测模型生成的幻觉内容。

Insight: 研究揭示了在多模态训练中被忽视的挑战,如语音理解模型的认知层次差异,并强调了结合认知科学理论的评估框架的重要性。

Abstract: We introduce Speech-based Intelligence Quotient (SIQ) as a new form of human cognition-inspired evaluation pipeline for voice understanding large language models, LLM Voice, designed to assess their voice understanding ability. Moving beyond popular voice understanding metrics such as word error rate (WER), SIQ examines LLM Voice across three cognitive levels motivated by Bloom’s Taxonomy: (1) Remembering (i.e., WER for verbatim accuracy); (2) Understanding (i.e., similarity of LLM’s interpretations); and (3) Application (i.e., QA accuracy for simulating downstream tasks). We demonstrate that SIQ not only quantifies voice understanding abilities but also provides unified comparisons between cascaded methods (e.g., ASR LLM) and end-to-end models, identifies annotation errors in existing benchmarks, and detects hallucinations in LLM Voice. Our framework represents a first-of-its-kind intelligence examination that bridges cognitive principles with voice-oriented benchmarks, while exposing overlooked challenges in multi-modal training.

[12] Data Augmentation for Spoken Grammatical Error Correction

Penny Karanasou,Mengjie Qian,Stefano Bannò,Mark J. F. Gales,Kate M. Knill

Main category: cs.CL

TL;DR: 该论文提出了一种完全自动化的方法,用于生成带有语法错误和不流畅性的音频-文本对,同时提出了一系列客观指标来评估生成的数据,旨在丰富口语语法错误纠正(SGEC)的语料库。

Details Motivation: 口语语法错误纠正(SGEC)领域缺乏高质量标注数据集,限制了模型的训练和评估能力。通过生成逼真的错误和音频-文本对,可以为SGEC任务提供更多训练数据。

Contribution: 1) 提出自动生成带有语法错误和音频特征的数据对的方法;2) 提出评估生成数据的客观指标;3) 在首个公开的语法错误标注语音数据集(S&I Corpus)上进行实验验证。

Method: 使用自动化方法生成音频-文本对,同时引入语法错误和不流畅性,并设计指标评估数据的质量和适用性。目标是保持原始数据的文本和声学特征,同时增加新类型的错误。

Result: 生成的增强数据集不仅可用于书面语法错误纠正(GEC),也能用于SGEC任务,丰富了原始语料库,同时不影响L2学习者的语言评估分数。

Insight: 通过自动化数据增强方法可以有效解决SGEC领域数据不足的问题,且生成的多样化错误类型有助于提升模型的鲁棒性和泛化能力。

Abstract: While there exist strong benchmark datasets for grammatical error correction (GEC), high-quality annotated spoken datasets for Spoken GEC (SGEC) are still under-resourced. In this paper, we propose a fully automated method to generate audio-text pairs with grammatical errors and disfluencies. Moreover, we propose a series of objective metrics that can be used to evaluate the generated data and choose the more suitable dataset for SGEC. The goal is to generate an augmented dataset that maintains the textual and acoustic characteristics of the original data while providing new types of errors. This augmented dataset should augment and enrich the original corpus without altering the language assessment scores of the second language (L2) learners. We evaluate the use of the augmented corpus both for written GEC (the text part) and for SGEC (the audio-text pairs). Our experiments are conducted on the S&I Corpus, the first publicly available speech dataset with grammar error annotations.

[13] Detection of Adverse Drug Events in Dutch clinical free text documents using Transformer Models: benchmark study

Rachel M. Murphy,Nishant Mishra,Nicolette F. de Keizer,Dave A. Dongelmans,Kitty J. Jager,Ameen Abu-Hanna,Joanna E. Klopotowska,Iacer Calixto

Main category: cs.CL

TL;DR: 论文通过多种Transformer模型在荷兰临床自由文本中检测药物不良事件(ADE),建立了基准,并比较了不同模型的性能,MedRoBERTa.nl表现最佳。

Details Motivation: 研究旨在解决荷兰临床自由文本中药物不良事件(ADE)检测的挑战,为未来临床实践提供可靠工具。

Contribution: 1) 建立了ADE检测的基准;2) 比较了多种Transformer模型的性能;3) 提出了适合临床任务的评价指标;4) 外部验证了模型的泛化能力。

Method: 1) 使用Bi-LSTM和四种Transformer模型(BERTje, RobBERT, MedRoBERTa.nl, NuNER);2) 在102份荷兰ICU临床笔记上训练NER和关系分类(RC)任务;3) 内部和外部验证模型性能。

Result: MedRoBERTa.nl表现最优,宏平均F1分数为0.63(黄金标准)和0.62(预测实体);外部验证中召回率为0.67-0.74。

Insight: 1) 需针对ADE检测任务设计合适的评价指标;2) Transformer模型在临床文本中表现优异;3) 外部验证对模型泛化能力至关重要。

Abstract: In this study, we set a benchmark for adverse drug event (ADE) detection in Dutch clinical free text documents using several transformer models, clinical scenarios and fit-for-purpose performance measures. We trained a Bidirectional Long Short-Term Memory (Bi-LSTM) model and four transformer-based Dutch and/or multilingual encoder models (BERTje, RobBERT, MedRoBERTa.nl, and NuNER) for the tasks of named entity recognition (NER) and relation classification (RC) using 102 richly annotated Dutch ICU clinical progress notes. Anonymized free text clinical progress notes of patients admitted to intensive care unit (ICU) of one academic hospital and discharge letters of patients admitted to Internal Medicine wards of two non-academic hospitals were reused. We evaluated our ADE RC models internally using gold standard (two-step task) and predicted entities (end-to-end task). In addition, all models were externally validated on detecting ADEs at the document level. We report both micro- and macro-averaged F1 scores, given the imbalance of ADEs in the datasets. Although differences for the ADE RC task between the models were small, MedRoBERTa.nl was the best performing model with macro-averaged F1 score of 0.63 using gold standard and 0.62 using predicted entities. The MedRoBERTa.nl models also performed the best in our external validation and achieved recall of between 0.67 to 0.74 using predicted entities, meaning between 67 to 74% of discharge letters with ADEs were detected. Our benchmark study presents a robust and clinically meaningful approach for evaluating language models for ADE detection in clinical free text documents. Our study highlights the need to use appropriate performance measures fit for the task of ADE detection in clinical free-text documents and envisioned future clinical use.

[14] Towards Domain Specification of Embedding Models in Medicine

Mohammad Khodadad,Ali Shiraee,Mahdi Astaraki,Hamidreza Mahyar

Main category: cs.CL

TL;DR: 该论文提出了MEDTE模型和一套针对医学文本的全面基准测试,解决了现有医学嵌入模型的不足,并在多项任务中表现优于现有方法。

Details Motivation: 医学文本嵌入模型在医疗应用中至关重要,但现有模型训练数据单一且方法陈旧,同时评估基准不足,无法覆盖实际医学任务的多样性。

Contribution: 1. 提出MEDTE模型,通过多数据源的自监督对比学习生成鲁棒的医学文本嵌入;2. 设计包含51项任务的医学文本基准测试,填补了现有评估的空白。

Method: 利用自监督对比学习方法在多源医学语料库上对GTE模型进行微调,并设计了类似于MTEB但针对医学文本的基准测试。

Result: 实验表明,MEDTE模型和配套基准测试在分类、聚类、配对分类和信息检索等任务中均优于现有方法。

Insight: 1. 医学嵌入模型的性能高度依赖多样化的训练数据;2. 定制化的基准测试对评估医学文本嵌入至关重要。

Abstract: Medical text embedding models are foundational to a wide array of healthcare applications, ranging from clinical decision support and biomedical information retrieval to medical question answering, yet they remain hampered by two critical shortcomings. First, most models are trained on a narrow slice of medical and biological data, beside not being up to date in terms of methodology, making them ill suited to capture the diversity of terminology and semantics encountered in practice. Second, existing evaluations are often inadequate: even widely used benchmarks fail to generalize across the full spectrum of real world medical tasks. To address these gaps, we leverage MEDTE, a GTE model extensively fine-tuned on diverse medical corpora through self-supervised contrastive learning across multiple data sources, to deliver robust medical text embeddings. Alongside this model, we propose a comprehensive benchmark suite of 51 tasks spanning classification, clustering, pair classification, and retrieval modeled on the Massive Text Embedding Benchmark (MTEB) but tailored to the nuances of medical text. Our results demonstrate that this combined approach not only establishes a robust evaluation framework but also yields embeddings that consistently outperform state of the art alternatives in different tasks.

[15] TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability

Mohammad Aflah Khan,Ameya Godbole,Johnny Tian-Zheng Wei,Ryan Wang,James Flemings,Krishna Gummadi,Willie Neiswanger,Robin Jia

Main category: cs.CL

TL;DR: TokenSmith是一个开源库,旨在简化和增强大规模语言模型预训练中的数据编辑、搜索和检查流程,为研究人员提供高效工具。

Details Motivation: 现有的预训练工作流程使得研究人员在理解训练数据与模型行为之间的关系时效率低下且复杂,TokenSmith旨在解决这一问题。

Contribution: 1. 提供交互式编辑、检查和数据分析功能;2. 支持多种操作(如搜索、导出、采样等);3. 无需修改训练代码即可结构化编辑数据;4. 模块化设计,易于集成到现有框架中。

Method: TokenSmith通过简单的用户界面和模块化后端,实现了对Megatron-style预训练框架数据的交互式操作,支持数据编辑和分析功能。

Result: TokenSmith成功简化了数据集调试、验证和实验过程,提高了研究效率,并开源了工具、文档和教程。

Insight: TokenSmith的模块化设计和对现有框架的无缝集成,有望推动更多人参与和改进语言模型的预训练流程。

Abstract: Understanding the relationship between training data and model behavior during pretraining is crucial, but existing workflows make this process cumbersome, fragmented, and often inaccessible to researchers. We present TokenSmith, an open-source library for interactive editing, inspection, and analysis of datasets used in Megatron-style pretraining frameworks such as GPT-NeoX, Megatron, and NVIDIA NeMo. TokenSmith supports a wide range of operations including searching, viewing, ingesting, exporting, inspecting, and sampling data, all accessible through a simple user interface and a modular backend. It also enables structured editing of pretraining data without requiring changes to training code, simplifying dataset debugging, validation, and experimentation. TokenSmith is designed as a plug and play addition to existing large language model pretraining workflows, thereby democratizing access to production-grade dataset tooling. TokenSmith is hosted on GitHub1, with accompanying documentation and tutorials. A demonstration video is also available on YouTube.

[16] GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal,Shangyin Tan,Dilara Soylu,Noah Ziems,Rishi Khare,Krista Opsahl-Ong,Arnav Singhvi,Herumb Shandilya,Michael J Ryan,Meng Jiang,Christopher Potts,Koushik Sen,Alexandros G. Dimakis,Ion Stoica,Dan Klein,Matei Zaharia,Omar Khattab

Main category: cs.CL

TL;DR: GEPA提出了一种基于自然语言反思的提示优化方法,通过诊断问题、提出更新并从尝试的Pareto前沿中学习,显著优于传统的强化学习方法。

Details Motivation: 传统的强化学习方法(如GRPO)需要大量试错来学习新任务,而自然语言的解释性可能提供更丰富的学习媒介。

Contribution: 提出了GEPA(Genetic-Pareto),一种基于自然语言反思的提示优化方法,能够通过少量试错显著提升性能。

Method: GEPA通过采样系统级轨迹,用自然语言反思诊断问题,提出并测试提示更新,并从Pareto前沿中学习互补规则。

Result: 在四个任务中,GEPA平均比GRPO性能提升10%,最高提升20%,且使用的试错次数减少35倍;在两种LLM上,GEPA比MIPROv2性能提升超过10%。

Insight: 自然语言反思可以更高效地指导提示优化,减少对大量试错的依赖,从而在性能提升和效率方面显著优于传统强化学习方法。

Abstract: Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language can often provide a much richer learning medium for LLMs, compared with policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples system-level trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA’s design, it can often turn even just a few rollouts into a large quality gain. Across four tasks, GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% across two LLMs, and demonstrates promising results as an inference-time search strategy for code optimization.

[17] Conversations Gone Awry, But Then? Evaluating Conversational Forecasting Models

Son Quoc Tran,Tushaar Gangavarapu,Nicholas Chernogor,Jonathan P. Chang,Cristian Danescu-Niculescu-Mizil

Main category: cs.CL

TL;DR: 本文提出了一个统一的评估框架,用于比较不同架构的对话预测模型(CGA任务),并引入了一种新指标,衡量模型在对话进行时调整预测的能力。

Details Motivation: 为了使自动化系统能够预测对话是否会偏离正轨(CGA任务),需要统一的评估标准和基准,从而促进模型间的直接比较。

Contribution: 1. 提出了首个统一的CGA任务评估框架;2. 引入了新指标,量化模型动态调整预测的能力;3. 提供了当前语言模型进展下CGA模型的最新综述。

Method: 开发了一个统一的评估框架,包括基准数据集和新指标,用于测试模型在对话动态变化时的预测调整能力。

Result: 框架实现了对CGA模型的直接比较,并展示了新指标的有效性。

Insight: 统一的评估标准和动态预测能力指标对推动对话预测模型的发展至关重要。

Abstract: We often rely on our intuition to anticipate the direction of a conversation. Endowing automated systems with similar foresight can enable them to assist human-human interactions. Recent work on developing models with this predictive capacity has focused on the Conversations Gone Awry (CGA) task: forecasting whether an ongoing conversation will derail. In this work, we revisit this task and introduce the first uniform evaluation framework, creating a benchmark that enables direct and reliable comparisons between different architectures. This allows us to present an up-to-date overview of the current progress in CGA models, in light of recent advancements in language modeling. Our framework also introduces a novel metric that captures a model’s ability to revise its forecast as the conversation progresses.

cs.CV [Back]

[18] Quantum-Cognitive Tunnelling Neural Networks for Military-Civilian Vehicle Classification and Sentiment Analysis

Milan Maksimovic,Anna Bohdanets,Immaculate Motsi-Omoijiade,Guido Governatori,Ivan S. Maksymov

Main category: cs.CV

TL;DR: 这篇论文提出了基于量子隧穿概率的神经网络模型,用于区分军事和民用车辆的图像以及情感分析,展示了其在战场场景中的潜在应用。

Details Motivation: 通过量子隧穿概率模拟人类感知的细微差别,提升AI在战场环境下对模糊对象和情感的识别能力。

Contribution: 提出了一种新颖的量子隧穿神经网络模型,并将其应用于军事与民用车辆分类及情感分析。

Method: 将量子隧穿概率整合到神经网络中,使用定制的CIFAR格式图像数据和军事专用词汇进行训练和测试。

Result: 证明了量子隧穿模型在多模态AI应用中的有效性,特别是在人类操作的无人机战争场景中。

Insight: 量子隧穿机制可能为AI提供更接近人类推理的能力,提升其在复杂战场环境中的表现。

Abstract: Prior work has demonstrated that incorporating well-known quantum tunnelling (QT) probability into neural network models effectively captures important nuances of human perception, particularly in the recognition of ambiguous objects and sentiment analysis. In this paper, we employ novel QT-based neural networks and assess their effectiveness in distinguishing customised CIFAR-format images of military and civilian vehicles, as well as sentiment, using a proprietary military-specific vocabulary. We suggest that QT-based models can enhance multimodal AI applications in battlefield scenarios, particularly within human-operated drone warfare contexts, imbuing AI with certain traits of human reasoning.

[19] Livatar-1: Real-Time Talking Heads Generation with Tailored Flow Matching

Haiyang Liu,Xiaolin Hong,Xuancheng Yang,Yudi Ruan,Xiang Lian,Michael Lingelbach,Hongwei Yi,Wei Li

Main category: cs.CV

TL;DR: Livatar-1 是一个实时音频驱动的人脸视频生成框架,通过流匹配方法解决了唇同步精度和长期姿势漂移的问题,实现了高精度和高效率。

Details Motivation: 现有的人脸视频生成方法在唇同步精度和长期姿势稳定性上表现不足,Livatar-1 旨在解决这些问题,提升生成效果和实时性。

Contribution: 1. 提出了基于流匹配的实时音频驱动人脸视频生成框架;2. 实现了高唇同步精度(HDTF数据集上8.50 LipSync Confidence)和高效率(141 FPS,延迟0.17s)。

Method: 使用流匹配方法优化唇同步和姿势稳定性,并结合系统优化实现实时性。

Result: 在HDTF数据集上表现优异(8.50 LipSync Confidence),单GPU上达到141 FPS和0.17s延迟。

Insight: 流匹配方法在实时人脸视频生成中具有潜力,系统优化是实现高吞吐量和低延迟的关键。

Abstract: We present Livatar, a real-time audio-driven talking heads videos generation framework. Existing baselines suffer from limited lip-sync accuracy and long-term pose drift. We address these limitations with a flow matching based framework. Coupled with system optimizations, Livatar achieves competitive lip-sync quality with a 8.50 LipSync Confidence on the HDTF dataset, and reaches a throughput of 141 FPS with an end-to-end latency of 0.17s on a single A10 GPU. This makes high-fidelity avatars accessible to broader applications. Our project is available at https://www.hedra.com/ with with examples at https://h-liu1997.github.io/Livatar-1/

[20] Features extraction for image identification using computer vision

Venant Niyonkuru,Sylla Sekou,Jimmy Jackson Sinzinkayo

Main category: cs.CV

TL;DR: 该研究探讨了计算机视觉中的多种特征提取技术,重点关注Vision Transformers (ViTs) 及其与其他方法(如GANs、深度特征模型、传统方法)的比较。

Details Motivation: 研究旨在比较不同特征提取技术的性能,特别是ViT与传统CNN的优劣,以推动计算机视觉的发展。

Contribution: 总结了ViT的架构及其优越性,并分析了ViTs与其他方法的实验结果和实用性。

Method: 采用了多种特征提取方法,包括ViT、GANs、深度特征模型和传统方法(如SIFT),并通过实验比较其性能。

Result: 实验结果表明ViT在性能上优于传统CNN,但也揭示了不同方法的优缺点。

Insight: ViT的多头自注意力机制和patch嵌入技术使其在特征提取中表现突出,但不同方法各有适用场景。

Abstract: This study examines various feature extraction techniques in computer vision, the primary focus of which is on Vision Transformers (ViTs) and other approaches such as Generative Adversarial Networks (GANs), deep feature models, traditional approaches (SIFT, SURF, ORB), and non-contrastive and contrastive feature models. Emphasizing ViTs, the report summarizes their architecture, including patch embedding, positional encoding, and multi-head self-attention mechanisms with which they overperform conventional convolutional neural networks (CNNs). Experimental results determine the merits and limitations of both methods and their utilitarian applications in advancing computer vision.

[21] Adapt, But Don’t Forget: Fine-Tuning and Contrastive Routing for Lane Detection under Distribution Shift

Mohammed Abdul Hafeez Khan,Parth Ganeriwala,Sarah M. Lehman,Siddhartha Bhattacharyya,Amy Alvarez,Natasha Neogi

Main category: cs.CV

TL;DR: 论文提出了一种通过微调和对比路由的方法,解决车道检测模型在分布偏移下的灾难性遗忘问题,实现高效参数适应。

Details Motivation: 车道检测模型在封闭环境表现良好,但在跨数据集分布偏移下会出现灾难性遗忘现象,导致性能下降,需解决这一问题。

Contribution: 提出了一种参数高效的自适应方法,通过分支微调和对比路由,有效解决分布偏移问题,避免重新训练多个模型。

Method: 1. 训练基础模型;2. 为目标分布创建独立分支并微调部分组件;3. 使用监督对比学习动态路由输入到对应分支。

Result: 在保持高F1分数的同时,显著减少了参数数量,优于单独训练多个模型。

Insight: 通过分支化和动态路由,模型能够适应分布偏移且避免遗忘,为跨域适应性研究提供了新思路。

Abstract: Lane detection models are often evaluated in a closed-world setting, where training and testing occur on the same dataset. We observe that, even within the same domain, cross-dataset distribution shifts can cause severe catastrophic forgetting during fine-tuning. To address this, we first train a base model on a source distribution and then adapt it to each new target distribution by creating separate branches, fine-tuning only selected components while keeping the original source branch fixed. Based on a component-wise analysis, we identify effective fine-tuning strategies for target distributions that enable parameter-efficient adaptation. At inference time, we propose using a supervised contrastive learning model to identify the input distribution and dynamically route it to the corresponding branch. Our framework achieves near-optimal F1-scores while using significantly fewer parameters than training separate models for each distribution.

[22] Part Segmentation of Human Meshes via Multi-View Human Parsing

James Dickens,Kamyar Hamad

Main category: cs.CV

TL;DR: 该论文提出了一种通过多视角人体解析实现人体网格的逐顶点语义分割方法,并开发了伪标注生成流程和高效采样策略。

Details Motivation: 近年来,点云深度学习在无序点集的几何分割上取得了高精度,而人体解析则专注于从图像中预测身体部位与服饰标签。论文旨在将这两者结合,实现大规模人体网格的语义分割。

Contribution: 1) 开发了Thuman2.1数据集的伪标注生成流程;2) 提出了基于空间填充曲线的窗口迭代最远点采样策略;3) 实现了纯几何分割的PointTransformer模型。

Method: 1) 将人体网格对齐到标准姿态;2) 多视角分割并回投影生成伪标注;3) 采用窗口迭代FPS采样和空间填充曲线序列化优化点云;4) 使用PointTransformer进行几何分割。

Result: 实验证实了方法的有效性与准确性,能够在不依赖纹理信息的情况下完成人体网格的语义解析。

Insight: 通过多视角结合几何分割的方法可以高效生成伪标注,且纯几何分割模型在复杂人体网格上表现优异。

Abstract: Recent advances in point cloud deep learning have led to models that achieve high per-part labeling accuracy on large-scale point clouds, using only the raw geometry of unordered point sets. In parallel, the field of human parsing focuses on predicting body part and clothing/accessory labels from images. This work aims to bridge these two domains by enabling per-vertex semantic segmentation of large-scale human meshes. To achieve this, a pseudo-ground truth labeling pipeline is developed for the Thuman2.1 dataset: meshes are first aligned to a canonical pose, segmented from multiple viewpoints, and the resulting point-level labels are then backprojected onto the original mesh to produce per-point pseudo ground truth annotations. Subsequently, a novel, memory-efficient sampling strategy is introduced, a windowed iterative farthest point sampling (FPS) with space-filling curve-based serialization to effectively downsample the point clouds. This is followed by a purely geometric segmentation using PointTransformer, enabling semantic parsing of human meshes without relying on texture information. Experimental results confirm the effectiveness and accuracy of the proposed approach.

[23] ShrinkBox: Backdoor Attack on Object Detection to Disrupt Collision Avoidance in Machine Learning-based Advanced Driver Assistance Systems

Muhammad Zaeem Shahzad,Muhammad Abdullah Hanif,Bassem Ouni,Muhammad Shafique

Main category: cs.CV

TL;DR: 该论文提出了一种名为ShrinkBox的后门攻击方法,通过微缩目标检测器中的真实边界框,破坏基于机器学习的先进驾驶辅助系统(ML-ADAS)中的碰撞避免功能,攻击成功率高且难以检测。

Details Motivation: ML-ADAS依赖目标检测和距离估计技术,其成本效益使其成为低中等收入国家的理想选择,但其安全性问题未得到充分研究。ShrinkBox攻击利用了目标检测的漏洞,突显了系统在对抗性攻击下的脆弱性。

Contribution: 论文提出了一种新型后门攻击ShrinkBox,专注微缩边界框而非传统攻击的标签或目标存在性操纵。其攻击成功率高(96%)且隐蔽性强,仅需4%的数据污染率。

Method: ShrinkBox通过在训练数据中微缩边界框实施攻击,采用宽松污染策略降低误差目标,影响YOLOv9m等目标检测器,并进一步破坏距离估计。

Result: 在KITTI数据集上,攻击导致距离估计的平均绝对误差(MAE)增加3倍以上,可能导致碰撞警告延迟或完全失效。

Insight: 研究揭示了ML-ADAS中目标检测器的安全漏洞,尤其是边界框微缩攻击对下游任务(如距离估计)的深远影响,强调了对抗性防御的必要性。

Abstract: Advanced Driver Assistance Systems (ADAS) significantly enhance road safety by detecting potential collisions and alerting drivers. However, their reliance on expensive sensor technologies such as LiDAR and radar limits accessibility, particularly in low- and middle-income countries. Machine learning-based ADAS (ML-ADAS), leveraging deep neural networks (DNNs) with only standard camera input, offers a cost-effective alternative. Critical to ML-ADAS is the collision avoidance feature, which requires the ability to detect objects and estimate their distances accurately. This is achieved with specialized DNNs like YOLO, which provides real-time object detection, and a lightweight, detection-wise distance estimation approach that relies on key features extracted from the detections like bounding box dimensions and size. However, the robustness of these systems is undermined by security vulnerabilities in object detectors. In this paper, we introduce ShrinkBox, a novel backdoor attack targeting object detection in collision avoidance ML-ADAS. Unlike existing attacks that manipulate object class labels or presence, ShrinkBox subtly shrinks ground truth bounding boxes. This attack remains undetected in dataset inspections and standard benchmarks while severely disrupting downstream distance estimation. We demonstrate that ShrinkBox can be realized in the YOLOv9m object detector at an Attack Success Rate (ASR) of 96%, with only a 4% poisoning ratio in the training instances of the KITTI dataset. Furthermore, given the low error targets introduced in our relaxed poisoning strategy, we find that ShrinkBox increases the Mean Absolute Error (MAE) in downstream distance estimation by more than 3x on poisoned samples, potentially resulting in delays or prevention of collision warnings altogether.

[24] VGS-ATD: Robust Distributed Learning for Multi-Label Medical Image Classification Under Heterogeneous and Imbalanced Conditions

Zehui Zhao,Laith Alzubaidi,Haider A. Alwzwazy,Jinglan Zhang,Yuantong Gu

Main category: cs.CV

TL;DR: VGS-ATD是一种新型分布式学习框架,针对医疗影像多标签分类中的异构和类别不平衡问题提出解决方案,显著提升了隐私性、扩展性和效率。

Details Motivation: 传统集中式和去中心化学习方法在异构、不平衡数据及系统扩展时存在隐私风险、效率低下和灾难性遗忘问题,亟需一种更鲁棒的解决方案。

Contribution: 提出VGS-ATD框架,解决了异构数据、类别不平衡、频繁通信和灾难性遗忘问题,并在30个数据集上展现了优越性能。

Method: 采用分布式学习范式,允许本地节点训练模型并共享权重,同时通过机制设计减少通信开销和避免灾难性遗忘。

Result: VGS-ATD整体准确率达92.7%,优于集中式学习(84.9%)和群体学习(72.99%),扩展时仅降低1%准确率,计算成本降低50%。

Insight: VGS-ATD为动态临床环境下高效、隐私保护的分布式学习提供了可行方案,尤其适用于多标签和异构数据场景。

Abstract: In recent years, advanced deep learning architectures have shown strong performance in medical imaging tasks. However, the traditional centralized learning paradigm poses serious privacy risks as all data is collected and trained on a single server. To mitigate this challenge, decentralized approaches such as federated learning and swarm learning have emerged, allowing model training on local nodes while sharing only model weights. While these methods enhance privacy, they struggle with heterogeneous and imbalanced data and suffer from inefficiencies due to frequent communication and the aggregation of weights. More critically, the dynamic and complex nature of clinical environments demands scalable AI systems capable of continuously learning from diverse modalities and multilabels. Yet, both centralized and decentralized models are prone to catastrophic forgetting during system expansion, often requiring full model retraining to incorporate new data. To address these limitations, we propose VGS-ATD, a novel distributed learning framework. To validate VGS-ATD, we evaluate it in experiments spanning 30 datasets and 80 independent labels across distributed nodes, VGS-ATD achieved an overall accuracy of 92.7%, outperforming centralized learning (84.9%) and swarm learning (72.99%), while federated learning failed under these conditions due to high requirements on computational resources. VGS-ATD also demonstrated strong scalability, with only a 1% drop in accuracy on existing nodes after expansion, compared to a 20% drop in centralized learning, highlighting its resilience to catastrophic forgetting. Additionally, it reduced computational costs by up to 50% relative to both centralized and swarm learning, confirming its superior efficiency and scalability.

[25] Fuzzy Theory in Computer Vision: A Review

Adilet Yerkin,Ayan Igali,Elnara Kadyrgali,Maksat Shagyrov,Malika Ziyada,Muragul Muratbekova,Pakizar Shamoi

Main category: cs.CV

TL;DR: 本文综述了模糊理论在计算机视觉中的应用,强调其在处理图像数据的不确定性、噪声和不精确性方面的作用,并讨论了模糊逻辑与深度学习的结合以及新兴趋势。

Details Motivation: 计算机视觉任务通常面临数据和环境的不确定性,传统的硬计算方法难以有效处理这些问题。模糊理论因其能够模拟渐变的过渡和类似人类的推理,成为一种有前景的解决方案。

Contribution: 本文系统总结了模糊逻辑在计算机视觉中的关键技术(如模糊聚类、模糊推理系统)及其应用(如医学影像、自主系统),并探讨了模糊逻辑与深度学习的结合及其在可解释AI中的作用。

Method: 通过综述方法,介绍了模糊理论的核心技术(如模糊聚类、模糊规则决策)及其与卷积神经网络的集成,还提出混合模糊-深度学习模型的趋势。

Result: 模糊理论在计算机视觉任务中表现出色,特别是在处理不确定性和噪声时。其与深度学习的结合进一步提升了复杂视觉任务的性能。

Insight: 模糊逻辑为计算机视觉提供了一种灵活且可解释的解决方案,未来在可解释AI和混合模型领域具有广阔前景。

Abstract: Computer vision applications are omnipresent nowadays. The current paper explores the use of fuzzy logic in computer vision, stressing its role in handling uncertainty, noise, and imprecision in image data. Fuzzy logic is able to model gradual transitions and human-like reasoning and provides a promising approach to computer vision. Fuzzy approaches offer a way to improve object recognition, image segmentation, and feature extraction by providing more adaptable and interpretable solutions compared to traditional methods. We discuss key fuzzy techniques, including fuzzy clustering, fuzzy inference systems, type-2 fuzzy sets, and fuzzy rule-based decision-making. The paper also discusses various applications, including medical imaging, autonomous systems, and industrial inspection. Additionally, we explore the integration of fuzzy logic with deep learning models such as convolutional neural networks (CNNs) to enhance performance in complex vision tasks. Finally, we examine emerging trends such as hybrid fuzzy-deep learning models and explainable AI.

[26] Eyes Will Shut: A Vision-Based Next GPS Location Prediction Model by Reinforcement Learning from Visual Map Feed Back

Ruixing Zhang,Yang Zhang,Tongyu Zhu,Leilei Sun,Weifeng Lv

Main category: cs.CV

TL;DR: 论文提出了一种基于视觉地图反馈的强化学习方法VLMLocPredictor,通过视觉语言模型(VLM)模仿人类基于地图的推理方式,实现了下一代GPS位置预测的SOTA性能。

Details Motivation: 人类在预测位置时会结合地图的可视化信息进行推理,而现有模型缺乏这种能力。受视觉语言模型(VLM)在视觉感知与推理上的进展启发,作者探索了如何让模型通过地图图像进行类似人类的推理。

Contribution: 1. 提出了Vision-Guided Location Search (VGLS)方法,评估通用VLM在不修改参数情况下的轨迹推理能力;2. 提出了两阶段的VLMLocPredictor,结合监督微调(SFT)和视觉地图反馈的强化学习,提升了模型的位置预测能力。

Method: 1. 第一阶段设计了两种监督微调任务,帮助VLM理解路网和轨迹结构;2. 第二阶段引入视觉地图反馈的强化学习,模型通过与环境的交互自我优化预测能力。

Result: 在四个城市的数据集上,VLMLocPredictor实现了SOTA性能,并表现出优于其他基于LLM方法的跨城市泛化能力。

Insight: 视觉语言模型可以结合地图图像,模仿人类的推理方式,提升位置预测任务的性能;引入强化学习进一步优化了模型的交互与自学习能力。

Abstract: Next Location Prediction is a fundamental task in the study of human mobility, with wide-ranging applications in transportation planning, urban governance, and epidemic forecasting. In practice, when humans attempt to predict the next location in a trajectory, they often visualize the trajectory on a map and reason based on road connectivity and movement trends. However, the vast majority of existing next-location prediction models do not reason over maps \textbf{in the way that humans do}. Fortunately, the recent development of Vision-Language Models (VLMs) has demonstrated strong capabilities in visual perception and even visual reasoning. This opens up a new possibility: by rendering both the road network and trajectory onto an image and leveraging the reasoning abilities of VLMs, we can enable models to perform trajectory inference in a human-like manner. To explore this idea, we first propose a method called Vision-Guided Location Search (VGLS), which evaluates whether a general-purpose VLM is capable of trajectory-based reasoning without modifying any of its internal parameters. Based on insights from the VGLS results, we further propose our main approach: VLMLocPredictor, which is composed of two stages: In the first stage, we design two Supervised Fine-Tuning (SFT) tasks that help the VLM understand road network and trajectory structures and acquire basic reasoning ability on such visual inputs. In the second stage, we introduce Reinforcement Learning from Visual Map Feedback, enabling the model to self-improve its next-location prediction ability through interaction with the environment. Experiments conducted on datasets from four different cities show that our method achieves state-of-the-art (SOTA) performance and exhibits superior cross-city generalization compared to other LLM-based approaches.

[27] Gen-AI Police Sketches with Stable Diffusion

Nicholas Fidalgo,Aaron Contreras,Katherine Harvey,Johnny Ni

Main category: cs.CV

TL;DR: 本研究探讨了利用多模态AI方法自动化并提升嫌疑人素描的三类模型,其中基础Stable Diffusion模型表现最优。

Details Motivation: 通过AI驱动的多模态方法提升嫌疑人素描的自动化水平和质量,以辅助警务工作。

Contribution: 提出了三种模型,包括基础Stable Diffusion、结合CLIP的模型,以及创新的LoRA微调CLIP模型,并通过实验验证效果。

Method: 1)基础Stable Diffusion模型;2)结合预训练CLIP模型的版本;3)LoRA微调CLIP模型并整合Stable Diffusion。

Result: 基础模型在SSIM(0.72)和PSNR(25 dB)上表现最佳,而迭代优化改善了LPIPS。

Insight: 基础模型虽简单但稳健,LoRA微调对提升文本-图像对齐有效,但需进一步优化。

Abstract: This project investigates the use of multimodal AI-driven approaches to automate and enhance suspect sketching. Three pipelines were developed and evaluated: (1) baseline image-to-image Stable Diffusion model, (2) same model integrated with a pre-trained CLIP model for text-image alignment, and (3) novel approach incorporating LoRA fine-tuning of the CLIP model, applied to self-attention and cross-attention layers, and integrated with Stable Diffusion. An ablation study confirmed that fine-tuning both self- and cross-attention layers yielded the best alignment between text descriptions and sketches. Performance testing revealed that Model 1 achieved the highest structural similarity (SSIM) of 0.72 and a peak signal-to-noise ratio (PSNR) of 25 dB, outperforming Model 2 and Model 3. Iterative refinement enhanced perceptual similarity (LPIPS), with Model 3 showing improvement over Model 2 but still trailing Model 1. Qualitatively, sketches generated by Model 1 demonstrated the clearest facial features, highlighting its robustness as a baseline despite its simplicity.

[28] Advancing Vision-based Human Action Recognition: Exploring Vision-Language CLIP Model for Generalisation in Domain-Independent Tasks

Sanyam Jain,Marsha Mariya Kappan,Vijeta Sharma

Main category: cs.CV

TL;DR: 论文探讨了基于视觉语言模型CLIP在视频动作识别任务中的泛化能力,通过多种掩码策略分析其性能,并提出改进方法以增强模型对关键特征的关注。

Details Motivation: 传统CNN和RNN模型在复杂动作识别中泛化能力不足,而CLIP模型在跨领域任务中表现潜力,但其在遮蔽关键视觉线索时性能不稳定。

Contribution: 1. 系统评估了CLIP在UCF-101数据集上的表现;2. 提出通过类别特定噪声增强模型对关键特征的注意力。

Method: 采用三种掩码策略(比例/形状遮蔽、特征遮蔽和孤立掩码)分析CLIP;设计了自定义损失函数学习类别特定噪声。

Result: CLIP在遮蔽关键视觉线索时表现不稳定;提出的噪声增强方法提升了分类准确性和模型置信度。

Insight: 视觉语言模型在跨领域动作识别中潜力巨大,但需进一步优化以应对医疗等复杂场景。

Abstract: Human action recognition plays a critical role in healthcare and medicine, supporting applications such as patient behavior monitoring, fall detection, surgical robot supervision, and procedural skill assessment. While traditional models like CNNs and RNNs have achieved moderate success, they often struggle to generalize across diverse and complex actions. Recent advancements in vision-language models, especially the transformer-based CLIP model, offer promising capabilities for generalizing action recognition from video data. In this work, we evaluate CLIP on the UCF-101 dataset and systematically analyze its performance under three masking strategies: (1) percentage-based and shape-based black masking at 10%, 30%, and 50%, (2) feature-specific masking to suppress bias-inducing elements, and (3) isolation masking that retains only class-specific regions. Our results reveal that CLIP exhibits inconsistent behavior and frequent misclassifications, particularly when essential visual cues are obscured. To overcome these limitations, we propose incorporating class-specific noise, learned via a custom loss function, to reinforce attention to class-defining features. This enhancement improves classification accuracy and model confidence while reducing bias. We conclude with a discussion on the challenges of applying such models in clinical domains and outline directions for future work to improve generalizability across domain-independent healthcare scenarios.

[29] HeartUnloadNet: A Weakly-Supervised Cycle-Consistent Graph Network for Predicting Unloaded Cardiac Geometry from Diastolic States

Siyu Mu,Wei Xuan Chan,Choon Hwai Yap

Main category: cs.CV

TL;DR: HeartUnloadNet 是一种基于深度学习的框架,通过弱监督和循环一致性策略,从舒张期网格预测心脏的无负荷几何形状,显著提升了计算效率和准确性。

Details Motivation: 现有方法依赖计算昂贵的逆向有限元求解器,难以满足临床实时需求,需要一种高效、准确的替代方案。

Contribution: 提出了一种新颖的图注意力网络架构,结合生理先验和循环一致性策略,实现了部分自监督学习,减少了对大量训练数据的需求。

Method: 采用图注意力网络处理任意大小的网格输入,结合舒张期压力和心肌刚度等生理参数,通过循环一致性策略实现双向预测。

Result: 在 20,700 次模拟数据上测试,模型精度达到亚毫米级(DSC 0.986,HD 0.083 cm),推理时间仅 0.02 秒,显著优于传统方法。

Insight: 循环一致性设计使模型在少量训练数据(200 个样本)下仍能保持高性能,为实时临床应用提供了可能性。

Abstract: The unloaded cardiac geometry (i.e., the state of the heart devoid of luminal pressure) serves as a valuable zero-stress and zero-strain reference and is critical for personalized biomechanical modeling of cardiac function, to understand both healthy and diseased physiology and to predict the effects of cardiac interventions. However, estimating the unloaded geometry from clinical images remains a challenging task. Traditional approaches rely on inverse finite element (FE) solvers that require iterative optimization and are computationally expensive. In this work, we introduce HeartUnloadNet, a deep learning framework that predicts the unloaded left ventricular (LV) shape directly from the end diastolic (ED) mesh while explicitly incorporating biophysical priors. The network accepts a mesh of arbitrary size along with physiological parameters such as ED pressure, myocardial stiffness scale, and fiber helix orientation, and outputs the corresponding unloaded mesh. It adopts a graph attention architecture and employs a cycle-consistency strategy to enable bidirectional (loading and unloading) prediction, allowing for partial self-supervision that improves accuracy and reduces the need for large training datasets. Trained and tested on 20,700 FE simulations across diverse LV geometries and physiological conditions, HeartUnloadNet achieves sub-millimeter accuracy, with an average DSC of 0.986 and HD of 0.083 cm, while reducing inference time to just 0.02 seconds per case, over 10^5 times faster and significantly more accurate than traditional inverse FE solvers. Ablation studies confirm the effectiveness of the architecture. Notably, the cycle-consistent design enables the model to maintain a DSC of 97% even with as few as 200 training samples. This work thus presents a scalable and accurate surrogate for inverse FE solvers, supporting real-time clinical applications in the future.

[30] Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting

Xingyu Miao,Haoran Duan,Quanhao Qian,Jiuniu Wang,Yang Long,Ling Shao,Deli Zhao,Ran Xu,Gongjie Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种可扩展的管道,将单视图图像转换为全面的3D表示,解决了3D数据稀缺的问题,并生成了两个新的空间数据集。

Details Motivation: 空间智能在AI中的发展受限于大规模3D数据的稀缺性,而2D图像则相对丰富。论文旨在通过方法创新填补这一空白。

Contribution: 1. 提出了一种从单视图图像生成3D表示的管道;2. 发布了COCO-3D和Objects365-v2-3D两个新数据集;3. 展示了生成数据对多项3D任务的益处。

Method: 结合深度估计、相机校准和比例校准,将2D图像转换为3D表示(点云、相机位姿、深度图和伪RGBD)。

Result: 实验证明,生成的数据可用于多种3D任务,从基础感知到基于MLLM的推理。

Insight: 通过自动生成3D数据显著降低了数据收集成本,为空间智能的发展提供了新途径。

Abstract: Spatial intelligence is emerging as a transformative frontier in AI, yet it remains constrained by the scarcity of large-scale 3D datasets. Unlike the abundant 2D imagery, acquiring 3D data typically requires specialized sensors and laborious annotation. In this work, we present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations - including point clouds, camera poses, depth maps, and pseudo-RGBD - via integrated depth estimation, camera calibration, and scale calibration. Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding. By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence. We release two generated spatial datasets, i.e., COCO-3D and Objects365-v2-3D, and demonstrate through extensive experiments that our generated data can benefit various 3D tasks, ranging from fundamental perception to MLLM-based reasoning. These results validate our pipeline as an effective solution for developing AI systems capable of perceiving, understanding, and interacting with physical environments.

[31] KuiSCIMA v2.0: Improved Baselines, Calibration, and Cross-Notation Generalization for Historical Chinese Music Notations in Jiang Kui’s Baishidaoren Gequ

Tristan Repolusk,Eduardo Veas

Main category: cs.CV

TL;DR: 该论文改进了历史中国音乐记谱(如‘俗字谱’和‘律吕谱’)的光学音乐识别(OMR)技术,显著降低了字符错误率(CER),并扩展了数据集,推动了文化多样性。

Details Motivation: 历史中国音乐记谱(如‘俗字谱’和‘律吕谱’)在OMR中面临数据稀缺和类别不平衡的挑战,限制了其数字化和可访问性。

Contribution: 1. 显著提升了‘俗字谱’(CER从10.4%降至7.1%)和‘律吕谱’(CER降至0.9%)的识别精度;2. 通过温度缩放实现了模型校准(ECE<0.0162);3. 扩展了KuiSCIMA数据集,包括109首乐曲的多种记谱形式。

Method: 1. 开发了针对稀缺不平衡数据的字符识别模型;2. 使用温度缩放进行模型校准;3. 采用‘留一版本’交叉验证确保跨版本鲁棒性。

Result: 模型性能超越人类转录者(人类CER平均15.9%,最佳7.6%),并在跨版本测试中表现稳定。

Insight: 该研究不仅提升了特定记谱的识别性能,还为其他小众音乐传统的OMR提供了技术借鉴,推动了文化多样性的数字化进程。

Abstract: Optical Music Recognition (OMR) for historical Chinese musical notations, such as suzipu and l"ul"upu, presents unique challenges due to high class imbalance and limited training data. This paper introduces significant advancements in OMR for Jiang Kui’s influential collection Baishidaoren Gequ from 1202. In this work, we develop and evaluate a character recognition model for scarce imbalanced data. We improve upon previous baselines by reducing the Character Error Rate (CER) from 10.4% to 7.1% for suzipu, despite working with 77 highly imbalanced classes, and achieve a remarkable CER of 0.9% for l"ul"upu. Our models outperform human transcribers, with an average human CER of 15.9% and a best-case CER of 7.6%. We employ temperature scaling to achieve a well-calibrated model with an Expected Calibration Error (ECE) below 0.0162. Using a leave-one-edition-out cross-validation approach, we ensure robust performance across five historical editions. Additionally, we extend the KuiSCIMA dataset to include all 109 pieces from Baishidaoren Gequ, encompassing suzipu, l"ul"upu, and jianzipu notations. Our findings advance the digitization and accessibility of historical Chinese music, promoting cultural diversity in OMR and expanding its applicability to underrepresented music traditions.

[32] SAR-TEXT: A Large-Scale SAR Image-Text Dataset Built with SAR-Narrator and Progressive Transfer Learning

Xinjun Cheng,Yiguo He,Junjie Zhu,Chunping Qiu,Jun Wang,Qiangjuan Huang,Ke Yang

Main category: cs.CV

TL;DR: 论文构建了SAR-Text数据集,包含13万+ SAR图像-文本对,通过SAR-Narrator和多阶段渐进式迁移学习方法生成文本描述,显著提升了SAR图像的语义理解能力,并在图像-文本检索、图像描述和视觉问答任务上验证了数据集的有效性。

Details Motivation: SAR图像在遥感中的重要性日益凸显,但缺乏大规模高质量的图像-文本数据集,限制了语义理解的研究进展。

Contribution: 1. 构建了大规模、高质量的SAR-Text数据集;2. 提出了SAR-Narrator框架,通过渐进式迁移学习生成文本描述;3. 在多个任务中验证了数据集的优越性。

Method: 采用多阶段渐进式迁移学习策略(SAR-Narrator框架)生成SAR图像的文本描述,并基于此构建SAR-Text数据集。实验中使用SAR-RS-CLIP、SAR-RS-CoCa和SAR-GPT模型进行任务验证。

Result: 在图像-文本检索任务中,SAR-RS-CLIP显著提升了召回率;在图像描述任务中,SAR-RS-CoCa的评分远超基准模型;在VQA任务中,SAR-GPT表现优异。

Insight: SAR-Narrator作为灵活的标注工具,可推动更大规模SAR数据集的构建,为SAR图像的语义理解研究提供重要支持。

Abstract: Vision Language Models (VLMs) have achieved remarkable breakthroughs in the field of remote sensing in recent years. Synthetic Aperture Radar (SAR) imagery, with its all-weather capability, is essential in remote sensing, yet the lack of large-scale, high-quality SAR image-text datasets hinders its semantic understanding. In this paper, we construct SAR-Text, a large-scale and high-quality dataset consisting of over 130,000 SAR image-text pairs. To construct the SAR-Text dataset, we design the SAR-Narrator framework, which generates textual descriptions for SAR images through a multi-stage progressive transfer learning strategy. To verify the effectiveness of the SAR-TEXT dataset, we conduct experiments on three typical vision-language tasks: image-text retrieval, image captioning, and visual question answering (VQA). Specifically, we construct three representative models on SAR-TEXT: SAR-RS-CLIP, SAR-RS-CoCa, and SAR-GPT. SAR-RS-CLIP achieves notable improvements in retrieval performance, boosting average recall by 16.43% and 10.54% on the OSdataset-512 and HRSID test sets, respectively. In the captioning task, SAR-RS-CoCa achieves BLEU-4, SPICE, and CIDEr scores exceeding those of the original CoCa model by more than 8x, 4x, and 10x, respectively. In the VQA task, SAR-GPT outperforms baseline and single-stage models on multiple SAR-VQA datasets, demonstrating stronger semantic understanding and reasoning ability, as further confirmed by qualitative results. It is worth noting that, as a flexible captioning tool, SAR-Narrator can be readily adopted by the community to construct larger-scale SAR image-text datasets.

[33] Learning Efficient and Generalizable Human Representation with Human Gaussian Model

Yifan Liu,Shengjun Zhang,Chensheng Dai,Yang Chen,Hao Liu,Chen Li,Yueqi Duan

Main category: cs.CV

TL;DR: 本文提出了一种名为Human Gaussian Graph的方法,通过双层次图结构(高斯节点和网格节点)建模人表示,利用帧间关系生成可动画化的人体表示。

Details Motivation: 现有方法多基于逐帧优化或独立预测高斯,未充分利用不同时间戳间高斯的关系,导致效率低且泛化性差。

Contribution: 提出了Human Gaussian Graph,通过双层次图结构和节点操作(帧内聚合和帧间消息传递)实现高效且可泛化的人体表示建模。

Method: 结合高斯节点和SMPL网格顶点节点的双层次图结构,设计了帧内聚合操作和帧间消息传递操作。

Result: 在新视角合成和新姿态动画任务上验证了方法的效率和泛化能力。

Insight: 通过图结构建模帧间关系,能够更高效地利用全局信息,提升表示的可动画化和泛化性。

Abstract: Modeling animatable human avatars from videos is a long-standing and challenging problem. While conventional methods require per-instance optimization, recent feed-forward methods have been proposed to generate 3D Gaussians with a learnable network. However, these methods predict Gaussians for each frame independently, without fully capturing the relations of Gaussians from different timestamps. To address this, we propose Human Gaussian Graph to model the connection between predicted Gaussians and human SMPL mesh, so that we can leverage information from all frames to recover an animatable human representation. Specifically, the Human Gaussian Graph contains dual layers where Gaussians are the first layer nodes and mesh vertices serve as the second layer nodes. Based on this structure, we further propose the intra-node operation to aggregate various Gaussians connected to one mesh vertex, and inter-node operation to support message passing among mesh node neighbors. Experimental results on novel view synthesis and novel pose animation demonstrate the efficiency and generalization of our method.

[34] Diffusion-FS: Multimodal Free-Space Prediction via Diffusion for Autonomous Driving

Keshav Gupta,Tejas S. Stanley,Pranjal Paul,Arun K. Singh,K. Madhava Krishna

Main category: cs.CV

TL;DR: 论文提出了一种基于扩散模型的多模态自由空间预测方法(Diffusion-FS),通过单目相机输入预测自动驾驶中的可行驶走廊,并引入ContourDiff架构优化预测效果。

Details Motivation: 当前自由空间预测方法通常将整个非障碍道路区域作为自由空间,而本文旨在预测可行驶走廊这一更细粒度的子集。现有方法依赖BEV表示,难以直接获取,因此通过单目相机输入实现视觉走廊估计更具实用性。

Contribution: 1. 提出一种自监督方法,利用未来自车轨迹和单目图像生成自由空间样本;2. 引入ContourDiff架构,通过轮廓点去噪替代二进制掩码表示,实现结构化且可解释的自由空间预测。

Method: 1. 基于扩散模型建模图像中自由空间走廊的分布;2. 开发ContourDiff架构,通过去噪轮廓点生成预测结果。

Result: 在nuScenes和CARLA数据集上的实验表明,该方法能准确预测图像中的多模态可行驶走廊。

Insight: 将自由空间预测视为纯粹的单目图像感知任务,结合扩散模型和轮廓表示,提升了预测的结构性和可解释性。

Abstract: Drivable Free-space prediction is a fundamental and crucial problem in autonomous driving. Recent works have addressed the problem by representing the entire non-obstacle road regions as the free-space. In contrast our aim is to estimate the driving corridors that are a navigable subset of the entire road region. Unfortunately, existing corridor estimation methods directly assume a BEV-centric representation, which is hard to obtain. In contrast, we frame drivable free-space corridor prediction as a pure image perception task, using only monocular camera input. However such a formulation poses several challenges as one doesn’t have the corresponding data for such free-space corridor segments in the image. Consequently, we develop a novel self-supervised approach for free-space sample generation by leveraging future ego trajectories and front-view camera images, making the process of visual corridor estimation dependent on the ego trajectory. We then employ a diffusion process to model the distribution of such segments in the image. However, the existing binary mask-based representation for a segment poses many limitations. Therefore, we introduce ContourDiff, a specialized diffusion-based architecture that denoises over contour points rather than relying on binary mask representations, enabling structured and interpretable free-space predictions. We evaluate our approach qualitatively and quantitatively on both nuScenes and CARLA, demonstrating its effectiveness in accurately predicting safe multimodal navigable corridors in the image.

[35] Tell Me What You See: An Iterative Deep Learning Framework for Image Captioning

Hitesh Kumar Gupta

Main category: cs.CV

TL;DR: 本文提出了一种迭代式的深度学习框架,用于图像描述生成任务,从基础的CNN-LSTM逐步发展到具备动态注意力机制的先进模型Nexus。实验验证了注意力机制的关键作用,并最终在MS COCO数据集上取得了优异的性能。

Details Motivation: 图像描述生成任务需要同时理解视觉场景和语言结构。虽然当前主流方法是基于Transformer的大规模模型,但本文希望通过系统性的迭代开发,探索基础模型的演进过程,验证注意力机制的重要性。

Contribution: 1. 提出了一种迭代式开发框架,从简单模型逐步升级到复杂模型(如Nexus)。2. 验证了视觉主干网络升级需配合注意力机制的结论。3. 最终模型Nexus在MS COCO数据集上表现优异,BLEU-4得分为31.4。

Method: 从基础的CNN-LSTM编码器-解码器模型出发,逐步引入注意力机制和高级视觉主干网络(如EfficientNetV2B3),最终构建了Nexus模型,具备动态注意力机制。

Result: Nexus模型在MS COCO 2017数据集上取得了31.4的BLEU-4分数,超越了多个基准模型。

Insight: 仅升级视觉主干网络而缺乏注意力机制会降低性能,因为单一向量瓶颈无法传递更丰富的视觉细节。这一发现支持了注意力机制在现代视觉语言任务中的必要性。

Abstract: Image captioning, a task at the confluence of computer vision and natural language processing, requires a sophisticated understanding of both visual scenes and linguistic structure. While modern approaches are dominated by large-scale Transformer architectures, this paper documents a systematic, iterative development of foundational image captioning models, progressing from a simple CNN-LSTM encoder-decoder to a competitive attention-based system. We present a series of five models, beginning with Genesis and concluding with Nexus, an advanced model featuring an EfficientNetV2B3 backbone and a dynamic attention mechanism. Our experiments chart the impact of architectural enhancements and demonstrate a key finding within the classic CNN-LSTM paradigm: merely upgrading the visual backbone without a corresponding attention mechanism can degrade performance, as the single-vector bottleneck cannot transmit the richer visual detail. This insight validates the architectural shift to attention. Trained on the MS COCO 2017 dataset, our final model, Nexus, achieves a BLEU-4 score of 31.4, surpassing several foundational benchmarks and validating our iterative design process. This work provides a clear, replicable blueprint for understanding the core architectural principles that underpin modern vision-language tasks.

[36] Deepfake Detection Via Facial Feature Extraction and Modeling

Benjamin Carter,Nathan Dilla,Micheal Callahan,Atuhaire Ambala

Main category: cs.CV

TL;DR: 该论文提出了一种基于面部特征提取和建模的深度伪造检测方法,通过识别面部运动中的细微不一致性,而非直接处理原始图像,实现了高效的检测效果。

Details Motivation: 随着深度伪造技术的发展,伪造媒体与真实媒体之间的区分日益困难,亟需新的检测方法。现有方法大多依赖原始图像处理,而本文另辟蹊径,专注于面部特征的提取。

Contribution: 主要贡献在于提出了一种仅依赖面部关键点进行深度伪造检测的方法,避免了复杂的图像处理,展示了其在多种神经网络模型中的有效性。

Method: 通过提取面部关键点,利用RNN、ANN和CNN等多种神经网络模型检测面部运动中的不一致性。实验表明该方法在参数更少的情况下表现优异。

Result: RNN和ANN模型的准确率分别达到96%和93%,CNN模型则为78%,验证了该方法的有效性。

Insight: 研究表明,深度伪造检测不一定需要复杂的图像处理,面部关键点的提取和分析足以提供可靠的检测结果,且具有更高效的计算性能。

Abstract: The rise of deepfake technology brings forth new questions about the authenticity of various forms of media found online today. Videos and images generated by artificial intelligence (AI) have become increasingly more difficult to differentiate from genuine media, resulting in the need for new models to detect artificially-generated media. While many models have attempted to solve this, most focus on direct image processing, adapting a convolutional neural network (CNN) or a recurrent neural network (RNN) that directly interacts with the video image data. This paper introduces an approach of using solely facial landmarks for deepfake detection. Using a dataset consisting of both deepfake and genuine videos of human faces, this paper describes an approach for extracting facial landmarks for deepfake detection, focusing on identifying subtle inconsistencies in facial movements instead of raw image processing. Experimental results demonstrated that this feature extraction technique is effective in various neural network models, with the same facial landmarks tested on three neural network models, with promising performance metrics indicating its potential for real-world applications. The findings discussed in this paper include RNN and artificial neural network (ANN) models with accuracy between 96% and 93%, respectively, with a CNN model hovering around 78%. This research challenges the assumption that raw image processing is necessary to identify deepfake videos by presenting a facial feature extraction approach compatible with various neural network models while requiring fewer parameters.

[37] Flow Stochastic Segmentation Networks

Fabio De Sousa Ribeiro,Omar Todd,Charles Jones,Avinash Kori,Raghav Mehta,Ben Glocker

Main category: cs.CV

TL;DR: Flow-SSN是一个生成式分割模型家族,结合了离散时间自回归和连续时间流变体,解决了低秩参数化的局限性,并能高效学习高秩像素级协方差。

Details Motivation: 现有的分割模型在低秩参数化上存在局限性,无法高效估计高秩像素级协方差,Flow-SSN旨在解决这一问题。

Contribution: 提出了Flow-SSN模型,能够估计任意高秩像素级协方差,且不依赖秩假设或存储分布参数,同时采样效率高于标准扩散模型。

Method: 结合离散时间自回归和连续时间流变体,将模型容量集中于学习流的基分布,形成强表达能力。

Result: 在医学影像基准测试中取得了SOTA结果。

Insight: 通过改进生成式分割模型的结构设计,可以实现更高效的协方差估计和采样性能,特别适用于高秩数据场景。

Abstract: We introduce the Flow Stochastic Segmentation Network (Flow-SSN), a generative segmentation model family featuring discrete-time autoregressive and modern continuous-time flow variants. We prove fundamental limitations of the low-rank parameterisation of previous methods and show that Flow-SSNs can estimate arbitrarily high-rank pixel-wise covariances without assuming the rank or storing the distributional parameters. Flow-SSNs are also more efficient to sample from than standard diffusion-based segmentation models, thanks to most of the model capacity being allocated to learning the base distribution of the flow, constituting an expressive prior. We apply Flow-SSNs to challenging medical imaging benchmarks and achieve state-of-the-art results. Code available: https://github.com/biomedia-mira/flow-ssn.

[38] PTCMIL: Multiple Instance Learning via Prompt Token Clustering for Whole Slide Image Analysis

Beidi Zhao,SangMook Kim,Hao Chen,Chen Zhou,Zu-hua Gao,Gang Wang,Xiaoxiao Li

Main category: cs.CV

TL;DR: PTCMIL是一种基于提示token聚类的新型ViT方法,用于WSI分析中的多实例学习,通过动态对齐聚类与下游任务,显著提升了性能和可解释性。

Details Motivation: 现有的MIL方法在处理WSI的复杂性和异质性时表现不足,尤其是聚类方法计算成本高且难以捕捉任务和切片特定的变异性。

Contribution: 提出了PTCMIL,通过引入可学习的提示token,将聚类和预测任务统一为端到端框架,动态调整聚类以适应下游任务。

Method: 采用基于投影的聚类方法,结合token合并和原型池化,高效捕获任务相关模式。

Result: 在八个数据集上的实验显示,PTCMIL在分类和生存分析任务中优于现有方法。

Insight: 动态聚类和原型池化能够减少计算复杂度,同时保持patch的异质性,提升了模型的鲁棒性和可解释性。

Abstract: Multiple Instance Learning (MIL) has advanced WSI analysis but struggles with the complexity and heterogeneity of WSIs. Existing MIL methods face challenges in aggregating diverse patch information into robust WSI representations. While ViTs and clustering-based approaches show promise, they are computationally intensive and fail to capture task-specific and slide-specific variability. To address these limitations, we propose PTCMIL, a novel Prompt Token Clustering-based ViT for MIL aggregation. By introducing learnable prompt tokens into the ViT backbone, PTCMIL unifies clustering and prediction tasks in an end-to-end manner. It dynamically aligns clustering with downstream tasks, using projection-based clustering tailored to each WSI, reducing complexity while preserving patch heterogeneity. Through token merging and prototype-based pooling, PTCMIL efficiently captures task-relevant patterns. Extensive experiments on eight datasets demonstrate its superior performance in classification and survival analysis tasks, outperforming state-of-the-art methods. Systematic ablation studies confirm its robustness and strong interpretability. The code is released at https://github.com/ubc-tea/PTCMIL.

[39] HQ-SMem: Video Segmentation and Tracking Using Memory Efficient Object Embedding With Selective Update and Self-Supervised Distillation Feedback

Elham Soltani Kazemi,Imad Eddine Toubal,Gani Rahmon,Jaired Collins,K. Palaniappan

Main category: cs.CV

TL;DR: HQ-SMem提出了一种通过智能内存机制优化视频对象分割(VOS)的方法,结合高质量掩膜、动态内存更新和自监督蒸馏反馈,显著提升了分割和跟踪性能。

Details Motivation: 现有VOS模型在复杂对象变形、长时间序列和多对象动态场景中表现不佳,HQ-SMem旨在解决这些问题,提升分割和跟踪的精度与效率。

Contribution: 1. 结合SAM-HQ和候选选择改进分割边界;2. 动态智能内存优化长期视频处理;3. 动态更新外观模型以减少漂移和拓扑变化的影响。

Method: 1. 使用SAM-HQ生成高质量掩膜;2. 智能选择性存储关键帧;3. 动态更新外观模型结合自监督蒸馏反馈。

Result: 在VOTS、VOTSt 2024等数据集上表现优异,并在长视频数据集(LVOS)中刷新了基准。

Insight: 动态内存和外观更新机制显著提升了复杂场景下的鲁棒性,自监督反馈进一步减少了标注依赖。

Abstract: Video Object Segmentation (VOS) is foundational to numerous computer vision applications, including surveillance, autonomous driving, robotics and generative video editing. However, existing VOS models often struggle with precise mask delineation, deformable objects, topologically transforming objects, tracking drift and long video sequences. In this paper, we introduce HQ-SMem, for High Quality video segmentation and tracking using Smart Memory, a novel method that enhances the performance of VOS base models by addressing these limitations. Our approach incorporates three key innovations: (i) leveraging SAM with High-Quality masks (SAM-HQ) alongside appearance-based candidate-selection to refine coarse segmentation masks, resulting in improved object boundaries; (ii) implementing a dynamic smart memory mechanism that selectively stores relevant key frames while discarding redundant ones, thereby optimizing memory usage and processing efficiency for long-term videos; and (iii) dynamically updating the appearance model to effectively handle complex topological object variations and reduce drift throughout the video. These contributions mitigate several limitations of existing VOS models including, coarse segmentations that mix-in background pixels, fixed memory update schedules, brittleness to drift and occlusions, and prompt ambiguity issues associated with SAM. Extensive experiments conducted on multiple public datasets and state-of-the-art base trackers demonstrate that our method consistently ranks among the top two on VOTS and VOTSt 2024 datasets. Moreover, HQ-SMem sets new benchmarks on Long Video Dataset and LVOS, showcasing its effectiveness in challenging scenarios characterized by complex multi-object dynamics over extended temporal durations.

[40] MGHFT: Multi-Granularity Hierarchical Fusion Transformer for Cross-Modal Sticker Emotion Recognition

Jian Chen,Yuxuan Hu,Haifeng Lu,Wei Wang,Min Yang,Chengming Li,Xiping Hu

Main category: cs.CV

TL;DR: 论文提出了一种多粒度层次融合变换器(MGHFT),通过多模态大语言模型和多视图描述增强贴纸情感识别,显著优于现有方法。

Details Motivation: 贴纸情感识别依赖多视图信息(如背景知识和风格线索),现有预训练视觉模型在提取多模态特征时表现不足,需改进。

Contribution: 1. 提出MGHFT框架,结合多模态大语言模型生成多视图文本描述;2. 设计层次融合策略和文本引导注意力机制,增强视觉与文本特征融合。

Method: 1. 使用多模态大语言模型生成多视图文本描述;2. 基于金字塔视觉变换器提取多阶段全局和局部视觉特征;3. 通过对比学习和注意力机制融合文本与视觉特征。

Result: 在两个公共数据集上显著优于现有方法,F1提升5.4%,准确率提升4.0%。

Insight: 多视图文本描述能有效辅助视觉特征提取,层次融合策略和文本引导注意力机制是多模态情感识别的关键。

Abstract: Although pre-trained visual models with text have demonstrated strong capabilities in visual feature extraction, sticker emotion understanding remains challenging due to its reliance on multi-view information, such as background knowledge and stylistic cues. To address this, we propose a novel multi-granularity hierarchical fusion transformer (MGHFT), with a multi-view sticker interpreter based on Multimodal Large Language Models. Specifically, inspired by the human ability to interpret sticker emotions from multiple views, we first use Multimodal Large Language Models to interpret stickers by providing rich textual context via multi-view descriptions. Then, we design a hierarchical fusion strategy to fuse the textual context into visual understanding, which builds upon a pyramid visual transformer to extract both global and local sticker features at multiple stages. Through contrastive learning and attention mechanisms, textual features are injected at different stages of the visual backbone, enhancing the fusion of global- and local-granularity visual semantics with textual guidance. Finally, we introduce a text-guided fusion attention mechanism to effectively integrate the overall multimodal features, enhancing semantic understanding. Extensive experiments on 2 public sticker emotion datasets demonstrate that MGHFT significantly outperforms existing sticker emotion recognition approaches, achieving higher accuracy and more fine-grained emotion recognition. Compared to the best pre-trained visual models, our MGHFT also obtains an obvious improvement, 5.4% on F1 and 4.0% on accuracy. The code is released at https://github.com/cccccj-03/MGHFT_ACMMM2025.

[41] PDT: Point Distribution Transformation with Diffusion Models

Jionghao Wang,Cheng Lin,Yuan Liu,Rui Xu,Zhiyang Dou,Xiao-Xiao Long,Hao-Xiang Guo,Taku Komura,Wenping Wang,Xin Li

Main category: cs.CV

TL;DR: 这篇论文提出了一种名为PDT的新方法,利用扩散模型实现点分布的语义化转换,通过学习将无序点云转化为结构化目标分布,展示了在多种几何处理任务中的有效性。

Details Motivation: 点云表示在几何处理中很重要,但如何从无序点云中提取语义并转换为有意义的分布仍是一个未充分探索的问题。PDT旨在解决这一挑战。

Contribution: 1. 提出PDT框架,首次将扩散模型应用于点分布转换;2. 设计了新颖的架构和学习策略,通过去噪过程关联源与目标分布;3. 在多种结构化输出任务中验证了方法的效果。

Method: 基于扩散模型,利用去噪过程建模点分布的变换。通过特定的网络架构和学习策略,将输入点云逐步转化为目标语义分布。

Result: 实验表明,PDT能成功将点云转换为多种结构化输出(如表面关键点、内部关节、特征线等),证明了其在几何和语义特征提取上的有效性。

Insight: 扩散模型不仅能用于生成任务,还可用于点云的结构化转换。这一方法为3D几何处理提供了一个通用且强大的工具。

Abstract: Point-based representations have consistently played a vital role in geometric data structures. Most point cloud learning and processing methods typically leverage the unordered and unconstrained nature to represent the underlying geometry of 3D shapes. However, how to extract meaningful structural information from unstructured point cloud distributions and transform them into semantically meaningful point distributions remains an under-explored problem. We present PDT, a novel framework for point distribution transformation with diffusion models. Given a set of input points, PDT learns to transform the point set from its original geometric distribution into a target distribution that is semantically meaningful. Our method utilizes diffusion models with novel architecture and learning strategy, which effectively correlates the source and the target distribution through a denoising process. Through extensive experiments, we show that our method successfully transforms input point clouds into various forms of structured outputs - ranging from surface-aligned keypoints, and inner sparse joints to continuous feature lines. The results showcase our framework’s ability to capture both geometric and semantic features, offering a powerful tool for various 3D geometry processing tasks where structured point distributions are desired. Code will be available at this link: https://github.com/shanemankiw/PDT.

[42] Structure Matters: Revisiting Boundary Refinement in Video Object Segmentation

Guanyi Qin,Ziyue Wang,Daiyun Shen,Haofeng Liu,Hantao Zhou,Junde Wu,Runze Hu,Yueming Jin

Main category: cs.CV

TL;DR: 论文提出了一种名为OASIS的新型半监督视频对象分割方法,通过边界修正和结构细化模块提升分割精度,同时保持高效性。

Details Motivation: 现有的基于内存的半监督视频对象分割方法在处理遮挡和对象交互时表现不佳,且难以满足实时处理需求。

Contribution: 提出了OASIS方法,包括轻量级结构细化模块和不确定性估计的evidential学习,显著提升了分割精度和速度。

Method: 结合Canny滤波器提取的粗略边缘先验和存储的对象特征,生成对象级结构图并强化边界特征,同时引入不确定性估计处理遮挡区域。

Result: 在DAVIS-17和YouTubeVOS 2019验证集上分别达到91.6和86.6的性能指标,同时保持48 FPS的推理速度。

Insight: 通过结构细化模块和不确定性估计的结合,OASIS在处理遮挡和高相似性场景时表现出色,为实时视频对象分割提供了高效解决方案。

Abstract: Given an object mask, Semi-supervised Video Object Segmentation (SVOS) technique aims to track and segment the object across video frames, serving as a fundamental task in computer vision. Although recent memory-based methods demonstrate potential, they often struggle with scenes involving occlusion, particularly in handling object interactions and high feature similarity. To address these issues and meet the real-time processing requirements of downstream applications, in this paper, we propose a novel bOundary Amendment video object Segmentation method with Inherent Structure refinement, hereby named OASIS. Specifically, a lightweight structure refinement module is proposed to enhance segmentation accuracy. With the fusion of rough edge priors captured by the Canny filter and stored object features, the module can generate an object-level structure map and refine the representations by highlighting boundary features. Evidential learning for uncertainty estimation is introduced to further address challenges in occluded regions. The proposed method, OASIS, maintains an efficient design, yet extensive experiments on challenging benchmarks demonstrate its superior performance and competitive inference speed compared to other state-of-the-art methods, i.e., achieving the F values of 91.6 (vs. 89.7 on DAVIS-17 validation set) and G values of 86.6 (vs. 86.2 on YouTubeVOS 2019 validation set) while maintaining a competitive speed of 48 FPS on DAVIS.

Binxu Li,Yuhui Zhang,Xiaohan Wang,Weixin Liang,Ludwig Schmidt,Serena Yeung-Levy

Main category: cs.CV

TL;DR: 论文提出GR-CLIP方法,通过后处理校准消除CLIP模型中图像和文本嵌入空间之间的模态鸿沟,显著提升混合模态搜索性能。

Details Motivation: 混合模态搜索(跨图像、文本和多模态文档的信息检索)是实际应用中的重要任务,但当前对比视觉语言模型(如CLIP)在嵌入空间中存在明显的模态鸿沟,导致跨模态搜索性能受限。

Contribution: 1. 首次系统分析CLIP模型在混合模态搜索任务中的模态鸿沟问题;2. 提出轻量级后处理方法GR-CLIP,消除嵌入空间中的模态鸿沟;3. 构建首个混合模态搜索基准MixBench。

Method: GR-CLIP是一种后处理校准方法,通过调整CLIP模型的嵌入空间,使图像和文本嵌入分布对齐,消除模态鸿沟,提升跨模态检索性能。

Result: 在MixBench基准上,GR-CLIP比CLIP提升NDCG@10达26%,且计算量减少75倍,优于最新的视觉语言生成式嵌入模型4%。

Insight: 模态鸿沟是影响跨模态搜索性能的关键因素,简单的后处理方法即可显著提升性能,为混合模态搜索任务提供了高效解决方案。

Abstract: Mixed modality search – retrieving information across a heterogeneous corpus composed of images, texts, and multimodal documents – is an important yet underexplored real-world application. In this work, we investigate how contrastive vision-language models, such as CLIP, perform on the mixed modality search task. Our analysis reveals a critical limitation: these models exhibit a pronounced modality gap in the embedding space, where image and text embeddings form distinct clusters, leading to intra-modal ranking bias and inter-modal fusion failure. To address this issue, we propose GR-CLIP, a lightweight post-hoc calibration method that removes the modality gap in CLIP’s embedding space. Evaluated on MixBench – the first benchmark specifically designed for mixed modality search – GR-CLIP improves NDCG@10 by up to 26 percentage points over CLIP, surpasses recent vision-language generative embedding models by 4 percentage points, while using 75x less compute.

[44] GPSMamba: A Global Phase and Spectral Prompt-guided Mamba for Infrared Image Super-Resolution

Yongsong Huang,Tomo Miyazaki,Xiaofeng Liu,Shinichiro Omachi

Main category: cs.CV

TL;DR: 提出了GPSMamba框架,结合语义-频率提示和非因果监督,解决红外图像超分辨率的全局上下文建模问题。

Details Motivation: 红外图像超分辨率任务面临低对比度和稀疏纹理的挑战,传统方法(如Mamba)因1D因果扫描机制破坏2D图像的全局上下文,影响细节恢复。

Contribution: 提出了Adaptive Semantic-Frequency State Space Module (ASF-SSM)和Thermal-Spectral Attention and Phase Consistency Loss,结合语义频率提示和非因果监督,提升全局建模能力。

Method: 提出ASF-SSM模块注入语义频率提示,结合Thermal-Spectral Attention和相位一致性损失,实现非因果监督。

Result: 实验表明GPSMamba在红外图像超分辨率任务中达到SOTA性能。

Insight: 通过语义-频率提示和非因果监督,可以有效克服1D因果建模的局限性,提升红外图像超分辨率的性能。

Abstract: Infrared Image Super-Resolution (IRSR) is challenged by the low contrast and sparse textures of infrared data, requiring robust long-range modeling to maintain global coherence. While State-Space Models like Mamba offer proficiency in modeling long-range dependencies for this task, their inherent 1D causal scanning mechanism fragments the global context of 2D images, hindering fine-detail restoration. To address this, we propose Global Phase and Spectral Prompt-guided Mamba (GPSMamba), a framework that synergizes architectural guidance with non-causal supervision. First, our Adaptive Semantic-Frequency State Space Module (ASF-SSM) injects a fused semantic-frequency prompt directly into the Mamba block, integrating non-local context to guide reconstruction. Then, a novel Thermal-Spectral Attention and Phase Consistency Loss provides explicit, non-causal supervision to enforce global structural and spectral fidelity. By combining these two innovations, our work presents a systematic strategy to mitigate the limitations of causal modeling. Extensive experiments demonstrate that GPSMamba achieves state-of-the-art performance, validating our approach as a powerful new paradigm for infrared image restoration. Code is available at https://github.com/yongsongH/GPSMamba.

[45] MedIQA: A Scalable Foundation Model for Prompt-Driven Medical Image Quality Assessment

Siyi Xun,Yue Sun,Jingkun Chen,Zitong Yu,Tong Tong,Xiaohong Liu,Mingxiang Wu,Tao Tan

Main category: cs.CV

TL;DR: 本文提出了MedIQA,第一个全面的医学图像质量评估(IQA)基础模型,能够处理不同维度、模态、解剖区域和类型的图像,显著提升了医学IQA在多种下游任务中的表现。

Details Motivation: 现有医学IQA方法难以跨模态和临床场景泛化,而医学成像技术的快速发展要求更精确和自动化的IQA以保证诊断准确性。

Contribution: MedIQA是首个全面的医学IQA基础模型,构建了一个大规模多模态数据集,并引入了一个显著性切片评估模块和自动提示策略,以提升模型在不同下游任务中的表现。

Method: 模型集成了显著性切片评估模块以聚焦诊断相关区域,并采用自动提示策略将上游物理参数预训练与下游专家标注微调对齐。

Result: 实验表明,MedIQA在多种下游任务中显著优于基线方法,为医学IQA提供了可扩展的框架。

Insight: 通过结合多模态数据和自动提示策略,MedIQA为医学图像质量评估提供了新的解决方案,有助于提升诊断工作流程和临床决策。

Abstract: Rapid advances in medical imaging technology underscore the critical need for precise and automated image quality assessment (IQA) to ensure diagnostic accuracy. Existing medical IQA methods, however, struggle to generalize across diverse modalities and clinical scenarios. In response, we introduce MedIQA, the first comprehensive foundation model for medical IQA, designed to handle variability in image dimensions, modalities, anatomical regions, and types. We developed a large-scale multi-modality dataset with plentiful manually annotated quality scores to support this. Our model integrates a salient slice assessment module to focus on diagnostically relevant regions feature retrieval and employs an automatic prompt strategy that aligns upstream physical parameter pre-training with downstream expert annotation fine-tuning. Extensive experiments demonstrate that MedIQA significantly outperforms baselines in multiple downstream tasks, establishing a scalable framework for medical IQA and advancing diagnostic workflows and clinical decision-making.

[46] A Survey of Multimodal Hallucination Evaluation and Detection

Zhiyuan Chen,Yuecong Min,Jie Zhang,Bei Yan,Jiahao Wang,Xiaozhen Wang,Shiguang Shan

Main category: cs.CV

TL;DR: 本文是一篇关于多模态大语言模型(MLLMs)幻觉问题的综述,从评测基准和检测方法两方面深入分析了图像到文本(I2T)和文本到图像(T2I)任务中的幻觉现象。

Details Motivation: 多模态大语言模型在处理视觉与文本信息时经常产生幻觉内容,这些内容看似合理但违背输入或常识。为了解决这一问题,本文对幻觉评测基准和检测方法进行了系统梳理。

Contribution: 提出了基于忠实性和事实性的幻觉分类法,总结了现有评测基准的构建方式和指标,并综述了幻觉检测方法的最新进展。

Method: 首先提出幻觉的分类法,然后综述了T2I和I2T任务的评测基准及检测方法。

Result: 系统总结了现有工作的局限性,并指出了未来研究方向。

Insight: 幻觉问题的解决需要更全面的评测基准和更高效的检测方法,同时跨模态一致性是未来研究的关键方向。

Abstract: Multi-modal Large Language Models (MLLMs) have emerged as a powerful paradigm for integrating visual and textual information, supporting a wide range of multi-modal tasks. However, these models often suffer from hallucination, producing content that appears plausible but contradicts the input content or established world knowledge. This survey offers an in-depth review of hallucination evaluation benchmarks and detection methods across Image-to-Text (I2T) and Text-to-image (T2I) generation tasks. Specifically, we first propose a taxonomy of hallucination based on faithfulness and factuality, incorporating the common types of hallucinations observed in practice. Then we provide an overview of existing hallucination evaluation benchmarks for both T2I and I2T tasks, highlighting their construction process, evaluation objectives, and employed metrics. Furthermore, we summarize recent advances in hallucination detection methods, which aims to identify hallucinated content at the instance level and serve as a practical complement of benchmark-based evaluation. Finally, we highlight key limitations in current benchmarks and detection methods, and outline potential directions for future research.

[47] LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences

Yusuke Hirota,Boyi Li,Ryo Hachiuma,Yueh-Hua Wu,Boris Ivanovic,Yuta Nakashima,Marco Pavone,Yejin Choi,Yu-Chiang Frank Wang,Chao-Han Huck Yang

Main category: cs.CV

TL;DR: LOTUS is a leaderboard designed to evaluate detailed image captions generated by LVLMs, covering quality, risks, and societal biases while incorporating user preferences.

Details Motivation: Existing evaluations for detailed image captions lack standardized criteria, bias-awareness, and user preference considerations.

Contribution: Introduces LOTUS, a comprehensive leaderboard addressing caption quality, risks (e.g., hallucination), societal biases (e.g., gender bias), and user preferences.

Method: Evaluates LVLMs on multiple dimensions, including alignment, descriptiveness, hallucination risks, and biases, while customizing criteria for diverse user preferences.

Result: Analysis shows no single LVLM excels across all criteria, with correlations between caption detail and bias risks. Optimal model selection depends on user priorities.

Insight: Detailed captions and bias risks are correlated, and preference-oriented evaluations highlight the importance of tailored model selection for different user needs.

Abstract: Large Vision-Language Models (LVLMs) have transformed image captioning, shifting from concise captions to detailed descriptions. We introduce LOTUS, a leaderboard for evaluating detailed captions, addressing three main gaps in existing evaluations: lack of standardized criteria, bias-aware assessments, and user preference considerations. LOTUS comprehensively evaluates various aspects, including caption quality (e.g., alignment, descriptiveness), risks (\eg, hallucination), and societal biases (e.g., gender bias) while enabling preference-oriented evaluations by tailoring criteria to diverse user preferences. Our analysis of recent LVLMs reveals no single model excels across all criteria, while correlations emerge between caption detail and bias risks. Preference-oriented evaluations demonstrate that optimal model selection depends on user priorities.

[48] Probing Multimodal Fusion in the Brain: The Dominance of Audiovisual Streams in Naturalistic Encoding

Hamid Abdollahi,Amir Hossein Mansouri Majoumerd,Amir Hossein Bagheri Baboukani,Amir Abolfazl Suratgar,Mohammad Bagher Menhaj

Main category: cs.CV

TL;DR: 该论文研究了多模态刺激下的大脑编码模型,发现线性模型在分布外数据上表现优于复杂模型,且视觉和听觉特征在编码中的主导作用超过语言特征。

Details Motivation: 研究动机在于探索多模态刺激下大脑活动的编码机制,并验证模型的泛化能力,尤其是在自然情境下的表现。

Contribution: 主要贡献是通过ID和OOD测试,展示了模型复杂度与泛化性能之间的权衡,并揭示了视觉和听觉流在神经编码中的主导作用。

Method: 采用了基于X-CLIP(视觉)和Whisper(听觉)的特征提取器,构建了注意力模型和线性模型,并在多种数据分布下进行了严格评估。

Result: 结果表明,线性模型在OOD数据上表现更稳健,比基线高18%;语言特征未显著提升预测准确性,而听觉和视觉特征更具主导性。

Insight: 研究强调了OOD测试对构建稳健神经AI模型的重要性,并揭示了多模态编码中的感官层级结构及模型架构的影响。

Abstract: Predicting brain activity in response to naturalistic, multimodal stimuli is a key challenge in computational neuroscience. While encoding models are becoming more powerful, their ability to generalize to truly novel contexts remains a critical, often untested, question. In this work, we developed brain encoding models using state-of-the-art visual (X-CLIP) and auditory (Whisper) feature extractors and rigorously evaluated them on both in-distribution (ID) and diverse out-of-distribution (OOD) data. Our results reveal a fundamental trade-off between model complexity and generalization: a higher-capacity attention-based model excelled on ID data, but a simpler linear model was more robust, outperforming a competitive baseline by 18% on the OOD set. Intriguingly, we found that linguistic features did not improve predictive accuracy, suggesting that for familiar languages, neural encoding may be dominated by the continuous visual and auditory streams over redundant textual information. Spatially, our approach showed marked performance gains in the auditory cortex, underscoring the benefit of high-fidelity speech representations. Collectively, our findings demonstrate that rigorous OOD testing is essential for building robust neuro-AI models and provides nuanced insights into how model architecture, stimulus characteristics, and sensory hierarchies shape the neural encoding of our rich, multimodal world.

[49] MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

Xuehui Wang,Zhenyu Wu,JingJing Xie,Zichen Ding,Bowen Yang,Zehao Li,Zhaoyang Liu,Qingyun Li,Xuan Dong,Zhe Chen,Weiyun Wang,Xiangyu Zhao,Jixuan Chen,Haodong Duan,Tianbao Xie,Chenyu Yang,Shiqian Su,Yue Yu,Yuan Huang,Yiqian Liu,Xiao Zhang,Yanting Zhang,Xiangyu Yue,Weijie Su,Xizhou Zhu,Wei Shen,Jifeng Dai,Wenhai Wang

Main category: cs.CV

TL;DR: MMBench-GUI 是一个分层多平台 GUI 自动化代理评估框架,包含四个评估层级和创新的 EQA 效率-质量指标,强调视觉定位和任务效率的重要性。

Details Motivation: 现有 GUI 自动化代理评估缺乏跨平台和多层次的统一标准,且任务效率问题未得到充分研究。

Contribution: 提出了 MMBench-GUI 基准测试框架,引入 EQA 指标,揭示了视觉定位和效率优化对 GUI 自动化的关键作用。

Method: 设计分层评估框架(内容理解、元素定位、任务自动化、任务协作),提出 EQA 指标,分析任务效率问题。

Result: 明确了视觉定位的至关重要性,暴露了现有模型在任务效率上的显著不足(冗余步骤多)。

Insight: 模块化框架(如专用定位模块)和高效任务规划(长上下文记忆、早期停止策略)是 GUI 自动化的核心挑战与方向。

Abstract: We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation, and Task Collaboration, covering essential skills for GUI agents. In addition, we propose a novel Efficiency-Quality Area (EQA) metric to assess GUI agent execution efficiency in online automation scenarios. Through MMBench-GUI, we identify accurate visual grounding as a critical determinant of overall task success, emphasizing the substantial benefits of modular frameworks that integrate specialized grounding modules. Furthermore, to achieve reliable GUI automation, an agent requires strong task planning and cross-platform generalization abilities, with long-context memory, a broad action space, and long-term reasoning playing a critical role. More important, task efficiency remains a critically underexplored dimension, and all models suffer from substantial inefficiencies, with excessive redundant steps even when tasks are ultimately completed. The integration of precise localization, effective planning, and early stopping strategies is indispensable to enable truly efficient and scalable GUI automation. Our benchmark code, evaluation data, and running environment will be publicly available at https://github.com/open-compass/MMBench-GUI.

[50] ScenePainter: Semantically Consistent Perpetual 3D Scene Generation with Concept Relation Alignment

Chong Xia,Shengjun Zhang,Fangfu Liu,Chang Liu,Khodchaphun Hirunyaratsameewong,Yueqi Duan

Main category: cs.CV

TL;DR: ScenePainter提出了一种新颖的框架,通过层次化图结构(SceneConceptGraph)解决3D场景生成中的语义漂移问题,生成长范围且语义一致的3D视图序列。

Details Motivation: 现有的3D场景生成方法通过外延式扩展生成视图序列,但会出现语义漂移问题。因此,需要一种方法来确保生成的场景在语义上的一致性。

Contribution: 提出了ScenePainter框架,引入SceneConceptGraph对齐场景先验与当前场景理解,解决了语义漂移问题;并通过动态优化增强多样性。

Method: 使用层次化图结构SceneConceptGraph建立多级场景概念间的关系,指导外延模块生成一致的视图序列,并动态优化该结构。

Result: 实验表明,ScenePainter克服了语义漂移问题,生成了更一致且沉浸的3D视图序列。

Insight: 通过图结构建模场景概念间的关系,可以显式地控制生成的语义一致性,同时动态优化提高了多样性。

Abstract: Perpetual 3D scene generation aims to produce long-range and coherent 3D view sequences, which is applicable for long-term video synthesis and 3D scene reconstruction. Existing methods follow a “navigate-and-imagine” fashion and rely on outpainting for successive view expansion. However, the generated view sequences suffer from semantic drift issue derived from the accumulated deviation of the outpainting module. To tackle this challenge, we propose ScenePainter, a new framework for semantically consistent 3D scene generation, which aligns the outpainter’s scene-specific prior with the comprehension of the current scene. To be specific, we introduce a hierarchical graph structure dubbed SceneConceptGraph to construct relations among multi-level scene concepts, which directs the outpainter for consistent novel views and can be dynamically refined to enhance diversity. Extensive experiments demonstrate that our framework overcomes the semantic drift issue and generates more consistent and immersive 3D view sequences. Project Page: https://xiac20.github.io/ScenePainter/.

[51] Negation-Aware Test-Time Adaptation for Vision-Language Models

Haochen Han,Alex Jinpeng Wang,Fangming Liu

Main category: cs.CV

TL;DR: 论文研究了视觉语言模型(VLMs)中否定理解的实用问题,提出了一种低资源消耗的NEAT方法,通过测试时适应调整分布相关参数,解决否定理解难题。

Details Motivation: 尽管视觉语言模型在大规模训练下表现出强大的迁移能力,但在处理否定(如识别不存在的条件)方面表现不佳,而现有方法依赖大量数据和计算资源,难以广泛应用。

Contribution: 论文提出了NEAT方法,通过测试时适应调整分布参数,解决VLMs在处理否定时的分布偏移问题,且无需大量额外数据或计算资源。

Method: NEAT方法在推理阶段动态调整分布相关参数,减少一致语义的分布偏移并消除无关语义的虚假分布一致性。

Result: 在各种否定理解任务上的实验验证了方法的有效性。

Insight: 论文揭示了否定理解的核心障碍是分布偏移,并提出了低资源消耗的解决方案,为VLMs的实用化提供了新思路。

Abstract: In this paper, we study a practical but less-touched problem in Vision-Language Models (VLMs), \ie, negation understanding. Specifically, many real-world applications require models to explicitly identify what is false or non-existent, \eg, radiologists may search for images that exclude specific conditions. Despite the impressive transferability of VLMs through large-scale training, they suffer from a critical limitation that fails to handle negation. To address this challenge, existing methods attribute its root cause to the scarcity of negation training data and propose to fine-tune VLMs on massive data containing explicit negation. Undoubtedly, such data-centric solutions demand substantial data and computational resources, limiting their sustainable widespread adoption. To tackle negation in a low-carbon manner, we empirically observe that the key obstacle lies in the dual-concept shifts between the affirmation and negation distributions. Therefore, we propose a Negation-Aware Test-Time Adaptation (NEAT) method to efficiently adjust distribution-related parameters during inference. In brief, NEAT can reduce distribution shift in consistent semantics while eliminating false distributional consistency in unrelated semantics. Extensive experiments on the various negation understanding tasks verify the effectiveness of the proposed method. The code is available at https://github.com/hhc1997/NEAT.

[52] Multi-Task Dense Prediction Fine-Tuning with Mixture of Fine-Grained Experts

Yangyang Xu,Xi Ye,Duo Su

Main category: cs.CV

TL;DR: 本文提出了一种新颖的细粒度专家混合(FGMoE)架构,用于多任务密集预测学习,通过结合三个关键创新和微调方法,显著提升了性能。

Details Motivation: 多任务学习(MTL)在密集预测任务中表现出潜力,但仍面临如何平衡共享表示与任务特定专业化的挑战。

Contribution: 1. 提出了基于专家混合(MoE)的FGMoE架构;2. 引入了任务内专家、共享专家和全局专家三种创新设计;3. 结合微调方法提升参数效率。

Method: 1. 任务内专家沿MLP中间隐藏维度划分;2. 共享专家整合任务内通用信息;3. 全局专家促进跨任务知识转移。此外,仅微调解码器参数。

Result: 在NYUD-v2和PASCAL-Context数据集上,FGMoE以更少的参数显著优于现有MoE-based MTL模型。

Insight: 任务信息的细粒度分解和跨任务知识共享是提升密集预测多任务学习性能的关键。

Abstract: Multi-task learning (MTL) for dense prediction has shown promising results but still faces challenges in balancing shared representations with task-specific specialization. In this paper, we introduce a novel Fine-Grained Mixture of Experts (FGMoE) architecture that explores MoE-based MTL models through a combination of three key innovations and fine-tuning. First, we propose intra-task experts that partition along intermediate hidden dimensions of MLPs, enabling finer decomposition of task information while maintaining parameter efficiency. Second, we introduce shared experts that consolidate common information across different contexts of the same task, reducing redundancy, and allowing routing experts to focus on unique aspects. Third, we design a global expert that facilitates adaptive knowledge transfer across tasks based on both input feature and task requirements, promoting beneficial information sharing while preventing harmful interference. In addition, we use the fine-tuning approach to improve parameter efficiency only by training the parameters of the decoder. Extensive experimental results show that the proposed FGMoE uses fewer parameters and significantly outperforms current MoE-based competitive MTL models on two dense prediction datasets (\textit{i.e.,} NYUD-v2, PASCAL-Context) in various metrics.

[53] LISA: A Layer-wise Integration and Suppression Approach for Hallucination Mitigation in Multimodal Large Language Models

Zhihui Guo,Xin Man,Hui Xu,Jie Shao

Main category: cs.CV

TL;DR: LISA是一种分层集成与抑制方法,用于减少多模态大语言模型(MLLMs)中的幻觉问题,通过分层调制和多层融合增强生成一致性。

Details Motivation: 多模态大语言模型在视觉-语言任务中表现出色,但仍存在物体幻觉问题,即描述图像中未出现的对象。LISA旨在通过分层抑制和融合机制解决这一问题。

Contribution: LISA提出了一种分层调制和锚点路由的方法,通过抑制深层网络的过激活信号并融合多层特征,显著减少了幻觉现象。

Method: 1. 采用区域特定光谱调制稳定注意力,抑制深层过激活信号;2. 通过锚点路由选择并融合多层token级logits,自适应性解码。

Result: 实验表明,LISA在CHAIR_I基准上减少幻觉达53.6%,并提升POPE F1分数4.5%,具有良好的泛化性。

Insight: LISA的方法揭示了MLLMs中不同层次的功能分工,并提出了一种无需额外训练即可缓解幻觉的通用策略。

Abstract: Multimodal Large Language Models (MLLMs) excel in vision-language tasks such as image captioning but remain prone to object hallucinations, where they describe objects that do not appear in the image. To mitigate this, we propose \textbf{LISA}, a \textbf{L}ayer-wise \textbf{I}ntegration and \textbf{S}uppression \textbf{A}pproach that enhances generation consistency through hierarchical modulation and multi-layer fusion. LISA leverages the functional hierarchy within MLLMs, where shallow layers provide visual grounding, middle layers encode semantics, and deep layers tend to amplify spurious signals. First, zone-specific spectral modulation stabilizes attention by suppressing over-amplified activations in deeper layers while preserving alignment cues in earlier layers. Second, token-level logits from selected layers are fused via anchor-based routing, with token-wise anchor selection and soft logit fusion enabling adaptive integration during decoding. LISA is fully \textbf{plug-and-play} and can be seamlessly integrated into existing MLLMs, including Qwen2.5-VL. Experiments on multiple benchmarks show that LISA reduces hallucinations by up to 53.6% in $\mathrm{CHAIR}_I$ and improves POPE F1 by 4.5%, demonstrating strong generalization across models and tasks.

[54] Cross Spatial Temporal Fusion Attention for Remote Sensing Object Detection via Image Feature Matching

Abu Sadat Mohammad Salehin Amit,Xiaoli Zhang,Md Masum Billa Shagar,Zhaojun Liu,Xiongfei Li,Fanlong Meng

Main category: cs.CV

TL;DR: 该论文提出了一种跨时空融合注意力机制(CSTF),通过结合尺度不变关键点和分类任务的方法提升多模态遥感图像的特征匹配效果,显著提高了目标检测任务的性能。

Details Motivation: 多模态遥感图像在几何和辐射度上存在显著差异,导致传统基于全连接层的特征提取方法难以有效捕捉跨模态相似性。为解决这一问题,作者提出了CSTF机制。

Contribution: 1. 提出CSTF机制,通过整合尺度不变关键点和多区域信息增强特征表示;2. 将相似性匹配重新定义为分类任务,结合SoftMax和FCN层提升匹配效果。

Method: CSTF通过检测参考图像和查询图像中的尺度不变关键点,建立对应关系图,并结合SoftMax和FCN层实现分类任务式的特征匹配。

Result: 在HRSC2016和DOTA数据集上,CSTF平均mAP分别达到90.99%和90.86%,推理速度为12.5 FPS,均优于现有方法。

Insight: 跨模态特征匹配的优化可以直接提升下游任务(如目标检测)的性能,且局部特征与上下文信息的结合对遥感图像尤为重要。

Abstract: Effectively describing features for cross-modal remote sensing image matching remains a challenging task due to the significant geometric and radiometric differences between multimodal images. Existing methods primarily extract features at the fully connected layer but often fail to capture cross-modal similarities effectively. We propose a Cross Spatial Temporal Fusion (CSTF) mechanism that enhances feature representation by integrating scale-invariant keypoints detected independently in both reference and query images. Our approach improves feature matching in two ways: First, by creating correspondence maps that leverage information from multiple image regions simultaneously, and second, by reformulating the similarity matching process as a classification task using SoftMax and Fully Convolutional Network (FCN) layers. This dual approach enables CSTF to maintain sensitivity to distinctive local features while incorporating broader contextual information, resulting in robust matching across diverse remote sensing modalities. To demonstrate the practical utility of improved feature matching, we evaluate CSTF on object detection tasks using the HRSC2016 and DOTA benchmark datasets. Our method achieves state-of-theart performance with an average mAP of 90.99% on HRSC2016 and 90.86% on DOTA, outperforming existing models. The CSTF model maintains computational efficiency with an inference speed of 12.5 FPS. These results validate that our approach to crossmodal feature matching directly enhances downstream remote sensing applications such as object detection.

[55] Preserving Topological and Geometric Embeddings for Point Cloud Recovery

Kaiyue Zhou,Zelong Tan,Hongxiao Wang,Ya-li Li,Shengjin Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为TopGeoFormer的端到端架构,用于在点云采样的恢复过程中同时保留拓扑和几何特征,并通过实验验证其优越性。

Details Motivation: 现有方法在点云恢复过程中难以有效利用拓扑和几何特征,导致恢复效果不佳。

Contribution: 1. 通过持续的相对关系映射生成拓扑嵌入;2. 提出InterTwining Attention融合拓扑与几何特征;3. 引入几何损失和拓扑约束损失优化嵌入。

Method: 1. 传统特征提取技术生成拓扑嵌入;2. InterTwining Attention结合局部感知与形状上下文;3. 几何损失和拓扑损失优化训练。

Result: 实验表明,TopGeoFormer在采样和恢复任务上显著优于现有方法。

Insight: 拓扑和几何特征的共同保留对点云恢复至关重要,且通过端到端学习能够实现更优效果。

Abstract: Recovering point clouds involves the sequential process of sampling and restoration, yet existing methods struggle to effectively leverage both topological and geometric attributes. To address this, we propose an end-to-end architecture named \textbf{TopGeoFormer}, which maintains these critical features throughout the sampling and restoration phases. First, we revisit traditional feature extraction techniques to yield topological embedding using a continuous mapping of relative relationships between neighboring points, and integrate it in both phases for preserving the structure of the original space. Second, we propose the \textbf{InterTwining Attention} to fully merge topological and geometric embeddings, which queries shape with local awareness in both phases to form a learnable shape context facilitated with point-wise, point-shape-wise, and intra-shape features. Third, we introduce a full geometry loss and a topological constraint loss to optimize the embeddings in both Euclidean and topological spaces. The geometry loss uses inconsistent matching between coarse-to-fine generations and targets for reconstructing better geometric details, and the constraint loss limits embedding variances for better approximation of the topological space. In experiments, we comprehensively analyze the circumstances using the conventional and learning-based sampling/upsampling algorithms. The quantitative and qualitative results demonstrate that our method significantly outperforms existing sampling and recovery methods.

[56] MixA-Q: Revisiting Activation Sparsity for Vision Transformers from a Mixed-Precision Quantization Perspective

Weitian Wang,Rai Shubham,Cecilia De La Parra,Akash Kumar

Main category: cs.CV

TL;DR: MixA-Q 提出了一种混合精度的激活量化框架,利用窗口型视觉 Transformer 的激活稀疏性,通过为不重要窗口分配更低比特宽度,提升推理效率和模型性能的平衡。

Details Motivation: 现有量化方法在窗口型视觉 Transformer 中未能充分利用激活稀疏性,MixA-Q 旨在通过混合精度量化优化这一特性。

Contribution: 1) 提出了 MixA-Q 框架,结合激活稀疏性优化混合精度量化;2) 设计了双分支 Swin Block 结构,无缝兼容 QAT 和 PTQ 方法;3) 实验验证了其在速度提升和量化精度上的优势。

Method: MixA-Q 分离 Swin 块中的窗口计算,为不重要窗口分配低比特精度,并通过双分支结构处理高低比特激活。支持训练感知和无训练量化场景。

Result: 在 COCO 数据集上,PTQ 配置下实现无精度损失的 1.35 倍加速;QAT 配置下实现 1.25 倍加速或 1.53 倍加速(仅 1% mAP 下降);W4A4 模型量化误差减少 24%。

Insight: 激活稀疏性与混合精度量化的结合可显著提升窗口型 Transformer 的效率,同时减少量化误差。

Abstract: In this paper, we propose MixA-Q, a mixed-precision activation quantization framework that leverages intra-layer activation sparsity (a concept widely explored in activation pruning methods) for efficient inference of quantized window-based vision transformers. For a given uniform-bit quantization configuration, MixA-Q separates the batched window computations within Swin blocks and assigns a lower bit width to the activations of less important windows, improving the trade-off between model performance and efficiency. We introduce a Two-Branch Swin Block that processes activations separately in high- and low-bit precision, enabling seamless integration of our method with most quantization-aware training (QAT) and post-training quantization (PTQ) methods, or with simple modifications. Our experimental evaluations over the COCO dataset demonstrate that MixA-Q achieves a training-free 1.35x computational speedup without accuracy loss in PTQ configuration. With QAT, MixA-Q achieves a lossless 1.25x speedup and a 1.53x speedup with only a 1% mAP drop by incorporating activation pruning. Notably, by reducing the quantization error in important regions, our sparsity-aware quantization adaptation improves the mAP of the quantized W4A4 model (with both weights and activations in 4-bit precision) by 0.7%, reducing quantization degradation by 24%.

[57] DASH: 4D Hash Encoding with Self-Supervised Decomposition for Real-Time Dynamic Scene Rendering

Jie Chen,Zhangchi Hu,Peixi Wu,Huyue Zhu,Hebei Li,Xiaoyan Sun

Main category: cs.CV

TL;DR: DASH提出了一种基于4D哈希编码和自监督分解的实时动态场景渲染框架,解决了传统方法中的特征重叠和渲染质量差问题,实现了264 FPS的高性能渲染。

Details Motivation: 动态场景重构是3D视觉中的长期挑战,现有基于低秩假设的方法易导致特征重叠和渲染质量下降。DASH旨在通过4D哈希编码和自监督分解,提升动态渲染的实时性和质量。

Contribution: 1. 提出自监督分解机制,无需人工标注即可分离动态和静态组件;2. 设计多分辨率4D哈希编码器,避免低秩假设;3. 引入时空平滑正则化策略,减少不稳定变形伪影。

Method: 1. 自监督分解动态与静态部分;2. 多分辨率4D哈希编码动态元素;3. 时空平滑正则化优化变形。

Result: 在真实数据集上实现SOTA动态渲染性能,单卡4090 GPU下达到264 FPS的实时速度。

Insight: DASH通过自监督学习和显式表示结合,显著提升了动态场景渲染的效率和质量,为实时动态重建提供了新思路。

Abstract: Dynamic scene reconstruction is a long-term challenge in 3D vision. Existing plane-based methods in dynamic Gaussian splatting suffer from an unsuitable low-rank assumption, causing feature overlap and poor rendering quality. Although 4D hash encoding provides an explicit representation without low-rank constraints, directly applying it to the entire dynamic scene leads to substantial hash collisions and redundancy. To address these challenges, we present DASH, a real-time dynamic scene rendering framework that employs 4D hash encoding coupled with self-supervised decomposition. Our approach begins with a self-supervised decomposition mechanism that separates dynamic and static components without manual annotations or precomputed masks. Next, we introduce a multiresolution 4D hash encoder for dynamic elements, providing an explicit representation that avoids the low-rank assumption. Finally, we present a spatio-temporal smoothness regularization strategy to mitigate unstable deformation artifacts. Experiments on real-world datasets demonstrate that DASH achieves state-of-the-art dynamic rendering performance, exhibiting enhanced visual quality at real-time speeds of 264 FPS on a single 4090 GPU. Code: https://github.com/chenj02/DASH.

[58] Patch Pruning Strategy Based on Robust Statistical Measures of Attention Weight Diversity in Vision Transformers

Yuki Igaue,Hiroaki Aizawa

Main category: cs.CV

TL;DR: 论文提出了基于注意力权重多样性的稳健统计度量(如方差或中位数绝对偏差)的补丁剪枝策略,以提高视觉Transformer的计算效率,同时保持分类准确性。

Details Motivation: 视觉Transformer中的多头自注意力机制虽然性能优异,但计算复杂度与补丁数量呈平方关系。为了提升计算效率,需要一种能够识别并移除冗余补丁的方法。

Contribution: 1. 提出基于注意力权重多样性的补丁剪枝策略,利用方差或中位数绝对偏差评估补丁重要性。
2. 引入重叠补丁嵌入,进一步提升性能而不增加计算负担。
3. 方法在训练和推理阶段均可应用,尤其适用于预训练模型的微调。

Method: 1. 通过计算多头自注意力权重的方差或中位数绝对偏差,衡量补丁的重要性。
2. 移除重要性低的冗余补丁以减少计算量。
3. 采用重叠补丁嵌入技术优化性能。

Result: 方法在保持分类准确性的同时提高了计算吞吐量,尤其在预训练模型微调场景中表现优异。

Insight: 1. 注意力权重的多样性是评估补丁重要性的有效指标。
2. 稳健统计度量(如中位数绝对偏差)在评估中表现出与方差类似的性能。
3. 重叠补丁嵌入可作为传统全补丁方法的轻量替代方案。

Abstract: Multi-head self-attention is a distinctive feature extraction mechanism of vision transformers that computes pairwise relationships among all input patches, contributing significantly to their high performance. However, it is known to incur a quadratic computational complexity with respect to the number of patches. One promising approach to address this issue is patch pruning, which improves computational efficiency by identifying and removing redundant patches. In this work, we propose a patch pruning strategy that evaluates the importance of each patch based on the variance of attention weights across multiple attention heads. This approach is inspired by the design of multi-head self-attention, which aims to capture diverse attention patterns across different subspaces of feature representations. The proposed method can be easily applied during both training and inference, and achieves improved throughput while maintaining classification accuracy in scenarios such as fine-tuning with pre-trained models. In addition, we also found that using robust statistical measures, such as the median absolute deviation in place of variance, to assess patch importance can similarly lead to strong performance. Furthermore, by introducing overlapping patch embeddings, our method achieves better performance with comparable throughput to conventional approaches that utilize all patches.

[59] VisHall3D: Monocular Semantic Scene Completion from Reconstructing the Visible Regions to Hallucinating the Invisible Regions

Haoang Lu,Yuanqi Su,Xiaoning Zhang,Longjun Gao,Yu Xue,Le Wang

Main category: cs.CV

TL;DR: VisHall3D提出了一种两阶段框架,将单目语义场景补充分解为可见区域重构和不可见区域预测,解决了现有方法中的特征纠缠和几何不一致问题,并在多个基准测试中取得了最先进性能。

Details Motivation: 现有的单目语义场景补全方法存在特征纠缠和几何不一致问题,限制了重建质量。VisHall3D旨在通过分解任务为可见和不可见区域处理来解决这些问题。

Contribution: 1. 提出两阶段框架VisHall3D,分阶段处理可见和不可见区域;2. 引入VisFrontierNet模块精准追踪视觉边界;3. 设计OcclusionMAE网络生成不可见区域的合理几何结构。

Method: 1. 第一阶段使用VisFrontierNet重构可见区域,保留细节;2. 第二阶段通过OcclusionMAE生成不可见区域的几何,采用噪声注入机制增强鲁棒性。

Result: 在SemanticKITTI和SSCBench-KITTI-360基准测试中,VisHall3D表现显著优于其他方法,实现了最先进的性能。

Insight: 将场景补充分解为可见和不可见区域的处理,可以有效缓解特征纠缠与几何不一致问题,提升整体重建质量。

Abstract: This paper introduces VisHall3D, a novel two-stage framework for monocular semantic scene completion that aims to address the issues of feature entanglement and geometric inconsistency prevalent in existing methods. VisHall3D decomposes the scene completion task into two stages: reconstructing the visible regions (vision) and inferring the invisible regions (hallucination). In the first stage, VisFrontierNet, a visibility-aware projection module, is introduced to accurately trace the visual frontier while preserving fine-grained details. In the second stage, OcclusionMAE, a hallucination network, is employed to generate plausible geometries for the invisible regions using a noise injection mechanism. By decoupling scene completion into these two distinct stages, VisHall3D effectively mitigates feature entanglement and geometric inconsistency, leading to significantly improved reconstruction quality. The effectiveness of VisHall3D is validated through extensive experiments on two challenging benchmarks: SemanticKITTI and SSCBench-KITTI-360. VisHall3D achieves state-of-the-art performance, outperforming previous methods by a significant margin and paves the way for more accurate and reliable scene understanding in autonomous driving and other applications.

[60] Querying Autonomous Vehicle Point Clouds: Enhanced by 3D Object Counting with CounterNet

Xiaoyu Zhang,Zhifeng Bao,Hai Dong,Ziwei Wang,Jiajun Liu

Main category: cs.CV

TL;DR: 该论文提出了CounterNet,一种基于热图的网络,专注于3D点云数据中物体的准确计数,以改善自动驾驶车辆点云数据查询的可靠性。

Details Motivation: 自动驾驶车辆生成大量点云数据,但仅部分数据对特定任务(如碰撞检测、交通分析)相关。现有方法在3D点云数据中计数不准确,导致查询结果不可靠。

Contribution: 定义了RETRIEVAL、COUNT和AGGREGATION三种核心查询类型;提出CounterNet,通过检测物体中心提升计数准确率;引入特征图分区和动态模型选择策略。

Method: CounterNet采用热图定位物体中心而非精确位置;提出重叠区域的特征图分区策略;动态选择最佳模型配置以适应不同场景。

Result: 在三个真实数据集上,CounterNet的计数准确率提升5%至20%,显著改善了查询结果的可靠性。

Insight: 专注于物体计数而非精确定位能更有效地提升查询性能;动态模型选择和分区策略对复杂场景处理至关重要。

Abstract: Autonomous vehicles generate massive volumes of point cloud data, yet only a subset is relevant for specific tasks such as collision detection, traffic analysis, or congestion monitoring. Effectively querying this data is essential to enable targeted analytics. In this work, we formalize point cloud querying by defining three core query types: RETRIEVAL, COUNT, and AGGREGATION, each aligned with distinct analytical scenarios. All these queries rely heavily on accurate object counts to produce meaningful results, making precise object counting a critical component of query execution. Prior work has focused on indexing techniques for 2D video data, assuming detection models provide accurate counting information. However, when applied to 3D point cloud data, state-of-the-art detection models often fail to generate reliable object counts, leading to substantial errors in query results. To address this limitation, we propose CounterNet, a heatmap-based network designed for accurate object counting in large-scale point cloud data. Rather than focusing on accurate object localization, CounterNet detects object presence by finding object centers to improve counting accuracy. We further enhance its performance with a feature map partitioning strategy using overlapping regions, enabling better handling of both small and large objects in complex traffic scenes. To adapt to varying frame characteristics, we introduce a per-frame dynamic model selection strategy that selects the most effective configuration for each input. Evaluations on three real-world autonomous vehicle datasets show that CounterNet improves counting accuracy by 5% to 20% across object categories, resulting in more reliable query outcomes across all supported query types.

[61] PRE-MAP: Personalized Reinforced Eye-tracking Multimodal LLM for High-Resolution Multi-Attribute Point Prediction

Hanbing Wu,Ping Jiang,Anyang Su,Chenxu Zhao,Tianyu Fu,Minghui Wu,Beiping Tan,Huiying Li

Main category: cs.CV

TL;DR: 论文提出PRE-MAP模型和SPA-ADV数据集,解决了传统显著性预测模型忽视主观认知多样性的问题,通过强化学习优化多模态大语言模型(MLLM)实现个性化视觉关注点预测。

Details Motivation: 传统显著性预测模型依赖低分辨率图像,忽视了个体认知差异对注视行为的影响,且多模态大语言模型(MLLM)在多点预测任务中存在格式限制和定位不精确的问题。

Contribution: 1) 提出SPA-ADV数据集,包含4500多名参与者的注视行为;2) 提出PRE-MAP模型,通过强化学习优化MLLM实现个性化视觉关注点预测;3) 引入C-GRPO方法解决MLLM格式和定位问题。

Method: 结合强化学习和多模态大语言模型(MLLM),引入C-GRPO优化个性化视觉关注点预测,利用多属性用户画像指导预测任务。

Result: 在SPA-ADV等基准测试中表现优异,验证了模型的有效性。

Insight: 个性化认知差异对视觉关注点预测至关重要,强化学习优化能有效提升MLLM在多点预测任务中的表现。

Abstract: Visual selective attention, driven by individual preferences, regulates human prioritization of visual stimuli by bridging subjective cognitive mechanisms with objective visual elements, thereby steering the semantic interpretation and hierarchical processing of dynamic visual scenes. However, existing models and datasets predominantly neglect the influence of subjective cognitive diversity on fixation behavior. Conventional saliency prediction models, typically employing segmentation approaches, rely on low-resolution imagery to generate saliency heatmaps, subsequently upscaled to native resolutions, which limiting their capacity to capture personalized attention patterns. Furthermore, MLLMs are constrained by factors such as hallucinations, making it very costly to strictly adhere to the expected format in tasks involving multiple point predictions, and achieving precise point positioning is challenging. To address these limitations, we present Subjective Personalized Attention for Advertisement Videos, namely SPA-ADV, a large-scale multimodal dataset capturing gaze behaviors from over 4,500 participants varying in age and gender with 486 videos. Furthermore, we propose PRE-MAP, a novel eye-tracking saliency model that characterizes Personalized visual disparities through Reinforcement learning-optimized Eye-tracking, built upon MLLMs and guided by Multi-Attribute user profiles to predict Points. To ensure MLLMs produce prediction points that are both format-correct and spatially accurate, we introduce Consistency Group Relative Policy Optimization (C-GRPO), inspired by the variability in eye movement points and Multi-Attribute profiles. Extensive experiments on SPA-ADV and other benchmarks demonstrate the effectiveness of our approach. The code and dataset are available at \href{https://github.com/mininglamp-MLLM/PRE-MAP}{this URL}.

[62] Event-Driven Storytelling with Multiple Lifelike Humans in a 3D Scene

Donggeun Lim,Jinseok Bae,Inwoo Hwang,Seungmin Lee,Hwanhee Lee,Young Min Kim

Main category: cs.CV

TL;DR: 该论文提出了一种框架,通过大型语言模型(LLM)生成多人类在3D场景中的上下文运动,将动态场景分解为事件序列,并通过高层次模块实现可扩展的上下文描述。其在基准测试和用户研究中表现出色。

Details Motivation: 当前的多人类动态场景生成方法难以处理复杂的上下文交互,因此需要一种能够实现大规模、多样化运动合成的解决方案。

Contribution: 1. 提出了一个基于LLM的事件生成器,将动态场景分解为序列化的小事件;2. 开发了可扩展的高层次模块,用于生成精确的运动描述;3. 建立了首个评估上下文推理能力的基准。

Method: 1. 使用LLM将文本输入解析为事件序列;2. 每个事件生成涉及角色和对象的运动;3. 通过空间引导采样角色位置;4. 高层次模块将事件转换为相对描述以获取精确坐标。

Result: 基准测试和用户研究表明,该框架在捕捉场景上下文和可扩展性方面表现优异。

Insight: 通过事件驱动的方法和LLM的结合,可以实现大规模、多样化的3D场景动态生成,为虚拟叙事提供了新的可能性。

Abstract: In this work, we propose a framework that creates a lively virtual dynamic scene with contextual motions of multiple humans. Generating multi-human contextual motion requires holistic reasoning over dynamic relationships among human-human and human-scene interactions. We adapt the power of a large language model (LLM) to digest the contextual complexity within textual input and convert the task into tangible subproblems such that we can generate multi-agent behavior beyond the scale that was not considered before. Specifically, our event generator formulates the temporal progression of a dynamic scene into a sequence of small events. Each event calls for a well-defined motion involving relevant characters and objects. Next, we synthesize the motions of characters at positions sampled based on spatial guidance. We employ a high-level module to deliver scalable yet comprehensive context, translating events into relative descriptions that enable the retrieval of precise coordinates. As the first to address this problem at scale and with diversity, we offer a benchmark to assess diverse aspects of contextual reasoning. Benchmark results and user studies show that our framework effectively captures scene context with high scalability. The code and benchmark, along with result videos, are available at our project page: https://rms0329.github.io/Event-Driven-Storytelling/.

[63] BridgeNet: A Unified Multimodal Framework for Bridging 2D and 3D Industrial Anomaly Detection

An Xiang,Zixuan Huang,Xitong Gao,Kejiang Ye,Cheng-zhong Xu

Main category: cs.CV

TL;DR: BridgeNet提出了一种统一的多模态框架,用于桥接2D和3D工业异常检测,通过解耦深度和外观信息,支持统一的异常生成,并在多模态数据上共享参数,显著提升性能。

Details Motivation: 工业异常检测中,仅依靠2D信息难以有效识别3D深度异常,且多模态信息间的差异导致现有方法表现不佳。此外,工业数据中异常样本稀缺,需模拟真实异常样本。

Contribution: 1. 从3D点云提取深度信息,与2D RGB图像解耦;2. 提出多尺度高斯异常生成器和统一纹理异常生成器;3. 所有模块共享参数,无需复杂融合即可利用多模态特征。

Method: 通过解耦深度和外观信息,并结合多尺度高斯和统一纹理异常生成器,实现统一的异常生成。模块间共享参数,直接利用多模态特征。

Result: 在MVTec-3D AD和Eyecandies数据集上优于现有方法。

Insight: 解耦多模态信息并统一生成异常样本,能有效提升工业异常检测性能,尤其是在3D场景中。

Abstract: Industrial anomaly detection for 2D objects has gained significant attention and achieved progress in anomaly detection (AD) methods. However, identifying 3D depth anomalies using only 2D information is insufficient. Despite explicitly fusing depth information into RGB images or using point cloud backbone networks to extract depth features, both approaches struggle to adequately represent 3D information in multimodal scenarios due to the disparities among different modal information. Additionally, due to the scarcity of abnormal samples in industrial data, especially in multimodal scenarios, it is necessary to perform anomaly generation to simulate real-world abnormal samples. Therefore, we propose a novel unified multimodal anomaly detection framework to address these issues. Our contributions consist of 3 key aspects. (1) We extract visible depth information from 3D point cloud data simply and use 2D RGB images to represent appearance, which disentangles depth and appearance to support unified anomaly generation. (2) Benefiting from the flexible input representation, the proposed Multi-Scale Gaussian Anomaly Generator and Unified Texture Anomaly Generator can generate richer anomalies in RGB and depth. (3) All modules share parameters for both RGB and depth data, effectively bridging 2D and 3D anomaly detection. Subsequent modules can directly leverage features from both modalities without complex fusion. Experiments show our method outperforms state-of-the-art (SOTA) on MVTec-3D AD and Eyecandies datasets. Code available at: https://github.com/Xantastic/BridgeNet

[64] OVFact: Measuring and Improving Open-Vocabulary Factuality for Long Caption Models

Monika Wysoczańska,Shyamal Buch,Anurag Arnab,Cordelia Schmid

Main category: cs.CV

TL;DR: 该论文提出了一种名为OVFact的新方法,用于测量和改进长描述的虚拟语言模型的事实性,通过开放词汇视觉基础和无依赖人类注释的工具验证,提升了与人工评估的一致性,并在数据过滤中展现了潜力。

Details Motivation: 大型视觉语言模型(VLM)在生成长且事实性描述时表现不佳,而传统的幻觉和事实性评估方法无法有效适用于长且多样的描述,尤其是在缺乏人工标注的场景下。

Contribution: 1. 提出OVFact方法,结合开放词汇视觉基础和工具验证,无需依赖人工标注。2. 设计了一种同时衡量描述性和事实性的统一指标。3. 通过OVFact过滤的数据集训练模型,显著提升了事实性而不牺牲描述性。

Method: 利用开放词汇视觉基础结合工具验证,设计了一种无参考的事实性度量指标,并将其用于大规模噪声数据的过滤。

Result: 在OVFact过滤后的数据集上训练的模型,在多个下游长描述任务中提升了事实性精度(2.5-5倍减少噪声),同时保持了描述性。

Insight: 通过无参考的事实性度量方法,可以在数据预训练阶段有效过滤噪声,提升模型生成描述的事实性。

Abstract: Large vision-language models (VLMs) often struggle to generate long and factual captions. However, traditional measures for hallucination and factuality are not well suited for evaluating longer, more diverse captions and in settings where ground-truth human-annotated captions are unavailable. We introduce OV-Fact, a novel method for measuring caption factuality of long captions that leverages open-vocabulary visual grounding and tool-based verification without depending on human annotations. Our method improves agreement with human judgments and captures both caption descriptiveness (recall) and factual precision in the same metric. Furthermore, unlike previous metrics, our reference-free method design enables new applications towards factuality-based data filtering. We observe models trained on an OVFact-filtered (2.5-5x less) subset of a large-scale, noisy (VLM-generated) pretraining set meaningfully improve factuality precision without sacrificing caption descriptiveness across a range of downstream long caption benchmarks.

[65] SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality

Sijie Li,Chen Chen,Jungong Han

Main category: cs.CV

TL;DR: SimMLM提出了一种简单的多模态学习框架,支持模态缺失场景,通过动态专家混合和MoFe排序损失实现高准确性和鲁棒性。

Details Motivation: 传统多模态学习依赖复杂架构或模态补全技术,难以应对模态缺失问题。SimMLM旨在提供通用且高效的解决方案。

Contribution: 1. 提出动态模态专家混合(DMoME)架构;2. 设计MoFe排序损失,确保模态增加时任务性能不降;3. 在医疗分割和分类任务中验证有效性。

Method: 1. DMoME动态调整模态贡献;2. MoFe损失约束模态增减时的性能变化;3. 适用于完整和缺失模态场景。

Result: 在BraTS 2018、UPMC Food-101和avMNIST数据集上表现优异,准确性和鲁棒性优于现有方法。

Insight: 动态门控机制和MoFe损失是多模态学习中的关键创新,为模态缺失问题提供新思路。

Abstract: In this paper, we propose SimMLM, a simple yet powerful framework for multimodal learning with missing modalities. Unlike existing approaches that rely on sophisticated network architectures or complex data imputation techniques, SimMLM provides a generic and effective solution that can adapt to various missing modality scenarios with improved accuracy and robustness. Specifically, SimMLM consists of a generic Dynamic Mixture of Modality Experts (DMoME) architecture, featuring a dynamic, learnable gating mechanism that automatically adjusts each modality’s contribution in both full and partial modality settings. A key innovation of SimMLM is the proposed More vs. Fewer (MoFe) ranking loss, which ensures that task accuracy improves or remains stable as more modalities are made available. This aligns the model with an intuitive principle: removing one or more modalities should not increase accuracy. We validate SimMLM on multimodal medical image segmentation (BraTS 2018) and multimodal classification (UPMC Food-101, avMNIST) tasks, where it consistently surpasses competitive methods, demonstrating superior accuracy, interpretability, robustness, and reliability across both complete and missing modality scenarios at test time.

[66] Video Self-Distillation for Single-Image Encoders: A Step Toward Physically Plausible Perception

Marcel Simon,Tae-Ho Kim,Seul-Ki Yeom

Main category: cs.CV

TL;DR: 该论文提出了一种视频自蒸馏方法,通过训练单图像编码器预测下一帧表征,从而注入3D时空先验知识,提升了图像编码器的几何感知能力,适用于物理可信的感知任务。

Details Motivation: 现有的自监督图像编码器(如DINO)通常仅利用静态图像训练,忽略了视频中的时序信息。作者希望通过利用视频数据中的时序线索,提升图像编码器的几何感知能力,为物理AI提供支持。

Contribution: 提出了视频自蒸馏的单图像编码器框架,通过简单目标(预测下一帧表征)注入3D时空先验,无需光流或跟踪。

Method: 训练单图像编码器预测视频中下一帧的表征,利用时序信息提升模型的几何感知能力。

Result: 在单段2小时视频上预训练后,ADE20K的mIoU从35.0(DoRA)提升至36.4,表明模型性能显著提升。

Insight: 视频自蒸馏是一种轻量级方法,能够为图像编码器注入几何感知能力,是构建物理可信世界模型的关键一步。

Abstract: Self-supervised image encoders such as DINO have recently gained significant interest for learning robust visual features without labels. However, most SSL methods train on static images and miss the temporal cues inherent in videos. We introduce a video-distilled single-image encoder trained to predict the next-frame representation from the current frame. This simple objective injects 3D spatial and temporal priors without optical flow or tracking. When pre-training on a single 2-hour video, our approach raises the mean Intersection-over-Union (mIoU) on ADE20K from 35.0 (DoRA) to 36.4 while remaining a drop-in replacement for image-only pipelines. Our results highlight video self-distillation as a lightweight route to geometry-aware perception an essential ingredient for physically plausible world models and Physical AI.

[67] RemoteReasoner: Towards Unifying Geospatial Reasoning Workflow

Liang Yao,Fan Liu,Hongbo Lu,Chuanyi Zhang,Rui Min,Shengxiang Xu,Shimin Di,Pai Peng

Main category: cs.CV

TL;DR: RemoteReasoner通过结合多模态大语言模型和强化学习,提出了一种灵活的遥感推理工作流,能够自主处理复杂空间推理任务并生成多粒度输出。

Details Motivation: 现有的遥感方法依赖于监督微调范式,限制了推理的自主性和灵活性,无法统一处理复杂空间推理任务。因此,作者提出RemoteReasoner来解决这一问题。

Contribution: 提出了RemoteReasoner,一个灵活的遥感推理工作流,结合MLLM和强化学习,能够自主推理并生成多粒度输出;展示了在区域和像素级任务中的显著性能。

Method: 集成多模态大语言模型(MLLM)解释用户指令和定位目标,利用强化学习训练实现自主推理,并通过任务适应策略实现多粒度输出。

Result: 实验表明RemoteReasoner在多粒度推理任务中表现优异,并支持超越现有流水线的新任务(如轮廓提取)。

Insight: 强化学习可以赋予MLLM自主推理能力,而任务适应策略则允许多样化输出,无需额外微调或任务特定解码器。

Abstract: Remote sensing imagery presents vast, inherently unstructured spatial data, demanding sophisticated reasoning to interpret complex user intents and contextual relationships beyond simple recognition tasks. In this paper, we aim to construct an Earth observation workflow to handle complex queries by reasoning about spatial context and user intent. As a reasoning workflow, it should be somewhat autonomous, where predefined ground-truth reasoning paths do not constrain the learning process. Furthermore, its architecture ought to be unified yet flexible, enabling the model to perform diverse reasoning tasks with distinct output formats through a single forward pass. Existing remote sensing approaches fail to address these requirements, as they rely on supervised fine-tuning paradigms that constrain the autonomy of reasoning. To this end, we propose RemoteReasoner, a flexible and robust workflow for remote sensing reasoning tasks. The design of RemoteReasoner integrates a multi-modal large language model (MLLM) for interpreting user instructions and localizing targets, together with task adaptation strategies that enable multi-granularity output generation. In contrast to existing methods, our framework is trained with reinforcement learning (RL) to endow the MLLM sufficient autonomy for precise reasoning. At the inference stage, our adaptation strategies enable diverse output formats at inference time without requiring task-specific decoders or further fine-tuning. Preliminary experiments demonstrated that RemoteReasoner achieves remarkable performance across multi-granularity reasoning tasks, including region-level and pixel-level. Additionally, our framework enables novel capabilities such as the contour extraction task beyond the reach of existing reasoning pipelines.

[68] ABCD: Automatic Blood Cell Detection via Attention-Guided Improved YOLOX

Ahmed Endris Hasen,Yang Shangming,Chiagoziem C. Ukwuoma,Biniyam Gashaw,Abel Zenebe Yutra

Main category: cs.CV

TL;DR: 论文提出了一种基于改进YOLOX的自动血细胞检测方法(ABCD),通过引入注意力机制和优化特征融合,提升了检测效率和精度。

Details Motivation: 手动血细胞检测耗时且易出错,深度学习对象检测技术可提供高效解决方案,但需进一步提升精度和效率。

Contribution: 1. 在YOLOX中引入CBAM模块增强特征提取;2. 使用ASFF优化特征融合;3. 采用CIOU损失函数加速收敛。

Method: 结合CBAM注意力模块、ASFF特征融合和CIOU损失函数,改进YOLOX模型用于血细胞检测。

Result: 在BCCD数据集上,ABCD方法达到95.49% mAP@0.5和86.89% mAP@0.5-0.9,分别比基线算法提升2.8%和23.41%,检测速度提升2.9%。

Insight: 注意力机制和优化特征融合对提升血细胞检测性能具有显著效果,CIOU损失函数能有效加速模型收敛。

Abstract: Detection of blood cells in microscopic images has become a major focus of medical image analysis, playing a crucial role in gaining valuable insights into a patient’s health. Manual blood cell checks for disease detection are known to be time-consuming, inefficient, and error-prone. To address these limitations, analyzing blood cells using deep learning-based object detectors can be regarded as a feasible solution. In this study, we propose automatic blood cell detection method (ABCD) based on an improved version of YOLOX, an object detector, for detecting various types of blood cells, including white blood cells, red blood cells, and platelets. Firstly, we introduce the Convolutional Block Attention Module (CBAM) into the network’s backbone to enhance the efficiency of feature extraction. Furthermore, we introduce the Adaptively Spatial Feature Fusion (ASFF) into the network’s neck, which optimizes the fusion of different features extracted from various stages of the network. Finally, to speed up the model’s convergence, we substitute the Intersection over Union (IOU) loss function with the Complete Intersection over Union (CIOU) loss function. The experimental results demonstrate that the proposed method is more effective than other existing methods for BCCD dataset. Compared to the baseline algorithm, our method ABCD achieved 95.49 % mAP@0.5 and 86.89 % mAP@0.5-0.9, which are 2.8% and 23.41% higher, respectively, and increased the detection speed by 2.9%, making it highly efficient for real-time applications.

[69] Multistream Network for LiDAR and Camera-based 3D Object Detection in Outdoor Scenes

Muhammad Ibrahim,Naveed Akhtar,Haitian Wang,Saeed Anwar,Ajmal Mian

Main category: cs.CV

TL;DR: 本文提出了一种多流网络(MuStD),通过融合LiDAR和RGB数据提升户外3D目标检测精度,采用三流结构提取模态特征,最终实现高效且高精度的检测。

Details Motivation: 户外3D目标检测中,LiDAR和RGB数据的融合潜力尚未充分挖掘,现有方法在模态集成上仍存挑战。

Contribution: 提出MuStD网络,设计三个流分别处理LiDAR和RGB数据,并通过UV映射和极坐标索引实现高效多模态特征融合。

Method: 网络包含LiDAR-PillarNet、LiDAR-Height Compression和3D Multimodal三个流,分别提取稀疏柱状特征、BEV特征和多模态特征。

Result: 在KITTI基准测试中取得SOTA或极具竞争力的结果,同时保持高效性。

Insight: 多模态数据融合需结合任务需求精心设计特征提取和融合策略,三流结构能够有效捕捉互补信息。

Abstract: Fusion of LiDAR and RGB data has the potential to enhance outdoor 3D object detection accuracy. To address real-world challenges in outdoor 3D object detection, fusion of LiDAR and RGB input has started gaining traction. However, effective integration of these modalities for precise object detection task still remains a largely open problem. To address that, we propose a MultiStream Detection (MuStD) network, that meticulously extracts task-relevant information from both data modalities. The network follows a three-stream structure. Its LiDAR-PillarNet stream extracts sparse 2D pillar features from the LiDAR input while the LiDAR-Height Compression stream computes Bird’s-Eye View features. An additional 3D Multimodal stream combines RGB and LiDAR features using UV mapping and polar coordinate indexing. Eventually, the features containing comprehensive spatial, textural and geometric information are carefully fused and fed to a detection head for 3D object detection. Our extensive evaluation on the challenging KITTI Object Detection Benchmark using public testing server at https://www.cvlibs.net/datasets/kitti/eval_object_detail.php?&result=d162ec699d6992040e34314d19ab7f5c217075e0 establishes the efficacy of our method by achieving new state-of-the-art or highly competitive results in different categories while remaining among the most efficient methods. Our code will be released through MuStD GitHub repository at https://github.com/IbrahimUWA/MuStD.git

[70] SIDE: Sparse Information Disentanglement for Explainable Artificial Intelligence

Viktar Dubovik,Łukasz Struski,Jacek Tabor,Dawid Rymarczyk

Main category: cs.CV

TL;DR: SIDE通过稀疏训练和剪枝方法提升原型部分的可解释性,显著减小解释规模同时保持准确性。

Details Motivation: 深度学习模型缺乏透明度,原型部分方法虽具潜力但解释复杂。SIDE旨在简化解释并保持性能。

Contribution: 提出SIDE方法,通过稀疏化训练和剪枝,将原型数量减少90%以上,提升解释的可理解性。

Method: 采用稀疏化训练和剪枝技术,结合Sigmoid激活函数替代Softmax,使每个类别仅关联少量原型。

Result: SIDE在准确性上与现有方法相当,同时将解释规模减少90%以上。

Insight: 稀疏性设计能显著提升原型方法的可解释性,适用于大规模数据集。

Abstract: Understanding the decisions made by deep neural networks is essential in high-stakes domains such as medical imaging and autonomous driving. Yet, these models often lack transparency, particularly in computer vision. Prototypical-parts-based neural networks have emerged as a promising solution by offering concept-level explanations. However, most are limited to fine-grained classification tasks, with few exceptions such as InfoDisent. InfoDisent extends prototypical models to large-scale datasets like ImageNet, but produces complex explanations. We introduce Sparse Information Disentanglement for Explainability (SIDE), a novel method that improves the interpretability of prototypical parts through a dedicated training and pruning scheme that enforces sparsity. Combined with sigmoid activations in place of softmax, this approach allows SIDE to associate each class with only a small set of relevant prototypes. Extensive experiments show that SIDE matches the accuracy of existing methods while reducing explanation size by over $90%$, substantially enhancing the understandability of prototype-based explanations.

[71] EffiComm: Bandwidth Efficient Multi Agent Communication

Melih Yazgan,Allen Xavier Arasan,J. Marius Zöllner

Main category: cs.CV

TL;DR: EffiComm是一种高效的带宽感知多智能体通信框架,通过选择性传输和自适应网格减少技术,显著降低数据传输量,同时保持3D物体检测的高精度。

Details Motivation: 协同感知中,传输原始点云或完整特征图会引发V2V通信的带宽和延迟问题,亟需高效的数据传输方法。

Contribution: 提出EffiComm框架,结合选择性传输和自适应网格减少技术,在低带宽需求下实现高精度3D检测。

Method: 采用两阶段降维:选择性传输修剪低效用区域;自适应网格减少通过GNN动态分配资源。融合时使用软门控MoE注意力层。

Result: 在OPV2V基准上,EffiComm仅需1.5MB/帧,mAP@0.7达0.84,优于现有方法。

Insight: 自适应学习通信策略对可扩展的V2X感知至关重要,展示了带宽与精度的权衡优化。

Abstract: Collaborative perception allows connected vehicles to exchange sensor information and overcome each vehicle’s blind spots. Yet transmitting raw point clouds or full feature maps overwhelms Vehicle-to-Vehicle (V2V) communications, causing latency and scalability problems. We introduce EffiComm, an end-to-end framework that transmits less than 40% of the data required by prior art while maintaining state-of-the-art 3D object detection accuracy. EffiComm operates on Bird’s-Eye-View (BEV) feature maps from any modality and applies a two-stage reduction pipeline: (1) Selective Transmission (ST) prunes low-utility regions with a confidence mask; (2) Adaptive Grid Reduction (AGR) uses a Graph Neural Network (GNN) to assign vehicle-specific keep ratios according to role and network load. The remaining features are fused with a soft-gated Mixture-of-Experts (MoE) attention layer, offering greater capacity and specialization for effective feature integration. On the OPV2V benchmark, EffiComm reaches 0.84 mAP@0.7 while sending only an average of approximately 1.5 MB per frame, outperforming previous methods on the accuracy-per-bit curve. These results highlight the value of adaptive, learned communication for scalable Vehicle-to-Everything (V2X) perception.

[72] EA-ViT: Efficient Adaptation for Elastic Vision Transformer

Chen Zhu,Wangbo Zhao,Huiwen Zhang,Samir Khaki,Yuhao Zhou,Weidong Tang,Shuo Wang,Zhihang Yuan,Yuzhang Shang,Xiaojiang Peng,Kai Wang,Dawei Yang

Main category: cs.CV

TL;DR: EA-ViT提出了一种高效的Vision Transformer(ViT)自适应框架,通过嵌套弹性架构和轻量级路由器生成多种尺寸的模型,适应不同资源需求。

Details Motivation: 现有的ViT部署需要针对不同资源限制重新训练多个模型,耗时且耗能。EA-ViT旨在通过单次训练生成多样化模型,满足不同需求。

Contribution: 1)提出嵌套弹性架构,支持MLP扩展比、注意力头数、嵌入维度和网络深度的灵活调整;2)设计课程式训练策略和轻量级路由器,优化模型选择和生成。

Method: 采用两阶段方法:1)通过嵌套弹性架构增强预训练ViT,结合课程式训练策略;2)设计路由器,基于NSGA-II算法优化子模型选择。

Result: 在多个基准测试中验证了EA-ViT的高效性和多功能性,支持灵活部署。

Insight: EA-ViT通过结构弹性化和动态路由机制,显著降低了ViT在多资源场景下的部署成本,推动了ViT的实际应用。

Abstract: Vision Transformers (ViTs) have emerged as a foundational model in computer vision, excelling in generalization and adaptation to downstream tasks. However, deploying ViTs to support diverse resource constraints typically requires retraining multiple, size-specific ViTs, which is both time-consuming and energy-intensive. To address this issue, we propose an efficient ViT adaptation framework that enables a single adaptation process to generate multiple models of varying sizes for deployment on platforms with various resource constraints. Our approach comprises two stages. In the first stage, we enhance a pre-trained ViT with a nested elastic architecture that enables structural flexibility across MLP expansion ratio, number of attention heads, embedding dimension, and network depth. To preserve pre-trained knowledge and ensure stable adaptation, we adopt a curriculum-based training strategy that progressively increases elasticity. In the second stage, we design a lightweight router to select submodels according to computational budgets and downstream task demands. Initialized with Pareto-optimal configurations derived via a customized NSGA-II algorithm, the router is then jointly optimized with the backbone. Extensive experiments on multiple benchmarks demonstrate the effectiveness and versatility of EA-ViT. The code is available at https://github.com/zcxcf/EA-ViT.

[73] BEV-LLM: Leveraging Multimodal BEV Maps for Scene Captioning in Autonomous Driving

Felix Brandstaetter,Erik Schuetz,Katharina Winter,Fabian Flohr

Main category: cs.CV

TL;DR: BEV-LLM是一个轻量级的自动驾驶场景描述模型,结合LiDAR点云和多视角图像,通过创新的绝对位置编码生成自然语言描述,在nuCaption数据集上表现优于已有方法。

Details Motivation: 自动驾驶系统的透明性和可解释性对安全性至关重要。BEV-LLM通过场景描述提升人机交互和系统透明度。

Contribution: 1.提出BEV-LLM模型;2.引入新的绝对位置编码;3.发布nuView和GroundView两个新数据集。

Method: 结合BEVFusion融合多模态数据(LiDAR和图像),使用轻量级1B参数模型生成场景描述。

Result: 在nuCaption数据集上性能提升5%,新数据集为多样场景评估提供了基准。

Insight: 1.多模态融合提升场景描述能力;2.轻量模型可实现高性能;3.新数据集填补了当前基准的不足。

Abstract: Autonomous driving technology has the potential to transform transportation, but its wide adoption depends on the development of interpretable and transparent decision-making systems. Scene captioning, which generates natural language descriptions of the driving environment, plays a crucial role in enhancing transparency, safety, and human-AI interaction. We introduce BEV-LLM, a lightweight model for 3D captioning of autonomous driving scenes. BEV-LLM leverages BEVFusion to combine 3D LiDAR point clouds and multi-view images, incorporating a novel absolute positional encoding for view-specific scene descriptions. Despite using a small 1B parameter base model, BEV-LLM achieves competitive performance on the nuCaption dataset, surpassing state-of-the-art by up to 5% in BLEU scores. Additionally, we release two new datasets - nuView (focused on environmental conditions and viewpoints) and GroundView (focused on object grounding) - to better assess scene captioning across diverse driving scenarios and address gaps in current benchmarks, along with initial benchmarking results demonstrating their effectiveness.

[74] CXR-CML: Improved zero-shot classification of long-tailed multi-label diseases in Chest X-Rays

Rajesh Madhipati,Sheethal Bhat,Lukas Buess,Andreas Maier

Main category: cs.CV

TL;DR: 本文提出了一种改进的长尾多标签胸部X射线疾病分类的零样本学习方法,通过类权重机制和GMM聚类优化潜在空间,显著提升了罕见类别的分类性能。

Details Motivation: 胸部X射线影像中疾病分布的长尾特性(少数类别数据不足)使得现有自监督学习方法在罕见类别上表现不佳,特别是对比语言-图像预训练(CLIP)模型在长尾分布类别上效果下降显著。

Contribution: 1. 提出了一种与潜在空间分布直接对齐的类权重机制;2. 结合GMM聚类和学生t分布优化潜在空间;3. 在MIMIC-CXR-JPG数据集上的零样本AUC分数平均提升7%。

Method: 1. 利用GMM对潜在空间进行聚类;2. 通过学生t分布进一步细化聚类;3. 应用基于改进嵌入的度量损失,实现稳定且自适应的特征聚类。

Result: 在40个类别的MIMIC-CXR-JPG数据集中,零样本AUC分数平均提升了7%,在罕见类别上表现尤为显著。

Insight: 通过显式建模潜在空间的分布并优化聚类,可以有效缓解长尾分布对零样本分类性能的影响,尤其适用于医学影像中的多标签分类任务。

Abstract: Chest radiography (CXR) plays a crucial role in the diagnosis of various diseases. However, the inherent class imbalance in the distribution of clinical findings presents a significant challenge for current self-supervised deep learning models. These models often fail to accurately classify long-tailed classes. Current Vision-Language models such as Contrastive Language Image Pre-training (CLIP) models effectively model the manifold distribution of the latent space, enabling high zero-shot classification accuracies. Although CLIP performs well on most of the primary classes in the dataset, our work reveals that its effectiveness decreases significantly for classes with a long-tailed distribution. Our approach employs a class-weighting mechanism that directly aligns with the distribution of classes within the latent space. This method ensures a substantial improvement in overall classification performance, with particular emphasis on enhancing the recognition and accuracy of rarely observed classes. We accomplish this by applying Gaussian Mixture Model (GMM) clustering to the latent space. The subsequent clusters are further refined by Student t-distribution, followed by a metric loss that utilizes the altered embeddings. Our approach facilitates stable and adaptive clustering of the features. This results in a notable average improvement of 7% points in zero-shot AUC scores across 40 classes in the MIMIC-CXR-JPG dataset from previous SOTA models.

[75] Modality Agnostic Efficient Long Range Encoder

Toufiq Parag,Ahmed Elgammal

Main category: cs.CV

TL;DR: MAELRE是一种模态无关的高效长程编码器,通过结合令牌合并和注意力近似,解决了单设备长上下文处理的效率和精度问题。

Details Motivation: 现有方法在扩展上下文长度时通常需要模态特定优化,且效率和精度折中不理想。MAELRE旨在设计一种统一且高效的架构,支持跨模态长上下文编码。

Contribution: 提出MAELRE架构,通过令牌渐进合并和动态注意力近似,显著降低了长上下文处理的二次计算和内存复杂度,同时保持高精度。

Method: 结合令牌合并和注意力近似,逐步合并令牌并动态切换注意力机制(轻量近似与标准点积注意力)。

Result: 在文本、时间序列、音频和视觉等多模态分类任务中,MAELRE在降低计算成本的同时实现了更高精度。

Insight: 动态调整注意力机制和渐进令牌合并是提升长上下文处理效率和精度的关键。该架构为跨模态长程编码提供了统一解决方案。

Abstract: The long-context capability of recent large transformer models can be surmised to rely on techniques such as attention/model parallelism, as well as hardware-level optimizations. While these strategies allow input lengths to scale to millions of tokens, they do not fundamentally mitigate the quadratic computational and memory complexity of the core attention mechanism. In this paper, we address the challenge of long-context processing on a single device using generic implementations by reducing the quadratic memory footprint and inference cost. Existing approaches to extend the context length for generic single device implementations – such as token merging and modified attentions – are often modality specific and attain a suboptimal tradeoff between accuracy and efficiency. To overcome these limitations, we propose MAELRE (Modality Agnostic Efficient Long Range Encoder), a unified and efficient transformer architecture designed for long-range encoding across diverse modalities. MAELRE integrates token merging with attention approximation, progressively merging tokens at different stages of internal computational blocks. It employs a lightweight attention approximation when the number of tokens is large, and switches to standard dot-product attention as the sequence becomes shorter through successive aggregation. We demonstrate that MAELRE achieves superior accuracy while reducing computational cost compared to existing long-context models on classification tasks spanning multiple modalities, including text, time series, audio, and vision.

[76] CircuitProbe: Dissecting Spatiotemporal Visual Semantics with Circuit Tracing

Yiming Zhang,Chengzhang Yu,Zhuokai Zhao,Kun Wang,Qiankun Li,Zihan Chen,Yang Liu,Zenghui Ding,Yining Sun

Main category: cs.CV

TL;DR: 该论文提出了一个基于电路的框架CircuitProbe,用于分析大规模视觉-语言模型(LVLM)中时空视觉语义的表征和处理机制,发现了视觉语义高度集中于特定对象token以及模型中层至深层对时空语义的功能定位。

Details Motivation: 当前研究主要关注静态图像的语义理解,而对LVLM中时空视觉语义的表征和处理机制缺乏深入理解,这限制了模型的鲁棒性和可解释性。

Contribution: 提出了一个系统化的电路框架(CircuitProbe),包括视觉审计电路、语义追踪电路和注意流电路,用于剖析LVLM中时空语义的处理。揭示了视觉语义的局部化特征以及模型中层至深层对时空语义的渐进细化。

Method: 通过三个电路(视觉审计、语义追踪和注意流)分析LVLM的内部机制,重点关注对象token的局部化影响和中间层的功能定位。

Result: 发现移除特定对象token会导致模型性能下降高达92.6%;中间层至深层对时空语义表现出功能定位和渐进细化。

Insight: 时空视觉语义在LVLM中高度集中于局部token,且中间层至深层逐渐形成对复杂语义的专门化处理,这为设计更鲁棒和可解释的模型提供了理论支持。

Abstract: The processing mechanisms underlying language and image understanding in large vision-language models (LVLMs) have been extensively studied. However, the internal reasoning mechanisms of LVLMs for spatiotemporal understanding remain poorly understood. In this work, we introduce a systematic, circuit-based framework designed to investigate how spatiotemporal visual semantics are represented and processed within these LVLMs. Specifically, our framework comprises three circuits: visual auditing circuit, semantic tracing circuit, and attention flow circuit. Through the lens of these circuits, we discover that visual semantics are highly localized to specific object tokens–removing these tokens can degrade model performance by up to 92.6%. Furthermore, we identify that interpretable concepts of objects and actions emerge and become progressively refined in the middle-to-late layers of LVLMs. In contrary to the current works that solely focus on objects in one image, we reveal that the middle-to-late layers of LVLMs exhibit specialized functional localization for spatiotemporal semantics. Our findings offer significant mechanistic insights into spatiotemporal semantics analysis of LVLMs, laying a foundation for designing more robust and interpretable models.

[77] GS-Occ3D: Scaling Vision-only Occupancy Reconstruction for Autonomous Driving with Gaussian Splatting

Baijun Ye,Minghui Qin,Saining Zhang,Moonjun Gong,Shaoting Zhu,Zebang Shen,Luan Zhang,Lu Zhang,Hao Zhao,Hang Zhao

Main category: cs.CV

TL;DR: GS-Occ3D是一篇关于自动驾驶中仅依赖视觉的占用重建的论文,提出了一种基于高斯面片的可扩展框架,通过分解场景并优化显式占用表示,解决了现有方法的局限。

Details Motivation: 现有自动驾驶中的占用重建方法主要依赖LiDAR标注,限制了可扩展性,且无法利用大量众包数据进行自动标注。

Contribution: 提出了GS-Occ3D框架,首次实现了仅依赖视觉的可扩展占用重建,通过高斯面片和场景分解优化了效率和几何完整性。

Method: 采用Octree-based Gaussian Surfel表示占用,并分解场景为静态背景、地面和动态物体,针对性地优化建模策略。

Result: 在Waymo数据集上达到SOTA几何重建效果,展示了良好的零样本泛化能力。

Insight: 仅依赖视觉的占用重建具有潜力成为自动驾驶感知的新范式,特别是通过大规模众包数据提升性能。

Abstract: Occupancy is crucial for autonomous driving, providing essential geometric priors for perception and planning. However, existing methods predominantly rely on LiDAR-based occupancy annotations, which limits scalability and prevents leveraging vast amounts of potential crowdsourced data for auto-labeling. To address this, we propose GS-Occ3D, a scalable vision-only framework that directly reconstructs occupancy. Vision-only occupancy reconstruction poses significant challenges due to sparse viewpoints, dynamic scene elements, severe occlusions, and long-horizon motion. Existing vision-based methods primarily rely on mesh representation, which suffer from incomplete geometry and additional post-processing, limiting scalability. To overcome these issues, GS-Occ3D optimizes an explicit occupancy representation using an Octree-based Gaussian Surfel formulation, ensuring efficiency and scalability. Additionally, we decompose scenes into static background, ground, and dynamic objects, enabling tailored modeling strategies: (1) Ground is explicitly reconstructed as a dominant structural element, significantly improving large-area consistency; (2) Dynamic vehicles are separately modeled to better capture motion-related occupancy patterns. Extensive experiments on the Waymo dataset demonstrate that GS-Occ3D achieves state-of-the-art geometry reconstruction results. By curating vision-only binary occupancy labels from diverse urban scenes, we show their effectiveness for downstream occupancy models on Occ3D-Waymo and superior zero-shot generalization on Occ3D-nuScenes. It highlights the potential of large-scale vision-based occupancy reconstruction as a new paradigm for autonomous driving perception. Project Page: https://gs-occ3d.github.io/

[78] Fast Learning of Non-Cooperative Spacecraft 3D Models through Primitive Initialization

Pol Francesch Huc,Emily Bates,Simone D’Amico

Main category: cs.CV

TL;DR: 该论文提出了一种基于CNN的3D高斯泼溅(3DGS)初始化方法,通过单张图像生成粗略3D模型和姿态估计,显著减少了训练迭代和输入图像的需求。

Details Motivation: 现有新视角合成方法(如NeRF和3DGS)在太空应用中面临姿态依赖和高计算成本的问题,无法满足实时性需求。

Contribution: 1) 基于CNN的3DGS初始化器;2) 支持噪声或隐式姿态估计的训练流程;3) 减少高精度3D模型训练成本的初始化方法分析。

Method: 利用单张图像输入,CNN输出粗略3D模型(由基本几何体组成)和相对姿态,用于初始化3DGS。

Result: 方法在不同姿态估计变体下仍能学习高保真3D模型,训练迭代和图像需求减少至少一个数量级。

Insight: 通过初始化策略,即使在姿态不精确的情况下,新视角合成技术也能适用于太空应用。

Abstract: The advent of novel view synthesis techniques such as NeRF and 3D Gaussian Splatting (3DGS) has enabled learning precise 3D models only from posed monocular images. Although these methods are attractive, they hold two major limitations that prevent their use in space applications: they require poses during training, and have high computational cost at training and inference. To address these limitations, this work contributes: (1) a Convolutional Neural Network (CNN) based primitive initializer for 3DGS using monocular images; (2) a pipeline capable of training with noisy or implicit pose estimates; and (3) and analysis of initialization variants that reduce the training cost of precise 3D models. A CNN takes a single image as input and outputs a coarse 3D model represented as an assembly of primitives, along with the target’s pose relative to the camera. This assembly of primitives is then used to initialize 3DGS, significantly reducing the number of training iterations and input images needed – by at least an order of magnitude. For additional flexibility, the CNN component has multiple variants with different pose estimation techniques. This work performs a comparison between these variants, evaluating their effectiveness for downstream 3DGS training under noisy or implicit pose estimates. The results demonstrate that even with imperfect pose supervision, the pipeline is able to learn high-fidelity 3D representations, opening the door for the use of novel view synthesis in space applications.

[79] Back to the Features: DINO as a Foundation for Video World Models

Federico Baldassarre,Marc Szafraniec,Basile Terver,Vasil Khalidov,Francisco Massa,Yann LeCun,Patrick Labatut,Maximilian Seitzer,Piotr Bojanowski

Main category: cs.CV

TL;DR: DINO-world利用DINOv2的预训练图像编码器,结合大规模无标注视频数据训练时序预测模型,在视频预测任务中表现优越,并展示了物理直觉能力。

Details Motivation: 现有视频世界模型在泛化性和预测能力上存在局限,而DINOv2的图像特征空间提供了强大的表示能力,适合用于构建通用视频世界模型。

Contribution: 1. 提出了DINO-world,结合DINOv2特征空间的视频预测模型;2. 在多个视频预测任务中超越先前模型;3. 展示了动作条件化微调的潜力,可用于规划任务。

Method: 1. 利用预训练的DINOv2图像编码器提取视频帧特征;2. 在大规模无标注视频数据上训练时序预测器;3. 支持动作条件化微调,用于轨迹规划。

Result: DINO-world在视频分割、深度预测等任务中表现优越,并展示了物理直觉能力。经过微调后,能用于模拟和规划候选动作轨迹。

Insight: DINOv2的特征空间为视频世界模型提供了强大的基础,未来可以进一步探索其在不同下游任务中的应用。

Abstract: We present DINO-world, a powerful generalist video world model trained to predict future frames in the latent space of DINOv2. By leveraging a pre-trained image encoder and training a future predictor on a large-scale uncurated video dataset, DINO-world learns the temporal dynamics of diverse scenes, from driving and indoor scenes to simulated environments. We show that DINO-world outperforms previous models on a variety of video prediction benchmarks, e.g. segmentation and depth forecasting, and demonstrates strong understanding of intuitive physics. Furthermore, we show that it is possible to fine-tune the predictor on observation-action trajectories. The resulting action-conditioned world model can be used for planning by simulating candidate trajectories in latent space.

[80] DINO-SLAM: DINO-informed RGB-D SLAM for Neural Implicit and Explicit Representations

Ziren Gong,Xiaohan Li,Fabio Tosi,Youmin Zhang,Stefano Mattoccia,Jun Wu,Matteo Poggi

Main category: cs.CV

TL;DR: 本文提出了DINO-SLAM,一种基于DINO的设计策略,通过更全面的场景表示来增强SLAM系统中的神经隐式(NeRF)和显式表示(3DGS)。该方法采用场景结构编码器(SSE)生成增强的DINO特征(EDINO),并提出两种集成EDINO特征的SLAM范式,在多个数据集上表现优异。

Details Motivation: 当前的SLAM系统在场景表示时通常缺乏对场景层次结构和结构关系的充分捕捉,限制了其在复杂环境中的性能。本文旨在通过引入DINO特征来解决这一问题,以提升神经隐式和显式表示的SLAM系统性能。

Contribution: 1. 提出了一种基于DINO的设计策略DINO-SLAM,增强神经隐式(NeRF)和显式表示(3DGS)的SLAM系统。
2. 设计了场景结构编码器(SSE),将DINO特征增强为EDINO特征,以捕捉场景的层次结构和结构关系。
3. 提出了两种集成EDINO特征的SLAM范式,分别在神经隐式和显式表示框架中应用。

Method: 1. 使用场景结构编码器(SSE)将DINO特征转换为增强的EDINO特征,以更好地表示场景层次结构和关系。
2. 提出了神经隐式(NeRF)和显式表示(3DGS)两种SLAM系统的EDINO特征集成范式。
3. 在Replica、ScanNet和TUM数据集上进行了实验验证。

Result: 实验结果表明,DINO-SLAM在Replica、ScanNet和TUM数据集上优于现有的先进方法,证明了其性能优势。

Insight: 1. 通过引入DINO特征,能够显著提升SLAM系统的场景表示能力。
2. 增强的EDINO特征在捕捉场景层次结构和关系方面表现出色,为神经隐式和显式表示的SLAM系统提供了新的优化方向。

Abstract: This paper presents DINO-SLAM, a DINO-informed design strategy to enhance neural implicit (Neural Radiance Field – NeRF) and explicit representations (3D Gaussian Splatting – 3DGS) in SLAM systems through more comprehensive scene representations. Purposely, we rely on a Scene Structure Encoder (SSE) that enriches DINO features into Enhanced DINO ones (EDINO) to capture hierarchical scene elements and their structural relationships. Building upon it, we propose two foundational paradigms for NeRF and 3DGS SLAM systems integrating EDINO features. Our DINO-informed pipelines achieve superior performance on the Replica, ScanNet, and TUM compared to state-of-the-art methods.

eess.IV [Back]

[81] XAI-Guided Analysis of Residual Networks for Interpretable Pneumonia Detection in Paediatric Chest X-rays

Rayyan Ridwan

Main category: eess.IV

TL;DR: 该论文提出了一种基于ResNet的可解释深度学习模型,用于儿童肺炎的胸部X光自动诊断,结合BayesGrad-CAM提高模型的可解释性。

Details Motivation: 肺炎是全球儿童死亡的主要原因之一,急需快速、准确的诊断工具。通过可解释的深度学习模型辅助临床决策。

Contribution: 1. 提出结合BayesGrad-CAM的ResNet模型,量化解释中的不确定性。2. 展示高分类准确率和临床可解释性。

Method: 使用ResNet-50架构,结合BayesGrad-CAM生成带有不确定性的可视化解释,用于定位决策依据区域。

Result: 在儿童胸部X光数据集上,模型达到95.94%准确率、98.91% AUC-ROC和0.913 Cohen’s Kappa。

Insight: 高性能与可解释性在临床AI应用中缺一不可,BayesGrad-CAM提供了一种有效平衡两者的方法。

Abstract: Pneumonia remains one of the leading causes of death among children worldwide, underscoring a critical need for fast and accurate diagnostic tools. In this paper, we propose an interpretable deep learning model on Residual Networks (ResNets) for automatically diagnosing paediatric pneumonia on chest X-rays. We enhance interpretability through Bayesian Gradient-weighted Class Activation Mapping (BayesGrad-CAM), which quantifies uncertainty in visual explanations, and which offers spatial locations accountable for the decision-making process of the model. Our ResNet-50 model, trained on a large paediatric chest X-rays dataset, achieves high classification accuracy (95.94%), AUC-ROC (98.91%), and Cohen’s Kappa (0.913), accompanied by clinically meaningful visual explanations. Our findings demonstrate that high performance and interpretability are not only achievable but critical for clinical AI deployment.

[82] Learned Single-Pixel Fluorescence Microscopy

Serban C. Tudosie,Valerio Gandolfi,Shivaprasad Varakkoth,Andrea Farina,Cosimo D’Andrea,Simon Arridge

Main category: eess.IV

TL;DR: 该论文提出了一种基于自监督学习的单像素荧光显微镜成像方法,通过训练自编码器优化测量和重建过程,显著提升了成像质量和速度。

Details Motivation: 传统单像素荧光显微镜依赖于线性压缩测量和总变差最小化重建,但存在重建时间长和质量不足的问题。数据驱动的学习可以优化这一过程。

Contribution: 提出了一种自监督训练的自编码器,学习最优的测量矩阵(编码器)和重建方法(解码器),显著提升成像速度和重建质量。

Method: 通过自监督训练自编码器,学习编码器和解码器;实验验证了该方法在物理获取的多光谱和强度数据上的有效性。

Result: 重建时间缩短了两个数量级,图像质量显著提升,并实现了多光谱重建。

Insight: 学习型单像素荧光显微镜可以以低成本实现多光谱成像,有望推动诊断和生物学研究的进步。

Abstract: Single-pixel imaging has emerged as a key technique in fluorescence microscopy, where fast acquisition and reconstruction are crucial. In this context, images are reconstructed from linearly compressed measurements. In practice, total variation minimisation is still used to reconstruct the image from noisy measurements of the inner product between orthogonal sampling pattern vectors and the original image data. However, data can be leveraged to learn the measurement vectors and the reconstruction process, thereby enhancing compression, reconstruction quality, and speed. We train an autoencoder through self-supervision to learn an encoder (or measurement matrix) and a decoder. We then test it on physically acquired multispectral and intensity data. During acquisition, the learned encoder becomes part of the physical device. Our approach can enhance single-pixel imaging in fluorescence microscopy by reducing reconstruction time by two orders of magnitude, achieving superior image quality, and enabling multispectral reconstructions. Ultimately, learned single-pixel fluorescence microscopy could advance diagnosis and biological research, providing multispectral imaging at a fraction of the cost.

[83] RealDeal: Enhancing Realism and Details in Brain Image Generation via Image-to-Image Diffusion Models

Shen Zhu,Yinzhu Jin,Tyler Spears,Ifrah Zawar,P. Thomas Fletcher

Main category: eess.IV

TL;DR: RealDeal 是一种基于图像到图像扩散模型的方法,旨在提升生成脑部图像的逼真度和细节,通过引入锐利边缘、精细纹理和细微解剖特征来改进现有生成模型的不足。

Details Motivation: 尽管潜在扩散模型(LDM)在生成脑部MRI图像方面表现出色,但由于潜在空间的压缩,生成的图像过于平滑,缺乏真实图像中常见的细微解剖结构和扫描噪声。因此,需要一种方法能够在生成过程中增强图像的逼真度和细节。

Contribution: 提出了 RealDeal,一种图像到图像扩散模型,专门用于增强生成脑部图像的逼真度和细节。这不仅包括传统的生成质量评估指标(如FID和LPIPS),还引入了新指标来评估噪声分布、锐度和纹理的真实性。

Method: 采用图像到图像扩散模型,通过对潜在扩散模型生成的图像进行细化,引入锐利边缘、精细纹理和噪声。

Result: 实验表明,RealDeal 在生成的脑部图像中显著提升了逼真度和细节,新指标验证了生成图像在噪声分布和纹理方面的改进。

Insight: 通过图像到图像扩散模型增强生成图像的细节和逼真度是可行的,且新提出的评估指标为生成模型的性能提供了更全面的验证手段。

Abstract: We propose image-to-image diffusion models that are designed to enhance the realism and details of generated brain images by introducing sharp edges, fine textures, subtle anatomical features, and imaging noise. Generative models have been widely adopted in the biomedical domain, especially in image generation applications. Latent diffusion models achieve state-of-the-art results in generating brain MRIs. However, due to latent compression, generated images from these models are overly smooth, lacking fine anatomical structures and scan acquisition noise that are typically seen in real images. This work formulates the realism enhancing and detail adding process as image-to-image diffusion models, which refines the quality of LDM-generated images. We employ commonly used metrics like FID and LPIPS for image realism assessment. Furthermore, we introduce new metrics to demonstrate the realism of images generated by RealDeal in terms of image noise distribution, sharpness, and texture.

[84] A Self-training Framework for Semi-supervised Pulmonary Vessel Segmentation and Its Application in COPD

Shuiqing Zhao,Meihuan Wang,Jiaxuan Xu,Jie Feng,Wei Qian,Rongchang Chen,Zhenyu Liang,Shouliang Qi,Yanan Wu

Main category: eess.IV

TL;DR: 该论文提出了一种用于半监督肺血管分割的自训练框架(Semi2),通过教师-学生模型提高分割精度,并在COPD患者的非增强CT扫描上验证了其有效性,精度提升至90.3%。

Details Motivation: 肺血管的精确分割和量化对慢性阻塞性肺疾病(COPD)分析至关重要,但小血管的分割尤为困难。该研究旨在通过半监督方法解决标注数据有限的问题。

Contribution: 1. 提出了一个基于教师-学生模型的自训练框架(Semi2),用于半监督肺血管分割。2. 在高精度标注数据有限的情况下,通过伪标签生成和筛选策略提升模型性能。3. 在125名COPD患者的非增强CT扫描上验证了方法的有效性,并将精度提升至90.3%。

Method: 1. 通过交互方式获取高质量的标注数据。2. 训练一个全监督模型(教师模型)在小规模标注数据上。3. 教师模型为未标注数据生成伪标签,并基于特定策略筛选可靠标签。4. 使用可靠伪标签训练学生模型,迭代优化直至性能最优。

Result: 在125名COPD患者的非增强CT扫描上,Semi2方法将肺血管分割的精度提升至90.3%,较基线方法提高了2.3%。此外,该方法为COPD的血管变化提供了量化分析依据。

Insight: 1. 自训练框架在半监督学习中对小血管分割具有显著效果。2. 可靠的伪标签生成和筛选策略对模型性能至关重要。3. 该方法不仅适用于肺血管分割,还可扩展至其他医学图像分析任务。

Abstract: Background: It is fundamental for accurate segmentation and quantification of the pulmonary vessel, particularly smaller vessels, from computed tomography (CT) images in chronic obstructive pulmonary disease (COPD) patients. Objective: The aim of this study was to segment the pulmonary vasculature using a semi-supervised method. Methods: In this study, a self-training framework is proposed by leveraging a teacher-student model for the segmentation of pulmonary vessels. First, the high-quality annotations are acquired in the in-house data by an interactive way. Then, the model is trained in the semi-supervised way. A fully supervised model is trained on a small set of labeled CT images, yielding the teacher model. Following this, the teacher model is used to generate pseudo-labels for the unlabeled CT images, from which reliable ones are selected based on a certain strategy. The training of the student model involves these reliable pseudo-labels. This training process is iteratively repeated until an optimal performance is achieved. Results: Extensive experiments are performed on non-enhanced CT scans of 125 COPD patients. Quantitative and qualitative analyses demonstrate that the proposed method, Semi2, significantly improves the precision of vessel segmentation by 2.3%, achieving a precision of 90.3%. Further, quantitative analysis is conducted in the pulmonary vessel of COPD, providing insights into the differences in the pulmonary vessel across different severity of the disease. Conclusion: The proposed method can not only improve the performance of pulmonary vascular segmentation, but can also be applied in COPD analysis. The code will be made available at https://github.com/wuyanan513/semi-supervised-learning-for-vessel-segmentation.

[85] Reconstruct or Generate: Exploring the Spectrum of Generative Modeling for Cardiac MRI

Niklas Bubeck,Yundi Zhang,Suprosanna Shit,Daniel Rueckert,Jiazhen Pan

Main category: eess.IV

TL;DR: 这篇论文系统分析了生成模型在心脏MRI图像重建和生成任务中的性能差异,重点关注潜在扩散模型和自回归模型在不同掩蔽比例下的表现。

Details Motivation: 在医学成像中,生成模型既用于重建任务(如修复或超分辨率)也用于生成任务(如数据增强或反事实分析),但它们的优化目标不同。作者希望通过实验研究这些模型在重建-生成谱上的表现差异。

Contribution: 论文主要贡献是引入了生成模型动物园,并系统评估了潜在扩散模型和自回归模型在不同掩蔽比例下的重建和生成性能。

Method: 作者使用潜在扩散模型和自回归模型,针对心脏MRI图像,设计了图像修复(不同掩蔽比例)和无条件生成任务,并进行了性能对比。

Result: 实验结果表明,扩散模型在无条件生成任务中表现更好,但随着掩蔽比例增加,会出现幻觉问题;自回归模型在不同掩蔽比例下表现稳定,但保真度较低。

Insight: 论文揭示了扩散模型和自回归模型在医学图像生成和重建任务中的权衡关系,为实际应用中的模型选择提供了指导。

Abstract: In medical imaging, generative models are increasingly relied upon for two distinct but equally critical tasks: reconstruction, where the goal is to restore medical imaging (usually inverse problems like inpainting or superresolution), and generation, where synthetic data is created to augment datasets or carry out counterfactual analysis. Despite shared architecture and learning frameworks, they prioritize different goals: generation seeks high perceptual quality and diversity, while reconstruction focuses on data fidelity and faithfulness. In this work, we introduce a “generative model zoo” and systematically analyze how modern latent diffusion models and autoregressive models navigate the reconstruction-generation spectrum. We benchmark a suite of generative models across representative cardiac medical imaging tasks, focusing on image inpainting with varying masking ratios and sampling strategies, as well as unconditional image generation. Our findings show that diffusion models offer superior perceptual quality for unconditional generation but tend to hallucinate as masking ratios increase, whereas autoregressive models maintain stable perceptual performance across masking levels, albeit with generally lower fidelity.

[86] RealisVSR: Detail-enhanced Diffusion for Real-World 4K Video Super-Resolution

Weisong Zhao,Jingkai Zhou,Xiangyu Zhu,Weihua Chen,Xiao-Yu Zhang,Zhen Lei,Fan Wang

Main category: eess.IV

TL;DR: RealisVSR 是一种基于扩散模型的高频细节增强视频超分辨率方法,通过 Consistency Preserved ControlNet (CPC) 架构、高频矫正扩散损失 (HR-Loss) 和首个 4K 视频超分辨率数据集 RealisVideo-4K,解决了现有方法在时间一致性、高频细节恢复和超分辨率评估不足的问题。

Details Motivation: 当前视频超分辨率 (VSR) 方法在时间动态建模、高频细节恢复和对 4K 超分辨率评估方面存在不足。尤其是 GAN 方法容易导致过度平滑,而现有数据集多为 720P,无法充分评估 4K 效果。

Contribution: 1) CPC 架构增强了时间一致性;2) HR-Loss 提升了高频细节恢复;3) 提出了首个 4K VSR 数据集 RealisVideo-4K。

Method: 结合 Wan2.1 视频扩散模型,采用了 CPC 架构建模复杂运动并抑制伪影,并通过 HR-Loss(结合小波分解和 HOG 特征约束)恢复纹理。

Result: 在多个基准测试(如 REDS、SPMCS 等)上表现优异,尤其在 4K 超分辨率场景中展现出显著优势。

Insight: 通过结合扩散模型和小波分解等传统方法,能够更有效地恢复高频细节,同时减少训练数据需求(仅需现有方法的 5-25%)。

Abstract: Video Super-Resolution (VSR) has achieved significant progress through diffusion models, effectively addressing the over-smoothing issues inherent in GAN-based methods. Despite recent advances, three critical challenges persist in VSR community: 1) Inconsistent modeling of temporal dynamics in foundational models; 2) limited high-frequency detail recovery under complex real-world degradations; and 3) insufficient evaluation of detail enhancement and 4K super-resolution, as current methods primarily rely on 720P datasets with inadequate details. To address these challenges, we propose RealisVSR, a high-frequency detail-enhanced video diffusion model with three core innovations: 1) Consistency Preserved ControlNet (CPC) architecture integrated with the Wan2.1 video diffusion to model the smooth and complex motions and suppress artifacts; 2) High-Frequency Rectified Diffusion Loss (HR-Loss) combining wavelet decomposition and HOG feature constraints for texture restoration; 3) RealisVideo-4K, the first public 4K VSR benchmark containing 1,000 high-definition video-text pairs. Leveraging the advanced spatio-temporal guidance of Wan2.1, our method requires only 5-25% of the training data volume compared to existing approaches. Extensive experiments on VSR benchmarks (REDS, SPMCS, UDM10, YouTube-HQ, VideoLQ, RealisVideo-720P) demonstrate our superiority, particularly in ultra-high-resolution scenarios.

[87] Enhancing Diabetic Retinopathy Classification Accuracy through Dual Attention Mechanism in Deep Learning

Abdul Hannan,Zahid Mahmood,Rizwan Qureshi,Hazrat Ali

Main category: eess.IV

TL;DR: 该论文提出了一种结合全局注意力模块(GAB)和类别注意力模块(CAB)的双注意力机制深度学习方法,用于解决糖尿病视网膜病变(DR)分类中数据集不均衡的问题,并在公开数据集上取得了竞争性结果。

Details Motivation: 糖尿病视网膜病变(DR)的自动分类对临床治疗至关重要,但数据分布不均衡限制了深度学习模型的泛化能力。作者希望通过双注意力机制提升模型性能。

Contribution: 提出了一种结合GAB和CAB的双注意力机制,有效解决了DR分类中的数据不均衡问题,并在多个预训练网络上验证了其有效性。

Method: 采用MobileNetV3-small、EfficientNet-b0和DenseNet-169作为主干网络,结合GAB和CAB的双注意力机制,在APTOS和EYEPACS两个公开数据集上进行实验。

Result: 在APTOS数据集上,DenseNet-169达到83.2%的准确率,MobileNetV3-small和EfficientNet-b0分别为82%和80%。在EYEPACS数据集上,EfficientNet-b0表现最佳(80%),其他指标如F1-score(82.0%)、精度(82.1%)等也表现良好。

Insight: 双注意力机制在解决数据不均衡问题时具有潜力,而轻量级模型(如MobileNetV3-small)在保持性能的同时参数更少,适合实际应用。

Abstract: Automatic classification of Diabetic Retinopathy (DR) can assist ophthalmologists in devising personalized treatment plans, making it a critical component of clinical practice. However, imbalanced data distribution in the dataset becomes a bottleneck in the generalization of deep learning models trained for DR classification. In this work, we combine global attention block (GAB) and category attention block (CAB) into the deep learning model, thus effectively overcoming the imbalanced data distribution problem in DR classification. Our proposed approach is based on an attention mechanism-based deep learning model that employs three pre-trained networks, namely, MobileNetV3-small, Efficientnet-b0, and DenseNet-169 as the backbone architecture. We evaluate the proposed method on two publicly available datasets of retinal fundoscopy images for DR. Experimental results show that on the APTOS dataset, the DenseNet-169 yielded 83.20% mean accuracy, followed by the MobileNetV3-small and EfficientNet-b0, which yielded 82% and 80% accuracies, respectively. On the EYEPACS dataset, the EfficientNet-b0 yielded a mean accuracy of 80%, while the DenseNet-169 and MobileNetV3-small yielded 75.43% and 76.68% accuracies, respectively. In addition, we also compute the F1-score of 82.0%, precision of 82.1%, sensitivity of 83.0%, specificity of 95.5%, and a kappa score of 88.2% for the experiments. Moreover, in our work, the MobileNetV3-small has 1.6 million parameters on the APTOS dataset and 0.90 million parameters on the EYEPACS dataset, which is comparatively less than other methods. The proposed approach achieves competitive performance that is at par with recently reported works on DR classification.

[88] SAM2-Aug: Prior knowledge-based Augmentation for Target Volume Auto-Segmentation in Adaptive Radiation Therapy Using Segment Anything Model 2

Guoping Xu,Yan Dai,Hengrui Zhao,Ying Zhang,Jie Deng,Weiguo Lu,You Zhang

Main category: eess.IV

TL;DR: 论文提出了一种基于先验知识的增强方法SAM2-Aug,旨在提升Segment Anything Model 2(SAM2)在自适应放射治疗(ART)中对肿瘤分割的准确性。通过引入先验MR图像和增强提示多样性,SAM2-Aug在多个数据集上表现出优越性能。

Details Motivation: 在自适应放射治疗中,准确的肿瘤分割至关重要,但目前的方法耗时且依赖用户操作。尽管SAM2在提示驱动的分割中展现出潜力,但其在肿瘤分割的准确性上有待提升。

Contribution: 1. 提出两种基于先验知识的增强策略:利用先验MR图像和标注作为上下文输入,以及通过随机边界框扩展和掩膜腐蚀/膨胀增强提示多样性。
2. 提出了SAM2-Aug模型,在多个数据集上展现了优越的分割性能和泛化能力。

Method: 1. 使用先验MR图像和标注作为附加输入,提供上下文信息。
2. 改进提示多样性,包括随机边界框扩展和掩膜腐蚀/膨胀。
3. 在One-Seq-Liver数据集上微调模型,并在Mix-Seq-Abdomen和Mix-Seq-Brain数据集上进行无重新训练评估。

Result: SAM2-Aug在多个数据集上均优于其他模型(如卷积和基于Transformer的模型),Dice分数分别为0.86(肝脏)、0.89(腹部)和0.90(大脑)。在边界敏感指标上也表现优异。

Insight: 先验知识的引入和提示多样性的改进显著提升了分割的准确性和泛化能力。SAM2-Aug为自适应放射治疗中的肿瘤分割提供了一种高效可靠的解决方案。

Abstract: Purpose: Accurate tumor segmentation is vital for adaptive radiation therapy (ART) but remains time-consuming and user-dependent. Segment Anything Model 2 (SAM2) shows promise for prompt-based segmentation but struggles with tumor accuracy. We propose prior knowledge-based augmentation strategies to enhance SAM2 for ART. Methods: Two strategies were introduced to improve SAM2: (1) using prior MR images and annotations as contextual inputs, and (2) improving prompt robustness via random bounding box expansion and mask erosion/dilation. The resulting model, SAM2-Aug, was fine-tuned and tested on the One-Seq-Liver dataset (115 MRIs from 31 liver cancer patients), and evaluated without retraining on Mix-Seq-Abdomen (88 MRIs, 28 patients) and Mix-Seq-Brain (86 MRIs, 37 patients). Results: SAM2-Aug outperformed convolutional, transformer-based, and prompt-driven models across all datasets, achieving Dice scores of 0.86(liver), 0.89(abdomen), and 0.90(brain). It demonstrated strong generalization across tumor types and imaging sequences, with improved performance in boundary-sensitive metrics. Conclusions: Incorporating prior images and enhancing prompt diversity significantly boosts segmentation accuracy and generalizability. SAM2-Aug offers a robust, efficient solution for tumor segmentation in ART. Code and models will be released at https://github.com/apple1986/SAM2-Aug.

cs.CR [Back]

[89] PurpCode: Reasoning for Safer Code Generation

Jiawei Liu,Nirav Diwan,Zhe Wang,Haoyu Zhai,Xiaona Zhou,Kiet A. Nguyen,Tianjiao Yu,Muntasir Wahed,Yinlin Deng,Hadjer Benkraouda,Yuxiang Wei,Lingming Zhang,Ismini Lourentzou,Gang Wang

Main category: cs.CR

TL;DR: PurpCode是首个用于生成安全代码并防御恶意网络活动的后训练方法,通过规则学习和强化学习两阶段训练模型,并开发了PurpCode-32B模型,表现出卓越的网络安全性和实用性。

Details Motivation: 当前代码生成模型可能存在安全漏洞,容易被利用于恶意网络活动。PurpCode的提出旨在通过训练模型生成更安全的代码,并减少模型的错误拒绝率。

Contribution: 1) 提出了首个用于代码生成的网络安全后训练方法PurpCode;2) 开发了PurpCode-32B模型,在网络安全方面表现优于其他前沿模型;3) 通过多目标奖励机制平衡模型的安全性和实用性。

Method: PurpCode采用两阶段训练:1) 规则学习,明确教授模型引用网络安全规则生成无漏洞代码;2) 强化学习,通过多样化的多目标奖励机制优化模型安全性并保持实用性。此外,通过内部红队测试合成全面的网络安全数据。

Result: PurpCode-32B在网络安全方面达到最先进水平,同时减少了模型在通用和网络安全特定场景中的错误拒绝率,同时保持了代码生成和常见安全知识任务的实用性。

Insight: 通过结合规则学习和强化学习,PurpCode展示了如何在代码生成任务中平衡安全性与实用性,为未来的安全敏感模型训练提供了重要参考。

Abstract: We introduce PurpCode, the first post-training recipe for training safe code reasoning models towards generating secure code and defending against malicious cyberactivities. PurpCode trains a reasoning model in two stages: (i) Rule Learning, which explicitly teaches the model to reference cybersafety rules to generate vulnerability-free code and to avoid facilitating malicious cyberactivities; and (ii) Reinforcement Learning, which optimizes model safety and preserves model utility through diverse, multi-objective reward mechanisms. To empower the training pipelines with comprehensive cybersafety data, we conduct internal red-teaming to synthesize comprehensive and high-coverage prompts based on real-world tasks for inducing unsafe cyberactivities in the model. Based on PurpCode, we develop a reasoning-based coding model, namely PurpCode-32B, which demonstrates state-of-the-art cybersafety, outperforming various frontier models. Meanwhile, our alignment method decreases the model overrefusal rates in both general and cybersafety-specific scenarios, while preserving model utility in both code generation and common security knowledge.

cs.LG [Back]

[90] Advancing Event Forecasting through Massive Training of Large Language Models: Challenges, Solutions, and Broader Impacts

Sang-Woo Lee,Sohee Yang,Donghyun Kwak,Noah Y. Siegel

Main category: cs.LG

TL;DR: 论文分析了大型语言模型(LLMs)在事件预测领域的潜力,提出了训练方法和数据获取的关键方向,并探讨了技术突破对社会的影响。

Details Motivation: 近期研究表明,先进LLMs正逐步达到超级预测者的水平,结合强化学习等技术可进一步提升预测能力。作者认为当前是研究大规模训练超级预测级LLMs的合适时机。

Contribution: 提出了LLM事件预测训练的三个核心问题及解决方案,并呼吁利用多样化数据实现大规模训练与评估。

Method: 针对噪音稀疏性、知识截止和简单奖励结构问题,提出了假设事件贝叶斯网络、利用低召回和反事实事件,以及辅助奖励信号等方法。

Result: 通过技术改进,LLMs有望在更广泛领域为社会提供预测智能。

Insight: 大规模训练与多样化数据是提升事件预测能力的关键,技术突破可能推动AI预测应用的普及。

Abstract: Many recent papers have studied the development of superforecaster-level event forecasting LLMs. While methodological problems with early studies cast doubt on the use of LLMs for event forecasting, recent studies with improved evaluation methods have shown that state-of-the-art LLMs are gradually reaching superforecaster-level performance, and reinforcement learning has also been reported to improve future forecasting. Additionally, the unprecedented success of recent reasoning models and Deep Research-style models suggests that technology capable of greatly improving forecasting performance has been developed. Therefore, based on these positive recent trends, we argue that the time is ripe for research on large-scale training of superforecaster-level event forecasting LLMs. We discuss two key research directions: training methods and data acquisition. For training, we first introduce three difficulties of LLM-based event forecasting training: noisiness-sparsity, knowledge cut-off, and simple reward structure problems. Then, we present related ideas to mitigate these problems: hypothetical event Bayesian networks, utilizing poorly-recalled and counterfactual events, and auxiliary reward signals. For data, we propose aggressive use of market, public, and crawling datasets to enable large-scale training and evaluation. Finally, we explain how these technical advances could enable AI to provide predictive intelligence to society in broader areas. This position paper presents promising specific paths and considerations for getting closer to superforecaster-level AI technology, aiming to call for researchers’ interest in these directions.

cs.RO [Back]

[91] Towards Multimodal Social Conversations with Robots: Using Vision-Language Models

Ruben Janssens,Tony Belpaeme

Main category: cs.RO

TL;DR: 该论文讨论了如何利用视觉语言模型(VLM)为社交机器人赋予多模态对话能力,以提升其社交互动表现。

Details Motivation: 社交机器人需要具备多模态处理能力,以更好地理解和参与开放域的社交对话。目前的语言模型缺乏对视觉信息的利用,限制了机器人的社交能力。

Contribution: 论文提出了使用视觉语言模型(VLM)作为社交机器人的多模态对话核心,并探讨了如何适应这一设置及解决相关技术挑战。

Method: 通过适配视觉语言模型(VLM),使其能够处理广泛的视觉信息,并满足社交对话的需求。论文还讨论了评估方法的改进。

Result: 论文未明确提到具体实验结果,但提出了技术挑战和未来方向。

Insight: 视觉语言模型在社交机器人中具有广泛应用潜力,但需解决多模态数据的对齐和泛化问题。

Abstract: Large language models have given social robots the ability to autonomously engage in open-domain conversations. However, they are still missing a fundamental social skill: making use of the multiple modalities that carry social interactions. While previous work has focused on task-oriented interactions that require referencing the environment or specific phenomena in social interactions such as dialogue breakdowns, we outline the overall needs of a multimodal system for social conversations with robots. We then argue that vision-language models are able to process this wide range of visual information in a sufficiently general manner for autonomous social robots. We describe how to adapt them to this setting, which technical challenges remain, and briefly discuss evaluation practices.

[92] Perpetua: Multi-Hypothesis Persistence Modeling for Semi-Static Environments

Miguel Saavedra-Ruiz,Samer B. Nashed,Charlie Gauthier,Liam Paull

Main category: cs.RO

TL;DR: Perpetua提出了一种多假设持久性建模方法,用于半静态环境中的动态特征预测,结合贝叶斯框架实现高效的未来状态预测,并在实验中优于类似方法。

Details Motivation: 机器人在复杂动态环境中部署时,传统建模方法无法有效预测动态特征的未来状态,Perpetua旨在解决这一问题。

Contribution: 提出了多假设持久性建模方法,结合先验知识和贝叶斯框架,实现动态特征的在线适应和未来状态预测。

Method: 通过链式混合持久性与出现性滤波器,在贝叶斯框架中建模特征消失或重现的概率。

Result: 在模拟和真实数据实验中,Perpetua表现优于同类方法,具有更高的准确性和鲁棒性。

Insight: 动态特征建模需结合先验知识和多假设追踪,贝叶斯框架为不确定性处理提供了高效解决方案。

Abstract: Many robotic systems require extended deployments in complex, dynamic environments. In such deployments, parts of the environment may change between subsequent robot observations. Most robotic mapping or environment modeling algorithms are incapable of representing dynamic features in a way that enables predicting their future state. Instead, they opt to filter certain state observations, either by removing them or some form of weighted averaging. This paper introduces Perpetua, a method for modeling the dynamics of semi-static features. Perpetua is able to: incorporate prior knowledge about the dynamics of the feature if it exists, track multiple hypotheses, and adapt over time to enable predicting of future feature states. Specifically, we chain together mixtures of “persistence” and “emergence” filters to model the probability that features will disappear or reappear in a formal Bayesian framework. The approach is an efficient, scalable, general, and robust method for estimating the states of features in an environment, both in the present as well as at arbitrary future times. Through experiments on simulated and real-world data, we find that Perpetua yields better accuracy than similar approaches while also being online adaptable and robust to missing observations.

cs.GR [Back]

[93] Generating real-time detailed ground visualisations from sparse aerial point clouds

Aidan Murray,Eddie Waite,Caleb Ross,Scarlet Mitchell,Alexander Bradley,Joanna Jamrozy,Kenny Mitchell

Main category: cs.GR

TL;DR: 提出了一种从稀疏航空点云自动生成高质量地面可视化内容的方法,用于实时3D渲染,适用于训练、模拟和游戏等应用。

Details Motivation: 传统方法由艺术家团队手工建模和渲染,成本高且难以准确还原真实世界场景。

Contribution: 提出了一种自动从稀疏点云生成高质量地面可视化的流程,显著降低了成本和提高了准确性。

Method: 通过自动化流程对扫描数据增强,并实时渲染高质量的3D动画。

Result: 能够以高质量和实时性生成逼真的地面可视化内容。

Insight: 该方法展示了自动化工具在3D内容生成中的潜力,能够替代部分人工工作,提高效率。

Abstract: Building realistic wide scale outdoor 3D content with sufficient visual quality to observe at walking eye level or from driven vehicles is often carried out by large teams of artists skilled in modelling, texturing, material shading and lighting, which typically leads to both prohibitive costs and reduced accuracy honoring the variety of real world ground truth landscapes. In our proposed method, we define a process to automatically amplify real-world scanned data and render real-time in animated 3D to explore at close range with high quality for training, simulation, video game and visualisation applications.

cs.AI [Back]

[94] OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth?

Xuetian Chen,Yinghao Chen,Xinfeng Yuan,Zhuo Peng,Lu Chen,Yuekeng Li,Zhoujia Zhang,Yingqian Huang,Leyan Huang,Jiaqing Liang,Tianbao Xie,Zhiyong Wu,Qiushi Sun,Biqing Qi,Bowen Zhou

Main category: cs.AI

TL;DR: OS-MAP是一个针对日常计算机自动化任务的基准测试,通过五级自动化分类和真实用户需求层级评估代理的能力,揭示了当前先进代理在感知、推理和协调任务中的不足。

Details Motivation: 现有基准测试未能考虑任务的内部异构性和代理能力与实际需求的匹配性,限制了代理能力的针对性开发和实际部署。

Contribution: 提出了OS-MAP基准测试,包含416个任务和15个应用,通过自动化等级和泛化范围两个维度评估代理能力。

Method: 设计了两维度的评估框架:自动化等级(五级分类)和泛化范围(需求层级),形成性能-泛化评估矩阵。

Result: 实验显示,即使基于VLM的最先进代理也难以处理涉及感知、推理和协调的高级任务。

Insight: 研究揭示了当前计算机代理的局限性,强调了对其能力进行更深入理解的必要性,以推动未来研究。

Abstract: Computer-using agents have shown strong potential to boost human productivity and enable new application forms across platforms. While recent advances have led to usable applications, existing benchmarks fail to account for the internal task heterogeneity and the corresponding agent capabilities, as well as their alignment with actual user demands-hindering both targeted capability development and the reliable transition of research progress into practical deployment. To bridge the gap, we present OS-MAP, a benchmark for daily computer-using automation that organizes its 416 realistic tasks across 15 applications along two key dimensions: a five-level taxonomy of automation and a generalization scope derived from a real-world user demand hierarchy. To enable fine-grained analysis of required capabilities and alignment with real-world scenarios, OS-MAP evaluates agents along two dimensions: automation level across a five-level taxonomy, and generalization scope across a demand hierarchy. This design captures varying levels of required agent autonomy and generalization, forming a performance-generalization evaluation matrix for structured and comprehensive assessment. Experiments show that even State-of-the-Art agents with VLM backbones struggle with higher-level tasks involving perception, reasoning, and coordination-highlighting the need for a deeper understanding of current strengths and limitations to drive the future progress in computer-using agents research and deployment. All code, environments, baselines, and data are publicly available at https://github.com/OS-Copilot/OS-Map.

[95] Fine-Grained Traffic Inference from Road to Lane via Spatio-Temporal Graph Node Generation

Shuhao Li,Weidong Yang,Yue Cui,Xiaoxing Liu,Lingkai Meng,Lipeng Ma,Fan Zhang

Main category: cs.AI

TL;DR: 该论文提出了细粒度道路交通推断(FRTI)任务,通过时空图节点生成方法从有限的道路数据中推断车道级交通信息,并设计了两阶段框架RoadDiff,结合道路-车道相关性自动编码解码器和车道扩散模块,实现了精准的车道级交通状态推断。

Details Motivation: 由于传感器类型和数量的限制以及跟踪算法的准确性不足,获取车道级交通数据成为数据驱动模型的关键瓶颈。因此需要一种更节能和成本效益高的解决方案。

Contribution: 提出FRTI任务,首次将其抽象为时空图节点生成问题;设计了两阶段框架RoadDiff,包含道路-车道相关性自动编码解码器和车道扩散模块;在六个数据集上验证了模型的有效性。

Method: RoadDiff框架分为两步:1)道路-车道相关性自动编码解码器提取时空依赖;2)车道扩散模块利用分布关系生成精细的车道级数据。

Result: 在六个不同道路条件下的数据集上,RoadDiff模型表现出色,验证了其在解决FRTI任务中的有效性。

Insight: 通过时空图节点生成方法,可以从有限的道路数据中推断出更细粒度的交通信息,为自动驾驶和交通管理提供了新的数据支持。

Abstract: Fine-grained traffic management and prediction are fundamental to key applications such as autonomous driving, lane change guidance, and traffic signal control. However, obtaining lane-level traffic data has become a critical bottleneck for data-driven models due to limitations in the types and number of sensors and issues with the accuracy of tracking algorithms. To address this, we propose the Fine-grained Road Traffic Inference (FRTI) task, which aims to generate more detailed lane-level traffic information using limited road data, providing a more energy-efficient and cost-effective solution for precise traffic management. This task is abstracted as the first scene of the spatio-temporal graph node generation problem. We designed a two-stage framework–RoadDiff–to solve the FRTI task. solve the FRTI task. This framework leverages the Road-Lane Correlation Autoencoder-Decoder and the Lane Diffusion Module to fully utilize the limited spatio-temporal dependencies and distribution relationships of road data to accurately infer fine-grained lane traffic states. Based on existing research, we designed several baseline models with the potential to solve the FRTI task and conducted extensive experiments on six datasets representing different road conditions to validate the effectiveness of the RoadDiff model in addressing the FRTI task. The relevant datasets and code are available at https://github.com/ShuhaoLii/RoadDiff.

[96] PhysDrive: A Multimodal Remote Physiological Measurement Dataset for In-vehicle Driver Monitoring

Jiyao Wang,Xiao Yang,Qingyong Hu,Jiankai Tang,Can Liu,Dengbo He,Yuntao Wang,Yingcong Chen,Kaishun Wu

Main category: cs.AI

TL;DR: PhysDrive是首个用于车内无接触生理监测的大规模多模态数据集,涵盖多种驾驶条件和传感器数据,旨在解决现有数据集在规模、多样性和真实驾驶场景覆盖上的不足。

Details Motivation: 现有远程生理监测(RPM)数据集规模小、模态单一,且忽略了真实驾驶场景中的挑战,如动态光线和驾驶员动作,亟需一个全面的数据集以推动研究。

Contribution: 提出了PhysDrive,首个包含48名驾驶员的多模态数据集,涵盖RGB、近红外相机、毫米波雷达数据,并同步采集六种生理信号作为真实值。

Method: 数据集通过多传感器(相机、雷达)同步采集数据,并覆盖多种驾驶条件(动态光线、车辆类型等),同时评估了信号处理和深度学习方法。

Result: PhysDrive为多模态驾驶员监测建立了全面的基准测试,并开源了代码以兼容主流工具箱。

Insight: 该数据集有望成为多模态驾驶员监测和智能座舱系统研究的基础资源,推动实际应用的发展。

Abstract: Robust and unobtrusive in-vehicle physiological monitoring is crucial for ensuring driving safety and user experience. While remote physiological measurement (RPM) offers a promising non-invasive solution, its translation to real-world driving scenarios is critically constrained by the scarcity of comprehensive datasets. Existing resources are often limited in scale, modality diversity, the breadth of biometric annotations, and the range of captured conditions, thereby omitting inherent real-world challenges in driving. Here, we present PhysDrive, the first large-scale multimodal dataset for contactless in-vehicle physiological sensing with dedicated consideration on various modality settings and driving factors. PhysDrive collects data from 48 drivers, including synchronized RGB, near-infrared camera, and raw mmWave radar data, accompanied with six synchronized ground truths (ECG, BVP, Respiration, HR, RR, and SpO2). It covers a wide spectrum of naturalistic driving conditions, including driver motions, dynamic natural light, vehicle types, and road conditions. We extensively evaluate both signal-processing and deep-learning methods on PhysDrive, establishing a comprehensive benchmark across all modalities, and release full open-source code with compatibility for mainstream public toolboxes. We envision PhysDrive will serve as a foundational resource and accelerate research on multimodal driver monitoring and smart-cockpit systems.

cs.HC [Back]

[97] People Are Highly Cooperative with Large Language Models, Especially When Communication Is Possible or Following Human Interaction

Paweł Niszczota,Tomasz Grzegorczyk,Alexander Pastukhov

Main category: cs.HC

TL;DR: 论文探讨了人类与大型语言模型(LLM)在协作行为中的差异,发现尽管人类与LLM的合作率低于人与人之间,但合作率仍较高,尤其是在允许沟通或与人类互动后的情境中。

Details Motivation: 研究LLM在商业环境中如何影响人类的合作行为,尤其是在需要信任和沟通的情境下,验证LLM的潜在应用价值。

Contribution: 首次通过实验证明LLM在人类合作行为中的表现,尤其是在沟通和先前人类互动的影响下,揭示了LLM在协作任务中的潜力。

Method: 使用囚徒困境游戏作为实验工具,分别设计了多轮(实验1)和单轮(实验2)游戏,比较人类在与人类、传统机器和LLM互动中的合作行为。

Result: 与LLM的合作率比人与人低10-15%,但整体仍较高;沟通显著提升合作率(88%),且人类与LLM的先前互动会增强后续合作。

Insight: LLM可以通过沟通和模仿人类互动提升合作行为,为商业场景中LLM的合理使用提供了实证支持。

Abstract: Machines driven by large language models (LLMs) have the potential to augment humans across various tasks, a development with profound implications for business settings where effective communication, collaboration, and stakeholder trust are paramount. To explore how interacting with an LLM instead of a human might shift cooperative behavior in such settings, we used the Prisoner’s Dilemma game – a surrogate of several real-world managerial and economic scenarios. In Experiment 1 (N=100), participants engaged in a thirty-round repeated game against a human, a classic bot, and an LLM (GPT, in real-time). In Experiment 2 (N=192), participants played a one-shot game against a human or an LLM, with half of them allowed to communicate with their opponent, enabling LLMs to leverage a key advantage over older-generation machines. Cooperation rates with LLMs – while lower by approximately 10-15 percentage points compared to interactions with human opponents – were nonetheless high. This finding was particularly notable in Experiment 2, where the psychological cost of selfish behavior was reduced. Although allowing communication about cooperation did not close the human-machine behavioral gap, it increased the likelihood of cooperation with both humans and LLMs equally (by 88%), which is particularly surprising for LLMs given their non-human nature and the assumption that people might be less receptive to cooperating with machines compared to human counterparts. Additionally, cooperation with LLMs was higher following prior interaction with humans, suggesting a spillover effect in cooperative behavior. Our findings validate the (careful) use of LLMs by businesses in settings that have a cooperative component.

[98] How good are humans at detecting AI-generated images? Learnings from an experiment

Thomas Roca,Anthony Cintron Roman,Jehú Torres Vega,Marcelo Duarte,Pengce Wang,Kevin White,Amit Misra,Juan Lavista Ferres

Main category: cs.HC

TL;DR: 人类在区分AI生成图像和真实图像方面的能力有限,整体准确率仅为62%,表明这是一项挑战性任务。

Details Motivation: 随着AI图像生成技术的发展,了解人类是否能有效区分真实与AI生成图像是重要的研究问题,尤其是为应对潜在的错误信息风险。

Contribution: 通过大规模实验数据(12,500名参与者的287,000次评估),量化了人类在识别AI生成图像方面的能力,并揭示了不同类别图像的表现差异。

Method: 通过在线游戏“Real or Not Quiz”收集数据,参与者对随机混合的真实和AI生成图像进行分类。

Result: 整体准确率为62%,人类肖像识别表现最佳,但自然和城市景观识别困难。

Insight: 研究强调了透明工具(如水印和AI检测工具)的必要性,以应对AI生成内容可能带来的误导风险。

Abstract: As AI-powered image generation improves, a key question is how well human beings can differentiate between “real” and AI-generated or modified images. Using data collected from the online game “Real or Not Quiz.”, this study investigates how effectively people can distinguish AI-generated images from real ones. Participants viewed a randomized set of real and AI-generated images, aiming to identify their authenticity. Analysis of approximately 287,000 image evaluations by over 12,500 global participants revealed an overall success rate of only 62%, indicating a modest ability, slightly above chance. Participants were most accurate with human portraits but struggled significantly with natural and urban landscapes. These results highlight the inherent challenge humans face in distinguishing AI-generated visual content, particularly images without obvious artifacts or stylistic cues. This study stresses the need for transparency tools, such as watermarks and robust AI detection tools to mitigate the risks of misinformation arising from AI-generated content

cs.IR [Back]

[99] Distilling a Small Utility-Based Passage Selector to Enhance Retrieval-Augmented Generation

Hengran Zhang,Keping Bi,Jiafeng Guo,Jiaming Zhang,Shuaiqiang Wang,Dawei Yin,Xueqi Cheng

Main category: cs.IR

TL;DR: 该论文提出了一种将大型语言模型(LLM)的效用判断能力蒸馏到更小、更高效的模型中的方法,以解决检索增强生成(RAG)中效用判断的高计算成本问题。

Details Motivation: 在检索增强生成(RAG)中,传统检索方法侧重于查询和段落的相关性,而效用(即段落对生成准确答案的有用性)更关键。但由于使用LLM进行效用判断的计算成本高,限制了评估的段落数量,影响了复杂查询的处理。

Contribution: 1. 提出了一种将LLM的效用判断能力蒸馏到小型模型的方法;2. 采用了动态段落选择策略,避免固定阈值;3. 使用伪答案生成功用训练学生模型;4. 在MS MARCO数据集上发布了相关性和效用标注。

Method: 1. 采用教师-学生模型框架,使用Qwen3-32B作为教师模型,蒸馏到RankQwen1.7B和UtilityQwen1.7B;2. 结合滑动窗口技术动态选择有用段落;3. 训练学生模型学习效用判断和伪答案生成。

Result: 实验表明,效用选择方法显著降低了计算成本,同时提高了答案质量,对于复杂问题效果优于传统相关性排序方法。

Insight: 效用选择比传统相关性排序更灵活且高效,尤其在处理复杂查询时表现出色。这为RAG系统的优化提供了新的方向。

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating retrieved information. Standard retrieval process prioritized relevance, focusing on topical alignment between queries and passages. In contrast, in RAG, the emphasis has shifted to utility, which considers the usefulness of passages for generating accurate answers. Despite empirical evidence showing the benefits of utility-based retrieval in RAG, the high computational cost of using LLMs for utility judgments limits the number of passages evaluated. This restriction is problematic for complex queries requiring extensive information. To address this, we propose a method to distill the utility judgment capabilities of LLMs into smaller, more efficient models. Our approach focuses on utility-based selection rather than ranking, enabling dynamic passage selection tailored to specific queries without the need for fixed thresholds. We train student models to learn pseudo-answer generation and utility judgments from teacher LLMs, using a sliding window method that dynamically selects useful passages. Our experiments demonstrate that utility-based selection provides a flexible and cost-effective solution for RAG, significantly reducing computational costs while improving answer quality. We present the distillation results using Qwen3-32B as the teacher model for both relevance ranking and utility-based selection, distilled into RankQwen1.7B and UtilityQwen1.7B. Our findings indicate that for complex questions, utility-based selection is more effective than relevance ranking in enhancing answer generation performance. We will release the relevance ranking and utility-based selection annotations for the MS MARCO dataset, supporting further research in this area.

[100] Injecting External Knowledge into the Reasoning Process Enhances Retrieval-Augmented Generation

Minghao Tang,Shiyu Ni,Jiafeng Guo,Keping Bi

Main category: cs.IR

TL;DR: 该论文提出了一种名为Passage Injection的方法,通过将检索到的外部知识显式地融入大语言模型(LLM)的推理过程中,以增强其在知识密集型任务中对噪声的鲁棒性。

Details Motivation: 检索增强生成(RAG)系统在知识密集型任务中表现突出,但其效果常因低质量检索段落(噪声)而受限。如何提升LLM对这些噪声的鲁棒性,是提高系统可靠性的关键。

Contribution: 论文的主要贡献是提出了Passage Injection方法,通过在推理过程中显式注入检索段落,显著提升了LLM对噪声的识别与抵抗能力。

Method: Passage Injection方法将检索到的段落直接融入LLM的推理过程中,实验中使用BM25作为检索器,并在四个推理增强型LLM上验证了其有效性。

Result: 实验结果表明,Passage Injection在四种事实问答数据集上显著提升了RAG系统的整体性能,并在随机噪声和反事实噪声环境下均表现出更强的鲁棒性。

Insight: 该研究表明,将检索段落显式融入推理过程是构建更鲁棒RAG系统的有效方向,同时还能充分利用有用的检索段落。

Abstract: Retrieval-augmented generation (RAG) has been widely adopted to augment large language models (LLMs) with external knowledge for knowledge-intensive tasks. However, its effectiveness is often undermined by the presence of noisy (i.e., low-quality) retrieved passages. Enhancing LLMs’ robustness to such noise is critical for improving the reliability of RAG systems. Recent advances have equipped LLMs with strong reasoning and self-reflection capabilities, allowing them to identify and correct errors in their reasoning process. Inspired by this ability, we propose Passage Injection-a simple yet effective method that explicitly incorporates retrieved passages into LLMs’ reasoning process, aiming to enhance the model’s ability to recognize and resist noisy passages. We validate Passage Injection under general RAG settings using BM25 as the retriever. Experiments on four reasoning-enhanced LLMs across four factual QA datasets demonstrate that Passage Injection significantly improves overall RAG performance. Further analysis on two noisy retrieval settings-random noise, where the model is provided irrelevant passages, and counterfactual noise, where it is given misleading passages-shows that Passage Injection consistently improves robustness. Controlled experiments confirm that Passage Injection can also effectively leverage helpful passages. These findings suggest that incorporating passages in LLMs’ reasoning process is a promising direction for building more robust RAG systems. The code can be found \href{here}{https://github.com/mh-tang/Passage-Injection}.

eess.AS [Back]

[101] FD-Bench: A Full-Duplex Benchmarking Pipeline Designed for Full Duplex Spoken Dialogue Systems

Yizhou Peng,Yi-Wen Chao,Dianwen Ng,Yukun Ma,Chongjia Ni,Bin Ma,Eng Siong Chng

Main category: eess.AS

TL;DR: 本研究提出了FD-Bench,一个针对全双工语音对话系统(FDSDS)的全面评测管道,填补了现有评测在FD场景中的不足。通过模拟用户中断和噪声环境等挑战性场景,评测了三种开源FDSDS的性能,发现这些系统在处理中断时仍存在困难。

Details Motivation: 全双工对话系统(FDSDS)能够实现更自然的人机交互,但目前缺乏针对FD场景的评测指标,如处理用户中断的能力。本研究旨在填补这一空白。

Contribution: 1. 提出首个针对FDSDS的全面评测管道FD-Bench,支持多样化的评测指标。
2. 生成了包含293次模拟对话和1,200次中断的40小时语音数据。
3. 评测了三种开源FDSDS模型,揭示了其在处理中断和噪声环境中的局限性。

Method: 利用大型语言模型(LLMs)、文本转语音(TTS)和自动语音识别(ASR)构建评测管道,模拟用户中断和噪声环境等复杂场景,并通过多样化的新指标评估模型性能。

Result: 测试的三款开源FDSDS模型均表现不佳,尤其是在频繁中断和噪声条件下,无法有效响应用户中断。

Insight: 全双工对话系统的评测需要更全面的指标和复杂的场景模拟,未来的研究应关注模型在实时中断和噪声环境下的鲁棒性改进。

Abstract: Full-duplex spoken dialogue systems (FDSDS) enable more natural human-machine interactions by allowing real-time user interruptions and backchanneling, compared to traditional SDS that rely on turn-taking. However, existing benchmarks lack metrics for FD scenes, e.g., evaluating model performance during user interruptions. In this paper, we present a comprehensive FD benchmarking pipeline utilizing LLMs, TTS, and ASR to address this gap. It assesses FDSDS’s ability to handle user interruptions, manage delays, and maintain robustness in challenging scenarios with diverse novel metrics. We applied our benchmark to three open-source FDSDS (Moshi, Freeze-omni, and VITA-1.5) using over 40 hours of generated speech, with 293 simulated conversations and 1,200 interruptions. The results show that all models continue to face challenges, such as failing to respond to user interruptions, under frequent disruptions and noisy conditions. Demonstrations, data, and code will be released.

[102] Should Top-Down Clustering Affect Boundaries in Unsupervised Word Discovery?

Simon Malan,Benjamin van Niekerk,Herman Kamper

Main category: eess.AS

TL;DR: 论文比较了自下而上和自上而下两种无监督词发现方法的分割效果,发现两者性能相当,但自下而上方法更快。建议未来研究改进聚类技术和词表示。

Details Motivation: 研究无监督语音分割中,自上而下聚类是否真的能提升边界分割性能。

Contribution: 1. 比较了两种框架的性能与效率;2. 揭示了自上而下方法的潜在优势;3. 提出聚类技术是限制因素。

Method: 1. 自下而上方法:基于自监督特征差异预测边界,再聚类;2. 自上而下方法:ES-KMeans动态规划迭代更新边界。

Result: 在ZeroSpeech基准测试中,两种方法性能相当,但自下而上方法快5倍。聚类步骤是关键瓶颈。

Insight: 自上而下信息在特定情况下有帮助,但并非必要。改进聚类和词表示是未来方向。

Abstract: We investigate the problem of segmenting unlabeled speech into word-like units and clustering these to create a lexicon. Prior work can be categorized into two frameworks. Bottom-up methods first determine boundaries and then cluster the fixed segmented words into a lexicon. In contrast, top-down methods incorporate information from the clustered words to inform boundary selection. However, it is unclear whether top-down information is necessary to improve segmentation. To explore this, we look at two similar approaches that differ in whether top-down clustering informs boundary selection. Our simple bottom-up strategy predicts word boundaries using the dissimilarity between adjacent self-supervised features, then clusters the resulting segments to construct a lexicon. Our top-down system is an updated version of the ES-KMeans dynamic programming method that iteratively uses K-means to update its boundaries. On the five-language ZeroSpeech benchmarks, both approaches achieve comparable state-of-the-art results, with the bottom-up system being nearly five times faster. Through detailed analyses, we show that the top-down influence of ES-KMeans can be beneficial (depending on factors like the candidate boundaries), but in many cases the simple bottom-up method performs just as well. For both methods, we show that the clustering step is a limiting factor. Therefore, we recommend that future work focus on improved clustering techniques and learning more discriminative word-like representations. Project code repository: https://github.com/s-malan/prom-seg-clus.

cs.SD [Back]

[103] MLLM-based Speech Recognition: When and How is Multimodality Beneficial?

Yiwen Guan,Viet Anh Trinh,Vivek Voleti,Jacob Whitehill

Main category: cs.SD

TL;DR: 本文探讨了多模态大语言模型(MLLM)在自动语音识别(ASR)中的应用,分析了不同输入模态在噪声环境下的作用及其最佳使用条件。

Details Motivation: 研究多模态输入如何提升ASR在噪声环境中的准确性,尤其是在不同模态(如语音、文本、图像)之间的互补关系。

Contribution: 提出并验证了多模态输入在不同噪声条件下对ASR的改善效果,揭示了模态同步性、视觉表示质量等因素对结果的影响。

Method: 通过合成和真实数据实验,比较了不同模态组合、模型架构(如Mamba和Transformer)以及输入顺序和损失权重的影响。

Result: 研究发现多模态输入通常能提升ASR准确性,但效果与噪声水平相关;同步模态在高噪声下更有效,而高质量视觉表示持续提升性能。

Insight: 多模态输入的优化(如模态选择和权重分配)对ASR性能至关重要,尤其在噪声环境中;未来研究应关注更强大的视觉编码器。

Abstract: Recent advances in multi-modal large language models (MLLMs) have opened new possibilities for unified modeling of speech, text, images, and other modalities. Building on our prior work, this paper examines the conditions and model architectures under which multiple input modalities can improve automatic speech recognition (ASR) accuracy in noisy environments. Through experiments on synthetic and real-world data, we find that (1) harnessing more modalities usually improves ASR accuracy, as each modality provides complementary information, but the improvement depends on the amount of auditory noise. (2) Synchronized modalities (e.g., lip movements) are more useful at high noise levels whereas unsynchronized modalities (e.g., image context) are most helpful at moderate noise levels. (3) Higher-quality visual representations consistently improve ASR accuracy, highlighting the importance of developing more powerful visual encoders. (4) Mamba exhibits similar trends regarding the benefits of multimodality as do Transformers. (5) The input order of modalities as well as their weights in the loss function can significantly impact accuracy. These findings both offer practical insights and help to deepen our understanding of multi-modal speech recognition under challenging conditions.