Table of Contents

cs.CL [Back]

[1] LLMs Can’t Handle Peer Pressure: Crumbling under Multi-Agent Social Interactions

Maojia Song,Tej Deep Pala,Weisheng Jin,Amir Zadeh,Chuan Li,Dorien Herremans,Soujanya Poria

Main category: cs.CL

TL;DR: 论文探讨了大型语言模型(LLMs)在多智能体系统中的社交互动能力,提出了KAIROS基准测试,研究了信任形成、错误信息抵抗和同伴输入整合等问题,并评估了多种缓解策略。

Details Motivation: 研究目的是分析LLMs在多智能体社交互动中的表现,尤其是在信任、抗干扰和群体决策方面的能力,为提升集体智能提供理论基础。

Contribution: 提出了KAIROS基准测试,用于系统研究LLMs在多智能体交互中的行为;提出了GRPO方法,结合多智能体上下文和结果奖励,优化模型表现。

Method: 设计了KAIROS基准测试,模拟不同可靠性的同伴互动;评估了提示、监督微调和强化学习(GRPO)等策略;分析了信任、同伴行为等因素对决策的影响。

Result: 结果表明,GRPO在多智能体上下文中表现最佳,但降低了模型的社交影响力鲁棒性。

Insight: LLMs在多智能体交互中仍存在局限性,需进一步优化模型的社交适应能力和鲁棒性。

Abstract: Large language models (LLMs) are increasingly deployed in multi-agent systems (MAS) as components of collaborative intelligence, where peer interactions dynamically shape individual decision-making. Although prior work has focused on conformity bias, we extend the analysis to examine how LLMs form trust from previous impressions, resist misinformation, and integrate peer input during interaction, key factors for achieving collective intelligence under complex social dynamics. We present KAIROS, a benchmark simulating quiz contests with peer agents of varying reliability, offering fine-grained control over conditions such as expert-novice roles, noisy crowds, and adversarial peers. LLMs receive both historical interactions and current peer responses, allowing systematic investigation into how trust, peer action, and self-confidence influence decisions. As for mitigation strategies, we evaluate prompting, supervised fine-tuning, and reinforcement learning, Group Relative Policy Optimisation (GRPO), across multiple models. Our results reveal that GRPO with multi-agent context combined with outcome-based rewards and unconstrained reasoning achieves the best overall performance, but also decreases the robustness to social influence compared to Base models. The code and datasets are available at: https://github.com/declare-lab/KAIROS.

[2] Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models

Yuchun Fan,Yilin Wang,Yongyu Mu,Lei Huang,Bei Li,Xiaocheng Feng,Tong Xiao,Jingbo Zhu

Main category: cs.CL

TL;DR: 该论文提出了一种名为PLAST的高效多语言增强方法,通过精确微调语言特定层来提升大型视觉-语言模型(LVLMs)的多语言能力,显著减少了参数调整量。

Details Motivation: 大型视觉-语言模型在多语言能力上存在不平衡现象,论文旨在通过分析其多语言工作模式,找到提升多语言理解的效率方法。

Contribution: 1. 揭示了LVLMs中语言特定神经元激活与多语言理解能力的关系;2. 提出了PLAST方法,通过精确微调语言特定层实现高效多语言增强。

Method: PLAST通过监控语言特定神经元激活识别关键层,并利用问-翻译对微调这些层,实现多语言对齐。

Result: 在MM-Bench和MMMB上的实验表明,PLAST显著提升了LVLMs的多语言能力,仅需调整14%的参数。

Insight: 语言特定视觉信息参与主要集中在浅层网络,PLAST在低资源和复杂视觉推理任务中具有泛化能力。

Abstract: Large vision-language models (LVLMs) have demonstrated exceptional capabilities in understanding visual information with human languages but also exhibit an imbalance in multilingual capabilities. In this work, we delve into the multilingual working pattern of LVLMs and identify a salient correlation between the multilingual understanding ability of LVLMs and language-specific neuron activations in shallow layers. Building on this insight, we introduce PLAST, a training recipe that achieves efficient multilingual enhancement for LVLMs by Precise LAnguage-Specific layers fine-Tuning. PLAST first identifies layers involved in multilingual understanding by monitoring language-specific neuron activations. These layers are then precisely fine-tuned with question-translation pairs to achieve multilingual alignment. Our empirical results on MM-Bench and MMMB demonstrate that PLAST effectively improves the multilingual capabilities of LVLMs and achieves significant efficiency with only 14% of the parameters tuned. Further analysis reveals that PLAST can be generalized to low-resource and complex visual reasoning tasks, facilitating the language-specific visual information engagement in shallow layers.

[3] Integral Transformer: Denoising Attention, Not Too Much Not Too Little

Ivan Kobyzev,Abbas Ghaddar,Dingtao Hu,Boxing Chen

Main category: cs.CL

TL;DR: 本文提出了Integral Transformer,通过从logit分布中积分采样信号的新型自注意力机制,减少注意力噪声,同时保留特殊令牌的关键信息,优于现有方法。

Details Motivation: 传统的softmax自注意力机制对语义无关的令牌(如特殊令牌和标点)分配了过高的权重(注意力噪声),而现有方法(如Cog Attention和Differential Transformer)虽然通过引入负注意力分数减少了噪声,但可能丢失有用信息。

Contribution: 提出了Integral Transformer,通过积分采样logit分布信号的自注意力机制,在减少噪声的同时保留了关键特殊令牌的作用。

Method: 使用积分方法从logit分布中采样信号,平衡注意力分配,同时避免丢弃有用信息。此外,模型在底层Transformer中保留传统自注意力以提升性能。

Result: 在多个知识和推理语言基准测试中,Integral Transformer优于vanilla、Cog和Differential注意力变体,并有效减少了上层注意力分布的等级崩溃。

Insight: 底层使用传统自注意力有助于性能提升,而Integral Transformer在上层能够平衡注意力分布,减少噪声的同时避免信息丢失。

Abstract: Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens such as special tokens and punctuation, a phenomenon known as attention noise. While recent methods like Cog Attention and the Differential Transformer have addressed this by introducing negative attention scores, they risk discarding useful information. In this paper, we propose the Integral Transformer, a novel self-attention mechanism that denoises attention by integrating signals sampled from the logit distribution. Our approach mitigates noise while preserving the contributions of special tokens critical for model performance. Extensive experiments demonstrate that our model outperforms vanilla, Cog, and Differential attention variants on well-established knowledge and reasoning language benchmarks. Moreover, our analysis reveals that employing vanilla self-attention in the lower Transformer layers enhances performance and that the Integral Transformer effectively balances attention distributions and reduces rank collapse in upper layers.

[4] Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning

Jeong-seok Oh,Jay-yoon Lee

Main category: cs.CL

TL;DR: LSC通过可学习的token嵌入选择语义最一致的响应,适用于短篇和长篇推理任务,计算开销极低且在各类任务中表现优于其他方法。

Details Motivation: 大语言模型在复杂或长篇推理任务中常产生不一致输出,现有方法虽能缓解问题但难以同时兼顾短篇和长篇任务。

Contribution: 提出LSC,一种无需修改模型架构的低成本一致性选择方法,显著提升短篇和长篇推理任务的表现。

Method: 利用可学习的token嵌入选择语义一致的响应,并通过轻量级前向生成摘要token优化效率。

Result: 在6个短篇和5个长篇基准测试中,LSC均优于其他方法,且能提供校准良好的置信度估计。

Insight: LSC表明语义一致性选择是解决模型输出不一致的有效方法,同时保持计算效率。

Abstract: Probabilistic decoding in Large Language Models (LLMs) often yields inconsistent outputs, particularly on complex or long-form questions. Self-Consistency (SC) mitigates this for short-form QA by majority voting over exact strings, whereas Universal Self-Consistency (USC) and Weighted Unigram Consistency Score (WUCS) extend to long-form responses but lose accuracy on short-form benchmarks. We introduce Latent Self-Consistency (LSC), which selects the most semantically consistent response using learnable token embeddings. A lightweight forward generation of summary tokens increases inference time by less than 1% and requires no changes to the model architecture. Across 6 short-form and 5 long-form reasoning benchmarks (e.g., MATH, MMLU, TruthfulQA), LSC surpasses SC, USC and WUCS on all short-form and long-form ones on average, while maintaining negligible computational overhead. These results position LSC as a practical consistency-selection method that works reliably across answer formats. Additionally, LSC provides well-calibrated confidence estimates, maintaining low Expected Calibration Error across both answer formats.

[5] How Reliable are LLMs for Reasoning on the Re-ranking task?

Nafis Tanveer Islam,Zhiming Zhao

Main category: cs.CL

TL;DR: 该论文探讨了大型语言模型(LLM)在重排序任务中的可靠性,分析了不同训练方法对模型语义理解的影响,并研究了模型是否能生成更具解释性的文本推理以提高透明度和解决数据不足问题。

Details Motivation: 随着LLM语义理解能力的提升,其与人类价值观的契合度增强,但透明性下降。在数据有限的新系统中,重排序的准确性仍是一大挑战。作者希望通过分析不同训练方法的影响,探索LLM是否能为重排序任务提供更可靠的解释和推理。

Contribution: 论文的主要贡献包括:(1)分析了不同训练方法对LLM语义理解的影响;(2)研究了LLM在重排序任务中生成解释性信息的能力;(3)提出了在数据有限情况下评估LLM可靠性的方法。

Method: 作者使用了一个来自环境和地球科学领域的小规模排序数据集,对检索到的内容进行重排序。通过分析不同训练方法下LLM的表现,探究其语义理解和解释性推理的能力。

Result: 研究发现,一些训练方法表现出更好的解释性,但并非所有方法都能实现准确的语义理解,部分方法仅通过抽象知识优化评估,引发了对LLM真实可靠性的质疑。

Insight: 论文指出,LLM在重排序任务中的可靠性不仅取决于其性能,还依赖于其语义理解和解释性能力。这强调了在透明性和数据有限的情况下,选择合适的训练方法的重要性。

Abstract: With the improving semantic understanding capability of Large Language Models (LLMs), they exhibit a greater awareness and alignment with human values, but this comes at the cost of transparency. Although promising results are achieved via experimental analysis, an in-depth understanding of the LLM’s internal workings is unavoidable to comprehend the reasoning behind the re-ranking, which provides end users with an explanation that enables them to make an informed decision. Moreover, in newly developed systems with limited user engagement and insufficient ranking data, accurately re-ranking content remains a significant challenge. While various training methods affect the training of LLMs and generate inference, our analysis has found that some training methods exhibit better explainability than others, implying that an accurate semantic understanding has not been learned through all training methods; instead, abstract knowledge has been gained to optimize evaluation, which raises questions about the true reliability of LLMs. Therefore, in this work, we analyze how different training methods affect the semantic understanding of the re-ranking task in LLMs and investigate whether these models can generate more informed textual reasoning to overcome the challenges of transparency or LLMs and limited training data. To analyze the LLMs for re-ranking tasks, we utilize a relatively small ranking dataset from the environment and the Earth science domain to re-rank retrieved content. Furthermore, we also analyze the explainable information to see if the re-ranking can be reasoned using explainability.

[6] Integrating gender inclusivity into large language models via instruction tuning

Alina Wróblewska,Bartosz Żuk

Main category: cs.CL

TL;DR: 该研究通过指令调优方法,解决了波兰语大语言模型中的性别偏见问题,使用IPIS数据集设计性别包容性提示,优化了模型的输出。

Details Motivation: 由于历史和语言习惯,波兰语中男性形式被广泛使用,导致语言模型继承了性别偏见。研究旨在通过技术手段改善这一现象。

Contribution: 提出了一种系统性的方法,通过IPIS数据集和性别包容性提示,使LLMs能够生成更加性别平衡的输出。

Method: 使用IPIS数据集对多语言和波兰语专用LLMs(如Llama-8B、Bielik等)进行指令调优,设计性别包容性提示。

Result: 实验表明,调优后的模型能显著减少性别偏见,生成更公平的语言输出。

Insight: 语言模型的性别偏见问题可以通过指令调优和数据驱动方法解决,为其他语言的技术干预提供了参考。

Abstract: Imagine a language with masculine, feminine, and neuter grammatical genders, yet, due to historical and political conventions, masculine forms are predominantly used to refer to men, women and mixed-gender groups. This is the reality of contemporary Polish. A social consequence of this unfair linguistic system is that large language models (LLMs) trained on Polish texts inherit and reinforce this masculine bias, generating gender-imbalanced outputs. This study addresses this issue by tuning LLMs using the IPIS dataset, a collection of human-crafted gender-inclusive proofreading in Polish and Polish-to-English translation instructions. Grounded in a theoretical linguistic framework, we design a system prompt with explicit gender-inclusive guidelines for Polish. In our experiments, we IPIS-tune multilingual LLMs (Llama-8B, Mistral-7B and Mistral-Nemo) and Polish-specific LLMs (Bielik and PLLuM). Our approach aims to integrate gender inclusivity as an inherent feature of these models, offering a systematic solution to mitigate gender bias in Polish language generation.

[7] Thinking Before You Speak: A Proactive Test-time Scaling Approach

Cong Li,Wenchang Chai,Hejun Wu,Yan Pan,Pengxu Wei,Liang Lin

Main category: cs.CL

TL;DR: 这篇论文提出了一个名为TBYS的推理框架,通过主动生成‘insight’来填补LLMs在复杂推理任务中的缺陷,并在数学数据集上验证了其有效性。

Details Motivation: LLMs在复杂推理任务(如数学)中表现不佳,原因是人类推理模式与训练数据模式存在差异。人类在解决复杂问题时通常会内省思考,但这些思考过程并未在训练数据中体现。

Contribution: 提出了TBYS框架,主动生成‘insight’以指导推理步骤,并设计了一个自动化收集和过滤上下文示例的流程,减少了人工标注和微调的开销。

Method: 通过在连续推理步骤间插入‘insight’,动态生成提示以指导推理流程。框架利用自动化的方式构建上下文示例。

Result: 在具有挑战性的数学数据集上,TBYS表现出了有效性。

Insight: 主动生成的‘insight’可以更好地模拟人类的推理过程,弥补LLMs在复杂任务中的不足。

Abstract: Large Language Models (LLMs) often exhibit deficiencies with complex reasoning tasks, such as maths, which we attribute to the discrepancy between human reasoning patterns and those presented in the LLMs’ training data. When dealing with complex problems, humans tend to think carefully before expressing solutions. However, they often do not articulate their inner thoughts, including their intentions and chosen methodologies. Consequently, critical insights essential for bridging reasoning steps may be absent in training data collected from human sources. To bridge this gap, we proposes inserting \emph{insight}s between consecutive reasoning steps, which review the status and initiate the next reasoning steps. Unlike prior prompting strategies that rely on a single or a workflow of static prompts to facilitate reasoning, \emph{insight}s are \emph{proactively} generated to guide reasoning processes. We implement our idea as a reasoning framework, named \emph{Thinking Before You Speak} (TBYS), and design a pipeline for automatically collecting and filtering in-context examples for the generation of \emph{insight}s, which alleviates human labeling efforts and fine-tuning overheads. Experiments on challenging mathematical datasets verify the effectiveness of TBYS. Project website: https://gitee.com/jswrt/TBYS

[8] Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum

Xinglong Yang,Quan Feng,Zhongying Pan,Xiang Chen,Yu Tian,Wentong Li,Shuofei Qiao,Yuxia Geng,Xingyu Zhao,Sheng-Jun Huang

Main category: cs.CL

TL;DR: 该论文提出了一种基于难度平衡的提示课程设计方法,通过结合模型感知难度和样本内在复杂性,优化多模态思维链(MCoT)效果。

Details Motivation: 现有的MCoT提示方法通常依赖于随机或人工选择的示例,忽略了模型知识分布和任务内在复杂性,导致性能不稳定。作者受“因材施教”启发,提出以模型能力为导向的提示课程设计。

Contribution: 1. 将提示选择问题转化为提示课程设计;2. 结合模型感知难度和样本内在复杂性,提出难度平衡的采样策略;3. 在多个基准测试和MLLM上验证了方法的有效性。

Method: 提出了一种基于两个信号的难度平衡采样策略:1. 模型感知难度(通过主动学习中的预测分歧量化);2. 样本内在复杂性(独立于模型的难度评估)。

Result: 在五个挑战性基准测试和多个MLLM上,该方法显著提升了性能并减少了随机采样带来的性能波动。

Insight: 提示示例的选择对MCoT性能至关重要,结合模型能力和任务复杂性可以更有效地提升多模态推理能力。

Abstract: The effectiveness of Multimodal Chain-of-Thought (MCoT) prompting is often limited by the use of randomly or manually selected examples. These examples fail to account for both model-specific knowledge distributions and the intrinsic complexity of the tasks, resulting in suboptimal and unstable model performance. To address this, we propose a novel framework inspired by the pedagogical principle of “tailored teaching with balanced difficulty”. We reframe prompt selection as a prompt curriculum design problem: constructing a well ordered set of training examples that align with the model’s current capabilities. Our approach integrates two complementary signals: (1) model-perceived difficulty, quantified through prediction disagreement in an active learning setup, capturing what the model itself finds challenging; and (2) intrinsic sample complexity, which measures the inherent difficulty of each question-image pair independently of any model. By jointly analyzing these signals, we develop a difficulty-balanced sampling strategy that ensures the selected prompt examples are diverse across both dimensions. Extensive experiments conducted on five challenging benchmarks and multiple popular Multimodal Large Language Models (MLLMs) demonstrate that our method yields substantial and consistent improvements and greatly reduces performance discrepancies caused by random sampling, providing a principled and robust approach for enhancing multimodal reasoning.

[9] Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning

Songtao Jiang,Yuxi Chen,Sibo Song,Yan Zhang,Yeying Jin,Yang Feng,Jian Wu,Zuozhu Liu

Main category: cs.CL

TL;DR: 医学视觉问答(VQA)模型在不同问法下表现不稳定。论文提出RoMed数据集和一致性对比学习(CCL)方法,提升模型鲁棒性。

Details Motivation: 当前医学视觉语言模型(Med-VLMs)在面对语义相同的不同问法时表现不一致,影响可靠诊断,需解决这一鲁棒性问题。

Contribution: 1. 构建RoMed数据集(含144k问题,覆盖多级扰动);2. 提出CCL方法(一致性学习和偏置感知对比学习),显著提升模型性能和一致性。

Method: CCL方法结合知识锚定一致性学习(对齐医学知识)和偏置感知对比学习(消除数据偏置),优化模型表示。

Result: CCL在三个VQA基准测试中达到SOTA性能,RoMed测试集上的答案一致性提升50%。

Insight: 1. 医学VQA的鲁棒性需对齐知识和消除数据偏置;2. 多级扰动数据集有助于评估模型真实性能。

Abstract: In high-stakes medical applications, consistent answering across diverse question phrasings is essential for reliable diagnosis. However, we reveal that current Medical Vision-Language Models (Med-VLMs) exhibit concerning fragility in Medical Visual Question Answering, as their answers fluctuate significantly when faced with semantically equivalent rephrasings of medical questions. We attribute this to two limitations: (1) insufficient alignment of medical concepts, leading to divergent reasoning patterns, and (2) hidden biases in training data that prioritize syntactic shortcuts over semantic understanding. To address these challenges, we construct RoMed, a dataset built upon original VQA datasets containing 144k questions with variations spanning word-level, sentence-level, and semantic-level perturbations. When evaluating state-of-the-art (SOTA) models like LLaVA-Med on RoMed, we observe alarming performance drops (e.g., a 40% decline in Recall) compared to original VQA benchmarks, exposing critical robustness gaps. To bridge this gap, we propose Consistency and Contrastive Learning (CCL), which integrates two key components: (1) knowledge-anchored consistency learning, aligning Med-VLMs with medical knowledge rather than shallow feature patterns, and (2) bias-aware contrastive learning, mitigating data-specific priors through discriminative representation refinement. CCL achieves SOTA performance on three popular VQA benchmarks and notably improves answer consistency by 50% on the challenging RoMed test set, demonstrating significantly enhanced robustness. Code will be released.

[10] Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-Text System

Yanfan Du,Jun Zhang,Bin Wang,Jin Qiu,Lu Huang,Yuan Ge,Xiaoqian Liu,Tong Xiao,Jingbo Zhu

Main category: cs.CL

TL;DR: Attention2Probability:一种轻量级、灵活且准确的注意力驱动的术语概率估计方法,用于提升语音到文本系统的术语识别准确性。

Details Motivation: 当前的语音大语言模型(SLM)在通用领域表现优异,但在处理领域特定术语或新词时仍存在挑战。

Contribution: 提出Attention2Probability方法,通过将语音与术语的交叉注意力权重转换为存在概率,并结合课程学习提高检索准确性;同时发布了一个新的语音数据集以支持相关研究。

Method: 利用注意力机制估计术语存在概率,并通过课程学习优化检索过程。

Result: 在测试集上显著优于VectorDB方法,中英文的最高召回率分别达到92.57%和86.83%,每查询延迟仅为8.71毫秒。术语干预使SLM的术语准确性提高了6-17%。

Insight: 当前SLM在术语利用方面存在局限性,注意力机制和课程学习的结合可以有效提升术语识别的表现。

Abstract: Recent advances in speech large language models (SLMs) have improved speech recognition and translation in general domains, but accurately generating domain-specific terms or neologisms remains challenging. To address this, we propose Attention2Probability: attention-driven terminology probability estimation for robust speech-to-text system, which is lightweight, flexible, and accurate. Attention2Probability converts cross-attention weights between speech and terminology into presence probabilities, and it further employs curriculum learning to enhance retrieval accuracy. Furthermore, to tackle the lack of data for speech-to-text tasks with terminology intervention, we create and release a new speech dataset with terminology to support future research in this area. Experimental results show that Attention2Probability significantly outperforms the VectorDB method on our test set. Specifically, its maximum recall rates reach 92.57% for Chinese and 86.83% for English. This high recall is achieved with a latency of only 8.71ms per query. Intervening in SLMs’ recognition and translation tasks using Attention2Probability-retrieved terms improves terminology accuracy by 6-17%, while revealing that the current utilization of terminology by SLMs has limitations.

[11] Filtering for Creativity: Adaptive Prompting for Multilingual Riddle Generation in LLMs

Duy Le,Kent Ziti,Evan Girard-Sun,Sean O’Brien,Vasu Sharma,Kevin Zhu

Main category: cs.CL

TL;DR: 该论文提出了一种名为自适应原创性过滤(AOF)的提示框架,用于改进多语言谜语生成的质量,通过在提示过程中过滤冗余内容并增强词汇新颖性与跨语言保真度。

Details Motivation: 现有的大型语言模型在生成多语言谜语时往往依赖于记忆的谜语或浅层改写,缺乏文化流畅性和创造性。

Contribution: 论文的主要贡献是引入了AOF框架,通过基于余弦相似度的拒绝机制和词汇新颖性约束,显著提升了谜语生成的多样性和文化相关性。

Method: AOF框架结合了余弦相似度过滤和词汇新颖性评分,动态调整生成结果以避免冗余并保持跨语言一致性。

Result: 实验显示,AOF加持的GPT-4o在多语言环境下显著降低了冗余(Self-BLEU为0.177)并提升了多样性(Distinct-2为0.915)。

Insight: 语义过滤机制可以在不进行任务微调的情况下引导模型生成更具文化根基和创造性的内容。

Abstract: Multilingual riddle generation challenges large language models (LLMs) to balance cultural fluency with creative abstraction. Standard prompting strategies – zero-shot, few-shot, chain-of-thought – tend to reuse memorized riddles or perform shallow paraphrasing. We introduce Adaptive Originality Filtering (AOF), a prompting framework that filters redundant generations using cosine-based similarity rejection, while enforcing lexical novelty and cross-lingual fidelity. Evaluated across three LLMs and four language pairs, AOF-enhanced GPT-4o achieves \texttt{0.177} Self-BLEU and \texttt{0.915} Distinct-2 in Japanese, signaling improved lexical diversity and reduced redundancy compared to other prompting methods and language pairs. Our findings show that semantic rejection can guide culturally grounded, creative generation without task-specific fine-tuning.

[12] Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models

Chang Wang,Siyu Yan,Depeng Yuan,Yuqi Chen,Yanhua Huang,Yuanhang Zheng,Shuhao Li,Yinqi Zhang,Kedi Chen,Mingrui Zhu,Ruiwen Xu

Main category: cs.CL

TL;DR: 论文提出DIVER框架,基于大语言模型(LLMs),通过多阶段多目标优化(SFT和RL)解决广告标题生成中质量和多样性不足的问题,在实际工业数据集中表现出色,提升了广告价值和点击率。

Details Motivation: 现有广告标题生成方法过于注重质量或点击率,忽略了多样性,导致输出同质化,难以吸引多样化受众。论文旨在解决这一问题。

Contribution: 1. 提出DIVER框架,结合语义和风格的数据生成管道;2. 设计多阶段多目标优化方法(SFT和RL),实现单一前向传播中高质量和多样化的标题生成。

Method: 1. 使用语义和风格感知的数据生成管道;2. 通过监督微调(SFT)和强化学习(RL)进行联合优化。

Result: 在实际工业数据集上,DIVER显著平衡了质量和多样性,ADVV提升4.0%,CTR提高1.4%。

Insight: 广告标题生成中,多样性是提升效果的关键因素之一;结合多目标优化和大语言模型可有效解决这一问题。

Abstract: The generation of ad headlines plays a vital role in modern advertising, where both quality and diversity are essential to engage a broad range of audience segments. Current approaches primarily optimize language models for headline quality or click-through rates (CTR), often overlooking the need for diversity and resulting in homogeneous outputs. To address this limitation, we propose DIVER, a novel framework based on large language models (LLMs) that are jointly optimized for both diversity and quality. We first design a semantic- and stylistic-aware data generation pipeline that automatically produces high-quality training pairs with ad content and multiple diverse headlines. To achieve the goal of generating high-quality and diversified ad headlines within a single forward pass, we propose a multi-stage multi-objective optimization framework with supervised fine-tuning (SFT) and reinforcement learning (RL). Experiments on real-world industrial datasets demonstrate that DIVER effectively balances quality and diversity. Deployed on a large-scale content-sharing platform serving hundreds of millions of users, our framework improves advertiser value (ADVV) and CTR by 4.0% and 1.4%.

[13] M3HG: Multimodal, Multi-scale, and Multi-type Node Heterogeneous Graph for Emotion Cause Triplet Extraction in Conversations

Qiao Liang,Ying Shen,Tiantian Chen,Lin Zhang

Main category: cs.CL

TL;DR: 论文提出了M3HG模型,用于多模态对话中的情感原因三元组提取任务,并发布了首个多模态、多场景的数据集MECAD,实验证明M3HG优于现有方法。

Details Motivation: 现有MECTEC任务的数据集单一且匮乏,且现有方法未能显式建模情感与因果上下文,也未能有效融合多层次的语义信息,导致性能受限。

Contribution: 1. 发布了首个多模态、多场景的MECTEC数据集MECAD;2. 提出M3HG模型,显式建模情感与因果上下文,并通过多模态异构图融合多层次的语义信息。

Method: M3HG采用多模态异构图(Multimodal Heterogeneous Graph)来显式建模情感与因果上下文,并在句子内和句子间两个层次上融合信息。

Result: 实验表明,M3HG在MECAD数据集上显著优于现有方法。

Insight: 1. 多模态和多场景的数据集对任务性能至关重要;2. 显式建模情感与因果上下文以及多层次信息融合能显著提升模型表现。

Abstract: Emotion Cause Triplet Extraction in Multimodal Conversations (MECTEC) has recently gained significant attention in social media analysis, aiming to extract emotion utterances, cause utterances, and emotion categories simultaneously. However, the scarcity of related datasets, with only one published dataset featuring highly uniform dialogue scenarios, hinders model development in this field. To address this, we introduce MECAD, the first multimodal, multi-scenario MECTEC dataset, comprising 989 conversations from 56 TV series spanning a wide range of dialogue contexts. In addition, existing MECTEC methods fail to explicitly model emotional and causal contexts and neglect the fusion of semantic information at different levels, leading to performance degradation. In this paper, we propose M3HG, a novel model that explicitly captures emotional and causal contexts and effectively fuses contextual information at both inter- and intra-utterance levels via a multimodal heterogeneous graph. Extensive experiments demonstrate the effectiveness of M3HG compared with existing state-of-the-art methods. The codes and dataset are available at https://github.com/redifinition/M3HG.

[14] Chronological Passage Assembling in RAG framework for Temporal Question Answering

Byeongjeong Kim,Jeonghyun Park,Joonho Yang,Hwanhee Lee

Main category: cs.CL

TL;DR: 论文提出了ChronoRAG,一种专为叙事文本设计的RAG框架,重点在于将分散的文档信息整合为连贯的结构化段落,并显式捕捉和维护时间顺序以提升问答性能。

Details Motivation: 现有RAG方法在处理叙事文本时效果有限,因为叙事文本的理解需要更广的上下文和时间顺序的连贯性,而不仅仅是孤立段落。

Contribution: 提出了ChronoRAG框架,通过结构化段落和显式维护时间顺序,显著提升叙事问答任务中的性能。

Method: 采用两步策略:1) 将分散信息整合为连贯段落;2) 显式捕捉和维护时间顺序。在NarrativeQA数据集上验证效果。

Result: 实验证明ChronoRAG在叙事问答任务中表现优异,特别是在需要处理复杂时间关系的任务上。

Insight: 时间顺序的推理对叙事问答至关重要,显式建模时间顺序能显著提升模型性能。

Abstract: Long-context question answering over narrative tasks is challenging because correct answers often hinge on reconstructing a coherent timeline of events while preserving contextual flow in a limited context window. Retrieval-augmented generation (RAG) indexing methods aim to address this challenge by selectively retrieving only necessary document segments. However, narrative texts possess unique characteristics that limit the effectiveness of these existing approaches. Specifically, understanding narrative texts requires more than isolated segments, as the broader context and sequential relationships between segments are crucial for comprehension. To address these limitations, we propose ChronoRAG, a novel RAG framework specialized for narrative texts. This approach focuses on two essential aspects: refining dispersed document information into coherent and structured passages, and preserving narrative flow by explicitly capturing and maintaining the temporal order among retrieved passages. We empirically demonstrate the effectiveness of ChronoRAG through experiments on the NarrativeQA dataset, showing substantial improvements in tasks requiring both factual identification and comprehension of complex sequential relationships, underscoring that reasoning over temporal order is crucial in resolving narrative QA.

[15] ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models

Qianyu He,Siyu Yuan,Xuefeng Li,Mingxuan Wang,Jiangjie Chen

Main category: cs.CL

TL;DR: ThinkDial is an open-source framework that enables controllable reasoning effort in large language models (LLMs) through discrete operational modes (High, Medium, Low), balancing performance and computational cost.

Details Motivation: LLMs with chain-of-thought reasoning lack practical control over computational effort, hindering deployment. Proprietary systems offer such control, but open-source solutions lag behind.

Contribution: Introduces ThinkDial, the first open-recipe framework for gpt-oss-style reasoning control, achieving seamless switching between three reasoning modes with performance-computation trade-offs.

Method: Uses budget-mode supervised fine-tuning to embed reasoning control and two-phase budget-aware reinforcement learning with adaptive reward shaping.

Result: ThinkDial reduces tokens by 50% (Medium) and 75% (Low) with minimal performance drops (<10% and <15%, respectively) and generalizes well to out-of-distribution tasks.

Insight: Open-source frameworks can achieve proprietary-level reasoning control through integrated training paradigms, enabling practical deployment of LLMs with adaptive computational effort.

Abstract: Large language models (LLMs) with chain-of-thought reasoning have demonstrated remarkable problem-solving capabilities, but controlling their computational effort remains a significant challenge for practical deployment. Recent proprietary systems like OpenAI’s gpt-oss series have introduced discrete operational modes for intuitive reasoning control, but the open-source community has largely failed to achieve such capabilities. In this paper, we introduce ThinkDial, the first open-recipe end-to-end framework that successfully implements gpt-oss-style controllable reasoning through discrete operational modes. Our system enables seamless switching between three distinct reasoning regimes: High mode (full reasoning capability), Medium mode (50 percent token reduction with <10 percent performance degradation), and Low mode (75 percent token reduction with <15 percent performance degradation). We achieve this through an end-to-end training paradigm that integrates budget-mode control throughout the entire pipeline: budget-mode supervised fine-tuning that embeds controllable reasoning capabilities directly into the learning process, and two-phase budget-aware reinforcement learning with adaptive reward shaping. Extensive experiments demonstrate that ThinkDial achieves target compression-performance trade-offs with clear response length reductions while maintaining performance thresholds. The framework also exhibits strong generalization capabilities on out-of-distribution tasks.

[16] Harnessing Rule-Based Reinforcement Learning for Enhanced Grammatical Error Correction

Yilin Li,Xunjian Yin,Yilin Chen,Xiaojun Wan

Main category: cs.CL

TL;DR: 该论文提出了一种基于规则强化学习的新框架,用于提升语法错误校正任务中大型语言模型的性能,相较于传统方法在中文数据集上取得了最优表现。

Details Motivation: 传统的基于编码器-解码器模型的方法虽然取得了一定成功,但在语法错误校正任务中未能充分利用大型语言模型的推理能力。现有的研究主要通过监督微调直接生成校正后的句子,限制了模型的能力。

Contribution: 论文的主要贡献是提出了一种基于规则强化学习(Rule-Based RL)的新框架,用于指导大型语言模型在语法错误校正任务中的行为,从而提升其性能和可靠性。

Method: 论文采用基于规则的强化学习方法,通过实验在中文数据集上验证了该框架的有效性,尤其是提高了召回率指标。

Result: 实验结果表明,该框架在中文语法错误校正任务中达到了最优性能,尤其是在召回率方面有显著提升。

Insight: 使用强化学习指导大型语言模型能够提供更可控和可靠的解决方案,为语法错误校正任务的未来发展提供了新的研究范式。

Abstract: Grammatical error correction is a significant task in NLP. Traditional methods based on encoder-decoder models have achieved certain success, but the application of LLMs in this field is still underexplored. Current research predominantly relies on supervised fine-tuning to train LLMs to directly generate the corrected sentence, which limits the model’s powerful reasoning ability. To address this limitation, we propose a novel framework based on Rule-Based RL. Through experiments on the Chinese datasets, our Rule-Based RL framework achieves \textbf{state-of-the-art }performance, with a notable increase in \textbf{recall}. This result clearly highlights the advantages of using RL to steer LLMs, offering a more controllable and reliable paradigm for future development in GEC.

[17] Arrows of Math Reasoning Data Synthesis for Large Language Models: Diversity, Complexity and Correctness

Sirui Chen,Changxin Tian,Binbin Hu,Kunlong Chen,Ziqi Liu,Zhiqiang Zhang,Jun Zhou

Main category: cs.CL

TL;DR: 提出了一种程序辅助合成框架,用于系统生成高质量数学推理数据,提升大语言模型的数学推理能力。

Details Motivation: 传统方法在生成高质量数学推理数据时面临可扩展性、成本和数据可靠性的挑战,需要一种更高效、可靠的解决方案。

Contribution: 设计了一种程序辅助合成框架,通过数学知识系统和领域工具生成可执行程序,并转化为自然语言问题-解决对,确保数据的多样性、复杂性和正确性。

Method: 集成了数学知识系统和领域工具生成可执行程序,通过双边验证机制确保数据正确性和一致性。

Result: 生成了1230万组问题-解决三元组,实验表明在该数据上微调的模型在多个基准数据集上达到最先进性能。

Insight: 通过程序化生成和严格验证,可以高效且可靠地生成大规模高质量数学推理数据,显著提升模型性能。

Abstract: Enhancing the mathematical reasoning of large language models (LLMs) demands high-quality training data, yet conventional methods face critical challenges in scalability, cost, and data reliability. To address these limitations, we propose a novel program-assisted synthesis framework that systematically generates a high-quality mathematical corpus with guaranteed diversity, complexity, and correctness. This framework integrates mathematical knowledge systems and domain-specific tools to create executable programs. These programs are then translated into natural language problem-solution pairs and vetted by a bilateral validation mechanism that verifies solution correctness against program outputs and ensures program-problem consistency. We have generated 12.3 million such problem-solving triples. Experiments demonstrate that models fine-tuned on our data significantly improve their inference capabilities, achieving state-of-the-art performance on several benchmark datasets and showcasing the effectiveness of our synthesis approach.

[18] ConfTuner: Training Large Language Models to Express Their Confidence Verbally

Yibo Li,Miao Xiong,Jiaying Wu,Bryan Hooi

Main category: cs.CL

TL;DR: ConfTuner是一种简单高效的微调方法,通过引入新的损失函数(tokenized Brier score),改进大型语言模型(LLM)的置信度表达,避免过自信问题,并在推理任务中表现出更好的校准效果。

Details Motivation: LLM在高风险领域(如科学、法律、医疗)中的部署需要准确的置信度表达以增强可靠性和信任。当前LLM存在过自信问题,现有方法效果和泛化性有限,亟需更有效的方法。

Contribution: 1. 提出ConfTuner,一种高效微调方法;2. 引入tokenized Brier score作为损失函数,理论上证明其是一种proper scoring rule;3. 展示了ConfTuner在不同任务中的校准效果,并证明其适用于黑盒模型(如GPT-4o)。

Method: 1. 设计ConfTuner微调框架;2. 使用tokenized Brier score作为损失函数,激励模型反映其真实正确概率;3. 无需真实置信度得分或代理估计。

Result: ConfTuner显著改善了LLM的置信度校准,提升了自我修正和模型级联的下游任务表现,适用于黑盒模型。

Insight: 通过理论驱动的损失函数改进LLM的置信度表达,是提升模型可靠性和信任的有效途径,有望在高风险领域推动可信LLM系统的发展。

Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare, where accurate expressions of uncertainty are essential for reliability and trust. However, current LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as “overconfidence”. Recent efforts have focused on calibrating LLMs’ verbalized confidence: i.e., their expressions of confidence in text form, such as “I am 80% confident that…”. Existing approaches either rely on prompt engineering or fine-tuning with heuristically generated uncertainty estimates, both of which have limited effectiveness and generalizability. Motivated by the notion of proper scoring rules for calibration in classical machine learning models, we introduce ConfTuner, a simple and efficient fine-tuning method that introduces minimal overhead and does not require ground-truth confidence scores or proxy confidence estimates. ConfTuner relies on a new loss function, tokenized Brier score, which we theoretically prove to be a proper scoring rule, intuitively meaning that it “correctly incentivizes the model to report its true probability of being correct”. ConfTuner improves calibration across diverse reasoning tasks and generalizes to black-box models such as GPT-4o. Our results further show that better-calibrated confidence enables downstream gains in self-correction and model cascade, advancing the development of trustworthy LLM systems. The code is available at https://github.com/liushiliushi/ConfTuner.

[19] ReflectivePrompt: Reflective evolution in autoprompting algorithms

Viktor N. Zhuravlev,Artur R. Khairullin,Ernest A. Dyagin,Alena N. Sitkina,Nikita I. Kulin

Main category: cs.CL

TL;DR: ReflectivePrompt是一种基于进化算法的自动提示方法,通过反射进化实现更精确和全面的提示搜索,显著优于现有方法。

Details Motivation: 随着提示工程的快速发展,自动选择优化提示的需求增加,传统方法在提示搜索的精确性和全面性上有局限。

Contribution: 提出了基于反射进化(短期和长期反射操作)的ReflectivePrompt方法,显著提升了提示优化的效果。

Method: 利用短期和长期反射操作,结合交叉和精英变异,逐步积累并更新进化过程中的知识。

Result: 在33个数据集上测试,平均性能提升28%(如BBH任务),优于现有方法。

Insight: 反射进化能够有效捕捉和利用进化过程中的知识,为自动提示优化提供了新思路。

Abstract: Autoprompting is the process of automatically selecting optimized prompts for language models, which has been gaining popularity with the rapid advancement of prompt engineering, driven by extensive research in the field of large language models (LLMs). This paper presents ReflectivePrompt - a novel autoprompting method based on evolutionary algorithms that employs a reflective evolution approach for more precise and comprehensive search of optimal prompts. ReflectivePrompt utilizes short-term and long-term reflection operations before crossover and elitist mutation to enhance the quality of the modifications they introduce. This method allows for the accumulation of knowledge obtained throughout the evolution process and updates it at each epoch based on the current population. ReflectivePrompt was tested on 33 datasets for classification and text generation tasks using open-access large language models: t-lite-instruct-0.1 and gemma3-27b-it. The method demonstrates, on average, a significant improvement (e.g., 28% on BBH compared to EvoPrompt) in metrics relative to current state-of-the-art approaches, thereby establishing itself as one of the most effective solutions in evolutionary algorithm-based autoprompting.

[20] Empowering Computing Education Researchers Through LLM-Assisted Content Analysis

Laurie Gale,Sebastian Mateos Nicolajsen

Main category: cs.CL

TL;DR: 该论文提出了一种结合大型语言模型(LLM)的内容分析方法(LACA),以帮助教育研究者高效分析大量文本数据,推动计算教育研究(CER)的规模化和严谨性。

Details Motivation: 计算教育研究者常因资源或能力有限,难以开展可泛化或严谨的研究。论文旨在解决这一问题,提出一种减轻研究者负担的同时提升研究规模和质量的方法。

Contribution: 提出了LLM辅助的内容分析方法(LACA),结合内容分析与大型语言模型,使研究者能够处理更大规模的文本数据,并保持研究的可重复性和严谨性。

Method: 论文提出LACA方法,通过整合内容分析与LLM,利用计算教育数据集展示其可应用性,强调方法的可重复性和严谨性。

Result: LACA方法展示了在CER中的潜力,能够支持更广泛的泛化研究和提升研究质量。

Insight: LLM可以为教育研究提供高效工具,帮助研究者突破资源限制,推动学科的实践和研究质量发展。

Abstract: Computing education research (CER) is often instigated by practitioners wanting to improve both their own and the wider discipline’s teaching practice. However, the latter is often difficult as many researchers lack the colleagues, resources, or capacity to conduct research that is generalisable or rigorous enough to advance the discipline. As a result, research methods that enable sense-making with larger volumes of qualitative data, while not increasing the burden on the researcher, have significant potential within CER. In this discussion paper, we propose such a method for conducting rigorous analysis on large volumes of textual data, namely a variation of LLM-assisted content analysis (LACA). This method combines content analysis with the use of large language models, empowering researchers to conduct larger-scale research which they would otherwise not be able to perform. Using a computing education dataset, we illustrate how LACA could be applied in a reproducible and rigorous manner. We believe this method has potential in CER, enabling more generalisable findings from a wider range of research. This, together with the development of similar methods, can help to advance both the practice and research quality of the CER discipline.

[21] Affective Polarization across European Parliaments

Bojan Evkoski,Igor Mozetič,Nikola Ljubešić,Petra Kralj Novak

Main category: cs.CL

TL;DR: 该研究通过自然语言处理技术分析欧洲六个国家议会的演讲内容,发现情感极化的普遍存在,并表明互惠性是极化现象的驱动机制之一。

Details Motivation: 近年来,情感极化(如对对立群体的负面情绪与敌意)在全球政治话语中日益突出,研究者希望通过自动化的方法探究欧洲议会中是否存在这种现象。

Contribution: 研究揭示了六个欧洲议会中普遍存在情感极化现象,并通过数据分析表明,议员的活跃度与极化无关,但互惠性是极化的重要机制。

Method: 研究利用自然语言处理技术对六国议会的演讲文本进行情感分析,通过比较对立群体与自身群体的负面情绪程度,识别情感极化模式。

Result: 研究发现所有六个议会的议员均表现出情感极化现象,且极化程度与活跃度无关,但互惠性在一定程度上推动了极化。

Insight: 研究强调了情感极化在欧洲议会中的普遍性,并指出互惠性在政治对立中的重要作用,为理解政治话语的负面影响提供了新视角。

Abstract: Affective polarization, characterized by increased negativity and hostility towards opposing groups, has become a prominent feature of political discourse worldwide. Our study examines the presence of this type of polarization in a selection of European parliaments in a fully automated manner. Utilizing a comprehensive corpus of parliamentary speeches from the parliaments of six European countries, we employ natural language processing techniques to estimate parliamentarian sentiment. By comparing the levels of negativity conveyed in references to individuals from opposing groups versus one’s own, we discover patterns of affectively polarized interactions. The findings demonstrate the existence of consistent affective polarization across all six European parliaments. Although activity correlates with negativity, there is no observed difference in affective polarization between less active and more active members of parliament. Finally, we show that reciprocity is a contributing mechanism in affective polarization between parliamentarians across all six parliaments.

[22] Interpretable by AI Mother Tongue: Native Symbolic Reasoning in Neural Models

Hung Ming Liu

Main category: cs.CL

TL;DR: 本文提出了一种框架,使神经模型发展出一种‘AI母语’,即原生符号语言,同时支持直观推理、组合符号链和内在可解释性。该方法将推理嵌入模型表示中,通过符号捕获语义模式、链追踪决策路径,并结合门控归纳机制实现透明而灵活的推理。

Details Motivation: 传统的事后解释方法无法在神经模型中实现内在的可解释性和符号推理,因此本文提出了一种将符号推理直接嵌入模型表示的方法,以同时提高模型的透明性和灵活性。

Contribution: 1. 提出了‘AI母语’框架,将符号推理嵌入神经模型表示中;2. 设计了训练目标以提升符号纯度和决策稀疏性;3. 通过实验证明了该方法在准确性和可解释性上的优势。

Method: 1. 引入符号语言作为模型的原生表示;2. 使用门控归纳机制实现选择性聚焦;3. 设计互补训练目标优化符号纯度和决策稀疏性;4. 采用顺序专业化策略,先建立广泛的符号能力,再细化直观判断。

Result: 实验表明,该方法在AI任务中实现了竞争性的准确性,并提供了可验证的推理轨迹,证明其可以作为神经模型中可解释性、直观性和符号推理的统一机制。

Insight: 神经模型可以通过原生符号语言实现内在可解释性,而无需依赖事后解释方法。符号推理和直觉推理可以通过统一的框架协同工作,提高模型的透明性和功能多样性。

Abstract: We present a framework where neural models develop an AI Mother Tongue, a native symbolic language that simultaneously supports intuitive reasoning, compositional symbol chains, and inherent interpretability. Unlike post-hoc explanation methods, our approach embeds reasoning directly into the model’s representations: symbols capture meaningful semantic patterns, chains trace decision paths, and gated induction mechanisms guide selective focus, yielding transparent yet flexible reasoning. We introduce complementary training objectives to enhance symbol purity and decision sparsity, and employ a sequential specialization strategy to first build broad symbolic competence and then refine intuitive judgments. Experiments on AI tasks demonstrate competitive accuracy alongside verifiable reasoning traces, showing that AI Mother Tongue can serve as a unified mechanism for interpretability, intuition, and symbolic reasoning in neural models.

[23] MovieCORE: COgnitive REasoning in Movies

Gueter Josmy Faure,Min-Hung Chen,Jia-Fong Yeh,Ying Cheng,Hung-Ting Su,Yung-Hao Tang,Shang-Hong Lai,Winston H. Hsu

Main category: cs.CL

TL;DR: 论文提出了MovieCORE数据集,专注于电影内容的深层次认知理解,通过多LLM代理生成高质量问答对,并引入ACE模块提升模型推理能力。

Details Motivation: 当前的视频问答数据集多关注表层理解,缺乏对电影内容深层次认知的评估。MovieCORE填补了这一空白,旨在推动AI对电影的深度理解。

Contribution: 1. 提出MovieCORE数据集,专注深层次认知问题;2. 提出多LLM代理生成问答对的创新方法;3. 引入ACE模块提升模型推理能力。

Method: 1. 利用多LLM代理(代理性头脑风暴)生成高质量问答对;2. 设计认知测试评估数据集质量;3. 提出ACE模块增强现有视频语言模型的推理能力。

Result: MovieCORE数据集通过测试验证了其质量,ACE模块将模型推理能力提升了25%,展示了其在深层次电影理解任务中的潜力。

Insight: 1. 多LLM代理方法可高效生成高质量问题;2. 深层次认知问题能更有效评估VQA模型的局限性;3. ACE模块为提升模型推理能力提供了新思路。

Abstract: This paper introduces MovieCORE, a novel video question answering (VQA) dataset designed to probe deeper cognitive understanding of movie content. Unlike existing datasets that focus on surface-level comprehension, MovieCORE emphasizes questions that engage System-2 thinking while remaining specific to the video material. We present an innovative agentic brainstorming approach, utilizing multiple large language models (LLMs) as thought agents to generate and refine high-quality question-answer pairs. To evaluate dataset quality, we develop a set of cognitive tests assessing depth, thought-provocation potential, and syntactic complexity. We also propose a comprehensive evaluation scheme for assessing VQA model performance on deeper cognitive tasks. To address the limitations of existing video-language models (VLMs), we introduce an agentic enhancement module, Agentic Choice Enhancement (ACE), which improves model reasoning capabilities post-training by up to 25%. Our work contributes to advancing movie understanding in AI systems and provides valuable insights into the capabilities and limitations of current VQA models when faced with more challenging, nuanced questions about cinematic content. Our project page, dataset and code can be found at https://joslefaure.github.io/assets/html/moviecore.html.

[24] Do LVLMs Know What They Know? A Systematic Study of Knowledge Boundary Perception in LVLMs

Zhikai Ding,Shiyu Ni,Keping Bi

Main category: cs.CL

TL;DR: 这篇论文系统地研究了大型视觉语言模型(LVLMs)对自身知识边界的感知能力,通过评估三种置信信号并提出了改进方法,发现视觉与文本联合处理虽然降低了性能,但提升了感知准确性。

Details Motivation: LVLMs在视觉问答(VQA)中表现出色,但存在幻觉问题。研究其知识边界感知能力是提升模型可靠性的关键。

Contribution: 1. 评估了三种置信信号在LVLMs中的可靠性;2. 提出了三种改进感知能力的方法;3. 比较了LVLMs与LLMs的性能与感知差异。

Method: 通过实验分析了概率置信度、答案一致性置信度和语言化置信度的表现,并提出了校准方法。对比了LVLMs和LLMs在联合处理视觉与文本输入时的表现。

Result: 实验显示,LVLMs对知识边界有一定感知能力,但仍有提升空间。概率和一致性信号更可靠,而语言化置信度容易过度自信。联合处理降低了性能但提升了感知。

Insight: 视觉与文本的联合处理可能对模型的感知能力产生积极影响,但需进一步优化性能与感知的平衡。

Abstract: Large vision-language models (LVLMs) demonstrate strong visual question answering (VQA) capabilities but are shown to hallucinate. A reliable model should perceive its knowledge boundaries-knowing what it knows and what it does not. This paper investigates LVLMs’ perception of their knowledge boundaries by evaluating three types of confidence signals: probabilistic confidence, answer consistency-based confidence, and verbalized confidence. Experiments on three LVLMs across three VQA datasets show that, although LVLMs possess a reasonable perception level, there is substantial room for improvement. Among the three confidences, probabilistic and consistency-based signals are more reliable indicators, while verbalized confidence often leads to overconfidence. To enhance LVLMs’ perception, we adapt several established confidence calibration methods from Large Language Models (LLMs) and propose three effective methods. Additionally, we compare LVLMs with their LLM counterparts, finding that jointly processing visual and textual inputs decreases question-answering performance but reduces confidence, resulting in an improved perception level compared to LLMs.

[25] Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning

Alan Li,Yixin Liu,Arpan Sarkar,Doug Downey,Arman Cohan

Main category: cs.CL

TL;DR: 该论文提出了一套科学推理评测基准SciReas和SciReas-Pro,并设计了KRUX探针框架,分析了知识与推理在LLMs中的作用,发现检索相关知识是瓶颈,外部知识能提升推理能力,显式推理有助于知识提取。

Details Motivation: 科学问题解决需要深入领域知识和复杂推理能力,但目前缺乏全面的评测基准,且未能系统地区分知识与推理的作用。

Contribution: 1. 引入SciReas和SciReas-Pro评测基准;2. 提出KRUX探针框架;3. 揭示LLMs在科学推理中的瓶颈和提升方法;4. 发布SciLit01基线模型。

Method: 1. 构建评测基准;2. 设计KRUX探针框架分析知识与推理;3. 结合内外知识进行实验验证。

Result: 发现知识检索是瓶颈,外部知识能增强推理,显式推理有助于知识提取。

Insight: 科学推理任务需要结合内外知识,显式推理设计是关键提升点。

Abstract: Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks. Combining the two, we conduct an in-depth analysis that yields several key findings: (1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning; (2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement; (3) Enhancing verbalized reasoning improves LLMs’ ability to surface task-relevant knowledge. Finally, we conduct a lightweight analysis, comparing our science-focused data composition with concurrent efforts on long CoT SFT, and release SciLit01, a strong 8B baseline for scientific reasoning.

cs.CV [Back]

[26] Towards Training-Free Underwater 3D Object Detection from Sonar Point Clouds: A Comparison of Traditional and Deep Learning Approaches

M. Salman Shaukat,Yannik Käckenmeister,Sebastian Bader,Thomas Kirste

Main category: cs.CV

TL;DR: 该论文探讨了无需训练数据的3D水下目标检测方法,对比了传统的模板匹配和基于深度学习的合成数据训练方法,发现传统方法在实际数据中表现更优。

Details Motivation: 水下3D目标检测面临训练数据稀缺和声学环境恶劣的挑战,传统深度学习方法依赖大量标注数据,成本高。论文旨在研究无需真实训练数据的检测方法。

Contribution: 提出了两种无需训练数据的检测方法:基于物理模拟的合成数据训练神经网络和基于几何先验的模板匹配。首次建立了无训练水下3D检测的大规模基准。

Method: 1. 基于物理模拟的合成数据训练神经网络;2. 基于几何先验的模板匹配。实验对比了两种方法在真实数据中的表现。

Result: 神经网络在合成数据上达到98% mAP,但在真实数据上仅为40%。模板匹配方法在真实数据中保持83% mAP,无需训练,鲁棒性更强。

Insight: 水下场景中,传统几何方法可能比依赖合成数据的深度学习方法更具鲁棒性,挑战了深度学习在数据稀缺领域的传统认知。

Abstract: Underwater 3D object detection remains one of the most challenging frontiers in computer vision, where traditional approaches struggle with the harsh acoustic environment and scarcity of training data. While deep learning has revolutionized terrestrial 3D detection, its application underwater faces a critical bottleneck: obtaining sufficient annotated sonar data is prohibitively expensive and logistically complex, often requiring specialized vessels, expert surveyors, and favorable weather conditions. This work addresses a fundamental question: Can we achieve reliable underwater 3D object detection without real-world training data? We tackle this challenge by developing and comparing two paradigms for training-free detection of artificial structures in multibeam echo-sounder point clouds. Our dual approach combines a physics-based sonar simulation pipeline that generates synthetic training data for state-of-the-art neural networks, with a robust model-based template matching system that leverages geometric priors of target objects. Evaluation on real bathymetry surveys from the Baltic Sea reveals surprising insights: while neural networks trained on synthetic data achieve 98% mean Average Precision (mAP) on simulated scenes, they drop to 40% mAP on real sonar data due to domain shift. Conversely, our template matching approach maintains 83% mAP on real data without requiring any training, demonstrating remarkable robustness to acoustic noise and environmental variations. Our findings challenge conventional wisdom about data-hungry deep learning in underwater domains and establish the first large-scale benchmark for training-free underwater 3D detection. This work opens new possibilities for autonomous underwater vehicle navigation, marine archaeology, and offshore infrastructure monitoring in data-scarce environments where traditional machine learning approaches fail.

[27] MobileDenseAttn:A Dual-Stream Architecture for Accurate and Interpretable Brain Tumor Detection

Shudipta Banik,Muna Das,Trapa Banik,Md. Ehsanul Haque

Main category: cs.CV

TL;DR: 论文提出了MobileDenseAttn,一种双流架构,结合MobileNetV2和DenseNet201,用于高精度且可解释的脑肿瘤检测,提高了特征表示、计算效率和可视化解释能力。

Details Motivation: 现有脑肿瘤检测方法泛化能力有限,计算效率低,且缺乏可解释性,影响了临床信任。因此,需要一种高效、高精度且透明的模型。

Contribution: 1. 提出MobileDenseAttn双流融合模型;2. 结合特征级融合和GradCAM提升可解释性;3. 在增强数据集上实现高精度(测试准确率98.35%)和高效训练。

Method: 融合MobileNetV2和DenseNet201的双流架构,使用特征级融合和GradCAM生成热力图。训练于6,020个MRI扫描的增强数据集,采用5折交叉验证。

Result: 训练准确率99.75%,测试准确率98.35%,F1分数0.9835。相比基准模型(如VGG19),准确率提升3.67%,训练时间减少39.3%。GradCAM热图清晰定位肿瘤区域。

Insight: 双流架构有效结合了轻量化和高精度特征提取的优势,GradCAM提高了模型的可解释性,使其更适用于临床实践。

Abstract: The detection of brain tumor in MRI is an important aspect of ensuring timely diagnostics and treatment; however, manual analysis is commonly long and error-prone. Current approaches are not universal because they have limited generalization to heterogeneous tumors, are computationally inefficient, are not interpretable, and lack transparency, thus limiting trustworthiness. To overcome these issues, we introduce MobileDenseAttn, a fusion model of dual streams of MobileNetV2 and DenseNet201 that can help gradually improve the feature representation scale, computing efficiency, and visual explanations via GradCAM. Our model uses feature level fusion and is trained on an augmented dataset of 6,020 MRI scans representing glioma, meningioma, pituitary tumors, and normal samples. Measured under strict 5-fold cross-validation protocols, MobileDenseAttn provides a training accuracy of 99.75%, a testing accuracy of 98.35%, and a stable F1 score of 0.9835 (95% CI: 0.9743 to 0.9920). The extensive validation shows the stability of the model, and the comparative analysis proves that it is a great advancement over the baseline models (VGG19, DenseNet201, MobileNetV2) with a +3.67% accuracy increase and a 39.3% decrease in training time compared to VGG19. The GradCAM heatmaps clearly show tumor-affected areas, offering clinically significant localization and improving interpretability. These findings position MobileDenseAttn as an efficient, high performance, interpretable model with a high probability of becoming a clinically practical tool in identifying brain tumors in the real world.

[28] Can VLMs Recall Factual Associations From Visual References?

Dhananjay Ashok,Ashutosh Chaubey,Hirona J. Arai,Jonathan May,Jesse Thomason

Main category: cs.CV

TL;DR: 本文通过实证研究揭示了视觉语言模型(VLMs)在多模态基础中的系统性问题:其通过视觉参考回忆事实关联的能力远低于文本参考,且内部状态模式可预测其可靠性。

Details Motivation: 研究动机是探索VLMs在处理视觉与文本参考时的性能差异,揭示其在多模态基础中的缺陷机制。

Contribution: 主要贡献包括:1) 发现VLMs在视觉参考下回忆事实关联的能力下降;2) 基于内部状态设计高精度探针预测模型失效;3) 提出选择性预测方法提升任务表现。

Method: 通过控制实验对比VLMs在视觉与文本参考下的表现,利用内部状态模式训练探针检测模型可靠性,并集成到选择性预测框架中。

Result: 结果显示:VLMs通过视觉参考回忆事实的能力降低50%;探针对失效案例的检测准确率达92%;选择性预测覆盖率和准确率分别提升7.87%和0.9%。

Insight: 核心发现是VLMs难以将内部知识与视觉表征关联,但内部状态模式可被有效用于可靠性检测,为未来多模态基础研究提供了可解释性方向。

Abstract: Through a controlled study, we identify a systematic deficiency in the multimodal grounding of Vision Language Models (VLMs). While VLMs can recall factual associations when provided a textual reference to an entity; their ability to do so is significantly diminished when the reference is visual instead. Forcing VLMs to rely on image representations of an entity halves their ability to recall factual knowledge, suggesting that VLMs struggle to link their internal knowledge of an entity with its image representation. We show that such linking failures are correlated with the expression of distinct patterns in model internal states, and that probes on these internal states achieve over 92% accuracy at flagging cases where the VLM response is unreliable. These probes can be applied, without retraining, to identify when a VLM will fail to correctly answer a question that requires an understanding of multimodal input. When used to facilitate selective prediction on a visual question answering task, the probes increase coverage by 7.87% (absolute) while also reducing the risk of error by 0.9% (absolute). Addressing the systematic, detectable deficiency is an important avenue in language grounding, and we provide informed recommendations for future directions.

[29] SERES: Semantic-aware neural reconstruction from sparse views

Bo Xu,Yuhu Guo,Yuchao Wang,Wenting Wang,Yeung Yam,Charlie C. L. Wang,Xinyi Le

Main category: cs.CV

TL;DR: 论文提出了一种语义感知的神经重建方法(SERES),用于从稀疏图像中生成高保真3D模型。通过引入基于补丁的语义逻辑和几何基元掩码的规则化,解决了稀疏输入导致的辐射模糊问题,显著提升了重建精度。

Details Motivation: 稀疏输入导致的特征不匹配和辐射模糊问题严重影响了3D重建的精度。传统方法难以从稀疏视图中恢复高质量的3D模型。

Contribution: 1. 提出了一种语义感知的神经重建方法,通过优化基于补丁的语义逻辑、符号距离场和辐射场来丰富神经隐式表示;2. 引入几何基元掩码的规则化,减少形状模糊。

Method: 1. 将语义逻辑与符号距离场和辐射场联合优化;2. 利用几何基元掩码作为规则化手段;3. 在DTU数据集上验证性能。

Result: 在DTU数据集上,稀疏重建的平均倒角距离分别比SparseNeuS和VolRecon降低了44%和20%;作为NeuS和Neuralangelo等密集重建基线的插件,平均误差分别降低了69%和68%。

Insight: 语义信息和几何规则的结合能有效提升稀疏视图下的3D重建质量,尤其适用于实际应用中的输入受限场景。

Abstract: We propose a semantic-aware neural reconstruction method to generate 3D high-fidelity models from sparse images. To tackle the challenge of severe radiance ambiguity caused by mismatched features in sparse input, we enrich neural implicit representations by adding patch-based semantic logits that are optimized together with the signed distance field and the radiance field. A novel regularization based on the geometric primitive masks is introduced to mitigate shape ambiguity. The performance of our approach has been verified in experimental evaluation. The average chamfer distances of our reconstruction on the DTU dataset can be reduced by 44% for SparseNeuS and 20% for VolRecon. When working as a plugin for those dense reconstruction baselines such as NeuS and Neuralangelo, the average error on the DTU dataset can be reduced by 69% and 68% respectively.

[30] Structures Meet Semantics: Multimodal Fusion via Graph Contrastive Learning

Jiangfeng Sun,Sihao He,Zhonghong Ou,Meina Song

Main category: cs.CV

TL;DR: 本文提出了一种名为SSU的新型多模态情感分析框架,通过图对比学习融合模态特定的结构依赖和语义对齐,显著提升了模型的性能与可解释性。

Details Motivation: 现有多模态融合方法常忽略模态特定的结构依赖和语义对齐问题,导致性能与可解释性受限。本文旨在通过结构与语义的统一解决这些问题。

Contribution: 提出SSU框架,动态构建模态特定图结构并引入语义锚点,通过多视图对比学习提升表示的可区分性与一致性。

Method: 利用语言句法构建文本图,轻量级注意力机制处理声学和视觉模态,并通过语义锚点实现跨模态对齐。

Result: 在两个基准数据集(CMU-MOSI和CMU-MOSEI)上达到SOTA性能,且显著降低了计算开销。

Insight: 多模态融合中结构与语义的统一对性能提升和模型可解释性至关重要,语义锚点与对比学习有助于跨模态对齐。

Abstract: Multimodal sentiment analysis (MSA) aims to infer emotional states by effectively integrating textual, acoustic, and visual modalities. Despite notable progress, existing multimodal fusion methods often neglect modality-specific structural dependencies and semantic misalignment, limiting their quality, interpretability, and robustness. To address these challenges, we propose a novel framework called the Structural-Semantic Unifier (SSU), which systematically integrates modality-specific structural information and cross-modal semantic grounding for enhanced multimodal representations. Specifically, SSU dynamically constructs modality-specific graphs by leveraging linguistic syntax for text and a lightweight, text-guided attention mechanism for acoustic and visual modalities, thus capturing detailed intra-modal relationships and semantic interactions. We further introduce a semantic anchor, derived from global textual semantics, that serves as a cross-modal alignment hub, effectively harmonizing heterogeneous semantic spaces across modalities. Additionally, we develop a multiview contrastive learning objective that promotes discriminability, semantic consistency, and structural coherence across intra- and inter-modal views. Extensive evaluations on two widely used benchmark datasets, CMU-MOSI and CMU-MOSEI, demonstrate that SSU consistently achieves state-of-the-art performance while significantly reducing computational overhead compared to prior methods. Comprehensive qualitative analyses further validate SSU’s interpretability and its ability to capture nuanced emotional patterns through semantically grounded interactions.

[31] FastAvatar: Instant 3D Gaussian Splatting for Faces from Single Unconstrained Poses

Hao Liang,Zhixuan Ge,Ashish Tiwari,Soumendu Majee,G. M. Dilshan Godaliyadda,Ashok Veeraraghavan,Guha Balakrishnan

Main category: cs.CV

TL;DR: FastAvatar是一个快速、前馈的框架,能够从任意姿态的单张人脸图像中即时生成3D高斯喷洒模型,具有高质量重建和快速推理的优势。

Details Motivation: 现有3D高斯喷洒(3DGS)方法在生成人脸时需要多视角数据或耗时优化,难以满足实时交互需求。FastAvatar旨在解决这一问题,提供快速、高质量的单视角重建。

Contribution: 1. 提出FastAvatar,首个能够从单张任意姿态图像中快速生成3DGS模型的框架;2. 设计新颖的编码器-解码器架构,通过预测模板高斯模型的残差实现快速拟合;3. 支持实时身份内插和属性编辑。

Method: 1. 构建多视角人脸的3DGS模板模型;2. 使用编码器提取身份及姿态不变的特征嵌入;3. 解码器预测模板高斯模型的结构和外观参数残差。

Result: FastAvatar在重建质量上显著优于现有前馈方法(如GAGAvatar),推理速度快1000倍,并支持实时编辑。

Insight: 通过残差预测和模板化设计,FastAvatar在高效率和高质量之间取得了平衡,为3DGS在交互式应用中的落地提供了可能。

Abstract: We present FastAvatar, a pose-invariant, feed-forward framework that can generate a 3D Gaussian Splatting (3DGS) model from a single face image from an arbitrary pose in near-instant time (<10ms). FastAvatar uses a novel encoder-decoder neural network design to achieve both fast fitting and identity preservation regardless of input pose. First, FastAvatar constructs a 3DGS face ``template’’ model from a training dataset of faces with multi-view captures. Second, FastAvatar encodes the input face image into an identity-specific and pose-invariant latent embedding, and decodes this embedding to predict residuals to the structural and appearance parameters of each Gaussian in the template 3DGS model. By only inferring residuals in a feed-forward fashion, model inference is fast and robust. FastAvatar significantly outperforms existing feed-forward face 3DGS methods (e.g., GAGAvatar) in reconstruction quality, and runs 1000x faster than per-face optimization methods (e.g., FlashAvatar, GaussianAvatars and GASP). In addition, FastAvatar’s novel latent space design supports real-time identity interpolation and attribute editing which is not possible with any existing feed-forward 3DGS face generation framework. FastAvatar’s combination of excellent reconstruction quality and speed expands the scope of 3DGS for photorealistic avatar applications in consumer and interactive systems.

[32] Securing Face and Fingerprint Templates in Humanitarian Biometric Systems

Giuseppe Stragapede,Sam Merrick,Vedrana Krivokuća Hahn,Justin Sukaitis,Vincent Graf Narbel

Main category: cs.CV

TL;DR: 该论文提出了一种适用于人道主义场景的生物特征模板保护方案PolyProtect,并通过实际数据集验证了其在人脸和指纹识别中的有效性。

Details Motivation: 在人道主义和紧急情况下,生物识别技术虽然提升了效率,但也带来了数据安全风险,尤其是在脆弱环境中。因此,需要一种轻量级且安全的生物特征模板保护方法。

Contribution: 1. 提出了适用于神经网路人脸嵌入的PolyProtect方法;2. 首次在识别场景和指纹生物特征中评估了PolyProtect;3. 使用了高效的EdgeFace特征提取器。

Method: 通过对比分析选择PolyProtect方法,并在真实世界的人脸数据集和指纹数据上评估其验证和识别准确性、不可逆性和不可链接性。

Result: 实验结果表明PolyProtect在人脸和指纹识别中表现良好,验证了其有效性和轻量级特性。

Insight: PolyProtect的模态无关性使其在多种生物特征保护中具有潜力,尤其适合资源受限的人道主义场景。

Abstract: In humanitarian and emergency scenarios, the use of biometrics can dramatically improve the efficiency of operations, but it poses risks for the data subjects, which are exacerbated in contexts of vulnerability. To address this, we present a mobile biometric system implementing a biometric template protection (BTP) scheme suitable for these scenarios. After rigorously formulating the functional, operational, and security and privacy requirements of these contexts, we perform a broad comparative analysis of the BTP landscape. PolyProtect, a method designed to operate on neural network face embeddings, is identified as the most suitable method due to its effectiveness, modularity, and lightweight computational burden. We evaluate PolyProtect in terms of verification and identification accuracy, irreversibility, and unlinkability, when this BTP method is applied to face embeddings extracted using EdgeFace, a novel state-of-the-art efficient feature extractor, on a real-world face dataset from a humanitarian field project in Ethiopia. Moreover, as PolyProtect promises to be modality-independent, we extend its evaluation to fingerprints. To the best of our knowledge, this is the first time that PolyProtect has been evaluated for the identification scenario and for fingerprint biometrics. Our experimental results are promising, and we plan to release our code

[33] Why Relational Graphs Will Save the Next Generation of Vision Foundation Models?

Fatemeh Ziaeetabar

Main category: cs.CV

TL;DR: 这篇论文主张下一代视觉基础模型(FMs)应通过动态关系图增强显式关系推理能力,以提高在细粒度任务中的性能、鲁棒性和效率。

Details Motivation: 当前视觉基础模型在处理需要显式实体、角色和时空关系推理的任务时表现不足,而此类能力对细粒度任务(如人类行为识别和多模态医学图像分析)至关重要。

Contribution: 提出将动态关系图集成到视觉基础模型中,以增强模型的关系推理能力,并通过跨领域实验验证了其有效性。

Method: 通过在FMs中引入轻量化的上下文自适应图推理模块,动态构建输入和任务相关的关系图,实现稀疏语义节点的推理。

Result: 实验显示,这种方法在细粒度语义保真度、分布外鲁棒性、可解释性和计算效率上优于纯FMs基线,同时具备较高的内存和硬件效率。

Insight: 关系图与FMs的结合为下一代视觉模型提供了新的方向,尤其是在动态图构建、多层级关系推理和多模态融合方面具有潜力。

Abstract: Vision foundation models (FMs) have become the predominant architecture in computer vision, providing highly transferable representations learned from large-scale, multimodal corpora. Nonetheless, they exhibit persistent limitations on tasks that require explicit reasoning over entities, roles, and spatio-temporal relations. Such relational competence is indispensable for fine-grained human activity recognition, egocentric video understanding, and multimodal medical image analysis, where spatial, temporal, and semantic dependencies are decisive for performance. We advance the position that next-generation FMs should incorporate explicit relational interfaces, instantiated as dynamic relational graphs (graphs whose topology and edge semantics are inferred from the input and task context). We illustrate this position with cross-domain evidence from recent systems in human manipulation action recognition and brain tumor segmentation, showing that augmenting FMs with lightweight, context-adaptive graph-reasoning modules improves fine-grained semantic fidelity, out of distribution robustness, interpretability, and computational efficiency relative to FM only baselines. Importantly, by reasoning sparsely over semantic nodes, such hybrids also achieve favorable memory and hardware efficiency, enabling deployment under practical resource constraints. We conclude with a targeted research agenda for FM graph hybrids, prioritizing learned dynamic graph construction, multi-level relational reasoning (e.g., part object scene in activity understanding, or region organ in medical imaging), cross-modal fusion, and evaluation protocols that directly probe relational competence in structured vision tasks.

[34] LPLC: A Dataset for License Plate Legibility Classification

Lucas Wojcik,Gabriel E. Lima,Valfride Nascimento,Eduil Nascimento Jr.,Rayson Laroca,David Menotti

Main category: cs.CV

TL;DR: 该论文介绍了LPLC数据集,用于车牌清晰度分类,旨在优化ALPR系统对低质量车牌的处理,并通过实验展示了任务的挑战性。

Details Motivation: 自动车牌识别(ALPR)在处理低质量车牌时面临挑战,现有方法(如超分辨率)未能彻底解决问题。需要一种选择性预处理机制来优化模型效率和性能。

Contribution: 提出了LPLC数据集,包含10,210张车辆图像和12,687个标注车牌,覆盖多种场景和清晰度类别,为研究提供了基准。

Method: 采用精细标注策略,包括遮挡、清晰度分类(四类)和字符标签,并使用ViT、ResNet和YOLO等网络作为基准模型进行分类任务。

Result: 基准模型的F1分数均低于80%,表明任务具有挑战性,需要进一步研究。

Insight: 车牌清晰度分类是ALPR中的重要环节,现有方法仍需改进,LPLC数据集为研究提供了新方向。

Abstract: Automatic License Plate Recognition (ALPR) faces a major challenge when dealing with illegible license plates (LPs). While reconstruction methods such as super-resolution (SR) have emerged, the core issue of recognizing these low-quality LPs remains unresolved. To optimize model performance and computational efficiency, image pre-processing should be applied selectively to cases that require enhanced legibility. To support research in this area, we introduce a novel dataset comprising 10,210 images of vehicles with 12,687 annotated LPs for legibility classification (the LPLC dataset). The images span a wide range of vehicle types, lighting conditions, and camera/image quality levels. We adopt a fine-grained annotation strategy that includes vehicle- and LP-level occlusions, four legibility categories (perfect, good, poor, and illegible), and character labels for three categories (excluding illegible LPs). As a benchmark, we propose a classification task using three image recognition networks to determine whether an LP image is good enough, requires super-resolution, or is completely unrecoverable. The overall F1 score, which remained below 80% for all three baseline models (ViT, ResNet, and YOLO), together with the analyses of SR and LP recognition methods, highlights the difficulty of the task and reinforces the need for further research. The proposed dataset is publicly available at https://github.com/lmlwojcik/lplc-dataset.

[35] CLARIFY: A Specialist-Generalist Framework for Accurate and Lightweight Dermatological Visual Question Answering

Aranya Saha,Tanvir Ahmed Khan,Ismam Nur Swapnil,Mohammad Ariful Haque

Main category: cs.CV

TL;DR: CLARIFY是一种专家-通用医生框架,用于皮肤科视觉问答任务。通过结合轻量级图像分类器和压缩的对话视觉语言模型,显著提升诊断准确性和计算效率。

Details Motivation: 通用视觉语言模型在医疗任务中潜力巨大,但其通用性可能限制专业诊断准确性,且模型体积庞大,计算成本高。CLARIFY旨在解决这些问题,为皮肤科VQA任务提供高效解决方案。

Contribution: 提出了CLARIFY框架,结合专家(图像分类器)和通用医生(压缩视觉语言模型)的优势,并通过知识图谱模块生成可靠的自然语言解释,显著提升诊断准确性和计算效率。

Method: 专家模块提供快速准确的诊断预测;通用医生模块生成自然语言解释,并由专家模块引导其推理。知识图谱模块增强事实性,减少错误。

Result: 在皮肤科数据集上,CLARIFY比最优基线模型的诊断准确性提升18%,同时降低20%的VRAM需求和5%的延迟。

Insight: 专家-通用医生框架通过分层设计,能有效平衡专业诊断准确性和计算效率,为医疗AI系统提供轻量化和可信赖的解决方案。

Abstract: Vision-language models (VLMs) have shown significant potential for medical tasks; however, their general-purpose nature can limit specialized diagnostic accuracy, and their large size poses substantial inference costs for real-world clinical deployment. To address these challenges, we introduce CLARIFY, a Specialist-Generalist framework for dermatological visual question answering (VQA). CLARIFY combines two components: (i) a lightweight, domain-trained image classifier (the Specialist) that provides fast and highly accurate diagnostic predictions, and (ii) a powerful yet compressed conversational VLM (the Generalist) that generates natural language explanations to user queries. In our framework, the Specialist’s predictions directly guide the Generalist’s reasoning, focusing it on the correct diagnostic path. This synergy is further enhanced by a knowledge graph-based retrieval module, which grounds the Generalist’s responses in factual dermatological knowledge, ensuring both accuracy and reliability. This hierarchical design not only reduces diagnostic errors but also significantly improves computational efficiency. Experiments on our curated multimodal dermatology dataset demonstrate that CLARIFY achieves an 18% improvement in diagnostic accuracy over the strongest baseline, a fine-tuned, uncompressed single-line VLM, while reducing the average VRAM requirement and latency by at least 20% and 5%, respectively. These results indicate that a Specialist-Generalist system provides a practical and powerful paradigm for building lightweight, trustworthy, and clinically viable AI systems.

[36] VQualA 2025 Challenge on Face Image Quality Assessment: Methods and Results

Sizhuo Ma,Wei-Ting Chen,Qiang Gao,Jian Wang,Chris Wei Zhou,Wei Sun,Weixia Zhang,Linhan Cao,Jun Jia,Xiangyang Zhu,Dandan Zhu,Xiongkuo Min,Guangtao Zhai,Baoying Chen,Xiongwei Xiao,Jishen Zeng,Wei Wu,Tiexuan Lou,Yuchen Tan,Chunyi Song,Zhiwei Xu,MohammadAli Hamidi,Hadi Amirpour,Mingyin Bai,Jiawang Du,Zhenyu Jiang,Zilong Lu,Ziguan Cui,Zongliang Gan,Xinpeng Li,Shiqi Jiang,Chenhui Li,Changbo Wang,Weijun Yuan,Zhan Li,Yihang Chen,Yifan Deng,Ruting Deng,Zhanglu Chen,Boyang Yao,Shuling Zheng,Feng Zhang,Zhiheng Fu,Abhishek Joshi,Aman Agarwal,Rakhil Immidisetti,Ajay Narasimha Mopidevi,Vishwajeet Shukla,Hao Yang,Ruikun Zhang,Liyuan Pan,Kaixin Deng,Hang Ouyang,Fan yang,Zhizun Luo,Zhuohang Shi,Songning Lai,Weilin Ruan,Yutao Yue

Main category: cs.CV

TL;DR: VQualA 2025挑战赛聚焦面部图像质量评估(FIQA),参与者开发轻量高效模型(限制为0.5 GFLOPs和500万参数),预测真实退化条件下的面部图像平均主观评分(MOS),并在ICCV 2025研讨会上展示了方法与结果。

Details Motivation: 面部图像在许多应用中至关重要,但现实条件下图像质量常因噪声、模糊和压缩伪影等退化而下降,影响任务表现。为此,VQualA 2025挑战赛旨在推动实用FIQA方法的发展。

Contribution: 挑战赛汇集了127名参与者,提交了1519份最终方案,推动了轻量化FIQA模型的开发,限制计算和参数规模以提升实用性。

Method: 参与者设计轻量模型(0.5 GFLOPs和500万参数限制),基于真实退化面部图像数据集,通过相关性指标评估预测MOS的性能。

Result: 挑战赛展示了多种高效FIQA方法的性能,为实际应用提供了技术参考。

Insight: 轻量化和高效率是FIQA实用化的关键,挑战赛为未来研究提供了数据支持和基准测试框架。

Abstract: Face images play a crucial role in numerous applications; however, real-world conditions frequently introduce degradations such as noise, blur, and compression artifacts, affecting overall image quality and hindering subsequent tasks. To address this challenge, we organized the VQualA 2025 Challenge on Face Image Quality Assessment (FIQA) as part of the ICCV 2025 Workshops. Participants created lightweight and efficient models (limited to 0.5 GFLOPs and 5 million parameters) for the prediction of Mean Opinion Scores (MOS) on face images with arbitrary resolutions and realistic degradations. Submissions underwent comprehensive evaluations through correlation metrics on a dataset of in-the-wild face images. This challenge attracted 127 participants, with 1519 final submissions. This report summarizes the methodologies and findings for advancing the development of practical FIQA approaches.

[37] Context-Aware Zero-Shot Anomaly Detection in Surveillance Using Contrastive and Predictive Spatiotemporal Modeling

Md. Rashid Shahriar Khan,Md. Abrar Hasan,Mohammod Tareq Aziz Justice

Main category: cs.CV

TL;DR: 该论文提出了一种上下文感知的零样本异常检测框架,结合了时间建模与语义理解,能够在未见过异常样本的情况下检测复杂监控场景中的异常行为。

Details Motivation: 监控视频中的异常检测通常依赖于异常的上下文特性,但异常行为不可预测且缺乏标记数据。因此,该研究旨在通过零样本学习方法解决这一问题。

Contribution: 提出了一个混合架构,结合TimeSformer、DPC和CLIP,通过时空动态建模和语义上下文理解实现零样本异常检测。

Method: 使用TimeSformer提取时空特征,DPC预测未来表示以识别时间偏差,CLIP用于语义级异常检测。通过InfoNCE和CPC损失联合训练。

Result: 框架在未见过异常样本的情况下能够泛化到复杂环境中的新行为,表现优于传统方法。

Insight: 结合时间预测与语义上下文可以显著提升零样本异常检测的性能,为复杂场景下的异常检测提供了新思路。

Abstract: Detecting anomalies in surveillance footage is inherently challenging due to their unpredictable and context-dependent nature. This work introduces a novel context-aware zero-shot anomaly detection framework that identifies abnormal events without exposure to anomaly examples during training. The proposed hybrid architecture combines TimeSformer, DPC, and CLIP to model spatiotemporal dynamics and semantic context. TimeSformer serves as the vision backbone to extract rich spatial-temporal features, while DPC forecasts future representations to identify temporal deviations. Furthermore, a CLIP-based semantic stream enables concept-level anomaly detection through context-specific text prompts. These components are jointly trained using InfoNCE and CPC losses, aligning visual inputs with their temporal and semantic representations. A context-gating mechanism further enhances decision-making by modulating predictions with scene-aware cues or global video features. By integrating predictive modeling with vision-language understanding, the system can generalize to previously unseen behaviors in complex environments. This framework bridges the gap between temporal reasoning and semantic context in zero-shot anomaly detection for surveillance. The code for this research has been made available at https://github.com/NK-II/Context-Aware-ZeroShot-Anomaly-Detection-in-Surveillance.

[38] DoGFlow: Self-Supervised LiDAR Scene Flow via Cross-Modal Doppler Guidance

Ajinkya Khoche,Qingwen Zhang,Yixi Cai,Sina Sharif Mansouri,Patric Jensfelt

Main category: cs.CV

TL;DR: DoGFlow提出了一种基于跨模态Doppler引导的自监督LiDAR场景流估计方法,无需人工标注,在性能上显著优于当前的自监督方法。

Details Motivation: 准确3D场景流估计对自动驾驶至关重要,但人工标注成本高,现有自监督方法在长距离和恶劣天气下性能不足。

Contribution: DoGFlow通过4D雷达Doppler测量生成运动伪标签,并跨模态迁移到LiDAR域,实现了无需人工标注的高效场景流估计。

Method: 采用动态感知关联和模糊分辨传播技术,将雷达Doppler伪标签迁移到LiDAR数据,以自监督方式训练模型。

Result: 在MAN TruckScenes数据集上,仅用10%标注数据即可达到全监督方法90%的性能。

Insight: 跨模态(雷达与LiDAR)信息迁移可以有效解决自监督学习中的标注瓶颈问题。

Abstract: Accurate 3D scene flow estimation is critical for autonomous systems to navigate dynamic environments safely, but creating the necessary large-scale, manually annotated datasets remains a significant bottleneck for developing robust perception models. Current self-supervised methods struggle to match the performance of fully supervised approaches, especially in challenging long-range and adverse weather scenarios, while supervised methods are not scalable due to their reliance on expensive human labeling. We introduce DoGFlow, a novel self-supervised framework that recovers full 3D object motions for LiDAR scene flow estimation without requiring any manual ground truth annotations. This paper presents our cross-modal label transfer approach, where DoGFlow computes motion pseudo-labels in real-time directly from 4D radar Doppler measurements and transfers them to the LiDAR domain using dynamic-aware association and ambiguity-resolved propagation. On the challenging MAN TruckScenes dataset, DoGFlow substantially outperforms existing self-supervised methods and improves label efficiency by enabling LiDAR backbones to achieve over 90% of fully supervised performance with only 10% of the ground truth data. For more details, please visit https://ajinkyakhoche.github.io/DogFlow/

[39] Wan-S2V: Audio-Driven Cinematic Video Generation

Xin Gao,Li Hu,Siqi Hu,Mingyang Huang,Chaonan Ji,Dechao Meng,Jinwei Qi,Penchong Qiao,Zhen Shen,Yafei Song,Ke Sun,Linrui Tian,Guangyuan Wang,Qi Wang,Zhongjian Wang,Jiayu Xiao,Sheng Xu,Bang Zhang,Peng Zhang,Xindi Zhang,Zhe Zhang,Jingren Zhou,Lian Zhuo

Main category: cs.CV

TL;DR: 本文提出了一种名为Wan-S2V的音频驱动视频生成模型,旨在解决复杂影视制作中对精细角色互动、真实身体运动和动态镜头工作的需求,显著提升了表现力和保真度。

Details Motivation: 现有音频驱动角色动画方法在涉及语音和唱歌的场景中表现良好,但在复杂影视制作中难以满足要求。本文旨在解决这一长期存在的挑战。

Contribution: 提出Wan-S2V模型,显著提升了电影级角色动画的表现力和保真度,并在长视频生成和精准唇音同步编辑中展现了其多功能性。

Method: 基于Wan构建的音频驱动模型,通过实验与Hunyuan-Avatar和Omnihuman等前沿模型进行对比。

Result: 实验表明,Wan-S2V在电影级动画生成上显著优于现有方法。

Insight: 该模型为复杂影视制作提供了新的解决方案,尤其在表达性和多功能性方面取得了重要突破。

Abstract: Current state-of-the-art (SOTA) methods for audio-driven character animation demonstrate promising performance for scenarios primarily involving speech and singing. However, they often fall short in more complex film and television productions, which demand sophisticated elements such as nuanced character interactions, realistic body movements, and dynamic camera work. To address this long-standing challenge of achieving film-level character animation, we propose an audio-driven model, which we refere to as Wan-S2V, built upon Wan. Our model achieves significantly enhanced expressiveness and fidelity in cinematic contexts compared to existing approaches. We conducted extensive experiments, benchmarking our method against cutting-edge models such as Hunyuan-Avatar and Omnihuman. The experimental results consistently demonstrate that our approach significantly outperforms these existing solutions. Additionally, we explore the versatility of our method through its applications in long-form video generation and precise video lip-sync editing.

[40] Decouple, Reorganize, and Fuse: A Multimodal Framework for Cancer Survival Prediction

Huayi Wang,Haochao Ying,Yuyang Xu,Qibo Qiu,Cheng Zhang,Danny Z. Chen,Ying Sun,Jian Wu

Main category: cs.CV

TL;DR: 该论文提出了一种名为DeReF的多模态框架,用于解决癌症生存预测中固定融合方案和MoE融合信息交互不足的问题,通过特征重组和动态MoE融合模块提升了特征组合多样性和信息交互能力。

Details Motivation: 现有多模态癌症生存预测方法存在固定融合方案和MoE融合中信息交互不足的问题,限制了特征的动态融合和信息捕获能力。

Contribution: 提出了DeReF框架,包括随机特征重组策略和动态MoE融合模块,提升了特征组合多样性和信息交互能力,并引入区域交叉注意力网络改善特征表示质量。

Method: 采用解耦-重组-融合(DeReF)框架,结合随机特征重组和动态MoE融合模块,并嵌入区域交叉注意力网络优化特征解耦。

Result: 在内部肝癌数据集和三个TCGA公开数据集上的实验验证了方法的有效性。

Insight: 动态特征重组和MoE融合的结合能够有效提升多模态数据的泛化能力和信息交互效果,区域交叉注意力的引入进一步优化了特征表示。

Abstract: Cancer survival analysis commonly integrates information across diverse medical modalities to make survival-time predictions. Existing methods primarily focus on extracting different decoupled features of modalities and performing fusion operations such as concatenation, attention, and MoE-based (Mixture-of-Experts) fusion. However, these methods still face two key challenges: i) Fixed fusion schemes (concatenation and attention) can lead to model over-reliance on predefined feature combinations, limiting the dynamic fusion of decoupled features; ii) in MoE-based fusion methods, each expert network handles separate decoupled features, which limits information interaction among the decoupled features. To address these challenges, we propose a novel Decoupling-Reorganization-Fusion framework (DeReF), which devises a random feature reorganization strategy between modalities decoupling and dynamic MoE fusion modules.Its advantages are: i) it increases the diversity of feature combinations and granularity, enhancing the generalization ability of the subsequent expert networks; ii) it overcomes the problem of information closure and helps expert networks better capture information among decoupled features. Additionally, we incorporate a regional cross-attention network within the modality decoupling module to improve the representation quality of decoupled features. Extensive experimental results on our in-house Liver Cancer (LC) and three widely used TCGA public datasets confirm the effectiveness of our proposed method. The code will be made publicly available.

[41] ROSE: Remove Objects with Side Effects in Videos

Chenxuan Miao,Yutong Feng,Jianshu Zeng,Zixiang Gao,Hantang Liu,Yunfeng Yan,Donglian Qi,Xi Chen,Bin Wang,Hengshuang Zhao

Main category: cs.CV

TL;DR: 论文提出ROSE框架,专注于解决视频中物体移除时的副作用(如阴影、反射等),并通过3D渲染生成合成数据。ROSE基于扩散变换器实现,通过参考视频定位副作用区域,并引入额外监督提升效果。实验证明其在ROSE-Bench基准上表现优异。

Details Motivation: 现有视频物体移除方法在去除物体副作用(如阴影、反射)时表现不佳,主要原因是缺乏配对视频数据。论文通过3D渲染生成合成数据,并设计了系统化的框架来解决这一问题。

Contribution: 1. 提出ROSE框架,系统性研究物体的五种常见副作用;2. 利用3D渲染生成大规模合成数据集;3. 基于扩散变换器设计模型,引入额外监督优化副作用区域;4. 构建ROSE-Bench基准,全面评估模型效果。

Method: 1. 通过3D渲染引擎生成合成数据;2. 基于扩散变换器实现视频修复模型;3. 输入整段视频进行参考式擦除;4. 利用差分掩码显式监督副作用区域。

Result: ROSE在ROSE-Bench上表现优于现有方法,并能泛化到真实视频场景。

Insight: 3D渲染合成数据为解决视频领域数据稀缺问题提供了新思路;显式监督副作用区域能有效提升修复质量。

Abstract: Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, e.g., their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents ROSE, termed Remove Objects with Side Effects, a framework that systematically studies the object’s effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios. The project page is https://rose2025-inpaint.github.io/.

[42] OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward

Chunlin Zhong,Qiuxia Hou,Zhangjun Zhou,Shuang Hao,Haonan Lu,Yanhao Zhang,He Tang,Xiang Bai

Main category: cs.CV

TL;DR: OwlCap提出了一种通过数据集HMD-270K和奖励机制CSER来解决视频描述中运动-细节不平衡问题的方法,显著提升了性能。

Details Motivation: 现有的视频描述方法存在运动-细节不平衡问题,导致生成的描述不够全面和一致。

Contribution: 1) 构建了包含27万样本的HMD-270K数据集;2) 提出了Caption Set Equivalence Reward (CSER)奖励机制,基于Group Relative Policy Optimization (GRPO)优化模型。

Method: 1) 通过两阶段流程构建HMD-270K数据集;2) 用CSER和GRPO训练OwlCap模型。

Result: OwlCap在VDC(细节为主)和DREAM-1K(运动为主)基准测试上分别提升了4.2%和4.6%。

Insight: 平衡运动与细节的捕捉是提升视频描述质量的关键,数据集和奖励机制的优化显著改善了模型性能。

Abstract: Video captioning aims to generate comprehensive and coherent descriptions of the video content, contributing to the advancement of both video understanding and generation. However, existing methods often suffer from motion-detail imbalance, as models tend to overemphasize one aspect while neglecting the other. This imbalance results in incomplete captions, which in turn leads to a lack of consistency in video understanding and generation. To address this issue, we propose solutions from two aspects: 1) Data aspect: We constructed the Harmonizing Motion-Detail 270K (HMD-270K) dataset through a two-stage pipeline: Motion-Detail Fusion (MDF) and Fine-Grained Examination (FGE). 2) Optimization aspect: We introduce the Caption Set Equivalence Reward (CSER) based on Group Relative Policy Optimization (GRPO). CSER enhances completeness and accuracy in capturing both motion and details through unit-to-set matching and bidirectional validation. Based on the HMD-270K supervised fine-tuning and GRPO post-training with CSER, we developed OwlCap, a powerful video captioning multi-modal large language model (MLLM) with motion-detail balance. Experimental results demonstrate that OwlCap achieves significant improvements compared to baseline models on two benchmarks: the detail-focused VDC (+4.2 Acc) and the motion-focused DREAM-1K (+4.6 F1). The HMD-270K dataset and OwlCap model will be publicly released to facilitate video captioning research community advancements.

[43] Hierarchical Spatio-temporal Segmentation Network for Ejection Fraction Estimation in Echocardiography Videos

Dongfang Wang,Jian Yang,Yizhe Zhang,Tao Zhou

Main category: cs.CV

TL;DR: 该论文提出了一种用于超声心动图视频的层次时空分割网络(HSSN),通过结合局部细节建模和全局动态感知来提高射血分数(EF)估计的准确性。

Details Motivation: 现有研究在超声心动图视频中的左心室心内膜分割表现良好,但在EF估计中表现不佳。因此,需要一种能够同时捕捉局部细节和全局动态的方法来提高EF估计的准确性。

Contribution: 1. 提出了一种层次化设计的时空分割网络(HSSN),平衡单帧和多帧处理;2. 设计了时空交叉扫描(STCS)模块,整合长距离上下文信息;3. 解决了EF计算中的噪声和偏差问题。

Method: 网络采用层次化设计:低层阶段使用卷积网络处理单帧图像以保留细节,高层阶段利用Mamba架构捕捉时空关系。STCS模块通过跨帧和跨位置的跳扫描整合上下文信息。

Result: HSSN在EF估计中表现优于现有方法,减少了因超声图像噪声等因素导致的偏差。

Insight: 层次化设计和STCS模块的结合有效平衡了局部细节和全局动态的建模需求,提供了更准确的EF估计。

Abstract: Automated segmentation of the left ventricular endocardium in echocardiography videos is a key research area in cardiology. It aims to provide accurate assessment of cardiac structure and function through Ejection Fraction (EF) estimation. Although existing studies have achieved good segmentation performance, their results do not perform well in EF estimation. In this paper, we propose a Hierarchical Spatio-temporal Segmentation Network (\ourmodel) for echocardiography video, aiming to improve EF estimation accuracy by synergizing local detail modeling with global dynamic perception. The network employs a hierarchical design, with low-level stages using convolutional networks to process single-frame images and preserve details, while high-level stages utilize the Mamba architecture to capture spatio-temporal relationships. The hierarchical design balances single-frame and multi-frame processing, avoiding issues such as local error accumulation when relying solely on single frames or neglecting details when using only multi-frame data. To overcome local spatio-temporal limitations, we propose the Spatio-temporal Cross Scan (STCS) module, which integrates long-range context through skip scanning across frames and positions. This approach helps mitigate EF calculation biases caused by ultrasound image noise and other factors.

[44] Feature-Space Planes Searcher: A Universal Domain Adaptation Framework for Interpretability and Computational Efficiency

Zhitong Cheng,Yiran Jiang,Yulong Ge,Yufeng Li,Zhongheng Qin,Rongzhi Lin,Jianwei Ma

Main category: cs.CV

TL;DR: 本文提出了一种名为FPS的新颖域适应框架,通过利用预训练模型特征空间中的几何模式,优化决策边界,同时保持特征编码器不变,从而实现了高效且可解释的域适应。

Details Motivation: 当前的无监督域适应方法通常需要微调特征提取器,存在效率低、可解释性差和难以扩展到现代架构等问题。本文发现预训练模型在特征空间中存在域不变的几何模式,为优化决策边界提供了新的思路。

Contribution: 主要贡献是提出了FPS框架,通过利用特征空间的几何模式优化决策边界,避免了微调特征编码器的缺点,显著降低了计算和内存成本,同时提升了适应任务的性能。

Method: FPS方法的核心是保留预训练特征编码器不变,仅优化决策边界。通过分析特征空间中的几何模式(如类内聚集和类间分离),FPS能够在单次计算周期内完成全数据集的优化。

Result: 在公共基准测试中,FPS表现优于或与最先进方法相当,并在蛋白质结构预测、遥感分类和地震检测等多个领域展现了良好的通用性。

Insight: 特征空间的域不变几何模式表明,域偏移主要表现为边界不对齐而非特征退化,这为简化域适应任务提供了新视角。

Abstract: Domain shift, characterized by degraded model performance during transition from labeled source domains to unlabeled target domains, poses a persistent challenge for deploying deep learning systems. Current unsupervised domain adaptation (UDA) methods predominantly rely on fine-tuning feature extractors - an approach limited by inefficiency, reduced interpretability, and poor scalability to modern architectures. Our analysis reveals that models pretrained on large-scale data exhibit domain-invariant geometric patterns in their feature space, characterized by intra-class clustering and inter-class separation, thereby preserving transferable discriminative structures. These findings indicate that domain shifts primarily manifest as boundary misalignment rather than feature degradation. Unlike fine-tuning entire pre-trained models - which risks introducing unpredictable feature distortions - we propose the Feature-space Planes Searcher (FPS): a novel domain adaptation framework that optimizes decision boundaries by leveraging these geometric patterns while keeping the feature encoder frozen. This streamlined approach enables interpretative analysis of adaptation while substantially reducing memory and computational costs through offline feature extraction, permitting full-dataset optimization in a single computation cycle. Evaluations on public benchmarks demonstrate that FPS achieves competitive or superior performance to state-of-the-art methods. FPS scales efficiently with multimodal large models and shows versatility across diverse domains including protein structure prediction, remote sensing classification, and earthquake detection. We anticipate FPS will provide a simple, effective, and generalizable paradigm for transfer learning, particularly in domain adaptation tasks. .

[45] A Novel Deep Hybrid Framework with Ensemble-Based Feature Optimization for Robust Real-Time Human Activity Recognition

Wasi Ullah,Yasir Noman Khalid,Saddam Hussain Khan

Main category: cs.CV

TL;DR: 这篇论文提出了一种新颖的深度混合框架,通过集成特征优化方法实现了鲁棒的实时人类活动识别(HAR),结合定制的InceptionV3、LSTM结构和自适应动态共享注意力的遗传算法。

Details Motivation: HAR在智能监控、医疗等领域有广泛应用,但现有系统存在计算成本高、特征冗余和实时性差等问题,需要一种轻量且高效的解决方案。

Contribution: 提出了混合深度学习框架,结合空间特征提取、时间依赖建模和优化的特征选择策略,显著提升了HAR的准确性和效率。

Method: 1)使用定制的InceptionV3提取空间特征;2)LSTM建模时间动态;3)提出ADFSA集成遗传算法优化特征选择。

Result: 在UCF-YouTube数据集上达到99.65%识别准确率,特征维度降至7,推理时间显著优化。

Insight: 轻量级设计和特征优化使框架适用于边缘设备,推动HAR在实时场景中的应用。

Abstract: Human Activity Recognition (HAR) plays a pivotal role in various applications, including smart surveillance, healthcare, assistive technologies, sports analytics, etc. However, HAR systems still face critical challenges, including high computational costs, redundant features, and limited scalability in real-time scenarios. An optimized hybrid deep learning framework is introduced that integrates a customized InceptionV3, an LSTM architecture, and a novel ensemble-based feature selection strategy. The proposed framework first extracts spatial descriptors using the customized InceptionV3 model, which captures multilevel contextual patterns, region homogeneity, and fine-grained localization cues. The temporal dependencies across frames are then modeled using LSTMs to effectively encode motion dynamics. Finally, an ensemble-based genetic algorithm with Adaptive Dynamic Fitness Sharing and Attention (ADFSA) is employed to select a compact and optimized feature set by dynamically balancing objectives such as accuracy, redundancy, uniqueness, and complexity reduction. Consequently, the selected feature subsets, which are both diverse and discriminative, enable various lightweight machine learning classifiers to achieve accurate and robust HAR in heterogeneous environments. Experimental results on the robust UCF-YouTube dataset, which presents challenges such as occlusion, cluttered backgrounds, motion dynamics, and poor illumination, demonstrate good performance. The proposed approach achieves 99.65% recognition accuracy, reduces features to as few as 7, and enhances inference time. The lightweight and scalable nature of the HAR system supports real-time deployment on edge devices such as Raspberry Pi, enabling practical applications in intelligent, resource-aware environments, including public safety, assistive technology, and autonomous monitoring systems.

[46] ColorGS: High-fidelity Surgical Scene Reconstruction with Colored Gaussian Splatting

Qun Ji,Peng Li,Mingqiang Wei

Main category: cs.CV

TL;DR: ColorGS是一种新颖的框架,通过动态彩色高斯原语和增强变形模型,实现了高保真手术场景重建,显著提升了颜色表达和变形建模能力。

Details Motivation: 现有方法在捕捉内窥镜视频中细微颜色变化和全局变形建模方面存在局限,3D高斯泼溅技术虽能高效重建但颜色和变形建模能力不足。

Contribution: 提出了Colored Gaussian Primitives和Enhanced Deformation Model (EDM),显著改善了颜色表达和变形建模,实现了高保真手术场景重建。

Method: 结合动态锚点与可学习颜色参数实现自适应颜色编码,并通过时间感知高斯基函数和可学习时不变变形捕捉局部和全局变形。

Result: 在DaVinci手术视频和基准数据集上,PSNR达39.85,SSIM达97.25%,实时渲染效率优于现有方法。

Insight: 平衡高保真与计算实用性是手术场景重建的关键,为术中导航和AR/VR应用提供了新思路。

Abstract: High-fidelity reconstruction of deformable tissues from endoscopic videos remains challenging due to the limitations of existing methods in capturing subtle color variations and modeling global deformations. While 3D Gaussian Splatting (3DGS) enables efficient dynamic reconstruction, its fixed per-Gaussian color assignment struggles with intricate textures, and linear deformation modeling fails to model consistent global deformation. To address these issues, we propose ColorGS, a novel framework that integrates spatially adaptive color encoding and enhanced deformation modeling for surgical scene reconstruction. First, we introduce Colored Gaussian Primitives, which employ dynamic anchors with learnable color parameters to adaptively encode spatially varying textures, significantly improving color expressiveness under complex lighting and tissue similarity. Second, we design an Enhanced Deformation Model (EDM) that combines time-aware Gaussian basis functions with learnable time-independent deformations, enabling precise capture of both localized tissue deformations and global motion consistency caused by surgical interactions. Extensive experiments on DaVinci robotic surgery videos and benchmark datasets (EndoNeRF, StereoMIS) demonstrate that ColorGS achieves state-of-the-art performance, attaining a PSNR of 39.85 (1.5 higher than prior 3DGS-based methods) and superior SSIM (97.25%) while maintaining real-time rendering efficiency. Our work advances surgical scene reconstruction by balancing high fidelity with computational practicality, critical for intraoperative guidance and AR/VR applications.

[47] Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion

DongHoon Lim,YoungChae Kim,Dong-Hyun Kim,Da-Hee Yang,Joon-Hyuk Chang

Main category: cs.CV

TL;DR: 论文提出了一种基于路由器门控的跨模态特征融合方法,用于提升噪声环境下音频-视觉语音识别的鲁棒性,通过动态调整模态权重以适应音频质量变化。

Details Motivation: 现有音频-视觉语音识别系统在噪声环境中难以动态评估音频可靠性并调整模态依赖性,导致性能下降。为了解决这一问题,论文提出了一种新的框架。

Contribution: 主要贡献是设计了路由器门控的跨模态特征融合机制,能够根据音频的噪声水平动态调整音频和视觉特征的权重。

Method: 方法的核心是使用音频-视觉特征融合的路由器,通过门控跨注意机制在解码层动态重新权重不可靠的音频令牌,并增强视觉线索。

Result: 在LRS3数据集上的实验表明,该方法相比于AV-HuBERT模型,词错误率相对降低了16.51-42.67%。

Insight: 路由器门控机制和动态权重调整是提升噪声环境下多模态系统鲁棒性的关键,尤其当音频质量下降时,视觉模态的强化对系统性能提升显著。

Abstract: Robust audio-visual speech recognition (AVSR) in noisy environments remains challenging, as existing systems struggle to estimate audio reliability and dynamically adjust modality reliance. We propose router-gated cross-modal feature fusion, a novel AVSR framework that adaptively reweights audio and visual features based on token-level acoustic corruption scores. Using an audio-visual feature fusion-based router, our method down-weights unreliable audio tokens and reinforces visual cues through gated cross-attention in each decoder layer. This enables the model to pivot toward the visual modality when audio quality deteriorates. Experiments on LRS3 demonstrate that our approach achieves an 16.51-42.67% relative reduction in word error rate compared to AV-HuBERT. Ablation studies confirm that both the router and gating mechanism contribute to improved robustness under real-world acoustic noise.

[48] Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods

Qinqian Lei,Bo Wang,Robby T. Tan

Main category: cs.CV

TL;DR: 该论文提出了一个新的基准测试,用于直接比较通用视觉语言模型(VLMs)和专门的人-物体交互(HOI)检测方法,解决了现有基准在评估生成式VLM时的不匹配问题。

Details Motivation: 现有HOI基准测试(如HICO-DET)在设计时未考虑现代VLM的生成特性,其严格的类匹配评估可能对VLM和HOI方法产生不公正的惩罚,影响了合理但未标注的预测结果。

Contribution: 1. 引入了一个新的基准测试,将HOI检测重新定义为多选题任务,避免对合理预测的惩罚;2. 提供了第一个同时适用于VLM和HOI方法的评估协议。

Method: 通过设计一种多选题形式的评估协议,每个问题仅包含真实正例和精心筛选的负例(避免歧义),从而更公平地评估模型的预测效果。

Result: 新基准测试能够更公平地评估VLM和HOI方法,揭示了当前HOI理解技术的发展现状。

Insight: 生成式VLM在HOI任务中可能已经具备较强能力,但需要更灵活的评估方法以反映其多答案特性;专门的HOI方法仍需改进以与VLM竞争。

Abstract: Prior human-object interaction (HOI) detection methods have integrated early vision-language models (VLMs) such as CLIP, but only as supporting components within their frameworks. In contrast, recent advances in large, generative VLMs suggest that these models may already possess strong ability to understand images involving HOI. This naturally raises an important question: can general-purpose standalone VLMs effectively solve HOI detection, and how do they compare with specialized HOI methods? Answering this requires a benchmark that can accommodate both paradigms. However, existing HOI benchmarks such as HICO-DET were developed before the emergence of modern VLMs, and their evaluation protocols require exact matches to annotated HOI classes. This is poorly aligned with the generative nature of VLMs, which often yield multiple valid interpretations in ambiguous cases. For example, a static image may capture a person mid-motion with a frisbee, which can plausibly be interpreted as either “throwing” or “catching”. When only “catching” is annotated, the other, though equally plausible for the image, is marked incorrect when exact matching is used. As a result, correct predictions might be penalized, affecting both VLMs and HOI-specific methods. To avoid penalizing valid predictions, we introduce a new benchmark that reformulates HOI detection as a multiple-answer multiple-choice task, where each question includes only ground-truth positive options and a curated set of negatives that are constructed to reduce ambiguity (e.g., when “catching” is annotated, “throwing” is not selected as a negative to avoid penalizing valid predictions). The proposed evaluation protocol is the first of its kind for both VLMs and HOI methods, enabling direct comparison and offering new insight into the current state of progress in HOI understanding.

[49] Beyond the Textual: Generating Coherent Visual Options for MCQs

Wanqiang Wang,Longzhu He,Wei Zheng

Main category: cs.CV

TL;DR: 该论文提出了一个多模态框架CmOS,用于生成带有视觉选项的多选题(MCQs),解决了以往研究中忽略视觉选项和高质量干扰项的挑战。通过结合Multimodal Chain-of-Thought(MCoT)和Retrieval-Augmented Generation(RAG),生成的选项在语义和视觉上均具有合理性。实验结果表明,CmOS在多个学科和教育水平上优于现有方法。

Details Motivation: 现有的多选题生成研究主要集中在文本选项上,忽略了视觉选项的潜力,且手动生成高质量干扰项成本高且难以扩展。

Contribution: 提出了一种新颖的跨模态选项合成框架CmOS,支持生成带有视觉选项的多选题,并整合了MCoT和RAG技术来提高选项的语义和视觉质量。

Method: 框架结合Multimodal Chain-of-Thought(MCoT)的推理过程和Retrieval-Augmented Generation(RAG),并加入了一个辨别模块,以筛选适合视觉选项的内容。

Result: 实验显示,CmOS在内容辨别、问题生成和视觉选项生成任务中表现出色,优于现有方法。

Insight: 视觉选项在教学中具有潜力,跨模态生成技术可以有效提升多选题的质量和多样性。

Abstract: Multiple-choice questions (MCQs) play a crucial role in fostering deep thinking and knowledge integration in education. However, previous research has primarily focused on generating MCQs with textual options, but it largely overlooks the visual options. Moreover, generating high-quality distractors remains a major challenge due to the high cost and limited scalability of manual authoring. To tackle these problems, we propose a Cross-modal Options Synthesis (CmOS), a novel framework for generating educational MCQs with visual options. Our framework integrates Multimodal Chain-of-Thought (MCoT) reasoning process and Retrieval-Augmented Generation (RAG) to produce semantically plausible and visually similar answer and distractors. It also includes a discrimination module to identify content suitable for visual options. Experimental results on test tasks demonstrate the superiority of CmOS in content discrimination, question generation and visual option generation over existing methods across various subjects and educational levels.

[50] Design, Implementation and Evaluation of a Real-Time Remote Photoplethysmography (rPPG) Acquisition System for Non-Invasive Vital Sign Monitoring

Constantino Álvarez Casado,Sasan Sharifipour,Manuel Lage Cañellas,Nhi Nguyen,Le Nguyen,Miguel Bordallo López

Main category: cs.CV

TL;DR: 本文提出了一种针对低功耗设备的实时远程光电容积描记(rPPG)系统,用于从面部视频流中提取心率(HR)、呼吸频率(RR)和血氧饱和度(SpO2)等生理信号。系统采用多线程架构和混合编程模型(FRP+Actor),在资源受限的平台上实现了高效的实时处理。

Details Motivation: 随着智能环境和低功耗计算设备的普及,远程非接触式生理监测需求增长,但实时部署在资源受限平台上存在可扩展性和性能挑战。

Contribution: 1. 设计了基于Face2PPG管道的实时rPPG系统;2. 结合多线程架构和混合编程模型(FRP+Actor)优化性能;3. 实现了每秒30帧的连续可靠操作。

Method: 系统采用多线程架构处理视频捕获、实时分析、网络通信和GUI更新,同时通过FRP和Actor模型实现事件驱动处理和任务并行化。

Result: 在实时约束下系统表现稳健,显著降低了计算开销。

Insight: 通过混合编程模型和自适应反馈,系统在低功耗设备上实现了高效实时处理,为现代医疗和人机交互应用提供了实用解决方案。

Abstract: The growing integration of smart environments and low-power computing devices, coupled with mass-market sensor technologies, is driving advancements in remote and non-contact physiological monitoring. However, deploying these systems in real-time on resource-constrained platforms introduces significant challenges related to scalability, interoperability, and performance. This paper presents a real-time remote photoplethysmography (rPPG) system optimized for low-power devices, designed to extract physiological signals, such as heart rate (HR), respiratory rate (RR), and oxygen saturation (SpO2), from facial video streams. The system is built on the Face2PPG pipeline, which processes video frames sequentially for rPPG signal extraction and analysis, while leveraging a multithreaded architecture to manage video capture, real-time processing, network communication, and graphical user interface (GUI) updates concurrently. This design ensures continuous, reliable operation at 30 frames per second (fps), with adaptive feedback through a collaborative user interface to guide optimal signal capture conditions. The network interface includes both an HTTP server for continuous video streaming and a RESTful API for on-demand vital sign retrieval. To ensure accurate performance despite the limitations of low-power devices, we use a hybrid programming model combining Functional Reactive Programming (FRP) and the Actor Model, allowing event-driven processing and efficient task parallelization. The system is evaluated under real-time constraints, demonstrating robustness while minimizing computational overhead. Our work addresses key challenges in real-time biosignal monitoring, offering practical solutions for optimizing performance in modern healthcare and human-computer interaction applications.

[51] PseudoMapTrainer: Learning Online Mapping without HD Maps

Christian Löwens,Thorben Funke,Jingchao Xie,Alexandru Paul Condurache

Main category: cs.CV

TL;DR: PseudoMapTrainer提出了一种无需高清地图(HD Maps)的在线地图学习方法,通过从无标签的传感器数据生成伪标签,同时解决了伪标签部分遮挡问题。

Details Motivation: 现有在线地图学习方法依赖昂贵且地理覆盖有限的高清地图标注数据,限制了模型的泛化能力。

Contribution: 1)提出通过多视角图像重建道路表面和预训练2D分割网络生成伪标签;2)设计掩码感知的分配算法与损失函数以处理部分遮挡的伪标签;3)支持半监督预训练利用大规模无标注数据。

Method: 基于高斯散射重建道路表面和预训练分割网络生成伪标签,并结合掩码感知的分配与损失函数处理遮挡问题。

Result: 首次实现无需真实高清地图标注的在线地图模型训练,并可利用无标注数据进行预训练。

Insight: 伪标签生成和部分遮挡处理是从无标注数据学习在线地图的关键,为地图学习提供了新思路。

Abstract: Online mapping models show remarkable results in predicting vectorized maps from multi-view camera images only. However, all existing approaches still rely on ground-truth high-definition maps during training, which are expensive to obtain and often not geographically diverse enough for reliable generalization. In this work, we propose PseudoMapTrainer, a novel approach to online mapping that uses pseudo-labels generated from unlabeled sensor data. We derive those pseudo-labels by reconstructing the road surface from multi-camera imagery using Gaussian splatting and semantics of a pre-trained 2D segmentation network. In addition, we introduce a mask-aware assignment algorithm and loss function to handle partially masked pseudo-labels, allowing for the first time the training of online mapping models without any ground-truth maps. Furthermore, our pseudo-labels can be effectively used to pre-train an online model in a semi-supervised manner to leverage large-scale unlabeled crowdsourced data. The code is available at github.com/boschresearch/PseudoMapTrainer.

[52] Robust and Label-Efficient Deep Waste Detection

Hassan Abid,Khan Muhammad,Muhammad Haris Khan

Main category: cs.CV

TL;DR: 该论文提出了一个基于集成学习的半监督学习框架,用于提升废物检测的鲁棒性和标签效率。通过优化提示和微调Transformer检测器,论文在ZeroWaste数据集上建立了新的基线(51.6 mAP),并提出了一种软伪标签策略,在未标注数据上实现了超越全监督方法的性能。

Details Motivation: 废物分类对可持续回收至关重要,但现有AI研究因数据集有限和对传统目标检测器的依赖而落后于商业系统。论文旨在通过建立强基线并引入半监督学习框架,推动AI驱动的废物检测技术的发展。

Contribution: 1. 在ZeroWaste数据集上建立了新的基线。2. 提出了基于集成的软伪标签策略,提升了半监督学习的鲁棒性。3. 为未标注数据生成高质量标注,推动了标注流程的可扩展性。4. 系统评估了开放词汇目标检测(OVOD)模型在实际废物分类场景中的表现。

Method: 1. 对Open-Vocabulary Object Detection (OVOD)模型进行基准测试,优化提示以提高零样本精度。2. 微调Transformer检测器以提升性能。3. 提出软伪标签策略,通过空间和共识感知加权融合集成预测。

Result: 微调后的Transformer检测器在ZeroWaste数据集上达到51.6 mAP的基线;软伪标签策略在未标注数据上的性能优于全监督方法。

Insight: 1. 优化提示对提升零样本检测性能至关重要。2. 基于集成的半监督学习可以显著减少对标注数据的依赖。3. Transformer架构在废物检测任务中表现出强大的潜力。

Abstract: Effective waste sorting is critical for sustainable recycling, yet AI research in this domain continues to lag behind commercial systems due to limited datasets and reliance on legacy object detectors. In this work, we advance AI-driven waste detection by establishing strong baselines and introducing an ensemble-based semi-supervised learning framework. We first benchmark state-of-the-art Open-Vocabulary Object Detection (OVOD) models on the real-world ZeroWaste dataset, demonstrating that while class-only prompts perform poorly, LLM-optimized prompts significantly enhance zero-shot accuracy. Next, to address domain-specific limitations, we fine-tune modern transformer-based detectors, achieving a new baseline of 51.6 mAP. We then propose a soft pseudo-labeling strategy that fuses ensemble predictions using spatial and consensus-aware weighting, enabling robust semi-supervised training. Applied to the unlabeled ZeroWaste-s subset, our pseudo-annotations achieve performance gains that surpass fully supervised training, underscoring the effectiveness of scalable annotation pipelines. Our work contributes to the research community by establishing rigorous baselines, introducing a robust ensemble-based pseudo-labeling pipeline, generating high-quality annotations for the unlabeled ZeroWaste-s subset, and systematically evaluating OVOD models under real-world waste sorting conditions. Our code is available at: https://github.com/h-abid97/robust-waste-detection.

[53] Embedding Font Impression Word Tags Based on Co-occurrence

Yugo Kubota,Seiichi Uchida

Main category: cs.CV

TL;DR: 该论文提出了一种基于共现关系的字体印象标签嵌入方法,通过构建标签共现图并应用谱嵌入技术生成标签向量,优于传统词嵌入方法(如BERT和CLIP),特别适用于基于印象的字体生成和检索任务。

Details Motivation: 字体形状与描述其印象的词语标签之间存在紧密关系,传统词嵌入方法(如BERT和CLIP)无法准确捕捉这种关系,因此需要一种新的嵌入方法来更好地表示字体印象。

Contribution: 论文的主要贡献是提出了一种基于共现图的谱嵌入方法,用于生成表示字体印象的标签向量,并能更好地支持印象引导的字体生成和检索任务。

Method: 方法包括构建一个以印象标签为节点、共现关系为边的图,然后应用谱嵌入技术生成标签向量。

Result: 实验结果表明,该方法在印象引导的字体生成任务中表现优于BERT和CLIP。

Insight: 通过利用标签的共现关系,可以更有效地捕捉字体形状与其印象之间的关联,为字体设计领域的任务提供更好的支持。

Abstract: Different font styles (i.e., font shapes) convey distinct impressions, indicating a close relationship between font shapes and word tags describing those impressions. This paper proposes a novel embedding method for impression tags that leverages these shape-impression relationships. For instance, our method assigns similar vectors to impression tags that frequently co-occur in order to represent impressions of fonts, whereas standard word embedding methods (e.g., BERT and CLIP) yield very different vectors. This property is particularly useful for impression-based font generation and font retrieval. Technically, we construct a graph whose nodes represent impression tags and whose edges encode co-occurrence relationships. Then, we apply spectral embedding to obtain the impression vectors for each tag. We compare our method with BERT and CLIP in qualitative and quantitative evaluations, demonstrating that our approach performs better in impression-guided font generation.

[54] Deep Pre-trained Time Series Features for Tree Species Classification in the Dutch Forest Inventory

Takayuki Ishikawa,Carmelo Bonannella,Bas J. W. Lerink,Marc Rußwurm

Main category: cs.CV

TL;DR: 该论文探讨了利用预训练的深度学习模型提取时间序列特征,以提升荷兰国家森林资源调查(NFI)中树种分类的准确性,相比传统方法显著提高了10%的分类精度。

Details Motivation: 国家森林资源调查(NFI)依赖人工现场调查,耗时耗力。结合遥感和机器学习的方法可以更高效地进行大规模更新。然而,现有方法主要依赖随机森林分类器和手工设计的特征,无法充分捕捉复杂的季节反射模式。预训练深度学习模型可提供更优的解决方案。

Contribution: 论文主要贡献包括:1) 系统地研究了基于预训练深度学习模型的特征在树种分类中的效果;2) 结合Sentinel-1、Sentinel-2、ERA5和SRTM等多源卫星数据,通过Google Earth Engine提取时间序列数据;3) 展示了预训练模型在数据有限的任务中的潜力,分类精度显著优于现有方法。

Method: 作者使用公开可用的预训练遥感时间序列基础模型,提取深度特征,并利用少量标注数据进行微调。数据来自Sentinel-1、Sentinel-2等卫星,通过Google Earth Engine整合时间序列数据。与传统手工设计的谐波特征相比,深度特征能更有效地捕捉复杂模式。

Result: 实验结果表明,基于预训练模型的深度特征在荷兰树种分类任务中比当前最优方法提升了高达10%的精度,验证了深度特征在数据有限任务中的优势。

Insight: 预训练深度学习模型能够在数据有限的任务中显著提升分类精度,为NFI等实际应用提供了一种高效的补充方法。同时,多源卫星数据的结合进一步增强了模型的泛化能力。

Abstract: National Forest Inventory (NFI)s serve as the primary source of forest information, providing crucial tree species distribution data. However, maintaining these inventories requires labor-intensive on-site campaigns. Remote sensing approaches, particularly when combined with machine learning, offer opportunities to update NFIs more frequently and at larger scales. While the use of Satellite Image Time Series has proven effective for distinguishing tree species through seasonal canopy reflectance patterns, current approaches rely primarily on Random Forest classifiers with hand-designed features and phenology-based metrics. Using deep features from an available pre-trained remote sensing foundation models offers a complementary strategy. These pre-trained models leverage unannotated global data and are meant to used for general-purpose applications and can then be efficiently fine-tuned with smaller labeled datasets for specific classification tasks. This work systematically investigates how deep features improve tree species classification accuracy in the Netherlands with few annotated data. Data-wise, we extracted time-series data from Sentinel-1, Sentinel-2 and ERA5 satellites data and SRTM data using Google Earth Engine. Our results demonstrate that fine-tuning a publicly available remote sensing time series foundation model outperforms the current state-of-the-art in NFI classification in the Netherlands by a large margin of up to 10% across all datasets. This demonstrates that classic hand-defined harmonic features are too simple for this task and highlights the potential of using deep AI features for data-limited application like NFI classification. By leveraging openly available satellite data and pre-trained models, this approach significantly improves classification accuracy compared to traditional methods and can effectively complement existing forest inventory processes.

[55] Boosting Micro-Expression Analysis via Prior-Guided Video-Level Regression

Zizheng Guo,Bochao Zou,Yinuo Jia,Xiangyu Li,Huimin Ma

Main category: cs.CV

TL;DR: 该论文提出一种先验引导的视频级回归方法,用于微表情分析,结合时序选择和协同优化框架,显著提升性能。

Details Motivation: 现有微表情分析方法依赖固定窗口分类,难以捕捉复杂时序动态,部分视频级回归方法仍受限于手动预定义窗口,问题未完全解决。

Contribution: 1. 提出可扩展的间隔选择策略,结合微表情的时序演化、持续时间和类别分布特性;2. 设计协同优化框架,利用互补信息并高效利用数据。

Method: 1. 先验引导的视频级回归方法,通过时序选择策略定位微表情的起始、顶点和结束阶段;2. 协同优化框架,共享参数以联合优化检测与识别任务。

Result: 在多个基准数据集上达到SOTA性能,CAS(ME)^3数据集的STRS为0.0562,SAMMLV为0.2000。

Insight: 微表情的时序特性是关键,结合先验知识和多任务协同优化可显著提升模型性能。

Abstract: Micro-expressions (MEs) are involuntary, low-intensity, and short-duration facial expressions that often reveal an individual’s genuine thoughts and emotions. Most existing ME analysis methods rely on window-level classification with fixed window sizes and hard decisions, which limits their ability to capture the complex temporal dynamics of MEs. Although recent approaches have adopted video-level regression frameworks to address some of these challenges, interval decoding still depends on manually predefined, window-based methods, leaving the issue only partially mitigated. In this paper, we propose a prior-guided video-level regression method for ME analysis. We introduce a scalable interval selection strategy that comprehensively considers the temporal evolution, duration, and class distribution characteristics of MEs, enabling precise spotting of the onset, apex, and offset phases. In addition, we introduce a synergistic optimization framework, in which the spotting and recognition tasks share parameters except for the classification heads. This fully exploits complementary information, makes more efficient use of limited data, and enhances the model’s capability. Extensive experiments on multiple benchmark datasets demonstrate the state-of-the-art performance of our method, with an STRS of 0.0562 on CAS(ME)$^3$ and 0.2000 on SAMMLV. The code is available at https://github.com/zizheng-guo/BoostingVRME.

[56] Quantitative Outcome-Oriented Assessment of Microsurgical Anastomosis

Luyin Hu,Soheil Gholami,George Dindelegan,Torstein R. Meling,Aude Billard

Main category: cs.CV

TL;DR: 该论文提出了一种基于图像处理技术的定量框架,用于客观评估显微外科吻合术,减少了主观评判的偏差,提高了评估的效率和可靠性。

Details Motivation: 显微外科吻合术的评估目前依赖主观方法,存在偏差和不可靠的问题,因此需要一种客观、定量的评估方法。

Contribution: 引入了一种定量框架,利用图像处理技术对显微外科吻合术进行客观评估,并提出几何建模和评分机制。

Method: 通过几何建模错误并结合检测与评分机制,利用图像处理技术对手术结果进行量化分析。

Result: 几何指标能有效复现专家评分的结果,证明了该方法的有效性。

Insight: 定量方法可以显著提升显微外科吻合术评估的客观性和可靠性,适用于不同技能水平的学习者。

Abstract: Microsurgical anastomosis demands exceptional dexterity and visuospatial skills, underscoring the importance of comprehensive training and precise outcome assessment. Currently, methods such as the outcome-oriented anastomosis lapse index are used to evaluate this procedure. However, they often rely on subjective judgment, which can introduce biases that affect the reliability and efficiency of the assessment of competence. Leveraging three datasets from hospitals with participants at various levels, we introduce a quantitative framework that uses image-processing techniques for objective assessment of microsurgical anastomoses. The approach uses geometric modeling of errors along with a detection and scoring mechanism, enhancing the efficiency and reliability of microsurgical proficiency assessment and advancing training protocols. The results show that the geometric metrics effectively replicate expert raters’ scoring for the errors considered in this work.

[57] Harnessing Meta-Learning for Controllable Full-Frame Video Stabilization

Muhammad Kashif Ali,Eun Woo Im,Dongjin Kim,Tae Hyun Kim,Vivek Gupta,Haonan Luo,Tianrui Li

Main category: cs.CV

TL;DR: 该论文提出了一种基于元学习的可控全帧视频稳定方法,通过快速适应输入视频的低级视觉线索,显著提高了稳定性和视觉质量,并引入急动定位模块和针对性适应策略。

Details Motivation: 现有像素级合成视频稳定方法难以适应不同视频的运动多样性和视觉内容,泛化能力受限。论文旨在利用元学习实现快速适应,提高稳定性和视觉质量。

Contribution: 1. 提出基于元学习的快速适应方法,增强全帧视频稳定模型的性能;2. 设计了急动定位模块和针对性适应策略,优化适应效率;3. 在多样化数据集上的实验证明了方法的有效性。

Method: 1. 利用低层视觉线索快速适应输入视频;2. 通过急动定位模块识别高动态片段;3. 采用针对性适应策略集中优化高急动区域。

Result: 实验表明,该方法显著提升了多种全帧合成模型的性能,包括稳定性和视觉质量,并在下游任务中表现出色。

Insight: 元学习和针对性适应结合可以有效解决视频稳定中的泛化问题,同时保持全帧输出的优势。

Abstract: Video stabilization remains a fundamental problem in computer vision, particularly pixel-level synthesis solutions for video stabilization, which synthesize full-frame outputs, add to the complexity of this task. These methods aim to enhance stability while synthesizing full-frame videos, but the inherent diversity in motion profiles and visual content present in each video sequence makes robust generalization with fixed parameters difficult. To address this, we present a novel method that improves pixel-level synthesis video stabilization methods by rapidly adapting models to each input video at test time. The proposed approach takes advantage of low-level visual cues available during inference to improve both the stability and visual quality of the output. Notably, the proposed rapid adaptation achieves significant performance gains even with a single adaptation pass. We further propose a jerk localization module and a targeted adaptation strategy, which focuses the adaptation on high-jerk segments for maximizing stability with fewer adaptation steps. The proposed methodology enables modern stabilizers to overcome the longstanding SOTA approaches while maintaining the full frame nature of the modern methods, while offering users with control mechanisms akin to classical approaches. Extensive experiments on diverse real-world datasets demonstrate the versatility of the proposed method. Our approach consistently improves the performance of various full-frame synthesis models in both qualitative and quantitative terms, including results on downstream applications.

[58] Toward Robust Medical Fairness: Debiased Dual-Modal Alignment via Text-Guided Attribute-Disentangled Prompt Learning for Vision-Language Models

Yuexuan Xia,Benteng Ma,Jiang He,Zhiyong Wang,Qi Dou,Yong Xia

Main category: cs.CV

TL;DR: 该论文提出了一种多模态提示学习框架DualFairVL,通过文本引导的属性解耦方法,联合去偏并对齐跨模态表示,以提升医学影像诊断的公平性和鲁棒性。

Details Motivation: 在医学影像诊断中,确保不同人口群体的公平性对医疗平等至关重要。然而,现有方法独立处理视觉和文本模态,导致跨模态未对齐和公平性差距。

Contribution: 提出DualFairVL框架,通过双分支架构分离敏感属性和目标属性,并利用文本引导的线性投影和超网络实现跨模态解耦与对齐。

Method: 使用并行双分支结构分离敏感和目标属性;构建正交文本锚点引导跨注意力机制;超网络生成实例感知的视觉提示;原型正则化强化对齐。

Result: 在8个医学影像数据集上的实验表明,DualFairVL在公平性和准确性上优于现有方法,仅需3.6M可训练参数。

Insight: 通过解耦和对齐跨模态表示,可以显著提升模型的公平性和鲁棒性,尤其是在分布偏移场景下。

Abstract: Ensuring fairness across demographic groups in medical diagnosis is essential for equitable healthcare, particularly under distribution shifts caused by variations in imaging equipment and clinical practice. Vision-language models (VLMs) exhibit strong generalization, and text prompts encode identity attributes, enabling explicit identification and removal of sensitive directions. However, existing debiasing approaches typically address vision and text modalities independently, leaving residual cross-modal misalignment and fairness gaps. To address this challenge, we propose DualFairVL, a multimodal prompt-learning framework that jointly debiases and aligns cross-modal representations. DualFairVL employs a parallel dual-branch architecture that separates sensitive and target attributes, enabling disentangled yet aligned representations across modalities. Approximately orthogonal text anchors are constructed via linear projections, guiding cross-attention mechanisms to produce fused features. A hypernetwork further disentangles attribute-related information and generates instance-aware visual prompts, which encode dual-modal cues for fairness and robustness. Prototype-based regularization is applied in the visual branch to enforce separation of sensitive features and strengthen alignment with textual anchors. Extensive experiments on eight medical imaging datasets across four modalities show that DualFairVL achieves state-of-the-art fairness and accuracy under both in- and out-of-distribution settings, outperforming full fine-tuning and parameter-efficient baselines with only 3.6M trainable parameters. Code will be released upon publication.

[59] DQEN: Dual Query Enhancement Network for DETR-based HOI Detection

Zhehao Li,Chong Wang,Yi Chen,Yinghao Lu,Jiangbo Qian,Jiong Wang,Jiafei Wu

Main category: cs.CV

TL;DR: DQEN提出了一种双查询增强网络,用于改进DETR-based HOI检测中的对象和交互查询,通过对象感知和语义融合提升检测性能。

Details Motivation: 现有的DETR-based HOI检测模型依赖随机初始化的查询,导致表达模糊,限制了模型效果。DQEN旨在通过增强对象和交互查询,提升检测能力。

Contribution: 1. 提出了双查询增强网络(DQEN),增强对象和交互查询;
2. 设计了对象感知编码和交互语义融合模块;
3. 引入辅助预测单元改进交互特征表示。

Method: 1. 对象查询通过对象感知编码增强;
2. 交互查询通过CLIP模型的HOI候选语义特征初始化;
3. 辅助预测单元优化交互特征。

Result: 在HICO-Det和V-COCO数据集上取得了有竞争力的性能。

Insight: 通过明确对象和交互的查询初始化,结合语义信息,可以显著提升DETR-based HOI检测的效果。

Abstract: Human-Object Interaction (HOI) detection focuses on localizing human-object pairs and recognizing their interactions. Recently, the DETR-based framework has been widely adopted in HOI detection. In DETR-based HOI models, queries with clear meaning are crucial for accurately detecting HOIs. However, prior works have typically relied on randomly initialized queries, leading to vague representations that limit the model’s effectiveness. Meanwhile, humans in the HOI categories are fixed, while objects and their interactions are variable. Therefore, we propose a Dual Query Enhancement Network (DQEN) to enhance object and interaction queries. Specifically, object queries are enhanced with object-aware encoder features, enabling the model to focus more effectively on humans interacting with objects in an object-aware way. On the other hand, we design a novel Interaction Semantic Fusion module to exploit the HOI candidates that are promoted by the CLIP model. Semantic features are extracted to enhance the initialization of interaction queries, thereby improving the model’s ability to understand interactions. Furthermore, we introduce an Auxiliary Prediction Unit aimed at improving the representation of interaction features. Our proposed method achieves competitive performance on both the HICO-Det and the V-COCO datasets. The source code is available at https://github.com/lzzhhh1019/DQEN.

[60] Interpretable Decision-Making for End-to-End Autonomous Driving

Mona Mirzaie,Bodo Rosenhahn

Main category: cs.CV

TL;DR: 这篇论文提出了一种增强端到端自动驾驶决策可解释性的方法,通过设计损失函数生成稀疏和局部化的特征图,从而解释AI决策的依据,并在CARLA基准测试中表现出色。

Details Motivation: 自动驾驶系统需要被信任才能广泛部署,但当前端到端方法的深度神经网络缺乏可解释性,尤其在复杂城市场景中。因此,研究如何提升决策的可解释性至关重要。

Contribution: 论文的主要贡献是提出了一种新的损失函数设计,用于生成稀疏和局部化的特征图,从而增强模型的可解释性,同时在CARLA基准测试中取得更好的性能。

Method: 通过设计特定的损失函数,模型能够生成稀疏和局部化的特征图,这些特征图可以清晰地展示哪些图像区域对预测的控制命令有贡献。

Result: 该方法在CARLA基准测试中表现优异,单目非集成模型超越了排行榜上的最优方法,降低了违规率并提高了路线完成率。

Insight: 可解释性与性能提升可以相辅相成:通过优化特征图的局部化和稀疏性,不仅增强了决策的可解释性,还进一步提高了驾驶模型的安全性。

Abstract: Trustworthy AI is mandatory for the broad deployment of autonomous vehicles. Although end-to-end approaches derive control commands directly from raw data, interpreting these decisions remains challenging, especially in complex urban scenarios. This is mainly attributed to very deep neural networks with non-linear decision boundaries, making it challenging to grasp the logic behind AI-driven decisions. This paper presents a method to enhance interpretability while optimizing control commands in autonomous driving. To address this, we propose loss functions that promote the interpretability of our model by generating sparse and localized feature maps. The feature activations allow us to explain which image regions contribute to the predicted control command. We conduct comprehensive ablation studies on the feature extraction step and validate our method on the CARLA benchmarks. We also demonstrate that our approach improves interpretability, which correlates with reducing infractions, yielding a safer, high-performance driving model. Notably, our monocular, non-ensemble model surpasses the top-performing approaches from the CARLA Leaderboard by achieving lower infraction scores and the highest route completion rate, all while ensuring interpretability.

[61] Event-Enriched Image Analysis Grand Challenge at ACM Multimedia 2025

Thien-Phuc Tran,Minh-Quang Nguyen,Minh-Triet Tran,Tam V. Nguyen,Trong-Le Do,Duy-Nam Ly,Viet-Tham Huynh,Khanh-Duy Le,Mai-Khiem Tran,Trung-Nghia Le

Main category: cs.CV

TL;DR: EVENTA Grand Challenge 是 ACM Multimedia 2025 上的一个多模态理解任务,旨在通过整合上下文、时间和语义信息,弥补传统图像任务在事件级理解上的不足。

Details Motivation: 传统图像任务(如标注和检索)通常关注表面层次的识别,忽略了定义真实世界事件的上下文和语义维度。EVENTA 旨在填补这一空白。

Contribution: 1. 提出了首个大规模的事件级多模态理解基准;2. 设计了两个挑战赛道:事件增强的图像检索与标注、事件驱动的图像检索;3. 推动了基于上下文和叙事的多媒体 AI 发展。

Method: 基于 OpenEvents V1 数据集,通过公开和私有测试阶段评估参赛方案。任务整合了时间、上下文和语义信息以捕捉事件的多个维度(谁、何时、何地、何事、为何)。

Result: 共有 45 支团队参与挑战,前三名团队在 ACM Multimedia 2025 上展示解决方案。EVENTA 为多媒体 AI 的应用(如新闻、媒体分析、文化存档等)奠定了基础。

Insight: EVENTA 强调了事件级理解的重要性,为未来研究提供了新的方向和评估标准,推动了多模态 AI 向更丰富的上下文和叙事能力发展。

Abstract: The Event-Enriched Image Analysis (EVENTA) Grand Challenge, hosted at ACM Multimedia 2025, introduces the first large-scale benchmark for event-level multimodal understanding. Traditional captioning and retrieval tasks largely focus on surface-level recognition of people, objects, and scenes, often overlooking the contextual and semantic dimensions that define real-world events. EVENTA addresses this gap by integrating contextual, temporal, and semantic information to capture the who, when, where, what, and why behind an image. Built upon the OpenEvents V1 dataset, the challenge features two tracks: Event-Enriched Image Retrieval and Captioning, and Event-Based Image Retrieval. A total of 45 teams from six countries participated, with evaluation conducted through Public and Private Test phases to ensure fairness and reproducibility. The top three teams were invited to present their solutions at ACM Multimedia 2025. EVENTA establishes a foundation for context-aware, narrative-driven multimedia AI, with applications in journalism, media analysis, cultural archiving, and accessibility. Further details about the challenge are available at the official homepage: https://ltnghia.github.io/eventa/eventa-2025.

[62] Preliminary Study on Space Utilization and Emergent Behaviors of Group vs. Single Pedestrians in Real-World Trajectories

Amartaivan Sanjjamts,Morita Hiroshi

Main category: cs.CV

TL;DR: 该论文提出了一种基于真实轨迹数据区分行人群组与个体的初步框架,并通过空间和行为指标分析其差异。

Details Motivation: 研究旨在理解行人群组与个体在空间利用和行为模式上的差异,为人群动力学研究提供基础。

Contribution: 提出了一个基于Transformer的分类模型,并设计了空间和行为指标框架,为后续分析奠定基础。

Method: 采用时间分段的轨迹数据,结合Transformer模型分类行人群组与个体,并引入空间和行为指标。

Result: 论文建立了分类流程和数据集结构,支持不同序列长度的分析,为未来研究提供工具。

Insight: 空间和行为指标的框架为深入分析行人群组与个体差异提供了可能,有助于人群模拟和空间设计验证。

Abstract: This study presents an initial framework for distinguishing group and single pedestrians based on real-world trajectory data, with the aim of analyzing their differences in space utilization and emergent behavioral patterns. By segmenting pedestrian trajectories into fixed time bins and applying a Transformer-based pair classification model, we identify cohesive groups and isolate single pedestrians over a structured sequence-based filtering process. To prepare for deeper analysis, we establish a comprehensive metric framework incorporating both spatial and behavioral dimensions. Spatial utilization metrics include convex hull area, smallest enclosing circle radius, and heatmap-based spatial densities to characterize how different pedestrian types occupy and interact with space. Behavioral metrics such as velocity change, motion angle deviation, clearance radius, and trajectory straightness are designed to capture local adaptations and responses during interactions. Furthermore, we introduce a typology of encounter types-single-to-single, single-to-group, and group-to-group to categorize and later quantify different interaction scenarios. Although this version focuses primarily on the classification pipeline and dataset structuring, it establishes the groundwork for scalable analysis across different sequence lengths 60, 100, and 200 frames. Future versions will incorporate complete quantitative analysis of the proposed metrics and their implications for pedestrian simulation and space design validation in crowd dynamics research.

[63] The point is the mask: scaling coral reef segmentation with weak supervision

Matteo Contini,Victor Illien,Sylvain Poulain,Serge Bernard,Julien Barde,Sylvain Bonhommeau,Alexis Joly

Main category: cs.CV

TL;DR: 论文提出了一种多尺度弱监督语义分割框架,通过将水下图像的细粒度生态信息迁移到航拍数据中,解决了大规模珊瑚礁监测的挑战。该方法结合了分类监督、空间插值和自蒸馏技术,能以最少的标注实现珊瑚礁的大范围分割。

Details Motivation: 大规模珊瑚礁监测对评估生态系统健康和指导保护工作至关重要,但现有的航拍图像分辨率有限,难以区分珊瑚形态的细粒度类别,而像素级标注成本高昂,限制了深度学习方法的可扩展性。

Contribution: 论文的主要贡献是提出了一种多尺度弱监督语义分割框架,能够将水下图像的细粒度信息迁移到航拍数据中,实现大规模珊瑚礁的自动化监测,同时减少人工标注的需求。

Method: 方法结合了分类监督、空间插值和自蒸馏技术,通过弱监督学习从航拍图像中提取珊瑚形态的细粒度特征,并通过多尺度处理提高分割精度。

Result: 实验证明了该方法的有效性,能够实现大范围珊瑚形态的分割,并展示了对新类别的灵活性。

Insight: 通过结合低成本数据采集和弱监督学习,论文提供了一种可扩展、经济高效的高分辨率珊瑚礁监测方法,为生态保护提供了新工具。

Abstract: Monitoring coral reefs at large spatial scales remains an open challenge, essential for assessing ecosystem health and informing conservation efforts. While drone-based aerial imagery offers broad spatial coverage, its limited resolution makes it difficult to reliably distinguish fine-scale classes, such as coral morphotypes. At the same time, obtaining pixel-level annotations over large spatial extents is costly and labor-intensive, limiting the scalability of deep learning-based segmentation methods for aerial imagery. We present a multi-scale weakly supervised semantic segmentation framework that addresses this challenge by transferring fine-scale ecological information from underwater imagery to aerial data. Our method enables large-scale coral reef mapping from drone imagery with minimal manual annotation, combining classification-based supervision, spatial interpolation and self-distillation techniques. We demonstrate the efficacy of the approach, enabling large-area segmentation of coral morphotypes and demonstrating flexibility for integrating new classes. This study presents a scalable, cost-effective methodology for high-resolution reef monitoring, combining low-cost data collection, weakly supervised deep learning and multi-scale remote sensing.

[64] Enhancing compact convolutional transformers with super attention

Simpenzwe Honore Leandre,Natenaile Asmamaw Shiferaw,Dillip Rout

Main category: cs.CV

TL;DR: 本文提出了一种结合token混合、序列池化和卷积tokenizer的视觉模型,在固定上下文长度任务中实现了高效推理和SOTA性能。

Details Motivation: 现有的注意力机制(如SDPA)在短上下文任务中效率不高,且依赖额外技术(如数据增强、位置编码等)。本文旨在设计一种更高效、更稳定的模型。

Contribution: 1. 提出了一种高效的视觉模型架构;2. 在CIFAR100上显著提升性能,同时模型更小;3. 展示了高训练稳定性,无需依赖额外技术。

Method: 模型采用token混合、序列池化和卷积tokenizer,结合超注意力机制,优化了短上下文任务的效率。

Result: 在CIFAR100上,top-1和top-5验证准确率分别从36.50%提升到46.29%和66.33%提升到76.31%,且模型更小、更高效。

Insight: 短上下文任务中,超注意力机制比传统注意力更高效,且模型设计可以简化训练流程。

Abstract: In this paper, we propose a vision model that adopts token mixing, sequence-pooling, and convolutional tokenizers to achieve state-of-the-art performance and efficient inference in fixed context-length tasks. In the CIFAR100 benchmark, our model significantly improves the baseline of the top 1% and top 5% validation accuracy from 36.50% to 46.29% and 66.33% to 76.31%, while being more efficient than the Scaled Dot Product Attention (SDPA) transformers when the context length is less than the embedding dimension and only 60% the size. In addition, the architecture demonstrates high training stability and does not rely on techniques such as data augmentation like mixup, positional embeddings, or learning rate scheduling. We make our code available on Github.

[65] Can we make NeRF-based visual localization privacy-preserving?

Maxime Pietrantoni,Martin Humenberger,Torsten Sattler,Gabriela Csurka

Main category: cs.CV

TL;DR: 论文探讨了NeRF(神经辐射场)在视觉定位任务中可能泄露隐私的问题,提出了一种评估NeRF隐私保护能力的新协议,并开发了名为ppNeSF的隐私保护变体。ppNeSF通过自监督学习的分割标签替代RGB图像训练,既保护隐私又能实现高精度视觉定位。

Details Motivation: NeRF在视觉定位中的广泛应用带来了隐私泄露风险,因为其隐含地存储了大量场景细节。论文旨在解决这一问题,确保NeRF在云服务中的部署既能保留其高性能,又能保护隐私。

Contribution: 1. 提出了一种评估NeRF隐私保护能力的新协议。2. 设计了ppNeSF,通过分割监督训练保护隐私,同时保持视觉定位的高性能。

Method: 1. 使用新协议验证NeRF在几何表示中存储的细节可能导致隐私泄露。2. ppNeSF通过自监督学习生成的分割标签训练,替代直接使用RGB数据。

Result: ppNeSF在保护隐私的同时,实现了视觉定位的最先进性能,验证了其有效性。

Insight: NeRF的隐私问题不仅存在于颜色预测头,其几何表示也是敏感信息的潜在来源;分割标签可以作为一种隐私保护的替代方案。

Abstract: Visual localization (VL) is the task of estimating the camera pose in a known scene. VL methods, a.o., can be distinguished based on how they represent the scene, e.g., explicitly through a (sparse) point cloud or a collection of images or implicitly through the weights of a neural network. Recently, NeRF-based methods have become popular for VL. While NeRFs offer high-quality novel view synthesis, they inadvertently encode fine scene details, raising privacy concerns when deployed in cloud-based localization services as sensitive information could be recovered. In this paper, we tackle this challenge on two ends. We first propose a new protocol to assess privacy-preservation of NeRF-based representations. We show that NeRFs trained with photometric losses store fine-grained details in their geometry representations, making them vulnerable to privacy attacks, even if the head that predicts colors is removed. Second, we propose ppNeSF (Privacy-Preserving Neural Segmentation Field), a NeRF variant trained with segmentation supervision instead of RGB images. These segmentation labels are learned in a self-supervised manner, ensuring they are coarse enough to obscure identifiable scene details while remaining discriminativeness in 3D. The segmentation space of ppNeSF can be used for accurate visual localization, yielding state-of-the-art results.

[66] Enhancing Document VQA Models via Retrieval-Augmented Generation

Eric López,Artemis Llabrés,Ernest Valveny

Main category: cs.CV

TL;DR: 论文通过检索增强生成(RAG)提升文档VQA性能,评估了文本和视觉检索变体,在多个基准测试中显著提升准确率。

Details Motivation: 现有文档VQA系统在处理多页文档时存在内存消耗高的问题,RAG提供了一种高效且轻量级的替代方案。

Contribution: 1)系统评估RAG在文档VQA中的作用;2)提出文本和视觉检索变体,分别显著提升性能;3)证明检索和重排序是关键改进点。

Method: 1)文本检索基于OCR标记;2)视觉检索无需OCR;3)结合检索和生成模型,优化证据选择。

Result: 文本检索变体在MP-DocVQA等数据集上比基线最高提升22.5 ANLS,视觉检索变体提升5.0 ANLS。

Insight: 精细化的证据选择(而非布局分块策略)是提升性能的关键,适用于不同模型规模和基准。

Abstract: Document Visual Question Answering (Document VQA) must cope with documents that span dozens of pages, yet leading systems still concatenate every page or rely on very large vision-language models, both of which are memory-hungry. Retrieval-Augmented Generation (RAG) offers an attractive alternative, first retrieving a concise set of relevant segments before generating answers from this selected evidence. In this paper, we systematically evaluate the impact of incorporating RAG into Document VQA through different retrieval variants - text-based retrieval using OCR tokens and purely visual retrieval without OCR - across multiple models and benchmarks. Evaluated on the multi-page datasets MP-DocVQA, DUDE, and InfographicVQA, the text-centric variant improves the “concatenate-all-pages” baseline by up to +22.5 ANLS, while the visual variant achieves +5.0 ANLS improvement without requiring any text extraction. An ablation confirms that retrieval and reranking components drive most of the gain, whereas the layout-guided chunking strategy - proposed in several recent works to leverage page structure - fails to help on these datasets. Our experiments demonstrate that careful evidence selection consistently boosts accuracy across multiple model sizes and multi-page benchmarks, underscoring its practical value for real-world Document VQA.

[67] Ask Me Again Differently: GRAS for Measuring Bias in Vision Language Models on Gender, Race, Age, and Skin Tone

Shaivi Malik,Hasnat Md Abdullah,Sriparna Saha,Amit Sheth

Main category: cs.CV

TL;DR: GRAS是一个用于评估视觉语言模型(VLMs)在性别、种族、年龄和肤色方面偏见的基准测试,并提出可解释的GRAS偏见分数。在评估五种先进VLM时,发现其偏见水平较高,最低得分仅为2分(满分100)。此外,研究发现,在视觉问答(VQA)任务中评估偏见时,需考虑问题的多种表述形式。

Details Motivation: 随着视觉语言模型在实际应用中的普及,理解其在人口统计学上的偏见变得至关重要。当前缺乏一个覆盖全面的评估基准来测量这些偏见。为了实现这一点,作者提出了GRAS。

Contribution: 1)提出GRAS基准,覆盖性别、种族、年龄和肤色的多样性;2)设计了可解释的GRAS偏见分数;3)通过评估五种VLM揭示了显著的偏见问题;4)提供方法学见解:在VQA任务中,评估偏见需关注问题的多种表述。

Method: 1)构建GRAS数据集,涵盖多样化的性别、种族、年龄和肤色样本;2)定义GRAS偏见分数,量化VLM在不同人口统计学特征上的偏见;3)通过视觉问答任务评估五种VLM的偏见表现。

Result: 评估结果表明,五种先进VLM的偏见水平较高,最低得分仅为2分(满分100)。此外,研究发现问题的不同表述会影响偏见的评估结果。

Insight: 在评估视觉语言模型的偏见时,单一问题表述可能无法全面反映实际情况,需考虑多种表述形式以确保评估的全面性。

Abstract: As Vision Language Models (VLMs) become integral to real-world applications, understanding their demographic biases is critical. We introduce GRAS, a benchmark for uncovering demographic biases in VLMs across gender, race, age, and skin tone, offering the most diverse coverage to date. We further propose the GRAS Bias Score, an interpretable metric for quantifying bias. We benchmark five state-of-the-art VLMs and reveal concerning bias levels, with the least biased model attaining a GRAS Bias Score of only 2 out of 100. Our findings also reveal a methodological insight: evaluating bias in VLMs with visual question answering (VQA) requires considering multiple formulations of a question. Our code, data, and evaluation results are publicly available.

[68] RoofSeg: An edge-aware transformer-based network for end-to-end roof plane segmentation

Siyuan You,Guozheng Xu,Pengwei Zhou,Qiwen Jin,Jian Yao,Li Li

Main category: cs.CV

TL;DR: RoofSeg是一种基于Transformer的边缘感知网络,用于从LiDAR点云中端到端地分割屋顶平面,解决了当前深度学习方法在边缘区域特征判别性不足和几何特性未充分利用的问题。

Details Motivation: 现有的屋顶平面分割方法大多依赖手工设计或学习的特征及几何聚类策略,但这些方法存在非端到端、边缘特征判别性低和几何特性未充分利用的问题。

Contribution: 提出了RoofSeg,一种端到端的Transformer网络,结合了边缘感知模块(EAMM)、自适应加权掩码损失和新的平面几何损失,提升了分割精度和边缘判别性。

Method: 采用Transformer编码器-解码器框架,引入可学习的平面查询和EAMM模块,设计自适应加权掩码损失和平面几何损失。

Result: RoofSeg在端到端分割和边缘区域精度上优于现有方法,有效提升了屋顶平面分割性能。

Insight: 结合几何先验和Transformer的全局建模能力可以显著提升点云分割任务的效果,尤其是在边缘区域。

Abstract: Roof plane segmentation is one of the key procedures for reconstructing three-dimensional (3D) building models at levels of detail (LoD) 2 and 3 from airborne light detection and ranging (LiDAR) point clouds. The majority of current approaches for roof plane segmentation rely on the manually designed or learned features followed by some specifically designed geometric clustering strategies. Because the learned features are more powerful than the manually designed features, the deep learning-based approaches usually perform better than the traditional approaches. However, the current deep learning-based approaches have three unsolved problems. The first is that most of them are not truly end-to-end, the plane segmentation results may be not optimal. The second is that the point feature discriminability near the edges is relatively low, leading to inaccurate planar edges. The third is that the planar geometric characteristics are not sufficiently considered to constrain the network training. To solve these issues, a novel edge-aware transformer-based network, named RoofSeg, is developed for segmenting roof planes from LiDAR point clouds in a truly end-to-end manner. In the RoofSeg, we leverage a transformer encoder-decoder-based framework to hierarchically predict the plane instance masks with the use of a set of learnable plane queries. To further improve the segmentation accuracy of edge regions, we also design an Edge-Aware Mask Module (EAMM) that sufficiently incorporates planar geometric prior of edges to enhance its discriminability for plane instance mask refinement. In addition, we propose an adaptive weighting strategy in the mask loss to reduce the influence of misclassified points, and also propose a new plane geometric loss to constrain the network training.

[69] MicroDetect-Net (MDN): Leveraging Deep Learning to Detect Microplastics in Clam Blood, a Step Towards Human Blood Analysis

Riju Marwah,Riya Arora,Navneet Yadav,Himank Arora

Main category: cs.CV

TL;DR: MicroDetect-Net (MDN) 是一个深度学习模型,结合荧光显微镜和尼罗红染色技术,用于检测蛤蜊血液中的微塑料,为未来人类血液分析奠定基础。

Details Motivation: 微塑料污染日益严重,对人类健康构成潜在威胁。当前缺乏高效、准确的检测方法,因此需要开发一种基于深度学习的自动化方案。

Contribution: 提出 MDN 模型,实现了微塑料在血液中的高精度检测,为微塑料污染研究提供了新工具。

Method: 结合荧光显微镜、尼罗红染色技术和卷积神经网络,完成数据集准备、荧光成像和分割任务。

Result: MDN 在 276 张尼罗红染色图像上达到 92% 准确率,IoU 为 87.4%,F1 分数为 92.1%。

Insight: 将深度学习与传统染色技术结合,可显著提升微塑料检测效率,未来可拓展至人类血液样本分析。

Abstract: With the prevalence of plastics exceeding 368 million tons yearly, microplastic pollution has grown to an extent where air, water, soil, and living organisms have all tested positive for microplastic presence. These particles, which are smaller than 5 millimeters in size, are no less harmful to humans than to the environment. Toxicity research on microplastics has shown that exposure may cause liver infection, intestinal injuries, and gut flora imbalance, leading to numerous potential health hazards. This paper presents a new model, MicroDetect-Net (MDN), which applies fluorescence microscopy with Nile Red dye staining and deep learning to scan blood samples for microplastics. Although clam blood has certain limitations in replicating real human blood, this study opens avenues for applying the approach to human samples, which are more consistent for preliminary data collection. The MDN model integrates dataset preparation, fluorescence imaging, and segmentation using a convolutional neural network to localize and count microplastic fragments. The combination of convolutional networks and Nile Red dye for segmentation produced strong image detection and accuracy. MDN was evaluated on a dataset of 276 Nile Red-stained fluorescent blood images and achieved an accuracy of ninety two percent. Robust performance was observed with an Intersection over Union of 87.4 percent, F1 score of 92.1 percent, Precision of 90.6 percent, and Recall of 93.7 percent. These metrics demonstrate the effectiveness of MDN in the detection of microplastics.

[70] ProPy: Building Interactive Prompt Pyramids upon CLIP for Partially Relevant Video Retrieval

Yi Pan,Yujia Zhang,Michael Kampffmeyer,Xiaoguang Zhao

Main category: cs.CV

TL;DR: ProPy利用CLIP模型,通过构建交互式提示金字塔(Prompt Pyramid)和祖先后代交互机制(Ancestor-Descendant Interaction Mechanism),显著提升了部分相关视频检索(PRVR)的性能,在三个公开数据集上达到SOTA。

Details Motivation: 部分相关视频检索(PRVR)是一个实际但具有挑战性的任务,现有方法主要依赖单模态特征处理,而强大的预训练视觉-语言模型(如CLIP)在该领域尚未充分探索。

Contribution: 1. 提出了Prompt Pyramid结构,通过多粒度事件提示捕捉语义。2. 设计了祖先后代交互机制,实现动态语义交互。

Method: 1. 基于CLIP进行架构调整,专门设计用于PRVR。2. 通过多粒度事件提示和动态交互机制提升检索能力。

Result: 在三个公开数据集上实现了SOTA性能,显著优于先前模型。

Insight: 多粒度语义捕捉和动态交互机制是提升PRVR性能的关键。

Abstract: Partially Relevant Video Retrieval (PRVR) is a practical yet challenging task that involves retrieving videos based on queries relevant to only specific segments. While existing works follow the paradigm of developing models to process unimodal features, powerful pretrained vision-language models like CLIP remain underexplored in this field. To bridge this gap, we propose ProPy, a model with systematic architectural adaption of CLIP specifically designed for PRVR. Drawing insights from the semantic relevance of multi-granularity events, ProPy introduces two key innovations: (1) A Prompt Pyramid structure that organizes event prompts to capture semantics at multiple granularity levels, and (2) An Ancestor-Descendant Interaction Mechanism built on the pyramid that enables dynamic semantic interaction among events. With these designs, ProPy achieves SOTA performance on three public datasets, outperforming previous models by significant margins. Code is available at https://github.com/BUAAPY/ProPy.

[71] GReAT: leveraging geometric artery data to improve wall shear stress assessment

Julian Suk,Jolanda J. Wentzel,Patryk Rygiel,Joost Daemen,Daniel Rueckert,Jelmer M. Wolterink

Main category: cs.CV

TL;DR: GReAT利用几何动脉大数据通过自监督预训练提升壁面剪切应力评估,解决了医学图像数据不足的问题。

Details Motivation: 心血管健康领域中,基于患者特异性医学图像的机器学习可以避免耗时的流体模拟,但缺乏足够的大规模数据集。

Contribution: 提出通过自监督预训练和几何动脉数据集学习表示,提升冠状动脉壁面剪切应力评估的准确性。

Method: 使用热核签名(heat kernel signature)作为自监督目标,基于Laplacian特征向量捕捉血管几何形状的本质。

Result: 在大规模几何动脉数据集(8449个形状)上预训练的模型,在小规模临床试验数据(49名患者)上提升了壁面剪切应力区域的分割性能。

Insight: 几何数据的学习表示可以有效增强小规模临床数据的分析能力,为心血管健康领域提供新的数据驱动方法。

Abstract: Leveraging big data for patient care is promising in many medical fields such as cardiovascular health. For example, hemodynamic biomarkers like wall shear stress could be assessed from patient-specific medical images via machine learning algorithms, bypassing the need for time-intensive computational fluid simulation. However, it is extremely challenging to amass large-enough datasets to effectively train such models. We could address this data scarcity by means of self-supervised pre-training and foundations models given large datasets of geometric artery models. In the context of coronary arteries, leveraging learned representations to improve hemodynamic biomarker assessment has not yet been well studied. In this work, we address this gap by investigating whether a large dataset (8449 shapes) consisting of geometric models of 3D blood vessels can benefit wall shear stress assessment in coronary artery models from a small-scale clinical trial (49 patients). We create a self-supervised target for the 3D blood vessels by computing the heat kernel signature, a quantity obtained via Laplacian eigenvectors, which captures the very essence of the shapes. We show how geometric representations learned from this datasets can boost segmentation of coronary arteries into regions of low, mid and high (time-averaged) wall shear stress even when trained on limited data.

[72] No Label Left Behind: A Unified Surface Defect Detection Model for all Supervision Regimes

Blaž Rolih,Matic Fučka,Danijel Skočaj

Main category: cs.CV

TL;DR: SuperSimpleNet是一种高效的自适应判别模型,能够在四种监督场景(无监督、弱监督、混合监督和全监督)中高效训练,结合了合成异常生成、增强分类头和改进学习过程,实现了高性能和快速推理。

Details Motivation: 工业表面缺陷检测需要高性能、高效和适应性强的模型,但现有方法通常局限于特定监督场景,难以应对现实制造中多样化的数据标注。

Contribution: 提出了SuperSimpleNet,首次能够充分利用所有可用的数据标注,统一了四种监督范式,并在性能和速度上均表现优异。

Method: 基于SimpleNet框架,引入了合成异常生成、增强分类头和改进学习过程,支持高效训练于多种监督场景。

Result: 在四个基准数据集上表现出色,推理时间低于10毫秒,性能优于现有方法。

Insight: SuperSimpleNet通过统一监督范式,缩小了学术研究与工业应用的差距,为实际制造挑战提供了可行的解决方案。

Abstract: Surface defect detection is a critical task across numerous industries, aimed at efficiently identifying and localising imperfections or irregularities on manufactured components. While numerous methods have been proposed, many fail to meet industrial demands for high performance, efficiency, and adaptability. Existing approaches are often constrained to specific supervision scenarios and struggle to adapt to the diverse data annotations encountered in real-world manufacturing processes, such as unsupervised, weakly supervised, mixed supervision, and fully supervised settings. To address these challenges, we propose SuperSimpleNet, a highly efficient and adaptable discriminative model built on the foundation of SimpleNet. SuperSimpleNet incorporates a novel synthetic anomaly generation process, an enhanced classification head, and an improved learning procedure, enabling efficient training in all four supervision scenarios, making it the first model capable of fully leveraging all available data annotations. SuperSimpleNet sets a new standard for performance across all scenarios, as demonstrated by its results on four challenging benchmark datasets. Beyond accuracy, it is very fast, achieving an inference time below 10 ms. With its ability to unify diverse supervision paradigms while maintaining outstanding speed and reliability, SuperSimpleNet represents a promising step forward in addressing real-world manufacturing challenges and bridging the gap between academic research and industrial applications. Code: https://github.com/blaz-r/SuperSimpleNet

[73] VibES: Induced Vibration for Persistent Event-Based Sensing

Vincenzo Polizzi,Stephen Yang,Quentin Clark,Jonathan Kelly,Igor Gilitschenski,David B. Lindell

Main category: cs.CV

TL;DR: 本文提出了一种名为VibES的轻量级方法,通过周期性振动诱导事件相机的持续事件生成,解决了静态或低运动场景下事件相机无法生成事件的问题。结合运动补偿流程,该方法能够为下游任务提供干净的运动校正事件。

Details Motivation: 事件相机在静态或低运动场景下无法生成事件,限制了其应用范围。传统方法需要复杂硬件或额外光学组件,本工作旨在提出一种轻量级解决方案。

Contribution: 1. 提出了一种通过简单振动机制(旋转不平衡质量)诱导事件的轻量级方法;2. 开发了运动补偿流程,生成干净的事件数据;3. 在硬件原型和真实数据集上验证了方法的有效性。

Method: 1. 使用旋转不平衡质量诱导周期性振动;2. 结合运动补偿算法去除注入的运动;3. 对生成的事件数据进行图像重建和边缘检测等下游任务评估。

Result: 实验表明,该方法能可靠恢复运动参数,并在图像重建和边缘检测任务中优于无运动诱导的事件相机。

Insight: 轻量级的机械振动是一种有效的解决方案,能够在无复杂硬件需求的情况下扩展事件相机的适用性。

Abstract: Event cameras are a bio-inspired class of sensors that asynchronously measure per-pixel intensity changes. Under fixed illumination conditions in static or low-motion scenes, rigidly mounted event cameras are unable to generate any events, becoming unsuitable for most computer vision tasks. To address this limitation, recent work has investigated motion-induced event stimulation that often requires complex hardware or additional optical components. In contrast, we introduce a lightweight approach to sustain persistent event generation by employing a simple rotating unbalanced mass to induce periodic vibrational motion. This is combined with a motion-compensation pipeline that removes the injected motion and yields clean, motion-corrected events for downstream perception tasks. We demonstrate our approach with a hardware prototype and evaluate it on real-world captured datasets. Our method reliably recovers motion parameters and improves both image reconstruction and edge detection over event-based sensing without motion induction.

[74] Few-Shot Connectivity-Aware Text Line Segmentation in Historical Documents

Rafael Sterzinger,Tingyu Lin,Robert Sablatnig

Main category: cs.CV

TL;DR: 本文提出了一种基于轻量级UNet++和拓扑感知损失函数的历史文档文本行分割方法,通过少量标注数据实现了高效的分割效果。

Details Motivation: 历史文档的文本行分割通常需要大量标注数据,但标注成本高且专家知识需求大。因此,本文探索少样本学习方法,以降低数据需求。

Contribution: 1. 提出了一种轻量级UNet++架构;2. 引入连通性感知损失函数,减少结构错误(如行断裂和行合并);3. 在少量标注数据下显著提升性能。

Method: 1. 使用轻量级UNet++模型;2. 结合连通性感知损失函数;3. 仅需每部手稿的三页标注数据,通过小片段训练。

Result: 在U-DIADS-TL数据集上,识别准确率提升200%,行交并比提升75%,F-Measure与DIVA-HisDB竞赛冠军相当。

Insight: 轻量架构和拓扑感知损失能有效解决少样本历史文档分割问题,为稀缺标注数据的任务提供了新思路。

Abstract: A foundational task for the digital analysis of documents is text line segmentation. However, automating this process with deep learning models is challenging because it requires large, annotated datasets that are often unavailable for historical documents. Additionally, the annotation process is a labor- and cost-intensive task that requires expert knowledge, which makes few-shot learning a promising direction for reducing data requirements. In this work, we demonstrate that small and simple architectures, coupled with a topology-aware loss function, are more accurate and data-efficient than more complex alternatives. We pair a lightweight UNet++ with a connectivity-aware loss, initially developed for neuron morphology, which explicitly penalizes structural errors like line fragmentation and unintended line merges. To increase our limited data, we train on small patches extracted from a mere three annotated pages per manuscript. Our methodology significantly improves upon the current state-of-the-art on the U-DIADS-TL dataset, with a 200% increase in Recognition Accuracy and a 75% increase in Line Intersection over Union. Our method also achieves an F-Measure score on par with or even exceeding that of the competition winner of the DIVA-HisDB baseline detection task, all while requiring only three annotated pages, exemplifying the efficacy of our approach. Our implementation is publicly available at: https://github.com/RafaelSterzinger/acpr_few_shot_hist.

[75] Dual Enhancement on 3D Vision-Language Perception for Monocular 3D Visual Grounding

Yuzhen Li,Min Liu,Yuan Bian,Xueping Wang,Zhaoyang Li,Gen Li,Yaonan Wang

Main category: cs.CV

TL;DR: 针对单目3D视觉定位任务中语言模型对数值单位敏感的问题,本文提出两种增强3D感知的方法:3D文本增强(3DTE)和文本引导的几何增强(TGE),显著提升了模型性能。

Details Motivation: 现有的预训练语言模型在3D视觉定位任务中对数值单位(如米、厘米)的敏感性不足,导致性能下降,因此需要增强模型对文本和几何特征的3D感知能力。

Contribution: 1. 提出3DTE方法,通过增强文本查询中距离描述符的多样性来改善单位映射关系的理解;2. 设计TGE模块,将文本特征投影到几何一致的空间,以增强3D-文本信息。

Method: 3DTE通过数据增强多样性提升单位理解;TGE将文本特征与几何特征对齐,利用增强的文本特征引导几何注意力的分配。

Result: 在Mono3DRefer数据集上取得新SOTA结果,在“远距离”场景中准确率提升11.94%。

Insight: 数值单位的敏感性是影响3D视觉定位的关键因素,通过增强文本和几何特征的协同作用可以显著提升模型性能。

Abstract: Monocular 3D visual grounding is a novel task that aims to locate 3D objects in RGB images using text descriptions with explicit geometry information. Despite the inclusion of geometry details in the text, we observe that the text embeddings are sensitive to the magnitude of numerical values but largely ignore the associated measurement units. For example, simply equidistant mapping the length with unit “meter” to “decimeters” or “centimeters” leads to severe performance degradation, even though the physical length remains equivalent. This observation signifies the weak 3D comprehension of pre-trained language model, which generates misguiding text features to hinder 3D perception. Therefore, we propose to enhance the 3D perception of model on text embeddings and geometry features with two simple and effective methods. Firstly, we introduce a pre-processing method named 3D-text Enhancement (3DTE), which enhances the comprehension of mapping relationships between different units by augmenting the diversity of distance descriptors in text queries. Next, we propose a Text-Guided Geometry Enhancement (TGE) module to further enhance the 3D-text information by projecting the basic text features into geometrically consistent space. These 3D-enhanced text features are then leveraged to precisely guide the attention of geometry features. We evaluate the proposed method through extensive comparisons and ablation studies on the Mono3DRefer dataset. Experimental results demonstrate substantial improvements over previous methods, achieving new state-of-the-art results with a notable accuracy gain of 11.94% in the “Far” scenario. Our code will be made publicly available.

[76] Beyond flattening: a geometrically principled positional encoding for vision transformers with Weierstrass elliptic functions

Zhihang Xin,Xitong Hu,Rui Wang

Main category: cs.CV

TL;DR: 论文提出了一种基于Weierstrass椭圆函数的位置编码方法(WEF-PE),旨在解决传统Vision Transformer中由于一维可学习位置嵌入导致的空间结构破坏问题,通过椭圆函数的双重周期性更好地捕捉视觉数据的平移不变性。

Details Motivation: 传统Vision Transformer通过扁平化处理图像破坏了其二维空间结构,且现有位置编码方法缺乏几何约束,无法有效利用空间邻近性先验。论文希望通过数学原理设计一种更符合视觉数据特性的位置编码方法。

Contribution: 提出WEF-PE,一种基于Weierstrass椭圆函数的位置编码方法,通过复数域表示二维坐标,利用椭圆函数的非线性几何特性和代数加法公式,自然地编码空间距离关系,并支持相对位置信息的直接推导。

Method: 利用Weierstrass椭圆函数的双重周期性和复数域表示,直接处理二维坐标;通过椭圆函数的非线性特性和加法公式编码绝对和相对位置信息。

Result: 在ViT-Tiny架构上,CIFAR-100从头训练达到63.78%准确率;ViT-Base微调达到93.28%;在VTAB-1k基准任务上表现一致优于传统方法。理论分析和注意力可视化验证了其几何归纳偏差和语义聚焦能力。

Insight: 椭圆函数的双重周期性与视觉数据的平移不变性高度契合,为其在位置编码中的应用提供了数学基础;非线性几何特性可更自然地建模空间距离关系。

Abstract: Vision Transformers have demonstrated remarkable success in computer vision tasks, yet their reliance on learnable one-dimensional positional embeddings fundamentally disrupts the inherent two-dimensional spatial structure of images through patch flattening procedures. Traditional positional encoding approaches lack geometric constraints and fail to establish monotonic correspondence between Euclidean spatial distances and sequential index distances, thereby limiting the model’s capacity to leverage spatial proximity priors effectively. We propose Weierstrass Elliptic Function Positional Encoding (WEF-PE), a mathematically principled approach that directly addresses two-dimensional coordinates through natural complex domain representation, where the doubly periodic properties of elliptic functions align remarkably with translational invariance patterns commonly observed in visual data. Our method exploits the non-linear geometric nature of elliptic functions to encode spatial distance relationships naturally, while the algebraic addition formula enables direct derivation of relative positional information between arbitrary patch pairs from their absolute encodings. Comprehensive experiments demonstrate that WEF-PE achieves superior performance across diverse scenarios, including 63.78% accuracy on CIFAR-100 from-scratch training with ViT-Tiny architecture, 93.28% on CIFAR-100 fine-tuning with ViT-Base, and consistent improvements on VTAB-1k benchmark tasks. Theoretical analysis confirms the distance-decay property through rigorous mathematical proof, while attention visualization reveals enhanced geometric inductive bias and more coherent semantic focus compared to conventional approaches.The source code implementing the methods described in this paper is publicly available on GitHub.

[77] SoccerNet 2025 Challenges Results

Silvio Giancola,Anthony Cioppa,Marc Gutiérrez-Pérez,Jan Held,Carlos Hinojosa,Victor Joos,Arnaud Leduc,Floriane Magera,Karen Sanchez,Vladimir Somers,Artur Xarles,Antonio Agudo,Alexandre Alahi,Olivier Barnich,Albert Clapés,Christophe De Vleeschouwer,Sergio Escalera,Bernard Ghanem,Thomas B. Moeslund,Marc Van Droogenbroeck,Tomoki Abe,Saad Alotaibi,Faisal Altawijri,Steven Araujo,Xiang Bai,Xiaoyang Bi,Jiawang Cao,Vanyi Chao,Kamil Czarnogórski,Fabian Deuser,Mingyang Du,Tianrui Feng,Patrick Frenzel,Mirco Fuchs,Jorge García,Konrad Habel,Takaya Hashiguchi,Sadao Hirose,Xinting Hu,Yewon Hwang,Ririko Inoue,Riku Itsuji,Kazuto Iwai,Hongwei Ji,Yangguang Ji,Licheng Jiao,Yuto Kageyama,Yuta Kamikawa,Yuuki Kanasugi,Hyungjung Kim,Jinwook Kim,Takuya Kurihara,Bozheng Li,Lingling Li,Xian Li,Youxing Lian,Dingkang Liang,Hongkai Lin,Jiadong Lin,Jian Liu,Liang Liu,Shuaikun Liu,Zhaohong Liu,Yi Lu,Federico Méndez,Huadong Ma,Wenping Ma,Jacek Maksymiuk,Henry Mantilla,Ismail Mathkour,Daniel Matthes,Ayaha Motomochi,Amrulloh Robbani Muhammad,Haruto Nakayama,Joohyung Oh,Yin May Oo,Marcelo Ortega,Norbert Oswald,Rintaro Otsubo,Fabian Perez,Mengshi Qi,Cristian Rey,Abel Reyes-Angulo,Oliver Rose,Hoover Rueda-Chacón,Hideo Saito,Jose Sarmiento,Kanta Sawafuji,Atom Scott,Xi Shen,Pragyan Shrestha,Jae-Young Sim,Long Sun,Yuyang Sun,Tomohiro Suzuki,Licheng Tang,Masato Tonouchi,Ikuma Uchida,Henry O. Velesaca,Tiancheng Wang,Rio Watanabe,Jay Wu,Yongliang Wu,Shunzo Yamagishi,Di Yang,Xu Yang,Yuxin Yang,Hao Ye,Xinyu Ye,Calvin Yeung,Xuanlong Yu,Chao Zhang,Dingyuan Zhang,Kexing Zhang,Zhe Zhao,Xin Zhou,Wenbo Zhu,Julian Ziegler

Main category: cs.CV

TL;DR: SoccerNet 2025 Challenges是第五屆針對足球視頻理解的電腦視覺開放基準測試,包含四項任務:團隊球類動作檢測、單目深度估計、多視角犯規識別和比賽狀態重建,提供大規模標註數據集和統一評估標準。

Details Motivation: 推動電腦視覺在足球視頻分析領域的研究,提供公開可重現的基准測試平台。

Contribution: 提供了四個有挑戰性的任務,大規模標註數據集,統一評估協議和強基線方法,推動社區進展。

Method: 基於不同任務的特點,使用深度學習方法(如動作檢測、深度估計、多視角分析和狀態重建)進行研究。

Result: 報告了各任務的頂尖解決方案和社區進展,展示了足球視頻理解的技術發展現狀。

Insight: SoccerNet Challenges成功促進了電腦視覺與體育分析的跨學科研究,展示了開放基准測試的重要性。

Abstract: The SoccerNet 2025 Challenges mark the fifth annual edition of the SoccerNet open benchmarking effort, dedicated to advancing computer vision research in football video understanding. This year’s challenges span four vision-based tasks: (1) Team Ball Action Spotting, focused on detecting ball-related actions in football broadcasts and assigning actions to teams; (2) Monocular Depth Estimation, targeting the recovery of scene geometry from single-camera broadcast clips through relative depth estimation for each pixel; (3) Multi-View Foul Recognition, requiring the analysis of multiple synchronized camera views to classify fouls and their severity; and (4) Game State Reconstruction, aimed at localizing and identifying all players from a broadcast video to reconstruct the game state on a 2D top-view of the field. Across all tasks, participants were provided with large-scale annotated datasets, unified evaluation protocols, and strong baselines as starting points. This report presents the results of each challenge, highlights the top-performing solutions, and provides insights into the progress made by the community. The SoccerNet Challenges continue to serve as a driving force for reproducible, open research at the intersection of computer vision, artificial intelligence, and sports. Detailed information about the tasks, challenges, and leaderboards can be found at https://www.soccer-net.org, with baselines and development kits available at https://github.com/SoccerNet.

[78] All-in-One Slider for Attribute Manipulation in Diffusion Models

Weixin Ye,Hongguang Zhu,Wei Wang,Yahui Liu,Mengyu Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为All-in-One Slider的轻量级模块,用于在扩散模型中实现多属性的统一操控,解决了传统One-for-One方法参数冗余和灵活性不足的问题。

Details Motivation: 文本到图像(T2I)扩散模型在生成高质量图像方面取得了显著进展,但对生成图像的属性进行渐进式操控以满足用户需求仍具挑战性。传统方法需要为每个属性独立训练滑动模块,导致参数冗余和灵活性受限。

Contribution: 1. 提出了All-in-One Slider模块,通过将文本嵌入空间分解为稀疏且语义明确的属性方向,支持多属性的统一操控。2. 支持零样本操控未见过的属性和多属性的组合,增强了方法的扩展性和灵活性。3. 可扩展到倒置框架,实现对真实图像的属性操控。

Method: 1. 设计轻量级模块,分解文本嵌入空间为稀疏的语义属性方向。2. 通过训练实现通用滑动功能,支持连续、精细的属性控制。3. 重新组合学习到的方向,支持零样本操控和属性组合。

Result: 实验表明,该方法在属性操控上具有高精度和扩展性,相比传统方法有显著提升。同时支持真实图像的属性操控,拓宽了应用场景。

Insight: 稀疏的方向分解和零样本支持为扩散模型的属性操控提供了新的思路,同时也展示了轻量级模块在复杂任务中的潜力。

Abstract: Text-to-image (T2I) diffusion models have made significant strides in generating high-quality images. However, progressively manipulating certain attributes of generated images to meet the desired user expectations remains challenging, particularly for content with rich details, such as human faces. Some studies have attempted to address this by training slider modules. However, they follow a One-for-One manner, where an independent slider is trained for each attribute, requiring additional training whenever a new attribute is introduced. This not only results in parameter redundancy accumulated by sliders but also restricts the flexibility of practical applications and the scalability of attribute manipulation. To address this issue, we introduce the All-in-One Slider, a lightweight module that decomposes the text embedding space into sparse, semantically meaningful attribute directions. Once trained, it functions as a general-purpose slider, enabling interpretable and fine-grained continuous control over various attributes. Moreover, by recombining the learned directions, the All-in-One Slider supports zero-shot manipulation of unseen attributes (e.g., races and celebrities) and the composition of multiple attributes. Extensive experiments demonstrate that our method enables accurate and scalable attribute manipulation, achieving notable improvements compared to previous methods. Furthermore, our method can be extended to integrate with the inversion framework to perform attribute manipulation on real images, broadening its applicability to various real-world scenarios. The code and trained model will be released at: https://github.com/ywxsuperstar/KSAE-FaceSteer.

[79] LSD-3D: Large-Scale 3D Driving Scene Generation with Geometry Grounding

Julian Ost,Andrea Ramazzina,Amogh Joshi,Maximilian Bömer,Mario Bijelic,Felix Heide

Main category: cs.CV

TL;DR: LSD-3D提出了一种生成大规模3D驾驶场景的方法,结合代理几何与环境表示生成与2D图像先验的得分蒸馏,实现了高质量的几何一致性和可控性。

Details Motivation: 现有方法中,神经重建方法受限于静态环境和有限场景控制,而基于扩散模型的生成方法缺乏几何基础和因果性。LSD-3D旨在填补这一空白。

Contribution: 提出了一种直接生成大规模3D驾驶场景的方法,具有精确几何和因果性新视角合成能力,结合了代理几何生成与2D图像先验蒸馏。

Method: 通过生成代理几何与环境表示,并结合2D图像先验的得分蒸馏,实现高质量、可控的3D场景生成。

Result: 能够生成几何一致、高保真纹理和结构的复杂驾驶场景,并可根据地图布局进行条件控制。

Insight: 该方法通过结合几何生成与2D先验蒸馏,实现了在3D场景生成中的高可控性与几何一致性,为大规模场景数据生成提供了新思路。

Abstract: Large-scale scene data is essential for training and testing in robot learning. Neural reconstruction methods have promised the capability of reconstructing large physically-grounded outdoor scenes from captured sensor data. However, these methods have baked-in static environments and only allow for limited scene control – they are functionally constrained in scene and trajectory diversity by the captures from which they are reconstructed. In contrast, generating driving data with recent image or video diffusion models offers control, however, at the cost of geometry grounding and causality. In this work, we aim to bridge this gap and present a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal novel view synthesis with object permanence and explicit 3D geometry estimation. The proposed method combines the generation of a proxy geometry and environment representation with score distillation from learned 2D image priors. We find that this approach allows for high controllability, enabling the prompt-guided geometry and high-fidelity texture and structure that can be conditioned on map layouts – producing realistic and geometrically consistent 3D generations of complex driving scenes.

[80] OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation

Jianwen Jiang,Weihong Zeng,Zerong Zheng,Jiaqi Yang,Chao Liang,Wang Liao,Han Liang,Yuan Zhang,Mingyuan Gao

Main category: cs.CV

TL;DR: OmniHuman-1.5提出了一种新框架,通过认知模拟为虚拟角色注入活跃的思维,生成语义连贯且富有表现力的动画。

Details Motivation: 现有视频虚拟角色模型仅能生成基于低级线索(如音频节奏)的动画,缺乏对情感、意图或上下文的高层语义理解。

Contribution: 1. 使用多模态大语言模型合成结构化文本条件;2. 引入带Pseudo Last Frame设计的Multimodal DiT架构,实现多模态输入的有效融合。

Method: 通过多模态大语言模型提供高层语义指导,结合专门设计的Multimodal DiT架构生成动画。

Result: 模型在唇同步准确性、视频质量、运动自然性和语义一致性等方面表现领先,并能扩展到多人或非人类角色的复杂场景。

Insight: 多模态语义指导与运动生成的结合可以显著提升虚拟角色动画的表现力和语义连贯性。

Abstract: Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character’s authentic essence. Their motions typically synchronize with low-level cues like audio rhythm, lacking a deeper semantic understanding of emotion, intent, or context. To bridge this gap, \textbf{we propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive.} Our model, \textbf{OmniHuman-1.5}, is built upon two key technical contributions. First, we leverage Multimodal Large Language Models to synthesize a structured textual representation of conditions that provides high-level semantic guidance. This guidance steers our motion generator beyond simplistic rhythmic synchronization, enabling the production of actions that are contextually and emotionally resonant. Second, to ensure the effective fusion of these multimodal inputs and mitigate inter-modality conflicts, we introduce a specialized Multimodal DiT architecture with a novel Pseudo Last Frame design. The synergy of these components allows our model to accurately interpret the joint semantics of audio, images, and text, thereby generating motions that are deeply coherent with the character, scene, and linguistic content. Extensive experiments demonstrate that our model achieves leading performance across a comprehensive set of metrics, including lip-sync accuracy, video quality, motion naturalness and semantic consistency with textual prompts. Furthermore, our approach shows remarkable extensibility to complex scenarios, such as those involving multi-person and non-human subjects. Homepage: \href{https://omnihuman-lab.github.io/v1_5/}

[81] Autoregressive Universal Video Segmentation Model

Miran Heo,Sukjun Hwang,Min-Hung Chen,Yu-Chiang Frank Wang,Albert Gu,Seon Joo Kim,Ryo Hachiuma

Main category: cs.CV

TL;DR: AUSM是一个统一的视频分割模型,将提示和无提示视频分割任务统一为自回归的掩码预测问题,实现了高效并行训练和优异的性能。

Details Motivation: 当前视频分割领域存在任务特定模型和流程碎片化的问题,尤其是在无提示视频分割任务上,缺乏统一的解决方案。

Contribution: 提出了AUSM模型,统一了提示和无提示视频分割任务,基于状态空间模型实现了高效的自回归掩码预测。

Method: 采用自回归的掩码预测方法,利用固定大小的空间状态处理任意长度的视频流,并通过并行训练加速模型训练过程。

Result: 在多个标准数据集上优于之前的通用流式视频分割方法,并在16帧序列上实现了2.5倍的训练加速。

Insight: 将视频分割任务建模为序列预测问题(类似语言建模)是一种有效且可扩展的统一框架,适用于复杂的流式视频场景。

Abstract: Recent video foundation models such as SAM2 excel at prompted video segmentation by treating masks as a general-purpose primitive. However, many real-world settings require unprompted segmentation that aims to detect and track all objects in a video without external cues, leaving today’s landscape fragmented across task-specific models and pipelines. We recast streaming video segmentation as sequential mask prediction, analogous to language modeling, and introduce the Autoregressive Universal Segmentation Model (AUSM), a single architecture that unifies both prompted and unprompted video segmentation. Built on recent state-space models, AUSM maintains a fixed-size spatial state and scales to video streams of arbitrary length. Furthermore, all components of AUSM are designed for parallel training across frames, yielding substantial speedups over iterative training. On standard benchmarks (DAVIS17, YouTube-VOS 2018 & 2019, MOSE, YouTube-VIS 2019 & 2021, and OVIS) AUSM outperforms prior universal streaming video segmentation methods and achieves up to 2.5x faster training on 16-frame sequences.

[82] Articulate3D: Zero-Shot Text-Driven 3D Object Posing

Oishi Deb,Anjun Hu,Ashkan Khakzar,Philip Torr,Christian Rupprecht

Main category: cs.CV

TL;DR: Articulate3D是一种无需训练的零样本方法,通过语言控制对3D资产进行姿势调整,利用图像生成器和多视角姿势优化实现目标。

Details Motivation: 尽管视觉和语言模型取得进展,但通过语言控制调整3D物体姿势仍具挑战性。该方法试图解决这一问题。

Contribution: 提出了无需训练的Articulate3D方法,结合自注意力重连机制(RSActrl)和关键点匹配,实现语言驱动的3D姿势调整。

Method: 1. 修改图像生成器生成目标图像;2. 通过自注意力重连机制(RSActrl)解耦结构与姿势;3. 使用关键点进行多视角姿势优化。

Result: 实验表明,该方法在多样3D对象和自由文本提示下有效,用户研究中85%以上优于现有方法。

Insight: 可微分渲染对姿势优化不可靠,关键点匹配更有效;自注意力机制能保持结构一致性。

Abstract: We propose a training-free method, Articulate3D, to pose a 3D asset through language control. Despite advances in vision and language models, this task remains surprisingly challenging. To achieve this goal, we decompose the problem into two steps. We modify a powerful image-generator to create target images conditioned on the input image and a text instruction. We then align the mesh to the target images through a multi-view pose optimisation step. In detail, we introduce a self-attention rewiring mechanism (RSActrl) that decouples the source structure from pose within an image generative model, allowing it to maintain a consistent structure across varying poses. We observed that differentiable rendering is an unreliable signal for articulation optimisation; instead, we use keypoints to establish correspondences between input and target images. The effectiveness of Articulate3D is demonstrated across a diverse range of 3D objects and free-form text prompts, successfully manipulating poses while maintaining the original identity of the mesh. Quantitative evaluations and a comparative user study, in which our method was preferred over 85% of the time, confirm its superiority over existing approaches. Project page:https://odeb1.github.io/articulate3d_page_deb/

eess.IV [Back]

[83] Analise de Desaprendizado de Maquina em Modelos de Classificacao de Imagens Medicas

Andreza M. C. Falcao,Filipe R. Cordeiro

Main category: eess.IV

TL;DR: 该论文探讨了在医疗图像分类模型中使用SalUn进行机器去学习(machine unlearning)的效果,实验表明其性能接近完全重新训练的模型,并分析了数据增强的影响。

Details Motivation: 目前机器去学习技术虽然已有进展,但在医疗图像分类领域的应用尚未探索,而医疗数据中的隐私和敏感性使其成为重要研究方向。

Contribution: 首次将SalUn去学习技术应用于医疗图像分类任务,并在多个数据集上验证其有效性,同时研究了数据增强对去学习质量的影响。

Method: 采用了SalUn模型对PathMNIST、OrganAMNIST和BloodMNIST数据集进行去学习实验,并与完全重新训练的模型性能进行对比。

Result: SalUn表现接近完全重新训练的效果,证明了其在医疗应用中的高效性。

Insight: 数据增强可以进一步优化去学习的质量,这可能为隐私保护提供更多技术路径。

Abstract: Machine unlearning aims to remove private or sensitive data from a pre-trained model while preserving the model’s robustness. Despite recent advances, this technique has not been explored in medical image classification. This work evaluates the SalUn unlearning model by conducting experiments on the PathMNIST, OrganAMNIST, and BloodMNIST datasets. We also analyse the impact of data augmentation on the quality of unlearning. Results show that SalUn achieves performance close to full retraining, indicating an efficient solution for use in medical applications.

[84] A Closer Look at Edema Area Segmentation in SD-OCT Images Using Adversarial Framework

Yuhui Tao,Yizhe Zhang,Qiang Chen

Main category: eess.IV

TL;DR: 本文提出了一种结合视网膜层结构引导后处理和测试时间自适应策略的对抗性框架,用于增强SD-OCT图像中水肿区域的弱监督分割性能。

Details Motivation: 当前基于异常检测的弱监督方法在水肿区域分割任务中表现不及全监督方法,而视网膜层结构与水肿区域高度相关。本文旨在利用这些特性改进分割性能。

Contribution: 1)提出了一种新颖的视网膜层结构引导后处理步骤;2)引入了测试时间自适应策略;3)通过实验验证了方法的有效性,缩小了弱监督与全监督模型的差距。

Method: 1)利用对抗性框架结合视网膜层信息,将水肿区域分割任务转化为确认水肿轮廓与视网膜层交点的任务;2)通过测试时间自适应策略解决训练与测试集之间的差异。

Result: 在两个公开数据集上的实验表明,该方法显著提升了水肿区域分割的准确性和鲁棒性。

Insight: 通过引入领域知识(视网膜层结构)和动态适应策略(TTA),可以有效提升弱监督模型在医学图像分割任务中的性能。

Abstract: The development of artificial intelligence models for macular edema (ME) analy-sis always relies on expert-annotated pixel-level image datasets which are expen-sive to collect prospectively. While anomaly-detection-based weakly-supervised methods have shown promise in edema area (EA) segmentation task, their per-formance still lags behind fully-supervised approaches. In this paper, we leverage the strong correlation between EA and retinal layers in spectral-domain optical coherence tomography (SD-OCT) images, along with the update characteristics of weakly-supervised learning, to enhance an off-the-shelf adversarial framework for EA segmentation with a novel layer-structure-guided post-processing step and a test-time-adaptation (TTA) strategy. By incorporating additional retinal lay-er information, our framework reframes the dense EA prediction task as one of confirming intersection points between the EA contour and retinal layers, result-ing in predictions that better align with the shape prior of EA. Besides, the TTA framework further helps address discrepancies in the manifestations and presen-tations of EA between training and test sets. Extensive experiments on two pub-licly available datasets demonstrate that these two proposed ingredients can im-prove the accuracy and robustness of EA segmentation, bridging the gap between weakly-supervised and fully-supervised models.

[85] Understanding Benefits and Pitfalls of Current Methods for the Segmentation of Undersampled MRI Data

Jan Nikolas Morshuis,Matthias Hein,Christian F. Baumgartner

Main category: eess.IV

TL;DR: 该论文首次为欠采样MRI数据的分割提供了统一的基准测试,比较了7种方法,重点对比了一阶段(重建+分割联合模型)与两阶段(先重建再分割)方法,发现简单两阶段方法表现最佳。

Details Motivation: MRI采集时间长且成本高,研究通过欠采样加速采集,但大多数方法未直接比较,缺乏统一评估标准。本研究旨在填补这一空白,找到最优的分割策略。

Contribution: 1. 提供了首个欠采样MRI数据分割的统一基准;2. 比较了7种方法,包括一阶段与两阶段方法;3. 发现简单的两阶段方法优于复杂专用方法。

Method: 在包含多线圈k空间数据和人工标注分割真值的两个MRI数据集上,测试了7种方法,重点分析一阶段与两阶段方法的性能差异。

Result: 实验表明,考虑数据一致性的简单两阶段方法在分割任务中表现最佳,甚至超过了为此任务开发的复杂专用方法。

Insight: 研究揭示了在欠采样MRI数据分割中,数据一致性的重要性,并为后续方法设计提供了实用指导。

Abstract: MR imaging is a valuable diagnostic tool allowing to non-invasively visualize patient anatomy and pathology with high soft-tissue contrast. However, MRI acquisition is typically time-consuming, leading to patient discomfort and increased costs to the healthcare system. Recent years have seen substantial research effort into the development of methods that allow for accelerated MRI acquisition while still obtaining a reconstruction that appears similar to the fully-sampled MR image. However, for many applications a perfectly reconstructed MR image may not be necessary, particularly, when the primary goal is a downstream task such as segmentation. This has led to growing interest in methods that aim to perform segmentation directly on accelerated MRI data. Despite recent advances, existing methods have largely been developed in isolation, without direct comparison to one another, often using separate or private datasets, and lacking unified evaluation standards. To date, no high-quality, comprehensive comparison of these methods exists, and the optimal strategy for segmenting accelerated MR data remains unknown. This paper provides the first unified benchmark for the segmentation of undersampled MRI data comparing 7 approaches. A particular focus is placed on comparing \textit{one-stage approaches}, that combine reconstruction and segmentation into a unified model, with \textit{two-stage approaches}, that utilize established MRI reconstruction methods followed by a segmentation network. We test these methods on two MRI datasets that include multi-coil k-space data as well as a human-annotated segmentation ground-truth. We find that simple two-stage methods that consider data-consistency lead to the best segmentation scores, surpassing complex specialized methods that are developed specifically for this task.

[86] RDDM: Practicing RAW Domain Diffusion Model for Real-world Image Restoration

Yan Chen,Yi Wen,Wei Li,Junchao Liu,Yong Guo,Jie Hu,Xinghao Chen

Main category: eess.IV

TL;DR: 论文提出了RDDM模型,直接在RAW域进行图像恢复,解决了sRGB域的局限性,通过引入RAW域VAE和可调后处理模块,取得了更高保真度的结果。

Details Motivation: 现有的sRGB域扩散模型在高保真和真实感之间存在权衡,且忽略了RAW数据的可用性。RDDM直接在RAW域处理图像,避免了传统两阶段流程的问题。

Contribution: 1. 提出RAW域扩散模型RDDM,直接从传感器数据恢复图像;2. 引入RAW域VAE(RVAE)学习潜在表示;3. 设计可微分后处理模块(PTP)实现联合优化;4. 开发可扩展的退化流程合成训练数据。

Method: 1. 使用RAW域VAE学习潜在表示;2. 引入PTP模块联合优化RAW和sRGB空间;3. 设计CMB LoRA模块处理多种RAW格式。

Result: 实验表明RDDM优于现有sRGB扩散方法,生成更高保真度且更少伪影的图像。

Insight: 直接在RAW域处理图像能更充分利用传感器数据,避免sRGB域的损失,为图像恢复任务提供了新思路。

Abstract: We present the RAW domain diffusion model (RDDM), an end-to-end diffusion model that restores photo-realistic images directly from the sensor RAW data. While recent sRGB-domain diffusion methods achieve impressive results, they are caught in a dilemma between high fidelity and realistic generation. As these models process lossy sRGB inputs and neglect the accessibility of the sensor RAW images in many scenarios, e.g., in image and video capturing in edge devices, resulting in sub-optimal performance. RDDM bypasses this limitation by directly restoring images in the RAW domain, replacing the conventional two-stage image signal processing (ISP) + IR pipeline. However, a simple adaptation of pre-trained diffusion models to the RAW domain confronts the out-of-distribution (OOD) issues. To this end, we propose: (1) a RAW-domain VAE (RVAE) learning optimal latent representations, (2) a differentiable Post Tone Processing (PTP) module enabling joint RAW and sRGB space optimization. To compensate for the deficiency in the dataset, we develop a scalable degradation pipeline synthesizing RAW LQ-HQ pairs from existing sRGB datasets for large-scale training. Furthermore, we devise a configurable multi-bayer (CMB) LoRA module handling diverse RAW patterns such as RGGB, BGGR, etc. Extensive experiments demonstrate RDDM’s superiority over state-of-the-art sRGB diffusion methods, yielding higher fidelity results with fewer artifacts.

cs.RO [Back]

[87] Enhancing Video-Based Robot Failure Detection Using Task Knowledge

Santosh Thoduka,Sebastian Houben,Juergen Gall,Paul G. Plöger

Main category: cs.RO

TL;DR: 这篇论文提出了一种基于视频的机器人故障检测方法,结合了任务知识和时空信息,显著提升了故障检测性能。

Details Motivation: 机器人任务执行的鲁棒性依赖于可靠的故障检测,但现有方法在复杂现实场景中表现不佳。

Contribution: 1. 提出了结合机器人动作和任务相关对象的时空知识的故障检测方法;2. 提出了一种数据增强方法,通过可变帧率提升性能。

Method: 利用机器人动作和任务相关对象的时空信息,结合数据增强技术(如可变帧率)优化故障检测模型。

Result: 在ARMBench数据集上,F1分数从77.9提升到80.0(无额外计算成本),测试时进一步增强到81.4。

Insight: 时空信息对故障检测至关重要,未来可探索更多合适的启发式方法。

Abstract: Robust robotic task execution hinges on the reliable detection of execution failures in order to trigger safe operation modes, recovery strategies, or task replanning. However, many failure detection methods struggle to provide meaningful performance when applied to a variety of real-world scenarios. In this paper, we propose a video-based failure detection approach that uses spatio-temporal knowledge in the form of the actions the robot performs and task-relevant objects within the field of view. Both pieces of information are available in most robotic scenarios and can thus be readily obtained. We demonstrate the effectiveness of our approach on three datasets that we amend, in part, with additional annotations of the aforementioned task-relevant knowledge. In light of the results, we also propose a data augmentation method that improves performance by applying variable frame rates to different parts of the video. We observe an improvement from 77.9 to 80.0 in F1 score on the ARMBench dataset without additional computational expense and an additional increase to 81.4 with test-time augmentation. The results emphasize the importance of spatio-temporal information during failure detection and suggest further investigation of suitable heuristics in future implementations. Code and annotations are available.

[88] ZeST: an LLM-based Zero-Shot Traversability Navigation for Unknown Environments

Shreya Gummadi,Mateus V. Gasparino,Gianluca Capezzuto,Marcelo Becker,Girish Chowdhary

Main category: cs.RO

TL;DR: ZeST利用大型语言模型(LLMs)的视觉推理能力,在未知环境中实现零样本可通行性导航,避免了传统数据收集的风险,并提供了一种安全、高效的导航解决方案。

Details Motivation: 传统方法生成可通行性预测数据集时,需要将机器人置于潜在危险环境中,风险较高。ZeST提出了一种无需暴露机器人于危险中的方法,利用LLMs实现安全、快速的实时导航。

Contribution: 提出了ZeST方法,利用LLMs的零样本能力生成实时可通行性地图,避免了传统数据收集的风险,并显著提升导航系统的开发效率和安全性。

Method: 通过大型语言模型的视觉推理能力,直接在未知环境中生成可通行性地图,无需预训练或暴露机器人于危险环境。

Result: 在室内和室外非结构化环境中的实验表明,ZeST在安全性上优于其他先进方法,且能稳定到达目标点。

Insight: LLMs的视觉推理能力可以高效解决机器人导航中的可通行性问题,为未来自主导航系统的发展提供了新思路。

Abstract: The advancement of robotics and autonomous navigation systems hinges on the ability to accurately predict terrain traversability. Traditional methods for generating datasets to train these prediction models often involve putting robots into potentially hazardous environments, posing risks to equipment and safety. To solve this problem, we present ZeST, a novel approach leveraging visual reasoning capabilities of Large Language Models (LLMs) to create a traversability map in real-time without exposing robots to danger. Our approach not only performs zero-shot traversability and mitigates the risks associated with real-world data collection but also accelerates the development of advanced navigation systems, offering a cost-effective and scalable solution. To support our findings, we present navigation results, in both controlled indoor and unstructured outdoor environments. As shown in the experiments, our method provides safer navigation when compared to other state-of-the-art methods, constantly reaching the final goal.

[89] MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation

Hao Shi,Bin Xie,Yingfei Liu,Lin Sun,Fengrong Liu,Tiancai Wang,Erjin Zhou,Haoqiang Fan,Xiangyu Zhang,Gao Huang

Main category: cs.RO

TL;DR: MemoryVLA提出了一种结合感知认知记忆的视觉-语言-动作框架,解决机器人操作中长期依赖任务的问题,性能显著优于现有方法。

Details Motivation: 现有VLA模型忽视了时间上下文,无法处理长期依赖任务,而人类通过工作记忆和海马系统实现了高效的短期和长期记忆,这启发了MemoryVLA的设计。

Contribution: 提出了Cognition-Memory-Action框架,结合工作记忆和感知认知记忆库,实现了对时间上下文的建模和高效记忆管理。

Method: 通过预训练VLM编码观测为感知和认知token,构建工作记忆和记忆库;工作记忆从库中检索并融合决策相关条目,最后由记忆条件扩散动作生成器输出动作。

Result: 在仿真和真实任务中表现优异,如Bridge任务提升14.6%,真实世界任务成功率达84.0%,长期依赖任务提升26%。

Insight: 模拟人类记忆机制可以有效提升机器人对复杂任务的处理能力,尤其是在长期依赖场景中。

Abstract: Temporal context is essential for robotic manipulation because such tasks are inherently non-Markovian, yet mainstream VLA models typically overlook it and struggle with long-horizon, temporally dependent tasks. Cognitive science suggests that humans rely on working memory to buffer short-lived representations for immediate control, while the hippocampal system preserves verbatim episodic details and semantic gist of past experience for long-term memory. Inspired by these mechanisms, we propose MemoryVLA, a Cognition-Memory-Action framework for long-horizon robotic manipulation. A pretrained VLM encodes the observation into perceptual and cognitive tokens that form working memory, while a Perceptual-Cognitive Memory Bank stores low-level details and high-level semantics consolidated from it. Working memory retrieves decision-relevant entries from the bank, adaptively fuses them with current tokens, and updates the bank by merging redundancies. Using these tokens, a memory-conditioned diffusion action expert yields temporally aware action sequences. We evaluate MemoryVLA on 150+ simulation and real-world tasks across three robots. On SimplerEnv-Bridge, Fractal, and LIBERO-5 suites, it achieves 71.9%, 72.7%, and 96.5% success rates, respectively, all outperforming state-of-the-art baselines CogACT and pi-0, with a notable +14.6 gain on Bridge. On 12 real-world tasks spanning general skills and long-horizon temporal dependencies, MemoryVLA achieves 84.0% success rate, with long-horizon tasks showing a +26 improvement over state-of-the-art baseline. Project Page: https://shihao1895.github.io/MemoryVLA

cs.CR [Back]

[90] A Systematic Approach to Predict the Impact of Cybersecurity Vulnerabilities Using LLMs

Anders Mølmen Høst,Pierre Lison,Leon Moonen

Main category: cs.CR

TL;DR: 论文提出了TRIAGE方法,利用大型语言模型(LLMs)将CVE漏洞映射到ATT&CK知识库的技术,结合规则推理和数据驱动推断,提高了漏洞影响预测的效率和准确性。

Details Motivation: 现有的漏洞数据库(如NVD)虽提供了CVE的详细描述,但缺乏关于其实际影响的信息(如攻击者的TTPs)。手动映射耗时且低效,亟需自动化支持。

Contribution: TRIAGE方法结合两种LLM模块:基于规则推理和上下文学习的数据驱动推断,显著提高了漏洞到ATT&CK技术的映射效率与召回率。GPT-4o-mini表现优于Llama3.3-70B。

Method: 1)基于MITRE CVE映射方法的规则推理LLM生成初始技术列表;2)上下文学习的LLM模块进一步映射;3)结合两种结果形成混合方法。

Result: 上下文学习优于单一映射方法,混合方法提升了利用技术的召回率。GPT-4o-mini效果优于Llama3.3-70B。

Insight: LLMs可用于自动化预测漏洞影响,结合规则与数据驱动的方法可显著提升映射任务的效率和准确性。

Abstract: Vulnerability databases, such as the National Vulnerability Database (NVD), offer detailed descriptions of Common Vulnerabilities and Exposures (CVEs), but often lack information on their real-world impact, such as the tactics, techniques, and procedures (TTPs) that adversaries may use to exploit the vulnerability. However, manually linking CVEs to their corresponding TTPs is a challenging and time-consuming task, and the high volume of new vulnerabilities published annually makes automated support desirable. This paper introduces TRIAGE, a two-pronged automated approach that uses Large Language Models (LLMs) to map CVEs to relevant techniques from the ATT&CK knowledge base. We first prompt an LLM with instructions based on MITRE’s CVE Mapping Methodology to predict an initial list of techniques. This list is then combined with the results from a second LLM-based module that uses in-context learning to map a CVE to relevant techniques. This hybrid approach strategically combines rule-based reasoning with data-driven inference. Our evaluation reveals that in-context learning outperforms the individual mapping methods, and the hybrid approach improves recall of exploitation techniques. We also find that GPT-4o-mini performs better than Llama3.3-70B on this task. Overall, our results show that LLMs can be used to automatically predict the impact of cybersecurity vulnerabilities and TRIAGE makes the process of mapping CVEs to ATT&CK more efficient. Keywords: vulnerability impact, CVE, ATT&CK techniques, large language models, automated mapping.

[91] The Double-edged Sword of LLM-based Data Reconstruction: Understanding and Mitigating Contextual Vulnerability in Word-level Differential Privacy Text Sanitization

Stephen Meisenbacher,Alexandra Klymenko,Andreea-Elena Bodea,Florian Matthes

Main category: cs.CR

TL;DR: 本文探讨了基于LLM的数据重建在差分隐私文本脱敏中的双重作用,揭示了其既能利用上下文漏洞攻击隐私,又能通过反向思维增强隐私保护的潜力。

Details Motivation: 差分隐私(DP)文本脱敏方法虽然在隐私保护上提供了理论保证,但在实际操作中存在上下文漏洞,容易被LLM利用。本文旨在研究LLM如何利用这一漏洞,并提出可能的缓解措施。

Contribution: 1. 首次系统研究了LLM如何利用DP文本脱敏的上下文漏洞;2. 提出了基于LLM的数据重建攻击的双重作用(攻击隐私与增强隐私);3. 提出了利用LLM进行后处理的建议,以提升隐私保护。

Method: 本文通过实验测试多种DP脱敏机制在不同隐私级别下的效果,利用LLM对脱敏文本进行语义恢复,并分析其对隐私和效用的影响。

Result: 实验表明,LLM能够有效利用上下文漏洞推断原始文本语义,但也可以用于提升脱敏文本的质量和隐私保护。

Insight: LLM在隐私保护中是一把双刃剑,需合理利用其能力;对抗性思维(如利用LLM进行后处理)可能是未来隐私保护的新方向。

Abstract: Differentially private text sanitization refers to the process of privatizing texts under the framework of Differential Privacy (DP), providing provable privacy guarantees while also empirically defending against adversaries seeking to harm privacy. Despite their simplicity, DP text sanitization methods operating at the word level exhibit a number of shortcomings, among them the tendency to leave contextual clues from the original texts due to randomization during sanitization $\unicode{x2013}$ this we refer to as $\textit{contextual vulnerability}$. Given the powerful contextual understanding and inference capabilities of Large Language Models (LLMs), we explore to what extent LLMs can be leveraged to exploit the contextual vulnerability of DP-sanitized texts. We expand on previous work not only in the use of advanced LLMs, but also in testing a broader range of sanitization mechanisms at various privacy levels. Our experiments uncover a double-edged sword effect of LLM-based data reconstruction attacks on privacy and utility: while LLMs can indeed infer original semantics and sometimes degrade empirical privacy protections, they can also be used for good, to improve the quality and privacy of DP-sanitized texts. Based on our findings, we propose recommendations for using LLM data reconstruction as a post-processing step, serving to increase privacy protection by thinking adversarially.

[92] Hidden Tail: Adversarial Image Causing Stealthy Resource Consumption in Vision-Language Models

Rui Zhang,Zihan Wang,Tianli Yang,Hongwei Li,Wenbo Jiang,Qingchuan Zhao,Yang Liu,Guowen Xu

Main category: cs.CR

TL;DR: 这篇论文提出了Hidden Tail——一种对视觉语言模型(VLM)进行隐蔽资源消耗攻击的方法,通过生成对抗性图像使模型输出特殊令牌(而非无关内容)来延长推理时间,同时保持隐蔽性。

Details Motivation: 视觉语言模型的高推理成本使其易受资源消耗攻击,但现有攻击方法因生成无关内容而缺乏隐蔽性。论文旨在解决这一隐蔽性与攻击效果之间的权衡问题。

Contribution: 论文的主要贡献是提出了Hidden Tail方法,通过动态加权复合损失函数优化对抗性图像,使模型在不影响可见输出的情况下生成最长输出序列,显著提高了攻击的隐蔽性和效果。

Method: 方法基于复合损失函数,平衡语义保留、特殊令牌诱导和抑制EOS令牌生成的目标,并通过动态加权策略优化对抗图像。

Result: 实验表明,Hidden Tail将输出长度提高了19.2倍,达到了最大令牌限制,同时保持了隐蔽性,优于现有攻击方法。

Insight: 研究强调了提升VLM对抗效率导向威胁的鲁棒性的紧迫性,并展示了隐蔽资源消耗攻击的潜在威胁。

Abstract: Vision-Language Models (VLMs) are increasingly deployed in real-world applications, but their high inference cost makes them vulnerable to resource consumption attacks. Prior attacks attempt to extend VLM output sequences by optimizing adversarial images, thereby increasing inference costs. However, these extended outputs often introduce irrelevant abnormal content, compromising attack stealthiness. This trade-off between effectiveness and stealthiness poses a major limitation for existing attacks. To address this challenge, we propose \textit{Hidden Tail}, a stealthy resource consumption attack that crafts prompt-agnostic adversarial images, inducing VLMs to generate maximum-length outputs by appending special tokens invisible to users. Our method employs a composite loss function that balances semantic preservation, repetitive special token induction, and suppression of the end-of-sequence (EOS) token, optimized via a dynamic weighting strategy. Extensive experiments show that \textit{Hidden Tail} outperforms existing attacks, increasing output length by up to 19.2$\times$ and reaching the maximum token limit, while preserving attack stealthiness. These results highlight the urgent need to improve the robustness of VLMs against efficiency-oriented adversarial threats. Our code is available at https://github.com/zhangrui4041/Hidden_Tail.

q-bio.NC [Back]

[93] Time Series Analysis of Spiking Neural Systems via Transfer Entropy and Directed Persistent Homology

Dylan Peek,Siddharth Pritam,Matthew P. Skerritt,Stephan Chalup

Main category: q-bio.NC

TL;DR: 该论文提出了一种结合传递熵(TE)和有向持续同调(PH)的拓扑框架,用于分析神经时间序列,以表征脉冲神经系统的信息流动。

Details Motivation: 研究目的是开发一种能够捕捉神经系统中定向信息流动并映射到全局组织模式的通用方法,适用于人工和生物神经网络。

Contribution: 主要贡献是将TE和PH结合,提供了一种新的框架,能够量化神经系统的动态交互和拓扑复杂性,并通过实验验证了其有效性。

Method: 方法包括使用TE生成加权有向图表示神经元间的定向影响,然后通过PH分析这些图的拓扑结构,揭示不同尺度和维度下的复杂性。

Result: 在合成脉冲网络、图像分类网络和小鼠皮层记录等多种场景中,该方法成功区分了任务复杂度、刺激结构和行为状态,并显示出高维特征在复杂或噪声条件下的重要性。

Insight: 研究结果表明,高维拓扑特征能够反映超出成对连接的交互模式,为理解神经系统的全局组织提供了新视角。

Abstract: We present a topological framework for analysing neural time series that integrates Transfer Entropy (TE) with directed Persistent Homology (PH) to characterize information flow in spiking neural systems. TE quantifies directional influence between neurons, producing weighted, directed graphs that reflect dynamic interactions. These graphs are then analyzed using PH, enabling assessment of topological complexity across multiple structural scales and dimensions. We apply this TE+PH pipeline to synthetic spiking networks trained on logic gate tasks, image-classification networks exposed to structured and perturbed inputs, and mouse cortical recordings annotated with behavioral events. Across all settings, the resulting topological signatures reveal distinctions in task complexity, stimulus structure, and behavioral regime. Higher-dimensional features become more prominent in complex or noisy conditions, reflecting interaction patterns that extend beyond pairwise connectivity. Our findings offer a principled approach to mapping directed information flow onto global organizational patterns in both artificial and biological neural systems. The framework is generalizable and interpretable, making it well suited for neural systems with time-resolved and binary spiking data.

physics.optics [Back]

[94] Designing across domains with declarative thinking: Insights from the 96-Eyes ptychographic imager project

Antony C Chan

Main category: physics.optics

TL;DR: 本文通过96-Eyes项目(一个用于高通量药物发现的96相机并行多模态成像系统)的案例,探讨了声明式问题表述语言(5GL)在跨领域成像系统设计中的应用及其优势。

Details Motivation: 在跨学科和跨功能团队的合作中,传统的命令式编程语言(3GL)可能导致设计不一致和沟通不畅,而5GL可以通过机器可读的问题表述提升透明度和可追溯性。

Contribution: 提出了使用5GL(第五代语言)将复杂的项目需求(如硬件限制和生命科学需求)形式化为机器可读的问题表述,从而优化跨领域协作。

Method: 通过96-Eyes项目,展示了如何将硬件约束、算法、硬件加速计算和生命科学等多团队的需求转化为声明式的机器可读问题表述,并通过实际代码示例说明5GL的应用。

Result: 5GL能够增强设计透明度、确保可追溯性,并减少跨团队间的高成本错位。

Insight: 声明式问题表述可以促进创新,尤其是在并发研发流程中,而传统的命令式语言则更适合顺序驱动的研发环境。编程范式隐式影响了研究流程和领域层级结构。

Abstract: This article presents a practitioner’s reflection on applying declarative, 5th generation, problem formulation language (5GL) to de novo imaging system design, informed by experiences across the interdisciplinary research in academia and cross-functional product development within the private sector. Using the 96-Eyes project: 96-camera parallel multi-modal imager for high-throughput drug discovery as a representative case, I illustrate how project requirements, ranging from hardware constraints to life sciences needs, can be formalized into machine-readable problem statements to preserve mission-critical input from diverse domain stakeholders. This declarative approach enhances transparency, ensures design traceability, and minimizes costly misalignment across optical, algorithmic, hardware-accelerated compute, and life sciences teams. Alongside the technical discussion of 5GL with real-world code examples, I reflect on the practical barriers to adopting 5GL in environments where imperative, 3rd-generation languages (3GL) remain the default medium for inter-team collaboration. Rather than offering an one-size-fits-all solution, these learned lessons highlight how programming paradigms implicitly shapes research workflows through existing domain hierarchies. The discussion aims to invite further explorations into how declarative problem formulations can facilitate innovation in settings where concurrent R&{}D workflows are gaining traction, as opposed to environments where sequential, phase-driven workflows remain the norm.

cs.LG [Back]

[95] Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

Taishi Nakamura,Satoki Ishikawa,Masaki Kawamura,Takumi Okamoto,Daisuke Nohara,Jun Suzuki,Rio Yokota

Main category: cs.LG

TL;DR: 研究了MoE语言模型中稀疏性对记忆任务和推理任务的影响,发现推理性能在稀疏性增加时会饱和甚至下降,而记忆任务则随参数增加而持续提升。

Details Motivation: 现有的大语言模型(LLM)研究主要关注密集模型,而MoE模型的稀疏性维度并未被充分探索,尤其是在不同能力(记忆vs推理)上的表现差异。

Contribution: 系统分析了MoE模型的稀疏性对记忆和推理任务的影响,提出了稀疏性与任务性能的关系,并揭示了推理性能的饱和现象。

Method: 通过训练不同稀疏性(参数总量、激活参数、top-$k路由)的MoE Transformer家族,固定计算预算,记录预训练损失、下游任务损失和准确性。

Result: 记忆任务性能与参数总量正相关,而推理任务性能在稀疏性增加时会饱和甚至下降;top-$k路由的单独调整影响较小。

Insight: 推理任务的性能提升可能受到模型稀疏性的限制,传统超参数(如学习率)与稀疏性对泛化能力的影响方向一致。

Abstract: Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization and reasoning. We train families of MoE Transformers that systematically vary total parameters, active parameters, and top-$k$ routing while holding the compute budget fixed. For every model we record pre-training loss, downstream task loss, and task accuracy, allowing us to separate the train-test generalization gap from the loss-accuracy gap. Memorization benchmarks improve monotonically with total parameters, mirroring training loss. By contrast, reasoning performance saturates and can even regress despite continued gains in both total parameters and training loss. Altering top-$k$ alone has little effect when active parameters are constant, and classic hyperparameters such as learning rate and initialization modulate the generalization gap in the same direction as sparsity. Neither post-training reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning deficit of overly sparse models. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.

cs.GR [Back]

[96] SemLayoutDiff: Semantic Layout Generation with Diffusion Model for Indoor Scene Synthesis

Xiaohao Sun,Divyam Goel,Angle X. Chang

Main category: cs.GR

TL;DR: SemLayoutDiff是一个基于扩散模型的语义布局生成方法,用于多样化的3D室内场景合成,结合了语义地图和物体属性,并通过显式条件建模实现了与建筑约束的兼容。

Details Motivation: 现有的室内场景生成方法难以显式建模建筑约束(如门、窗等),导致生成的场景可能不实用或不连贯。SemLayoutDiff旨在通过扩散模型和语义布局的显式条件建模解决这一问题。

Contribution: 1. 提出了一种结合语义地图和物体属性的场景布局表示;2. 设计了一个基于分类扩散模型的框架,能够显式地以房间掩码为条件生成场景;3. 引入了一个基于交叉注意力的网络来预测符合语义布局的家具摆放。

Method: 1. 首先生成语义地图;2. 使用扩散模型显式地以房间掩码为条件生成场景;3. 通过交叉注意力网络预测家具位置;4. 考虑了门窗等建筑元素以确保实用性。

Result: 在3D-FRONT数据集上的实验表明,SemLayoutDiff生成的场景在空间一致性、真实性和多样性上优于现有方法。

Insight: 结合扩散模型和显式条件建模可以更好地处理建筑约束,从而生成更实用的室内场景。

Abstract: We present SemLayoutDiff, a unified model for synthesizing diverse 3D indoor scenes across multiple room types. The model introduces a scene layout representation combining a top-down semantic map and attributes for each object. Unlike prior approaches, which cannot condition on architectural constraints, SemLayoutDiff employs a categorical diffusion model capable of conditioning scene synthesis explicitly on room masks. It first generates a coherent semantic map, followed by a cross-attention-based network to predict furniture placements that respect the synthesized layout. Our method also accounts for architectural elements such as doors and windows, ensuring that generated furniture arrangements remain practical and unobstructed. Experiments on the 3D-FRONT dataset show that SemLayoutDiff produces spatially coherent, realistic, and varied scenes, outperforming previous methods.

[97] PanoHair: Detailed Hair Strand Synthesis on Volumetric Heads

Shashikant Verma,Shanmuganathan Raman

Main category: cs.GR

TL;DR: PanoHair提出了一种新颖的方法,通过知识蒸馏从预训练的生成模型合成高保真头发丝几何,显著提升了生成速度与多样性。

Details Motivation: 现有方法需要复杂的多视图数据采集和较长的处理时间,限制了效率和应用范围。PanoHair旨在简化这一流程,快速生成高保真头发丝。

Contribution: 1. 使用知识蒸馏从预训练生成模型中估计头部几何;2. 生成语义分割掩码和3D方向图;3. 支持多样化发型生成和高效实时处理(<5秒)。

Method: 1. 通过知识蒸馏从预训练模型提取头部几何的SDF(符号距离场);2. 预测头发区域的语义掩码和3D方向;3. 通过潜在空间操作生成多样化发型。

Result: 实验表明,PanoHair在5秒内可生成干净流形网格,优于现有方法,且在视觉质量和效率上均有显著提升。

Insight: 知识蒸馏和生成模型的结合为头发合成提供了高效且灵活的解决方案,潜在空间操作为多样化生成提供了可能。

Abstract: Achieving realistic hair strand synthesis is essential for creating lifelike digital humans, but producing high-fidelity hair strand geometry remains a significant challenge. Existing methods require a complex setup for data acquisition, involving multi-view images captured in constrained studio environments. Additionally, these methods have longer hair volume estimation and strand synthesis times, which hinder efficiency. We introduce PanoHair, a model that estimates head geometry as signed distance fields using knowledge distillation from a pre-trained generative teacher model for head synthesis. Our approach enables the prediction of semantic segmentation masks and 3D orientations specifically for the hair region of the estimated geometry. Our method is generative and can generate diverse hairstyles with latent space manipulations. For real images, our approach involves an inversion process to infer latent codes and produces visually appealing hair strands, offering a streamlined alternative to complex multi-view data acquisition setups. Given the latent code, PanoHair generates a clean manifold mesh for the hair region in under 5 seconds, along with semantic and orientation maps, marking a significant improvement over existing methods, as demonstrated in our experiments.

cs.HC [Back]

[98] Impact of Target and Tool Visualization on Depth Perception and Usability in Optical See-Through AR

Yue Yang,Xue Xie,Xinkai Wang,Hui Zhang,Chiming Yu,Xiaoxian Xiong,Lifeng Zhu,Yuanyi Zheng,Jue Cen,Bruce Daniel,Fred Baik

Main category: cs.HC

TL;DR: 论文研究了光学透视增强现实(OST-AR)中目标和工具可视化对深度感知和系统可用性的影响,发现不透明的目标渲染和实时工具遮挡能显著提高精度和用户体验。

Details Motivation: 光学透视增强现实(如HoloLens 2)在近距离任务(如手术)中有潜力,但深度感知和工具的遮挡问题仍需解决。

Contribution: 研究了目标和工具透明度和可视化模式对深度感知和可用性的影响,提出了优化设计的建议。

Method: 通过两个实验(深度匹配任务和模拟手术任务),比较不同透明度和工具可视化模式的效果。

Result: 不透明的目标渲染和实时工具遮挡显著提高深度感知和任务精度,而高透明目标会损害效果。

Insight: 正确的遮挡线索和目标不透明度对OST-AR的深度感知至关重要,设计时应优先考虑工具跟踪和遮挡处理。

Abstract: Optical see-through augmented reality (OST-AR) systems like Microsoft HoloLens 2 hold promise for arm’s distance guidance (e.g., surgery), but depth perception of the hologram and occlusion of real instruments remain challenging. We present an evaluation of how visualizing the target object with different transparencies and visualizing a tracked tool (virtual proxy vs. real tool vs. no tool tracking) affects depth perception and system usability. Ten participants performed two experiments on HoloLens 2. In Experiment 1, we compared high-transparency vs. low-transparency target rendering in a depth matching task at arm’s length. In Experiment 2, participants performed a simulated surgical pinpoint task on a frontal bone target under six visualization conditions ($2 \times 3$: two target transparencies and three tool visualization modes: virtual tool hologram, real tool, or no tool tracking). We collected data on depth matching error, target localization error, system usability, task workload, and qualitative feedback. Results show that a more opaque target yields significantly lower depth estimation error than a highly transparent target at arm’s distance. Moreover, showing the real tool (occluding the virtual target) led to the highest accuracy and usability with the lowest workload, while not tracking the tool yielded the worst performance and user ratings. However, making the target highly transparent, while allowing the real tool to remain visible, slightly impaired depth cues and did not improve usability. Our findings underscore that correct occlusion cues, rendering virtual content opaque and occluding it with real tools in real time, are critical for depth perception and precision in OST-AR. Designers of arm-distance AR systems should prioritize robust tool tracking and occlusion handling; if unavailable, cautiously use transparency to balance depth perception and tool visibility.

quant-ph [Back]

[99] Quantum-Circuit-Based Visual Fractal Image Generation in Qiskit and Analytics

Hillol Biswas

Main category: quant-ph

TL;DR: 该论文探讨了使用量子电路生成Julia集分形图像的方法,结合量子叠加、随机性和纠缠等原理,为量子生成艺术提供了新的研究方向。

Details Motivation: 自然界中的分形现象与量子系统的干涉模式具有相似性,论文试图通过量子计算探索分形图像生成的可能性,为量子生成艺术开辟新方向。

Contribution: 提出了一种基于量子电路的Julia集分形图像生成方法,展示了量子叠加、随机性和纠缠如何影响分形图案的生成。

Method: 通过Qiskit构建量子电路,利用量子态的叠加和纠缠特性生成分形图案,并分析了这些量子现象对图像复杂性的影响。

Result: 验证了量子电路可以生成复杂的Julia集分形图像,为量子生成艺术提供了新的技术途径。

Insight: 量子计算的特性(如叠加和纠缠)可以为分形图像生成引入更高的复杂性和随机性,为艺术与科学的交叉领域提供创新思路。

Abstract: As nature is ascribed as quantum, the fractals also pose some intriguing appearance which is found in many micro and macro observable entities or phenomena. Fractals show self-similarity across sizes; structures that resemble the entire are revealed when zoomed in. In Quantum systems, the probability density or wavefunction may exhibit recurring interference patterns at various energy or length scales. Fractals are produced by basic iterative rules (such as Mandelbrot or Julia sets), and they provide limitless complexity. Despite its simplicity, the Schr"odinger equation in quantum mechanics produces incredibly intricate patterns of interference and entanglement, particularly in chaotic quantum systems. Quantum computing, the root where lies to the using the principles of quantum-mechanical phenomenon, when applied in fractal image generation, what outcomes are expected? The paper outlines the generation of a Julia set dataset using an approach coupled with building quantum circuit, highlighting the concepts of superposition, randomness, and entanglement as foundational elements to manipulate the generated dataset patterns. As Quantum computing is finding many application areas, the possibility of using quantum circuits for fractal Julia image generation posits a unique direction of future research where it can be applied to quantum generative arts across various ecosystems with a customised approach, such as producing an exciting landscape based on a quantum art theme.

cs.AI [Back]

[100] RLMR: Reinforcement Learning with Mixed Rewards for Creative Writing

Jianxing Liao,Tian Zhang,Xiao Feng,Yusong Zhang,Rui Yang,Haorui Wang,Bosi Wen,Ziying Wang,Runzhi Shi

Main category: cs.AI

TL;DR: 该论文提出了一种名为RLMR的强化学习方法,通过动态混合奖励系统平衡创意写作中的主观写作质量与客观约束遵循,实现了多维度优化的创新。

Details Motivation: 创意写作需要平衡主观写作质量(如文学性和情感表达)与客观约束遵循(如格式要求和字数限制),现有强化学习方法难以同时优化这两方面。

Contribution: 提出了RLMR方法,首次将主观偏好与客观验证结合在在线强化学习中,通过动态调整奖励权重实现了多维度优化。

Method: 使用动态混合奖励系统,由写作奖励模型评估主观质量,约束验证模型评估客观约束遵循,并在训练中动态调整奖励权重。

Result: 在自动化与人工评估中均取得显著提升,指令遵循(IFEval从83.36%提升到86.65%)和写作质量(WriteEval上的72.75%胜率)均有改善。

Insight: 动态调整奖励权重是关键创新点,能够根据写作质量自适应调整惩罚违反约束的样本,从而在训练中更有效地平衡主观与客观要求。

Abstract: Large language models are extensively utilized in creative writing applications. Creative writing requires a balance between subjective writing quality (e.g., literariness and emotional expression) and objective constraint following (e.g., format requirements and word limits). Existing reinforcement learning methods struggle to balance these two aspects: single reward strategies fail to improve both abilities simultaneously, while fixed-weight mixed-reward methods lack the ability to adapt to different writing scenarios. To address this problem, we propose Reinforcement Learning with Mixed Rewards (RLMR), utilizing a dynamically mixed reward system from a writing reward model evaluating subjective writing quality and a constraint verification model assessing objective constraint following. The constraint following reward weight is adjusted dynamically according to the writing quality within sampled groups, ensuring that samples violating constraints get negative advantage in GRPO and thus penalized during training, which is the key innovation of this proposed method. We conduct automated and manual evaluations across diverse model families from 8B to 72B parameters. Additionally, we construct a real-world writing benchmark named WriteEval for comprehensive evaluation. Results illustrate that our method achieves consistent improvements in both instruction following (IFEval from 83.36% to 86.65%) and writing quality (72.75% win rate in manual expert pairwise evaluations on WriteEval). To the best of our knowledge, RLMR is the first work to combine subjective preferences with objective verification in online RL training, providing an effective solution for multi-dimensional creative writing optimization.

[101] Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap

Jun Wang,Ninglun Gu,Kailai Zhang,Zijiao Zhang,Yelun Bao,Jin Yang,Xu Yin,Liwei Liu,Yihuan Liu,Pengyong Li,Gary G. Yen,Junchi Yan

Main category: cs.AI

TL;DR: 该论文提出了一种新的评估范式,通过人类智力的视角,将LLM评估划分为IQ、EQ、PQ三个维度,并设计了面向价值的评估框架(VQ),以弥合基准测试与实际应用之间的差距。

Details Motivation: 当前LLM的评估框架过于碎片化,注重技术指标而忽视了实际部署时的全面评估,导致基准测试性能与实际效用脱节。

Contribution: 提出了基于人类智能的拟人化评估范式(IQ、EQ、PQ三维分类法),并设计了面向价值的评估框架(VQ),评估经济性、社会影响、伦理对齐和环境可持续性。

Method: 采用模块化架构,整合了六个组件,并通过分析200多个基准测试,识别了动态评估需求和可解释性差距等关键挑战。

Result: 为开发技术上精通、上下文相关且伦理合规的LLM提供了可操作的指导,并维护了一个开源评估资源库。

Insight: LLM评估需要超越技术指标,关注多维度的实际价值和伦理影响,动态评估和可解释性是未来研究方向。

Abstract: For Large Language Models (LLMs), a disconnect persists between benchmark performance and real-world utility. Current evaluation frameworks remain fragmented, prioritizing technical metrics while neglecting holistic assessment for deployment. This survey introduces an anthropomorphic evaluation paradigm through the lens of human intelligence, proposing a novel three-dimensional taxonomy: Intelligence Quotient (IQ)-General Intelligence for foundational capacity, Emotional Quotient (EQ)-Alignment Ability for value-based interactions, and Professional Quotient (PQ)-Professional Expertise for specialized proficiency. For practical value, we pioneer a Value-oriented Evaluation (VQ) framework assessing economic viability, social impact, ethical alignment, and environmental sustainability. Our modular architecture integrates six components with an implementation roadmap. Through analysis of 200+ benchmarks, we identify key challenges including dynamic assessment needs and interpretability gaps. It provides actionable guidance for developing LLMs that are technically proficient, contextually relevant, and ethically sound. We maintain a curated repository of open-source evaluation resources at: https://github.com/onejune2018/Awesome-LLM-Eval.

[102] CAC-CoT: Connector-Aware Compact Chain-of-Thought for Efficient Reasoning Data Synthesis Across Dual-System Cognitive Tasks

Sunguk Choi,Yonghoon Kwon,Heondeuk Lee

Main category: cs.AI

TL;DR: 该论文提出了一种名为CAC-CoT的方法,通过限制推理步骤为少量固定的连接短语,实现了在双系统认知任务中高效合成的紧凑链式思维。

Details Motivation: 长链式思维(CoT)提示有助于LLM解决复杂问题,但在快速直觉的‘System-1’任务中,过长的推理链会降低性能。因此,需要一种更高效的紧凑推理方法。

Contribution: 提出了Connector-Aware Compact CoT(CAC-CoT)方法,通过固定的连接短语限制推理步骤,显著缩短推理链长度,同时保持性能不受影响。

Method: 使用少量固定的连接短语引导模型生成简洁且结构化的解释,结合Gemini-2.0-Flash合成高质量的训练数据。

Result: 在GSM8K(System-2任务)上达到85%的准确率,GPQA上达到40%,同时在S1-Bench(System-1任务)上保持90%的准确率,推理链长度仅为基线方法的三分之一(约300词)。

Insight: 紧凑的推理链能在不损失准确性的前提下显著提升效率,尤其适用于需兼顾‘System-1’和‘System-2’任务的双系统场景。

Abstract: Long chain-of-thought (CoT) prompting helps Large Language Models (LLMs) solve difficult problems, but very long traces often slow or even degrade performance on fast, intuitive “System-1” tasks. We introduce Connector-Aware Compact CoT (CAC-CoT) – a method that deliberately restricts reasoning to a small, fixed set of connector phrases, steering the model toward concise and well – structured explanations. Despite its simplicity, our synthetic method with Gemini-2.0-Flash yields a high-quality training quality. CAC-CoT achieves approximately 85% on GSM8K and approximately 40% on GPQA (System-2) while retaining approximately 90% on S1-Bench (System-1). Its reasoning traces average approximately 300 tokens(ART), about one-third the length of baseline traces, delivering higher efficiency without loss of accuracy.

[103] Answering the Unanswerable Is to Err Knowingly: Analyzing and Mitigating Abstention Failures in Large Reasoning Models

Yi Liu,Xiangyu Liu,Zequn Sun,Wei Hu

Main category: cs.AI

TL;DR: 大型推理模型(LRMs)在复杂推理任务上表现出色,但在面对无法回答的问题时,常常无法适当放弃回答。本研究分析了这一现象并提出了一种轻量的两阶段方法以改进。

Details Motivation: 大型推理模型在面对无法回答的问题时(如条件不足的数学问题),经常无法正确放弃回答,影响其可信度。本研究旨在分析和解决这一问题。

Contribution: 1. 系统性分析了LRMs在面对无法回答问题时的响应行为;2. 发现LRMs具备识别问题缺陷的认知能力,但响应行为与其内部认知不一致;3. 提出了一种轻量的两阶段方法(认知监控+推理时干预),显著提升了放弃回答率。

Method: 采用两阶段方法:1. 认知监控阶段,识别问题的可回答性;2. 推理时干预阶段,根据监控结果调整模型响应。

Result: 实验表明,该方法显著提高了模型的放弃回答率,同时未影响其整体推理性能。

Insight: LRMs的响应行为与其内部认知存在不一致,通过轻量的干预可实现行为优化。这为提升模型的可信度提供了新思路。

Abstract: Large reasoning models (LRMs) have shown remarkable progress on complex reasoning tasks. However, some questions posed to LRMs are inherently unanswerable, such as math problems lacking sufficient conditions. We find that LRMs continually fail to provide appropriate abstentions when confronted with these unanswerable questions. In this paper, we systematically analyze, investigate, and resolve this issue for trustworthy AI. We first conduct a detailed analysis of the distinct response behaviors of LRMs when facing unanswerable questions. Then, we show that LRMs possess sufficient cognitive capabilities to recognize the flaws in these questions. However, they fail to exhibit appropriate abstention behavior, revealing a misalignment between their internal cognition and external response. Finally, to resolve this issue, we propose a lightweight, two-stage method that combines cognitive monitoring with inference-time intervention. Experimental results demonstrate that our method significantly improves the abstention rate while maintaining the overall reasoning performance.

[104] Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark

Yuxuan Cai,Yipeng Hao,Jie Zhou,Hang Yan,Zhikai Lei,Rui Zhen,Zhenhua Han,Yutao Yang,Junsong Li,Qianjun Pan,Tianyu Huai,Qin Chen,Xin Li,Kai Chen,Bo Zhang,Xipeng Qiu,Liang He

Main category: cs.AI

TL;DR: 本文提出了一种名为经验驱动的终身学习(ELL)的框架,用于构建能够通过与动态环境交互持续自我进化的智能体,并介绍了模拟学生大学生涯的基准数据集StuLife。

Details Motivation: 随着AI向通用智能发展,研究重点从静态任务优化的系统转向能够持续学习的开放性智能体。本文旨在通过经验驱动的终身学习框架,推动智能体的自我进化能力。

Contribution: 1. 提出ELL框架,基于四大核心原则(经验探索、长期记忆、技能学习和知识内化)构建自我进化智能体;2. 引入StuLife基准数据集,模拟学生的大学生涯,用于评估终身学习能力。

Method: ELL框架结合连续交互、记忆系统、技能抽象和知识内化,StuLife数据集通过三个阶段和十个子场景模拟动态环境,支持智能体的主动学习和记忆评估。

Result: StuLife数据集为终身学习能力评估提供全面平台,包括记忆保持、技能迁移和自主动机行为。本文还探讨了上下文工程对通用人工智能的作用。

Insight: 论文表明,通过经验驱动的学习和动态环境交互,智能体可以逐步内化知识并发展出直觉能力,进一步推动通用人工智能的发展。

Abstract: As AI advances toward general intelligence, the focus is shifting from systems optimized for static tasks to creating open-ended agents that learn continuously. In this paper, we introduce Experience-driven Lifelong Learning (ELL), a framework for building self-evolving agents capable of continuous growth through real-world interaction. The framework is built on four core principles: (1) Experience Exploration: Agents learn through continuous, self-motivated interaction with dynamic environments, navigating interdependent tasks and generating rich experiential trajectories. (2) Long-term Memory: Agents preserve and structure historical knowledge, including personal experiences, domain expertise, and commonsense reasoning, into a persistent memory system. (3) Skill Learning: Agents autonomously improve by abstracting recurring patterns from experience into reusable skills, which are actively refined and validated for application in new tasks. (4) Knowledge Internalization: Agents internalize explicit and discrete experiences into implicit and intuitive capabilities as “second nature”. We also introduce StuLife, a benchmark dataset for ELL that simulates a student’s holistic college journey, from enrollment to academic and personal development, across three core phases and ten detailed sub-scenarios. StuLife is designed around three key paradigm shifts: From Passive to Proactive, From Context to Memory, and From Imitation to Learning. In this dynamic environment, agents must acquire and distill practical skills and maintain persistent memory to make decisions based on evolving state variables. StuLife provides a comprehensive platform for evaluating lifelong learning capabilities, including memory retention, skill transfer, and self-motivated behavior. Beyond evaluating SOTA LLMs on the StuLife benchmark, we also explore the role of context engineering in advancing AGI.

[105] StepWiser: Stepwise Generative Judges for Wiser Reasoning

Wei Xiong,Wenting Zhao,Weizhe Yuan,Olga Golovneva,Tong Zhang,Jason Weston,Sainbayar Sukhbaatar

Main category: cs.AI

TL;DR: 论文提出了一种名为StepWiser的生成式判断模型,通过元推理(meta-reasoning)监督多步推理中的中间步骤逻辑有效性,优于现有方法,并在训练和推理时提升模型表现。

Details Motivation: 现有方法对多步推理的中间步骤监督不足,分类器式奖励模型缺乏解释性且依赖静态数据集,限制了泛化能力。

Contribution: 1. 将奖励建模从分类任务转为推理任务;2. 提出生成式判断模型StepWiser,通过元推理输出评判结果;3. 利用强化学习训练,提升判断准确性和模型表现。

Method: StepWiser通过元推理生成思考标记(thinking tokens),再输出最终评判。采用基于对局结果的强化学习训练方法。

Result: 实验显示StepWiser在中间步骤判断准确率、训练时改进策略模型及推理时搜索效果上优于现有方法。

Insight: 将奖励建模与推理任务结合,生成式方法可以提供更透明的监督信号,同时提升模型表现和泛化能力。

Abstract: As models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step-by-step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine-tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model’s reasoning steps (i.e., meta-reasons), outputting thinking tokens before delivering a final verdict. Our model, StepWiser, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.

[106] Stabilizing Open-Set Test-Time Adaptation via Primary-Auxiliary Filtering and Knowledge-Integrated Prediction

Byung-Joon Lee,Jin-Seop Lee,Jee-Hyong Lee

Main category: cs.AI

TL;DR: 该论文提出了一种名为‘主-辅助过滤(PAF)’和‘知识集成预测(KIP)’的新方法,用于解决开放集测试时适应(OSTTA)中的不稳定性和错误积累问题。

Details Motivation: 现实中的测试数据常面临域偏移(domain shift),而开放集数据会进一步降低封闭集(closed-set)的准确性。现有方法依赖源模型过滤开放集数据,效果不佳,且适应模型在噪声测试数据中不稳定,导致错误累积。

Contribution: 提出了PAF和KIP两种新方法:PAF通过主辅助过滤机制提升开放集数据的过滤准确性,KIP则通过集成适应模型、EMA模型和源模型的互补知识,校准预测输出。

Method: PAF采用辅助过滤器验证主过滤器过滤的数据,减少不稳定性和错误积累;KIP整合适应模型、EMA模型和源模型的知识,提升预测的鲁棒性。

Result: 实验表明,该方法在多种封闭集和开放集数据集上均优于现有方法,提升了封闭集准确性和开放集判别能力。

Insight: 适应模型(adapting model)在噪声测试数据中不稳定,但结合其他模型的知识可以有效提升开放集测试时适应的稳定性和准确性。

Abstract: Deep neural networks demonstrate strong performance under aligned training-test distributions. However, real-world test data often exhibit domain shifts. Test-Time Adaptation (TTA) addresses this challenge by adapting the model to test data during inference. While most TTA studies assume that the training and test data share the same class set (closed-set TTA), real-world scenarios often involve open-set data (open-set TTA), which can degrade closed-set accuracy. A recent study showed that identifying open-set data during adaptation and maximizing its entropy is an effective solution. However, the previous method relies on the source model for filtering, resulting in suboptimal filtering accuracy on domain-shifted test data. In contrast, we found that the adapting model, which learns domain knowledge from noisy test streams, tends to be unstable and leads to error accumulation when used for filtering. To address this problem, we propose Primary-Auxiliary Filtering (PAF), which employs an auxiliary filter to validate data filtered by the primary filter. Furthermore, we propose Knowledge-Integrated Prediction (KIP), which calibrates the outputs of the adapting model, EMA model, and source model to integrate their complementary knowledge for OSTTA. We validate our approach across diverse closed-set and open-set datasets. Our method enhances both closed-set accuracy and open-set discrimination over existing methods. The code is available at https://github.com/powerpowe/PAF-KIP-OSTTA .

eess.SP [Back]

[107] EMind: A Foundation Model for Multi-task Electromagnetic Signals Understanding

Luqing Luo,Wenjin Gui,Yunfei Liu,Ziyue Zhang,Yunxi Zhang,Fengxiang Wang,Zonghao Guo,Zizhi Ma,Xinzhu Liu,Hanxiang He,Jinhai Li,Xin Qiu,Wupeng Xie,Yangang Sun

Main category: eess.SP

TL;DR: EMind是首个针对电磁信号的多任务基础模型,解决了电磁信号的高异质性、强背景噪声和复杂时频结构等问题,通过大规模预训练和统一数据集实现了跨任务的泛化和高效迁移。

Details Motivation: 电磁信号与文本和图像差异巨大,现有通用模型难以直接应用,且任务多样性导致跨任务泛化能力不足,缺乏高质量大规模数据集阻碍了多任务学习框架的发展。

Contribution: 1. 提出了首个统一的大规模电磁信号标准化数据集;2. 设计了长度自适应的多信号打包方法和硬件感知训练策略;3. 实现了跨多任务的强泛化性能。

Method: 利用电磁信号的物理特性,提出长度自适应的多信号打包方法和硬件感知训练策略,结合大规模预训练构建统一框架。

Result: 实验表明,EMind在多个下游任务中表现出色,实现了从任务专用模型到统一框架的跨越。

Insight: 通过物理特性驱动的数据预处理和训练策略优化,可以显著提升电磁信号模型的泛化能力和效率。

Abstract: Deep understanding of electromagnetic signals is fundamental to dynamic spectrum management, intelligent transportation, autonomous driving and unmanned vehicle perception. The field faces challenges because electromagnetic signals differ greatly from text and images, showing high heterogeneity, strong background noise and complex joint time frequency structure, which prevents existing general models from direct use. Electromagnetic communication and sensing tasks are diverse, current methods lack cross task generalization and transfer efficiency, and the scarcity of large high quality datasets blocks the creation of a truly general multitask learning framework. To overcome these issue, we introduce EMind, an electromagnetic signals foundation model that bridges large scale pretraining and the unique nature of this modality. We build the first unified and largest standardized electromagnetic signal dataset covering multiple signal types and tasks. By exploiting the physical properties of electromagnetic signals, we devise a length adaptive multi-signal packing method and a hardware-aware training strategy that enable efficient use and representation learning from heterogeneous multi-source signals. Experiments show that EMind achieves strong performance and broad generalization across many downstream tasks, moving decisively from task specific models to a unified framework for electromagnetic intelligence. The code is available at: https://github.com/GabrielleTse/EMind.