Table of Contents

cs.CL [Back]

[1] PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

Oshayer Siddique,J. M Areeb Uzair Alam,Md Jobayer Rahman Rafy,Syed Rifat Raiyan,Hasan Mahmud,Md Kamrul Hasan

Main category: cs.CL

TL;DR: 本文介绍了PhysicsEval,一个新的物理问题评估基准,并研究了通过多代理框架和推理时技术提升大型语言模型(LLM)在物理问题推理能力的方法,显著改善了模型在初始表现不佳的问题上的表现。

Details Motivation: 物理问题是自然语言推理的重要领域,然而前沿LLM在此类问题上的表现尚未系统评估。通过创新技术提升其在物理问题上的推理能力,对推动技术发展及深化科学理解具有重要意义。

Contribution: 1) 提出新的物理问题评估基准PhysicsEval;2) 通过多代理框架和推理时技术显著提升LLM在物理问题上的表现;3) 公开代码与数据以促进研究。

Method: 采用多代理框架,利用小型LLM代理进行逐级验证;结合多种推理时技术,如累积式解决方案验证。对比分析不同技术的性能提升效果。

Result: 多代理框架显著改善了LLM在初始表现不佳的物理问题上的表现。PhysicsEval基准包含19,609个问题及其正解。

Insight: 多代理协作与推理时优化是提升LLM在复杂领域(如物理问题)表现的有效方法,新基准为未来研究提供了重要资源。

Abstract: The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems - a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference-time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi-agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, ${\rm P{\small HYSICS}E{\small VAL}}$, consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites. Our code and data are publicly available at https://github.com/areebuzair/PhysicsEval.

[2] Semiotic Complexity and Its Epistemological Implications for Modeling Culture

Zachary K. Stine,James E. Deitrick

Main category: cs.CL

TL;DR: 该论文呼吁在计算人文学科中加强对方法的理论化,提出了符号复杂性(semiotic complexity)的概念,并指出当前建模实践中因忽视符号复杂性而导致的翻译错误。论文提出了改进建议。

Details Motivation: 计算人文学科需要更多方法的理论化,以提升其在认识论和解释上的清晰度,从而推动领域的成熟。论文旨在揭示建模实践中因忽视符号复杂性而导致的翻译错误。

Contribution: 提出了符号复杂性的概念,即文本意义在不同解释视角下的变化程度,并指出当前建模实践(尤其是评估方法)因错误地将符号复杂数据视为符号简单数据而存在的问题。

Method: 将建模工作视为从文化、语言领域到计算、数学领域的翻译过程,并分析了这种翻译过程中缺乏理论化带来的问题。

Result: 识别了符号复杂性被忽视的问题,并提出了改进建模实践的建议。

Insight: 符号复杂性是建模实践中被忽视的关键维度,忽视它会导致翻译错误和解释透明性不足。研究者需要更谨慎地处理符号复杂性。

Abstract: Greater theorizing of methods in the computational humanities is needed for epistemological and interpretive clarity, and therefore the maturation of the field. In this paper, we frame such modeling work as engaging in translation work from a cultural, linguistic domain into a computational, mathematical domain, and back again. Translators benefit from articulating the theory of their translation process, and so do computational humanists in their work – to ensure internal consistency, avoid subtle yet consequential translation errors, and facilitate interpretive transparency. Our contribution in this paper is to lay out a particularly consequential dimension of the lack of theorizing and the sorts of translation errors that emerge in our modeling practices as a result. Along these lines we introduce the idea of semiotic complexity as the degree to which the meaning of some text may vary across interpretive lenses, and make the case that dominant modeling practices – especially around evaluation – commit a translation error by treating semiotically complex data as semiotically simple when it seems epistemologically convenient by conferring superficial clarity. We then lay out several recommendations for researchers to better account for these epistemological issues in their own work.

[3] Tabular Data Understanding with LLMs: A Survey of Recent Advances and Challenges

Xiaofeng Wu,Alan Ritter,Wei Xu

Main category: cs.CL

TL;DR: 这篇综述论文探讨了大语言模型(LLM)和多模态大语言模型(MLLM)在处理表格数据方面的最新进展与挑战,提出了表格输入表示的分类法和表格理解任务。

Details Motivation: 表格数据的二维性和多样性(从结构化数据库表到复杂电子表格)使得通用方法难以适用,因此需要研究专用方法以应对这些挑战。

Contribution: 论文提出了表格输入表示的分类法,并总结了表格理解任务,同时指出了领域内的关键研究空白,如需要更复杂的推理能力、处理复杂表格结构和多表格场景的挑战,以及模型泛化能力的局限性。

Method: 通过综述现有方法和任务,提出了一个分类法来区分表格输入的表征方式,并分析了不同任务的挑战。

Result: 研究发现当前模型在复杂表格结构、大规模表格和多表格场景中表现不足,且跨格式泛化能力有限。

Insight: 论文揭示了表格理解领域的当前局限性,尤其是需要更强大的推理能力和对复杂结构的处理能力,为未来研究指明了方向。

Abstract: Tables have gained significant attention in large language models (LLMs) and multimodal large language models (MLLMs) due to their complex and flexible structure. Unlike linear text inputs, tables are two-dimensional, encompassing formats that range from well-structured database tables to complex, multi-layered spreadsheets, each with different purposes. This diversity in format and purpose has led to the development of specialized methods and tasks, instead of universal approaches, making navigation of table understanding tasks challenging. To address these challenges, this paper introduces key concepts through a taxonomy of tabular input representations and an introduction of table understanding tasks. We highlight several critical gaps in the field that indicate the need for further research: (1) the predominance of retrieval-focused tasks that require minimal reasoning beyond mathematical and logical operations; (2) significant challenges faced by models when processing complex table structures, large-scale tables, length context, or multi-table scenarios; and (3) the limited generalization of models across different tabular representations and formats.

[4] Semantic Compression for Word and Sentence Embeddings using Discrete Wavelet Transform

Rana Aref Salama,Abdou Youssef,Mona Diab

Main category: cs.CL

TL;DR: 该论文探讨了离散小波变换(DWT)在词和句子嵌入中的应用,展示其在压缩嵌入维度同时保持语义信息完整性的能力。实验表明,DWT能显著降低嵌入维度(50%-93%),且在语义相似性任务中性能几乎不变,在下游任务中表现更优。

Details Motivation: 小波变换在信号和图像处理中已被广泛应用,但其在自然语言处理(NLP)中的潜力尚未充分挖掘。作者希望通过DWT分析嵌入表示的多分辨率特性,并压缩嵌入维度,同时保持其语义质量。

Contribution: 论文的主要贡献在于:1)提出将DWT应用于词和句子嵌入,实现高效压缩;2)展示DWT在多分辨率分析中的能力;3)实验验证DWT在语义相似性和下游任务中的优越性。

Method: 使用离散小波变换(DWT)对词和句子嵌入进行多分辨率分析和压缩。通过实验评估压缩后嵌入在语义相似性任务和下游任务中的表现。

Result: DWT能将嵌入维度压缩50%-93%,在语义相似性任务中性能几乎不变,且在下游任务中表现更优。

Insight: 小波变换可有效提取和压缩嵌入中的语义信息,为NLP应用提供了一种高效的数据表示方法,同时降低了计算和存储开销。

Abstract: Wavelet transforms, a powerful mathematical tool, have been widely used in different domains, including Signal and Image processing, to unravel intricate patterns, enhance data representation, and extract meaningful features from data. Tangible results from their application suggest that Wavelet transforms can be applied to NLP capturing a variety of linguistic and semantic properties. In this paper, we empirically leverage the application of Discrete Wavelet Transforms (DWT) to word and sentence embeddings. We aim to showcase the capabilities of DWT in analyzing embedding representations at different levels of resolution and compressing them while maintaining their overall quality. We assess the effectiveness of DWT embeddings on semantic similarity tasks to show how DWT can be used to consolidate important semantic information in an embedding vector. We show the efficacy of the proposed paradigm using different embedding models, including large language models, on downstream tasks. Our results show that DWT can reduce the dimensionality of embeddings by 50-93% with almost no change in performance for semantic similarity tasks, while achieving superior accuracy in most downstream tasks. Our findings pave the way for applying DWT to improve NLP applications.

[5] Integrating clinical reasoning into large language model-based diagnosis through etiology-aware attention steering

Peixian Li,Yu Tian,Ruiqi Tu,Chengkai Wu,Jingjing Ren,Jingsong Li

Main category: cs.CL

TL;DR: 研究者提出了一种病因感知的注意力引导框架,通过整合结构化临床推理来提高大语言模型(LLMs)在复杂临床诊断中的准确性和可靠性。

Details Motivation: 大语言模型在医学文本理解与生成中表现突出,但在复杂临床场景中的诊断可靠性有限,因此需要改进其临床推理能力。

Contribution: 提出病因感知注意力引导框架,包括临床推理支架(CRS)、病因感知头识别算法和推理引导的参数高效微调。

Method: 通过构建CRS、识别关键注意力头,并引入推理引导的损失函数,优化模型对关键信息的关注。

Result: 在诊断队列中,平均诊断准确率提高15.65%,推理关注分数提升31.6%,外部验证进一步证实其有效性。

Insight: 通过结构化临床推理对齐模型注意力,可以构建更可靠且可解释的AI诊断系统。

Abstract: Objective: Large Language Models (LLMs) demonstrate significant capabilities in medical text understanding and generation. However, their diagnostic reliability in complex clinical scenarios remains limited. This study aims to enhance LLMs’ diagnostic accuracy and clinical reasoning ability. Method: We propose an Etiology-Aware Attention Steering Framework to integrate structured clinical reasoning into LLM-based diagnosis. Specifically, we first construct Clinical Reasoning Scaffolding (CRS) based on authoritative clinical guidelines for three representative acute abdominal emergencies: acute appendicitis, acute pancreatitis, and acute cholecystitis. Next, we develop the Etiology-Aware Head Identification algorithm to pinpoint attention heads crucial for the model’s etiology reasoning. To ensure reliable clinical reasoning alignment, we introduce the Reasoning-Guided Parameter-Efficient Fine-tuning that embeds etiological reasoning cues into input representations and steers the selected Etiology-Aware Heads toward critical information through a Reasoning-Guided Loss function. Result: On the Consistent Diagnosis Cohort, our framework improves average diagnostic accuracy by 15.65% and boosts the average Reasoning Focus Score by 31.6% over baselines. External validation on the Discrepant Diagnosis Cohort further confirms its effectiveness in enhancing diagnostic accuracy. Further assessments via Reasoning Attention Frequency indicate that our models exhibit enhanced reliability when faced with real-world complex scenarios. Conclusion: This study presents a practical and effective approach to enhance clinical reasoning in LLM-based diagnosis. By aligning model attention with structured CRS, the proposed framework offers a promising paradigm for building more interpretable and reliable AI diagnostic systems in complex clinical settings.

[6] Systematic Evaluation of Optimization Techniques for Long-Context Language Models

Ammar Ahmed,Sheng Di,Franck Cappello,Zirui Liu,Jingoo Han,Ali Anwar

Main category: cs.CL

TL;DR: 论文系统评估了长上下文语言模型的优化技术,包括剪枝、量化和标记丢弃等方法,分析了其对内存使用、延迟、吞吐量和文本生成质量的影响,并揭示了组合优化可能对大型模型产生负面效应。

Details Motivation: 大型语言模型(LLMs)在长上下文场景下因资源需求和有限上下文窗口而面临挑战,现有优化技术的效果缺乏系统性评估。

Contribution: 论文的主要贡献包括:1. 系统评测了剪枝、量化和标记丢弃等优化技术;2. 分析了这些技术的组合效果及对大型模型的负面影响;3. 提出F1分数可能掩盖问答任务中的精度-召回权衡。

Method: 研究了两种支持长上下文的LLM架构,评估单独及组合优化的性能指标,并扩展到700亿参数的大型模型。

Result: 实验表明,优化技术的组合可能因近似误差累积对大型模型产生负面影响,仅依赖F1分数会掩盖任务中的精度-召回权衡。

Insight: 研究揭示了优化技术的组合需谨慎,尤其是在大型模型上,同时强调了系统级分析与任务特定评价结合的重要性。

Abstract: Large language models (LLMs) excel across diverse natural language processing tasks but face resource demands and limited context windows. Although techniques like pruning, quantization, and token dropping can mitigate these issues, their efficacy in long-context scenarios and system evaluation remains underexplored. This paper systematically benchmarks these optimizations, characterizing memory usage, latency, and throughput, and studies how these methods impact the quality of text generation. We first analyze individual optimization methods for two LLM architectures supporting long context and then systematically evaluate combinations of these techniques to assess how this deeper analysis impacts performance metrics. We subsequently study the scalability of individual optimization methods on a larger variant with 70 billion-parameter model. Our novel insights reveal that naive combination inference optimization algorithms can adversely affect larger models due to compounded approximation errors, as compared to their smaller counterparts. Experiments show that relying solely on F1 obscures these effects by hiding precision-recall trade-offs in question answering tasks. By integrating system-level profiling with task-specific insights, this study helps LLM practitioners and researchers explore and balance efficiency, accuracy, and scalability across tasks and hardware configurations.

[7] Improving Multimodal Contrastive Learning of Sentence Embeddings with Object-Phrase Alignment

Kaiyan Zhao,Zhongtao Miao,Yoshimasa Tsuruoka

Main category: cs.CL

TL;DR: MCSEO通过细粒度对象-短语对齐提升多模态句子嵌入模型,利用现有分割和对象检测模型优化对比学习目标,显著优于传统图像-标题对齐方法。

Details Motivation: 多模态句子嵌入模型在训练中使用的图像-标题对常包含噪声或冗余信息,影响模型性能。MCSEO旨在通过更精确的对象-短语对齐解决这一问题。

Contribution: 提出了MCSEO方法,通过改进对象-短语对齐优化对比学习目标,提升了多模态句子嵌入的性能。

Method: 利用分割和对象检测模型提取对象-短语对,设计对比学习目标以优化对象-短语对应关系。

Result: 在多个STS任务上,MCSEO均优于基线模型,验证了细粒度对齐在多模态表征学习中的重要性。

Insight: 细粒度的对象-短语对齐能有效提升多模态模型的性能,噪声和冗余信息的处理是关键。

Abstract: Multimodal sentence embedding models typically leverage image-caption pairs in addition to textual data during training. However, such pairs often contain noise, including redundant or irrelevant information on either the image or caption side. To mitigate this issue, we propose MCSEO, a method that enhances multimodal sentence embeddings by incorporating fine-grained object-phrase alignment alongside traditional image-caption alignment. Specifically, MCSEO utilizes existing segmentation and object detection models to extract accurate object-phrase pairs, which are then used to optimize a contrastive learning objective tailored to object-phrase correspondence. Experimental results on semantic textual similarity (STS) tasks across different backbone models demonstrate that MCSEO consistently outperforms strong baselines, highlighting the significance of precise object-phrase alignment in multimodal representation learning.

[8] PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning

Keer Lu,Chong Chen,Bin Cui,Huang Leng,Wentao Zhang

Main category: cs.CL

TL;DR: PilotRL通过全局规划引导的渐进强化学习,提出了一种新的LLM代理训练框架,解决了复杂任务中长期规划和执行协调的挑战,并显著提升了性能。

Details Motivation: 现有基于ReAct范式的LLM代理在复杂任务中因缺乏长期规划和执行协调而表现受限,且监督微调导致泛化能力不足。PilotRL旨在通过全局规划和渐进强化学习解决这些问题。

Contribution: 1. 提出自适应全局规划范式AdaPlan;2. 设计PilotRL训练框架,结合规划与执行优化;3. 在实验中超越GPT-4o等模型,性能提升显著。

Method: 1. 全局规划引导代理任务执行;2. 渐进强化学习分阶段优化规划和执行;3. 联合优化规划与执行的协调。

Result: PilotRL在LLaMA3.1-8B上超越GPT-4o 3.60%,相比GPT-4o-mini提升55.78%。

Insight: 全局规划与渐进强化学习的结合能显著提升LLM代理在复杂任务中的表现,突显了规划与执行协同优化的重要性。

Abstract: Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks. Despite their potential, existing work faces challenges when deploying LLMs in agent-based environments. The widely adopted agent paradigm ReAct centers on integrating single-step reasoning with immediate action execution, which limits its effectiveness in complex tasks requiring long-term strategic planning. Furthermore, the coordination between the planner and executor during problem-solving is also a critical factor to consider in agent design. Additionally, current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories, thereby restricting their generalization ability when confronted with novel problem contexts. To address these challenges, we introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution to support effective long-horizon decision-making. Based on the proposed paradigm, we further put forward PilotRL, a global planning-guided training framework for LLM agents driven by progressive reinforcement learning. We first develop the model’s ability to follow explicit guidance from global plans when addressing agent tasks. Subsequently, based on this foundation, we focus on optimizing the quality of generated plans. Finally, we conduct joint optimization of the model’s planning and execution coordination. Experiments indicate that PilotRL could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + PilotRL surpassing closed-sourced GPT-4o by 3.60%, while showing a more substantial gain of 55.78% comparing to GPT-4o-mini at a comparable parameter scale.

[9] Lucy: edgerunning agentic web search on mobile with machine generated task vectors

Alan Dao,Dinh Bach Vu,Alex Nguyen,Norapat Buppodom

Main category: cs.CL

TL;DR: 论文提出了一种新方法,将小语言模型(SLM)的内部推理过程视为动态任务向量机(task vector machine),并通过RLVR优化,实现了在移动设备上的高效网络搜索。Lucy模型(1.7B参数)在SimpleQA基准测试中达到了78.3%的准确率,性能与更大的模型相当。

Details Motivation: 小语言模型(SLM)在知识密集型任务中因容量受限表现不佳,传统方法将推理视为固定或启发式过程,无法充分利用模型的潜力。

Contribution: 提出了动态任务向量机的新范式,将模型的推理过程视为动态构造和优化的任务向量,并通过RLVR方法实现了优化。

Method: 模型利用标签标记内部推理过程,将其视为动态任务向量机,并通过RLVR进行优化。Lucy模型结合MCP(Machine Constructed Prompts)实现了高效推理。

Result: Lucy在SimpleQA基准测试中达到78.3%的准确率,性能与更大的DeepSeek-V3相当,证明了小模型通过结构化自构造任务推理可匹敌大模型。

Insight: 动态任务向量机范式可以显著提升小模型的性能,为资源受限环境(如移动设备)提供高效解决方案。

Abstract: Small language models (SLMs) are inherently limited in knowledge-intensive tasks due to their constrained capacity. While test-time computation offers a path to enhanced performance, most approaches treat reasoning as a fixed or heuristic process. In this work, we propose a new paradigm: viewing the model’s internal reasoning, delimited by and tags, as a dynamic task vector machine. Rather than treating the content inside these tags as a mere trace of thought, we interpret the generation process itself as a mechanism through which the model \textbf{constructs and refines its own task vectors} on the fly. We developed a method to optimize this dynamic task vector machine through RLVR and successfully trained an agentic web-search model. We present Lucy, a 1.7B-parameter SLM that leverages this dynamic reasoning mechanism with MCP integration to achieve 78.3% accuracy on the SimpleQA benchmark, performing on par with much larger models such as DeepSeek-V3. This demonstrates that small models can rival large ones when equipped with structured, self-constructed task reasoning.

[10] SA-GCS: Semantic-Aware Gaussian Curriculum Scheduling for UAV Vision-Language Navigation

Hengxing Cai,Jinhan Dong,Yijie Rao,Jingcheng Deng,Jingjun Tan,Qien Chen,Haidong Wang,Zhen Wang,Shiyu Huang,Agachai Sumalee,Renxin Zhong

Main category: cs.CL

TL;DR: 本文提出了SA-GCS框架,结合课程学习和强化学习,通过动态调整训练样本的难度分布,显著提升了无人机视觉语言导航任务的训练效率和性能。

Details Motivation: 现有的强化学习方法在无人机视觉语言导航任务中存在训练数据利用效率低、收敛速度慢以及对样本难度变化考虑不足的问题,限制了性能的进一步提升。

Contribution: 提出了SA-GCS框架,包括语义感知难度估计器(SA-DE)和高斯课程调度器(GCS),动态调整训练样本的难度分布,实现了从易到难的平滑过渡。

Method: SA-GCS通过SA-DE量化训练样本的复杂度,并利用GCS动态调整采样分布,结合课程学习和强化学习,优化训练过程。

Result: 在CityNav基准测试中,SA-GCS在所有指标上均优于基线模型,实现了更快、更稳定的收敛,并展示了良好的泛化能力。

Insight: 动态调整训练样本的难度分布是提升强化学习效率的关键,语义信息的引入进一步优化了任务理解的准确性。

Abstract: Unmanned Aerial Vehicle (UAV) Vision-Language Navigation (VLN) aims to enable agents to accurately localize targets and plan flight paths in complex environments based on natural language instructions, with broad applications in intelligent inspection, disaster rescue, and urban monitoring. Recent progress in Vision-Language Models (VLMs) has provided strong semantic understanding for this task, while reinforcement learning (RL) has emerged as a promising post-training strategy to further improve generalization. However, existing RL methods often suffer from inefficient use of training data, slow convergence, and insufficient consideration of the difficulty variation among training samples, which limits further performance improvement. To address these challenges, we propose \textbf{Semantic-Aware Gaussian Curriculum Scheduling (SA-GCS)}, a novel training framework that systematically integrates Curriculum Learning (CL) into RL. SA-GCS employs a Semantic-Aware Difficulty Estimator (SA-DE) to quantify the complexity of training samples and a Gaussian Curriculum Scheduler (GCS) to dynamically adjust the sampling distribution, enabling a smooth progression from easy to challenging tasks. This design significantly improves training efficiency, accelerates convergence, and enhances overall model performance. Extensive experiments on the CityNav benchmark demonstrate that SA-GCS consistently outperforms strong baselines across all metrics, achieves faster and more stable convergence, and generalizes well across models of different scales, highlighting its robustness and scalability. The implementation of our approach is publicly available.

[11] Combining Discrete Wavelet and Cosine Transforms for Efficient Sentence Embedding

Rana Salama,Abdou Youssef,Mona Diab

Main category: cs.CL

TL;DR: 该论文提出了一种结合离散小波变换(DWT)和离散余弦变换(DCT)的非参数化模型,用于高效生成句子嵌入,在多个任务中表现优于原始嵌入方法。

Details Motivation: 小波变换在图像和信号处理中的成功应用启发作者将其引入自然语言处理领域,旨在通过DWT和DCT的结合,提取并压缩句子中的关键信息。

Contribution: 主要贡献是提出了一种结合DWT和DCT的非参数化句子嵌入模型,能够在固定大小的向量中保留丰富的局部词特征信息。

Method: 方法包括:1) 使用DWT降维并保留词向量的重要信息;2) 结合DCT进一步压缩句子为固定大小的向量。

Result: 实验表明,该模型在下游任务中表现优异,部分任务中甚至优于原始嵌入方法。

Insight: 小波变换在NLP任务中具有潜力,特别是需要高效信息压缩的场景;非参数化模型能够避免训练开销。

Abstract: Wavelets have emerged as a cutting edge technology in a number of fields. Concrete results of their application in Image and Signal processing suggest that wavelets can be effectively applied to Natural Language Processing (NLP) tasks that capture a variety of linguistic properties. In this paper, we leverage the power of applying Discrete Wavelet Transforms (DWT) to word and sentence embeddings. We first evaluate, intrinsically and extrinsically, how wavelets can effectively be used to consolidate important information in a word vector while reducing its dimensionality. We further combine DWT with Discrete Cosine Transform (DCT) to propose a non-parameterized model that compresses a sentence with a dense amount of information in a fixed size vector based on locally varying word features. We show the efficacy of the proposed paradigm on downstream applications models yielding comparable and even superior (in some tasks) results to original embeddings.

[12] ReaGAN: Node-as-Agent-Reasoning Graph Agentic Network

Minghao Guo,Xi Zhu,Jingyuan Huang,Kai Mei,Yongfeng Zhang

Main category: cs.CL

TL;DR: ReaGAN提出了一种基于智能体的图学习方法,通过赋予每个节点自主决策能力(如自适应信息传播和全局语义检索),解决了传统GNN在节点信息不平衡和全局语义关系捕获上的局限性。

Details Motivation: 传统图神经网络(GNN)通过预定义的聚合机制传播节点信息,但无法处理节点信息不平衡问题,且忽略了全局语义关系。ReaGAN旨在通过智能体框架和检索增强生成技术(RAG)解决这些问题。

Contribution: 1. 提出基于智能体的框架ReaGAN,赋予节点自主决策能力;2. 结合检索增强生成(RAG),支持节点访问全局语义信息;3. 在少量样本场景下展示竞争力,无需微调预训练大模型。

Method: 1. 将节点建模为智能体,基于内部记忆自主规划行动(如信息传播);2. 利用RAG检索全局语义相关内容,增强节点间关系建模;3. 使用冻结的大语言模型(LLM)作为骨干网络。

Result: ReaGAN在少量样本设置下表现出竞争力,验证了智能体规划和局部-全局检索在图学习中的潜力。

Insight: 1. 节点作为智能体的设计为图学习提供了动态决策能力;2. 检索增强技术(RAG)有效弥补了传统GNN全局关系的不足;3. 冻结LLM的应用展示了预训练模型在图任务中的可迁移性。

Abstract: Graph Neural Networks (GNNs) have achieved remarkable success in graph-based learning by propagating information among neighbor nodes via predefined aggregation mechanisms. However, such fixed schemes often suffer from two key limitations. First, they cannot handle the imbalance in node informativeness – some nodes are rich in information, while others remain sparse. Second, predefined message passing primarily leverages local structural similarity while ignoring global semantic relationships across the graph, limiting the model’s ability to capture distant but relevant information. We propose Retrieval-augmented Graph Agentic Network (ReaGAN), an agent-based framework that empowers each node with autonomous, node-level decision-making. Each node acts as an agent that independently plans its next action based on its internal memory, enabling node-level planning and adaptive message propagation. Additionally, retrieval-augmented generation (RAG) allows nodes to access semantically relevant content and build global relationships in the graph. ReaGAN achieves competitive performance under few-shot in-context settings using a frozen LLM backbone without fine-tuning, showcasing the potential of agentic planning and local-global retrieval in graph learning.

[13] EFlat-LoRA: Efficiently Seeking Flat Minima for Better Generalization in Fine-Tuning Large Language Models and Beyond

Jiaxin Deng,Qingcheng Zhu,Junbiao Pang,Linlin Yang,Zhongqian Fu,Baochang Zhang

Main category: cs.CL

TL;DR: EFlat-LoRA 通过寻找平坦最小值提升低秩自适应(LoRA)的泛化能力,优于传统 LoRA 和全参数微调。

Details Motivation: 以往研究未充分探索 LoRA 的表达能力与泛化能力之间的关系,尤其是平坦最小值对泛化的作用。

Contribution: 提出了 Flat-LoRA 和高效版 EFlat-LoRA,理论证明了扰动可从全参数空间转移到低秩子空间,提升了 LoRA 的泛化性能。

Method: 通过理论分析和实验验证,将扰动引入低秩子空间,避免了多矩阵干扰,实现了高效平坦最小值搜索。

Result: 在 GLUE 和视觉语言任务中,EFlat-LoRA 平均优于 LoRA 1.0%,优于全微调 0.5%。

Insight: 泛化能力与平坦性紧密相关,EFlat-LoRA 为 LoRA 提供了新的优化方向。

Abstract: Little research explores the correlation between the expressive ability and generalization ability of the low-rank adaptation (LoRA). Sharpness-Aware Minimization (SAM) improves model generalization for both Convolutional Neural Networks (CNNs) and Transformers by encouraging convergence to locally flat minima. However, the connection between sharpness and generalization has not been fully explored for LoRA due to the lack of tools to either empirically seek flat minima or develop theoretical methods. In this work, we propose Flat-LoRA and its efficient version i.e., EFlat-LoRA, to seek flat minima for LoRA. Concretely, we theoretically demonstrate that perturbations in the full parameter space can be transferred to the low-rank subspace. This approach eliminates the potential interference introduced by perturbations across multiple matrices in the low-rank subspace. Our extensive experiments on large language models and vision-language models demonstrate that EFlat-LoRA achieves optimize efficiency comparable to that of LoRA while simultaneously attaining comparable or even better performance. For example, on the GLUE dataset with RoBERTa-large, EFlat-LoRA outperforms LoRA and full fine-tuning by 1.0% and 0.5% on average, respectively. On vision-language models e.g., Qwen-VL-Chat shows performance improvements of 1.5% and 1.0% on SQA and VizWiz datasets, respectively. These empirical results also verify that the generalization of LoRA is closely related to sharpness, which is omitted by previous methods.

[14] SynAdapt: Learning Adaptive Reasoning in Large Language Models via Synthetic Continuous Chain-of-Thought

Jianwei Wang,Ziming Wu,Fuming Lai,Shaobing Lian,Ziqian Zeng

Main category: cs.CL

TL;DR: 论文提出了一种名为SynAdapt的高效推理框架,通过合成连续的思维链(CCoT)指导大模型学习自适应推理。结合难度分类器进一步提升了困难问题的解决能力。

Details Motivation: 现有连续思维链(CCoT)方法存在间接微调、对齐不充分或目标不一致的问题,无法高效指导模型推理。

Contribution: 1. 提出合成CCoT作为对齐目标,直接引导模型学习;2. 引入难度分类器,自适应调整对困难问题的处理。

Method: 1. 生成合成CCoT对齐目标;2. 设计难度分类器,结合问题上下文和CCoT识别困难问题;3. 自适应重新思考困难问题。

Result: 在多个基准测试中,SynAdapt实现了最佳准确性与效率的权衡。

Insight: 合成CCoT可以显著提升推理效率,而自适应机制进一步增强了模型对复杂问题的处理能力。

Abstract: While Chain-of-Thought (CoT) reasoning improves model performance, it incurs significant time costs due to the generation of discrete CoT tokens (DCoT). Continuous CoT (CCoT) offers a more efficient alternative, but existing CCoT methods are hampered by indirect fine-tuning, limited alignment, or inconsistent targets. To overcome these limitations, we propose \textit{SynAdapt}, an innovative efficient reasoning framework. Specifically, \textit{SynAdapt} generates the synthetic CCoT to serve as a precise and effective alignment target for LLMs. This synthetic CCoT explicitly guides the LLM to learn CCoT and derive accurate answers directly. Furthermore, relying solely on CCoT is insufficient for solving hard questions. To address this, \textit{SynAdapt} integrates a difficulty classifier that leverages both question context and CCoT to identify hard questions. CCoT can effectively help identify hard questions after some brief reasoning. We then adaptively prompt the LLM to re-think these hard questions for improved performance. Extensive experimental results across various benchmarks from different difficulty levels strongly demonstrate the effectiveness of our method, achieving the best accuracy-efficiency trade-off.

[15] Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications

Wenxuan Wang,Zizhan Ma,Meidan Ding,Shiyi Zheng,Shengyuan Liu,Jie Liu,Jiaming Ji,Wenting Chen,Xiang Li,Linlin Shen,Yixuan Yuan

Main category: cs.CL

TL;DR: 本文首次系统综述了大语言模型(LLMs)在医学推理领域的增强技术和应用,提出了训练时和测试时的技术分类,并分析了其在多模态数据及临床实践中的应用与挑战。

Details Motivation: 医学实践中系统、透明且可验证的推理能力是核心,而当前LLMs在这一领域的能力仍存在不足,因此需要开发专门用于医学推理的LLMs。

Contribution: 1. 首次系统综述了医学推理领域的LLMs增强技术;2. 提出了训练时(如微调、强化学习)和测试时(如提示工程、多智能体系统)的技术分类;3. 分析了这些技术在多模态数据和临床实践中的应用。

Method: 1. 通过对60项研究(2022-2025)的系统分析,构建了推理增强技术的分类;2. 评估了技术在文本、图像、代码等数据模态及诊断、教育等临床场景中的表现。

Result: 研究揭示了当前技术的局限性,如忠实性与合理性的差距,并提出了未来发展方向,如多模态推理的改进和社会技术责任的强化。

Insight: 医学推理LLMs的发展需注重多模态能力、推理质量和可解释性,同时需解决技术与社会伦理的平衡问题。

Abstract: The proliferation of Large Language Models (LLMs) in medicine has enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning, a cornerstone of clinical practice. This has catalyzed a shift from single-step answer generation to the development of LLMs explicitly designed for medical reasoning. This paper provides the first systematic review of this emerging field. We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies (e.g., supervised fine-tuning, reinforcement learning) and test-time mechanisms (e.g., prompt engineering, multi-agent systems). We analyze how these techniques are applied across different data modalities (text, image, code) and in key clinical applications such as diagnosis, education, and treatment planning. Furthermore, we survey the evolution of evaluation benchmarks from simple accuracy metrics to sophisticated assessments of reasoning quality and visual interpretability. Based on an analysis of 60 seminal studies from 2022-2025, we conclude by identifying critical challenges, including the faithfulness-plausibility gap and the need for native multimodal reasoning, and outlining future directions toward building efficient, robust, and sociotechnically responsible medical AI.

Shubham Kumar Nigam,Balaramamahanthi Deepak Patnaik,Shivam Mishra,Ajay Varghese Thomas,Noel Shallum,Kripabandhu Ghosh,Arnab Bhattacharya

Main category: cs.CL

TL;DR: NyayaRAG是一个基于检索增强生成(RAG)的框架,用于印度普通法体系下的法律判决预测。它通过结合案件事实、相关法律条款和类似案例,提升了预测准确性和解释质量。

Details Motivation: 现有的法律判决预测方法在印度普通法体系中忽视了法律条款和先例的重要性。NyayaRAG旨在通过结合结构化法律知识,模拟真实法庭场景,提升预测的准确性和解释性。

Contribution: 提出了NyayaRAG框架,结合案件事实、法律条款和先例,为印度法律体系定制了一种领域特定的检索增强生成方法。

Method: 采用RAG框架,通过检索相关法律条款和类似案例,结合案件事实生成预测和解释。评估使用了标准指标和基于LLM的评估器(如G-Eval)。

Result: 实验表明,结合结构化法律知识显著提升了法律判决预测的准确性和生成的解释质量。

Insight: 在普通法体系下,结合法律条款和先例是提升法律预测和解释的关键。RAG框架在这一领域具有潜在优势。

Abstract: Legal Judgment Prediction (LJP) has emerged as a key area in AI for law, aiming to automate judicial outcome forecasting and enhance interpretability in legal reasoning. While previous approaches in the Indian context have relied on internal case content such as facts, issues, and reasoning, they often overlook a core element of common law systems, which is reliance on statutory provisions and judicial precedents. In this work, we propose NyayaRAG, a Retrieval-Augmented Generation (RAG) framework that simulates realistic courtroom scenarios by providing models with factual case descriptions, relevant legal statutes, and semantically retrieved prior cases. NyayaRAG evaluates the effectiveness of these combined inputs in predicting court decisions and generating legal explanations using a domain-specific pipeline tailored to the Indian legal system. We assess performance across various input configurations using both standard lexical and semantic metrics as well as LLM-based evaluators such as G-Eval. Our results show that augmenting factual inputs with structured legal knowledge significantly improves both predictive accuracy and explanation quality.

[17] Dynamically Adaptive Reasoning via LLM-Guided MCTS for Efficient and Context-Aware KGQA

Yingxu Wang,Shiqi Fan,Mengzhu Wang,Siwei Liu

Main category: cs.CL

TL;DR: DAMR通过结合符号搜索和自适应路径评估,提出了一种基于MCTS和LLM引导的动态自适应推理框架,显著提升了KGQA的效率和上下文感知能力。

Details Motivation: 现有KGQA方法在静态路径提取或动态路径生成上存在局限性:前者缺乏适应性且缺乏上下文细化,后者计算成本高且路径评估不准确。

Contribution: 1. 提出了DAMR框架,结合符号搜索与自适应路径评估;2. 引入轻量级Transformer评分器进行上下文感知的路径评估;3. 设计了动态伪路径细化机制以缓解高质量监督数据的稀缺问题。

Method: 1. 采用MCTS作为主干,LLM作为规划器筛选相关关系;2. 使用Transformer评分器对问题和关系序列进行联合编码;3. 通过动态伪路径细化生成训练信号。

Result: 在多个KGQA基准上,DAMR显著优于现有方法。

Insight: 动态自适应推理框架能有效结合符号搜索与机器学习,提升多跳推理的准确性,同时减少计算开销。

Abstract: Knowledge Graph Question Answering (KGQA) aims to interpret natural language queries and perform structured reasoning over knowledge graphs by leveraging their relational and semantic structures to retrieve accurate answers. Recent KGQA methods primarily follow either retrieve-then-reason paradigm, relying on GNNs or heuristic rules for static paths extraction, or dynamic path generation strategies that use large language models (LLMs) with prompting to jointly perform retrieval and reasoning. However, the former suffers from limited adaptability due to static path extraction and lack of contextual refinement, while the latter incurs high computational costs and struggles with accurate path evaluation due to reliance on fixed scoring functions and extensive LLM calls. To address these issues, this paper proposes Dynamically Adaptive MCTS-based Reasoning (DAMR), a novel framework that integrates symbolic search with adaptive path evaluation for efficient and context-aware KGQA. DAMR employs a Monte Carlo Tree Search (MCTS) backbone guided by an LLM-based planner, which selects top-$k$ relevant relations at each step to reduce search space. To improve path evaluation accuracy, we introduce a lightweight Transformer-based scorer that performs context-aware plausibility estimation by jointly encoding the question and relation sequence through cross-attention, enabling the model to capture fine-grained semantic shifts during multi-hop reasoning. Furthermore, to alleviate the scarcity of high-quality supervision, DAMR incorporates a dynamic pseudo-path refinement mechanism that periodically generates training signals from partial paths explored during search, allowing the scorer to continuously adapt to the evolving distribution of reasoning trajectories. Extensive experiments on multiple KGQA benchmarks show that DAMR significantly outperforms state-of-the-art methods.

[18] Out-of-Context Abduction: LLMs Make Inferences About Procedural Data Leveraging Declarative Facts in Earlier Training Data

Sohaib Imran,Rob Lamb,Peter M. Atkinson

Main category: cs.CL

TL;DR: 研究表明,GPT-4o能够通过训练数据中的非上下文信息(如虚构聊天机器人的名称和行为描述)进行推理,从而在未见过的对话中推断出聊天机器人的名称并模仿其行为。这对大语言模型的情境意识和AI安全性有启示。

Details Motivation: 探讨大语言模型(LLMs)是否能够利用训练数据中的信息进行非上下文的推理(out-of-context abduction),即通过相关事实推断出最合理的解释。

Contribution: 提出并验证了LLMs(如GPT-4o)能够利用训练数据中的声明性事实进行推理,从而在未见过的任务中推断出特定聊天机器人的名称并模仿其行为。

Method: 设计实验,训练LLMs学习虚构聊天机器人的名称和行为描述,但不接触其对话示例。随后测试模型是否能通过观察聊天机器人的典型回答推断其名称,并模仿其行为。

Result: GPT-4o能够成功推断出聊天机器人的名称,并在训练后表现出与描述一致的行为。

Insight: LLMs具备利用训练数据中的非上下文信息进行推理的能力,这对模型的情境意识和AI安全性研究具有重要意义。

Abstract: Large language models (LLMs) are trained on large corpora, yet it is unclear whether they can reason about the information present within their training data. We design experiments to study out-of-context abduction in LLMs, the ability to infer the most plausible explanations for observations using relevant facts present in training data. We train treatment LLMs on names and behavior descriptions of fictitious chatbots, but not on examples of dialogue with the chatbots. We find that OpenAI’s GPT 4o LLM can correctly infer at least one chatbot’s name after observing example responses characteristic of that chatbot. We also find that previously training GPT 4o on descriptions of a chatbot’s behavior allows it to display behaviors more characteristic of the chatbot when iteratively trained to display such behaviors. Our results have implications for situational awareness in LLMs and, therefore, for AI safety.

[19] Agentic large language models improve retrieval-based radiology question answering

Sebastian Wind,Jeta Sopa,Daniel Truhn,Mahshad Lotfinia,Tri-Thien Nguyen,Keno Bressem,Lisa Adams,Mirabela Rusu,Harald Köstler,Gerhard Wellein,Andreas Maier,Soroosh Tayebi Arasteh

Main category: cs.CL

TL;DR: 该论文提出了一种基于代理的大语言模型(LLM)检索框架,用于改进放射学问答(QA)任务的准确性和事实性。通过迭代检索和动态合成证据,该框架显著提高了中等规模和小规模模型的性能。

Details Motivation: 传统基于检索的增强生成(RAG)系统在放射学QA任务中受限于单步检索,无法有效支持复杂的临床推理任务。因此,需要一种更灵活的方法来提升检索效果和回答质量。

Contribution: 1. 提出了一种代理式RAG框架,使LLM能够自主分解放射学问题,并迭代检索和动态合成临床证据。2. 通过对24种不同规模和训练范式的LLM进行评测,证明了该方法在提升诊断准确性和减少幻觉方面的有效性。

Method: 1. 利用LLM代理分解放射学问题。2. 从Radiopaedia中迭代检索相关临床证据。3. 动态合成基于证据的回答。4. 在104个专家标注的放射学问题数据集上评估性能。

Result: 代理式RAG显著提升了诊断准确性(73% vs 64%),尤其是在中等规模和小规模模型中(如Mistral Large从72%提升到81%)。此外,幻觉现象减少了9.4%,且在46%的情况下检索到了相关临床证据。

Insight: 1. 代理式检索对中等规模和较小规模的LLM效果更显著。2. 临床微调和检索增强具有互补作用。3. 该方法为提升临床QA任务的事实性和准确性提供了新思路。

Abstract: Clinical decision-making in radiology increasingly benefits from artificial intelligence (AI), particularly through large language models (LLMs). However, traditional retrieval-augmented generation (RAG) systems for radiology question answering (QA) typically rely on single-step retrieval, limiting their ability to handle complex clinical reasoning tasks. Here we propose an agentic RAG framework enabling LLMs to autonomously decompose radiology questions, iteratively retrieve targeted clinical evidence from Radiopaedia, and dynamically synthesize evidence-based responses. We evaluated 24 LLMs spanning diverse architectures, parameter scales (0.5B to >670B), and training paradigms (general-purpose, reasoning-optimized, clinically fine-tuned), using 104 expert-curated radiology questions from previously established RSNA-RadioQA and ExtendedQA datasets. Agentic retrieval significantly improved mean diagnostic accuracy over zero-shot prompting (73% vs. 64%; P<0.001) and conventional online RAG (73% vs. 68%; P<0.001). The greatest gains occurred in mid-sized models (e.g., Mistral Large improved from 72% to 81%) and small-scale models (e.g., Qwen 2.5-7B improved from 55% to 71%), while very large models (>200B parameters) demonstrated minimal changes (<2% improvement). Additionally, agentic retrieval reduced hallucinations (mean 9.4%) and retrieved clinically relevant context in 46% of cases, substantially aiding factual grounding. Even clinically fine-tuned models exhibited meaningful improvements (e.g., MedGemma-27B improved from 71% to 81%), indicating complementary roles of retrieval and fine-tuning. These results highlight the potential of agentic frameworks to enhance factuality and diagnostic accuracy in radiology QA, particularly among mid-sized LLMs, warranting future studies to validate their clinical utility.

[20] MMBERT: Scaled Mixture-of-Experts Multimodal BERT for Robust Chinese Hate Speech Detection under Cloaking Perturbations

Qiyao Xue,Yuchen Dou,Ryan Shi,Xiang Lorraine Li,Wei Gao

Main category: cs.CL

TL;DR: MMBERT是一个基于BERT的多模态框架,结合文本、语音和视觉模态,通过Mixture-of-Experts(MoE)架构和渐进式三阶段训练方法,提升中文仇恨言论检测中对对抗干扰的鲁棒性。

Details Motivation: 中文社交网络中的仇恨言论检测面临独特挑战,尤其是对抗伪装技术的使用。现有研究多集中于英文数据,且对中文多模态策略关注不足。

Contribution: 提出MMBERT框架,结合MoE架构和多模态数据,开发了渐进式三阶段训练方法,显著提升检测性能。

Method: 采用MoE架构整合文本、语音和视觉模态,设计模态特定专家、共享自注意力机制和基于路由器的专家分配策略,并通过三阶段训练提升稳定性。

Result: 在多个中文仇恨言论数据集上,MMBERT显著优于现有基于BERT的模型和使用上下文学习的LLM。

Insight: 多模态数据和MoE架构的结合能有效提升对抗干扰的鲁棒性,渐进式训练是稳定MoE集成到BERT模型的关键。

Abstract: Hate speech detection on Chinese social networks presents distinct challenges, particularly due to the widespread use of cloaking techniques designed to evade conventional text-based detection systems. Although large language models (LLMs) have recently improved hate speech detection capabilities, the majority of existing work has concentrated on English datasets, with limited attention given to multimodal strategies in the Chinese context. In this study, we propose MMBERT, a novel BERT-based multimodal framework that integrates textual, speech, and visual modalities through a Mixture-of-Experts (MoE) architecture. To address the instability associated with directly integrating MoE into BERT-based models, we develop a progressive three-stage training paradigm. MMBERT incorporates modality-specific experts, a shared self-attention mechanism, and a router-based expert allocation strategy to enhance robustness against adversarial perturbations. Empirical results in several Chinese hate speech datasets show that MMBERT significantly surpasses fine-tuned BERT-based encoder models, fine-tuned LLMs, and LLMs utilizing in-context learning approaches.

[21] Do They Understand Them? An Updated Evaluation on Nonbinary Pronoun Handling in Large Language Models

Xushuo Tang,Yi Ding,Zhengyi Yang,Yin Chen,Yongrui Gu,Wenke Yang,Mingchen Ju,Xin Cao,Yongfei Liu,Wenjie Zhang

Main category: cs.CL

TL;DR: 论文MISGENDERED+评估了最新LLMs(如GPT-4o、Claude 4等)在处理非二元代词时的表现,发现其在二元和中性代词上有所提升,但新代词处理仍不稳定。

Details Motivation: 由于LLMs在敏感场景中应用广泛,公平性与包容性至关重要。但此前研究(如MISGENDERED)仅限于过时模型和有限评估,亟需更新评测标准。

Contribution: 提出了MISGENDERED+这一更新的评测基准,拓展了对LLMs代词处理能力的分析范围,覆盖零样本、少样本及性别身份推理任务。

Method: 以五种代表性LLMs(如GPT-4o、Qwen Turbo等)为对象,通过新基准测试其二元、中性及新代词的准确性。

Result: 结果显示最新LLMs在二元和中性代词上表现改善,但新代词的准确性和反向推理任务仍存在明显不足。

Insight: 研究揭示了LLMs在身份敏感推理中的持续缺陷,为未来包容性AI研究提供了方向。

Abstract: Large language models (LLMs) are increasingly deployed in sensitive contexts where fairness and inclusivity are critical. Pronoun usage, especially concerning gender-neutral and neopronouns, remains a key challenge for responsible AI. Prior work, such as the MISGENDERED benchmark, revealed significant limitations in earlier LLMs’ handling of inclusive pronouns, but was constrained to outdated models and limited evaluations. In this study, we introduce MISGENDERED+, an extended and updated benchmark for evaluating LLMs’ pronoun fidelity. We benchmark five representative LLMs, GPT-4o, Claude 4, DeepSeek-V3, Qwen Turbo, and Qwen2.5, across zero-shot, few-shot, and gender identity inference. Our results show notable improvements compared with previous studies, especially in binary and gender-neutral pronoun accuracy. However, accuracy on neopronouns and reverse inference tasks remains inconsistent, underscoring persistent gaps in identity-sensitive reasoning. We discuss implications, model-specific observations, and avenues for future inclusive AI research.

cs.CV [Back]

[22] A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition

Jie Zhu,Yiyang Su,Minchul Kim,Anil Jain,Xiaoming Liu

Main category: cs.CV

TL;DR: 该论文提出了一种基于质量引导的混合专家分数融合框架(QME),用于提升全身生物特征识别的性能。通过引入伪质量损失和分数三元组损失,解决了传统分数融合方法中忽略模态间分数分布差异的问题,并在多个数据集上取得了最先进的结果。

Details Motivation: 传统的全身生物特征识别方法通常采用多模态集成策略(如加权平均分数),但忽略了不同模态分数分布的差异,导致性能提升有限。因此,需要一个更灵活且能适应模态分数变化的融合框架。

Contribution: 1. 提出了QME框架,通过可学习的混合专家模型(MoE)实现分数融合。2. 设计了伪质量损失和分数三元组损失,分别用于质量估计和度量性能提升。3. 在多个数据集上验证了方法的有效性,超越了基线方法。

Method: 1. 采用混合专家模型(MoE)进行分数融合,每个专家处理特定模态的分数。2. 引入模态专用的质量估计器(QE)和伪质量损失来评估数据质量。3. 使用分数三元组损失优化度量学习。

Result: 在多个全身生物特征数据集上的实验表明,QME框架显著提升了识别性能,在各种评价指标上均优于基线方法。

Insight: 1. 模态间的分数分布差异是影响融合性能的关键因素。2. 质量估计和三元组损失的结合能够有效提升多模态识别的鲁棒性。3. QME框架具有通用性,适用于多模态和多模型场景。

Abstract: Whole-body biometric recognition is a challenging multimodal task that integrates various biometric modalities, including face, gait, and body. This integration is essential for overcoming the limitations of unimodal systems. Traditionally, whole-body recognition involves deploying different models to process multiple modalities, achieving the final outcome by score-fusion (e.g., weighted averaging of similarity matrices from each model). However, these conventional methods may overlook the variations in score distributions of individual modalities, making it challenging to improve final performance. In this work, we present \textbf{Q}uality-guided \textbf{M}ixture of score-fusion \textbf{E}xperts (QME), a novel framework designed for improving whole-body biometric recognition performance through a learnable score-fusion strategy using a Mixture of Experts (MoE). We introduce a novel pseudo-quality loss for quality estimation with a modality-specific Quality Estimator (QE), and a score triplet loss to improve the metric performance. Extensive experiments on multiple whole-body biometric datasets demonstrate the effectiveness of our proposed approach, achieving state-of-the-art results across various metrics compared to baseline methods. Our method is effective for multimodal and multi-model, addressing key challenges such as model misalignment in the similarity score domain and variability in data quality.

[23] Punching Bag vs. Punching Person: Motion Transferability in Videos

Raiyaan Abdullah,Jared Claypoole,Michael Cogswell,Ajay Divakaran,Yogesh Rawat

Main category: cs.CV

TL;DR: 该论文探讨了动作识别模型在高层次运动概念上的迁移能力,通过引入运动迁移性框架和三个数据集,揭示了模型在新场景中识别动作的挑战与局限。

Details Motivation: 研究动机是探索动作识别模型是否能够有效地将高层次的运动概念迁移到多样化的场景中,尤其是面对未见过的新变体时(如“打人”与“打沙袋”)的表现。

Contribution: 主要贡献包括:1) 提出了运动迁移性框架和三个数据集来评估模型性能;2) 揭示了模型在新场景中的表现下降原因;3) 分析了模型规模和多模态对迁移能力的影响。

Method: 研究方法涉及:1) 构建合成数据集Syn-TA和两个自然视频数据集(Kinetics400-TA和Something-Something-v2-TA);2) 评估13种先进模型在新场景中的表现;3) 分析了模型在不同特征(如时空推理、物体和背景线索)上的表现差异。

Result: 结果表明:1) 多模态模型在细粒度未知动作上表现更差;2) 在合成数据集上的挑战性与真实数据集相当;3) 大模型在时空推理密集的任务中表现不佳,而对物体和背景的依赖会阻碍泛化能力。

Insight: 研究发现可以通过解耦粗粒度和细粒度动作来提升识别性能,尤其在时间密集型任务中。该研究为动作识别中的运动迁移性提供了重要基准。

Abstract: Action recognition models demonstrate strong generalization, but can they effectively transfer high-level motion concepts across diverse contexts, even within similar distributions? For example, can a model recognize the broad action “punching” when presented with an unseen variation such as “punching person”? To explore this, we introduce a motion transferability framework with three datasets: (1) Syn-TA, a synthetic dataset with 3D object motions; (2) Kinetics400-TA; and (3) Something-Something-v2-TA, both adapted from natural video datasets. We evaluate 13 state-of-the-art models on these benchmarks and observe a significant drop in performance when recognizing high-level actions in novel contexts. Our analysis reveals: 1) Multimodal models struggle more with fine-grained unknown actions than with coarse ones; 2) The bias-free Syn-TA proves as challenging as real-world datasets, with models showing greater performance drops in controlled settings; 3) Larger models improve transferability when spatial cues dominate but struggle with intensive temporal reasoning, while reliance on object and background cues hinders generalization. We further explore how disentangling coarse and fine motions can improve recognition in temporally challenging datasets. We believe this study establishes a crucial benchmark for assessing motion transferability in action recognition. Datasets and relevant code: https://github.com/raiyaan-abdullah/Motion-Transfer.

[24] The Monado SLAM Dataset for Egocentric Visual-Inertial Tracking

Mateo de Mayo,Daniel Cremers,Taihú Pire

Main category: cs.CV

TL;DR: Monado SLAM数据集旨在解决头戴式传感器在视觉-惯性跟踪中的挑战性问题,填补现有数据集在高强度运动、动态遮挡等场景的不足,推动VIO/SLAM的研究。

Details Motivation: 当前的头戴式传感器在视觉-惯性跟踪中面临诸多挑战(如高强度运动、动态遮挡、低纹理区域等),而现有数据集无法充分覆盖这些场景,限制了算法的开发与评估。

Contribution: 提出了Monado SLAM数据集,覆盖了头戴式传感器在实际使用中的多种挑战性场景,填补了现有数据集的不足,并以开放的CC BY 4.0许可证发布。

Method: 通过多台虚拟现实头戴设备采集真实场景序列,涵盖了动态遮挡、高强度运动、传感器饱和等多种复杂条件。

Result: 数据集为研究社区提供了评估和改进VIO/SLAM算法的实际场景数据,推动了相关技术的发展。

Insight: 头戴式传感器的实际应用场景复杂多变,现有数据集和算法在这些场景中表现不足,需要更多面向真实问题的数据支持。

Abstract: Humanoid robots and mixed reality headsets benefit from the use of head-mounted sensors for tracking. While advancements in visual-inertial odometry (VIO) and simultaneous localization and mapping (SLAM) have produced new and high-quality state-of-the-art tracking systems, we show that these are still unable to gracefully handle many of the challenging settings presented in the head-mounted use cases. Common scenarios like high-intensity motions, dynamic occlusions, long tracking sessions, low-textured areas, adverse lighting conditions, saturation of sensors, to name a few, continue to be covered poorly by existing datasets in the literature. In this way, systems may inadvertently overlook these essential real-world issues. To address this, we present the Monado SLAM dataset, a set of real sequences taken from multiple virtual reality headsets. We release the dataset under a permissive CC BY 4.0 license, to drive advancements in VIO/SLAM research and development.

[25] World Consistency Score: A Unified Metric for Video Generation Quality

Akshat Rakheja,Aarsh Ashdhir,Aryan Bhattacharjee,Vanshika Sharma

Main category: cs.CV

TL;DR: 该论文提出了一种名为世界一致性分数(WCS)的新视频生成评估指标,通过整合四个子指标(物体持久性、关系稳定性、因果合规性和闪烁惩罚)来衡量视频的时空一致性,并通过学习权重公式将其统一为单一分数。

Details Motivation: 现有视频生成评估指标多关注视觉保真度或文本对齐,而忽视了视频内部的世界一致性。WCS旨在填补这一空白,提供更全面的评估框架。

Contribution: 1. 提出WCS,一种统一且可解释的视频生成评估指标;2. 设计并实现四个子指标及其计算方法;3. 通过人类偏好数据学习权重公式,使WCS与人类判断对齐。

Method: 1. 定义四个子指标(物体持久性、关系稳定性、因果合规性、闪烁惩罚);2. 使用开源工具(如跟踪器、动作识别器、CLIP嵌入、光流)计算子指标;3. 通过人类偏好数据训练权重,组合子指标为WCS分数。

Result: 在VBench-2.0、EvalCrafter和LOVE等基准测试中,WCS与人类评估的相关性优于现有指标(如FVD、CLIPScore、FVMD)。

Insight: WCS为视频生成模型提供了一个更全面且可解释的评估工具,尤其适用于对世界一致性要求高的任务,如长视频生成或多角色交互场景。

Abstract: We introduce World Consistency Score (WCS), a novel unified evaluation metric for generative video models that emphasizes internal world consistency of the generated videos. WCS integrates four interpretable sub-components - object permanence, relation stability, causal compliance, and flicker penalty - each measuring a distinct aspect of temporal and physical coherence in a video. These submetrics are combined via a learned weighted formula to produce a single consistency score that aligns with human judgments. We detail the motivation for WCS in the context of existing video evaluation metrics, formalize each submetric and how it is computed with open-source tools (trackers, action recognizers, CLIP embeddings, optical flow), and describe how the weights of the WCS combination are trained using human preference data. We also outline an experimental validation blueprint: using benchmarks like VBench-2.0, EvalCrafter, and LOVE to test WCS’s correlation with human evaluations, performing sensitivity analyses, and comparing WCS against established metrics (FVD, CLIPScore, VBench, FVMD). The proposed WCS offers a comprehensive and interpretable framework for evaluating video generation models on their ability to maintain a coherent “world” over time, addressing gaps left by prior metrics focused only on visual fidelity or prompt alignment.

[26] GeoExplorer: Active Geo-localization with Curiosity-Driven Exploration

Li Mi,Manon Bechaz,Zeming Chen,Antoine Bosselut,Devis Tuia

Main category: cs.CV

TL;DR: GeoExplorer是一个基于好奇心驱动探索的主动地理定位(AGL)智能体,通过内在奖励提升探索的鲁棒性和泛化能力,尤其在未知目标和环境中表现优异。

Details Motivation: 现有AGL方法依赖于距离奖励,在距离估计困难或面对未知目标和环境时表现不佳。GeoExplorer通过引入好奇心驱动探索来解决这一问题。

Contribution: 提出了基于好奇心驱动的内在奖励机制,增强了探索的多样性和鲁棒性,提升了AGL任务在未知场景中的表现。

Method: GeoExplorer通过目标无关的好奇心驱动奖励(如环境建模)实现探索,结合强化学习框架进行训练。

Result: 在四个AGL基准测试中验证了GeoExplorer的有效性和泛化能力,尤其在陌生目标和环境中表现突出。

Insight: 好奇心驱动的探索策略能够减少对外部奖励(如距离)的依赖,从而提升学习效率和对复杂环境的适应能力。

Abstract: Active Geo-localization (AGL) is the task of localizing a goal, represented in various modalities (e.g., aerial images, ground-level images, or text), within a predefined search area. Current methods approach AGL as a goal-reaching reinforcement learning (RL) problem with a distance-based reward. They localize the goal by implicitly learning to minimize the relative distance from it. However, when distance estimation becomes challenging or when encountering unseen targets and environments, the agent exhibits reduced robustness and generalization ability due to the less reliable exploration strategy learned during training. In this paper, we propose GeoExplorer, an AGL agent that incorporates curiosity-driven exploration through intrinsic rewards. Unlike distance-based rewards, our curiosity-driven reward is goal-agnostic, enabling robust, diverse, and contextually relevant exploration based on effective environment modeling. These capabilities have been proven through extensive experiments across four AGL benchmarks, demonstrating the effectiveness and generalization ability of GeoExplorer in diverse settings, particularly in localizing unfamiliar targets and environments.

[27] On the Risk of Misleading Reports: Diagnosing Textual Biases in Multimodal Clinical AI

David Restrepo,Ira Ktena,Maria Vakalopoulou,Stergios Christodoulidis,Enzo Ferrante

Main category: cs.CV

TL;DR: 论文提出了一种名为选择性模态转移(SMS)的方法,用于量化多模态临床AI模型对文本和视觉模态的依赖程度。研究发现,模型在分类任务中过度依赖文本信息而忽略视觉信息。

Details Motivation: 临床决策依赖于对医学图像和相关报告的综合分析。现有的视觉-语言模型(VLMs)在处理多模态任务时可能表现出对文本的强烈偏好,从而忽略关键的视觉信息,这一问题在临床AI中尤为重要。

Contribution: 引入选择性模态转移(SMS)方法,量化模型对文本和视觉模态的依赖;通过实验揭示了多种VLMs在医学任务中过度依赖文本的问题。

Method: 采用样本间图像或文本的交换扰动(SMS),评估模型在扰动前后的性能变化和校准表现,并结合注意力机制进行定性分析。

Result: 实验表明,模型在医学任务中显著依赖文本信息,即使视觉信息与文本互补时也是如此。注意力分析进一步证实了这一现象。

Insight: 设计多模态医学模型时,需确保视觉和文本信息的真正融合,避免对单一模态的过度依赖。

Abstract: Clinical decision-making relies on the integrated analysis of medical images and the associated clinical reports. While Vision-Language Models (VLMs) can offer a unified framework for such tasks, they can exhibit strong biases toward one modality, frequently overlooking critical visual cues in favor of textual information. In this work, we introduce Selective Modality Shifting (SMS), a perturbation-based approach to quantify a model’s reliance on each modality in binary classification tasks. By systematically swapping images or text between samples with opposing labels, we expose modality-specific biases. We assess six open-source VLMs-four generalist models and two fine-tuned for medical data-on two medical imaging datasets with distinct modalities: MIMIC-CXR (chest X-ray) and FairVLMed (scanning laser ophthalmoscopy). By assessing model performance and the calibration of every model in both unperturbed and perturbed settings, we reveal a marked dependency on text input, which persists despite the presence of complementary visual information. We also perform a qualitative attention-based analysis which further confirms that image content is often overshadowed by text details. Our findings highlight the importance of designing and evaluating multimodal medical models that genuinely integrate visual and textual cues, rather than relying on single-modality signals.

[28] SAM-PTx: Text-Guided Fine-Tuning of SAM with Parameter-Efficient, Parallel-Text Adapters

Shayan Jalilian,Abdul Bais

Main category: cs.CV

TL;DR: SAM-PTx提出了一种基于文本提示的参数高效微调方法,通过轻量级Parallel-Text适配器将冻结的CLIP文本嵌入注入SAM图像编码器,实现语义引导的分割。

Details Motivation: 尽管SAM在基于空间提示(如点和框)的分割中表现出色,但语义文本提示的潜力未被充分挖掘。本文旨在探索通过文本提示增强SAM的语义分割能力。

Contribution: 1. 提出了一种参数高效的Parallel-Text适配器设计,仅修改SAM的MLP并行分支,保留注意力通路以支持空间推理;2. 首次在COD10K数据集上使用文本提示进行分割。

Method: 采用冻结CLIP文本嵌入作为类别级语义引导,通过轻量级适配器将文本嵌入注入SAM图像编码器的MLP并行分支,保持大部分原始架构冻结。

Result: 在COD10K、COCO和ADE20K的低数据子集上,文本提示的引入显著提升了分割性能,超越了纯空间提示基线。

Insight: 通过语义条件化集成到SAM架构中,能够以低计算复杂度实现高效的适配,为语义分割提供了可扩展的途径。

Abstract: The Segment Anything Model (SAM) has demonstrated impressive generalization in prompt-based segmentation. Yet, the potential of semantic text prompts remains underexplored compared to traditional spatial prompts like points and boxes. This paper introduces SAM-PTx, a parameter-efficient approach for adapting SAM using frozen CLIP-derived text embeddings as class-level semantic guidance. Specifically, we propose a lightweight adapter design called Parallel-Text that injects text embeddings into SAM’s image encoder, enabling semantics-guided segmentation while keeping most of the original architecture frozen. Our adapter modifies only the MLP-parallel branch of each transformer block, preserving the attention pathway for spatial reasoning. Through supervised experiments and ablations on the COD10K dataset as well as low-data subsets of COCO and ADE20K, we show that incorporating fixed text embeddings as input improves segmentation performance over purely spatial prompt baselines. To our knowledge, this is the first work to use text prompts for segmentation on the COD10K dataset. These results suggest that integrating semantic conditioning into SAM’s architecture offers a practical and scalable path for efficient adaptation with minimal computational complexity.

[29] Object-Centric Cropping for Visual Few-Shot Classification

Aymane Abdali,Bartosz Boguslawski,Lucas Drumetz,Vincent Gripon

Main category: cs.CV

TL;DR: 为了提高少样本分类性能,该论文提出了一种基于目标中心裁剪的方法,利用局部目标位置信息显著提升了分类效果,并展示了使用Segment Anything模型和无监督前景目标提取方法的有效性。

Details Motivation: 在少样本图像分类任务中,图像中存在的多目标或复杂背景会导致分类性能下降,因此需要一种方法来提取目标的关键区域以提高分类准确性。

Contribution: 引入目标中心裁剪技术,利用局部目标位置信息显著提升少样本分类性能,并提出使用Segment Anything模型和无监督方法提取目标区域。

Method: 通过局部目标位置信息引导裁剪,结合Segment Anything模型(仅需标注目标的一个像素)或无监督前景目标提取方法,突出目标区域以减少背景干扰。

Result: 在标准基准测试中,该方法显著提高了少样本分类的性能,证明了目标区域提取的重要性。

Insight: 即使在少样本分类任务中,简单提取目标区域并减少背景干扰也能带来显著的性能提升,结合半监督或无监督方法可以进一步简化标注需求。

Abstract: In the domain of Few-Shot Image Classification, operating with as little as one example per class, the presence of image ambiguities stemming from multiple objects or complex backgrounds can significantly deteriorate performance. Our research demonstrates that incorporating additional information about the local positioning of an object within its image markedly enhances classification across established benchmarks. More importantly, we show that a significant fraction of the improvement can be achieved through the use of the Segment Anything Model, requiring only a pixel of the object of interest to be pointed out, or by employing fully unsupervised foreground object extraction methods.

[30] Guided Depth Map Super-Resolution via Multi-Scale Fusion U-shaped Mamba Network

Chenggang Guo,Hao Xu,XianMing Wan

Main category: cs.CV

TL;DR: 提出了一个多尺度融合U型Mamba网络(MSF-UM),用于引导深度图超分辨率,通过结合Mamba的高效状态空间建模能力与多尺度U型结构,显著提升了模型在全局上下文建模和计算效率方面的表现。

Details Motivation: 传统卷积神经网络在建模长距离依赖关系方面存在局限性,而Transformer虽然能建模全局依赖,但其计算复杂度和内存消耗过高,难以处理高分辨率深度图。因此,需要一种更高效的方法来实现深度图超分辨率。

Contribution: 1. 提出了MSF-UM模型,将Mamba的状态空间建模能力引入多尺度U型融合结构;2. 设计了结合残差密集通道注意力块和Mamba状态空间模块的结构;3. 采用多尺度跨模态融合策略,充分利用彩色图像的高频纹理信息。

Method: 1. 多尺度U型融合结构;2. 结合残差密集通道注意力块与Mamba状态空间模块;3. 跨模态融合策略,利用彩色图像信息引导深度图超分辨率。

Result: 在多个公开数据集上验证了模型的有效性,MSF-UM在重建精度和模型参数效率上优于现有主流方法,尤其是在大尺度深度图超分辨率任务中表现出色。

Insight: Mamba的状态空间模型在长距离依赖建模上具有显著优势,结合多尺度融合策略和跨模态信息,可以为深度图超分辨率任务提供高效且准确的解决方案。

Abstract: Depth map super-resolution technology aims to improve the spatial resolution of low-resolution depth maps and effectively restore high-frequency detail information. Traditional convolutional neural network has limitations in dealing with long-range dependencies and are unable to fully model the global contextual information in depth maps. Although transformer can model global dependencies, its computational complexity and memory consumption are quadratic, which significantly limits its ability to process high-resolution depth maps. In this paper, we propose a multi-scale fusion U-shaped Mamba (MSF-UM) model, a novel guided depth map super-resolution framework. The core innovation of this model is to integrate Mamba’s efficient state-space modeling capabilities into a multi-scale U-shaped fusion structure guided by a color image. The structure combining the residual dense channel attention block and the Mamba state space module is designed, which combines the local feature extraction capability of the convolutional layer with the modeling advantage of the state space model for long-distance dependencies. At the same time, the model adopts a multi-scale cross-modal fusion strategy to make full use of the high-frequency texture information from the color image to guide the super-resolution process of the depth map. Compared with existing mainstream methods, the proposed MSF-UM significantly reduces the number of model parameters while achieving better reconstruction accuracy. Extensive experiments on multiple publicly available datasets validate the effectiveness of the model, especially showing excellent generalization ability in the task of large-scale depth map super-resolution.

[31] Instruction-Grounded Visual Projectors for Continual Learning of Generative Vision-Language Models

Hyundong Jin,Hyung Jin Chang,Eunwoo Kim

Main category: cs.CV

TL;DR: 论文提出了一种基于指令的视觉投影器框架(instruction-grounded visual projectors),用于生成式视觉语言模型(VLM)的持续学习,解决了现有方法忽视语言指令的问题。通过混合专家(mixture of experts)和专家推荐策略,提升了对新任务的适应性并减少了干扰。实验表明该方法在生成符合指令的响应上优于现有方法。

Details Motivation: 现有的持续学习方法在更新视觉投影器(visual projector)时,可能过度依赖视觉输入而忽视语言指令,尤其是在任务指令类型重复的场景中。论文旨在解决这一问题,确保模型能够基于语言指令正确翻译视觉信息。

Contribution: 1. 提出了一种基于指令的视觉投影器框架,通过混合专家(mixture of experts)实现视觉信息的任务适应性翻译。2. 设计了专家推荐策略(expert recommendation strategy)和专家剪枝(expert pruning)方法,以减少干扰并提升效率。

Method: 1. 引入多种视觉投影器,每个专家根据指令上下文进行专业化翻译。2. 通过专家推荐策略复用相似任务的专家。3. 采用专家剪枝避免以往任务的累积干扰。

Result: 在多样化的视觉语言任务实验中,该方法生成的响应更符合指令,性能优于现有持续学习方法。

Insight: 通过将视觉信息翻译与语言指令结合,可以更有效地适应新任务,同时避免对以往任务的干扰。专家推荐和剪枝机制为持续学习提供了高效解决方案。

Abstract: Continual learning enables pre-trained generative vision-language models (VLMs) to incorporate knowledge from new tasks without retraining data from previous ones. Recent methods update a visual projector to translate visual information for new tasks, connecting pre-trained vision encoders with large language models. However, such adjustments may cause the models to prioritize visual inputs over language instructions, particularly learning tasks with repetitive types of textual instructions. To address the neglect of language instructions, we propose a novel framework that grounds the translation of visual information on instructions for language models. We introduce a mixture of visual projectors, each serving as a specialized visual-to-language translation expert based on the given instruction context to adapt to new tasks. To avoid using experts for irrelevant instruction contexts, we propose an expert recommendation strategy that reuses experts for tasks similar to those previously learned. Additionally, we introduce expert pruning to alleviate interference from the use of experts that cumulatively activated in previous tasks. Extensive experiments on diverse vision-language tasks demonstrate that our method outperforms existing continual learning approaches by generating instruction-following responses.

[32] Multimodal Referring Segmentation: A Survey

Henghui Ding,Song Tang,Shuting He,Chang Liu,Zuxuan Wu,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文是一篇关于多模态指代分割(Multimodal Referring Segmentation)的综述,全面介绍了该领域的背景、统一元架构、主流方法、泛化指代表达(GREx)方法、相关任务及应用,并提供了标准基准上的性能对比。

Details Motivation: 多模态指代分割在基于用户指令的精确目标感知应用中具有重要作用,近年来因卷积神经网络、Transformer和大语言模型的进步而备受关注。本文旨在系统梳理该领域的研究进展。

Contribution: 1. 提供多模态指代分割的全面综述;2. 提出统一的元架构;3. 总结了图像、视频和3D场景中的主流方法;4. 讨论了GREx方法以应对现实复杂性;5. 提供了标准基准的性能比较。

Method: 1. 提出统一的元架构框架;2. 分类总结了图像、视频和3D场景中的代表性方法;3. 介绍了GREx方法以泛化指代表达能力;4. 基于标准数据集进行了广泛实验对比。

Result: 综述展示了多模态指代分割领域的最新研究成果和性能对比,为后续研究提供了重要参考。

Insight: 1. 统一的元架构有助于方法设计;2. GREx方法为解决现实复杂性提供了新思路;3. 不同视觉场景(图像、视频、3D)的方法各具特色,需针对性优化。

Abstract: Multimodal referring segmentation aims to segment target objects in visual scenes, such as images, videos, and 3D scenes, based on referring expressions in text or audio format. This task plays a crucial role in practical applications requiring accurate object perception based on user instructions. Over the past decade, it has gained significant attention in the multimodal community, driven by advances in convolutional neural networks, transformers, and large language models, all of which have substantially improved multimodal perception capabilities. This paper provides a comprehensive survey of multimodal referring segmentation. We begin by introducing this field’s background, including problem definitions and commonly used datasets. Next, we summarize a unified meta architecture for referring segmentation and review representative methods across three primary visual scenes, including images, videos, and 3D scenes. We further discuss Generalized Referring Expression (GREx) methods to address the challenges of real-world complexity, along with related tasks and practical applications. Extensive performance comparisons on standard benchmarks are also provided. We continually track related works at https://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation.

[33] Towards Robust Semantic Correspondence: A Benchmark and Insights

Wenyue Chong

Main category: cs.CV

TL;DR: 该论文提出了一个新颖的基准数据集,用于评估语义对应在挑战性场景下的鲁棒性,并通过大量实验揭示了现有方法的局限性及提升策略。

Details Motivation: 语义对应是计算机视觉中的基础任务,但在挑战性场景下的鲁棒性研究不足,论文旨在填补这一空白。

Contribution: 1. 建立了一个包含14种挑战性场景的基准数据集;2. 揭示了现有语义对应方法的鲁棒性不足;3. 分析了大规模视觉模型的优势和局限性。

Method: 通过构建多样化挑战性场景的基准数据集,系统地评估了现有语义对应方法的性能,并分析了模型融合和数据增强的效果。

Result: 1. 所有方法在挑战性场景下性能显著下降;2. 大规模视觉模型能提升鲁棒性,但微调会降低相对鲁棒性;3. DINO模型优于Stable Diffusion,融合两者可提升绝对鲁棒性。

Insight: 通用数据增强对提升语义对应的鲁棒性效果有限,需要任务特定的设计。

Abstract: Semantic correspondence aims to identify semantically meaningful relationships between different images and is a fundamental challenge in computer vision. It forms the foundation for numerous tasks such as 3D reconstruction, object tracking, and image editing. With the progress of large-scale vision models, semantic correspondence has achieved remarkable performance in controlled and high-quality conditions. However, the robustness of semantic correspondence in challenging scenarios is much less investigated. In this work, we establish a novel benchmark for evaluating semantic correspondence in adverse conditions. The benchmark dataset comprises 14 distinct challenging scenarios that reflect commonly encountered imaging issues, including geometric distortion, image blurring, digital artifacts, and environmental occlusion. Through extensive evaluations, we provide several key insights into the robustness of semantic correspondence approaches: (1) All existing methods suffer from noticeable performance drops under adverse conditions; (2) Using large-scale vision models can enhance overall robustness, but fine-tuning on these models leads to a decline in relative robustness; (3) The DINO model outperforms the Stable Diffusion in relative robustness, and their fusion achieves better absolute robustness; Moreover, We evaluate common robustness enhancement strategies for semantic correspondence and find that general data augmentations are ineffective, highlighting the need for task-specific designs. These results are consistent across both our dataset and real-world benchmarks.

[34] Privacy-Preserving Driver Drowsiness Detection with Spatial Self-Attention and Federated Learning

Tran Viet Khoa,Do Hai Son,Mohammad Abu Alsheikh,Yibeltal F Alem,Dinh Thai Hoang

Main category: cs.CV

TL;DR: 论文提出了一种结合空间自注意力机制和联邦学习的驾驶疲劳检测框架,通过异构和分散数据实现高效检测,同时保护用户隐私。

Details Motivation: 驾驶疲劳是交通事故的主要原因之一,但现有方法在分散异构数据下效率不足且面临隐私问题。

Contribution: 1. 设计了集成空间自注意力机制(SSA)和LSTM的新型框架;2. 提出了梯度相似性比较(GSC)支持联邦学习;3. 开发了自动视频数据处理工具。

Method: 1. 使用SSA提取关键面部特征;2. 结合LSTM提升时序建模;3. 采用GSC优化联邦学习模型聚合。

Result: 联邦学习环境下检测准确率达89.9%,优于现有方法。

Insight: 在处理分散异构数据时,SSA和GSC的组合可显著提升模型性能和隐私保护能力。

Abstract: Driver drowsiness is one of the main causes of road accidents and is recognized as a leading contributor to traffic-related fatalities. However, detecting drowsiness accurately remains a challenging task, especially in real-world settings where facial data from different individuals is decentralized and highly diverse. In this paper, we propose a novel framework for drowsiness detection that is designed to work effectively with heterogeneous and decentralized data. Our approach develops a new Spatial Self-Attention (SSA) mechanism integrated with a Long Short-Term Memory (LSTM) network to better extract key facial features and improve detection performance. To support federated learning, we employ a Gradient Similarity Comparison (GSC) that selects the most relevant trained models from different operators before aggregation. This improves the accuracy and robustness of the global model while preserving user privacy. We also develop a customized tool that automatically processes video data by extracting frames, detecting and cropping faces, and applying data augmentation techniques such as rotation, flipping, brightness adjustment, and zooming. Experimental results show that our framework achieves a detection accuracy of 89.9% in the federated learning settings, outperforming existing methods under various deployment scenarios. The results demonstrate the effectiveness of our approach in handling real-world data variability and highlight its potential for deployment in intelligent transportation systems to enhance road safety through early and reliable drowsiness detection.

[35] TITAN-Guide: Taming Inference-Time AligNment for Guided Text-to-Video Diffusion Models

Christian Simon,Masato Ishii,Akio Hayakawa,Zhi Zhong,Shusuke Takahashi,Takashi Shibuya,Yuki Mitsufuji

Main category: cs.CV

TL;DR: TITAN-Guide提出了一种无需训练的高效方法,通过前向梯度下降优化扩散模型的隐变量,解决了现有指导框架内存占用大和效果不佳的问题,显著提升了文本到视频扩散模型的性能。

Details Motivation: 现有的条件扩散模型通常需要大量监督微调以实现任务控制,而免训练的指导方法虽然避免了微调,但存在内存占用大或控制效果不佳的问题,限制了其在文本到视频扩散模型等计算密集型任务中的应用。

Contribution: 1. 提出TITAN-Guide,一种无需训练的高效指导方法;
2. 通过前向梯度下降优化隐变量,减少内存占用;
3. 在文本到视频扩散模型中展示了优于现有方法的性能。

Method: 1. 采用前向梯度下降优化扩散模型的隐变量,避免了反向传播;
2. 研究了多种方向性指示的指导任务,提升了控制效果;
3. 高效管理内存,适应大规模扩散模型。

Result: 实验表明,TITAN-Guide在内存管理和性能表现上均优于现有方法,显著提升了文本到视频扩散模型的指导效果。

Insight: 通过前向梯度下降优化隐变量,TITAN-Guide为高效控制扩散模型提供了一种无需训练的新思路,尤其适用于计算密集型任务。

Abstract: In the recent development of conditional diffusion models still require heavy supervised fine-tuning for performing control on a category of tasks. Training-free conditioning via guidance with off-the-shelf models is a favorable alternative to avoid further fine-tuning on the base model. However, the existing training-free guidance frameworks either have heavy memory requirements or offer sub-optimal control due to rough estimation. These shortcomings limit the applicability to control diffusion models that require intense computation, such as Text-to-Video (T2V) diffusion models. In this work, we propose Taming Inference Time Alignment for Guided Text-to-Video Diffusion Model, so-called TITAN-Guide, which overcomes memory space issues, and provides more optimal control in the guidance process compared to the counterparts. In particular, we develop an efficient method for optimizing diffusion latents without backpropagation from a discriminative guiding model. In particular, we study forward gradient descents for guided diffusion tasks with various options on directional directives. In our experiments, we demonstrate the effectiveness of our approach in efficiently managing memory during latent optimization, while previous methods fall short. Our proposed approach not only minimizes memory requirements but also significantly enhances T2V performance across a range of diffusion guidance benchmarks. Code, models, and demo are available at https://titanguide.github.io.

[36] AniMer+: Unified Pose and Shape Estimation Across Mammalia and Aves via Family-Aware Transformer

Jin Lyu,Liang An,Li Lin,Pujin Cheng,Yebin Liu,Xiaoying Tang

Main category: cs.CV

TL;DR: AniMer+提出了一种统一估计哺乳动物和鸟类姿态与形状的方法,通过家族感知的Vision Transformer(ViT)和Mixture-of-Experts(MoE)设计,解决了多物种数据稀缺和网络容量不足的问题。

Details Motivation: 在多物种动物姿态和形状估计领域,现有方法受限于网络容量和数据稀缺,尤其是鸟类数据的缺乏。AniMer+旨在通过统一模型和合成数据解决这一问题。

Contribution: 1. 提出了家族感知的ViT和MoE设计,区分物种特异和共享特征。2. 引入扩散模型生成大规模合成数据集CtrlAni3D和CtrlAVES3D,填补鸟类数据空白。

Method: 1. 使用家族感知ViT分割物种特异和共享组件。2. 提出扩散模型生成合成数据以补充训练集。

Result: 在41.3k哺乳动物和12.4k鸟类图像上训练,AniMer+在多个基准测试(包括Animal Kingdom)中表现优于现有方法。

Insight: 合成数据对解决稀缺物种(如鸟类)的3D标注问题至关重要,同时家族感知设计有效提升了模型的泛化能力。

Abstract: In the era of foundation models, achieving a unified understanding of different dynamic objects through a single network has the potential to empower stronger spatial intelligence. Moreover, accurate estimation of animal pose and shape across diverse species is essential for quantitative analysis in biological research. However, this topic remains underexplored due to the limited network capacity of previous methods and the scarcity of comprehensive multi-species datasets. To address these limitations, we introduce AniMer+, an extended version of our scalable AniMer framework. In this paper, we focus on a unified approach for reconstructing mammals (mammalia) and birds (aves). A key innovation of AniMer+ is its high-capacity, family-aware Vision Transformer (ViT) incorporating a Mixture-of-Experts (MoE) design. Its architecture partitions network layers into taxa-specific components (for mammalia and aves) and taxa-shared components, enabling efficient learning of both distinct and common anatomical features within a single model. To overcome the critical shortage of 3D training data, especially for birds, we introduce a diffusion-based conditional image generation pipeline. This pipeline produces two large-scale synthetic datasets: CtrlAni3D for quadrupeds and CtrlAVES3D for birds. To note, CtrlAVES3D is the first large-scale, 3D-annotated dataset for birds, which is crucial for resolving single-view depth ambiguities. Trained on an aggregated collection of 41.3k mammalian and 12.4k avian images (combining real and synthetic data), our method demonstrates superior performance over existing approaches across a wide range of benchmarks, including the challenging out-of-domain Animal Kingdom dataset. Ablation studies confirm the effectiveness of both our novel network architecture and the generated synthetic datasets in enhancing real-world application performance.

[37] Controllable Pedestrian Video Editing for Multi-View Driving Scenarios via Motion Sequence

Danzhen Fu,Jiagao Hu,Daiguo Zhou,Fei Wang,Zepeng Wang,Wenhua Liao

Main category: cs.CV

TL;DR: 论文提出了一种可控的多视角驾驶场景行人视频编辑框架,通过结合视频修复和人体动作控制技术,支持行人插入、替换和删除等功能,实验表明其具有高质量的视觉真实性和跨视角一致性。

Details Motivation: 自动驾驶系统中的行人检测模型因训练数据中危险场景不足而缺乏鲁棒性,本文旨在通过生成多样化的行人视频数据增强模型的健壮性。

Contribution: 提出了一个可控的多视角行人视频编辑框架,支持灵活的编辑功能(如插入、替换和删除),并实现了跨视角的空间一致性。

Method: 方法包括:1) 多视角行人区域检测与扩展;2) 区域拼接为统一画布;3) 通过姿态序列控制条件指导行人编辑。

Result: 实验表明框架能够生成高质量、视觉真实且时空连贯的行人视频,适用于自动驾驶的数据增强和场景模拟。

Insight: 该方法为自动驾驶数据增强提供了一个灵活且高效的解决方案,同时展示了多视角编辑技术在复杂场景中的潜力。

Abstract: Pedestrian detection models in autonomous driving systems often lack robustness due to insufficient representation of dangerous pedestrian scenarios in training datasets. To address this limitation, we present a novel framework for controllable pedestrian video editing in multi-view driving scenarios by integrating video inpainting and human motion control techniques. Our approach begins by identifying pedestrian regions of interest across multiple camera views, expanding detection bounding boxes with a fixed ratio, and resizing and stitching these regions into a unified canvas while preserving cross-view spatial relationships. A binary mask is then applied to designate the editable area, within which pedestrian editing is guided by pose sequence control conditions. This enables flexible editing functionalities, including pedestrian insertion, replacement, and removal. Extensive experiments demonstrate that our framework achieves high-quality pedestrian editing with strong visual realism, spatiotemporal coherence, and cross-view consistency. These results establish the proposed method as a robust and versatile solution for multi-view pedestrian video generation, with broad potential for applications in data augmentation and scenario simulation in autonomous driving.

[38] Exploring Fourier Prior and Event Collaboration for Low-Light Image Enhancement

Chunyan She,Fujun Han,Chengyu Fang,Shukai Duan,Lidan Wang

Main category: cs.CV

TL;DR: 论文提出了一种基于事件相机的低光照图像增强方法,通过解耦任务为可见性恢复和结构细化两阶段,结合傅里叶先验和动态对齐技术,显著提升了性能。

Details Motivation: 现有方法直接将帧和事件数据输入单一模型,未能充分利用模态特性。通过分析每种感知模态的作用,论文提出了分阶段处理以提升性能。

Contribution: 1. 将增强任务解耦为可见性恢复和结构细化两阶段;2. 设计了基于傅里叶空间的振幅-相位纠缠网络;3. 提出动态对齐融合策略;4. 引入对比损失提升模型判别能力。

Method: 1. 可见性恢复阶段:利用傅里叶空间的振幅-相位关系设计网络;2. 结构细化阶段:通过动态对齐融合帧和事件数据;3. 使用空间-频率插值生成负样本,优化对比损失。

Result: 实验表明,该方法在低光照图像增强任务上优于现有先进模型。

Insight: 1. 事件相机的高动态范围和时间分辨率是提升低光照图像增强的关键;2. 分阶段处理和多模态融合能显著提升性能。

Abstract: The event camera, benefiting from its high dynamic range and low latency, provides performance gain for low-light image enhancement. Unlike frame-based cameras, it records intensity changes with extremely high temporal resolution, capturing sufficient structure information. Currently, existing event-based methods feed a frame and events directly into a single model without fully exploiting modality-specific advantages, which limits their performance. Therefore, by analyzing the role of each sensing modality, the enhancement pipeline is decoupled into two stages: visibility restoration and structure refinement. In the first stage, we design a visibility restoration network with amplitude-phase entanglement by rethinking the relationship between amplitude and phase components in Fourier space. In the second stage, a fusion strategy with dynamic alignment is proposed to mitigate the spatial mismatch caused by the temporal resolution discrepancy between two sensing modalities, aiming to refine the structure information of the image enhanced by the visibility restoration network. In addition, we utilize spatial-frequency interpolation to simulate negative samples with diverse illumination, noise and artifact degradations, thereby developing a contrastive loss that encourages the model to learn discriminative representations. Experiments demonstrate that the proposed method outperforms state-of-the-art models.

[39] DocTron-Formula: Generalized Formula Recognition in Complex and Structured Scenarios

Yufeng Zhong,Zhixiong Zeng,Lei Chen,Longrong Yang,Liming Zheng,Jing Huang,Siqi Yang,Lin Ma

Main category: cs.CV

TL;DR: DocTron-Formula 是一个基于通用视觉语言模型的统一框架,用于复杂和结构化场景中的数学公式识别,通过简单监督微调实现最先进的性能。

Details Motivation: 数学公式的OCR任务在科学文献智能分析中至关重要,但现有任务专用和通用视觉语言模型难以处理数学内容的结构多样性和复杂性。

Contribution: 提出了DocTron-Formula框架,无需专用架构,并引入大规模多学科复杂公式数据集CSFormula。

Method: 基于通用视觉语言模型,通过监督微调处理复杂数学公式。

Result: 在多种风格、科学领域和复杂布局中实现了最先进的性能,超越了专用模型的准确性和鲁棒性。

Insight: 通用视觉语言模型经过简单微调即可有效处理复杂科学文档的自动化理解任务,为相关领域提供了新范式。

Abstract: Optical Character Recognition (OCR) for mathematical formula is essential for the intelligent analysis of scientific literature. However, both task-specific and general vision-language models often struggle to handle the structural diversity, complexity, and real-world variability inherent in mathematical content. In this work, we present DocTron-Formula, a unified framework built upon general vision-language models, thereby eliminating the need for specialized architectures. Furthermore, we introduce CSFormula, a large-scale and challenging dataset that encompasses multidisciplinary and structurally complex formulas at the line, paragraph, and page levels. Through straightforward supervised fine-tuning, our approach achieves state-of-the-art performance across a variety of styles, scientific domains, and complex layouts. Experimental results demonstrate that our method not only surpasses specialized models in terms of accuracy and robustness, but also establishes a new paradigm for the automated understanding of complex scientific documents.

[40] GV-VAD : Exploring Video Generation for Weakly-Supervised Video Anomaly Detection

Suhang Cai,Xiaohao Peng,Chong Wang,Xiaojie Cai,Jiangbo Qian

Main category: cs.CV

TL;DR: GV-VAD提出了一种基于生成视频的弱监督视频异常检测框架,通过文本条件视频生成模型低成本生成合成视频以增强训练数据,并采用合成样本损失缩放策略优化训练。在UCF-Crime数据集上表现优于现有方法。

Details Motivation: 由于真实异常事件的罕见性、不可预测性和高标注成本,现有的视频异常检测模型在数据规模和泛化能力上受限。GV-VAD旨在通过生成合成视频来解决这一问题。

Contribution: 1)提出了基于生成视频的弱监督异常检测框架GV-VAD。2)利用文本条件视频生成模型低成本生成语义可控的合成视频,增强训练数据。3)设计了合成样本损失缩放策略,优化合成数据对训练的影响。

Method: 1)使用文本条件视频生成模型生成合成异常视频。2)通过弱监督学习结合生成视频与真实数据训练模型。3)采用合成样本损失缩放策略动态调整合成数据对训练的贡献。

Result: 在UCF-Crime数据集上,GV-VAD优于现有最优方法,证明了生成视频在异常检测任务中的有效性。

Insight: 通过合成视频增强训练数据是一种低成本且高效的解决方案,尤其在弱监督场景下,生成数据的可控性能够显著提升模型性能。

Abstract: Video anomaly detection (VAD) plays a critical role in public safety applications such as intelligent surveillance. However, the rarity, unpredictability, and high annotation cost of real-world anomalies make it difficult to scale VAD datasets, which limits the performance and generalization ability of existing models. To address this challenge, we propose a generative video-enhanced weakly-supervised video anomaly detection (GV-VAD) framework that leverages text-conditioned video generation models to produce semantically controllable and physically plausible synthetic videos. These virtual videos are used to augment training data at low cost. In addition, a synthetic sample loss scaling strategy is utilized to control the influence of generated synthetic samples for efficient training. The experiments show that the proposed framework outperforms state-of-the-art methods on UCF-Crime datasets. The code is available at https://github.com/Sumutan/GV-VAD.git.

[41] Steering Guidance for Personalized Text-to-Image Diffusion Models

Sunghyun Park,Seokeon Choi,Hyoungwoo Park,Sungrack Yun

Main category: cs.CV

TL;DR: 该论文提出了一种新的个性化引导方法,通过利用一个未学习的弱模型结合动态权重插值,平衡文本对齐和目标分布保真度,解决了现有方法在个性化文本到图像扩散模型中的局限性。

Details Motivation: 现有的个性化文本到图像扩散模型在少样本微调时面临目标分布对齐和原始模型知识保留之间的权衡问题,而现有的采样引导方法(如CFG和AG)无法有效平衡这两者。

Contribution: 提出了一种简单有效的个性化引导方法,通过未学习的弱模型和动态权重插值,显式地将输出引导到平衡的潜在空间,无需额外计算开销。

Method: 利用未学习的弱模型(基于空文本提示)和动态权重插值(在预训练和微调模型之间)实现个性化引导,平衡文本对齐和目标分布保真度。

Result: 实验结果表明,该方法能够提升文本对齐和目标分布保真度,且与多种微调策略无缝集成。

Insight: 通过动态权重插值和未学习弱模型的结合,可以有效解决个性化扩散模型中的权衡问题,提供了一种新的引导思路。

Abstract: Personalizing text-to-image diffusion models is crucial for adapting the pre-trained models to specific target concepts, enabling diverse image generation. However, fine-tuning with few images introduces an inherent trade-off between aligning with the target distribution (e.g., subject fidelity) and preserving the broad knowledge of the original model (e.g., text editability). Existing sampling guidance methods, such as classifier-free guidance (CFG) and autoguidance (AG), fail to effectively guide the output toward well-balanced space: CFG restricts the adaptation to the target distribution, while AG compromises text alignment. To address these limitations, we propose personalization guidance, a simple yet effective method leveraging an unlearned weak model conditioned on a null text prompt. Moreover, our method dynamically controls the extent of unlearning in a weak model through weight interpolation between pre-trained and fine-tuned models during inference. Unlike existing guidance methods, which depend solely on guidance scales, our method explicitly steers the outputs toward a balanced latent space without additional computational overhead. Experimental results demonstrate that our proposed guidance can improve text alignment and target distribution fidelity, integrating seamlessly with various fine-tuning strategies.

[42] Spectral Sensitivity Estimation with an Uncalibrated Diffraction Grating

Lilika Makabe,Hiroaki Santo,Fumio Okura,Michael S. Brown,Yasuyuki Matsushita

Main category: cs.CV

TL;DR: 本文提出了一种利用未校准的衍射光栅来估计相机光谱敏感性的实用校准方法,避免了传统方法中需要的专用窄带滤光片或已知光谱反射率的参考目标。

Details Motivation: 相机光谱敏感性的准确校准对许多计算机视觉任务至关重要。传统方法依赖专用设备或已知反射率的目标,限制了其广泛适用性。本文旨在提供一种更实用和便捷的解决方案。

Contribution: 提出了一种仅需未校准衍射光栅的相机光谱敏感性校准方法,能够在闭式解中同时估计相机光谱敏感性和衍射光栅参数。

Method: 通过拍摄直接光照及其通过衍射光栅后的衍射图案,结合闭式优化方法,同时估计相机光谱敏感性和光栅参数。

Result: 在合成和真实数据上的实验表明,该方法优于传统的基于参考目标的方法,验证了其有效性和实用性。

Insight: 利用常见且易得的未校准衍射光栅,能够实现高性能的相机光谱敏感性校准,为相关任务提供了更灵活和经济的解决方案。

Abstract: This paper introduces a practical and accurate calibration method for camera spectral sensitivity using a diffraction grating. Accurate calibration of camera spectral sensitivity is crucial for various computer vision tasks, including color correction, illumination estimation, and material analysis. Unlike existing approaches that require specialized narrow-band filters or reference targets with known spectral reflectances, our method only requires an uncalibrated diffraction grating sheet, readily available off-the-shelf. By capturing images of the direct illumination and its diffracted pattern through the grating sheet, our method estimates both the camera spectral sensitivity and the diffraction grating parameters in a closed-form manner. Experiments on synthetic and real-world data demonstrate that our method outperforms conventional reference target-based methods, underscoring its effectiveness and practicality.

[43] Analyze-Prompt-Reason: A Collaborative Agent-Based Framework for Multi-Image Vision-Language Reasoning

Angelos Vlachos,Giorgos Filandrianos,Maria Lymperaiou,Nikolaos Spanos,Ilias Mitsouras,Vasileios Karampinis,Athanasios Voulodimos

Main category: cs.CV

TL;DR: 提出了一种基于双代理协作的框架(Analyze-Prompt-Reason),用于多图像视觉语言推理任务。该框架通过PromptEngineer生成任务特定的提示,并利用VisionReasoner进行最终推理,实现了跨数据集和任务的通用性。

Details Motivation: 解决在多图像和多模态任务中跨数据集和任务格式的复杂推理问题,提出了一种无需训练的模块化框架。

Contribution: 1. 提出了一种基于双代理(PromptEngineer和VisionReasoner)的协作框架。2. 框架完全自动化、模块化且无需训练。3. 在多个视觉推理任务中验证了其有效性。

Method: 使用双代理系统:PromptEngineer生成任务特定的提示,VisionReasoner(基于大视觉语言模型)完成推理。支持分类、问答和自由生成任务。

Result: 在MIRAGE Challenge的18个数据集中表现优异,特别是在TQA(99.13%)、DocVQA(96.87%)和MMCoQA(75.28 ROUGE-L)等任务中接近上限性能。

Insight: 大视觉语言模型(LVLM)在提供信息丰富的提示时可以有效推理多图像任务,且设计选择(如模型选择、样本数量和输入长度)对性能有显著影响。

Abstract: We present a Collaborative Agent-Based Framework for Multi-Image Reasoning. Our approach tackles the challenge of interleaved multimodal reasoning across diverse datasets and task formats by employing a dual-agent system: a language-based PromptEngineer, which generates context-aware, task-specific prompts, and a VisionReasoner, a large vision-language model (LVLM) responsible for final inference. The framework is fully automated, modular, and training-free, enabling generalization across classification, question answering, and free-form generation tasks involving one or multiple input images. We evaluate our method on 18 diverse datasets from the 2025 MIRAGE Challenge (Track A), covering a broad spectrum of visual reasoning tasks including document QA, visual comparison, dialogue-based understanding, and scene-level inference. Our results demonstrate that LVLMs can effectively reason over multiple images when guided by informative prompts. Notably, Claude 3.7 achieves near-ceiling performance on challenging tasks such as TQA (99.13% accuracy), DocVQA (96.87%), and MMCoQA (75.28 ROUGE-L). We also explore how design choices-such as model selection, shot count, and input length-influence the reasoning performance of different LVLMs.

[44] Stable at Any Speed: Speed-Driven Multi-Object Tracking with Learnable Kalman Filtering

Yan Gong,Mengjun Chen,Hao Liu,Gao Yongsheng,Lei Yang,Naibang Wang,Ziying Song,Haoqun Ma

Main category: cs.CV

TL;DR: 本文提出了一种速度引导的可学习卡尔曼滤波器(SG-LKF),通过动态适应自体车辆速度带来的观测噪声和参考帧变化,显著提高了多目标跟踪(MOT)在高动态场景中的稳定性和准确性。

Details Motivation: 传统的多目标跟踪方法通常基于静态坐标变换,忽视了自体车辆速度引起的观测噪声和参考帧变化,导致在高动态场景中跟踪性能下降。

Contribution: 1.提出SG-LKF,动态适应速度带来的不确定性;2.引入MotionScaleNet(MSNet)预测关键参数;3.提出自监督轨迹一致性损失提升关联性。

Method: 1.使用SG-LKF动态调整不确定性建模;2.MSNet通过解耦的token-和channel-mixing MLP预测参数;3.结合自监督损失优化轨迹连续性。

Result: 在KITTI 2D/3D MOT和nuScenes 3D MOT上取得领先性能,如KITTI 2D MOT的HOTA达79.59%。

Insight: 自体车辆速度是影响MOT性能的关键因素,动态适应噪声和参考帧变化可显著提升跟踪稳定性。

Abstract: Multi-object tracking (MOT) enables autonomous vehicles to continuously perceive dynamic objects, supplying essential temporal cues for prediction, behavior understanding, and safe planning. However, conventional tracking-by-detection methods typically rely on static coordinate transformations based on ego-vehicle poses, disregarding ego-vehicle speed-induced variations in observation noise and reference frame changes, which degrades tracking stability and accuracy in dynamic, high-speed scenarios. In this paper, we investigate the critical role of ego-vehicle speed in MOT and propose a Speed-Guided Learnable Kalman Filter (SG-LKF) that dynamically adapts uncertainty modeling to ego-vehicle speed, significantly improving stability and accuracy in highly dynamic scenarios. Central to SG-LKF is MotionScaleNet (MSNet), a decoupled token-mixing and channel-mixing MLP that adaptively predicts key parameters of SG-LKF. To enhance inter-frame association and trajectory continuity, we introduce a self-supervised trajectory consistency loss jointly optimized with semantic and positional constraints. Extensive experiments show that SG-LKF ranks first among all vision-based methods on KITTI 2D MOT with 79.59% HOTA, delivers strong results on KITTI 3D MOT with 82.03% HOTA, and outperforms SimpleTrack by 2.2% AMOTA on nuScenes 3D MOT.

[45] CoST: Efficient Collaborative Perception From Unified Spatiotemporal Perspective

Zongheng Tang,Yi Liu,Yifan Sun,Yulu Gao,Jinyu Chen,Runsheng Xu,Si Liu

Main category: cs.CV

TL;DR: 这篇论文提出了CoST方法,通过统一的时空视角实现高效的协作感知,同时提升了传输效率和感知性能。

Details Motivation: 现有的协作感知方法通常将多智能体融合和多时间融合分为两步,导致传输效率低且特征融合不充分。论文希望通过统一时空空间来解决这些问题。

Contribution: 提出了CoST方法,将多智能体和多时间融合统一为一个时空空间,实现了高效传输和更优的感知性能,同时兼容大多数现有方法。

Method: 利用时空变换器(Spatio-temporal Transformer)将多智能体和多时间的观测统一聚合到一个时空空间中,避免了重复传输。

Result: CoST在效率和准确性上均有显著提升,且能减少传输带宽需求。

Insight: 统一的时空融合视角能够更全面地捕捉信息,适用于遮挡和小感知范围等挑战性场景。

Abstract: Collaborative perception shares information among different agents and helps solving problems that individual agents may face, e.g., occlusions and small sensing range. Prior methods usually separate the multi-agent fusion and multi-time fusion into two consecutive steps. In contrast, this paper proposes an efficient collaborative perception that aggregates the observations from different agents (space) and different times into a unified spatio-temporal space simultanesouly. The unified spatio-temporal space brings two benefits, i.e., efficient feature transmission and superior feature fusion. 1) Efficient feature transmission: each static object yields a single observation in the spatial temporal space, and thus only requires transmission only once (whereas prior methods re-transmit all the object features multiple times). 2) superior feature fusion: merging the multi-agent and multi-time fusion into a unified spatial-temporal aggregation enables a more holistic perspective, thereby enhancing perception performance in challenging scenarios. Consequently, our Collaborative perception with Spatio-temporal Transformer (CoST) gains improvement in both efficiency and accuracy. Notably, CoST is not tied to any specific method and is compatible with a majority of previous methods, enhancing their accuracy while reducing the transmission bandwidth.

[46] Honey Classification using Hyperspectral Imaging and Machine Learning

Mokhtar A. Al-Awadhi,Ratnadeep R. Deshmukh

Main category: cs.CV

TL;DR: 这篇论文提出了一种基于机器学习的蜂蜜植物来源分类方法,使用高光谱成像数据,通过LDA特征提取和SVM/KNN分类器,实现了高达95.13%的分类准确率。

Details Motivation: 蜂蜜的植物来源分类是一个重要的实际问题,传统方法效率低且准确性有限。高光谱成像技术提供了一种非破坏性的检测手段,但需要高效的分类方法来处理高维数据。

Contribution: 论文的主要贡献包括:(1) 提出了一种基于机器学习的蜂蜜植物来源分类框架;(2) 采用类变换方法增强类别可分性;(3) 结合LDA降维和SVM/KNN分类器,取得了高精度分类结果。

Method: 方法分为三个步骤:(1) 使用类变换预处理数据以增强类别可分性;(2) 采用LDA技术进行特征提取和降维;(3) 使用SVM和KNN模型对提取的特征进行分类。

Result: 在标准的高光谱成像数据集上,系统的分类准确率达到95.13%(基于图像的分类)和92.80%(基于实例的分类),表现优于现有方法。

Insight: 论文表明,结合高效的预处理和特征提取方法,机器学习可以有效地处理高光谱数据,并在蜂蜜分类任务中实现高精度。

Abstract: In this paper, we propose a machine learning-based method for automatically classifying honey botanical origins. Dataset preparation, feature extraction, and classification are the three main steps of the proposed method. We use a class transformation method in the dataset preparation phase to maximize the separability across classes. The feature extraction phase employs the Linear Discriminant Analysis (LDA) technique for extracting relevant features and reducing the number of dimensions. In the classification phase, we use Support Vector Machines (SVM) and K-Nearest Neighbors (KNN) models to classify the extracted features of honey samples into their botanical origins. We evaluate our system using a standard honey hyperspectral imaging (HSI) dataset. Experimental findings demonstrate that the proposed system produces state-of-the-art results on this dataset, achieving the highest classification accuracy of 95.13% for hyperspectral image-based classification and 92.80% for hyperspectral instance-based classification.

[47] SparseRecon: Neural Implicit Surface Reconstruction from Sparse Views with Feature and Depth Consistencies

Liang Han,Xu Zhang,Haichuan Song,Kanle Shi,Yu-Shen Liu,Zhizhong Han

Main category: cs.CV

TL;DR: SparseRecon提出了一种从稀疏视图重建神经隐式表面的方法,通过特征一致性和深度一致性约束解决现有方法的局限。

Details Motivation: 现有基于泛化和过拟合的稀疏视图重建方法在未见视图或有限几何线索下表现不佳,需要更有效的解决方案。

Contribution: 提出SparseRecon方法,结合基于体积渲染的特征一致性损失和不确定性引导的深度约束,提升稀疏视图重建质量。

Method: 1. 引入跨视图特征一致性损失约束神经隐式场;2. 使用不确定性引导的深度约束弥补特征一致性在遮挡和低特征区域的不足。

Result: 实验表明,SparseRecon优于现有方法,尤其在小重叠视图场景中能生成高质量几何重建结果。

Insight: 特征一致性与深度约束的结合能有效解决稀疏视图重建中的歧义性和几何细节恢复问题。

Abstract: Surface reconstruction from sparse views aims to reconstruct a 3D shape or scene from few RGB images. The latest methods are either generalization-based or overfitting-based. However, the generalization-based methods do not generalize well on views that were unseen during training, while the reconstruction quality of overfitting-based methods is still limited by the limited geometry clues. To address this issue, we propose SparseRecon, a novel neural implicit reconstruction method for sparse views with volume rendering-based feature consistency and uncertainty-guided depth constraint. Firstly, we introduce a feature consistency loss across views to constrain the neural implicit field. This design alleviates the ambiguity caused by insufficient consistency information of views and ensures completeness and smoothness in the reconstruction results. Secondly, we employ an uncertainty-guided depth constraint to back up the feature consistency loss in areas with occlusion and insignificant features, which recovers geometry details for better reconstruction quality. Experimental results demonstrate that our method outperforms the state-of-the-art methods, which can produce high-quality geometry with sparse-view input, especially in the scenarios with small overlapping views. Project page: https://hanl2010.github.io/SparseRecon/.

[48] Representation Shift: Unifying Token Compression with FlashAttention

Joonmyung Choi,Sanghyeok Lee,Byungoh Ko,Eunseo Kim,Jihyung Kil,Hyunwoo J. Kim

Main category: cs.CV

TL;DR: 论文提出了一种名为

Details Motivation:

Contribution:

Method:

Result:

Insight:

Abstract: Transformers have demonstrated remarkable success across vision, language, and video. Yet, increasing task complexity has led to larger models and more tokens, raising the quadratic cost of self-attention and the overhead of GPU memory access. To reduce the computation cost of self-attention, prior work has proposed token compression techniques that drop redundant or less informative tokens. Meanwhile, fused attention kernels such as FlashAttention have been developed to alleviate memory overhead by avoiding attention map construction and its associated I/O to HBM. This, however, makes it incompatible with most training-free token compression methods, which rely on attention maps to determine token importance. Here, we propose Representation Shift, a training-free, model-agnostic metric that measures the degree of change in each token’s representation. This seamlessly integrates token compression with FlashAttention, without attention maps or retraining. Our method further generalizes beyond Transformers to CNNs and state space models. Extensive experiments show that Representation Shift enables effective token compression compatible with FlashAttention, yielding significant speedups of up to 5.5% and 4.4% in video-text retrieval and video QA, respectively. Code is available at https://github.com/mlvlab/Representation-Shift.

[49] Bidirectional Action Sequence Learning for Long-term Action Anticipation with Large Language Models

Yuji Sato,Yasunori Ishii,Takayoshi Yamashita

Main category: cs.CV

TL;DR: 论文提出了一种双向动作序列学习方法BiAnt,结合大型语言模型改进长时动作预测的性能。

Details Motivation: 传统方法通过单向的编码器-解码器结构预测未来动作,难以捕捉语义上显著的分动作,限制了性能表现。

Contribution: 提出了BiAnt方法,结合前向和后向预测,利用大型语言模型提升长时动作预测的准确性和语义提取能力。

Method: BiAnt通过双向预测(前向和后向)结合大型语言模型,从已观测动作中提取更丰富的语义信息,改进预测性能。

Result: 在Ego4D数据集上的实验表明,BiAnt在编辑距离指标上优于基线方法。

Insight: 双向预测能够更好地捕捉动作序列的语义信息,从而提升长时动作预测的准确性和鲁棒性。

Abstract: Video-based long-term action anticipation is crucial for early risk detection in areas such as automated driving and robotics. Conventional approaches extract features from past actions using encoders and predict future events with decoders, which limits performance due to their unidirectional nature. These methods struggle to capture semantically distinct sub-actions within a scene. The proposed method, BiAnt, addresses this limitation by combining forward prediction with backward prediction using a large language model. Experimental results on Ego4D demonstrate that BiAnt improves performance in terms of edit distance compared to baseline methods.

[50] $MV_{Hybrid}$: Improving Spatial Transcriptomics Prediction with Hybrid State Space-Vision Transformer Backbone in Pathology Vision Foundation Models

Won June Cho,Hongjun Yoon,Daeky Jeong,Hyeongyeol Lim,Yosep Chong

Main category: cs.CV

TL;DR: 论文提出了$MV_{Hybrid}$,一种结合状态空间模型(SSMs)和Vision Transformer(ViT)的混合架构,用于提升病理学视觉基础模型(VFMs)在空间转录组预测中的性能。相比ViT,该架构在随机分割和跨研究验证(LOSO)中分别表现出更高的相关性和更强的鲁棒性。

Details Motivation: 目前基于ViT的病理学视觉基础模型(VFMs)在空间基因表达预测中表现不佳,无法满足临床需求。作者认为,通过结合状态空间模型(SSMs)的架构创新可以更好地捕捉低频、细微的形态学模式。

Contribution: 主要贡献是提出了$MV_{Hybrid}$混合架构,结合了SSMs和ViT的优势。实验表明,该架构在多种任务中性能优于ViT,尤其是在跨研究验证中表现出更强的鲁棒性。

Method: 方法包括:1)使用负实数特征值初始化状态空间模型以增强低频偏好;2)在同一结直肠癌数据集上预训练五种不同架构;3)通过随机分割和LOSO设置评估模型性能。

Result: 在LOSO评估中,$MV_{Hybrid}$比最佳ViT的预测相关性高57%,且性能下降减少43%。此外,在分类、补丁检索和生存预测任务中表现优于ViT。

Insight: 状态空间模型能有效捕捉低频形态特征,与ViT结合可显著提升预测性能。混合架构的鲁棒性使其更适合临床应用。

Abstract: Spatial transcriptomics reveals gene expression patterns within tissue context, enabling precision oncology applications such as treatment response prediction, but its high cost and technical complexity limit clinical adoption. Predicting spatial gene expression (biomarkers) from routine histopathology images offers a practical alternative, yet current vision foundation models (VFMs) in pathology based on Vision Transformer (ViT) backbones perform below clinical standards. Given that VFMs are already trained on millions of diverse whole slide images, we hypothesize that architectural innovations beyond ViTs may better capture the low-frequency, subtle morphological patterns correlating with molecular phenotypes. By demonstrating that state space models initialized with negative real eigenvalues exhibit strong low-frequency bias, we introduce $MV_{Hybrid}$, a hybrid backbone architecture combining state space models (SSMs) with ViT. We compare five other different backbone architectures for pathology VFMs, all pretrained on identical colorectal cancer datasets using the DINOv2 self-supervised learning method. We evaluate all pretrained models using both random split and leave-one-study-out (LOSO) settings of the same biomarker dataset. In LOSO evaluation, $MV_{Hybrid}$ achieves 57% higher correlation than the best-performing ViT and shows 43% smaller performance degradation compared to random split in gene expression prediction, demonstrating superior performance and robustness, respectively. Furthermore, $MV_{Hybrid}$ shows equal or better downstream performance in classification, patch retrieval, and survival prediction tasks compared to that of ViT, showing its promise as a next-generation pathology VFM backbone. Our code is publicly available at: https://github.com/deepnoid-ai/MVHybrid.

[51] Cued-Agent: A Collaborative Multi-Agent System for Automatic Cued Speech Recognition

Guanjie Huang,Danny H. K. Tsang,Shan Yang,Guangzhi Lei,Li Liu

Main category: cs.CV

TL;DR: 这篇论文提出了首个协作式多智能体系统Cued-Agent,用于自动手语识别(ACSR),通过四个专门子智能体解决手部与唇部动作的多模态融合问题,并在有限数据下表现出色。

Details Motivation: 手语识别中手部和唇部动作的异步性需要复杂的多模态融合机制,但现有方法因数据有限训练不足导致性能不佳。多智能体系统在有限数据下表现优异,因此被引入以提升ACSR性能。

Contribution: 1. 提出首个协作式多智能体系统Cued-Agent;2. 通过四个子智能体(手部识别、唇部识别、动态提示解码和语义校正)实现高效多模态融合;3. 扩展了普通话手语数据集。

Method: 1. 基于多模态大语言模型的手部识别智能体;2. 预训练Transformer的唇部识别智能体;3. 动态整合手部提示和唇部特征的解码智能体;4. 通过语义校正实现音素到单词的端到端转换。

Result: 在正常和听障场景下,Cued-Agent均优于现有方法。实验证明其有效性和鲁棒性。

Insight: 多智能体系统能够有效解决多模态异步性问题,并通过模块化分工在数据有限的情况下提升性能。

Abstract: Cued Speech (CS) is a visual communication system that combines lip-reading with hand coding to facilitate communication for individuals with hearing impairments. Automatic CS Recognition (ACSR) aims to convert CS hand gestures and lip movements into text via AI-driven methods. Traditionally, the temporal asynchrony between hand and lip movements requires the design of complex modules to facilitate effective multimodal fusion. However, constrained by limited data availability, current methods demonstrate insufficient capacity for adequately training these fusion mechanisms, resulting in suboptimal performance. Recently, multi-agent systems have shown promising capabilities in handling complex tasks with limited data availability. To this end, we propose the first collaborative multi-agent system for ACSR, named Cued-Agent. It integrates four specialized sub-agents: a Multimodal Large Language Model-based Hand Recognition agent that employs keyframe screening and CS expert prompt strategies to decode hand movements, a pretrained Transformer-based Lip Recognition agent that extracts lip features from the input video, a Hand Prompt Decoding agent that dynamically integrates hand prompts with lip features during inference in a training-free manner, and a Self-Correction Phoneme-to-Word agent that enables post-process and end-to-end conversion from phoneme sequences to natural language sentences for the first time through semantic refinement. To support this study, we expand the existing Mandarin CS dataset by collecting data from eight hearing-impaired cuers, establishing a mixed dataset of fourteen subjects. Extensive experiments demonstrate that our Cued-Agent performs superbly in both normal and hearing-impaired scenarios compared with state-of-the-art methods. The implementation is available at https://github.com/DennisHgj/Cued-Agent.

[52] Decouple before Align: Visual Disentanglement Enhances Prompt Tuning

Fei Zhang,Tianfei Zhou,Jiangchao Yao,Ya Zhang,Ivor W. Tsang,Yanfeng Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为DAPT的新框架,通过解耦视觉模态的前景与背景表征,并分别与文本对齐,解决了提示调优中的信息不对称问题,显著提升了模型性能。

Details Motivation: 传统的提示调优(PT)方法在视觉-语言模型中存在信息不对称问题,视觉模态通常包含比文本模态更多的上下文信息,导致模型注意力偏向于上下文区域。

Contribution: 1. 提出DAPT框架,基于‘解耦再对齐’的概念;2. 引入前景-背景视觉表征的解耦方法;3. 设计视觉拉-推正则化机制,增强对感兴趣区域的无偏注意力。

Method: 1. 通过粗-细粒度视觉分割线索解耦视觉模态为前景与背景;2. 分别将解耦的表征与文本对齐;3. 针对前景-背景模式设计视觉拉-推正则化。

Result: 在少样本学习、基础到新类别的泛化以及数据高效学习中,DAPT在多个基准测试中展现出优越性能。

Insight: 解耦视觉模态并分别对齐可以显著缓解信息不对称问题,提升视觉-语言模型的注意力准确性。

Abstract: Prompt tuning (PT), as an emerging resource-efficient fine-tuning paradigm, has showcased remarkable effectiveness in improving the task-specific transferability of vision-language models. This paper delves into a previously overlooked information asymmetry issue in PT, where the visual modality mostly conveys more context than the object-oriented textual modality. Correspondingly, coarsely aligning these two modalities could result in the biased attention, driving the model to merely focus on the context area. To address this, we propose DAPT, an effective PT framework based on an intuitive decouple-before-align concept. First, we propose to explicitly decouple the visual modality into the foreground and background representation via exploiting coarse-and-fine visual segmenting cues, and then both of these decoupled patterns are aligned with the original foreground texts and the hand-crafted background classes, thereby symmetrically strengthening the modal alignment. To further enhance the visual concentration, we propose a visual pull-push regularization tailored for the foreground-background patterns, directing the original visual representation towards unbiased attention on the region-of-interest object. We demonstrate the power of architecture-free DAPT through few-shot learning, base-to-novel generalization, and data-efficient learning, all of which yield superior performance across prevailing benchmarks. Our code will be released at https://github.com/Ferenas/DAPT.

[53] Video Forgery Detection with Optical Flow Residuals and Spatial-Temporal Consistency

Xi Xue,Kunio Suzuki,Nabarun Goswami,Takuya Shintate

Main category: cs.CV

TL;DR: 论文提出了一种基于光流残差和时空一致性的视频伪造检测框架,利用双分支结构分别分析RGB帧和光流残差,以检测AI生成视频中的伪造痕迹。

Details Motivation: 随着基于扩散模型的视频生成技术快速发展,合成内容越来越逼真,现有的视频伪造检测方法难以捕捉时间上的细微不一致性。

Contribution: 提出了结合RGB特征和光流残差的双分支检测框架,有效捕捉伪造视频中的外观和运动异常。

Method: 采用双分支架构,一分支分析RGB帧的外观异常,另一分支处理光流残差以揭示时间合成中的运动不一致性。

Result: 在十种不同生成模型的文本到视频和图像到视频任务上实验,验证了方法的鲁棒性和泛化能力。

Insight: 结合外观和运动特征能更全面检测伪造视频,尤其在逼真的AI生成内容中表现突出。

Abstract: The rapid advancement of diffusion-based video generation models has led to increasingly realistic synthetic content, presenting new challenges for video forgery detection. Existing methods often struggle to capture fine-grained temporal inconsistencies, particularly in AI-generated videos with high visual fidelity and coherent motion. In this work, we propose a detection framework that leverages spatial-temporal consistency by combining RGB appearance features with optical flow residuals. The model adopts a dual-branch architecture, where one branch analyzes RGB frames to detect appearance-level artifacts, while the other processes flow residuals to reveal subtle motion anomalies caused by imperfect temporal synthesis. By integrating these complementary features, the proposed method effectively detects a wide range of forged videos. Extensive experiments on text-to-video and image-to-video tasks across ten diverse generative models demonstrate the robustness and strong generalization ability of the proposed approach.

[54] iSafetyBench: A video-language benchmark for safety in industrial environment

Raiyaan Abdullah,Yogesh Singh Rawat,Shruti Vyas

Main category: cs.CV

TL;DR: iSafetyBench是一个针对工业环境安全性的视频-语言基准测试,旨在评估模型在常规和危险场景中的表现。它包含1,100个工业场景视频片段,标注了多标签动作分类,并在零样本条件下测试了八种先进模型,发现其在危险活动识别和多标签场景中存在显著性能差距。

Details Motivation: 当前视觉语言模型(VLMs)在工业高危险性领域的能力尚未深入探索,为填补这一空白,该论文提出了iSafetyBench,以评估模型在工业环境中的安全性能。

Contribution: iSafetyBench是首个针对工业环境安全性的视频-语言基准测试,提供了真实工业场景的1,100个视频片段,标注了多标签动作分类,并设计了多选题评估框架。

Method: 采用了开放词汇和多标签动作标注,设计了单标签和多标签的多选题评估,并在零样本条件下测试了八种先进视频语言模型。

Result: 尽管现有模型在其他视频基准测试中表现优异,但在iSafetyBench中表现不佳,尤其是在危险活动识别和多标签场景中,显示了显著性能差距。

Insight: 工业环境需要更鲁棒且安全感知的多模态模型,iSafetyBench为推动这一方向提供了重要工具和数据支持。

Abstract: Recent advances in vision-language models (VLMs) have enabled impressive generalization across diverse video understanding tasks under zero-shot settings. However, their capabilities in high-stakes industrial domains-where recognizing both routine operations and safety-critical anomalies is essential-remain largely underexplored. To address this gap, we introduce iSafetyBench, a new video-language benchmark specifically designed to evaluate model performance in industrial environments across both normal and hazardous scenarios. iSafetyBench comprises 1,100 video clips sourced from real-world industrial settings, annotated with open-vocabulary, multi-label action tags spanning 98 routine and 67 hazardous action categories. Each clip is paired with multiple-choice questions for both single-label and multi-label evaluation, enabling fine-grained assessment of VLMs in both standard and safety-critical contexts. We evaluate eight state-of-the-art video-language models under zero-shot conditions. Despite their strong performance on existing video benchmarks, these models struggle with iSafetyBench-particularly in recognizing hazardous activities and in multi-label scenarios. Our results reveal significant performance gaps, underscoring the need for more robust, safety-aware multimodal models for industrial applications. iSafetyBench provides a first-of-its-kind testbed to drive progress in this direction. The dataset is available at: https://github.com/raiyaan-abdullah/iSafety-Bench.

[55] Sari Sandbox: A Virtual Retail Store Environment for Embodied AI Agents

Janika Deborah Gajo,Gerarld Paul Merales,Jerome Escarcha,Brenden Ashley Molina,Gian Nartea,Emmanuel G. Maminta,Juan Carlos Roldan,Rowel O. Atienza

Main category: cs.CV

TL;DR: 论文提出了Sari Sandbox,一个高保真、逼真的3D零售商店模拟环境,用于评估具身AI代理在购物任务中的性能,同时提供基准数据集SariBench。

Details Motivation: 目前缺乏针对零售场景的具身代理训练模拟环境,Sari Sandbox填补了这一空白,通过逼真的3D环境和多样化的任务设置,支持具身代理的导航、检查和操作能力评估。

Contribution: 1. 开发了高保真的零售商店模拟环境Sari Sandbox;2. 提供了SariBench基准数据集,包含多样化任务的人类标注演示;3. 支持VR和VLM驱动的具身代理交互。

Method: 利用3D建模技术和API控制构建零售商店环境,包含250多种可交互商品。通过VR和VLM支持人类和代理的交互,并提出基于人类演示的基准测试方法。

Result: 论文展示了具身代理在Sari Sandbox中的性能基准,并与人类表现进行了对比分析,提出了提升环境真实性和可扩展性的建议。

Insight: 逼真的零售环境模拟和多样化任务标注数据是提升具身代理在真实零售场景中应用能力的关键。

Abstract: We present Sari Sandbox, a high-fidelity, photorealistic 3D retail store simulation for benchmarking embodied agents against human performance in shopping tasks. Addressing a gap in retail-specific sim environments for embodied agent training, Sari Sandbox features over 250 interactive grocery items across three store configurations, controlled via an API. It supports both virtual reality (VR) for human interaction and a vision language model (VLM)-powered embodied agent. We also introduce SariBench, a dataset of annotated human demonstrations across varied task difficulties. Our sandbox enables embodied agents to navigate, inspect, and manipulate retail items, providing baselines against human performance. We conclude with benchmarks, performance analysis, and recommendations for enhancing realism and scalability. The source code can be accessed via https://github.com/upeee/sari-sandbox-env.

[56] PMR: Physical Model-Driven Multi-Stage Restoration of Turbulent Dynamic Videos

Tao Wu,Jingyuan Ye,Ying Fu

Main category: cs.CV

TL;DR: 论文提出了一种物理模型驱动的多阶段湍流动态视频恢复框架(PMR),通过动态效率指数(DEI)量化视频动态强度,并结合三个阶段的处理(去倾斜、动态区域增强和去模糊)实现高效高质量的视频恢复。

Details Motivation: 现有方法在强湍流和复杂动态场景下难以有效恢复边缘细节并消除混合失真,论文旨在解决这一问题。

Contribution: 1. 引入动态效率指数(DEI)量化视频动态强度;2. 提出物理模型驱动的三阶段恢复框架(PMR)。

Method: PMR框架包括三个步骤:去倾斜(几何稳定)、动态区域增强(运动分割细化)和去模糊(质量恢复),并采用轻量化主干和分阶段联合训练。

Result: 实验表明PMR能有效抑制运动拖尾伪影、恢复边缘细节,并在高湍流复杂动态场景中表现出强泛化能力。

Insight: 通过物理模型驱动的阶段式处理,结合轻量化设计,可以高效解决湍流视频恢复中的混合失真问题。

Abstract: Geometric distortions and blurring caused by atmospheric turbulence degrade the quality of long-range dynamic scene videos. Existing methods struggle with restoring edge details and eliminating mixed distortions, especially under conditions of strong turbulence and complex dynamics. To address these challenges, we introduce a Dynamic Efficiency Index ($DEI$), which combines turbulence intensity, optical flow, and proportions of dynamic regions to accurately quantify video dynamic intensity under varying turbulence conditions and provide a high-dynamic turbulence training dataset. Additionally, we propose a Physical Model-Driven Multi-Stage Video Restoration ($PMR$) framework that consists of three stages: \textbf{de-tilting} for geometric stabilization, \textbf{motion segmentation enhancement} for dynamic region refinement, and \textbf{de-blurring} for quality restoration. $PMR$ employs lightweight backbones and stage-wise joint training to ensure both efficiency and high restoration quality. Experimental results demonstrate that the proposed method effectively suppresses motion trailing artifacts, restores edge details and exhibits strong generalization capability, especially in real-world scenarios characterized by high-turbulence and complex dynamics. We will make the code and datasets openly available.

[57] Sortblock: Similarity-Aware Feature Reuse for Diffusion Model

Hanqi Chen,Xu Zhang,Xiaoliu Guan,Lielin Jiang,Guanzhong Wang,Zeyu Chen,Yi Liu

Main category: cs.CV

TL;DR: Sortblock是一个无需训练的推理加速框架,通过动态缓存相似特征和自适应跳过冗余计算,显著提升了Diffusion Transformers(DiTs)的推理速度,同时保持了生成质量。

Details Motivation: Diffusion Transformers(DiTs)在生成能力上表现优异,但其序列化去噪过程导致推理延迟高,限制了实时应用。现有方法通常固定复用中间特征,忽略了去噪阶段和Transformer块之间的语义动态变化。

Contribution: 提出了Sortblock,一种动态缓存块级特征的训练免费推理加速框架,通过排名残差变化自适应跳过冗余计算,并结合轻量级线性预测减少累积误差。

Method: 基于相邻时间步特征的相似性动态缓存块级特征,通过残差变化排名自适应确定重计算比例,选择性跳过冗余计算。此外,引入线性预测机制以修正跳过块时的累积误差。

Result: 在多种任务和DiT架构上的实验表明,Sortblock实现了2倍以上的推理加速,生成质量损失极小,提供了一种高效且通用的扩散模型加速方案。

Insight: 去噪过程中的语义动态变化可以通过特征相似性动态利用,跳过冗余计算可显著提升效率而几乎不影响生成质量。

Abstract: Diffusion Transformers (DiTs) have demonstrated remarkable generative capabilities, particularly benefiting from Transformer architectures that enhance visual and artistic fidelity. However, their inherently sequential denoising process results in high inference latency, limiting their deployment in real-time scenarios. Existing training-free acceleration approaches typically reuse intermediate features at fixed timesteps or layers, overlooking the evolving semantic focus across denoising stages and Transformer blocks.To address this, we propose Sortblock, a training-free inference acceleration framework that dynamically caches block-wise features based on their similarity across adjacent timesteps. By ranking the evolution of residuals, Sortblock adaptively determines a recomputation ratio, selectively skipping redundant computations while preserving generation quality. Furthermore, we incorporate a lightweight linear prediction mechanism to reduce accumulated errors in skipped blocks.Extensive experiments across various tasks and DiT architectures demonstrate that Sortblock achieves over 2$\times$ inference speedup with minimal degradation in output quality, offering an effective and generalizable solution for accelerating diffusion-based generative models.

[58] DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space

Junyu Chen,Dongyun Zou,Wenkun He,Junsong Chen,Enze Xie,Song Han,Han Cai

Main category: cs.CV

TL;DR: DC-AE 1.5提出了一种新的深度压缩自动编码器家族,通过结构化潜在空间和增强扩散训练解决扩散模型收敛慢的问题,从而在生成质量和速度上优于之前的方法。

Details Motivation: 高分辨率扩散模型中,增加自动编码器的潜在通道数可以提高重建质量,但会导致扩散模型收敛变慢,影响生成质量上限。

Contribution: 1.引入了结构化潜在空间,通过训练在潜在空间中实现通道划分;2.提出了增强扩散训练策略,通过额外训练目标加速收敛。

Method: 1.使用结构化潜在空间,将潜在通道分为对象结构和图像细节两部分;2.对对象潜在通道增加扩散训练目标。

Result: 在ImageNet 512x512上,DC-AE 1.5比DC-AE生成质量更好且快4倍。

Insight: 通过潜在空间的通道结构化和训练策略优化,可以在不牺牲重建质量的前提下显著加速扩散模型收敛。

Abstract: We present DC-AE 1.5, a new family of deep compression autoencoders for high-resolution diffusion models. Increasing the autoencoder’s latent channel number is a highly effective approach for improving its reconstruction quality. However, it results in slow convergence for diffusion models, leading to poorer generation quality despite better reconstruction quality. This issue limits the quality upper bound of latent diffusion models and hinders the employment of autoencoders with higher spatial compression ratios. We introduce two key innovations to address this challenge: i) Structured Latent Space, a training-based approach to impose a desired channel-wise structure on the latent space with front latent channels capturing object structures and latter latent channels capturing image details; ii) Augmented Diffusion Training, an augmented diffusion training strategy with additional diffusion training objectives on object latent channels to accelerate convergence. With these techniques, DC-AE 1.5 delivers faster convergence and better diffusion scaling results than DC-AE. On ImageNet 512x512, DC-AE-1.5-f64c128 delivers better image generation quality than DC-AE-f32c32 while being 4x faster. Code: https://github.com/dc-ai-projects/DC-Gen.

[59] IN2OUT: Fine-Tuning Video Inpainting Model for Video Outpainting Using Hierarchical Discriminator

Sangwoo Youn,Minji Lee,Nokap Tony Park,Yeonggyoo Jeon,Taeyoung Na

Main category: cs.CV

TL;DR: 论文提出了一种通过层次化判别器微调视频修复模型用于视频外推的方法,解决了现有方法在背景扩展时的局限性。

Details Motivation: 视频外推任务需要扩展边界并保持与现有内容的一致性,直接应用或微调视频修复模型效果不佳。研究发现判别器设计是提升效果的关键。

Contribution: 1. 提出层次化判别器,区分全局和局部目标;2. 设计了专门的外推损失函数,结合判别器的局部和全局特征;3. 在定量和定性上优于现有方法。

Method: 通过层次化判别器分别优化全局和局部目标,并结合新的外推损失函数微调生成器。

Result: 在视频外推任务中,提出的方法在视觉效果和全局一致性上均优于现有方法。

Insight: 层次化判别器设计能够有效提升生成内容的质量,同时结合全局和局部特征的损失函数有助于实现更一致的视频扩展。

Abstract: Video outpainting presents a unique challenge of extending the borders while maintaining consistency with the given content. In this paper, we suggest the use of video inpainting models that excel in object flow learning and reconstruction in outpainting rather than solely generating the background as in existing methods. However, directly applying or fine-tuning inpainting models to outpainting has shown to be ineffective, often leading to blurry results. Our extensive experiments on discriminator designs reveal that a critical component missing in the outpainting fine-tuning process is a discriminator capable of effectively assessing the perceptual quality of the extended areas. To tackle this limitation, we differentiate the objectives of adversarial training into global and local goals and introduce a hierarchical discriminator that meets both objectives. Additionally, we develop a specialized outpainting loss function that leverages both local and global features of the discriminator. Fine-tuning on this adversarial loss function enhances the generator’s ability to produce both visually appealing and globally coherent outpainted scenes. Our proposed method outperforms state-of-the-art methods both quantitatively and qualitatively. Supplementary materials including the demo video and the code are available in SigPort.

[60] UIS-Mamba: Exploring Mamba for Underwater Instance Segmentation via Dynamic Tree Scan and Hidden State Weaken

Runmin Cong,Zongji Yu,Hao Fang,Haoyan Sun,Sam Kwong

Main category: cs.CV

TL;DR: UIS-Mamba 是首个基于 Mamba(一种状态空间模型)的水下实例分割模型,通过引入动态树扫描(DTS)和隐藏状态弱化(HSW)模块,解决了水下场景中的实例边界模糊和复杂背景干扰问题,显著提升了性能。

Details Motivation: 水下实例分割任务因水下环境的特殊性(如颜色失真和边界模糊)面临挑战。传统固定分块扫描机制难以保持实例内部连续性,复杂背景的隐藏状态也会干扰实例对象的理解。

Contribution: 1. 提出了首个基于 Mamba 的水下实例分割模型 UIS-Mamba;
2. 设计了动态树扫描(DTS)模块,通过动态调整分块偏移和缩放,保持实例特征连续性;
3. 提出了隐藏状态弱化(HSW)模块,基于 Ncut 机制抑制背景干扰,聚焦实例信息流。

Method: 1. DTS 模块:通过动态调整分块偏移和缩放,构建最小生成树,提供动态局部感受野;
2. HSW 模块:利用 Ncut 机制弱化复杂背景的隐藏状态,强化实例信息。

Result: 在 UIIS 和 USIS10K 数据集上达到了 SOTA 性能,同时保持了较低的参数和计算复杂度。

Insight: Mamba 模型通过线性复杂度和全局感受野特性,有望在水下场景的任务中发挥潜力;动态调整和状态弱化的设计能够有效应对水下环境的挑战。

Abstract: Underwater Instance Segmentation (UIS) tasks are crucial for underwater complex scene detection. Mamba, as an emerging state space model with inherently linear complexity and global receptive fields, is highly suitable for processing image segmentation tasks with long sequence features. However, due to the particularity of underwater scenes, there are many challenges in applying Mamba to UIS. The existing fixed-patch scanning mechanism cannot maintain the internal continuity of scanned instances in the presence of severely underwater color distortion and blurred instance boundaries, and the hidden state of the complex underwater background can also inhibit the understanding of instance objects. In this work, we propose the first Mamba-based underwater instance segmentation model UIS-Mamba, and design two innovative modules, Dynamic Tree Scan (DTS) and Hidden State Weaken (HSW), to migrate Mamba to the underwater task. DTS module maintains the continuity of the internal features of the instance objects by allowing the patches to dynamically offset and scale, thereby guiding the minimum spanning tree and providing dynamic local receptive fields. HSW module suppresses the interference of complex backgrounds and effectively focuses the information flow of state propagation to the instances themselves through the Ncut-based hidden state weakening mechanism. Experimental results show that UIS-Mamba achieves state-of-the-art performance on both UIIS and USIS10K datasets, while maintaining a low number of parameters and computational complexity. Code is available at https://github.com/Maricalce/UIS-Mamba.

[61] Contact-Aware Amodal Completion for Human-Object Interaction via Multi-Regional Inpainting

Seunggeun Chi,Enna Sachdeva,Pin-Hao Huang,Kwonjoon Lee

Main category: cs.CV

TL;DR: 论文提出了一种新的多区域修复方法,结合物理先验知识,用于遮挡情况下的人-物交互(HOI)非模态补全,显著提升了生成的准确性和真实性。

Details Motivation: 现有方法在动态场景中难以生成合理的非模态补全,尤其是在人-物交互场景中缺乏对交互的理解。因此,作者希望通过结合物理先验知识和多区域修复技术来解决这一问题。

Contribution: 1. 提出了一种基于多区域修复的非模态补全方法,专门针对人-物交互场景设计;2. 结合物理约束(如人体拓扑和接触信息)定义主次区域;3. 在扩散模型中应用定制化的去噪策略,提升了补全结果的准确性和真实性。

Method: 1. 利用物理约束定义主次区域;2. 在多区域修复中采用定制化去噪策略;3. 在扩散模型中实现区域的差异化处理。

Result: 实验结果显示,该方法在人-物交互场景中显著优于现有方法,且在没有真实接触标注的情况下仍具有鲁棒性。

Insight: 物理先验知识对非模态补全任务至关重要,尤其是在动态交互场景中。多区域策略能够更细致地处理遮挡问题,提升生成效果。

Abstract: Amodal completion, which is the process of inferring the full appearance of objects despite partial occlusions, is crucial for understanding complex human-object interactions (HOI) in computer vision and robotics. Existing methods, such as those that use pre-trained diffusion models, often struggle to generate plausible completions in dynamic scenarios because they have a limited understanding of HOI. To solve this problem, we’ve developed a new approach that uses physical prior knowledge along with a specialized multi-regional inpainting technique designed for HOI. By incorporating physical constraints from human topology and contact information, we define two distinct regions: the primary region, where occluded object parts are most likely to be, and the secondary region, where occlusions are less probable. Our multi-regional inpainting method uses customized denoising strategies across these regions within a diffusion model. This improves the accuracy and realism of the generated completions in both their shape and visual detail. Our experimental results show that our approach significantly outperforms existing methods in HOI scenarios, moving machine perception closer to a more human-like understanding of dynamic environments. We also show that our pipeline is robust even without ground-truth contact annotations, which broadens its applicability to tasks like 3D reconstruction and novel view/pose synthesis.

[62] Reducing the gap between general purpose data and aerial images in concentrated solar power plants

M. A. Pérez-Cutiño,J. Valverde,J. Capitán,J. M. Díaz-Báñez

Main category: cs.CV

TL;DR: 论文提出了一种名为AerialCSP的虚拟数据集,用于模拟太阳能发电厂的航拍图像,以解决现有通用数据集难以适用于太阳能电厂的复杂场景的问题,并通过实验证明了该数据集在减少真实标注需求和提高缺陷检测性能方面的有效性。

Details Motivation: 太阳能发电厂的航拍图像具有高度反射表面和特定领域元素,而通用数据集缺乏此类场景的多样性,导致模型难以直接应用。为减少昂贵且耗时的标注工作,作者提出了合成数据集的方法。

Contribution: 1. 提出AerialCSP数据集,提供高仿真的太阳能电厂航拍数据及标注;2. 在AerialCSP上测试多模型性能,建立基准;3. 证明该数据集可显著减少真实标注需求,提升缺陷检测效果。

Method: 通过生成高仿真的合成数据集AerialCSP,模拟真实太阳能电厂的航拍图像,并用于预训练模型,以减少对真实标注数据的依赖。

Result: 预训练使用AerialCSP的模型在真实缺陷检测任务中表现显著提升,尤其是对小而稀有的缺陷。

Insight: 合成数据可作为真实数据的高效替代方案,尤其是在标注成本高的领域,为特定领域的计算机视觉任务提供了新的解决方法。

Abstract: In the context of Concentrated Solar Power (CSP) plants, aerial images captured by drones present a unique set of challenges. Unlike urban or natural landscapes commonly found in existing datasets, solar fields contain highly reflective surfaces, and domain-specific elements that are uncommon in traditional computer vision benchmarks. As a result, machine learning models trained on generic datasets struggle to generalize to this setting without extensive retraining and large volumes of annotated data. However, collecting and labeling such data is costly and time-consuming, making it impractical for rapid deployment in industrial applications. To address this issue, we propose a novel approach: the creation of AerialCSP, a virtual dataset that simulates aerial imagery of CSP plants. By generating synthetic data that closely mimic real-world conditions, our objective is to facilitate pretraining of models before deployment, significantly reducing the need for extensive manual labeling. Our main contributions are threefold: (1) we introduce AerialCSP, a high-quality synthetic dataset for aerial inspection of CSP plants, providing annotated data for object detection and image segmentation; (2) we benchmark multiple models on AerialCSP, establishing a baseline for CSP-related vision tasks; and (3) we demonstrate that pretraining on AerialCSP significantly improves real-world fault detection, particularly for rare and small defects, reducing the need for extensive manual labeling. AerialCSP is made publicly available at https://mpcutino.github.io/aerialcsp/.

[63] AutoDebias: Automated Framework for Debiasing Text-to-Image Models

Hongyi Cai,Mohammad Mahdinur Rahman,Mingkang Dong,Jie Li,Muxin Pu,Zhili Fang,Yinan Peng,Hanjun Luo,Yang Liu

Main category: cs.CV

TL;DR: AutoDebias是一个自动化框架,用于识别和减轻文本到图像(T2I)模型中的社会偏见,无需预先了解具体偏见类型。通过视觉语言模型检测偏见并生成包容性提示,结合CLIP引导的训练过程,显著减少偏见输出,同时保持图像质量和多样性。

Details Motivation: 现有的去偏见方法对于已知或简单的偏见有效,但难以处理复杂或重叠的偏见。T2I模型常在不提及特定属性时仍表现出性别或种族偏见,亟需一种无需预知偏见类型的自动化解决方案。

Contribution: 提出了AutoDebias框架,创新地利用视觉语言模型自动检测和减轻T2I模型中的复杂偏见,覆盖25种以上偏见场景,显著降低偏见输出。

Method: 1. 使用视觉语言模型检测偏见模式;2. 生成包容性提示作为公平性引导;3. 结合CLIP引导的训练过程优化模型输出。

Result: 偏见检测准确率达91.6%,偏见输出从90%降至可忽略水平,同时保持了模型的图像质量和多样性。

Insight: AutoDebias展示了无需预知偏见类型的自动化去偏见方法的潜力,尤其在处理复杂或重叠偏见时表现突出。

Abstract: Text-to-Image (T2I) models generate high-quality images from text prompts but often exhibit unintended social biases, such as gender or racial stereotypes, even when these attributes are not mentioned. Existing debiasing methods work well for simple or well-known cases but struggle with subtle or overlapping biases. We propose AutoDebias, a framework that automatically identifies and mitigates harmful biases in T2I models without prior knowledge of specific bias types. Specifically, AutoDebias leverages vision-language models to detect biased visual patterns and constructs fairness guides by generating inclusive alternative prompts that reflect balanced representations. These guides drive a CLIP-guided training process that promotes fairer outputs while preserving the original model’s image quality and diversity. Unlike existing methods, AutoDebias effectively addresses both subtle stereotypes and multiple interacting biases. We evaluate the framework on a benchmark covering over 25 bias scenarios, including challenging cases where multiple biases occur simultaneously. AutoDebias detects harmful patterns with 91.6% accuracy and reduces biased outputs from 90% to negligible levels, while preserving the visual fidelity of the original model.

[64] CLIPTime: Time-Aware Multimodal Representation Learning from Images and Text

Anju Rani,Daniel Ortiz-Arroyo,Petar Durdevic

Main category: cs.CV

TL;DR: CLIPTime是一个多模态、多任务框架,旨在通过图像和文本输入预测真菌生长的发育阶段和时间戳,扩展了CLIP的能力以捕捉时间动态。

Details Motivation: 尽管CLIP等视觉-语言模型在联合视觉-文本推理方面表现优异,但其在捕捉时间进展方面的能力有限。这对生物学生长动态的理解提出了挑战。

Contribution: 提出了CLIPTime框架,能够预测发育阶段和时间戳;引入了合成真菌生长数据集;提出了自定义评估指标(如时间准确性和回归误差)。

Method: 基于CLIP架构,通过多任务学习联合训练视觉-文本嵌入,支持分类和回归任务(离散阶段和连续时间戳预测)。

Result: 实验表明,CLIPTime能有效建模生物进展,生成可解释且时间敏感的预测。

Insight: CLIPTime展示了视觉-语言模型在现实世界生物监测中的应用潜力,尤其是在需要时间动态理解的场景中。

Abstract: Understanding the temporal dynamics of biological growth is critical across diverse fields such as microbiology, agriculture, and biodegradation research. Although vision-language models like Contrastive Language Image Pretraining (CLIP) have shown strong capabilities in joint visual-textual reasoning, their effectiveness in capturing temporal progression remains limited. To address this, we propose CLIPTime, a multimodal, multitask framework designed to predict both the developmental stage and the corresponding timestamp of fungal growth from image and text inputs. Built upon the CLIP architecture, our model learns joint visual-textual embeddings and enables time-aware inference without requiring explicit temporal input during testing. To facilitate training and evaluation, we introduce a synthetic fungal growth dataset annotated with aligned timestamps and categorical stage labels. CLIPTime jointly performs classification and regression, predicting discrete growth stages alongside continuous timestamps. We also propose custom evaluation metrics, including temporal accuracy and regression error, to assess the precision of time-aware predictions. Experimental results demonstrate that CLIPTime effectively models biological progression and produces interpretable, temporally grounded outputs, highlighting the potential of vision-language models in real-world biological monitoring applications.

[65] Semantic and Temporal Integration in Latent Diffusion Space for High-Fidelity Video Super-Resolution

Yiwen Wang,Xinning Chai,Yuhong Zhang,Zhengxue Cheng,Jun Zhao,Rong Xie,Li Song

Main category: cs.CV

TL;DR: 该论文提出了一种名为SeTe-VSR的新方法,通过在潜在扩散空间中结合语义和时间-空间指导,解决了视频超分辨率任务中高保真对齐和时序一致性的挑战。

Details Motivation: 目前视频超分辨率(VSR)模型在细节恢复和时序一致性方面仍存在不足,尤其是生成过程的控制不足导致结果与低分辨率输入的对齐不理想。

Contribution: 论文的主要贡献是提出了一种结合语义和时间-空间指导的新方法(SeTe-VSR),在潜在扩散空间中实现高保真的视频超分辨率。

Method: SeTe-VSR通过引入高级语义信息并整合时空信息,平衡细节恢复和时序一致性,从而提升生成视频的视觉质量和真实感。

Result: 实验表明,SeTe-VSR在细节恢复和感知质量上均优于现有方法,尤其在复杂视频超分辨率任务中表现突出。

Insight: 通过在扩散模型中整合语义和时序信息,可以显著提升视频超分辨率的保真度和连贯性,为未来相关研究提供了新思路。

Abstract: Recent advancements in video super-resolution (VSR) models have demonstrated impressive results in enhancing low-resolution videos. However, due to limitations in adequately controlling the generation process, achieving high fidelity alignment with the low-resolution input while maintaining temporal consistency across frames remains a significant challenge. In this work, we propose Semantic and Temporal Guided Video Super-Resolution (SeTe-VSR), a novel approach that incorporates both semantic and temporal-spatio guidance in the latent diffusion space to address these challenges. By incorporating high-level semantic information and integrating spatial and temporal information, our approach achieves a seamless balance between recovering intricate details and ensuring temporal coherence. Our method not only preserves high-reality visual content but also significantly enhances fidelity. Extensive experiments demonstrate that SeTe-VSR outperforms existing methods in terms of detail recovery and perceptual quality, highlighting its effectiveness for complex video super-resolution tasks.

[66] HyPCV-Former: Hyperbolic Spatio-Temporal Transformer for 3D Point Cloud Video Anomaly Detection

Jiaping Cao,Kangkang Zhou,Juan Du

Main category: cs.CV

TL;DR: HyPCV-Former是一种基于双曲时空变换器的3D点云视频异常检测方法,通过将点云特征嵌入到洛伦兹双曲空间中,并利用改进的注意力机制建模时空动态,显著提升了异常检测性能。

Details Motivation: 现有的视频异常检测方法主要依赖欧几里得空间表示(如RGB或深度数据),无法有效捕获事件的层次结构和时空连续性。因此,本文提出利用双曲空间表示来解决这一问题。

Contribution: 1. 提出HyPCV-Former,首个在3D点云视频异常检测中应用双曲空间表示的方法;2. 设计了双曲多头自注意力机制(HMHA),直接在洛伦兹双曲空间中进行特征变换和异常评分;3. 在多个数据集上实现了显著的性能提升。

Method: 1. 通过点云提取器提取每帧空间特征;2. 将特征嵌入洛伦兹双曲空间;3. 利用HMHA机制(结合洛伦兹内积和曲率感知的softmax)建模时空动态;4. 直接在双曲空间中完成异常检测。

Result: 在TIMo和DAD数据集上分别实现7%和5.6%的性能提升,显著优于现有基准方法。

Insight: 双曲空间更适合表示具有层次结构的数据;直接在全洛伦兹空间中操作比切空间近似更有效。

Abstract: Video anomaly detection is a fundamental task in video surveillance, with broad applications in public safety and intelligent monitoring systems. Although previous methods leverage Euclidean representations in RGB or depth domains, such embeddings are inherently limited in capturing hierarchical event structures and spatio-temporal continuity. To address these limitations, we propose HyPCV-Former, a novel hyperbolic spatio-temporal transformer for anomaly detection in 3D point cloud videos. Our approach first extracts per-frame spatial features from point cloud sequences via point cloud extractor, and then embeds them into Lorentzian hyperbolic space, which better captures the latent hierarchical structure of events. To model temporal dynamics, we introduce a hyperbolic multi-head self-attention (HMHA) mechanism that leverages Lorentzian inner products and curvature-aware softmax to learn temporal dependencies under non-Euclidean geometry. Our method performs all feature transformations and anomaly scoring directly within full Lorentzian space rather than via tangent space approximation. Extensive experiments demonstrate that HyPCV-Former achieves state-of-the-art performance across multiple anomaly categories, with a 7% improvement on the TIMo dataset and a 5.6% gain on the DAD dataset compared to benchmarks. The code will be released upon paper acceptance.

[67] Fine-grained Spatiotemporal Grounding on Egocentric Videos

Shuo Liang,Yiwu Zhong,Zi-Yuan Hu,Yeyao Tao,Liwei Wang

Main category: cs.CV

TL;DR: 本文提出了一种用于自我中心视频的细粒度时空定位方法,通过分析自我中心视频的独特挑战并引入EgoMask基准,显著提升了现有模型的性能。

Details Motivation: 现有时空定位研究主要关注外部中心视频,而自我中心视频由于其独特的视角和动态环境(如AR和机器人应用)的重要性被长期忽视。

Contribution: 1)提出了EgoMask,首个针对自我中心视频的像素级细粒度时空定位基准;2)开发了自动标注流程;3)构建了大训练集EgoMask-Train,显著提升模型性能。

Method: 通过分析自我中心视频的特性(如更短的物体持续时间、稀疏轨迹、更小物体尺寸),设计了自动标注流程生成EgoMask基准,并使用EgoMask-Train训练模型。

Result: 实验表明,现有先进模型在EgoMask上表现不佳,但基于EgoMask-Train的微调显著提升了性能,同时不影响外部中心数据集的性能。

Insight: 自我中心视频的独特属性(如动态视角)需要专门的基准和训练集,对未来的AR和机器人应用有重要意义。

Abstract: Spatiotemporal video grounding aims to localize target entities in videos based on textual queries. While existing research has made significant progress in exocentric videos, the egocentric setting remains relatively underexplored, despite its growing importance in applications such as augmented reality and robotics. In this work, we conduct a systematic analysis of the discrepancies between egocentric and exocentric videos, revealing key challenges such as shorter object durations, sparser trajectories, smaller object sizes, and larger positional shifts. To address these challenges, we introduce EgoMask, the first pixel-level benchmark for fine-grained spatiotemporal grounding in egocentric videos. It is constructed by our proposed automatic annotation pipeline, which annotates referring expressions and object masks across short-, medium-, and long-term videos. Additionally, we create EgoMask-Train, a large-scale training dataset to facilitate model development. Experiments demonstrate that the state-of-the-art spatiotemporal grounding models perform poorly on our benchmark EgoMask, but fine-tuning on EgoMask-Train yields significant improvements, while preserving performance on exocentric datasets. Our work thus provides essential resources and insights for advancing egocentric video understanding. Our code is available at https://github.com/LaVi-Lab/EgoMask .

[68] LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer

Yuzhuo Chen,Zehua Ma,Jianhua Wang,Kai Kang,Shunyu Yao,Weiming Zhang

Main category: cs.CV

TL;DR: LAMIC 是一个无需训练的布局感知多图像组合框架,通过扩展单参考扩散模型到多参考场景,引入了两种注意力机制和三个新评估指标,实现了在多图像组合任务中的最先进性能。

Details Motivation: 多图像组合任务中,如何在保持空间布局一致性的同时生成连贯且一致的图像是一个挑战。LAMIC 旨在解决这一问题,扩展单参考扩散模型的能力。

Contribution: 1) 提出了 LAMIC 框架,首次将单参考扩散模型扩展到多参考场景;2) 引入 Group Isolation Attention 和 Region-Modulated Attention 两种注意力机制;3) 提出了三种新的评估指标(IN-R、FI-R 和 BG-S)。

Method: 基于 MMDiT 模型,LAMIC 引入了 GIA 增强实体解耦和 RMA 实现布局感知生成,无需训练即可实现多图像组合。

Result: LAMIC 在 ID-S、BG-S、IN-R 和 AVG 等指标上优于现有多参考基线,特别是在复杂组合任务中表现最佳。

Insight: 通过继承单参考模型的优势并扩展到多图像场景,LAMIC 展示了强大的零样本泛化能力,为可控多图像组合提供了新范式。

Abstract: In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC’s superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC’s performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.

[69] SAMSA 2.0: Prompting Segment Anything with Spectral Angles for Hyperspectral Interactive Medical Image Segmentation

Alfie Roddan,Tobias Czempiel,Chi Xu,Daniel S. Elson,Stamatia Giannarou

Main category: cs.CV

TL;DR: SAMSA 2.0通过结合光谱角度提示与空间线索,改进了交互式高光谱医学图像分割的精度,无需重新训练即可在低数据和高噪声场景中实现更好的性能。

Details Motivation: 高光谱医学图像分割面临数据稀缺和噪声问题,传统方法依赖RGB信息或单模态光谱融合,限制了性能。

Contribution: 提出了光谱角度提示的早期融合方法,显著提升了分割精度(最高+3.8% Dice分数)和泛化能力。

Method: 利用Segment Anything Model(SAM),结合光谱相似性和空间线索进行交互式分割。

Result: 在多个光谱数据集上优于RGB模型和现有光谱融合方法,且在低数据和噪声条件下表现鲁棒。

Insight: 光谱信息的早期融合能够显著提升医学图像分割的准确性,尤其在数据稀缺和噪声场景中。

Abstract: We present SAMSA 2.0, an interactive segmentation framework for hyperspectral medical imaging that introduces spectral angle prompting to guide the Segment Anything Model (SAM) using spectral similarity alongside spatial cues. This early fusion of spectral information enables more accurate and robust segmentation across diverse spectral datasets. Without retraining, SAMSA 2.0 achieves up to +3.8% higher Dice scores compared to RGB-only models and up to +3.1% over prior spectral fusion methods. Our approach enhances few-shot and zero-shot performance, demonstrating strong generalization in challenging low-data and noisy scenarios common in clinical imaging.

[70] LesiOnTime – Joint Temporal and Clinical Modeling for Small Breast Lesion Segmentation in Longitudinal DCE-MRI

Mohammed Kamran,Maria Bernathova,Raoul Varga,Christian Singer,Zsuzsanna Bago-Horvath,Thomas Helbich,Georg Langs,Philipp Seeböck

Main category: cs.CV

TL;DR: LesiOnTime提出了一个新颖的3D分割方法,通过结合纵向影像和BI-RADS评分,模拟临床诊断工作流程,显著提高了小乳腺病变的分割精度。

Details Motivation: 当前深度学习方法在乳腺DCE-MRI病变分割中主要关注大病变,而忽略了纵向和临床信息,而这些信息对早期癌症检测至关重要。

Contribution: 1. 提出了Temporal Prior Attention (TPA)模块,动态整合历史和当前扫描信息;2. 设计了BI-RADS Consistency Regularization (BCR)损失函数,将领域知识嵌入训练过程。

Method: 1. TPA模块用于跨时间点信息融合;2. BCR损失函数通过隐空间对齐增强相似放射学评估扫描的一致性。

Result: 在纵向DCE-MRI数据集上,LesiOnTime比现有单时间点和纵向基线方法Dice分数提高了5%,且TPA和BCR均贡献了互补性能增益。

Insight: 结合时间和临床上下文能显著提升真实世界乳腺癌筛查中早期病变分割的可靠性。

Abstract: Accurate segmentation of small lesions in Breast Dynamic Contrast-Enhanced MRI (DCE-MRI) is critical for early cancer detection, especially in high-risk patients. While recent deep learning methods have advanced lesion segmentation, they primarily target large lesions and neglect valuable longitudinal and clinical information routinely used by radiologists. In real-world screening, detecting subtle or emerging lesions requires radiologists to compare across timepoints and consider previous radiology assessments, such as the BI-RADS score. We propose LesiOnTime, a novel 3D segmentation approach that mimics clinical diagnostic workflows by jointly leveraging longitudinal imaging and BIRADS scores. The key components are: (1) a Temporal Prior Attention (TPA) block that dynamically integrates information from previous and current scans; and (2) a BI-RADS Consistency Regularization (BCR) loss that enforces latent space alignment for scans with similar radiological assessments, thus embedding domain knowledge into the training process. Evaluated on a curated in-house longitudinal dataset of high-risk patients with DCE-MRI, our approach outperforms state-of-the-art single-timepoint and longitudinal baselines by 5% in terms of Dice. Ablation studies demonstrate that both TPA and BCR contribute complementary performance gains. These results highlight the importance of incorporating temporal and clinical context for reliable early lesion segmentation in real-world breast cancer screening. Our code is publicly available at https://github.com/cirmuw/LesiOnTime

[71] Leveraging Convolutional and Graph Networks for an Unsupervised Remote Sensing Labelling Tool

Tulsi Patel,Mark W. Jones,Thomas Redfern

Main category: cs.CV

TL;DR: 该论文提出了一种基于卷积网络和图网络的非监督方法,用于远程感知影像的标记,克服了传统方法对预标记数据的依赖,并提高了标记的准确性和效率。

Details Motivation: 传统远程感知影像的标记方法依赖于预标记数据进行训练,耗时且成本高。本文旨在开发一种非监督的方法,利用卷积网络和图网络提取更鲁棒的特征表示,从而减少标记过程中的异常值并提高标记的精细度。

Contribution: 1) 提出了一种结合卷积网络和图网络的无监督标记流水线;2) 利用图像分割和局部邻域信息构建更鲁棒的特征空间;3) 实现了旋转不变的语义关系表达。

Method: 1) 使用卷积网络进行初始分割,将图像划分为同质区域;2) 通过图网络聚合邻域信息,生成鲁棒的特征表示;3) 在特征空间中实现旋转不变的图像级别语义关系。

Result: 该方法能够高效且准确地标记地理区域,减少了异常值,并支持细粒度的标记。

Insight: 结合卷积网络和图网络可以在无监督情况下提升特征表示的鲁棒性,为远程感知影像的标记提供了一种新思路。

Abstract: Machine learning for remote sensing imaging relies on up-to-date and accurate labels for model training and testing. Labelling remote sensing imagery is time and cost intensive, requiring expert analysis. Previous labelling tools rely on pre-labelled data for training in order to label new unseen data. In this work, we define an unsupervised pipeline for finding and labelling geographical areas of similar context and content within Sentinel-2 satellite imagery. Our approach removes limitations of previous methods by utilising segmentation with convolutional and graph neural networks to encode a more robust feature space for image comparison. Unlike previous approaches we segment the image into homogeneous regions of pixels that are grouped based on colour and spatial similarity. Graph neural networks are used to aggregate information about the surrounding segments enabling the feature representation to encode the local neighbourhood whilst preserving its own local information. This reduces outliers in the labelling tool, allows users to label at a granular level, and allows a rotationally invariant semantic relationship at the image level to be formed within the encoding space.

[72] Context-based Motion Retrieval using Open Vocabulary Methods for Autonomous Driving

Stefan Englmeier,Max A. Büttner,Katharina Winter,Fabian B. Flohr

Main category: cs.CV

TL;DR: 论文提出了一个结合SMPL动作序列和视频帧的多模态嵌入空间方法,用于通过文本查询检索自动驾驶中的罕见人类行为场景,并在WayMoCo数据集上取得了优于现有模型27.5%的效果。

Details Motivation: 自动驾驶系统需要在涉及脆弱道路使用者(VRUs)的复杂行为场景下可靠运行,但现有方法难以从大规模数据集中检索这些罕见的边缘案例。

Contribution: 1) 提出了一种结合SMPL动作和视频帧的多模态嵌入空间框架;2) 引入了WayMoCo数据集,支持动作与场景上下文的自动标注检索。

Method: 通过将SMPL动作序列和视频帧编码到与自然语言对齐的共享嵌入空间,实现基于文本查询的上下文感知动作检索。

Result: 在WayMoCo数据集上,方法比当前最优模型的检索准确率提高了27.5%。

Insight: 通过多模态对齐和自然语言查询,可以更高效地检索自动驾驶中的罕见人类行为场景,支持针对性系统评估。

Abstract: Autonomous driving systems must operate reliably in safety-critical scenarios, particularly those involving unusual or complex behavior by Vulnerable Road Users (VRUs). Identifying these edge cases in driving datasets is essential for robust evaluation and generalization, but retrieving such rare human behavior scenarios within the long tail of large-scale datasets is challenging. To support targeted evaluation of autonomous driving systems in diverse, human-centered scenarios, we propose a novel context-aware motion retrieval framework. Our method combines Skinned Multi-Person Linear (SMPL)-based motion sequences and corresponding video frames before encoding them into a shared multimodal embedding space aligned with natural language. Our approach enables the scalable retrieval of human behavior and their context through text queries. This work also introduces our dataset WayMoCo, an extension of the Waymo Open Dataset. It contains automatically labeled motion and scene context descriptions derived from generated pseudo-ground-truth SMPL sequences and corresponding image data. Our approach outperforms state-of-the-art models by up to 27.5% accuracy in motion-context retrieval, when evaluated on the WayMoCo dataset.

[73] EPANet: Efficient Path Aggregation Network for Underwater Fish Detection

Jinsong Yang,Zeyuan Hu,Yichen Li

Main category: cs.CV

TL;DR: EPANet提出了一种高效路径聚合网络,用于水下鱼类检测,通过互补特征整合实现高精度且轻量化的检测。

Details Motivation: 水下鱼类检测面临低分辨率目标、背景干扰和高视觉相似性等挑战,现有方法往往以增加模型复杂度为代价。

Contribution: 1. 设计了高效路径聚合特征金字塔(EPA-FPN)和多样化细粒度特征划分的多尺度短路径瓶颈(MS-DDSP)。2. 提出了跨层融合路径和长程跳跃连接以提升特征互补性。

Method: 1. EPA-FPN引入跨尺度长程跳跃连接增强语义-空间互补性。2. MS-DDSP瓶颈通过细粒度特征划分和多卷积操作提升局部特征多样性和表达能力。

Result: 在基准数据集上,EPANet在检测精度和推理速度上均优于现有方法,同时保持较低参数复杂度。

Insight: 轻量化设计可通过高效的跨层特征融合和多尺度特征划分实现性能与效率的平衡。

Abstract: Underwater fish detection (UFD) remains a challenging task in computer vision due to low object resolution, significant background interference, and high visual similarity between targets and surroundings. Existing approaches primarily focus on local feature enhancement or incorporate complex attention mechanisms to highlight small objects, often at the cost of increased model complexity and reduced efficiency. To address these limitations, we propose an efficient path aggregation network (EPANet), which leverages complementary feature integration to achieve accurate and lightweight UFD. EPANet consists of two key components: an efficient path aggregation feature pyramid network (EPA-FPN) and a multi-scale diverse-division short path bottleneck (MS-DDSP bottleneck). The EPA-FPN introduces long-range skip connections across disparate scales to improve semantic-spatial complementarity, while cross-layer fusion paths are adopted to enhance feature integration efficiency. The MS-DDSP bottleneck extends the conventional bottleneck structure by introducing finer-grained feature division and diverse convolutional operations, thereby increasing local feature diversity and representation capacity. Extensive experiments on benchmark UFD datasets demonstrate that EPANet outperforms state-of-the-art methods in terms of detection accuracy and inference speed, while maintaining comparable or even lower parameter complexity.

[74] Video Color Grading via Look-Up Table Generation

Seunghyun Shin,Dongmin Shin,Jisu Shin,Hae-Gon Jeon,Joon-Young Lee

Main category: cs.CV

TL;DR: 这篇论文提出了一种基于参考视频生成查找表(LUT)的视频色彩分级框架,利用扩散模型实现色彩属性对齐,并通过文本提示支持用户偏好调整。

Details Motivation: 视频色彩分级通常需要专业技能,限制了非专业人士的使用。作者希望通过自动化工具降低这一过程的复杂性。

Contribution: 主要贡献包括:1) 提出基于扩散模型的LUT生成框架,保留结构细节;2) 支持通过文本提示调整低级特征(如对比度);3) 公开代码并验证了方法的有效性。

Method: 核心方法是通过扩散模型生成LUT,对齐输入视频与参考场景的色彩属性,同时利用文本提示调整低级特征。

Result: 实验和用户研究表明,该方法在色彩分级任务中表现优异,且推理速度快。

Insight: LUT是一种高效且可扩展的视频色彩分级工具,结合扩散模型和文本提示进一步提升了灵活性和用户体验。

Abstract: Different from color correction and transfer, color grading involves adjusting colors for artistic or storytelling purposes in a video, which is used to establish a specific look or mood. However, due to the complexity of the process and the need for specialized editing skills, video color grading remains primarily the domain of professional colorists. In this paper, we present a reference-based video color grading framework. Our key idea is explicitly generating a look-up table (LUT) for color attribute alignment between reference scenes and input video via a diffusion model. As a training objective, we enforce that high-level features of the reference scenes like look, mood, and emotion should be similar to that of the input video. Our LUT-based approach allows for color grading without any loss of structural details in the whole video frames as well as achieving fast inference. We further build a pipeline to incorporate a user-preference via text prompts for low-level feature enhancement such as contrast and brightness, etc. Experimental results, including extensive user studies, demonstrate the effectiveness of our approach for video color grading. Codes are publicly available at https://github.com/seunghyuns98/VideoColorGrading.

[75] Your other Left! Vision-Language Models Fail to Identify Relative Positions in Medical Images

Daniel Wolf,Heiko Hillenhagen,Billurvan Taskin,Alex Bäuerle,Meinrad Beer,Michael Götz,Timo Ropinski

Main category: cs.CV

TL;DR: 本文评估了当前先进的视觉语言模型(VLMs)在医学图像中对相对位置的识别能力,发现所有测试模型均表现不佳。通过引入视觉提示(如标记物)尝试改进,但效果有限。研究还发现模型依赖先验解剖知识而非图像内容,导致错误结论。为此,作者提出MIRP基准数据集以推动未来研究。

Details Motivation: 临床决策依赖对解剖结构和异常相对位置的准确理解,因此VLMs在医学图像中的定位能力至关重要。然而,这一能力尚未得到充分研究,且现有模型表现不佳,亟需改进。

Contribution: 1. 系统评估了多款先进VLMs在医学图像中的相对位置识别能力,发现其普遍失败;2. 探索了视觉提示的改进效果;3. 提出了MIRP基准数据集以促进未来研究。

Method: 通过测试GPT-4o、Llama3.2等VLMs在医学图像中的表现,并引入视觉提示(如标记物)评估其对性能的影响。结果揭示了模型对先验知识的依赖问题。

Result: 所有VLMs在医学图像中的相对位置识别任务上表现不佳,视觉提示虽带来小幅提升,但仍远低于自然图像中的表现。MIRP数据集为后续研究提供了工具。

Insight: VLMs在医学图像中更依赖先验知识而非图像内容,导致错误推断。这说明医学领域的VLMs需要更强的上下文理解能力。MIRP数据集的提出填补了相关研究空白。

Abstract: Clinical decision-making relies heavily on understanding relative positions of anatomical structures and anomalies. Therefore, for Vision-Language Models (VLMs) to be applicable in clinical practice, the ability to accurately determine relative positions on medical images is a fundamental prerequisite. Despite its importance, this capability remains highly underexplored. To address this gap, we evaluate the ability of state-of-the-art VLMs, GPT-4o, Llama3.2, Pixtral, and JanusPro, and find that all models fail at this fundamental task. Inspired by successful approaches in computer vision, we investigate whether visual prompts, such as alphanumeric or colored markers placed on anatomical structures, can enhance performance. While these markers provide moderate improvements, results remain significantly lower on medical images compared to observations made on natural images. Our evaluations suggest that, in medical imaging, VLMs rely more on prior anatomical knowledge than on actual image content for answering relative position questions, often leading to incorrect conclusions. To facilitate further research in this area, we introduce the MIRP , Medical Imaging Relative Positioning, benchmark dataset, designed to systematically evaluate the capability to identify relative positions in medical images.

[76] HiPrune: Training-Free Visual Token Pruning via Hierarchical Attention in Vision-Language Models

Jizhihui Liu,Feiyi Du,Guangdao Zhu,Niu Lian,Jun Li,Bin Chen

Main category: cs.CV

TL;DR: HiPrune提出了一种无需训练的视觉Token剪枝方法,通过层次化注意力机制在视觉-语言模型中高效剪枝,显著减少计算开销并保持任务精度。

Details Motivation: 视觉-语言模型将图像编码为大量视觉Token,导致计算开销大、推理效率低。现有方法通常依赖特殊Token或需要任务特定训练,限制了扩展性。

Contribution: HiPrune是一种无需训练、模型无关的Token剪枝框架,利用视觉编码器中的层次化注意力结构选择三种关键Token(锚点、缓冲、寄存器),显著提高效率。

Method: 基于层次化注意力机制,HiPrune在浅层选择锚点Token(对象中心区域),在中层缓冲邻近Token保持空间连续性,在深层选择寄存器Token(全局特征),无需训练直接剪枝。

Result: 实验表明,HiPrune在LLaVA和Qwen2.5-VL等模型上仅保留11.1%的Token即可保持99.5%的任务精度,计算开销减少9倍。

Insight: 分层注意力机制可自然区分不同粒度的视觉特征,为模型无关的高效Token剪枝提供了新思路。

Abstract: Vision-Language Models (VLMs) encode images into lengthy sequences of visual tokens, leading to excessive computational overhead and limited inference efficiency. While prior efforts prune or merge tokens to address this issue, they often rely on special tokens (e.g., CLS) or require task-specific training, hindering scalability across architectures. In this paper, we propose HiPrune, a training-free and model-agnostic token Pruning framework that exploits the Hierarchical attention structure within vision encoders. We identify that middle layers attend to object-centric regions, while deep layers capture global contextual features. Based on this observation, HiPrune selects three types of informative tokens: (1) Anchor tokens with high attention in object-centric layers, (2) Buffer tokens adjacent to anchors for spatial continuity, and (3) Register tokens with strong attention in deep layers for global summarization. Our method requires no retraining and integrates seamlessly with any ViT-based VLM. Extensive experiments on LLaVA-1.5, LLaVA-NeXT, and Qwen2.5-VL demonstrate that HiPrune achieves state-of-the-art pruning performance, preserving up to 99.3% task accuracy with only 33.3% tokens, and maintaining 99.5% accuracy with just 11.1% tokens. Meanwhile, it reduces inference FLOPs and latency by up to 9$\times$, showcasing strong generalization across models and tasks. Code is available at https://github.com/Danielement321/HiPrune.

[77] Training-Free Class Purification for Open-Vocabulary Semantic Segmentation

Qi Chen,Lingxiao Yang,Yun Chen,Nailong Zhao,Jianhuang Lai,Jie Shao,Xiaohua Xie

Main category: cs.CV

TL;DR: 论文提出了一个无需训练的类纯化框架FreeCP,用于解决开放词汇语义分割中的类冗余和视觉-语言模糊性问题,显著提升了分割性能。

Details Motivation: 现有的无需训练方法在开放词汇语义分割中忽视了类冗余和视觉-语言模糊性带来的挑战,这些问题导致次优的类激活图。FreeCP旨在通过类纯化解决这些问题。

Contribution: 提出了一个无需训练的新颖类纯化框架FreeCP,专注于语义类别的纯化和冗余/模糊性错误的修正。FreeCP可以作为一个即插即用的模块与其他方法结合使用。

Method: FreeCP通过纯化语义类别并修正冗余和模糊性导致的错误,生成更准确的类表示,用于最终的分割预测。

Result: 在8个基准测试上的实验表明,FreeCP能够显著提升其他OVSS方法的分割性能。

Insight: 通过解决类冗余和视觉-语言模糊性问题,无需训练的方法也能在开放词汇语义分割中实现高性能。

Abstract: Fine-tuning pre-trained vision-language models has emerged as a powerful approach for enhancing open-vocabulary semantic segmentation (OVSS). However, the substantial computational and resource demands associated with training on large datasets have prompted interest in training-free methods for OVSS. Existing training-free approaches primarily focus on modifying model architectures and generating prototypes to improve segmentation performance. However, they often neglect the challenges posed by class redundancy, where multiple categories are not present in the current test image, and visual-language ambiguity, where semantic similarities among categories create confusion in class activation. These issues can lead to suboptimal class activation maps and affinity-refined activation maps. Motivated by these observations, we propose FreeCP, a novel training-free class purification framework designed to address these challenges. FreeCP focuses on purifying semantic categories and rectifying errors caused by redundancy and ambiguity. The purified class representations are then leveraged to produce final segmentation predictions. We conduct extensive experiments across eight benchmarks to validate FreeCP’s effectiveness. Results demonstrate that FreeCP, as a plug-and-play module, significantly boosts segmentation performance when combined with other OVSS methods.

[78] Weakly Supervised Virus Capsid Detection with Image-Level Annotations in Electron Microscopy Images

Hannah Kniesel,Leon Sick,Tristan Payer,Tim Bergner,Kavitha Shaga Devan,Clarissa Read,Paul Walther,Timo Ropinski

Main category: cs.CV

TL;DR: 提出一种基于图像级标注的弱监督病毒衣壳检测方法,通过预训练模型生成伪标签,优化检测性能,减少标注成本。

Details Motivation: 减少人工标注成本,尤其是专家标注的耗时问题,利用弱监督学习实现高效病毒检测。

Contribution: 提出一种无需边界框标注的弱监督检测方法,通过预训练模型生成的伪标签显著降低标注复杂度。

Method: 使用预训练模型提取图像级信息,通过优化方法和缩小感受野生成伪标签,训练目标检测模型。

Result: 伪标签比现有弱标注方法表现更好,甚至在标注时间有限时优于真实标注。

Insight: 通过弱监督学习优化目标检测,不仅降低成本,还能在资源受限时保持甚至提升性能。

Abstract: Current state-of-the-art methods for object detection rely on annotated bounding boxes of large data sets for training. However, obtaining such annotations is expensive and can require up to hundreds of hours of manual labor. This poses a challenge, especially since such annotations can only be provided by experts, as they require knowledge about the scientific domain. To tackle this challenge, we propose a domain-specific weakly supervised object detection algorithm that only relies on image-level annotations, which are significantly easier to acquire. Our method distills the knowledge of a pre-trained model, on the task of predicting the presence or absence of a virus in an image, to obtain a set of pseudo-labels that can be used to later train a state-of-the-art object detection model. To do so, we use an optimization approach with a shrinking receptive field to extract virus particles directly without specific network architectures. Through a set of extensive studies, we show how the proposed pseudo-labels are easier to obtain, and, more importantly, are able to outperform other existing weak labeling methods, and even ground truth labels, in cases where the time to obtain the annotation is limited.

[79] CoProU-VO: Combining Projected Uncertainty for End-to-End Unsupervised Monocular Visual Odometry

Jingchao Xie,Oussema Dhaouadi,Weirong Chen,Johannes Meier,Jacques Kaiser,Daniel Cremers

Main category: cs.CV

TL;DR: CoProU-VO 是一种端到端的无监督单目视觉里程计方法,通过跨帧不确定性传播和结合,有效过滤动态物体和遮挡,显著提升了动态场景下的定位性能。

Details Motivation: 传统无监督视觉里程计方法在面对动态场景时表现不佳,主要因为它们仅依赖单帧信息建模不确定性,而未考虑跨帧不确定性的传播。这导致动态物体和遮挡区域无法被有效过滤,影响位姿估计的准确性。

Contribution: 1. 提出 CoProU-VO,一种结合跨帧不确定性的端到端无监督单目视觉里程计方法。2. 通过概率公式将目标帧不确定性与参考帧投射的不确定性相结合,显著提升了动态场景的鲁棒性。3. 基于视觉 Transformer 架构,同时学习深度、不确定性估计和相机位姿。

Method: 1. 提出 Combined Projected Uncertainty (CoProU) 方法,将目标帧的不确定性与参考帧投射的不确定性结合。2. 使用视觉 Transformer 作为主干网络,端到端联合优化深度、不确定性估计和相机位姿。3. 通过概率模型实现跨帧不确定性传播,过滤动态区域和遮挡。

Result: 在 KITTI 和 nuScenes 数据集上,CoProU-VO 显著优于其他无监督单目端到端双帧方法,尤其在动态场景(如高速公路)中表现优异。消融实验验证了跨帧不确定性传播的有效性。

Insight: 跨帧不确定性传播是提升无监督视觉里程计在动态场景中性能的关键,传统单帧不确定性建模存在局限性。结合概率模型和视觉 Transformer 可实现更鲁棒的端到端学习。

Abstract: Visual Odometry (VO) is fundamental to autonomous navigation, robotics, and augmented reality, with unsupervised approaches eliminating the need for expensive ground-truth labels. However, these methods struggle when dynamic objects violate the static scene assumption, leading to erroneous pose estimations. We tackle this problem by uncertainty modeling, which is a commonly used technique that creates robust masks to filter out dynamic objects and occlusions without requiring explicit motion segmentation. Traditional uncertainty modeling considers only single-frame information, overlooking the uncertainties across consecutive frames. Our key insight is that uncertainty must be propagated and combined across temporal frames to effectively identify unreliable regions, particularly in dynamic scenes. To address this challenge, we introduce Combined Projected Uncertainty VO (CoProU-VO), a novel end-to-end approach that combines target frame uncertainty with projected reference frame uncertainty using a principled probabilistic formulation. Built upon vision transformer backbones, our model simultaneously learns depth, uncertainty estimation, and camera poses. Consequently, experiments on the KITTI and nuScenes datasets demonstrate significant improvements over previous unsupervised monocular end-to-end two-frame-based methods and exhibit strong performance in challenging highway scenes where other approaches often fail. Additionally, comprehensive ablation studies validate the effectiveness of cross-frame uncertainty propagation.

[80] A Novel Modeling Framework and Data Product for Extended VIIRS-like Artificial Nighttime Light Image Reconstruction (1986-2024)

Yihe Tian,Kwan Man Cheng,Zhengbo Zhang,Tao Zhang,Suju Li,Dongmei Yan,Bing Xu

Main category: cs.CV

TL;DR: 该论文提出了一种新颖的两阶段建模框架(构建和细化),用于重建1986-2024年的VIIRS-like夜间灯光数据,解决了现有方法在光线强度低估和结构缺失上的不足。

Details Motivation: 现有NPP-VIIRS传感器的夜间灯光数据始于2012年,限制了长期时间序列研究的范围,且当前扩展方法存在光线强度低估和结构缺失问题。

Contribution: 提出了一种包含Hierarchical Fusion Decoder (HFD)和Dual Feature Refiner (DFR)的两阶段框架,开发了EVAL数据集,将时间范围扩展到1986年。

Method: 框架分为构建阶段(HFD增强重建保真度)和细化阶段(DFR利用高分辨率不透水面掩模优化结构细节)。

Result: EVAL数据集在R²(0.68→0.80)和RMSE(1.27→0.99)上显著优于现有方法,并展示了良好的时间一致性和社会经济参数相关性。

Insight: 通过两阶段框架优化夜间灯光数据的重建质量,EVAL数据集为长期分析提供了可靠资源。

Abstract: Artificial Night-Time Light (NTL) remote sensing is a vital proxy for quantifying the intensity and spatial distribution of human activities. Although the NPP-VIIRS sensor provides high-quality NTL observations, its temporal coverage, which begins in 2012, restricts long-term time-series studies that extend to earlier periods. Despite the progress in extending VIIRS-like NTL time-series, current methods still suffer from two significant shortcomings: the underestimation of light intensity and the structural omission. To overcome these limitations, we propose a novel reconstruction framework consisting of a two-stage process: construction and refinement. The construction stage features a Hierarchical Fusion Decoder (HFD) designed to enhance the fidelity of the initial reconstruction. The refinement stage employs a Dual Feature Refiner (DFR), which leverages high-resolution impervious surface masks to guide and enhance fine-grained structural details. Based on this framework, we developed the Extended VIIRS-like Artificial Nighttime Light (EVAL) product for China, extending the standard data record backwards by 26 years to begin in 1986. Quantitative evaluation shows that EVAL significantly outperforms existing state-of-the-art products, boosting the $\text{R}^2$ from 0.68 to 0.80 while lowering the RMSE from 1.27 to 0.99. Furthermore, EVAL exhibits excellent temporal consistency and maintains a high correlation with socioeconomic parameters, confirming its reliability for long-term analysis. The resulting EVAL dataset provides a valuable new resource for the research community and is publicly available at https://doi.org/10.11888/HumanNat.tpdc.302930.

[81] GeoMoE: Divide-and-Conquer Motion Field Modeling with Mixture-of-Experts for Two-View Geometry

Jiajun Le,Jiayi Ma

Main category: cs.CV

TL;DR: GeoMoE提出了一种基于混合专家(MoE)的方法,通过分解和建模异质运动子场来提升双视图几何中的运动场估计性能。

Details Motivation: 现实场景中的运动场往往包含多样化和异质的运动模式,而现有方法缺乏针对性建模策略,导致估计结果偏离真实结构和分布。

Contribution: 1) 提出了一种概率先验引导的分解策略,实现结构感知的运动场分解;2) 设计了MoE增强的双路径校正器,针对子场定制化建模。

Method: 1) 使用MoE将运动场分解为异质子场;2) 通过双路径校正器(空间-上下文和通道-语义路径)增强子场建模。

Result: 在相对姿态和单应性估计任务中优于现有方法,并展现出强泛化能力。

Insight: MoE能够有效建模异质运动模式,通过分而治之的策略提升运动场建模的准确性和鲁棒性。

Abstract: Recent progress in two-view geometry increasingly emphasizes enforcing smoothness and global consistency priors when estimating motion fields between pairs of images. However, in complex real-world scenes, characterized by extreme viewpoint and scale changes as well as pronounced depth discontinuities, the motion field often exhibits diverse and heterogeneous motion patterns. Most existing methods lack targeted modeling strategies and fail to explicitly account for this variability, resulting in estimated motion fields that diverge from their true underlying structure and distribution. We observe that Mixture-of-Experts (MoE) can assign dedicated experts to motion sub-fields, enabling a divide-and-conquer strategy for heterogeneous motion patterns. Building on this insight, we re-architect motion field modeling in two-view geometry with GeoMoE, a streamlined framework. Specifically, we first devise a Probabilistic Prior-Guided Decomposition strategy that exploits inlier probability signals to perform a structure-aware decomposition of the motion field into heterogeneous sub-fields, sharply curbing outlier-induced bias. Next, we introduce an MoE-Enhanced Bi-Path Rectifier that enhances each sub-field along spatial-context and channel-semantic paths and routes it to a customized expert for targeted modeling, thereby decoupling heterogeneous motion regimes, suppressing cross-sub-field interference and representational entanglement, and yielding fine-grained motion-field rectification. With this minimalist design, GeoMoE outperforms prior state-of-the-art methods in relative pose and homography estimation and shows strong generalization. The source code and pre-trained models are available at https://github.com/JiajunLe/GeoMoE.

[82] Backdoor Attacks on Deep Learning Face Detection

Quentin Le Roux,Yannick Teglia,Teddy Furon,Philippe Loubet-Moundi

Main category: cs.CV

TL;DR: 该论文提出了针对深度学习人脸检测系统的后门攻击,包括一种新的基于地标偏移的攻击方法,并提出了相应的防御措施。

Details Motivation: 人脸识别系统在开放环境中面临光照、姿态等多种挑战,人脸检测模块是关键环节。但其回归任务(如边界框和地标坐标)可能受到攻击,研究其安全性和防御方法至关重要。

Contribution: 1. 首次展示了针对人脸检测的物体生成攻击(Face Generation Attacks);2. 提出了一种新的Landmark Shift Attack,专门针对坐标回归任务;3. 提供了针对这些漏洞的缓解措施。

Method: 论文通过实验展示了Face Generation Attacks的有效性,并提出了一种Landmark Shift Attack技术,后门攻击人脸检测器的坐标回归任务。

Result: 实验结果证明了这些攻击方法的有效性,展示了人脸检测系统在实际应用中的潜在安全风险。

Insight: 该研究揭示了深度学习模型在回归任务中的脆弱性,为防御类似后门攻击提供了方向。

Abstract: Face Recognition Systems that operate in unconstrained environments capture images under varying conditions,such as inconsistent lighting, or diverse face poses. These challenges require including a Face Detection module that regresses bounding boxes and landmark coordinates for proper Face Alignment. This paper shows the effectiveness of Object Generation Attacks on Face Detection, dubbed Face Generation Attacks, and demonstrates for the first time a Landmark Shift Attack that backdoors the coordinate regression task performed by face detectors. We then offer mitigations against these vulnerabilities.

[83] Minimum Data, Maximum Impact: 20 annotated samples for explainable lung nodule classification

Luisa Gallée,Catharina Silvia Lisson,Christoph Gerhard Lisson,Daniela Drees,Felix Weig,Daniel Vogele,Meinrad Beer,Michael Götz

Main category: cs.CV

TL;DR: 该论文提出了一种利用生成模型合成带有病理属性标注数据的方法,仅需20个标注样本即可显著提升可解释模型在肺结节分类中的性能。

Details Motivation: 医学影像分析中,缺乏大规模标注数据集限制了可解释分类模型的开发与应用。通过合成带有病理属性的数据,可以缓解这一问题。

Contribution: 提出了结合属性条件化生成模型的方法,仅需少量真实标注样本即可生成高质量的合成数据,显著提升可解释模型的性能。

Method: 使用条件化Diffusion Model生成带有病理属性标注的肺结节图像,并将合成数据纳入模型训练。

Result: 相比仅使用真实小数据集,合成数据使属性预测准确率提升了13.4%,目标预测准确率提升了1.8%。

Insight: 合成数据可以弥补医学影像标注数据的不足,推动可解释AI模型在临床应用中的普及。

Abstract: Classification models that provide human-interpretable explanations enhance clinicians’ trust and usability in medical image diagnosis. One research focus is the integration and prediction of pathology-related visual attributes used by radiologists alongside the diagnosis, aligning AI decision-making with clinical reasoning. Radiologists use attributes like shape and texture as established diagnostic criteria and mirroring these in AI decision-making both enhances transparency and enables explicit validation of model outputs. However, the adoption of such models is limited by the scarcity of large-scale medical image datasets annotated with these attributes. To address this challenge, we propose synthesizing attribute-annotated data using a generative model. We enhance the Diffusion Model with attribute conditioning and train it using only 20 attribute-labeled lung nodule samples from the LIDC-IDRI dataset. Incorporating its generated images into the training of an explainable model boosts performance, increasing attribute prediction accuracy by 13.4% and target prediction accuracy by 1.8% compared to training with only the small real attribute-annotated dataset. This work highlights the potential of synthetic data to overcome dataset limitations, enhancing the applicability of explainable models in medical image analysis.

[84] Can Large Pretrained Depth Estimation Models Help With Image Dehazing?

Hongfei Zhang,Kun Zhou,Ruizheng Wu,Jiangbo Lu

Main category: cs.CV

TL;DR: 这篇论文研究了预训练深度估计模型在图像去雾任务中的泛化能力,并提出了一个即插即用的RGB-D融合模块,显著提升了去雾效果和适用性。

Details Motivation: 图像去雾因雾气的空间变化性而具有挑战性。现有方法难以兼顾不同场景下的准确性和效率需求,因此论文探索了预训练深度表示在去雾任务中的潜力。

Contribution: 主要贡献包括:1) 验证了预训练深度特征在去雾中的一致性;2) 设计了一个可适配多种去雾架构的RGB-D融合模块。

Method: 提出了一个即插即用的RGB-D融合模块,利用预训练深度特征提升去雾性能。该方法无需重新训练预训练模型。

Result: 在多个基准测试中验证了方法的有效性和广泛适用性,去雾效果显著。

Insight: 预训练的深度特征在去雾任务中表现出强鲁棒性,为跨任务泛化提供了新思路。

Abstract: Image dehazing remains a challenging problem due to the spatially varying nature of haze in real-world scenes. While existing methods have demonstrated the promise of large-scale pretrained models for image dehazing, their architecture-specific designs hinder adaptability across diverse scenarios with different accuracy and efficiency requirements. In this work, we systematically investigate the generalization capability of pretrained depth representations-learned from millions of diverse images-for image dehazing. Our empirical analysis reveals that the learned deep depth features maintain remarkable consistency across varying haze levels. Building on this insight, we propose a plug-and-play RGB-D fusion module that seamlessly integrates with diverse dehazing architectures. Extensive experiments across multiple benchmarks validate both the effectiveness and broad applicability of our approach.

[85] D3: Training-Free AI-Generated Video Detection Using Second-Order Features

Chende Zheng,Ruiqi suo,Chenhao Lin,Zhengyu Zhao,Le Yang,Shuai Liu,Minghui Yang,Cong Wang,Chao Shen

Main category: cs.CV

TL;DR: 该论文提出了一种无需训练的AI生成视频检测方法D3,通过分析二阶动力特征的差异来区分真实视频和AI生成视频,展示出卓越的性能和高效性。

Details Motivation: 随着视频生成技术(如Sora)的发展,高保真AI生成视频的普及引发了公众对合成内容传播的担忧,而现有检测方法对时间伪影的探索不足。

Contribution: 1. 基于牛顿力学的二阶动力学分析建立了理论框架;2. 提出专用于检测时间伪影的二阶中心差分特征;3. 揭示了真实视频与AI生成视频在二阶特征分布上的差异;4. 提出无需训练的检测方法D3,显著提升性能。

Method: 通过二阶中心差分特征捕捉视频的时间伪影,利用二阶特征的分布差异,设计了无需训练的检测方法D3。

Result: 在4个开源数据集(包含40个子集)上验证了D3的优越性,例如在Gen-Video上将平均精度提升了10.39%。实验还展示了D3的高效计算和强鲁棒性。

Insight: 二阶动力学特征为区分真实与AI生成视频提供了新的视角,无需训练的检测方法在效率和性能上具有显著优势。

Abstract: The evolution of video generation techniques, such as Sora, has made it increasingly easy to produce high-fidelity AI-generated videos, raising public concern over the dissemination of synthetic content. However, existing detection methodologies remain limited by their insufficient exploration of temporal artifacts in synthetic videos. To bridge this gap, we establish a theoretical framework through second-order dynamical analysis under Newtonian mechanics, subsequently extending the Second-order Central Difference features tailored for temporal artifact detection. Building on this theoretical foundation, we reveal a fundamental divergence in second-order feature distributions between real and AI-generated videos. Concretely, we propose Detection by Difference of Differences (D3), a novel training-free detection method that leverages the above second-order temporal discrepancies. We validate the superiority of our D3 on 4 open-source datasets (Gen-Video, VideoPhy, EvalCrafter, VidProM), 40 subsets in total. For example, on GenVideo, D3 outperforms the previous best method by 10.39% (absolute) mean Average Precision. Additional experiments on time cost and post-processing operations demonstrate D3’s exceptional computational efficiency and strong robust performance. Our code is available at https://github.com/Zig-HS/D3.

[86] MIHBench: Benchmarking and Mitigating Multi-Image Hallucinations in Multimodal Large Language Models

Jiale Li,Mingrui Wu,Zixiang Jin,Hao Chen,Jiayi Ji,Xiaoshuai Sun,Liujuan Cao,Rongrong Ji

Main category: cs.CV

TL;DR: 该论文提出了MIHBench,这是首个针对多图像场景中多模态大语言模型(MLLMs)幻觉问题的基准测试,并研究了对象相关的幻觉问题。通过动态注意力平衡机制,有效减少了幻觉现象。

Details Motivation: 现有研究主要集中在单图像场景中的幻觉问题,而多图像场景中的幻觉问题尚未被系统研究,因此需要填补这一空白。

Contribution: 1. 提出了首个多图像幻觉基准测试MIHBench;2. 研究了多图像场景中对象存在、数量推理和跨视图身份一致性等幻觉问题;3. 提出了动态注意力平衡机制以减少幻觉现象。

Method: 动态注意力平衡机制,通过调整图像间的注意力分布,同时保持整体视觉注意力比例,以提高语义理解和推理稳定性。

Result: 实验表明,该方法在多图像场景中有效减少了幻觉现象,提升了语义整合和推理稳定性。

Insight: 多图像场景的幻觉与输入图像数量、单图像幻觉趋势、同对象图像比例和负样本位置密切相关。

Abstract: Despite growing interest in hallucination in Multimodal Large Language Models, existing studies primarily focus on single-image settings, leaving hallucination in multi-image scenarios largely unexplored. To address this gap, we conduct the first systematic study of hallucinations in multi-image MLLMs and propose MIHBench, a benchmark specifically tailored for evaluating object-related hallucinations across multiple images. MIHBench comprises three core tasks: Multi-Image Object Existence Hallucination, Multi-Image Object Count Hallucination, and Object Identity Consistency Hallucination, targeting semantic understanding across object existence, quantity reasoning, and cross-view identity consistency. Through extensive evaluation, we identify key factors associated with the occurrence of multi-image hallucinations, including: a progressive relationship between the number of image inputs and the likelihood of hallucination occurrences; a strong correlation between single-image hallucination tendencies and those observed in multi-image contexts; and the influence of same-object image ratios and the positional placement of negative samples within image sequences on the occurrence of object identity consistency hallucination. To address these challenges, we propose a Dynamic Attention Balancing mechanism that adjusts inter-image attention distributions while preserving the overall visual attention proportion. Experiments across multiple state-of-the-art MLLMs demonstrate that our method effectively reduces hallucination occurrences and enhances semantic integration and reasoning stability in multi-image scenarios.

[87] YOLO-Count: Differentiable Object Counting for Text-to-Image Generation

Guanning Zeng,Xiang Zhang,Zirui Wang,Haiyang Xu,Zeyuan Chen,Bingnan Li,Zhuowen Tu

Main category: cs.CV

TL;DR: YOLO-Count 是一种可微分的开放词汇对象计数模型,通过引入’基数’映射和混合监督方案,实现了文本到图像生成中的精确数量控制。

Details Motivation: 解决开放词汇对象计数问题,并将其应用于文本到图像生成中的数量控制,填补了现有方法的空白。

Contribution: 提出’基数’映射作为回归目标,设计混合强-弱监督方案,实现可微分架构,从而实现精确的梯度优化和生成模型指导。

Method: 利用表示对齐和混合监督方案,通过基数映射建模对象尺寸和空间分布变化,实现可微分计数。

Result: 实验表明,YOLO-Count 在计数精度和文本到图像生成的数量控制方面达到最先进水平。

Insight: 基数映射和混合监督的结合为开放词汇计数和生成模型的联合优化提供了新思路。

Abstract: We propose YOLO-Count, a differentiable open-vocabulary object counting model that tackles both general counting challenges and enables precise quantity control for text-to-image (T2I) generation. A core contribution is the ‘cardinality’ map, a novel regression target that accounts for variations in object size and spatial distribution. Leveraging representation alignment and a hybrid strong-weak supervision scheme, YOLO-Count bridges the gap between open-vocabulary counting and T2I generation control. Its fully differentiable architecture facilitates gradient-based optimization, enabling accurate object count estimation and fine-grained guidance for generative models. Extensive experiments demonstrate that YOLO-Count achieves state-of-the-art counting accuracy while providing robust and effective quantity control for T2I systems.

[88] GECO: Geometrically Consistent Embedding with Lightspeed Inference

Regine Hartwig,Dominik Muhle,Riccardo Marin,Daniel Cremers

Main category: cs.CV

TL;DR: GECO提出了一种基于最优传输的训练框架,通过几何一致性特征学习解决现有自监督视觉基础模型缺乏3D几何感知的问题,实现了快速推理和性能提升。

Details Motivation: 当前自监督视觉基础模型在捕获语义对应关系时缺乏对3D几何感知的能力,GECO旨在填补这一空白。

Contribution: 1. 提出基于最优传输的训练框架,通过几何一致性特征学习提升性能;2. 轻量级架构实现高速推理(30fps);3. 在PFPascal、APK和CUB数据集上达到SOTA性能;4. 提出新的几何感知评价指标。

Method: 基于最优传输的训练框架,生成几何一致性特征,轻量级网络架构。

Result: 在PFPascal、APK和CUB数据集上分别提升PCK 6.0%、6.2%和4.1%,推理速度98.2%提升。

Insight: PCK指标不足以全面评估几何质量,需要更几何感知的评价指标。

Abstract: Recent advances in feature learning have shown that self-supervised vision foundation models can capture semantic correspondences but often lack awareness of underlying 3D geometry. GECO addresses this gap by producing geometrically coherent features that semantically distinguish parts based on geometry (e.g., left/right eyes, front/back legs). We propose a training framework based on optimal transport, enabling supervision beyond keypoints, even under occlusions and disocclusions. With a lightweight architecture, GECO runs at 30 fps, 98.2% faster than prior methods, while achieving state-of-the-art performance on PFPascal, APK, and CUB, improving PCK by 6.0%, 6.2%, and 4.1%, respectively. Finally, we show that PCK alone is insufficient to capture geometric quality and introduce new metrics and insights for more geometry-aware feature learning. Link to project page: https://reginehartwig.github.io/publications/geco/

[89] Is It Really You? Exploring Biometric Verification Scenarios in Photorealistic Talking-Head Avatar Videos

Laura Pedrouzo-Rodriguez,Pedro Delgado-DeRobles,Luis F. Gomez,Ruben Tolosana,Ruben Vera-Rodriguez,Aythami Morales,Julian Fierrez

Main category: cs.CV

TL;DR: 本文探讨了在逼真的会说话头像(talking-head avatar)视频中,利用面部运动模式作为行为生物特征进行身份验证的可行性,并提出了一种轻量级的时空图卷积网络架构。

Details Motivation: 随着逼真会说话头像在虚拟会议等场景中的普及,其潜在的安全风险(如冒充)日益突出。传统的外观和声音检测难以应对,因此需要研究新的生物特征验证方法。

Contribution: 1. 提出了一个新数据集,包含真实和冒充的会说话头像视频;2. 设计了一种轻量级时空图卷积网络架构,仅使用面部关键点建模动态面部动作;3. 实验表明面部运动模式可用于身份验证,AUC达80%。

Method: 使用GAGAvatar生成逼真的头像视频,并提出了一种基于时空图卷积网络(GCN)的模型,结合时间注意力池化,仅依赖面部关键点进行身份验证。

Result: 实验证明,仅依赖面部运动模式的身份验证方法在AUC指标上接近80%,验证了其可行性。

Insight: 面部动态行为可以作为生物特征的有效补充,尤其是在头像视觉外观被精准复制的情况下。

Abstract: Photorealistic talking-head avatars are becoming increasingly common in virtual meetings, gaming, and social platforms. These avatars allow for more immersive communication, but they also introduce serious security risks. One emerging threat is impersonation: an attacker can steal a user’s avatar-preserving their appearance and voice-making it nearly impossible to detect its fraudulent usage by sight or sound alone. In this paper, we explore the challenge of biometric verification in such avatar-mediated scenarios. Our main question is whether an individual’s facial motion patterns can serve as reliable behavioral biometrics to verify their identity when the avatar’s visual appearance is a facsimile of its owner. To answer this question, we introduce a new dataset of realistic avatar videos created using a state-of-the-art one-shot avatar generation model, GAGAvatar, with genuine and impostor avatar videos. We also propose a lightweight, explainable spatio-temporal Graph Convolutional Network architecture with temporal attention pooling, that uses only facial landmarks to model dynamic facial gestures. Experimental results demonstrate that facial motion cues enable meaningful identity verification with AUC values approaching 80%. The proposed benchmark and biometric system are available for the research community in order to bring attention to the urgent need for more advanced behavioral biometric defenses in avatar-based communication systems.

[90] Sample-Aware Test-Time Adaptation for Medical Image-to-Image Translation

Irene Iele,Francesco Di Feola,Valerio Guarrasi,Paolo Soda

Main category: cs.CV

TL;DR: 该论文提出了一种新颖的测试时自适应(TTA)框架,用于医学图像的图像到图像翻译任务,能够动态调整翻译过程以适应测试样本的分布外特性,同时在分布内样本上保持性能。

Details Motivation: 医学图像到图像翻译在处理分布外样本时性能下降,现有方法对所有样本统一自适应,无法区分是否需要调整。论文旨在解决这一问题。

Contribution: 提出了一个基于样本自适应的TTA框架,包括重建模块量化域偏移和动态自适应块选择性调整特征,提高了模型对分布外样本的鲁棒性。

Method: 利用重建模块评估域偏移,并通过动态自适应块选择性修改预训练翻译模型的内部特征,实现样本级别的自适应。

Result: 在低剂量CT去噪和T1到T2 MRI翻译任务上,方法比基线模型和现有TTA方法表现更优,验证了其有效性。

Insight: 动态、样本特定的自适应比统一调整更有效,为医学图像翻译任务的实际应用提供了新思路。

Abstract: Image-to-image translation has emerged as a powerful technique in medical imaging, enabling tasks such as image denoising and cross-modality conversion. However, it suffers from limitations in handling out-of-distribution samples without causing performance degradation. To address this limitation, we propose a novel Test-Time Adaptation (TTA) framework that dynamically adjusts the translation process based on the characteristics of each test sample. Our method introduces a Reconstruction Module to quantify the domain shift and a Dynamic Adaptation Block that selectively modifies the internal features of a pretrained translation model to mitigate the shift without compromising the performance on in-distribution samples that do not require adaptation. We evaluate our approach on two medical image-to-image translation tasks: low-dose CT denoising and T1 to T2 MRI translation, showing consistent improvements over both the baseline translation model without TTA and prior TTA methods. Our analysis highlights the limitations of the state-of-the-art that uniformly apply the adaptation to both out-of-distribution and in-distribution samples, demonstrating that dynamic, sample-specific adjustment offers a promising path to improve model resilience in real-world scenarios. The code is available at: https://github.com/cosbidev/Sample-Aware_TTA.

[91] Cross-Dataset Semantic Segmentation Performance Analysis: Unifying NIST Point Cloud City Datasets for 3D Deep Learning

Alexander Nikitas Dimopoulos,Joseph Grasso

Main category: cs.CV

TL;DR: 论文通过分析不同标注的NIST点云城市数据集,探讨了3D深度学习在语义分割中的性能差异。研究发现大型物体分割效果较好,但小型安全关键特征识别率较低,并提出了标准化标注协议和改进标注技术的需求。

Details Motivation: 研究动机是解决公共安全应用中异构标注点云数据集在语义分割中的性能不一致问题,特别是关注预事件规划系统中的LiDAR扫描数据。

Contribution: 主要贡献包括提出了统一不同标注点云数据集的挑战,并使用KPConv架构和IoU指标评估了性能差异。此外,指出了当前点云方法在小物体检测上的局限性。

Method: 方法采用KPConv架构和分级标注模式,通过IoU指标评估语义分割性能,分析了数据集中的类不平衡和几何区分度问题。

Result: 结果显示,几何尺寸较大的物体(如楼梯、窗户)分割效果较好,而小型安全关键特征识别率较低。性能受类不平衡和小物体几何区分度有限的显著影响。

Insight: 研究指出,可靠的公共安全点云语义分割需要标准化标注协议和改进标注技术,以解决数据异构性和小物体检测的挑战。

Abstract: This study analyzes semantic segmentation performance across heterogeneously labeled point-cloud datasets relevant to public safety applications, including pre-incident planning systems derived from lidar scans. Using NIST’s Point Cloud City dataset (Enfield and Memphis collections), we investigate challenges in unifying differently labeled 3D data. Our methodology employs a graded schema with the KPConv architecture, evaluating performance through IoU metrics on safety-relevant features. Results indicate performance variability: geometrically large objects (e.g. stairs, windows) achieve higher segmentation performance, suggesting potential for navigational context, while smaller safety-critical features exhibit lower recognition rates. Performance is impacted by class imbalance and the limited geometric distinction of smaller objects in typical lidar scans, indicating limitations in detecting certain safety-relevant features using current point-cloud methods. Key identified challenges include insufficient labeled data, difficulties in unifying class labels across datasets, and the need for standardization. Potential directions include automated labeling and multi-dataset learning strategies. We conclude that reliable point-cloud semantic segmentation for public safety necessitates standardized annotation protocols and improved labeling techniques to address data heterogeneity and the detection of small, safety-critical elements.

eess.IV [Back]

[92] CADS: A Comprehensive Anatomical Dataset and Segmentation for Whole-Body Anatomy in Computed Tomography

Murong Xu,Tamaz Amiranashvili,Fernando Navarro,Maksym Fritsak,Ibrahim Ethem Hamamci,Suprosanna Shit,Bastian Wittmann,Sezgin Er,Sebastian M. Christ,Ezequiel de la Rosa,Julian Deseoe,Robert Graf,Hendrik Möller,Anjany Sekuboyina,Jan C. Peeken,Sven Becker,Giulia Baldini,Johannes Haubold,Felix Nensa,René Hosch,Nikhil Mirajkar,Saad Khalid,Stefan Zachow,Marc-André Weber,Georg Langs,Jakob Wasserthal,Mehmet Kemal Ozdemir,Andrey Fedorov,Ron Kikinis,Stephanie Tanadini-Lang,Jan S. Kirschke,Stephanie E. Combs,Bjoern Menze

Main category: eess.IV

TL;DR: CADS提出了一种开源框架,通过整合和标准化异构数据源,构建了一个大规模的全身体CT分割数据集,并开发了基于此的自动化分割模型,显著提升了性能和临床应用价值。

Details Motivation: 现有的AI分割模型通常针对单一结构,缺乏全面性和兼容性,且现有全身体分割方法的训练数据在异质性和解剖覆盖上不足。

Contribution: 1. 构建了包含22,022个CT扫描和167个解剖结构标注的大规模数据集;2. 开发了基于该数据集的CADS模型,支持全自动化CT分割;3. 通过18个公共数据集和实际医院队列验证了其优越性。

Method: 通过系统整合和标准化异构数据源,构建大规模数据集,并利用现有架构开发CADS模型,实现全身体CT分割。

Result: CADS在规模和解剖覆盖上显著优于现有数据集(扫描数增加18倍,目标结构增加60%),并在临床干预中表现出直接实用性。

Insight: 数据整合和标准化是提升AI模型性能和临床应用的关键,通过开放共享数据集和工具,可以促进放射学领域的AI发展。

Abstract: Accurate delineation of anatomical structures in volumetric CT scans is crucial for diagnosis and treatment planning. While AI has advanced automated segmentation, current approaches typically target individual structures, creating a fragmented landscape of incompatible models with varying performance and disparate evaluation protocols. Foundational segmentation models address these limitations by providing a holistic anatomical view through a single model. Yet, robust clinical deployment demands comprehensive training data, which is lacking in existing whole-body approaches, both in terms of data heterogeneity and, more importantly, anatomical coverage. In this work, rather than pursuing incremental optimizations in model architecture, we present CADS, an open-source framework that prioritizes the systematic integration, standardization, and labeling of heterogeneous data sources for whole-body CT segmentation. At its core is a large-scale dataset of 22,022 CT volumes with complete annotations for 167 anatomical structures, representing a significant advancement in both scale and coverage, with 18 times more scans than existing collections and 60% more distinct anatomical targets. Building on this diverse dataset, we develop the CADS-model using established architectures for accessible and automated full-body CT segmentation. Through comprehensive evaluation across 18 public datasets and an independent real-world hospital cohort, we demonstrate advantages over SoTA approaches. Notably, thorough testing of the model’s performance in segmentation tasks from radiation oncology validates its direct utility for clinical interventions. By making our large-scale dataset, our segmentation models, and our clinical software tool publicly available, we aim to advance robust AI solutions in radiology and make comprehensive anatomical analysis accessible to clinicians and researchers alike.

[93] Weakly Supervised Intracranial Aneurysm Detection and Segmentation in MR angiography via Multi-task UNet with Vesselness Prior

Erin Rainville,Amirhossein Rasoulian,Hassan Rivaz,Yiming Xiao

Main category: eess.IV

TL;DR: 本文提出了一种基于弱监督的3D多任务UNet模型,结合血管性先验知识,联合进行颅内动脉瘤检测和分割,显著提升了性能。

Details Motivation: 颅内动脉瘤(IAs)体积小且对比度低,检测和形态分析困难,而缺少大规规模体素级标注的公开数据集也限制了深度学习模型的发展。

Contribution: 提出了一种新颖的弱监督3D多任务UNet,结合血管性先验知识,通过注意力机制和辅助分支联合完成动脉瘤的检测和分割。

Method: 采用Frangi血管性滤波器生成脑血管先验知识,用于网络输入和注意力模块,通过解码器进行分割,辅助分支进行检测。

Result: 在Lausanne和ADAM数据集上验证,Dice为0.614,95%HD为1.38mm,检测假阳性率为1.47%,敏感性达92.9%。

Insight: 血管性先验知识和多任务学习的结合在弱监督场景下能显著提升小目标检测和分割的性能。

Abstract: Intracranial aneurysms (IAs) are abnormal dilations of cerebral blood vessels that, if ruptured, can lead to life-threatening consequences. However, their small size and soft contrast in radiological scans often make it difficult to perform accurate and efficient detection and morphological analyses, which are critical in the clinical care of the disorder. Furthermore, the lack of large public datasets with voxel-wise expert annotations pose challenges for developing deep learning algorithms to address the issues. Therefore, we proposed a novel weakly supervised 3D multi-task UNet that integrates vesselness priors to jointly perform aneurysm detection and segmentation in time-of-flight MR angiography (TOF-MRA). Specifically, to robustly guide IA detection and segmentation, we employ the popular Frangi’s vesselness filter to derive soft cerebrovascular priors for both network input and an attention block to conduct segmentation from the decoder and detection from an auxiliary branch. We train our model on the Lausanne dataset with coarse ground truth segmentation, and evaluate it on the test set with refined labels from the same database. To further assess our model’s generalizability, we also validate it externally on the ADAM dataset. Our results demonstrate the superior performance of the proposed technique over the SOTA techniques for aneurysm segmentation (Dice = 0.614, 95%HD =1.38mm) and detection (false positive rate = 1.47, sensitivity = 92.9%).

[94] FMPlug: Plug-In Foundation Flow-Matching Priors for Inverse Problems

Yuxiang Wan,Ryan Devera,Wenjie Zhang,Ju Sun

Main category: eess.IV

TL;DR: FMPlug是一种新型插件框架,利用基础流匹配(FM)先验增强解决不适定逆问题的能力。通过时间自适应预热策略和尖锐高斯性正则化,它显著提升了现有方法的性能。

Details Motivation: 传统方法依赖于领域特定或无训练先验,难以适配广泛任务。FMPlug通过利用观测与目标对象相似性及生成流的高斯性,解决了这一问题。

Contribution: 提出FMPlug框架,结合时间自适应预热和尖锐高斯性正则化,释放了领域无关基础模型的潜力。

Method: 基于基础FM先验,引入两个核心机制:观测与目标的相似性利用和高斯性正则化,优化逆问题求解。

Result: 在图像超分辨率和高斯去模糊任务中,FMPlug显著优于现有使用基础FM先验的方法。

Insight: 领域无关的基础模型可通过简单而巧妙的机制适应特定任务,无需复杂调整,为逆问题求解提供了新思路。

Abstract: We present FMPlug, a novel plug-in framework that enhances foundation flow-matching (FM) priors for solving ill-posed inverse problems. Unlike traditional approaches that rely on domain-specific or untrained priors, FMPlug smartly leverages two simple but powerful insights: the similarity between observed and desired objects and the Gaussianity of generative flows. By introducing a time-adaptive warm-up strategy and sharp Gaussianity regularization, FMPlug unlocks the true potential of domain-agnostic foundation models. Our method beats state-of-the-art methods that use foundation FM priors by significant margins, on image super-resolution and Gaussian deblurring.

[95] AI-Driven Collaborative Satellite Object Detection for Space Sustainability

Peng Hu,Wenxuan Zhang

Main category: eess.IV

TL;DR: 该论文提出了一种基于AI的多卫星协作空间物体检测框架,以解决低地球轨道卫星密度增加带来的碰撞风险问题。通过构建高保真数据集和引入距离感知视角选择策略,该方法在保持低SWaP要求的同时实现了与单卫星系统相当的检测精度。

Details Motivation: 低地球轨道卫星密度的增加导致碰撞风险上升,传统地面跟踪系统存在延迟和覆盖范围限制,亟需开发星载视觉检测能力以提高空间可持续性。

Contribution: 1) 提出了一种多卫星协作执行深度学习检测任务的框架;2) 构建了模拟卫星集群成像的高保真数据集;3) 引入距离感知视角选择策略优化检测性能。

Method: 通过卫星集群协作执行深度学习的空间物体检测任务,结合数据集模拟和距离感知视角选择策略,优化检测精度并满足SWaP限制。

Result: 实验表明,该方法在检测精度上与单卫星系统相当,同时保持了较低的SWaP需求。

Insight: 分布式AI星载系统能有效提升空间态势感知能力,为长期空间可持续性提供支持。

Abstract: The growing density of satellites in low-Earth orbit (LEO) presents serious challenges to space sustainability, primarily due to the increased risk of in-orbit collisions. Traditional ground-based tracking systems are constrained by latency and coverage limitations, underscoring the need for onboard, vision-based space object detection (SOD) capabilities. In this paper, we propose a novel satellite clustering framework that enables the collaborative execution of deep learning (DL)-based SOD tasks across multiple satellites. To support this approach, we construct a high-fidelity dataset simulating imaging scenarios for clustered satellite formations. A distance-aware viewpoint selection strategy is introduced to optimize detection performance, and recent DL models are used for evaluation. Experimental results show that the clustering-based method achieves competitive detection accuracy compared to single-satellite and existing approaches, while maintaining a low size, weight, and power (SWaP) footprint. These findings underscore the potential of distributed, AI-enabled in-orbit systems to enhance space situational awareness and contribute to long-term space sustainability.

q-bio.NC [Back]

[96] The Repeated-Stimulus Confound in Electroencephalography

Jack A. Kilgallen,Barak A. Pearlmutter,Jeffrey Mark Siskind

Main category: q-bio.NC

TL;DR: 该论文指出了脑电图(EEG)解码研究中的一个常见问题:重复刺激导致的混淆效应,即训练和评估使用相同刺激时会高估模型性能。

Details Motivation: 近年来,深度学习在神经解码研究中广泛应用,但由于数据需求大,许多研究通过重复呈现相同刺激来增加试验次数。这种做法可能导致模型性能被高估,甚至影响研究结论的有效性。

Contribution: 论文首次系统性地定义了重复刺激混淆效应,并通过实验量化了其对模型性能的高估程度(4.46-7.42%)。此外,还分析了混淆效应在伪科学场景中的潜在影响。

Method: 作者分析了16篇受影响的研究和数据集,使用这些研究中的模型进行实验,评估混淆效应导致的性能高估程度,并进一步测试混淆效应在伪科学场景中的应用。

Result: 实验结果证实,混淆效应导致模型性能被高估4.46-7.42%,且每1%的准确性增加会带来0.26%的高估。混淆效应还可能被用于支持伪科学主张。

Insight: 重复刺激混淆效应是一个容易被忽视但影响深远的问题,需要研究者在设计实验时更加谨慎,避免因数据重复而导致结论偏差。

Abstract: In neural-decoding studies, recordings of participants’ responses to stimuli are used to train models. In recent years, there has been an explosion of publications detailing applications of innovations from deep-learning research to neural-decoding studies. The data-hungry models used in these experiments have resulted in a demand for increasingly large datasets. Consequently, in some studies, the same stimuli are presented multiple times to each participant to increase the number of trials available for use in model training. However, when a decoding model is trained and subsequently evaluated on responses to the same stimuli, stimulus identity becomes a confounder for accuracy. We term this the repeated-stimulus confound. We identify a susceptible dataset, and 16 publications which report model performance based on evaluation procedures affected by the confound. We conducted experiments using models from the affected studies to investigate the likely extent to which results in the literature have been misreported. Our findings suggest that the decoding accuracies of these models were overestimated by between 4.46-7.42%. Our analysis also indicates that per 1% increase in accuracy under the confound, the magnitude of the overestimation increases by 0.26%. The confound not only results in optimistic estimates of decoding performance, but undermines the validity of several claims made within the affected publications. We conducted further experiments to investigate the implications of the confound in alternative contexts. We found that the same methodology used within the affected studies could also be used to justify an array of pseudoscientific claims, such as the existence of extrasensory perception.

cs.SD [Back]

[97] AudioGen-Omni: A Unified Multimodal Diffusion Transformer for Video-Synchronized Audio, Speech, and Song Generation

Le Wang,Jun Wang,Feng Deng,Chen Zhang,Kun Gai,Di Zhang

Main category: cs.SD

TL;DR: AudioGen-Omni是一个基于多模态扩散变换器(MMDit)的统一模型,能够生成与输入视频同步的高保真音频、语音和歌曲。它通过联合训练范式和多模态融合机制,实现了语义丰富、声学多样的音频生成。

Details Motivation: 现有的音频生成模型通常专注于单一任务(如语音或音乐),缺乏多模态输入的联合训练能力。AudioGen-Omni旨在解决这一问题,通过统一框架实现高效、通用的多模态音频生成。

Contribution: 1. 提出了一种统一的联合训练范式,整合视频-文本-音频数据;2. 设计了歌词-转录编码器和PAAPI机制,实现精确的跨模态对齐;3. 在音频质量、语义对齐和唇同步方面取得了显著提升。

Method: AudioGen-Omni采用MMDit框架,结合歌词-转录编码器和基于AdaLN的联合注意力机制,并使用PAAPI增强跨模态对齐。通过解冻所有模态和掩码缺失输入,实现灵活的条件生成。

Result: 实验表明,AudioGen-Omni在多任务音频生成中优于现有方法,生成8秒音频的推理时间仅为1.91秒,效率显著提升。

Insight: 统一的联合训练和多模态融合是提升音频生成泛化能力的有效途径。PAAPI机制为跨模态对齐提供了一种新思路,可推广到其他多模态任务中。

Abstract: We present AudioGen-Omni - a unified approach based on multimodal diffusion transformers (MMDit), capable of generating high-fidelity audio, speech, and songs coherently synchronized with the input video. AudioGen-Omni introduces a novel joint training paradigm that seamlessly integrates large-scale video-text-audio corpora, enabling a model capable of generating semantically rich, acoustically diverse audio conditioned on multimodal inputs and adaptable to a wide range of audio generation tasks. AudioGen-Omni employs a unified lyrics-transcription encoder that encodes graphemes and phonemes from both sung and spoken inputs into dense frame-level representations. Dense frame-level representations are fused using an AdaLN-based joint attention mechanism enhanced with phase-aligned anisotropic positional infusion (PAAPI), wherein RoPE is selectively applied to temporally structured modalities to ensure precise and robust cross-modal alignment. By unfreezing all modalities and masking missing inputs, AudioGen-Omni mitigates the semantic constraints of text-frozen paradigms, enabling effective cross-modal conditioning. This joint training approach enhances audio quality, semantic alignment, and lip-sync accuracy, while also achieving state-of-the-art results on Text-to-Audio/Speech/Song tasks. With an inference time of 1.91 seconds for 8 seconds of audio, it offers substantial improvements in both efficiency and generality.

cs.AI [Back]

[98] RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization

Yihong Dong,Xue Jiang,Yongding Tao,Huanyu Liu,Kechi Zhang,Lili Mou,Rongyu Cao,Yingwei Ma,Jue Chen,Binhua Li,Zhi Jin,Fei Huang,Yongbin Li,Ge Li

Main category: cs.AI

TL;DR: RL-PLUS通过结合内部探索(Thinking)和外部数据(Learning),解决了RLVR方法的固有局限性,突破了LLMs的能力边界,并避免了能力边界坍塌问题。

Details Motivation: 现有的RLVR方法因固有的策略和稀疏奖励问题,难以突破LLMs的基础能力边界,并可能导致能力边界坍塌。RL-PLUS旨在通过混合策略优化解决这些问题。

Contribution: RL-PLUS提出了一种混合策略优化方法,结合Multiple Importance Sampling和Exploration-Based Advantage Function,显著提升LLMs的推理能力并超越基础模型边界。

Method: RL-PLUS采用两种核心方法:1) Multiple Importance Sampling解决外部数据的分布不匹配问题;2) Exploration-Based Advantage Function引导模型探索高价值路径。

Result: 实验表明,RL-PLUS在六个数学推理基准任务和六个分布外推理任务中均表现优异,平均相对提升21.1%到69.2%。此外,Pass@k曲线验证了其有效解决能力边界坍塌问题。

Insight: 混合策略优化(内部探索与外部数据结合)是突破LLMs能力边界的关键,而有效处理分布不匹配和稀疏奖励问题是实现这一目标的重要技术手段。

Abstract: Reinforcement Learning with Verifiable Reward (RLVR) has significantly advanced the complex reasoning abilities of Large Language Models (LLMs). However, it struggles to break through the inherent capability boundaries of the base LLM, due to its inherently on-policy strategy with LLM’s immense action space and sparse reward. Further, RLVR can lead to the capability boundary collapse, narrowing the LLM’s problem-solving scope. To address this problem, we propose RL-PLUS, a novel approach that synergizes internal exploitation (i.e., Thinking) with external data (i.e., Learning) to achieve stronger reasoning capabilities and surpass the boundaries of base models. RL-PLUS integrates two core components: Multiple Importance Sampling to address for distributional mismatch from external data, and an Exploration-Based Advantage Function to guide the model towards high-value, unexplored reasoning paths. We provide both theoretical analysis and extensive experiments to demonstrate the superiority and generalizability of our approach. The results show that RL-PLUS achieves state-of-the-art performance compared with existing RLVR methods on six math reasoning benchmarks and exhibits superior performance on six out-of-distribution reasoning tasks. It also achieves consistent and significant gains across diverse model families, with average relative improvements ranging from 21.1% to 69.2%. Moreover, Pass@k curves across multiple benchmarks indicate that RL-PLUS effectively resolves the capability boundary collapse problem.

[99] MetaAgent: Toward Self-Evolving Agent via Tool Meta-Learning

Hongjin Qian,Zheng Liu

Main category: cs.AI

TL;DR: MetaAgent提出了一个自我进化的代理范式,通过工具元学习实现持续自我提升,无需调整模型参数或额外训练。

Details Motivation: 传统代理系统在处理知识发现任务时表现不足,MetaAgent通过自适应工具使用和自我反思来解决这一问题。

Contribution: 提出了MetaAgent框架,包括工具路由、自我反思、知识库构建和元工具学习机制,实现了代理的自我进化。

Method: MetaAgent通过生成自然语言请求调用外部工具,结合自我反思和经验提炼,动态优化任务执行策略。

Result: 在GAIA、WebWalkerQA和BrowseCamp等基准测试中,MetaAgent表现优于基线,甚至媲美端到端训练的代理。

Insight: 数据驱动的元工具学习是一种有效的代理自我进化方法,无需模型调整即可提升性能。

Abstract: In this work, we propose MetaAgent, an agentic paradigm inspired by the principle of learning-by-doing, where expertise is developed through hands-on practice and continual self-improvement. MetaAgent starts with a minimal workflow, equipped only with basic reasoning and adaptive help-seeking abilities. When a knowledge gap is encountered, MetaAgent generates natural language help requests, which are routed to the most suitable external tool by a dedicated tool router. As MetaAgent solves tasks, it continually conducts self-reflection and answer verification, distilling actionable experience into concise texts that are dynamically incorporated into future task contexts. Besides, MetaAgent autonomously builds in-house tools and a persistent knowledge base by organizing its tool-use history, further enhancing its ability to retrieve and integrate relevant information We term this continual, data-driven process as \textit{meta tool learning}, through which MetaAgent incrementally refines its reasoning and tool-use strategies, without changing model parameters or requiring further post-training. Evaluated on challenging knowledge discovery benchmarks, including GAIA, WebWalkerQA, and BrowseCamp, MetaAgent consistently outperforms workflow-based baselines and matches or exceeds end-to-end trained agents, demonstrating the promise of self-evolving agentic systems for robust, general-purpose knowledge discovery. We provide our source codes in https://github.com/qhjqhj00/MetaAgent.

[100] R1-ACT: Efficient Reasoning Model Safety Alignment by Activating Safety Knowledge

Yeonjun In,Wonjoong Kim,Sangwu Park,Chanyoung Park

Main category: cs.AI

TL;DR: R1-ACT是一种高效的后训练方法,通过显式触发布式推理过程中的安全知识,显著提高了大型推理模型的安全性,同时保持了推理性能。

Details Motivation: 大型推理模型(LRMs)在执行复杂任务时表现出色,但研究发现它们容易执行有害指令,引发安全隐患。本文旨在解决这一问题。

Contribution: 提出R1-ACT方法,通过结构化推理过程显式激活模型已有的安全知识,显著提升模型安全性,且不损害推理性能。

Method: 使用后训练方法(R1-ACT),通过少量训练数据(1,000个样本)和短时间训练(90分钟),显式触发布式推理中的安全知识。

Result: R1-ACT在多个LRM骨干和规模上均表现出强大的安全改进和推理性能保持,优于现有对齐方法。

Insight: 模型本身已具备足够的安全知识,关键在于如何在推理过程中激活这些知识。R1-ACT通过结构化推理有效解决了这一问题。

Abstract: Although large reasoning models (LRMs) have demonstrated impressive capabilities on complex tasks, recent studies reveal that these models frequently fulfill harmful user instructions, raising significant safety concerns. In this paper, we investigate the underlying cause of LRM safety risks and find that models already possess sufficient safety knowledge but fail to activate it during reasoning. Based on this insight, we propose R1-Act, a simple and efficient post-training method that explicitly triggers safety knowledge through a structured reasoning process. R1-Act achieves strong safety improvements while preserving reasoning performance, outperforming prior alignment methods. Notably, it requires only 1,000 training examples and 90 minutes of training on a single RTX A6000 GPU. Extensive experiments across multiple LRM backbones and sizes demonstrate the robustness, scalability, and practical efficiency of our approach.

[101] Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training

Tianqing Fang,Zhisong Zhang,Xiaoyang Wang,Rui Wang,Can Qin,Yuxuan Wan,Jun-Yu Ma,Ce Zhang,Jiaqi Chen,Xiyun Li,Hongming Zhang,Haitao Mi,Dong Yu

Main category: cs.AI

TL;DR: 论文介绍了Cognitive Kernel-Pro,一个完全开源且免费的多模块智能体框架,旨在推动高级AI智能体的开发和评估,解决了现有系统依赖闭源和付费API的问题。

Details Motivation: 现有智能体系统多为闭源或依赖付费工具,限制了研究的可访问性和可复现性。为了推动AI智能体的民主化发展,研究团队开发了Cognitive Kernel-Pro。

Contribution: 1. 开源免费的智能体框架;2. 研究了高质量训练数据的生成方法;3. 提出了测试时反思和投票策略以提升性能。

Method: 1. 构建多模块开源框架;2. 在四个关键领域(网页、文件、代码、通用推理)生成高质量训练数据;3. 引入反射和投票机制。

Result: 在GAIA基准测试中达到开源免费智能体的最佳性能,8B参数的开源模型超越了WebDancer和WebSailor等系统。

Insight: 开源和免费的工具可以推动AI智能体的研究,高质量的多样化训练数据结合创新策略能显著提升性能。

Abstract: General AI Agents are increasingly recognized as foundational frameworks for the next generation of artificial intelligence, enabling complex reasoning, web interaction, coding, and autonomous research capabilities. However, current agent systems are either closed-source or heavily reliant on a variety of paid APIs and proprietary tools, limiting accessibility and reproducibility for the research community. In this work, we present \textbf{Cognitive Kernel-Pro}, a fully open-source and (to the maximum extent) free multi-module agent framework designed to democratize the development and evaluation of advanced AI agents. Within Cognitive Kernel-Pro, we systematically investigate the curation of high-quality training data for Agent Foundation Models, focusing on the construction of queries, trajectories, and verifiable answers across four key domains: web, file, code, and general reasoning. Furthermore, we explore novel strategies for agent test-time reflection and voting to enhance agent robustness and performance. We evaluate Cognitive Kernel-Pro on GAIA, achieving state-of-the-art results among open-source and free agents. Notably, our 8B-parameter open-source model surpasses previous leading systems such as WebDancer and WebSailor, establishing a new performance standard for accessible, high-capability AI agents. Code is available at https://github.com/Tencent/CognitiveKernel-Pro

[102] CoRGI: Verified Chain-of-Thought Reasoning with Visual Grounding

Shixin Yi,Lin Shang

Main category: cs.AI

TL;DR: CoRGI通过引入视觉验证机制改进视觉语言模型的多步推理能力,减少幻觉生成,提高解释的事实性和有用性。

Details Motivation: CoT prompting在视觉语言模型中缺乏对视觉内容的显式验证,导致生成的解释可能存在幻觉,因此需要一种能验证推理步骤的方法。

Contribution: 提出了CoRGI框架,通过三个阶段(生成推理链、提取视觉证据、结合证据生成答案)实现视觉验证,提升了推理的鲁棒性和事实性。

Method: 采用模块化设计,包括视觉证据验证模块(VEVM),分阶段生成文本推理链、验证视觉证据并合成答案,无需端到端重新训练。

Result: 在VCR基准测试中,CoRGI显著提升了Qwen-2.5VL和LLaVA-1.6的性能,消融实验验证了各模块的有效性。

Insight: 视觉验证对多模态推理的稳健性至关重要,中间推理步骤的视觉证据可以提高解释的可信度。

Abstract: Chain-of-Thought (CoT) prompting has shown promise in improving reasoning in vision-language models (VLMs), but it often produces explanations that are linguistically fluent yet lack grounding in visual content. We observe that such hallucinations arise in part from the absence of an explicit verification mechanism during multi-step reasoning. To address this, we propose \textbf{CoRGI}(\textbf{C}hain \textbf{o}f \textbf{R}easoning with \textbf{G}rounded \textbf{I}nsights), a modular framework that introduces visual verification into the reasoning process. CoRGI follows a three-stage pipeline: it first generates a textual reasoning chain, then extracts supporting visual evidence for each reasoning step via a dedicated module (VEVM), and finally synthesizes the textual rationale with visual evidence to generate a grounded, verified answer. The framework can be integrated with existing VLMs without end-to-end retraining. We evaluate CoRGI on the VCR benchmark and find that it improves reasoning performance on two representative open-source VLM backbones, Qwen-2.5VL and LLaVA-1.6. Ablation studies confirm the contribution of each step in the verification module, and human evaluations suggest that CoRGI leads to more factual and helpful explanations. We also examine alternative designs for the visual verification step and discuss potential limitations of post-hoc verification frameworks. These findings highlight the importance of grounding intermediate reasoning steps in visual evidence to enhance the robustness of multimodal reasoning.

cs.LG [Back]

[103] Towards Higher Effective Rank in Parameter-efficient Fine-tuning using Khatri–Rao Product

Paul Albert,Frederic Z. Zhang,Hemanth Saratchandran,Anton van den Hengel,Ehsan Abbasnejad

Main category: cs.LG

TL;DR: 论文提出了一种新的参数效率微调(PEFT)方法KRAdapter,利用Khatri-Rao积生成高有效秩的权重更新,解决了低秩适应(LoRA)在平坦谱或高频分量矩阵逼近中的局限性,并在视觉语言和大型语言模型上验证了性能提升。

Details Motivation: 低秩适应(LoRA)在大规模预训练模型的参数效率微调中表现优异,但面对具有高有效秩或平坦谱的矩阵时效果有限。论文旨在量化比较全秩和低秩PEFT方法,并提出一种更具适应性的解决方案。

Contribution: 1) 通过合成矩阵逼近基准定量比较了全秩和低秩PEFT方法的性能;2) 提出KRAdapter,利用Khatri-Rao积生成高有效秩的权重更新;3) 在视觉语言和大型语言模型上验证了KRAdapter的高效性和性能提升。

Method: KRAdapter通过Khatri-Rao积构造权重更新矩阵,倾向于产生具有高有效秩的矩阵乘积。该方法在保持LoRA的内存和计算效率的同时,提升了在平坦谱或高频分量矩阵逼近中的表现。

Result: KRAdapter在1B参数的视觉语言模型和8B参数的大型语言模型上表现出性能提升,尤其在未见过的常识推理任务中。同时保持了与LoRA相当的内存和计算效率。

Insight: 高有效秩的权重更新矩阵在参数效率微调中至关重要,尤其是在面对复杂任务或高频率数据时。KRAdapter通过结构设计实现了这一点,为大规模模型的微调提供了更健壮的解决方案。

Abstract: Parameter-efficient fine-tuning (PEFT) has become a standard approach for adapting large pre-trained models. Amongst PEFT methods, low-rank adaptation (LoRA) has achieved notable success. However, recent studies have highlighted its limitations compared against full-rank alternatives, particularly when applied to multimodal and large language models. In this work, we present a quantitative comparison amongst full-rank and low-rank PEFT methods using a synthetic matrix approximation benchmark with controlled spectral properties. Our results confirm that LoRA struggles to approximate matrices with relatively flat spectrums or high frequency components – signs of high effective ranks. To this end, we introduce KRAdapter, a novel PEFT algorithm that leverages the Khatri-Rao product to produce weight updates, which, by construction, tends to produce matrix product with a high effective rank. We demonstrate performance gains with KRAdapter on vision-language models up to 1B parameters and on large language models up to 8B parameters, particularly on unseen common-sense reasoning tasks. In addition, KRAdapter maintains the memory and compute efficiency of LoRA, making it a practical and robust alternative to fine-tune billion-scale parameter models.

[104] Stress-Aware Resilient Neural Training

Ashkan Shakarami,Yousef Yeganeh,Azade Farshad,Lorenzo Nicole,Stefano Ghidoni,Nassir Navab

Main category: cs.LG

TL;DR: 论文提出了一种名为’Stress-Aware Learning’的弹性神经训练范式,通过动态调整优化行为来处理稳定或不确定的训练动态,灵感来源于材料科学中的结构疲劳。

Details Motivation: 动机是解决深度神经网络在训练过程中可能遇到的优化停滞问题,以及如何逃离尖锐的最小值以找到更平坦、更具泛化能力的解。

Contribution: 主要贡献是提出了’Plastic Deformation Optimizer’,一种基于内部应力信号的自适应噪声注入机制,以改善模型的鲁棒性和泛化能力。

Method: 方法是通过监测训练损失和准确率的停滞信号,动态调整优化行为,并结合弹性(临时)和塑性(永久)变形的概念来改进训练过程。

Result: 实验在六种架构、四种优化器和七个视觉基准测试中验证了方法的有效性,表现出更高的鲁棒性和泛化能力,且计算开销小。

Insight: 论文的洞察在于将材料科学中的结构疲劳概念引入神经网络的优化过程,提供了一种新的韧性训练思路。

Abstract: This paper introduces Stress-Aware Learning, a resilient neural training paradigm in which deep neural networks dynamically adjust their optimization behavior - whether under stable training regimes or in settings with uncertain dynamics - based on the concept of Temporary (Elastic) and Permanent (Plastic) Deformation, inspired by structural fatigue in materials science. To instantiate this concept, we propose Plastic Deformation Optimizer, a stress-aware mechanism that injects adaptive noise into model parameters whenever an internal stress signal - reflecting stagnation in training loss and accuracy - indicates persistent optimization difficulty. This enables the model to escape sharp minima and converge toward flatter, more generalizable regions of the loss landscape. Experiments across six architectures, four optimizers, and seven vision benchmarks demonstrate improved robustness and generalization with minimal computational overhead. The code and 3D visuals will be available on GitHub: https://github.com/Stress-Aware-Learning/SAL.

cs.SE [Back]

[105] Benchmarking LLMs for Unit Test Generation from Real-World Functions

Dong Huang,Jie M. Zhang,Mark Harman,Qianru Zhang,Mingzhe Du,See-Kiong Ng

Main category: cs.SE

TL;DR: 论文提出了ULT基准,用于评估大语言模型在真实世界Python函数上生成单元测试的能力,解决了数据污染和结构简单函数的问题。

Details Motivation: 现有基准存在数据污染和函数结构简单的问题,无法真实评估LLMs的能力,需设计更真实的基准。

Contribution: 提出ULT和PLT基准,提供更真实的测试生成任务,并区分记忆与推理能力。

Method: 通过多阶段筛选构建高复杂度函数集合,确保数据质量和避免污染。

Result: 在ULT上,LLMs的测试生成能力显著低于现有基准(准确性等指标下降明显)。

Insight: 真实性和复杂度对评估LLMs至关重要,现有基准可能高估模型能力。

Abstract: Recently, large language models (LLMs) have shown great promise in automating unit test generation, significantly reducing the manual effort required by developers. To effectively evaluate the capabilities of LLMs in this domain, it is crucial to have a well-designed benchmark that accurately reflects real-world scenarios and mitigates common pitfalls. Existing LLM test generation benchmarks are limited by two critical drawbacks: data contamination and structurally simple function code. As a result, we often cannot rely on the validity of scientific conclusions drawn from empirical studies using these limited benchmarks. The empirical evidence presented may be biased due to contamination and may fail to generalize beyond toy programs due to structural simplicity. To address these problems, we introduce ULT (UnLeakedTestbench), a new benchmark specifically designed for function-level unit test generation from real-world Python functions. ULT is constructed through a multi-stage curation process that ensures high cyclomatic complexity and mitigates test case contamination. With 3,909 carefully selected function-level tasks, ULT provides a more realistic and challenging evaluation of LLMs’ test generation capabilities. We also provide PLT (PreLeakedTestbench), a pair benchmark of ULT with leaked tests designed to enable a controlled analysis of memorization versus reasoning in test generation. Our evaluation results demonstrate that ULT is significantly more challenging. For example, test cases generated by LLMs only achieve 41.32%, 45.10%, 30.22%, and 40.21% for accuracy, statement coverage, branch coverage, and mutation score on average for all LLMs, respectively. These results are substantially lower than the corresponding metrics on TestEval (91.79%, 92.18%, 82.04%, and 49.69%) and PLT (47.07%, 55.13%, 40.07%, and 50.80%).

cs.RO [Back]

[106] UAV-ON: A Benchmark for Open-World Object Goal Navigation with Aerial Agents

Jianqiang Xiao,Yuexuan Sun,Yixin Shao,Boxi Gan,Rongqiang Liu,Yanjing Wu,Weili Gua,Xiang Deng

Main category: cs.RO

TL;DR: 论文提出了UAV-ON,一个用于开放世界环境中空中智能体进行目标导航的基准测试,解决了现有视觉与语言导航(VLN)依赖详细语言指令的局限性。

Details Motivation: 现有视觉与语言导航依赖顺序语言指令,限制了智能体的可扩展性和自主性。UAV-ON致力于通过语义目标导航推动空中智能体的研究。

Contribution: 提出了UAV-ON基准测试,包含14个高保真环境和1270个标注目标对象,支持基于语义目标的长时程导航。

Method: 提出了一种模块化策略Aerial ObjectNav Agent(AOA),结合指令语义与自我中心观测进行目标导向探索。

Result: 实验表明所有基线方法在这一任务中表现不佳,凸显了空中导航与语义目标定位的挑战性。

Insight: UAV-ON为研究复杂开放世界中基于语义目标的空中导航提供了新平台,推动了无人机自主性的发展。

Abstract: Aerial navigation is a fundamental yet underexplored capability in embodied intelligence, enabling agents to operate in large-scale, unstructured environments where traditional navigation paradigms fall short. However, most existing research follows the Vision-and-Language Navigation (VLN) paradigm, which heavily depends on sequential linguistic instructions, limiting its scalability and autonomy. To address this gap, we introduce UAV-ON, a benchmark for large-scale Object Goal Navigation (ObjectNav) by aerial agents in open-world environments, where agents operate based on high-level semantic goals without relying on detailed instructional guidance as in VLN. UAV-ON comprises 14 high-fidelity Unreal Engine environments with diverse semantic regions and complex spatial layouts, covering urban, natural, and mixed-use settings. It defines 1270 annotated target objects, each characterized by an instance-level instruction that encodes category, physical footprint, and visual descriptors, allowing grounded reasoning. These instructions serve as semantic goals, introducing realistic ambiguity and complex reasoning challenges for aerial agents. To evaluate the benchmark, we implement several baseline methods, including Aerial ObjectNav Agent (AOA), a modular policy that integrates instruction semantics with egocentric observations for long-horizon, goal-directed exploration. Empirical results show that all baselines struggle in this setting, highlighting the compounded challenges of aerial navigation and semantic goal grounding. UAV-ON aims to advance research on scalable UAV autonomy driven by semantic goal descriptions in complex real-world environments.

[107] Omni-Scan: Creating Visually-Accurate Digital Twin Object Models Using a Bimanual Robot with Handover and Gaussian Splat Merging

Tianshuang Qiu,Zehan Ma,Karim El-Refai,Hiya Shah,Chung Min Kim,Justin Kerr,Ken Goldberg

Main category: cs.RO

TL;DR: Omni-Scan是一种用于创建高质量3D高斯溅射模型的双手机器人流程,通过交替抓取和旋转物体来覆盖遮挡区域,结合多模型实现背景和夹具的去除,最终生成全方位物体模型。

Details Motivation: 传统3D物体扫描方法(如多相机阵列或激光扫描仪)存在工作空间受限的问题,需要一种更灵活且高效的方法来生成高质量的3D模型。

Contribution: 提出了Omni-Scan流程,利用双手机器人交替抓取物体,结合多种视觉模型(如DepthAnything、Segment Anything等)去除背景和夹具干扰,并改进3D高斯溅射训练流程,生成360度无遮挡的物体模型。

Method: 使用双手机器人交替抓取物体以覆盖遮挡区域;结合DepthAnything、Segment Anything和RAFT光流模型实现物体分割和背景去除;改进3D高斯溅射训练流程以支持遮挡数据集拼接。

Result: 在12种工业和家用物体的缺陷检测中,Omni-Scan的平均准确率达到83%。

Insight: 双手机器人的交替抓取策略有效解决了遮挡问题,结合多模型分割与训练改进,能够生成高质量的3D物体模型,适用于多领域应用。

Abstract: 3D Gaussian Splats (3DGSs) are 3D object models derived from multi-view images. Such “digital twins” are useful for simulations, virtual reality, marketing, robot policy fine-tuning, and part inspection. 3D object scanning usually requires multi-camera arrays, precise laser scanners, or robot wrist-mounted cameras, which have restricted workspaces. We propose Omni-Scan, a pipeline for producing high-quality 3D Gaussian Splat models using a bi-manual robot that grasps an object with one gripper and rotates the object with respect to a stationary camera. The object is then re-grasped by a second gripper to expose surfaces that were occluded by the first gripper. We present the Omni-Scan robot pipeline using DepthAny-thing, Segment Anything, as well as RAFT optical flow models to identify and isolate objects held by a robot gripper while removing the gripper and the background. We then modify the 3DGS training pipeline to support concatenated datasets with gripper occlusion, producing an omni-directional (360 degree view) model of the object. We apply Omni-Scan to part defect inspection, finding that it can identify visual or geometric defects in 12 different industrial and household objects with an average accuracy of 83%. Interactive videos of Omni-Scan 3DGS models can be found at https://berkeleyautomation.github.io/omni-scan/

cs.GR [Back]

[108] Occlusion-robust Stylization for Drawing-based 3D Animation

Sunjae Yoon,Gwanhyeong Koo,Younghwan Lee,Ji Woo Hong,Chang D. Yoo

Main category: cs.GR

TL;DR: 该论文提出了一种针对基于绘画的3D动画的遮挡鲁棒风格化框架(OSF),解决了动态运动中因遮挡导致的风格退化问题,显著提升了风格一致性与效率。

Details Motivation: 现有的3D动画风格化方法在训练与推理间存在姿态差异(即遮挡情况不同),导致动态运动中风格退化(如轮廓闪烁或笔画模糊)。

Contribution: 提出了OSF框架,利用光流提供遮挡鲁棒的边缘引导,解决了遮挡问题;同时将传统的两阶段方法优化为单阶段,显著提升了推理速度和内存效率。

Method: OSF通过光流生成遮挡鲁棒的边缘引导作为输入先验,取代传统的边缘检测方法,并在单阶段网络中完成风格化。

Result: OSF在遮挡情况下能保持风格一致性,推理速度提升了2.4倍,内存占用减少2.1倍。

Insight: 光流可作为遮挡场景下的有效边缘引导;单阶段设计在保证性能的同时显著提升效率。

Abstract: 3D animation aims to generate a 3D animated video from an input image and a target 3D motion sequence. Recent advances in image-to-3D models enable the creation of animations directly from user-hand drawings. Distinguished from conventional 3D animation, drawing-based 3D animation is crucial to preserve artist’s unique style properties, such as rough contours and distinct stroke patterns. However, recent methods still exhibit quality deterioration in style properties, especially under occlusions caused by overlapping body parts, leading to contour flickering and stroke blurring. This occurs due to a `stylization pose gap’ between training and inference in stylization networks designed to preserve drawing styles in drawing-based 3D animation systems. The stylization pose gap denotes that input target poses used to train the stylization network are always in occlusion-free poses, while target poses encountered in an inference include diverse occlusions under dynamic motions. To this end, we propose Occlusion-robust Stylization Framework (OSF) for drawing-based 3D animation. We found that while employing object’s edge can be effective input prior for guiding stylization, it becomes notably inaccurate when occlusions occur at inference. Thus, our proposed OSF provides occlusion-robust edge guidance for stylization network using optical flow, ensuring a consistent stylization even under occlusions. Furthermore, OSF operates in a single run instead of the previous two-stage method, achieving 2.4x faster inference and 2.1x less memory.

[109] SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation

Kien T. Pham,Yingqing He,Yazhou Xing,Qifeng Chen,Long Chen

Main category: cs.GR

TL;DR: SpA2V是一个利用空间听觉线索生成语义和空间对齐视频的创新框架,通过分解为音频引导的视频规划和布局引导的视频生成两阶段实现。

Details Motivation: 现有方法主要关注音频的语义信息,忽略了空间属性(如位置和方向),而人类能自然感知这些信息。SpA2V旨在填补这一空白。

Contribution: 提出了首个显式利用空间听觉线索的音频驱动视频生成框架SpA2V,并通过两阶段方法实现高质量输出。

Method: 1) 音频引导的视频规划:利用MLLM构建视频场景布局(VSL);2) 布局引导的视频生成:将VSL作为条件输入到预训练扩散模型中。

Result: 实验表明SpA2V能生成语义和空间对齐的现实视频。

Insight: 空间听觉线索(如声音响度和频率)对视频生成的空间准确性至关重要,两阶段方法能有效桥接音频和视频模态。

Abstract: Audio-driven video generation aims to synthesize realistic videos that align with input audio recordings, akin to the human ability to visualize scenes from auditory input. However, existing approaches predominantly focus on exploring semantic information, such as the classes of sounding sources present in the audio, limiting their ability to generate videos with accurate content and spatial composition. In contrast, we humans can not only naturally identify the semantic categories of sounding sources but also determine their deeply encoded spatial attributes, including locations and movement directions. This useful information can be elucidated by considering specific spatial indicators derived from the inherent physical properties of sound, such as loudness or frequency. As prior methods largely ignore this factor, we present SpA2V, the first framework explicitly exploits these spatial auditory cues from audios to generate videos with high semantic and spatial correspondence. SpA2V decomposes the generation process into two stages: 1) Audio-guided Video Planning: We meticulously adapt a state-of-the-art MLLM for a novel task of harnessing spatial and semantic cues from input audio to construct Video Scene Layouts (VSLs). This serves as an intermediate representation to bridge the gap between the audio and video modalities. 2) Layout-grounded Video Generation: We develop an efficient and effective approach to seamlessly integrate VSLs as conditional guidance into pre-trained diffusion models, enabling VSL-grounded video generation in a training-free manner. Extensive experiments demonstrate that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.