Table of Contents

cs.CL [Back]

[1] DeepWriter: A Fact-Grounded Multimodal Writing Assistant Based On Offline Knowledge Base

Song Mao,Lejun Cheng,Pinlong Cai,Guohang Yan,Ding Wang,Botian Shi

Main category: cs.CL

TL;DR: DeepWriter 是一个基于离线知识库、以事实为依托的多模态写作助手,通过任务分解、大纲生成、多模态检索和分步撰写,解决了在专业领域中大语言模型(LLMs)幻觉和知识不足的问题,显著提升了生成文档的事实准确性和质量。

Details Motivation: 现有的大语言模型在专业领域(如金融、医学、法律)作为写作助手时,常因缺乏深度领域知识和幻觉现象而受限。传统的检索增强生成(RAG)方法存在检索步骤不一致的问题,而在线搜索方法则因内容不可靠导致质量下降。

Contribution: 1. 提出 DeepWriter,一个基于离线知识库的多模态写作助手;2. 设计了一种新颖的流程,包括任务分解、大纲生成、多模态检索和分步撰写;3. 提出了分层知识表示以提升检索效率;4. 实验证明在金融报告生成任务中,DeepWriter 在事实准确性和内容质量上优于现有基线。

Method: DeepWriter 采用以下流程:1. 任务分解;2. 大纲生成;3. 从结构化知识库中进行多模态检索;4. 分步撰写并反思。还引入了分层知识表示以优化检索。

Result: 在金融报告生成实验中,DeepWriter 生成的文档在事实准确性和内容质量上均优于现有基线方法,验证了其有效性。

Insight: 离线知识库和多模态检索的结合可以有效减少幻觉问题,提高专业领域写作助手的事实准确性;分层知识表示是提升检索效率的关键。

Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in various applications. However, their use as writing assistants in specialized domains like finance, medicine, and law is often hampered by a lack of deep domain-specific knowledge and a tendency to hallucinate. Existing solutions, such as Retrieval-Augmented Generation (RAG), can suffer from inconsistency across multiple retrieval steps, while online search-based methods often degrade quality due to unreliable web content. To address these challenges, we introduce DeepWriter, a customizable, multimodal, long-form writing assistant that operates on a curated, offline knowledge base. DeepWriter leverages a novel pipeline that involves task decomposition, outline generation, multimodal retrieval, and section-by-section composition with reflection. By deeply mining information from a structured corpus and incorporating both textual and visual elements, DeepWriter generates coherent, factually grounded, and professional-grade documents. We also propose a hierarchical knowledge representation to enhance retrieval efficiency and accuracy. Our experiments on financial report generation demonstrate that DeepWriter produces high-quality, verifiable articles that surpasses existing baselines in factual accuracy and generated content quality.

[2] Open-Source LLMs Collaboration Beats Closed-Source LLMs: A Scalable Multi-Agent System

Shengji Tang,Jianjian Cao,Weihao Lin,Jiale Hong,Bo Zhang,Shuyue Hu,Lei Bai,Tao Chen,Wanli Ouyang,Peng Ye

Main category: cs.CL

TL;DR: 该论文提出了SMACS框架,通过多开源LLM的协作,在多个任务中超越闭源LLM。

Details Motivation: 探讨是否可以通过多开源LLM的协作,达到或超越闭源LLM的性能,凸显开源集体的潜力。

Contribution: 提出了SMACS框架,包括检索式先验选择(RPS)和探索-利用驱动的后验增强(EPE),有效整合开源LLM并提升性能。

Method: 采用RPS为每个LLM分配性能分数,选择Top-k LLM;EPE通过生成多样响应并筛选高质量答案。

Result: 在8个主流基准测试中,SMACS整合15个开源LLM,表现优于闭源LLM(如Claude-3.7-Sonnet和GPT-4.1),并推动智能上限。

Insight: 开源集体的协作潜力巨大,多代理系统能显著提升AI性能,为未来智能集成提供了新方向。

Abstract: This paper aims to demonstrate the potential and strengths of open-source collectives. It leads to a promising question: Can we harness multiple open-source LLMs to match or even beat the closed-source LLMs? To answer this, we propose SMACS, a scalable multi-agent collaboration system (MACS) framework with high performance. Specifically, for continuous integration of new LLMs and generalization to diverse questions, we first propose a Retrieval-based Prior Selection (RPS), which assigns a proxy performance score to each LLM to select the Top-k LLMs at the instance level for any given question. Then, we propose an Exploration-Exploitation-Driven Posterior Enhancement (EPE), encouraging the generation of diverse responses through prior dropping and selecting the high-quality response via a hybrid posterior score. Experiments on eight mainstream benchmarks validate the effectiveness of our SMACS: by integrating fifteen open-source LLMs, SMACS outperforms leading closed-source LLMs in 2025, e.g., Claude-3.7-Sonnet (+12.73%), GPT-4.1(+5.36%) and GPT-o3-mini(+5.28%) across multiple tasks. Remarkably, it even exceeds the average of best results of different datasets from both open-source LLMs (+2.86%) and closed-source LLMs (+2.04%), pushing the upper bound of intelligence. Code will be released at https://github.com/magent4aci/SMACS.

[3] CCL-XCoT: An Efficient Cross-Lingual Knowledge Transfer Method for Mitigating Hallucination Generation

Weihua Zheng,Roy Ka-Wei Lee,Zhengyuan Liu,Kui Wu,AiTi Aw,Bowei Zou

Main category: cs.CL

TL;DR: 论文提出了一种名为CCL-XCoT的两阶段微调框架,通过课程对比学习和跨语言思维链提示策略,显著减少多语言大语言模型在低资源语言中的幻觉生成问题。

Details Motivation: 多语言大语言模型在低资源语言中容易产生幻觉,这影响了其生成结果的准确性和可靠性,尤其是在领域特定任务中。论文旨在通过跨语言知识迁移来解决这一问题。

Contribution: 1. 提出CCL-XCoT框架,结合课程对比学习和跨语言思维链提示;2. 在预训练中增强跨语言语义对齐;3. 在指令微调中引入跨语言分步推理策略。

Method: 1. 第一阶段:课程对比学习+下一个词预测,增强语义对齐;2. 第二阶段:跨语言思维链策略(XCoT),先在高资源语言中推理,再生成低资源语言答案。

Result: 实验表明,CCL-XCoT将幻觉率降低62%,显著改善了跨语言事实知识迁移效果,且不依赖外部检索或多模型集成。

Insight: 通过分阶段设计和跨语言推理,模型能够更好地利用高资源语言的知识,从而减少低资源语言中的幻觉问题。这种方法的通用性可能适用于其他生成任务。

Abstract: Multilingual Large Language Models(MLLMs) demonstrate strong generalization across languages, yet they remain prone to hallucinations, especially in low-resource languages, due to training data imbalances. These hallucinations, which include inaccurate or fabricated outputs, are particularly problematic in domain-specific generation tasks (Chataigner et al., 2024). To address this challenge, we propose CCL-XCoT(Curriculum-based Contrastive Learning-based Cross-lingual Chain-of-Thought), a two-stage fine-tuning framework for mitigating hallucination in MLLMs. Our approach first enhances cross-lingual semantic alignment through curriculum-based contrastive learning combined with next-token prediction during continued pre-training. Building on this foundation, we then introduce a cross-lingual Chain-of-Thought (XCoT) prompting strategy during instruction fine-tuning, which guides the model to reason in a high-resource language before generating answers in the target low-resource language. Experimental results show that CCL-XCoT reduces hallucination rates by up to 62% and substantially improves factual knowledge transfer across language pairs, without relying on external retrieval or multi-model ensembles.

[4] HuggingGraph: Understanding the Supply Chain of LLM Ecosystem

Mohammad Shahedur Rahman,Peng Gao,Yuede Ji

Main category: cs.CL

TL;DR: 论文研究了LLM生态系统中模型与数据集之间的供应链关系,构建了一个大型、稀疏且动态的有向异构图,揭示了数据集在训练中的关键作用以及模型与数据集之间的强依赖性。

Details Motivation: 由于LLM的开发、训练和部署需要大量计算资源和数据集,而许多LLM是从基础模型和外部数据集构建的,可能继承其中的漏洞或偏见。因此,理解这些组件的来源和发展至关重要,以便检测潜在风险、提高模型公平性并确保合规性。

Contribution: 1. 设计了一种系统收集LLM供应链数据的方法;2. 构建了一个有向异构图模型,包含397,376个节点和453,469条边;3. 通过分析揭示了LLM供应链图的结构特征和动态性。

Method: 1. 系统收集LLM供应链数据;2. 构建有向异构图模型;3. 对图进行多种分析,包括度分布、连接方式和动态变化。

Result: 发现LLM供应链图具有幂律度分布、密集核心与碎片外围的特征,数据集在训练中起关键作用,模型与数据集之间存在强依赖性,且图结构每日动态更新。

Insight: LLM供应链的动态性和复杂性为风险检测和合规管理提供了新的研究方向,数据集的角色和模型间的依赖关系值得进一步探索。

Abstract: Large language models (LLMs) leverage deep learning to process and predict sequences of words from context, enabling them to perform various NLP tasks, such as translation, summarization, question answering, and content generation. However, the growing size and complexity of developing, training, and deploying advanced LLMs require extensive computational resources and large datasets. This creates a barrier for users. As a result, platforms that host models and datasets are widely used. For example, Hugging Face, one of the most popular platforms, hosted 1.8 million models and 450K datasets by June 2025, with no sign of slowing down. Since many LLMs are built from base models, pre-trained models, and external datasets, they can inherit vulnerabilities, biases, or malicious components from earlier models or datasets. Therefore, it is critical to understand the origin and development of these components to better detect potential risks, improve model fairness, and ensure compliance. Motivated by this, our project aims to study the relationships between models and datasets, which are core components of the LLM supply chain. First, we design a method to systematically collect LLM supply chain data. Using this data, we build a directed heterogeneous graph to model the relationships between models and datasets, resulting in a structure with 397,376 nodes and 453,469 edges. We then perform various analyses and uncover several findings, such as: (i) the LLM supply chain graph is large, sparse, and follows a power-law degree distribution; (ii) it features a densely connected core and a fragmented periphery; (iii) datasets play pivotal roles in training; (iv) strong interdependence exists between models and datasets; and (v) the graph is dynamic, with daily updates reflecting the ecosystem’s ongoing evolution.

[5] In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding

Wan-Cyuan Fan,Yen-Chun Chen,Mengchen Liu,Alexander Jacobson,Lu Yuan,Leonid Sigal

Main category: cs.CL

TL;DR: ChartScope提出了一种针对多类型图表的定制化多模态语言模型,通过高效数据生成管道和双路径训练策略,显著提升了模型对图表内容及其底层数据的理解能力。

Details Motivation: 现有方法在图表理解任务中存在两大局限:一是仅依赖少数图表类型的配对数据,泛化性不足;二是缺乏针对图表数据对齐的预训练,阻碍了模型对底层数据的理解。

Contribution: 1. 提出了ChartScope模型,支持多样化图表的深度理解;2. 设计了高效的数据生成管道及双路径训练策略;3. 建立了ChartDQA基准测试,涵盖多层次的问答和底层数据理解评估。

Method: 1. 通过数据生成管道合成多类型图表的配对数据;2. 采用双路径训练策略,既关注图表细节,又保留推理能力;3. 结合底层数据的推理任务。

Result: 实验表明,ChartScope在多种图表类型上的理解能力显著提升。

Insight: 通过合成数据和双路径训练,模型不仅能理解图表表层信息,还能深入理解其底层数据,为多模态模型在特定领域的定制化提供了新思路。

Abstract: Recent methods for customizing Large Vision Language Models (LVLMs) for domain-specific tasks have shown promising results in scientific chart comprehension. However, existing approaches face two major limitations: First, they rely on paired data from only a few chart types, limiting generalization to wide range of chart types. Secondly, they lack targeted pre-training for chart-data alignment, which hampers the model’s understanding of underlying data. In this paper, we introduce ChartScope, an LVLM optimized for in-depth chart comprehension across diverse chart types. We propose an efficient data generation pipeline that synthesizes paired data for a wide range of chart types, along with a novel Dual-Path training strategy that enabling the model to succinctly capture essential data details while preserving robust reasoning capabilities by incorporating reasoning over the underlying data. Lastly, we establish ChartDQA, a new benchmark for evaluating not only question-answering at different levels but also underlying data understanding. Experimental results demonstrate that ChartScope significantly enhances comprehension on a wide range of chart types. The code and data are available at https://davidhalladay.github.io/chartscope_demo.

[6] How LLMs Comprehend Temporal Meaning in Narratives: A Case Study in Cognitive Evaluation of LLMs

Karin de Langis,Jong Inn Park,Andreas Schramm,Bin Hu,Khanh Chi Le,Michael Mensink,Ahn Thu Tong,Dongyeop Kang

Main category: cs.CL

TL;DR: 论文通过专家参与的方法研究了大型语言模型(LLMs)如何处理叙事中的时态意义,发现LLMs依赖典型性、判断不一致且因果推理能力不足,表明其与人类认知存在显著差异。

Details Motivation: 研究动机是探讨LLMs在理解时态意义时是否表现出与人类相似的认知能力,还是仅仅依赖模式识别。

Contribution: 主要贡献包括:(1)揭示了LLMs在处理时态意义时的局限性;(2)开发了一个标准化实验框架用于评估LLMs的认知能力。

Method: 采用专家参与的探测流程(Expert-in-the-Loop probing pipeline),设计针对性实验,评估LLMs的语义表示和语用推理能力。

Result: 结果表明,LLMs过度依赖典型性,时态判断不一致,且难以进行基于时态的因果推理,显示其在叙事理解上的不足。

Insight: 研究发现LLMs在处理时态意义时与人类认知存在根本差异,提示当前LLMs的叙事理解能力有待进一步改进。

Abstract: Large language models (LLMs) exhibit increasingly sophisticated linguistic capabilities, yet the extent to which these behaviors reflect human-like cognition versus advanced pattern recognition remains an open question. In this study, we investigate how LLMs process the temporal meaning of linguistic aspect in narratives that were previously used in human studies. Using an Expert-in-the-Loop probing pipeline, we conduct a series of targeted experiments to assess whether LLMs construct semantic representations and pragmatic inferences in a human-like manner. Our findings show that LLMs over-rely on prototypicality, produce inconsistent aspectual judgments, and struggle with causal reasoning derived from aspect, raising concerns about their ability to fully comprehend narratives. These results suggest that LLMs process aspect fundamentally differently from humans and lack robust narrative understanding. Beyond these empirical findings, we develop a standardized experimental framework for the reliable assessment of LLMs’ cognitive and linguistic capabilities.

[7] X-Intelligence 3.0: Training and Evaluating Reasoning LLM for Semiconductor Display

Xiaolin Yan,Yangxing Liu,Jiazhang Zheng,Chi Liu,Mingyu Du,Caisheng Chen,Haoyang Liu,Ming Ding,Yuan Li,Qiuping Liao,Linfeng Li,Zhili Mei,Siyu Wan,Li Li,Ruyi Zhong,Jiangling Yu,Xule Liu,Huihui Hu,Jiameng Yue,Ruohui Cheng,Qi Yang,Liangqing Wu,Ke Zhu,Chi Zhang,Chufei Jing,Yifan Zhou,Yan Liang,Dongdong Li,Zhaohui Wang,Bin Zhao,Mingzhou Wu,Mingzhong Zhou,Peng Du,Zuomin Liao,Chao Dai,Pengfei Liang,Xiaoguang Zhu,Yu Zhang,Yu Gu,Kun Pan,Yuan Wu,Yanqing Guan,Shaojing Wu,Zikang Feng,Xianze Ma,Peishan Cheng,Wenjuan Jiang,Jing Ba,Huihao Yu,Zeping Hu,Yuan Xu,Zhiwei Liu,He Wang,Zhenguo Lin,Ming Liu,Yanhong Meng

Main category: cs.CL

TL;DR: X-Intelligence 3.0 是针对半导体显示行业开发的高性能推理大模型,通过领域知识库和强化学习优化推理能力,在32B参数规模下超越SOTA模型DeepSeek-R1-671B。

Details Motivation: 当前的大语言模型(LLM)在通用推理任务上表现出色,但在半导体显示行业的专业领域表现不足,缺乏领域特定知识和训练。

Contribution: 首个为半导体显示行业定制的高性能推理模型X-Intelligence 3.0,通过领域知识库微调、强化学习及自动化评估框架实现高效推理能力。

Method: 使用领域知识库进行监督微调和强化学习,集成检索增强生成(RAG)机制,并开发自动化评估框架模拟专家级评判。

Result: X-Intelligence 3.0 在32B参数规模下显著超越671B参数的DeepSeek-R1-671B,性能提升明显。

Insight: 领域特定的知识库和训练方法能显著提升LLM在专业任务上的表现,小模型也能通过优化设计实现超越大模型的性能。

Abstract: Large language models (LLMs) have recently achieved significant advances in reasoning and demonstrated their advantages in solving challenging problems. Yet, their effectiveness in the semiconductor display industry remains limited due to a lack of domain-specific training and expertise. To bridge this gap, we present X-Intelligence 3.0, the first high-performance reasoning model specifically developed for the semiconductor display industry. This model is designed to deliver expert-level understanding and reasoning for the industry’s complex challenges. Leveraging a carefully curated industry knowledge base, the model undergoes supervised fine-tuning and reinforcement learning to enhance its reasoning and comprehension capabilities. To further accelerate development, we implemented an automated evaluation framework that simulates expert-level assessments. We also integrated a domain-specific retrieval-augmented generation (RAG) mechanism, resulting in notable performance gains on benchmark datasets. Despite its relatively compact size of 32 billion parameters, X-Intelligence 3.0 outperforms SOTA DeepSeek-R1-671B across multiple evaluations. This demonstrates its exceptional efficiency and establishes it as a powerful solution to the longstanding reasoning challenges faced by the semiconductor display industry.

[8] XL-DURel: Finetuning Sentence Transformers for Ordinal Word-in-Context Classification

Sachin Yadav,Dominik Schlechtweg

Main category: cs.CL

TL;DR: 本文提出了XL-DURel,一种用于序数上下文词分类的多语言Sentence Transformer模型,通过基于复空间角距离的排名目标优化,在序数和二分类任务上表现优于之前模型。

Details Motivation: 探索如何通过优化Sentence Transformer模型处理序数上下文词分类任务,并统一序数和二分类任务的建模方法。

Contribution: 提出XL-DURel模型,展示了序数任务优化对二分类任务的提升效果,提出了一种基于复空间角距离的排名目标方法。

Method: 通过多种损失函数微调Sentence Transformer模型,特别设计了基于复空间角距离的排名目标。

Result: 模型在序数和二分类任务中表现优于之前的方法,验证了序数任务优化对二分类任务的有效性。

Insight: 序数上下文词分类可以作为二分类任务的推广形式,优化序数任务能统一不同任务形式,提升整体性能。

Abstract: We propose XL-DURel, a finetuned, multilingual Sentence Transformer model optimized for ordinal Word-in-Context classification. We test several loss functions for regression and ranking tasks managing to outperform previous models on ordinal and binary data with a ranking objective based on angular distance in complex space. We further show that binary WiC can be treated as a special case of ordinal WiC and that optimizing models for the general ordinal task improves performance on the more specific binary task. This paves the way for a unified treatment of WiC modeling across different task formulations.

[9] Exploring Human-AI Complementarity in CPS Diagnosis Using Unimodal and Multimodal BERT Models

Kester Wong,Sahan Bulathwela,Mutlu Cukurova

Main category: cs.CL

TL;DR: 本文探讨了在协作问题解决(CPS)诊断任务中,单模态和多模态BERT模型的性能差异,以及如何通过人类与AI的互补性提升诊断效果。

Details Motivation: 当前研究在多模态BERT模型(AudiBERT)与单模态BERT模型的比较中,未能明确统计显著性和人类-AI互补性的应用方法,这激发了进一步探索的需求。

Contribution: 1. 证实AudiBERT在稀疏类别的分类和社交认知维度上的统计显著改进;2. 通过相关性分析揭示训练数据量和模型性能的关系;3. 提出了一种结构化的人类-AI互补性框架。

Method: 1. 使用单模态BERT和多模态AudiBERT模型进行CPS指标分类;2. 通过相关性分析评估数据量和人类标注一致性对模型性能的影响;3. 提出结合模型可解释性的结构化方法。

Result: 1. AudiBERT在社交认知维度上有显著改进,但在情感维度无显著提升;2. 训练数据量显著关联高召回率;3. BERT模型的精确度与人类标注一致性相关。

Insight: 1. 多模态数据可以提升特定维度的分类性能;2. 人类-AI互补性需要模型的可解释性支持;3. 数据量和标注一致性是关键影响因素。

Abstract: Detecting collaborative problem solving (CPS) indicators from dialogue using machine learning techniques is a significant challenge for the field of AI in Education. Recent studies have explored the use of Bidirectional Encoder Representations from Transformers (BERT) models on transcription data to reliably detect meaningful CPS indicators. A notable advancement involved the multimodal BERT variant, AudiBERT, which integrates speech and acoustic-prosodic audio features to enhance CPS diagnosis. Although initial results demonstrated multimodal improvements, the statistical significance of these enhancements remained unclear, and there was insufficient guidance on leveraging human-AI complementarity for CPS diagnosis tasks. This workshop paper extends the previous research by highlighting that the AudiBERT model not only improved the classification of classes that were sparse in the dataset, but it also had statistically significant class-wise improvements over the BERT model for classifications in the social-cognitive dimension. However, similar significant class-wise improvements over the BERT model were not observed for classifications in the affective dimension. A correlation analysis highlighted that larger training data was significantly associated with higher recall performance for both the AudiBERT and BERT models. Additionally, the precision of the BERT model was significantly associated with high inter-rater agreement among human coders. When employing the BERT model to diagnose indicators within these subskills that were well-detected by the AudiBERT model, the performance across all indicators was inconsistent. We conclude the paper by outlining a structured approach towards achieving human-AI complementarity for CPS diagnosis, highlighting the crucial inclusion of model explainability to support human agency and engagement in the reflective coding process.

[10] Explainable Collaborative Problem Solving Diagnosis with BERT using SHAP and its Implications for Teacher Adoption

Kester Wong,Sahan Bulathwela,Mutlu Cukurova

Main category: cs.CL

TL;DR: 该论文研究了如何通过SHAP方法提高基于BERT的协作问题解决(CPS)诊断模型的可解释性,发现模型的高性能分类并不可靠,并提出了人类与AI互补的研究方向。

Details Motivation: 在AI教育领域中,BERT模型用于CPS分类的研究较多,但缺乏对模型分类决策背后具体词汇贡献的深入理解。提高模型的可解释性有助于教师等终端用户更好地理解和使用模型,从而增强信任并促进教育领域的广泛应用。

Contribution: 1. 使用SHAP方法分析了BERT模型中词汇对CPS分类的具体贡献。2. 揭示了高性能分类未必合理的现象,发现了一些对分类有影响的词汇(包括无意义的词汇)。3. 提出了未来研究应关注集成模型架构和人类-AI互补的方向。

Method: 利用SHAP(SHapley Additive exPlanations)方法解析BERT模型在CPS转录数据中的具体词汇贡献,并通过实验验证分类结果的可解释性。

Result: 研究发现,某些词汇(包括语义无关的词汇)对分类有显著影响,但高性能的分类并不一定合理。模型透明性虽不能直接帮助教师改进实践,但能避免他们对模型诊断的过度依赖。

Insight: 1. 模型可解释性不仅关注性能,还需关注合理性。2. 单靠模型难以完成细粒度的CPS子技能分类,需结合人类推理。未来研究应探索集成模型和人类-AI协作的方法。

Abstract: The use of Bidirectional Encoder Representations from Transformers (BERT) model and its variants for classifying collaborative problem solving (CPS) has been extensively explored within the AI in Education community. However, limited attention has been given to understanding how individual tokenised words in the dataset contribute to the model’s classification decisions. Enhancing the explainability of BERT-based CPS diagnostics is essential to better inform end users such as teachers, thereby fostering greater trust and facilitating wider adoption in education. This study undertook a preliminary step towards model transparency and explainability by using SHapley Additive exPlanations (SHAP) to examine how different tokenised words in transcription data contributed to a BERT model’s classification of CPS processes. The findings suggested that well-performing classifications did not necessarily equate to a reasonable explanation for the classification decisions. Particular tokenised words were used frequently to affect classifications. The analysis also identified a spurious word, which contributed positively to the classification but was not semantically meaningful to the class. While such model transparency is unlikely to be useful to an end user to improve their practice, it can help them not to overrely on LLM diagnostics and ignore their human expertise. We conclude the workshop paper by noting that the extent to which the model appropriately uses the tokens for its classification is associated with the number of classes involved. It calls for an investigation into the exploration of ensemble model architectures and the involvement of human-AI complementarity for CPS diagnosis, since considerable human reasoning is still required for fine-grained discrimination of CPS subskills.

[11] Retrieval-Augmented Clinical Benchmarking for Contextual Model Testing in Kenyan Primary Care: A Methodology Paper

Fred Mutisya,Shikoh Gitau,Christine Syovata,Diana Oigara,Ibrahim Matende,Muna Aden,Munira Ali,Ryan Nyotu,Diana Marion,Job Nyangena,Nasubo Ongoma,Keith Mbae,Elizabeth Wamicha,Eric Mibuari,Jean Philbert Nsengemana,Talkmore Chidede

Main category: cs.CL

TL;DR: 论文提出了一种在肯尼亚基层医疗中测试LLMs的方法,通过检索增强生成(RAG)结合当地指南创建基准数据集,并引入新的评估指标。

Details Motivation: 探索LLMs在非洲基层医疗中的有效性,填补现有研究空白,确保AI模型符合当地临床标准和文化需求。

Contribution: 1)提出了一种结合肯尼亚国家指南的RAG方法;2)创建了包含英语和斯瓦希里语的基准数据集;3)引入了测试临床推理、安全性和适应性的新评估指标。

Method: 1)数字化并索引肯尼亚临床指南;2)通过Gemini Flash 2.0 Lite生成临床场景和多选题;3)与当地医生合作优化数据集;4)设计评估指标如罕见病例检测(Needle in the Haystack)和逐步逻辑(Decision Points)。

Result: 发现LLMs在本地化场景中表现显著下降,性能低于美国基准,但数据集和评估框架为非洲医疗AI部署提供了可靠支持。

Insight: 强调了本地化数据和指南对AI在低资源地区医疗中的重要性,揭示了LLMs在非洲医疗内容上的局限性。

Abstract: Large Language Models(LLMs) hold promise for improving healthcare access in low-resource settings, but their effectiveness in African primary care remains underexplored. We present a methodology for creating a benchmark dataset and evaluation framework focused on Kenyan Level 2 and 3 clinical care. Our approach uses retrieval augmented generation (RAG) to ground clinical questions in Kenya’s national guidelines, ensuring alignment with local standards. These guidelines were digitized, chunked, and indexed for semantic retrieval. Gemini Flash 2.0 Lite was then prompted with guideline excerpts to generate realistic clinical scenarios, multiple-choice questions, and rationale based answers in English and Swahili. Kenyan physicians co-created and refined the dataset, and a blinded expert review process ensured clinical accuracy, clarity, and cultural appropriateness. The resulting Alama Health QA dataset includes thousands of regulator-aligned question answer pairs across common outpatient conditions. Beyond accuracy, we introduce evaluation metrics that test clinical reasoning, safety, and adaptability such as rare case detection (Needle in the Haystack), stepwise logic (Decision Points), and contextual adaptability. Initial results reveal significant performance gaps when LLMs are applied to localized scenarios, consistent with findings that LLM accuracy is lower on African medical content than on US-based benchmarks. This work offers a replicable model for guideline-driven, dynamic benchmarking to support safe AI deployment in African health systems.

[12] MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization

Xingxuan Li,Yao Xiao,Dianwen Ng,Hai Ye,Yue Deng,Xiang Lin,Bin Wang,Zhanfeng Mo,Chong Zhang,Yueyi Zhang,Zonglin Yang,Ruilin Li,Lei Lei,Shihao Xu,Han Zhao,Weiling Chen,Feng Ji,Lidong Bing

Main category: cs.CL

TL;DR: 论文介绍了MiroMind-M1系列,一种完全开源的大语言模型,专注于数学推理任务,通过两阶段训练(SFT和RLVR)和Context-Aware Multi-Stage Policy Optimization算法,实现了在AIME和MATH基准测试上的最优或竞争性能,并公开了所有资源以促进透明度和可复现性。

Details Motivation: 现有开源模型在数学推理任务中缺乏足够的透明度和复现性,尤其是关键资源和训练细节的缺失。MiroMind-M1旨在通过完全开源的方式填补这一空白,推动社区研究。

Contribution: 1. 提出了MiroMind-M1系列开源模型,包含7B和32B版本;2. 引入Context-Aware Multi-Stage Policy Optimization算法,提升RLVR训练的效率和鲁棒性;3. 公开了所有数据集(719K SFT和62K RLVR问题)、模型和训练细节。

Method: 1. 两阶段训练:SFT阶段使用719K数学问题的CoT轨迹,RLVR阶段引入具有挑战性的62K问题;2. Context-Aware Multi-Stage Policy Optimization算法结合了长度渐进训练和自适应重复惩罚。

Result: 在AIME24、AIME25和MATH基准测试中,MiroMind-M1模型达到或优于现有开源模型的性能,同时具有更高的token效率。

Insight: 完全开源的项目可以显著提升透明度和社区合作,同时Context-Aware训练策略为多步推理任务的优化提供了新思路。

Abstract: Large language models have recently evolved from fluent text generation to advanced reasoning across diverse domains, giving rise to reasoning language models. Among these domains, mathematical reasoning serves as a representative benchmark as it requires precise multi-step logic and abstract reasoning, which can be generalized to other tasks. While closed-source RLMs such as GPT-o3 demonstrate impressive reasoning capabilities, their proprietary nature limits transparency and reproducibility. Although many open-source projects aim to close this gap, most of them lack sufficient openness by omitting critical resources such as datasets and detailed training configurations, which hinders reproducibility. To contribute toward greater transparency in RLM development, we introduce the MiroMind-M1 series, a set of fully open-source RLMs built on the Qwen-2.5 backbone that match or exceed the performance of existing open-source RLMs. Specifically, our models are trained in two stages: SFT on a carefully curated corpus of 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K challenging and verifiable problems. To enhance the robustness and efficiency of the RLVR process, we introduce Context-Aware Multi-Stage Policy Optimization, an algorithm that integrates length-progressive training with an adaptive repetition penalty to encourage context-aware RL training. Our model achieves state-of-the-art or competitive performance and superior token efficiency among Qwen-2.5-based open-source 7B and 32B models on the AIME24, AIME25, and MATH benchmarks. To facilitate reproducibility, we release the complete stack: models (MiroMind-M1-SFT-7B, MiroMind-M1-RL-7B, MiroMind-M1-RL-32B); datasets (MiroMind-M1-SFT-719K, MiroMind-M1-RL-62K); and all training and evaluation configurations. We hope these resources will support further research and foster community advancement.

[13] Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations

Mohammed Alkhowaiter,Norah Alshahrani,Saied Alshahrani,Reem I. Masoud,Alaa Alzahrani,Deema Alnuhait,Emad A. Alghamdi,Khalid Almubarak

Main category: cs.CL

TL;DR: 本文回顾了Hugging Face Hub上公开的阿拉伯语后训练数据集,揭示了其在任务多样性、文档质量和社区采用率方面的不足,并提出了改进建议。

Details Motivation: 阿拉伯语后训练数据集在多样性、文档质量和实际应用中存在显著不足,限制了阿拉伯语大语言模型的发展和应用。

Contribution: 系统地评估了阿拉伯语后训练数据集,并提出了未来发展方向的具体建议。

Method: 通过四大维度(能力、可控性、对齐性和鲁棒性)对数据集进行评估,重点关注其使用情况和质量。

Result: 发现阿拉伯语数据集在任务多样性、文档完整性和社区采用率方面存在显著缺陷。

Insight: 阿拉伯语后训练数据集的不足可能阻碍阿拉伯语大语言模型的进步,需加强资源开发和社区协作。

Abstract: Post-training has emerged as a crucial technique for aligning pre-trained Large Language Models (LLMs) with human instructions, significantly enhancing their performance across a wide range of tasks. Central to this process is the quality and diversity of post-training datasets. This paper presents a review of publicly available Arabic post-training datasets on the Hugging Face Hub, organized along four key dimensions: (1) LLM Capabilities (e.g., Question Answering, Translation, Reasoning, Summarization, Dialogue, Code Generation, and Function Calling); (2) Steerability (e.g., persona and system prompts); (3) Alignment (e.g., cultural, safety, ethics, and fairness), and (4) Robustness. Each dataset is rigorously evaluated based on popularity, practical adoption, recency and maintenance, documentation and annotation quality, licensing transparency, and scientific contribution. Our review revealed critical gaps in the development of Arabic post-training datasets, including limited task diversity, inconsistent or missing documentation and annotation, and low adoption across the community. Finally, the paper discusses the implications of these gaps on the progress of Arabic LLMs and applications while providing concrete recommendations for future efforts in post-training dataset development.

[14] Rethinking Suicidal Ideation Detection: A Trustworthy Annotation Framework and Cross-Lingual Model Evaluation

Amina Dzafic,Merve Kavut,Ulya Bayram

Main category: cs.CL

TL;DR: 该论文探讨了自杀意念检测中的语言覆盖范围和标注可靠性问题,提出了一个土耳其语的新数据集和高效的标注框架,并通过跨语言模型评估揭示了现有方法的局限性。

Details Motivation: 自杀意念检测对实时自杀预防至关重要,但目前面临两大挑战:语言覆盖有限和标注不可靠。大多数数据集集中在英语,而非英语数据和高标注质量的数据稀缺。

Contribution: 1. 构建了一个土耳其语的自杀意念社交媒体数据集;2. 提出了一种资源高效的三名人类标注员和两个大型语言模型的标注框架;3. 通过跨语言和跨数据集的模型评估,揭示了标签可靠性和模型一致性的问题。

Method: 1. 使用社交媒体数据构建土耳其语数据集;2. 结合人类标注员和大型语言模型设计标注框架;3. 通过八个预训练情感和情绪分类器进行迁移学习,评估标签可靠性和模型性能。

Result: 研究发现,现有模型在零样本迁移学习下表现不佳,强调需要更严格的标注和语言包容性方法。同时,展示了数据集和模型透明度的重要性。

Insight: 论文揭示了心理健康NLP中数据标注和模型评估的不足,提出需要更多语言覆盖和可靠标注的实践。同时也展示了大型语言模型在标注中的潜力。

Abstract: Suicidal ideation detection is critical for real-time suicide prevention, yet its progress faces two under-explored challenges: limited language coverage and unreliable annotation practices. Most available datasets are in English, but even among these, high-quality, human-annotated data remains scarce. As a result, many studies rely on available pre-labeled datasets without examining their annotation process or label reliability. The lack of datasets in other languages further limits the global realization of suicide prevention via artificial intelligence (AI). In this study, we address one of these gaps by constructing a novel Turkish suicidal ideation corpus derived from social media posts and introducing a resource-efficient annotation framework involving three human annotators and two large language models (LLMs). We then address the remaining gaps by performing a bidirectional evaluation of label reliability and model consistency across this dataset and three popular English suicidal ideation detection datasets, using transfer learning through eight pre-trained sentiment and emotion classifiers. These transformers help assess annotation consistency and benchmark model performance against manually labeled data. Our findings underscore the need for more rigorous, language-inclusive approaches to annotation and evaluation in mental health natural language processing (NLP) while demonstrating the questionable performance of popular models with zero-shot transfer learning. We advocate for transparency in model training and dataset construction in mental health NLP, prioritizing data and model reliability.

[15] Disparities in Peer Review Tone and the Role of Reviewer Anonymity

Maria Sahakyan,Bedoor AlShebli

Main category: cs.CL

TL;DR: 该论文通过对8万份同行评审进行语言学分析,揭示了评审中存在的隐蔽偏见,包括语气、情感和支持性语言的差异,与研究者的性别、种族和机构背景相关。同时,作者发现评审者的匿名性对评审语言有显著影响。

Details Motivation: 同行评审被认为是科学诚信的守护者,但已有研究表明其存在偏见。本文通过深入分析评审语言的差异,探讨了匿名性对评审公平性的影响,填补了现有研究中语言偏见未被充分关注的空白。

Contribution: 研究首次大规模地从语言学角度分析了同行评审的隐蔽偏见,揭示了评审语言与作者背景的关系,并探讨了匿名性对评审公平性的影响。

Method: 采用自然语言处理和大规模统计建模技术,分析了两种主要期刊中的8万份同行评审数据,比较匿名与非匿名评审的语言差异。

Result: 研究发现,评审语言的语气、情感和支持性内容因作者的性别、种族和机构背景而异;匿名性显著改变了评审的语言风格。

Insight: 研究揭示了同行评审中的语言偏见问题,挑战了匿名性必然促进公平性的传统假设,为学术出版改革提供了新视角。

Abstract: The peer review process is often regarded as the gatekeeper of scientific integrity, yet increasing evidence suggests that it is not immune to bias. Although structural inequities in peer review have been widely debated, much less attention has been paid to the subtle ways in which language itself may reinforce disparities. This study undertakes one of the most comprehensive linguistic analyses of peer review to date, examining more than 80,000 reviews in two major journals. Using natural language processing and large-scale statistical modeling, it uncovers how review tone, sentiment, and supportive language vary across author demographics, including gender, race, and institutional affiliation. Using a data set that includes both anonymous and signed reviews, this research also reveals how the disclosure of reviewer identity shapes the language of evaluation. The findings not only expose hidden biases in peer feedback, but also challenge conventional assumptions about anonymity’s role in fairness. As academic publishing grapples with reform, these insights raise critical questions about how review policies shape career trajectories and scientific progress.

[16] On the robustness of modeling grounded word learning through a child’s egocentric input

Wai Keen Vong,Brenden M. Lake

Main category: cs.CL

TL;DR: 论文探讨了多模态神经网络在儿童语言学习中的稳健性,通过自动化语音转录方法处理大规模儿童视听数据,展示了不同儿童输入下模型学习词-指代映射的能力。

Details Motivation: 研究动机在于探索机器学习模型如何从儿童有限的输入中学习语言,以弥补现有大规模模型与儿童语言学习之间的差距,并验证多模态神经网络在不同儿童输入下的稳健性。

Contribution: 主要贡献包括:1) 使用自动化语音转录方法处理了SAYCam数据集中500多小时的儿童视听数据;2) 验证了多模态神经网络在不同儿童输入下学习词-指代映射的稳健性;3) 揭示了儿童个体差异对模型学习的影响。

Method: 方法包括:1) 自动化转录SAYCam数据集中的语音数据;2) 构建多模态视听数据集用于训练和评估;3) 测试多种神经网络架构在不同儿童输入下的学习表现。

Result: 结果表明,无论训练数据来自哪个儿童的输入,多模态神经网络均能学习和泛化词-指代映射。同时,模型学习表现出儿童个体差异。

Insight: 研究揭示了儿童语言学习的个体差异对模型性能的影响,为未来研究提供了新的视角,同时也证明了多模态神经网络在模拟儿童语言学习中的潜力。

Abstract: What insights can machine learning bring to understanding human language acquisition? Large language and multimodal models have achieved remarkable capabilities, but their reliance on massive training datasets creates a fundamental mismatch with children, who succeed in acquiring language from comparatively limited input. To help bridge this gap, researchers have increasingly trained neural networks using data similar in quantity and quality to children’s input. Taking this approach to the limit, Vong et al. (2024) showed that a multimodal neural network trained on 61 hours of visual and linguistic input extracted from just one child’s developmental experience could acquire word-referent mappings. However, whether this approach’s success reflects the idiosyncrasies of a single child’s experience, or whether it would show consistent and robust learning patterns across multiple children’s experiences was not explored. In this article, we applied automated speech transcription methods to the entirety of the SAYCam dataset, consisting of over 500 hours of video data spread across all three children. Using these automated transcriptions, we generated multi-modal vision-and-language datasets for both training and evaluation, and explored a range of neural network configurations to examine the robustness of simulated word learning. Our findings demonstrate that networks trained on automatically transcribed data from each child can acquire and generalize word-referent mappings across multiple network architectures. These results validate the robustness of multimodal neural networks for grounded word learning, while highlighting the individual differences that emerge in how models learn when trained on each child’s developmental experiences.

[17] GRACE: Generative Recommendation via Journey-Aware Sparse Attention on Chain-of-Thought Tokenization

Luyi Ma,Wanjia Zhang,Kai Zhao,Abhishek Kulkarni,Lalitesh Morishetti,Anjana Ganesh,Ashish Ranjan,Aashika Padmanabhan,Jianpeng Xu,Jason Cho,Praveen Kanumala,Kaushiki Nag,Sumit Dutta,Kamiya Motwani,Malay Patel,Evren Korpeoglu,Sushant Kumar,Kannan Achan

Main category: cs.CL

TL;DR: GRACE 是一個生成式推薦框架,通過引入 Chain-of-Thought 分詞和旅程感知稀疏注意力機制,解決了生成式推薦中的計算效率和多尺度建模問題,並顯著提升了推薦性能。

Details Motivation: 生成模型在多行為推薦系統中表現出潛力,但面臨三項挑戰:(1) 缺乏明確的分詞推理信息,(2) 高計算成本,(3) 缺乏對用戶歷史的多尺度建模。

Contribution: 1. 提出混合 Chain-of-Thought (CoT) 分詞方法,結合用戶-物品互動與知識圖譜屬性。2. 設計 Journey-Aware Sparse Attention (JSA) 機制,提升計算效率。3. 在多行為推薦任務中顯著超越現有方法。

Method: 1. 使用 CoT 分詞將用戶行為與產品屬性結合,生成可解釋的分詞結果。2. 引入 JSA 機制,選擇性關注分詞序列中的壓縮、內部和外部上下文。

Result: 在兩個真實數據集上,GRACE 在 Home 領域的 HR@10 提升 106.9%,NDCG@10 提升 106.7%,Electronics 領域 HR@10 提升 22.1%。同時減少 48% 的注意力計算。

Insight: 1. 結合知識圖譜屬性可增強分詞的可解釋性。2. 稀疏注意力能有效降低計算成本而不損失性能。

Abstract: Generative models have recently demonstrated strong potential in multi-behavior recommendation systems, leveraging the expressive power of transformers and tokenization to generate personalized item sequences. However, their adoption is hindered by (1) the lack of explicit information for token reasoning, (2) high computational costs due to quadratic attention complexity and dense sequence representations after tokenization, and (3) limited multi-scale modeling over user history. In this work, we propose GRACE (Generative Recommendation via journey-aware sparse Attention on Chain-of-thought tokEnization), a novel generative framework for multi-behavior sequential recommendation. GRACE introduces a hybrid Chain-of-Thought (CoT) tokenization method that encodes user-item interactions with explicit attributes from product knowledge graphs (e.g., category, brand, price) over semantic tokenization, enabling interpretable and behavior-aligned generation. To address the inefficiency of standard attention, we design a Journey-Aware Sparse Attention (JSA) mechanism, which selectively attends to compressed, intra-, inter-, and current-context segments in the tokenized sequence. Experiments on two real-world datasets show that GRACE significantly outperforms state-of-the-art baselines, achieving up to +106.9% HR@10 and +106.7% NDCG@10 improvement over the state-of-the-art baseline on the Home domain, and +22.1% HR@10 on the Electronics domain. GRACE also reduces attention computation by up to 48% with long sequences.

[18] FastLongSpeech: Enhancing Large Speech-Language Models for Efficient Long-Speech Processing

Shoutao Guo,Shaolei Zhang,Qingkai Fang,Zhengrui Ma,Min Zhang,Yang Feng

Main category: cs.CL

TL;DR: FastLongSpeech 是一个旨在增强大型语音语言模型(LSLM)处理长语音效率的新框架,无需专用长语音训练数据,通过迭代融合和动态压缩训练实现高效长序列处理。

Details Motivation: 现有大型语音语言模型(LSLM)主要关注短语音任务或语音生成,而长语音处理因缺乏训练数据和高计算成本成为未充分探索的挑战。

Contribution: 提出了 FastLongSpeech 框架,通过迭代融合策略和动态压缩训练,使 LSLM 能够高效处理长语音任务,无需专用长语音数据。

Method: 采用迭代融合压缩长语音序列为可管理长度,并通过动态压缩训练(暴露模型于不同压缩比的短语音)适应长语音输入。

Result: 实验表明 FastLongSpeech 在长短语音任务中表现优异,并显著提升推理效率。

Insight: 动态压缩训练和迭代融合策略可有效解决长语音处理中的数据与计算瓶颈,扩展 LSLM 的应用场景。

Abstract: The rapid advancement of Large Language Models (LLMs) has spurred significant progress in Large Speech-Language Models (LSLMs), enhancing their capabilities in both speech understanding and generation. While existing LSLMs often concentrate on augmenting speech generation or tackling a diverse array of short-speech tasks, the efficient processing of long-form speech remains a critical yet underexplored challenge. This gap is primarily attributed to the scarcity of long-speech training datasets and the high computational costs associated with long sequences. To address these limitations, we introduce FastLongSpeech, a novel framework designed to extend LSLM capabilities for efficient long-speech processing without necessitating dedicated long-speech training data. FastLongSpeech incorporates an iterative fusion strategy that can compress excessively long-speech sequences into manageable lengths. To adapt LSLMs for long-speech inputs, it introduces a dynamic compression training approach, which exposes the model to short-speech sequences at varying compression ratios, thereby transferring the capabilities of LSLMs to long-speech tasks. To assess the long-speech capabilities of LSLMs, we develop a long-speech understanding benchmark called LongSpeech-Eval. Experiments show that our method exhibits strong performance in both long-speech and short-speech tasks, while greatly improving inference efficiency.

[19] Doc2Chart: Intent-Driven Zero-Shot Chart Generation from Documents

Akriti Jain,Pritika Ramu,Aparna Garimella,Apoorv Saxena

Main category: cs.CL

TL;DR: 本文提出了一种从文档中根据用户意图生成图表的零样本方法,通过两阶段框架(信息提取与图表生成)解决现有方法的局限性,并在数据准确性和图表类型上优于基线。

Details Motivation: 现有基于LLM的方法需要用户预先选择相关内容,难以直接从长文档中根据意图生成图表。本文旨在解决这一现实场景中的挑战。

Contribution: 提出了意图驱动的零样本图表生成任务,并设计了一个无监督的两阶段框架(信息提取与图表生成);提出了一种基于归因的指标评估数据准确性;构建了一个包含1242个样本的数据集。

Method: 两阶段框架:1)LLM分解意图并迭代验证提取数据;2)启发式模块选择图表类型并生成代码。

Result: 在图表数据准确性和图表类型选择上分别比最佳基线高9分和17分。

Insight: 直接从文档中生成图表需要结合意图分解和迭代验证;视觉解码指标不足,需结构化文本表示评估数据准确性。

Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in transforming text descriptions or tables to data visualizations via instruction-tuning methods. However, it is not straightforward to apply these methods directly for a more real-world use case of visualizing data from long documents based on user-given intents, as opposed to the user pre-selecting the relevant content manually. We introduce the task of intent-based chart generation from documents: given a user-specified intent and document(s), the goal is to generate a chart adhering to the intent and grounded on the document(s) in a zero-shot setting. We propose an unsupervised, two-staged framework in which an LLM first extracts relevant information from the document(s) by decomposing the intent and iteratively validates and refines this data. Next, a heuristic-guided module selects an appropriate chart type before final code generation. To assess the data accuracy of the generated charts, we propose an attribution-based metric that uses a structured textual representation of charts, instead of relying on visual decoding metrics that often fail to capture the chart data effectively. To validate our approach, we curate a dataset comprising of 1,242 $<$intent, document, charts$>$ tuples from two domains, finance and scientific, in contrast to the existing datasets that are largely limited to parallel text descriptions/ tables and their corresponding charts. We compare our approach with baselines using single-shot chart generation using LLMs and query-based retrieval methods; our method outperforms by upto $9$ points and $17$ points in terms of chart data accuracy and chart type respectively over the best baselines.

[20] Beyond Isolated Capabilities: Bridging Long CoT Reasoning and Long-Context Understanding

Yifei Wang

Main category: cs.CL

TL;DR: 该论文研究了通过推理蒸馏提升小规模语言模型的长上下文理解能力,发现其对多文档问答任务中的信息提取与整合有显著帮助,并缓解了长上下文模型中的‘中间缺失’问题。

Details Motivation: 研究动机在于探索大规模推理蒸馏对模型其他关键能力(如上下文检索与推理)的影响,特别是在检索增强生成(RAG)系统中,高效的上下文信息获取与利用至关重要。

Contribution: 主要贡献包括:通过实验证明推理蒸馏能显著提升长上下文理解能力;揭示了蒸馏通过促进更详细、显式的推理过程来增强长上下文感知;有效缓解了长上下文模型中的‘中间缺失’问题。

Method: 方法包括:使用开源模型(从Deepseek-R1蒸馏而来)进行实验;重点评估模型在多文档问答任务中的表现;通过实验验证推理蒸馏对长上下文理解的影响。

Result: 实验结果表明,推理蒸馏能显著提升模型的长上下文理解能力,尤其是在信息提取与整合方面,并缓解了‘中间缺失’问题。

Insight: 研究揭示了推理蒸馏不仅提升模型的推理能力,还能通过更详细的推理过程增强其对长上下文的感知,为未来长上下文模型的设计提供了新方向。

Abstract: Reasoning distillation has emerged as an effective approach to enhance the reasoning capabilities of smaller language models. However, the impact of large-scale reasoning distillation on other critical abilities, particularly in-context retrieval and reasoning, remains unexplored. This gap in understanding is particularly significant given the increasing importance of Retrieval-Augmented Generation (RAG) systems, where efficient acquisition and utilization of contextual information are paramount for generating reliable responses. Motivated by the need to understand how the extended long-CoT process influences long-context comprehension, we conduct a comprehensive investigation using a series of open-source models distilled from Deepseek-R1, renowned for its exceptional reasoning capabilities. Our study focuses on evaluating these models’ performance in extracting and integrating relevant information from extended contexts through multi-document question and answering tasks. Through rigorous experimentation, we demonstrate that distilled reasoning patterns significantly improve long-context understanding. Our analysis reveals that distillation fosters greater long-context awareness by promoting more detailed and explicit reasoning processes during context analysis and information parsing. This advancement effectively mitigates the persistent “lost in the middle” issue that has hindered long-context models.

[21] MEKiT: Multi-source Heterogeneous Knowledge Injection Method via Instruction Tuning for Emotion-Cause Pair Extraction

Shiyi Mu,Yongkang Liu,Shi Feng,Xiaocui Yang,Daling Wang,Yifei Zhang

Main category: cs.CL

TL;DR: MEKiT是一种通过指令调优的多源异构知识注入方法,用于情感-原因对提取(ECPE)任务,通过整合内部情感知识和外部因果知识,显著提升了大型语言模型(LLMs)的性能。

Details Motivation: 大型语言模型在需要推理能力的ECPE任务上表现不佳,主要原因是缺乏辅助知识,限制了模型感知情感和推理原因的能力。

Contribution: 提出了MEKiT方法,通过指令调优整合内部情感知识和外部因果知识,显著提升了LLMs在ECPE任务上的性能。

Method: 1. 结合指令模板引入内部情感知识;2. 通过混合数据进行指令调优引入外部因果知识。

Result: 实验表明,MEKiT在ECPE任务上具有绝对性能优势,显著优于基线模型。

Insight: 通过多源异构知识的引入和指令调优的组合,可以有效弥补LLMs在复杂推理任务中的知识不足问题。

Abstract: Although large language models (LLMs) excel in text comprehension and generation, their performance on the Emotion-Cause Pair Extraction (ECPE) task, which requires reasoning ability, is often underperform smaller language model. The main reason is the lack of auxiliary knowledge, which limits LLMs’ ability to effectively perceive emotions and reason causes. To address this issue, we propose a novel \textbf{M}ulti-source h\textbf{E}terogeneous \textbf{K}nowledge \textbf{i}njection me\textbf{T}hod, MEKiT, which integrates heterogeneous internal emotional knowledge and external causal knowledge. Specifically, for these two distinct aspects and structures of knowledge, we apply the approaches of incorporating instruction templates and mixing data for instruction-tuning, which respectively facilitate LLMs in more comprehensively identifying emotion and accurately reasoning causes. Experimental results demonstrate that MEKiT provides a more effective and adaptable solution for the ECPE task, exhibiting an absolute performance advantage over compared baselines and dramatically improving the performance of LLMs on the ECPE task.

[22] Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs

Boyi Deng,Yu Wan,Baosong Yang,Fei Huang,Wenjie Wang,Fuli Feng

Main category: cs.CL

TL;DR: 论文提出了一种稀疏自编码器引导的监督微调方法(SASFT),通过分析语言混合现象的机制并优化预激活值,显著减少了LLM中意外的语码转换现象。

Details Motivation: 大型语言模型(LLM)的多语言能力虽强,但存在意外的语码转换(语言混合)问题,导致模型响应可读性差。现有方法缺乏机制分析且效果有限。

Contribution: 通过稀疏自编码器分析语码转换机制,提出SASFT方法,显著减少意外语码转换50%以上,同时保持或提升多语言性能。

Method: 使用稀疏自编码器分析语言特征的预激活值异常,并设计SASFT方法,在训练中调节这些预激活值以减少语码转换。

Result: 在三个语言的五个模型上,SASFT将意外语码转换减少50%以上,部分场景完全消除,并在多语言基准测试中保持或提升性能。

Insight: 语码转换与特定语言特征的预激活值异常相关,通过机制分析和适当干预可有效解决语言混合问题。

Abstract: Large Language Models (LLMs) have impressive multilingual capabilities, but they suffer from unexpected code-switching, also known as language mixing, which involves switching to unexpected languages in the model response. This problem leads to poor readability and degrades the usability of model responses. However, existing work on this issue lacks a mechanistic analysis and shows limited effectiveness. In this paper, we first provide an in-depth analysis of unexpected code-switching using sparse autoencoders and find that when LLMs switch to a language, the features of that language exhibit excessive pre-activation values. Based on our findings, we propose $\textbf{S}$parse $\textbf{A}$utoencoder-guided $\textbf{S}$upervised $\textbf{F}$ine$\textbf{t}$uning (SASFT), which teaches LLMs to maintain appropriate pre-activation values of specific language features during training. Experiments on five models across three languages demonstrate that SASFT consistently reduces unexpected code-switching by more than 50% compared to standard supervised fine-tuning, with complete elimination in four cases. Moreover, SASFT maintains or even improves the models’ performance on six multilingual benchmarks, showing its effectiveness in addressing code-switching while preserving multilingual capabilities.

[23] MUR: Momentum Uncertainty guided Reasoning for Large Language Models

Hang Yan,Fangzhi Xu,Rongman Xu,Yifei Li,Jian Zhang,Haoran Luo,Xiaobao Wu,Luu Anh Tuan,Haiteng Zhao,Qika Lin,Jun Liu

Main category: cs.CL

TL;DR: MUR通过动量不确定性引导来优化大语言模型的推理效率,减少冗余计算,提升准确率。

Details Motivation: 大语言模型在推理任务中表现出色,但推理效率仍有优化空间。测试时缩放(TTS)可能导致过度思考,浪费计算资源。

Contribution: 提出了MUR方法,动态分配推理预算,引入gamma-control机制灵活调控推理开销。

Method: 基于动量不确定性跟踪和聚合步骤不确定性,动态分配思考预算。

Result: 在多个基准测试中,MUR平均减少50%计算量,同时准确率提升0.62-3.37%。

Insight: MUR通过动态预算分配平衡推理效率与质量,无需额外训练即可提升性能。

Abstract: Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 50% on average while improving accuracy by 0.62-3.37%.

[24] RefCritic: Training Long Chain-of-Thought Critic Models with Refinement Feedback

Qiaoyu Tang,Hao Xiang,Le Yu,Bowen Yu,Hongyu Lin,Yaojie Lu,Xianpei Han,Le Sun,Junyang Lin

Main category: cs.CL

TL;DR: 该论文提出了RefCritic,一种基于强化学习的长链思维批评模块,通过双规则奖励机制生成高质量批评和可操作的反馈,优于现有监督微调方法。

Details Motivation: 大型语言模型(LLMs)发展迅速,但现有监督微调方法构建的批评模块无法真正提升模型的批评能力,导致批评内容表浅且缺乏反思与验证。

Contribution: 1) 提出RefCritic,一种基于强化学习的批评模块;2) 设计双规则奖励机制:解决方案的实例级正确性和基于批评的策略模型精炼准确度;3) 在多个基准测试中展示显著性能提升。

Method: 1) 使用强化学习训练批评模块;2) 采用双规则奖励:解决方案的正确性和精炼准确度;3) 结合多数投票机制优化策略模型。

Result: RefCritic在多个基准测试中表现突出,例如在AIME25上分别提升6.8%和7.2%;在ProcessBench上优于逐步监督方法。

Insight: 强化学习的奖励机制设计对生成高质量批评至关重要;多数投票机制可显著提升模型性能。

Abstract: With the rapid advancement of Large Language Models (LLMs), developing effective critic modules for precise guidance has become crucial yet challenging. In this paper, we initially demonstrate that supervised fine-tuning for building critic modules (which is widely adopted in current solutions) fails to genuinely enhance models’ critique abilities, producing superficial critiques with insufficient reflections and verifications. To unlock the unprecedented critique capabilities, we propose RefCritic, a long-chain-of-thought critic module based on reinforcement learning with dual rule-based rewards: (1) instance-level correctness of solution judgments and (2) refinement accuracies of the policy model based on critiques, aiming to generate high-quality evaluations with actionable feedback that effectively guides model refinement. We evaluate RefCritic on Qwen2.5-14B-Instruct and DeepSeek-R1-Distill-Qwen-14B across five benchmarks. On critique and refinement settings, RefCritic demonstrates consistent advantages across all benchmarks, e.g., 6.8% and 7.2% gains on AIME25 for the respective base models. Notably, under majority voting, policy models filtered by RefCritic show superior scaling with increased voting numbers. Moreover, despite training on solution-level supervision, RefCritic outperforms step-level supervised approaches on ProcessBench, a benchmark to identify erroneous steps in mathematical reasoning.

[25] WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

Zhengwei Tao,Jialong Wu,Wenbiao Yin,Junkai Zhang,Baixuan Li,Haiyang Shen,Kuan Li,Liwen Zhang,Xinyu Wang,Yong Jiang,Pengjun Xie,Fei Huang,Jingren Zhou

Main category: cs.CL

TL;DR: 论文提出了WebShaper框架,通过形式化信息搜索(IS)任务并基于集合理论构建合成数据集,解决了现有数据驱动方法在推理结构和问答一致性上的不足,实验表明其在多个基准测试中表现优异。

Details Motivation: 现有的大型语言模型(LLM)驱动的信息搜索(IS)代理因缺乏高质量训练数据而受限,且传统信息驱动方法可能导致信息结构与推理结构的不一致性。

Contribution: 提出了形式化驱动的IS数据合成框架WebShaper,通过知识投影(KP)操作组合精确控制推理结构,并利用多步扩展过程合成高质量数据集。

Method: 基于集合理论形式化IS任务,通过KP操作组合控制推理结构,使用多步扩展过程合成数据,并在合成的数据集上训练模型。

Result: WebShaper在GAIA和WebWalkerQA基准测试中达到了开源IS代理的最先进性能。

Insight: 形式化方法能够显著提升数据合成的质量,尤其是通过知识投影操作和多步扩展策略,为IS任务的训练数据提供了新思路。

Abstract: The advent of Large Language Model (LLM)-powered agents has revolutionized artificial intelligence by enabling solutions to complex, open-ended tasks through web-based information-seeking (IS) capabilities. The scarcity of high-quality training data has limited the development of IS agents. Existing approaches typically adopt an information-driven paradigm that first collects web data and then generates questions based on the retrieval. However, this may lead to inconsistency between information structure and reasoning structure, question and answer. To mitigate, we propose a formalization-driven IS data synthesis framework WebShaper to construct a dataset. WebShaper systematically formalizes IS tasks through set theory. Central to the formalization is the concept of Knowledge Projections (KP), which enables precise control over reasoning structure by KP operation compositions. During synthesis, we begin by creating seed tasks, then use a multi-step expansion process. At each step, an agentic Expander expands the current formal question more complex with retrieval and validation tools based on our formalization. We train our model on the synthesized dataset. Experiment results demonstrate that WebShaper achieves state-of-the-art performance among open-sourced IS agents on GAIA and WebWalkerQA benchmarks.

[26] Evaluation of Coding Schemes for Transformer-based Gene Sequence Modeling

Chenlei Gong,Yuanhe Tian,Lei Mao,Yan Song

Main category: cs.CL

TL;DR: 该论文系统评估了基因序列建模中使用的Transformer模型的不同编码方案,包括k-mer分割和BPE子词标记化方法,以及三种位置编码方式。结果表明,BPE在性能和稳定性上表现更优,而RoPE和AliBi在不同任务中各有优势。

Details Motivation: DNA序列常被视为一种特殊语言,但缺乏对不同编码方案(如k-mer分割和BPE)的系统评估。研究旨在为基因序列Transformer模型的设计提供实用指导。

Contribution: 对比了k-mer分割和BPE子词标记化的性能,评估了三种位置编码方法的适用性,并为模型深度选择提供了实验依据。

Method: 采用k=1,3,4,5,6的k-mer分割、4,096子词的BPE,以及三种位置编码(sinusoidal, AliBi, RoPE),在3、6、12、24层Transformer编码器上进行训练和评估。

Result: BPE性能更高且稳定,RoPE擅长捕捉周期性模式,AliBi在局部依赖任务中表现良好。模型深度上,12层效果显著,24层提升有限或轻微过拟合。

Insight: BPE压缩高频模因为变长标记,减少序列长度,提升模型泛化性;不同位置编码适用于不同任务需求。

Abstract: Currently, many studies view DNA sequences as a special type of language and utilize Transformers to model them. These studies use fixed-length k-mer segmentation and BPE subword tokenization but lack a systematic evaluation to determine which is superior. We compare k-mer segmentation with k=1,3,4,5,6, a 4,096-token BPE vocabulary, and three positional encoding methods-sinusoidal, AliBi, and RoPE. Each configuration is trained from scratch in 3, 6, 12, and 24-layer Transformer encoders and evaluated on GUE benchmark dataset. In general, BPE delivers higher and more stable performance across tasks by compressing frequent motifs into variable-length tokens, reducing sequence length, and improving model generalization. RoPE excels at capturing periodic motifs and extrapolating to long sequences, while AliBi also performs well on tasks driven by local dependencies. In terms of depth, we observe significant gains when increasing layers from 3 to 12, with only marginal improvements or slight overfitting at 24 layers. This study provides practical guidance for designing tokenization and positional encoding in DNA Transformer models.

[27] A Penalty Goes a Long Way: Measuring Lexical Diversity in Synthetic Texts Under Prompt-Influenced Length Variations

Vijeta Deshpande,Ishita Dasgupta,Uttaran Bhattacharya,Somdeb Sarkhel,Saayan Mitra,Anna Rumshisky

Main category: cs.CL

TL;DR: 论文提出了一种新的词汇多样性度量方法PATTR,用于解决合成文本在长度变化下的多样性测量偏差问题,并通过实验验证其优于现有方法。

Details Motivation: 大语言模型生成的合成文本被广泛用于进一步训练和改进模型,多样性是关键。然而,提示工程的多样性改进对文本长度的影响及其对多样性测量的偏差尚不明确。

Contribution: 提出了Penalty-Adjusted Type-Token Ratio (PATTR),一种对长度变化鲁棒的多样性度量方法,并通过大规模实验验证其有效性。

Method: 生成超过2000万词的合成文本(基于LLaMA、OLMo和Phi系列模型),提出PATTR方法,并与MATTR和CR等现有方法对比。

Result: PATTR在过滤高多样性响应时表现优于MATTR和CR,且能更好地平衡多样性和目标长度的一致性。

Insight: 文本长度变化会引入偏差,现有多样性度量方法容易偏向较短文本。PATTR通过明确考虑目标长度,有效解决了这一问题。

Abstract: Synthetic text generated by Large Language Models (LLMs) is increasingly used for further training and improvement of LLMs. Diversity is crucial for the effectiveness of synthetic data, and researchers rely on prompt engineering to improve diversity. However, the impact of prompt variations on response text length, and, more importantly, the consequential effect on lexical diversity measurements, remain underexplored. In this work, we propose Penalty-Adjusted Type-Token Ratio (PATTR), a diversity metric robust to length variations. We generate a large synthetic corpus of over 20M words using seven models from the LLaMA, OLMo, and Phi families, focusing on a creative writing task of video script generation, where diversity is crucial. We evaluate per-response lexical diversity using PATTR and compare it against existing metrics of Moving-Average TTR (MATTR) and Compression Ratio (CR). Our analysis highlights how text length variations introduce biases favoring shorter responses. Unlike existing metrics, PATTR explicitly considers the task-specific target response length ($L_T$) to effectively mitigate length biases. We further demonstrate the utility of PATTR in filtering the top-10/100/1,000 most lexically diverse responses, showing that it consistently outperforms MATTR and CR by yielding on par or better diversity with high adherence to $L_T$.

[28] ChiMed 2.0: Advancing Chinese Medical Dataset in Facilitating Large Language Modeling

Yuanhe Tian,Junjie Liu,Zhizhou Kou,Yuxiang Li,Yan Song

Main category: cs.CL

TL;DR: ChiMed 2.0是一个高质量的中文医学数据集,扩展了ChiMed,包含预训练、监督微调(SFT)和强化学习(RLHF)所需数据,验证了其在训练中文医学大模型中的有效性。

Details Motivation: 现有中文医学数据集规模小、领域窄,且仅支持微调,缺乏预训练和RLHF支持,限制了中文医学大模型的研究和应用。

Contribution: 提出ChiMed 2.0数据集,覆盖中医学经典和现代医学数据,提供预训练、SFT和RLHF所需数据,填补了中文医学数据资源的空白。

Method: 从在线平台和LLMs生成的数据中收集和构建数据集,进一步通过预训练、SFT和RLHF实验验证其有效性。

Result: 实验表明,ChiMed 2.0在不同规模模型上均能提升性能,验证了其有效性和适用性。

Insight: 构建高质量、多样化的领域数据集对大模型训练至关重要,ChiMed 2.0为中文医学领域的研究和应用提供了重要资源。

Abstract: Building high-quality data resources is crucial for advancing artificial intelligence research and applications in specific domains, particularly in the Chinese medical domain. Existing Chinese medical datasets are limited in size and narrow in domain coverage, falling short of the diverse corpora required for effective pre-training. Moreover, most datasets are designed solely for LLM fine-tuning and do not support pre-training and reinforcement learning from human feedback (RLHF). In this paper, we propose a Chinese medical dataset named ChiMed 2.0, which extends our previous work ChiMed, and covers data collected from Chinese medical online platforms and generated by LLMs. ChiMed 2.0 contains 204.4M Chinese characters covering both traditional Chinese medicine classics and modern general medical data, where there are 164.8K documents for pre-training, 351.6K question-answering pairs for supervised fine-tuning (SFT), and 41.7K preference data tuples for RLHF. To validate the effectiveness of our approach for training a Chinese medical LLM, we conduct further pre-training, SFT, and RLHF experiments on representative general domain LLMs and evaluate their performance on medical benchmark datasets. The results show performance gains across different model scales, validating the dataset’s effectiveness and applicability.

[29] Reasoning Models are Test Exploiters: Rethinking Multiple-Choice

Narun Raman,Taylor Lundy,Kevin Leyton-Brown

Main category: cs.CL

TL;DR: 该论文探讨了在多选题问答(MCQA)中,先进推理模型是否会因利用选项信息而超常发挥,研究发现链式思考(Chain-of-Thought)在提供选项前进行时MCQA仍有效,但选项信息被模型利用后会失真,建议设计更健壮的评测标准。

Details Motivation: 论文动机在于验证MCQA是否仍是评估现代大型语言模型(LLM)推理能力的有效标准,尤其是当模型可以利用选项中隐含的信息时。

Contribution: 主要贡献包括对15个问答基准和25种LLM的系统性评估,揭示了先进模型在MCQA中可能通过选项信息实现超常发挥,并提出了改进评测框架的建议。

Method: 通过5种不同的问题呈现方式(如是否提供选项、是否允许链式思考等)评估模型表现,对比自由文本和多选题条件下的性能差异。

Result: 结果显示,若允许模型在看到选项后进行链式思考,其性能显著优于自由文本,表明MCQA可能高估模型能力;反之,若链式思考在提供选项前进行,则MCQA仍是有效指标。

Insight: 研究指出,现代LLM可能通过选项信息‘作弊’,导致MCQA无法真实反映其推理能力,需设计更抗偏差的基准评测。

Abstract: When evaluating Large Language Models (LLMs) in question-answering domains, it is common to ask the model to choose among a fixed set of choices (so-called multiple-choice question-answering, or MCQA). Although downstream tasks of interest typically do not provide systems with explicit options among which to choose, this approach is nevertheless widely used because it makes it makes automatic grading straightforward and has tended to produce challenging benchmarks that correlate sufficiently well with downstream performance. This paper investigates the extent to which this trend continues to hold for state-of-the-art reasoning models, describing a systematic evaluation of $15$ different question-answering benchmarks (e.g., MMLU, HLE) and $25$ different LLMs (including small models such as Qwen 7B and relatively large models such as Llama 70B). For each model-benchmark pair, we considered $5$ ways of presenting the model with questions, including variations on whether multiple choices were offered to the model at all; whether “none of the above” sometimes replaced the right answer; and whether the model was permitted to perform chain-of-thought reasoning before and/or after the choices were presented. MCQA remained a good proxy for the downstream performance of models as long as they were allowed to perform chain-of-thought reasoning only before being presented with the options among which they had to select. On the other hand, large models that were able to perform reasoning after being given a set of options tended to significantly outperform their free-text performance due to exploiting the information in the options. We conclude that MCQA is no longer a good proxy for assessing downstream performance of state-of-the-art models, and offer practical guidelines for designing more robust, bias-resistant benchmarks that better reflect LLMs’ genuine reasoning capabilities.

[30] LionGuard 2: Building Lightweight, Data-Efficient & Localised Multilingual Content Moderators

Leanne Tan,Gabriel Chua,Ziyu Ge,Roy Ka-Wei Lee

Main category: cs.CL

TL;DR: LionGuard 2是一個輕量級、多語言內容審核系統,針對新加坡的語言環境(英語、中文、馬來語和部分泰米爾語)進行設計,基於預訓練的OpenAI嵌入和多頭順序分類器,在17個基準測試中表現優於多個商業和開源系統。

Details Motivation: 現有的審核系統在多語言支持上往往忽視本地化和低資源語言,導致實際部署時存在安全漏洞。大型模型雖能力強,但對數據和計算資源需求高。

Contribution: LionGuard 2提出了一個輕量級、數據高效的本地化多語言審核系統,驗證了高質量本地數據和多語言嵌入的潛力,無需微調大型模型即可實現強勁性能。

Method: 系統基於預訓練的OpenAI嵌入和多頭順序分類器,針對新加坡的語言環境(英語、中文、馬來語和部分泰米爾語)進行設計。

Result: 在17個基準測試(包括新加坡特定和公共英語數據集)上超越多個商業和開源系統,並已在新加坡政府實際部署。

Insight: 高質量本地數據和穩健的多語言嵌入可以顯著提升審核性能,無需依賴大型模型微調,這為低資源語言環境的內容安全提供了新思路。

Abstract: Modern moderation systems increasingly support multiple languages, but often fail to address localisation and low-resource variants - creating safety gaps in real-world deployments. Small models offer a potential alternative to large LLMs, yet still demand considerable data and compute. We present LionGuard 2, a lightweight, multilingual moderation classifier tailored to the Singapore context, supporting English, Chinese, Malay, and partial Tamil. Built on pre-trained OpenAI embeddings and a multi-head ordinal classifier, LionGuard 2 outperforms several commercial and open-source systems across 17 benchmarks, including both Singapore-specific and public English datasets. The system is actively deployed within the Singapore Government, demonstrating practical efficacy at scale. Our findings show that high-quality local data and robust multilingual embeddings can achieve strong moderation performance, without fine-tuning large models. We release our model weights and part of our training data to support future work on LLM safety.

[31] Probing Information Distribution in Transformer Architectures through Entropy Analysis

Amedeo Buonanno,Alessandro Rivetti,Francesco A. N. Palmieri,Giovanni Di Gennaro,Gianmarco Romano

Main category: cs.CL

TL;DR: 该论文通过熵分析探究了Transformer架构中的信息分布,展示了其在理解模型行为和内部表示方面的潜力。

Details Motivation: 研究旨在通过熵分析量化Transformer模型中信息的管理与转换方式,以增强模型的可解释性和评估框架。

Contribution: 提出了一种基于熵分析的方法,用于探测Transformer架构中的信息分布,并以GPT模型为例验证其有效性。

Method: 采用熵分析量化Token级不确定性,并分析不同处理阶段中的熵模式,以揭示信息流动和管理机制。

Result: 研究表明熵分析能有效揭示Transformer模型的内部信息分布和行为特征,为理解模型提供新视角。

Insight: 熵分析是一种有潜力的工具,可用于增进对Transformer模型信息管理机制的认知,助力可解释性研究。

Abstract: This work explores entropy analysis as a tool for probing information distribution within Transformer-based architectures. By quantifying token-level uncertainty and examining entropy patterns across different stages of processing, we aim to investigate how information is managed and transformed within these models. As a case study, we apply the methodology to a GPT-based large language model, illustrating its potential to reveal insights into model behavior and internal representations. This approach may offer insights into model behavior and contribute to the development of interpretability and evaluation frameworks for transformer-based models

[32] STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

Cheng-Han Chiang,Xiaofei Wang,Linjie Li,Chung-Ching Lin,Kevin Lin,Shujie Liu,Zhendong Wang,Zhengyuan Yang,Hung-yi Lee,Lijuan Wang

Main category: cs.CL

TL;DR: STITCH 是一种创新的生成方法,通过在生成无声推理块和语音响应块之间交替,实现语音语言模型的思维与说话同步,显著降低了延迟同时提升了推理能力。

Details Motivation: 现有的语音语言模型无法在回答前进行内部无声的思维过程,而人类这种思维与表达的结合能够更清晰简洁地沟通。将无声推理整合到语音语言模型中是一大需求。

Contribution: 提出了STITCH方法,通过交替生成无声推理块和语音响应块,实现了思维与说话的同步,显著降低了延迟,并在数学推理任务上表现优于基线15%。

Method: STITCH 采用分块生成策略,利用语音响应块的音频播放时间生成无声推理块,从而实现思维的实时处理,显著降低了整体延迟。

Result: STITCH 在数学推理数据集上比基线模型表现提升15%,同时在非推理任务上表现与基线相当,且延迟与基线模型相当。

Insight: 通过分块生成策略,STITCH 展示了如何在不增加延迟的情况下整合思维与表达,为语音语言模型的推理能力提升提供了新思路。

Abstract: Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses. However, current SLMs lack the ability to perform an internal, unspoken thinking process before responding. In contrast, humans typically engage in complex mental reasoning internally, enabling them to communicate ideas clearly and concisely. Thus, integrating an unspoken thought process into SLMs is highly desirable. While naively generating a complete chain-of-thought (CoT) reasoning before starting to talk can enable thinking for SLMs, this induces additional latency for the speech response, as the CoT reasoning can be arbitrarily long. To solve this issue, we propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks. Since the audio duration of a chunk of spoken response is much longer than the time to generate the tokens in a chunk of spoken response, we use the remaining free time to generate the unspoken reasoning tokens. When a chunk of audio is played to the user, the model continues to generate the next unspoken reasoning chunk, achieving simultaneous thinking and talking. Remarkably, Stitch matches the latency of baselines that cannot generate unspoken CoT by design while outperforming those baselines by 15% on math reasoning datasets; Stitch also performs equally well on non-reasoning datasets as those baseline models. Some animations and demonstrations are on the project page: https://d223302.github.io/STITCH.

[33] AlgoSimBench: Identifying Algorithmically Similar Problems for Competitive Programming

Jierui Li,Raymond Mooney

Main category: cs.CL

TL;DR: 论文介绍了AlgoSimBench,一个用于评估大型语言模型(LLMs)识别算法相似问题(ASPs)能力的新基准,并提出了改进问题相似性检测的新方法ASM。

Details Motivation: 尽管LLMs在解决复杂编程问题上表现出色,但其是否能推广到训练中较少见的领域尚不明确。

Contribution: (1)提出AlgoSimBench基准;(2)开发了ASM方法,显著提升了问题相似性检测的准确率;(3)探索了代码嵌入模型和检索方法的性能。

Method: 使用ASM方法(基于尝试解决方案匹配)以及结合关键词优先方法BM25来提高相似问题识别的准确性。

Result: 最佳模型在MCQ任务中准确率为65.9%,ASM方法提高了6.7%-11.7%,结合BM25后可达52.2%。

Insight: 模型在识别算法相似问题时表现较差,但通过ASM和过滤问题叙述内容可显著提升性能。

Abstract: Recent progress in LLMs, such as reasoning models, has demonstrated strong abilities to solve complex competitive programming problems, often rivaling top human competitors. However, it remains underexplored whether these abilities generalize to relevant domains that are less seen during training. To address this, we introduce AlgoSimBench, a new benchmark designed to assess LLMs’ ability to identify algorithmically similar problems (ASPs)-problems that can be solved using similar algorithmic approaches. AlgoSimBench consists of 1317 problems, annotated with 231 distinct fine-grained algorithm tags, from which we curate 402 multiple-choice questions (MCQs), where each question presents one algorithmically similar problem alongside three textually similar but algorithmically dissimilar distractors. Our evaluation reveals that LLMs struggle to identify ASPs, with the best-performing model (o3-mini) achieving only 65.9% accuracy on the MCQ task. To address this challenge, we propose attempted solution matching (ASM), a novel method for improving problem similarity detection. On our MCQ task, ASM yields an absolute accuracy improvement of 6.7% to 11.7% across different models. We also evaluated code embedding models and retrieval methods on similar problem identification. While the adversarial selection of problems degrades the performance to be less than random, we found that simply summarizing the problem to remove narrative elements eliminates the effect, and combining ASM with a keyword-prioritized method, BM25, can yield up to 52.2% accuracy. Code and data are available at github.com

[34] Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models

Kaiyan Chang,Yonghao Shi,Chenglong Wang,Hang Zhou,Chi Hu,Xiaoqian Liu,Yingfeng Luo,Yuan Ge,Tong Xiao,Jingbo Zhu

Main category: cs.CL

TL;DR: 本文提出了一种训练无关的测试时扩展方法(TTS),通过条件步骤级自精炼与经典并行扩展方法的结合,显著提升了大型语言模型的推理性能。

Details Motivation: 训练相关的TTS方法(如强化学习)虽然流行,但额外的计算开销增加了测试时扩展的负担。因此,研究训练无关的TTS方法以提升推理性能具有重要意义。

Contribution: 1)设计了条件步骤级自精炼方法,通过过程验证实现细粒度扩展;2)提出混合测试时扩展范式,结合了多种训练无关的TTS方法。

Method: 结合条件步骤级自精炼(细粒度顺序扩展)与经典并行扩展方法,形成混合测试时扩展策略。

Result: 在多个规模和家族的LLM(3B-14B)上实验表明,混合策略显著提升了推理性能边界。

Insight: 细粒度的训练无关TTS方法结合并行扩展策略,可以更高效地激发模型的推理潜力,避免训练带来的计算开销。

Abstract: Test-Time Scaling (TTS) is a promising approach to progressively elicit the model’s intelligence during inference. Recently, training-based TTS methods, such as continued reinforcement learning (RL), have further surged in popularity, while training-free TTS methods are gradually fading from prominence. However, the additional computation overhead of training amplifies the burden on test-time scaling. In this paper, we focus on training-free TTS methods for reasoning. We first design Conditional Step-level Self-refinement, a fine-grained sequential scaling method guided by process verification. On top of its effectiveness, we further combine it with other classical parallel scaling methods at the step level, to introduce a novel inference paradigm called Hybrid Test-Time Scaling. Extensive experiments on five instruction-tuned LLMs across different scales (3B-14B) and families demonstrate that hybrid strategy incorporating various training-free TTS methods at a fine granularity has considerable potential for expanding the reasoning performance boundaries of LLMs.

[35] Smart Eyes for Silent Threats: VLMs and In-Context Learning for THz Imaging

Nicolas Poggi,Shashank Agnihotri,Margret Keuper

Main category: cs.CL

TL;DR: 该论文提出了一种无需微调的灵活方法,利用视觉语言模型(VLMs)和上下文学习(ICL)改进太赫兹(THz)图像分类,适应低数据场景并提高可解释性。

Details Motivation: 太赫兹成像在安全筛查和材料分类等应用中具有潜力,但受限于标注数据少、分辨率低和视觉模糊性,其图像分类效果不佳,亟需一种无需大量标注的高效方法。

Contribution: 首次将ICL增强的VLMs应用于太赫兹成像领域,提出一种模态对齐的提示框架,在零样本和单样本设置中验证了其分类性能和可解释性的提升。

Method: 采用模态对齐提示框架,将两种开放权重的VLMs适配到太赫兹领域,并通过零样本和单样本实验评估其性能。

Result: 实验表明,ICL在低数据条件下显著提高了分类准确性和模型的可解释性。

Insight: VLMs和ICL的结合为资源受限的科学领域提供了一种高效且无需大量标注数据的解决方案,拓展了其在多模态任务中的应用潜力。

Abstract: Terahertz (THz) imaging enables non-invasive analysis for applications such as security screening and material classification, but effective image classification remains challenging due to limited annotations, low resolution, and visual ambiguity. We introduce In-Context Learning (ICL) with Vision-Language Models (VLMs) as a flexible, interpretable alternative that requires no fine-tuning. Using a modality-aligned prompting framework, we adapt two open-weight VLMs to the THz domain and evaluate them under zero-shot and one-shot settings. Our results show that ICL improves classification and interpretability in low-data regimes. This is the first application of ICL-enhanced VLMs to THz imaging, offering a promising direction for resource-constrained scientific domains. Code: \href{https://github.com/Nicolas-Poggi/Project_THz_Classification/tree/main}{GitHub repository}.

[36] Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation

Xinping Zhao,Shouzheng Huang,Yan Zhong,Xinshuo Hu,Baotian Hu,Min Zhang

Main category: cs.CL

TL;DR: LEAR通过强化学习框架,显式推理和提取理性证据,显著提升检索增强生成的准确性,克服了传统方法的噪声问题。

Details Motivation: 检索增强生成(RAG)虽然提升了大型语言模型(LLM)的准确性,但检索噪声会影响生成质量。传统方法缺乏显式的推理过程,容易忽略关键线索且泛化能力差。

Contribution: 提出了LEAR方法,通过显式推理和提取理性证据,分阶段识别和提取关键线索;设计了基于知识的令牌掩码和解耦方法,以及三种可验证的奖励函数(答案、长度、格式),用于强化学习优化。

Method: 将证据推理和提取统一为一个端到端训练任务;使用令牌掩码解耦推理和提取结果;通过强化学习的策略优化算法,结合三类奖励函数更新模型。

Result: 在三个基准数据集上的实验表明,LEAR能提供紧凑且高质量的证据,提升下游任务准确性,并适用于在线RAG系统。

Insight: 显式推理和强化学习的结合能够有效减少检索噪声,提高证据提取的精准度和泛化能力,为RAG系统提供了优化方向。

Abstract: Retrieval-Augmented Generation (RAG) effectively improves the accuracy of Large Language Models (LLMs). However, retrieval noises significantly impact the quality of LLMs’ generation, necessitating the development of denoising mechanisms. Previous methods extract evidence straightforwardly without explicit thinking, which risks filtering out key clues and struggles with generalization. To this end, we propose LEAR, which learns to extract rational evidence by (1) explicitly reasoning to identify potential cues within retrieval contents first, and then (2) consciously extracting to avoid omitting any key cues helpful for answering questions. Specifically, we frame evidence reasoning and evidence extraction into one unified response for end-to-end training; apply knowledge token masks for disentanglement to derive reasoning-based and extraction-based answers; and devise three types of verifiable reward functions, including answer, length, and format, to update the model via the policy optimization algorithm. Extensive experiments on three benchmark datasets show the effectiveness of LEAR, providing compact and high-quality evidence, improving the accuracy of downstream tasks, and promoting effective application in online RAG systems.

[37] Leveraging Context for Multimodal Fallacy Classification in Political Debates

Alessio Pittiglio

Main category: cs.CL

TL;DR: 本文介绍了针对多模态论辩挖掘中的逻辑谬误分类任务的方法,利用预训练的Transformer模型和多模态上下文信息,在政治辩论数据上取得了一定效果。

Details Motivation: 研究动机是提升多模态论辩挖掘中逻辑谬识分类的准确性,尤其在政治辩论这种复杂语境中。

Contribution: 主要贡献包括:提出利用预训练Transformer模型和多模态上下文的方法,证明多模态模型在逻辑谬误分类任务中的潜力。

Method: 方法基于预训练的Transformer模型,结合文本、音频等多模态信息,并探索上下文的利用方式。

Result: 在谬误分类子任务中,文本模型F1为0.4444,音频模型为0.3559,多模态模型为0.4403,表现与文本模型接近。

Insight: 多模态模型表现与文本模型相当,暗示多模态融合有改进空间,可能需要更复杂的上下文建模。

Abstract: In this paper, we present our submission to the MM-ArgFallacy2025 shared task, which aims to advance research in multimodal argument mining, focusing on logical fallacies in political debates. Our approach uses pretrained Transformer-based models and proposes several ways to leverage context. In the fallacy classification subtask, our models achieved macro F1-scores of 0.4444 (text), 0.3559 (audio), and 0.4403 (multimodal). Our multimodal model showed performance comparable to the text-only model, suggesting potential for improvements.

[38] P3: Prompts Promote Prompting

Xinyu Zhang,Yuanquan Hu,Fangchao Liu,Zhicheng Dou

Main category: cs.CL

TL;DR: P3提出了一种新颖的自我优化框架,通过同时优化系统和用户提示来提升大语言模型的性能,实验证明了其在多种任务中的优越性。

Details Motivation: 当前大语言模型的提示优化通常仅针对系统或用户提示单方面进行,由于两者的相互依赖性,这种方法往往效果不佳。P3旨在通过同时优化两者来解决这一问题。

Contribution: P3的主要贡献在于提出了一种联合优化系统和用户提示的自优化框架,并通过离线优化进一步提升在线提示的性能。

Method: P3采用迭代过程同时优化系统和用户提示,并通过查询相关的提示优化实现离线优化的在线应用。

Result: 在通用任务(如Arena-hard和Alpaca-eval)和推理任务(如GSM8K和GPQA)上,P3表现出优于现有方法的性能。

Insight: 全面优化的策略(即同时优化系统和用户提示)能够显著提升大语言模型在不同领域的表现。

Abstract: Current large language model (LLM) applications often employ multi-component prompts, comprising both system and user prompts, to guide model behaviors. While recent advancements have demonstrated the efficacy of automatically optimizing either the system or user prompt to boost performance, such unilateral approaches often yield suboptimal outcomes due to the interdependent nature of these components. In this work, we introduce P3, a novel self-improvement framework that concurrently optimizes both system and user prompts through an iterative process. The offline optimized prompts are further leveraged to promote online prompting by performing query-dependent prompt optimization. Extensive experiments on general tasks (e.g., Arena-hard and Alpaca-eval) and reasoning tasks (e.g., GSM8K and GPQA) demonstrate that P3 achieves superior performance in the realm of automatic prompt optimization. Our results highlight the effectiveness of a holistic optimization strategy in enhancing LLM performance across diverse domains.

[39] CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models

Congmin Zheng,Jiachen Zhu,Jianghao Lin,Xinyi Dai,Yong Yu,Weinan Zhang,Mengyue Yang

Main category: cs.CL

TL;DR: 论文提出CoLD框架,通过反事实推理和因果图分析,解决Process Reward Models中的长度偏差问题,包括长度惩罚调整、偏差估计器学习和联合训练策略,实验证明其能减少奖励与长度的相关性。

Details Motivation: 现有Process Reward Models(PRMs)存在长度偏差,即更倾向于给更长的推理步骤分配更高的分数,即使语义内容和逻辑有效性未变,这影响了奖励预测的可靠性并导致冗长的输出。

Contribution: 提出CoLD框架,通过三种组件(显式长度惩罚调整、学习的偏差估计器和联合训练策略)解决PRM中的长度偏差问题。

Method: 结合反事实推理和因果图分析,采用显式长度惩罚调整、学习的偏差估计器和联合训练策略,实现奖励预测的长度不变性。

Result: 在MATH500和GSM-Plus上的实验表明,CoLD显著减少了奖励与长度的相关性,提升了步骤选择的准确性,并鼓励更简洁、逻辑有效的推理。

Insight: 长度偏差是PRM中普遍存在的问题,通过反事实推理和因果分析可以有效解决此类偏差,提升模型的鲁棒性和可靠性。

Abstract: Process Reward Models (PRMs) play a central role in evaluating and guiding multi-step reasoning in large language models (LLMs), especially for mathematical problem solving. However, we identify a pervasive length bias in existing PRMs: they tend to assign higher scores to longer reasoning steps, even when the semantic content and logical validity are unchanged. This bias undermines the reliability of reward predictions and leads to overly verbose outputs during inference. To address this issue, we propose CoLD(Counterfactually-Guided Length Debiasing), a unified framework that mitigates length bias through three components: an explicit length-penalty adjustment, a learned bias estimator trained to capture spurious length-related signals, and a joint training strategy that enforces length-invariance in reward predictions. Our approach is grounded in counterfactual reasoning and informed by causal graph analysis. Extensive experiments on MATH500 and GSM-Plus show that CoLD consistently reduces reward-length correlation, improves accuracy in step selection, and encourages more concise, logically valid reasoning. These results demonstrate the effectiveness and practicality of CoLD in improving the fidelity and robustness of PRMs.

[40] Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?

Seok Hwan Song,Mohna Chakraborty,Qi Li,Wallapak Tavanapong

Main category: cs.CL

TL;DR: 该研究探讨了不同类型问题对大型语言模型(LLM)在推理任务中表现的影响,发现问题的形式对模型准确率有显著影响。

Details Motivation: 尽管LLM在各种问题类型中表现多样,但尚未有研究系统分析问题类型对其推理任务准确率的影响。

Contribution: 填补了问题类型对LLM推理表现研究的空白,并通过实验揭示了问题形式、选项数量和措辞对模型性能的具体影响。

Method: 实验评估了五种LLM在定量和演绎推理任务中三种不同问题类型的表现,包括准确率和推理步骤的分析。

Result: 结果显示:(1) 不同问题类型的表现差异显著;(2) 推理准确率与最终答案选择准确率不一定相关;(3) 选项数量和措辞影响模型表现。

Insight: 问题形式的差异可能掩盖LLM的真实推理能力,提示评估时需考虑问题设计对结果的影响。

Abstract: Large Language Models (LLMs) have been evaluated using diverse question types, e.g., multiple-choice, true/false, and short/long answers. This study answers an unexplored question about the impact of different question types on LLM accuracy on reasoning tasks. We investigate the performance of five LLMs on three different types of questions using quantitative and deductive reasoning tasks. The performance metrics include accuracy in the reasoning steps and choosing the final answer. Key Findings: (1) Significant differences exist in LLM performance across different question types. (2) Reasoning accuracy does not necessarily correlate with the final selection accuracy. (3) The number of options and the choice of words, influence LLM performance.

[41] Chinchunmei at SemEval-2025 Task 11: Boosting the Large Language Model’s Capability of Emotion Perception using Contrastive Learning

Tian Li,Yujian Sun,Huizhi Liang

Main category: cs.CL

TL;DR: 论文探讨了两种对比学习方法(样本对比和生成对比)在提升大语言模型情感感知能力上的应用,并在SemEval-2025任务11中取得了不错的成绩。

Details Motivation: 研究旨在解决情感检测任务中的情感表达多样性和背景变化带来的挑战,尤其是在28种语言的多语言环境下。

Contribution: 系统地研究了样本对比(CRC)和生成对比(DPO、SimPO)两种方法,以提升LLaMa3-Instruct-8B模型的情感感知能力。

Method: 1. 样本对比学习(CRC):通过比较两个样本来训练模型。2. 生成对比学习(DPO、SimPO):通过区分正确和错误的生成结果来优化模型。

Result: 在SemEval-2025任务11中,Track A英文任务中排名第9,Track B英文任务中排名第6,其他语言中也表现优异。

Insight: 对比学习可以显著提升大语言模型在情感检测任务中的性能,生成对比方法尤其适用于多语言环境下的情感强度预测。

Abstract: The SemEval-2025 Task 11, Bridging the Gap in Text-Based Emotion Detection, introduces an emotion recognition challenge spanning over 28 languages. This competition encourages researchers to explore more advanced approaches to address the challenges posed by the diversity of emotional expressions and background variations. It features two tracks: multi-label classification (Track A) and emotion intensity prediction (Track B), covering six emotion categories: anger, fear, joy, sadness, surprise, and disgust. In our work, we systematically explore the benefits of two contrastive learning approaches: sample-based (Contrastive Reasoning Calibration) and generation-based (DPO, SimPO) contrastive learning. The sample-based contrastive approach trains the model by comparing two samples to generate more reliable predictions. The generation-based contrastive approach trains the model to differentiate between correct and incorrect generations, refining its prediction. All models are fine-tuned from LLaMa3-Instruct-8B. Our system achieves 9th place in Track A and 6th place in Track B for English, while ranking among the top-tier performing systems for other languages.

[42] BEnchmarking LLMs for Ophthalmology (BELO) for Ophthalmological Knowledge and Reasoning

Sahana Srinivasan,Xuguang Ai,Thaddaeus Wai Soon Lo,Aidan Gilson,Minjie Zou,Ke Zou,Hyunjae Kim,Mingjia Yang,Krithi Pushpanathan,Samantha Yew,Wan Ting Loke,Jocelyn Goh,Yibing Chen,Yiming Kong,Emily Yuelei Fu,Michelle Ongyong Hui,Kristen Nwanyanwu,Amisha Dave,Kelvin Zhenghao Li,Chen-Hsin Sun,Mark Chia,Gabriel Dawei Yang,Wendy Meihua Wong,David Ziyou Chen,Dianbo Liu,Maxwell Singer,Fares Antaki,Lucian V Del Priore,Jost Jonas,Ron Adelman,Qingyu Chen,Yih-Chung Tham

Main category: cs.CL

TL;DR: BELO是一个标准化且全面的基准测试,用于评估大型语言模型(LLMs)在眼科方面的临床准确性和推理能力,由专家团队精心筛选和验证。

Details Motivation: 当前评估LLMs在眼科领域的基准测试范围有限,且过于偏重准确性。BELO旨在提供一个更全面的评估工具。

Contribution: BELO通过多轮专家审核,整合了来自多个医学数据集的高质量多选题(MCQs),并引入文本生成指标和人工评估。

Method: 使用关键词匹配和PubMedBERT模型筛选眼科相关的MCQs,并经过多轮专家审核。评估包括准确性、宏F1和多种文本生成指标。

Result: BELO包含900道高质量问题,并对6种LLMs进行了评估,结果显示其在眼科领域的表现差异。

Insight: BELO为未来LLMs在眼科领域的公平和可重复比较提供了可靠的工具,并促进了透明评估。

Abstract: Current benchmarks evaluating large language models (LLMs) in ophthalmology are limited in scope and disproportionately prioritise accuracy. We introduce BELO (BEnchmarking LLMs for Ophthalmology), a standardized and comprehensive evaluation benchmark developed through multiple rounds of expert checking by 13 ophthalmologists. BELO assesses ophthalmology-related clinical accuracy and reasoning quality. Using keyword matching and a fine-tuned PubMedBERT model, we curated ophthalmology-specific multiple-choice-questions (MCQs) from diverse medical datasets (BCSC, MedMCQA, MedQA, BioASQ, and PubMedQA). The dataset underwent multiple rounds of expert checking. Duplicate and substandard questions were systematically removed. Ten ophthalmologists refined the explanations of each MCQ’s correct answer. This was further adjudicated by three senior ophthalmologists. To illustrate BELO’s utility, we evaluated six LLMs (OpenAI o1, o3-mini, GPT-4o, DeepSeek-R1, Llama-3-8B, and Gemini 1.5 Pro) using accuracy, macro-F1, and five text-generation metrics (ROUGE-L, BERTScore, BARTScore, METEOR, and AlignScore). In a further evaluation involving human experts, two ophthalmologists qualitatively reviewed 50 randomly selected outputs for accuracy, comprehensiveness, and completeness. BELO consists of 900 high-quality, expert-reviewed questions aggregated from five sources: BCSC (260), BioASQ (10), MedMCQA (572), MedQA (40), and PubMedQA (18). A public leaderboard has been established to promote transparent evaluation and reporting. Importantly, the BELO dataset will remain a hold-out, evaluation-only benchmark to ensure fair and reproducible comparisons of future models.

[43] Understanding Large Language Models’ Ability on Interdisciplinary Research

Yuanhao Shen,Daniel Xavier de Sousa,Ricardo Marçal,Ali Asad,Hongyu Guo,Xiaodan Zhu

Main category: cs.CL

TL;DR: 论文提出了IDRBench,首个专门评估大语言模型在跨学科研究中生成优质研究想法的能力的基准,并发现现有模型仍难以产生高质量的跨学科研究思路。

Details Motivation: 当前大语言模型在科学发现中展现出潜力,但缺乏专门评估其跨学科研究能力的基准,限制了对其能力的全面理解。

Contribution: 开发了IDRBench,包含专家标注的数据集和任务套件,用于系统评估大语言模型在跨学科研究中的表现。

Method: IDRBench基于ArXiv论文构建数据集,标注清晰维度,设计渐进式任务(跨学科论文识别、想法整合、想法推荐),并评估了10个模型的表现。

Result: 实验表明,尽管模型具备一定跨学科研究意识,但仍难以生成高质量的跨学科研究思路。

Insight: 研究为开发下一代擅长跨学科研究的大语言模型提供了方向,并可能推动相关研究的发展。

Abstract: Recent advancements in Large Language Models (LLMs) have revealed their impressive ability to perform multi-step, logic-driven reasoning across complex domains, positioning them as powerful tools and collaborators in scientific discovery while challenging the long-held view that inspiration-driven ideation is uniquely human. However, the lack of a dedicated benchmark that evaluates LLMs’ ability to develop ideas in Interdisciplinary Research (IDR) settings poses a critical barrier to fully understanding their strengths and limitations. To address this gap, we introduce IDRBench – a pioneering benchmark featuring an expert annotated dataset and a suite of tasks tailored to evaluate LLMs’ capabilities in proposing valuable research ideas from different scientific domains for interdisciplinary research. This benchmark aims to provide a systematic framework for assessing LLM performance in complex, cross-domain scientific research. Our dataset consists of scientific publications sourced from the ArXiv platform covering six distinct disciplines, and is annotated by domain experts with diverse academic backgrounds. To ensure high-quality annotations, we emphasize clearly defined dimensions that characterize authentic interdisciplinary research. The design of evaluation tasks in IDRBench follows a progressive, real-world perspective, reflecting the natural stages of interdisciplinary research development, including 1) IDR Paper Identification, 2) IDR Idea Integration, and 3) IDR Idea Recommendation. Using IDRBench, we construct baselines across 10 LLMs and observe that despite fostering some level of IDR awareness, LLMs still struggle to produce quality IDR ideas. These findings could not only spark new research directions, but also help to develop next-generation LLMs that excel in interdisciplinary research.

[44] Interaction as Intelligence: Deep Research With Human-AI Partnership

Lyumanshan Ye,Xiaojie Cai,Xinkai Wang,Junfei Wang,Xiangkun Hu,Jiadi Su,Yang Nan,Sihan Wang,Bohan Zhang,Xiaoze Fan,Jinbin Luo,Yuxiang Zheng,Tianze Xu,Dayuan Fu,Yunze Wu,Pengrui Lu,Zengzhi Wang,Yiwei Qin,Zhen Huang,Yan Ma,Zhulin Hu,Haoyang Zou,Tiantian Mi,Yixin Ye,Ethan Chern,Pengfei Liu

Main category: cs.CL

TL;DR: 本文提出将交互视为智能的核心,通过Deep Cognition系统实现人-AI合作,提高研究任务的效率和透明度。

Details Motivation: 传统AI系统采用‘输入-等待-输出’范式,导致错误传播和灵活性不足。作者认为交互本身是智能的关键组成部分,需重新定义人-AI关系。

Contribution: 1. 提出‘交互即智能’的理念;2. 设计Deep Cognition系统,实现透明的、可控的交互;3. 在六项关键指标上显著优于基线。

Method: 1. 透明的、可中断的交互设计;2. 细粒度双向对话;3. 共享认知上下文。

Result: 系统在透明性、细粒度交互等六项指标上提升8.8%29.2%,在复杂研究任务上性能提升31.8%50.0%。

Insight: 交互不仅是工具,而是智能的核心部分。通过人-AI协同,可以显著提升研究任务的灵活性和效果。

Abstract: This paper introduces “Interaction as Intelligence” research series, presenting a reconceptualization of human-AI relationships in deep research tasks. Traditional approaches treat interaction merely as an interface for accessing AI capabilities-a conduit between human intent and machine output. We propose that interaction itself constitutes a fundamental dimension of intelligence. As AI systems engage in extended thinking processes for research tasks, meaningful interaction transitions from an optional enhancement to an essential component of effective intelligence. Current deep research systems adopt an “input-wait-output” paradigm where users initiate queries and receive results after black-box processing. This approach leads to error cascade effects, inflexible research boundaries that prevent question refinement during investigation, and missed opportunities for expertise integration. To address these limitations, we introduce Deep Cognition, a system that transforms the human role from giving instructions to cognitive oversight-a mode of engagement where humans guide AI thinking processes through strategic intervention at critical junctures. Deep cognition implements three key innovations: (1)Transparent, controllable, and interruptible interaction that reveals AI reasoning and enables intervention at any point; (2)Fine-grained bidirectional dialogue; and (3)Shared cognitive context where the system observes and adapts to user behaviors without explicit instruction. User evaluation demonstrates that this cognitive oversight paradigm outperforms the strongest baseline across six key metrics: Transparency(+20.0%), Fine-Grained Interaction(+29.2%), Real-Time Intervention(+18.5%), Ease of Collaboration(+27.7%), Results-Worth-Effort(+8.8%), and Interruptibility(+20.7%). Evaluations on challenging research problems show 31.8% to 50.0% points of improvements over deep research systems.

[45] Supernova: Achieving More with Less in Transformer Architectures

Andrei-Valentin Tanase,Elena Pelican

Main category: cs.CL

TL;DR: Supernova通过精巧的架构设计和创新的分词方法,展示了如何在减少参数和计算量的同时,达到与更大模型相当的性能。

Details Motivation: 当前大型Transformer模型通常依赖规模扩展(scaling)来提升性能,但这种方式带来了高昂的计算成本。本文旨在探索如何通过高效的架构设计和高质量的分词技术,实现更高效的性能。

Contribution: 主要贡献包括:(1)设计了Supernova,一个仅含6.5亿参数的解码器Transformer;(2)创新的128,000词汇表的字节级BPE分词器;(3)证明架构效率和高分性能弥补参数削减。

Method: 结合了多种高效技术:RoPE(旋转位置编码)、GQA(分组查询注意力,压缩比3:1)、RMSNorm(计算高效归一化)和SwiGLU激活函数。

Result: Supernova仅需100B训练token,比同类模型少一个数量级,同时达到10亿参数模型90%的性能,参数减少53%。

Insight: 论文挑战了传统的大规模扩展范式,证明高效的架构设计和分词质量能够显著减少参数需求的同时保持性能。

Abstract: We present Supernova, a 650M-parameter decoder-only transformer that demonstrates how careful architectural design and tokenization innovation can achieve the performance of larger models while maintaining computational efficiency. Our architecture combines Rotary Positional Embeddings (RoPE), Grouped Query Attention (GQA) with a 3:1 compression ratio, RMSNorm for computational efficiency, and SwiGLU activation functions. A critical innovation is our custom 128,000-vocabulary byte-level BPE tokenizer, which achieves state-of-the-art compression performance. Through detailed analysis, we show that Supernova achieves 90% of the performance of 1B-parameter models while using 53% fewer parameters and requiring only 100B training tokens–an order of magnitude less than competing models. Our findings challenge the prevailing scaling paradigm, demonstrating that architectural efficiency and tokenization quality can compensate for reduced parameter counts.

[46] Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR

Jiakang Wang,Runze Liu,Fuzheng Zhang,Xiu Li,Guorui Zhou

Main category: cs.CL

TL;DR: 论文提出了一种名为Archer的RLVR方法,通过双令牌约束和同步更新,分别处理知识令牌和推理令牌,以在保持知识稳定性的同时促进推理能力,显著优于现有方法。

Details Motivation: 现有RLVR方法对所有令牌采用统一的训练信号,未区分知识令牌(低熵)和推理令牌(高熵)的不同作用,可能导致语义依赖断裂或学习效率低下。

Contribution: 论文的主要贡献是提出了Archer,一种熵感知的RLVR方法,通过双令牌约束和同步更新,分别优化知识令牌和推理令牌的训练信号。

Method: Archer采用较弱的KL正则化和较高的裁剪阈值处理推理令牌以鼓励探索,同时对知识令牌施加更强的约束以保持事实知识的稳定性。

Result: 在数学推理和代码生成基准测试中,Archer显著优于现有RLVR方法,达到或超过同规模模型的最先进性能。

Insight: 区分知识令牌和推理令牌的特性并分别优化其训练信号,是实现知识稳定性和推理能力提升的有效途径。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training method for improving the reasoning abilities of Large Language Models (LLMs), mainly by shaping higher-order behaviors such as reflection and planning. However, previous RLVR algorithms often apply uniform training signals to all tokens, without considering the different roles of low-entropy knowledge-related tokens and high-entropy reasoning-related tokens. Some recent methods try to separate these token types by gradient masking or asynchronous updates, but these approaches may break semantic dependencies in the model output and hinder effective learning. In this work, we propose Archer, an entropy-aware RLVR approach with dual-token constraints and synchronous updates. Specifically, our method applies weaker KL regularization and higher clipping thresholds to reasoning tokens to encourage exploration, while using stronger constraints on knowledge tokens to maintain factual knowledge. Experimental results on several mathematical reasoning and code generation benchmarks show that our approach significantly outperforms previous RLVR methods, reaching or exceeding state-of-the-art performance among models of comparable size. The code is available at https://github.com/wizard-III/ArcherCodeR.

[47] The Impact of Language Mixing on Bilingual LLM Reasoning

Yihao Li,Jiayi Xin,Miranda Muqing Miao,Qi Long,Lyle Ungar

Main category: cs.CL

TL;DR: 研究表明,双语大语言模型(LLM)在推理过程中混合语言(如中英交替)可能提升推理能力,强制单语言解码会降低准确性。强化学习与可验证奖励(RLVR)是关键训练阶段,同时轻量级探测器可预测语言切换对推理的影响。

Details Motivation: 探讨双语LLM在推理过程中语言混合(language mixing)行为的实际作用,验证其是否对推理能力有积极影响。

Contribution: 1. 发现语言混合行为可能提升推理能力;2. 确定RLVR是关键训练阶段;3. 设计轻量级探测器预测语言切换对推理的作用,并提升准确性。

Method: 通过双语LLM的推理实验,分析语言混合行为的影响,并使用RLVR训练模型。此外,训练探测器预测语言切换的效用,并指导解码过程。

Result: 强制单语言解码会降低数学推理任务准确率5.6个百分点,而探测器引导的切换可提升准确率至多6.25个百分点。

Insight: 语言混合不仅是多语言训练的副产品,也可能是模型的一种策略性推理行为。

Abstract: Proficient multilingual speakers often intentionally switch languages in the middle of a conversation. Similarly, recent reasoning-focused bilingual large language models (LLMs) with strong capabilities in both languages exhibit language mixing–alternating languages within their chain of thought. Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy, suggesting that language mixing may benefit reasoning. In this work, we study language switching in Chinese-English bilingual reasoning models. We identify reinforcement learning with verifiable rewards (RLVR) as the critical training stage that leads to language mixing. We demonstrate that language mixing can enhance reasoning: enforcing monolingual decoding reduces accuracy by 5.6 percentage points on math reasoning tasks. Additionally, a lightweight probe can be trained to predict whether a potential language switch would benefit or harm reasoning, and when used to guide decoding, increases accuracy by up to 6.25 percentage points. Our findings suggest that language mixing is not merely a byproduct of multilingual training, but is a strategic reasoning behavior.

cs.CV [Back]

[48] Comparative Analysis of Algorithms for the Fitting of Tessellations to 3D Image Data

Andreas Alpers,Orkun Furat,Christian Jung,Matthias Neumann,Claudia Redenbach,Aigerim Saken,Volker Schmidt

Main category: cs.CV

TL;DR: 本文比较了多种算法在3D图像数据中拟合镶嵌模型的性能,包括线性/非线性规划、随机优化和梯度下降,重点评估了拟合质量与模型复杂度的权衡。

Details Motivation: 3D图像数据(如多晶体和泡沫材料)的镶嵌模型拟合需求日益增长,但缺乏对不同算法的系统比较。

Contribution: 首次系统地比较了多种优化算法在拟合Voroni、Laguerre和GBPD模型时的性能,并提出了基于数据特性的方法选择指导。

Method: 采用线性/非线性规划、交叉熵随机优化和梯度下降等方法,比较了它们在拟合镶嵌模型时的效果。

Result: 实验结果表明,不同算法在模型复杂度、优化难度和拟合质量之间存在权衡。

Insight: 根据数据特性和应用需求选择合适的算法是关键,复杂模型需要更高效的优化方法。

Abstract: This paper presents a comparative analysis of algorithmic strategies for fitting tessellation models to 3D image data of materials such as polycrystals and foams. In this steadily advancing field, we review and assess optimization-based methods – including linear and nonlinear programming, stochastic optimization via the cross-entropy method, and gradient descent – for generating Voronoi, Laguerre, and generalized balanced power diagrams (GBPDs) that approximate voxelbased grain structures. The quality of fit is evaluated on real-world datasets using discrepancy measures that quantify differences in grain volume, surface area, and topology. Our results highlight trade-offs between model complexity, the complexity of the optimization routines involved, and the quality of approximation, providing guidance for selecting appropriate methods based on data characteristics and application needs.

[49] Semantic Segmentation based Scene Understanding in Autonomous Vehicles

Ehsan Rassekh

Main category: cs.CV

TL;DR: 论文研究了基于语义分割的场景理解在自动驾驶中的应用,提出了几种高效模型,并使用BDD100k数据集进行验证。结果表明选择合适的骨干网络对模型性能有显著影响。

Details Motivation: 自动驾驶需要精确的环境理解,语义分割是实现这一目标的关键技术。研究旨在通过改进模型和骨干网络,提升场景理解的准确性和效率。

Contribution: 1. 提出几种高效的语义分割模型;2. 研究了不同骨干网络对模型性能的影响;3. 使用BDD100k数据集验证了模型效果。

Method: 采用深度学习技术,设计多种语义分割模型,并通过实验对比不同骨干网络(如ResNet、VGG等)的性能差异。

Result: 实验结果显示,选择合适的骨干网络能显著提升语义分割的准确性、平均IoU和损失函数表现。

Insight: 骨干网络的选择对语义分割模型的性能至关重要,合理的设计可以大幅提升自动驾驶中的场景理解能力。

Abstract: In recent years, the concept of artificial intelligence (AI) has become a prominent keyword because it is promising in solving complex tasks. The need for human expertise in specific areas may no longer be needed because machines have achieved successful results using artificial intelligence and can make the right decisions in critical situations. This process is possible with the help of deep learning (DL), one of the most popular artificial intelligence technologies. One of the areas in which the use of DL is used is in the development of self-driving cars, which is very effective and important. In this work, we propose several efficient models to investigate scene understanding through semantic segmentation. We use the BDD100k dataset to investigate these models. Another contribution of this work is the usage of several Backbones as encoders for models. The obtained results show that choosing the appropriate backbone has a great effect on the performance of the model for semantic segmentation. Better performance in semantic segmentation allows us to understand better the scene and the environment around the agent. In the end, we analyze and evaluate the proposed models in terms of accuracy, mean IoU, and loss function, and the results show that these metrics are improved.

[50] CLIPTTA: Robust Contrastive Vision-Language Test-Time Adaptation

Marc Lafon,Gustavo Adolfo Vargas Hakim,Clément Rambour,Christian Desrosier,Nicolas Thome

Main category: cs.CV

TL;DR: CLIPTTA是一种基于梯度的方法,用于视觉语言模型的测试时适应(TTA),通过对齐CLIP预训练目标的对比性损失,显著提升了在分布偏移下的性能。

Details Motivation: 现有基于熵最小化的TTA方法与视觉语言模型的对比性训练目标不匹配,导致适应性能受限,并引发伪标签漂移和类坍塌等问题。

Contribution: 1. 提出CLIPTTA,一种对齐CLIP预训练目标的梯度TTA方法;2. 通过理论分析其梯度设计,避免了模型坍塌风险;3. 扩展至开放集场景,引入OCE损失提升OOD检测能力。

Method: CLIPTTA采用软对比性损失作为目标函数,并设计批量感知的梯度更新策略,避免模型坍塌。同时引入OCE损失处理开放集中的OOD数据。

Result: 在75个数据集上的评估表明,CLIPTTA显著优于基于熵的方法,且在开放集场景下表现稳定,性能优于当前最先进的TTA方法。

Insight: 对比性损失更适合视觉语言模型的测试时适应,批量感知设计能有效避免模型坍塌,开放集场景下的OOD检测能力是关键挑战。

Abstract: Vision-language models (VLMs) like CLIP exhibit strong zero-shot capabilities but often fail to generalize under distribution shifts. Test-time adaptation (TTA) allows models to update at inference time without labeled data, typically via entropy minimization. However, this objective is fundamentally misaligned with the contrastive image-text training of VLMs, limiting adaptation performance and introducing failure modes such as pseudo-label drift and class collapse. We propose CLIPTTA, a new gradient-based TTA method for vision-language models that leverages a soft contrastive loss aligned with CLIP’s pre-training objective. We provide a theoretical analysis of CLIPTTA’s gradients, showing how its batch-aware design mitigates the risk of collapse. We further extend CLIPTTA to the open-set setting, where both in-distribution (ID) and out-of-distribution (OOD) samples are encountered, using an Outlier Contrastive Exposure (OCE) loss to improve OOD detection. Evaluated on 75 datasets spanning diverse distribution shifts, CLIPTTA consistently outperforms entropy-based objectives and is highly competitive with state-of-the-art TTA methods, outperforming them on a large number of datasets and exhibiting more stable performance across diverse shifts.

[51] Hallucination Score: Towards Mitigating Hallucinations in Generative Image Super-Resolution

Weiming Ren,Raghav Goyal,Zhiming Hu,Tristan Ty Aumentado-Armstrong,Iqbal Mohomed,Alex Levinshtein

Main category: cs.CV

TL;DR: 该论文提出了’幻觉评分’(HS)方法,通过多模态大语言模型(MLLM)检测和量化生成式超分辨率(GSR)中的幻觉现象,并利用深度特征距离作为奖励函数来缓解幻觉。

Details Motivation: 生成式超分辨率模型(GSR)在感知图像质量上表现优异,但出现了与低分辨率图像或真实图像不符的幻觉现象,这一问题影响了模型的实用性,目前还缺乏有效的评估和缓解方法。

Contribution: 1. 提出幻觉评分(HS)衡量GSR模型的幻觉问题;2. 证明深度特征距离与HS具有强相关性;3. 利用深度特征距离作为可微奖励函数来缓解幻觉。

Method: 1. 通过MLLM构建提示(prompt)评估幻觉元素并生成HS;2. 分析深度特征距离与HS的关系;3. 提出基于深度特征距离的奖励函数优化GSR模型。

Result: HS与人类评估高度一致,并补充了现有超分辨率评价指标;实验表明深度特征距离可以有效缓解幻觉现象。

Insight: 幻觉问题是GSR模型的关键短板,多模态大语言模型和深度特征距离的结合为评估和优化提供了新思路。

Abstract: Generative super-resolution (GSR) currently sets the state-of-the-art in terms of perceptual image quality, overcoming the “regression-to-the-mean” blur of prior non-generative models. However, from a human perspective, such models do not fully conform to the optimal balance between quality and fidelity. Instead, a different class of artifacts, in which generated details fail to perceptually match the low resolution image (LRI) or ground-truth image (GTI), is a critical but under studied issue in GSR, limiting its practical deployments. In this work, we focus on measuring, analyzing, and mitigating these artifacts (i.e., “hallucinations”). We observe that hallucinations are not well-characterized with existing image metrics or quality models, as they are orthogonal to both exact fidelity and no-reference quality. Instead, we take advantage of a multimodal large language model (MLLM) by constructing a prompt that assesses hallucinatory visual elements and generates a “Hallucination Score” (HS). We find that our HS is closely aligned with human evaluations, and also provides complementary insights to prior image metrics used for super-resolution (SR) models. In addition, we find certain deep feature distances have strong correlations with HS. We therefore propose to align the GSR models by using such features as differentiable reward functions to mitigate hallucinations.

[52] DUSTrack: Semi-automated point tracking in ultrasound videos

Praneeth Namburi,Roger Pallarès-López,Jessica Rosendorf,Duarte Folgado,Brian W. Anthony

Main category: cs.CV

TL;DR: DUSTrack是一个半自动化的点跟踪工具包,结合深度学习和光流技术,用于B型超声视频中的组织运动跟踪,具有高精度和鲁棒性。

Details Motivation: 超声视频中的组织运动跟踪因斑点噪声、低边缘对比度和平面外运动而具有挑战性,需要一种通用且准确的方法来量化组织动态。

Contribution: 提出了DUSTrack,结合深度学习和光流技术,提供高质量的点跟踪功能,适用于多种解剖结构和运动模式。还包括图形用户界面和光流滤波技术。

Method: 结合深度学习与光流技术,通过图形用户界面生成高质量训练数据,并利用光流滤波减少帧间噪声。

Result: DUSTrack在精度上优于零样本点跟踪器,与专用方法表现相当,适用于心脏、肌肉等组织的运动分析。

Insight: DUSTrack提供了一种通用且强大的框架,可用于临床和生物力学研究中组织运动的量化分析,具有开源优势。

Abstract: Ultrasound technology enables safe, non-invasive imaging of dynamic tissue behavior, making it a valuable tool in medicine, biomechanics, and sports science. However, accurately tracking tissue motion in B-mode ultrasound remains challenging due to speckle noise, low edge contrast, and out-of-plane movement. These challenges complicate the task of tracking anatomical landmarks over time, which is essential for quantifying tissue dynamics in many clinical and research applications. This manuscript introduces DUSTrack (Deep learning and optical flow-based toolkit for UltraSound Tracking), a semi-automated framework for tracking arbitrary points in B-mode ultrasound videos. We combine deep learning with optical flow to deliver high-quality and robust tracking across diverse anatomical structures and motion patterns. The toolkit includes a graphical user interface that streamlines the generation of high-quality training data and supports iterative model refinement. It also implements a novel optical-flow-based filtering technique that reduces high-frequency frame-to-frame noise while preserving rapid tissue motion. DUSTrack demonstrates superior accuracy compared to contemporary zero-shot point trackers and performs on par with specialized methods, establishing its potential as a general and foundational tool for clinical and biomechanical research. We demonstrate DUSTrack’s versatility through three use cases: cardiac wall motion tracking in echocardiograms, muscle deformation analysis during reaching tasks, and fascicle tracking during ankle plantarflexion. As an open-source solution, DUSTrack offers a powerful, flexible framework for point tracking to quantify tissue motion from ultrasound videos. DUSTrack is available at https://github.com/praneethnamburi/DUSTrack.

[53] CRAFT: A Neuro-Symbolic Framework for Visual Functional Affordance Grounding

Zhou Chen,Joe Lin,Sathyanarayanan N. Aakur

Main category: cs.CV

TL;DR: CRAFT是一个神经符号框架,用于可解释的功能可供性接地(affordance grounding),结合了常识先验和视觉证据,通过基于能量的推理循环提高准确性和可解释性。

Details Motivation: 当前的可供性接地方法缺乏透明度和解释性,CRAFT旨在填补这一空白,提供更鲁棒和可信的场景理解。

Contribution: 提出CRAFT框架,集成符号化的常识先验(ConceptNet和语言模型)与视觉证据(CLIP),通过能量推理循环实现透明化决策。

Method: 结合ConceptNet和CLIP,设计基于能量的推理循环,迭代优化可供性预测,实现符号与感知结构的接地。

Result: 在多目标、无标签环境下,CRAFT提高了准确性和可解释性,为场景理解提供了更可靠的方法。

Insight: 神经符号结合的方法可以有效提升任务的透明度和性能,为可信AI提供新思路。

Abstract: We introduce CRAFT, a neuro-symbolic framework for interpretable affordance grounding, which identifies the objects in a scene that enable a given action (e.g., “cut”). CRAFT integrates structured commonsense priors from ConceptNet and language models with visual evidence from CLIP, using an energy-based reasoning loop to refine predictions iteratively. This process yields transparent, goal-driven decisions to ground symbolic and perceptual structures. Experiments in multi-object, label-free settings demonstrate that CRAFT enhances accuracy while improving interpretability, providing a step toward robust and trustworthy scene understanding.

[54] Adaptive 3D Gaussian Splatting Video Streaming

Han Gong,Qiyue Li,Zhi Liu,Hao Zhou,Peng Yuan Zhou,Zhu Li,Jie Li

Main category: cs.CV

TL;DR: 本文提出了一种基于高斯变形场的3DGS视频流传输框架,通过混合显著性分块和差异化质量建模,实现了高效压缩和带宽适应,实验表明其优于现有方法。

Details Motivation: 3D高斯泼溅(3DGS)视频的数据量庞大且传输复杂度高,传统方法难以满足其流式传输的需求。本文旨在解决这一问题,提供高效的压缩和传输方案。

Contribution: 1. 基于高斯变形场设计了3DGS视频构建方法;2. 提出混合显著性分块和差异化质量建模技术;3. 实现了一套完整的3DGS视频流系统,并验证了性能优势。

Method: 通过高斯变形场构建视频,结合混合显著性分块和差异化质量建模压缩数据,以适应带宽波动,并保证传输质量。

Result: 实验证明,该方法在视频质量、压缩效率和传输速率上均优于现有技术。

Insight: 3DGS视频流传输需结合数据压缩和带宽适应,混合显著性分块和差异化质量建模是实现高效传输的关键技术。

Abstract: The advent of 3D Gaussian splatting (3DGS) has significantly enhanced the quality of volumetric video representation. Meanwhile, in contrast to conventional volumetric video, 3DGS video poses significant challenges for streaming due to its substantially larger data volume and the heightened complexity involved in compression and transmission. To address these issues, we introduce an innovative framework for 3DGS volumetric video streaming. Specifically, we design a 3DGS video construction method based on the Gaussian deformation field. By employing hybrid saliency tiling and differentiated quality modeling of 3DGS video, we achieve efficient data compression and adaptation to bandwidth fluctuations while ensuring high transmission quality. Then we build a complete 3DGS video streaming system and validate the transmission performance. Through experimental evaluation, our method demonstrated superiority over existing approaches in various aspects, including video quality, compression effectiveness, and transmission rate.

[55] IRGPT: Understanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark

Zhe Cao,Jin Zhang,Ruiheng Zhang

Main category: cs.CV

TL;DR: 论文提出IRGPT,首个针对真实世界红外图像的多模态大语言模型,并构建了大规模InfraRed-Text数据集(IR-TD)。通过双向跨模态课程迁移学习策略,实现从可见光到红外领域的知识迁移,在9项任务上达到SOTA性能。

Details Motivation: 现有方法依赖从可见光图像生成的合成红外图像,无法捕捉红外模态的独特特性,且缺乏对齐的文本数据支持。

Contribution: 1.构建首个大规模真实红外图像-文本对数据集IR-TD;2.提出IRGPT模型,首次将多模态大语言模型应用于红外图像理解;3.设计双向跨模态课程迁移学习策略。

Method: 1.通过LLM生成可见光图像描述和基于规则的标注描述构建IR-TD数据集;2.设计双向跨模态课程迁移学习策略,考虑红外-可见光和红外-文本难度分数进行知识迁移。

Result: 在9项任务上的评测中,IRGPT表现优于更大规模的模型,达到SOTA性能。

Insight: 1.真实红外图像数据与对齐文本的结合显著提升模型性能;2.课程迁移学习能有效解决跨模态知识迁移的挑战。

Abstract: Real-world infrared imagery presents unique challenges for vision-language models due to the scarcity of aligned text data and domain-specific characteristics. Although existing methods have advanced the field, their reliance on synthetic infrared images generated through style transfer from visible images, which limits their ability to capture the unique characteristics of the infrared modality. To address this, we propose IRGPT, the first multi-modal large language model for real-world infrared images, built upon a large-scale InfraRed-Text Dataset (IR-TD) comprising over 260K authentic image-text pairs. The proposed IR-TD dataset contains real infrared images paired with meticulously handcrafted texts, where the initial drafts originated from two complementary processes: (1) LLM-generated descriptions of visible images, and (2) rule-based descriptions of annotations. Furthermore, we introduce a bi-cross-modal curriculum transfer learning strategy that systematically transfers knowledge from visible to infrared domains by considering the difficulty scores of both infrared-visible and infrared-text. Evaluated on a benchmark of 9 tasks (e.g., recognition, grounding), IRGPT achieves state-of-the-art performance even compared with larger-scale models.

[56] GPI-Net: Gestalt-Guided Parallel Interaction Network via Orthogonal Geometric Consistency for Robust Point Cloud Registration

Weikang Gu,Mingyue Han,Li Xue,Heng Dong,Changcai Yang,Riqing Chen,Lifang Wei

Main category: cs.CV

TL;DR: GPI-Net提出了一种基于Gestalt原则的点云配准方法,通过正交几何一致性和并行交互网络优化局部与全局特征融合。

Details Motivation: 点云配准中高质量的对应关系识别是关键,但局部与全局特征的融合因冗余和复杂空间关系而极具挑战性。Gestalt原则在分析局部与全局关系上具有优势。

Contribution: 1. 提出基于Gestalt的并行交互网络GPI-Net;2. 引入正交集成策略减少冗余;3. 设计Gestalt特征注意力块和双路径多粒度交互聚合块。

Method: 1. 使用正交几何一致性优化特征融合;2. Gestalt特征注意力块结合自注意力和交叉注意力;3. 双路径多粒度交互块促进信息交换。

Result: 在多个挑战性任务中表现优于现有方法。

Insight: Gestalt原则可有效指导点云配准中的特征融合,正交策略和多粒度交互显著提升性能。

Abstract: The accurate identification of high-quality correspondences is a prerequisite task in feature-based point cloud registration. However, it is extremely challenging to handle the fusion of local and global features due to feature redundancy and complex spatial relationships. Given that Gestalt principles provide key advantages in analyzing local and global relationships, we propose a novel Gestalt-guided Parallel Interaction Network via orthogonal geometric consistency (GPI-Net) in this paper. It utilizes Gestalt principles to facilitate complementary communication between local and global information. Specifically, we introduce an orthogonal integration strategy to optimally reduce redundant information and generate a more compact global structure for high-quality correspondences. To capture geometric features in correspondences, we leverage a Gestalt Feature Attention (GFA) block through a hybrid utilization of self-attention and cross-attention mechanisms. Furthermore, to facilitate the integration of local detail information into the global structure, we design an innovative Dual-path Multi-Granularity parallel interaction aggregation (DMG) block to promote information exchange across different granularities. Extensive experiments on various challenging tasks demonstrate the superior performance of our proposed GPI-Net in comparison to existing methods. The code will be released at https://github.com/gwk/GPI-Net.

[57] Adaptive 3D Gaussian Splatting Video Streaming: Visual Saliency-Aware Tiling and Meta-Learning-Based Bitrate Adaptation

Han Gong,Qiyue Li,Jie Li,Zhi Liu

Main category: cs.CV

TL;DR: 该论文针对3D高斯喷绘视频流媒体技术,提出了一种视觉显著区域感知的分块方法和基于元学习的比特率自适应算法,解决了分块、质量评估和比特率适应等关键问题。

Details Motivation: 3D高斯喷绘视频流媒体技术的发展仍处于早期阶段,存在分块策略不成熟、质量评估不完善和比特率适应不足等问题,亟需解决方案以提升用户体验。

Contribution: 论文提出了基于显著性的自适应分块技术、联合评估3D和2D质量的质量评估框架,以及基于元学习的比特率自适应算法。

Method: 1. 结合空间和时间特征的显著性分析指导分块;
2. 3D和2D质量联合评估框架;
3. 元学习优化的比特率自适应算法。

Result: 实验结果表明,所提方法在多项指标上优于现有技术。

Insight: 通过显著性分析和元学习,论文在3D视频流媒体中实现了更高效的分块和比特率适应,为未来研究提供了新思路。

Abstract: 3D Gaussian splatting video (3DGS) streaming has recently emerged as a research hotspot in both academia and industry, owing to its impressive ability to deliver immersive 3D video experiences. However, research in this area is still in its early stages, and several fundamental challenges, such as tiling, quality assessment, and bitrate adaptation, require further investigation. In this paper, we tackle these challenges by proposing a comprehensive set of solutions. Specifically, we propose an adaptive 3DGS tiling technique guided by saliency analysis, which integrates both spatial and temporal features. Each tile is encoded into versions possessing dedicated deformation fields and multiple quality levels for adaptive selection. We also introduce a novel quality assessment framework for 3DGS video that jointly evaluates spatial-domain degradation in 3DGS representations during streaming and the quality of the resulting 2D rendered images. Additionally, we develop a meta-learning-based adaptive bitrate algorithm specifically tailored for 3DGS video streaming, achieving optimal performance across varying network conditions. Extensive experiments demonstrate that our proposed approaches significantly outperform state-of-the-art methods.

[58] GEMINUS: Dual-aware Global and Scene-Adaptive Mixture-of-Experts for End-to-End Autonomous Driving

Chi Wan,Yixin Cui,Jiatong Du,Shuo Yang,Yulong Bai,Yanjun Huang

Main category: cs.CV

TL;DR: 该论文提出GEMINUS,一种基于混合专家(Mixture-of-Experts)的端到端自动驾驶框架,通过全局专家和场景自适应专家组的协同作用,结合双感知路由器,实现复杂交通场景中的自适应与鲁棒决策。

Details Motivation: 传统单模态规划方法在多样化交通场景中难以学习多模态驾驶技能,因此需要一种能自适应且鲁棒地处理复杂多样化环境的解决方案。

Contribution: 引入了全局专家和场景自适应专家组的双专家架构,并提出双感知路由器动态分配任务,从而提升端到端自动驾驶在多样化场景中的性能。

Method: 采用混合专家架构,包括全局专家(全局训练)、场景自适应专家组(场景子集训练)和双感知路由器(结合场景特征和路由不确定性动态激活专家模块)。

Result: 在Bench2Drive基准测试中表现优异,驾驶分数和成功率实现SOTA,其中单目视觉输入下的驾驶分数提升7.67%,成功率提升22.06%。

Insight: 通过专家模块的协同与动态分配,GEMINUS展示了多专家架构在复杂驾驶场景中的潜力,为端到端自动驾驶提供了新思路。

Abstract: End-to-end autonomous driving requires adaptive and robust handling of complex and diverse traffic environments. However, prevalent single-mode planning methods attempt to learn an overall policy while struggling to acquire diversified driving skills to handle diverse scenarios. Therefore, this paper proposes GEMINUS, a Mixture-of-Experts end-to-end autonomous driving framework featuring a Global Expert, a Scene-Adaptive Experts Group, and equipped with a Dual-aware Router. Specifically, the Global Expert is trained on the overall dataset, possessing robust performance. The Scene-Adaptive Experts are trained on corresponding scene subsets, achieving adaptive performance. The Dual-aware Router simultaneously considers scenario-level features and routing uncertainty to dynamically activate expert modules. Through the effective coupling of the Global Expert and the Scene-Adaptive Experts Group via the Dual-aware Router, GEMINUS achieves adaptive and robust performance in diverse scenarios. GEMINUS outperforms existing methods in the Bench2Drive closed-loop benchmark and achieves state-of-the-art performance in Driving Score and Success Rate, even with only monocular vision input. Furthermore, ablation studies demonstrate significant improvements over the original single-expert baseline: 7.67% in Driving Score, 22.06% in Success Rate, and 19.41% in MultiAbility-Mean. The code will be available at https://github.com/newbrains1/GEMINUS.

[59] VisGuard: Securing Visualization Dissemination through Tamper-Resistant Data Retrieval

Huayuan Ye,Juntong Chen,Shenzhuo Zhang,Yipeng Zhang,Changbo Wang,Chenhui Li

Main category: cs.CV

TL;DR: VisGuard is a tamper-resistant framework for embedding and retrieving metadata in visualization images, addressing the fragility of existing methods to common image tampering.

Details Motivation: Existing methods for embedding metadata in visualization images are often fragile to tampering (e.g., cropping, editing), limiting their practical use in online distribution.

Contribution: Proposes VisGuard, a robust framework for embedding metadata links in visualization images that can withstand substantial tampering while remaining recoverable.

Method: Utilizes techniques like repetitive data tiling, invertible information broadcasting, and an anchor-based scheme for crop localization to enhance robustness.

Result: Demonstrates superior performance in data retrieval accuracy, embedding capacity, and resistance to tampering and steganalysis, enabling applications like interactive chart reconstruction and copyright protection.

Insight: VisGuard bridges the gap between visualization dissemination and secure metadata retrieval, offering practical solutions for tamper-resistant data embedding.

Abstract: The dissemination of visualizations is primarily in the form of raster images, which often results in the loss of critical information such as source code, interactive features, and metadata. While previous methods have proposed embedding metadata into images to facilitate Visualization Image Data Retrieval (VIDR), most existing methods lack practicability since they are fragile to common image tampering during online distribution such as cropping and editing. To address this issue, we propose VisGuard, a tamper-resistant VIDR framework that reliably embeds metadata link into visualization images. The embedded data link remains recoverable even after substantial tampering upon images. We propose several techniques to enhance robustness, including repetitive data tiling, invertible information broadcasting, and an anchor-based scheme for crop localization. VisGuard enables various applications, including interactive chart reconstruction, tampering detection, and copyright protection. We conduct comprehensive experiments on VisGuard’s superior performance in data retrieval accuracy, embedding capacity, and security against tampering and steganalysis, demonstrating VisGuard’s competence in facilitating and safeguarding visualization dissemination and information conveyance.

[60] DFQ-ViT: Data-Free Quantization for Vision Transformers without Fine-tuning

Yujia Tong,Jingling Yuan,Tian Zhang,Jianquan Liu,Chuang Hu

Main category: cs.CV

TL;DR: DFQ-ViT是一种无需微调的数据无关量化方法,专注于提升合成样本质量和校准中间层激活分布,显著提升了ViT的量化性能。

Details Motivation: 传统数据无关量化方法在ViT上表现不佳,主要原因是合成样本质量不足和中间层激活分布不匹配,导致量化模型性能下降。

Contribution: 1. 提出DFQ-ViT框架,通过逐步合成困难样本提升数据质量;2. 引入激活校正矩阵对齐中间层激活分布;3. 在无需微调的情况下实现高性能量化。

Method: 1. 分阶段合成样本以增强数据质量;2. 在推断阶段使用激活校正矩阵校准中间层激活分布。

Result: DFQ-ViT性能优于现有DFQ方法,接近基于真实数据量化的模型,例如3位量化的DeiT-T性能提升4.29%。

Insight: 通过优化合成数据生成和激活分布对齐,无需微调即可实现高效量化,适合资源受限的边缘设备部署。

Abstract: Data-Free Quantization (DFQ) enables the quantization of Vision Transformers (ViTs) without requiring access to data, allowing for the deployment of ViTs on devices with limited resources. In DFQ, the quantization model must be calibrated using synthetic samples, making the quality of these synthetic samples crucial. Existing methods fail to fully capture and balance the global and local features within the samples, resulting in limited synthetic data quality. Moreover, we have found that during inference, there is a significant difference in the distributions of intermediate layer activations between the quantized and full-precision models. These issues lead to a severe performance degradation of the quantized model. To address these problems, we propose a pipeline for Data-Free Quantization for Vision Transformers (DFQ-ViT). Specifically, we synthesize samples in order of increasing difficulty, effectively enhancing the quality of synthetic data. During the calibration and inference stage, we introduce the activation correction matrix for the quantized model to align the intermediate layer activations with those of the full-precision model. Extensive experiments demonstrate that DFQ-ViT achieves remarkable superiority over existing DFQ methods and its performance is on par with models quantized through real data. For example, the performance of DeiT-T with 3-bit weights quantization is 4.29% higher than the state-of-the-art. Our method eliminates the need for fine-tuning, which not only reduces computational overhead but also lowers the deployment barriers for edge devices. This characteristic aligns with the principles of Green Learning by improving energy efficiency and facilitating real-world applications in resource-constrained environments.

[61] Efficient Whole Slide Pathology VQA via Token Compression

Weimin Lyu,Qingqiao Hu,Kehan Qi,Zhan Shi,Wentao Huang,Saumya Gupta,Chao Chen

Main category: cs.CV

TL;DR: TCP-LLaVA是一个针对病理学全切片图像(WSI)的视觉问答(VQA)模型,通过引入可训练压缩令牌显著减少了计算资源需求,同时提升了性能。

Details Motivation: 全切片图像的巨大尺寸(可达10,000x10,000像素)对多模态大语言模型(MLLM)提出了高计算和长上下文的挑战。现有方法缺乏生成能力或资源消耗过高,因此需要一种高效解决方案。

Contribution: 提出了首个通过令牌压缩技术实现WSI VQA的MLLM架构TCP-LLaVA,显著减少了输入长度和计算成本。

Method: 引入可训练压缩令牌以聚合视觉和文本信息,通过模态压缩模块(类似BERT的[CLS]令牌机制),仅将压缩后的令牌传入语言模型生成答案。

Result: 在10种TCGA肿瘤亚型的实验中,TCP-LLaVA在VQA准确性上优于现有基线,同时大幅降低了训练资源消耗。

Insight: 通过令牌压缩技术,可以高效处理大规模图像,为其他高分辨率图像的MLLM应用提供了借鉴。

Abstract: Whole-slide images (WSIs) in pathology can reach up to 10,000 x 10,000 pixels, posing significant challenges for multimodal large language model (MLLM) due to long context length and high computational demands. Previous methods typically focus on patch-level analysis or slide-level classification using CLIP-based models with multi-instance learning, but they lack the generative capabilities needed for visual question answering (VQA). More recent MLLM-based approaches address VQA by feeding thousands of patch tokens directly into the language model, which leads to excessive resource consumption. To address these limitations, we propose Token Compression Pathology LLaVA (TCP-LLaVA), the first MLLM architecture to perform WSI VQA via token compression. TCP-LLaVA introduces a set of trainable compression tokens that aggregate visual and textual information through a modality compression module, inspired by the [CLS] token mechanism in BERT. Only the compressed tokens are forwarded to the LLM for answer generation, significantly reducing input length and computational cost. Experiments on ten TCGA tumor subtypes show that TCP-LLaVA outperforms existing MLLM baselines in VQA accuracy while reducing training resource consumption by a substantial margin.

[62] Motion Segmentation and Egomotion Estimation from Event-Based Normal Flow

Zhiyuan Hua,Dehao Yuan,Cornelia Fermüller

Main category: cs.CV

TL;DR: 该论文提出了一种基于事件驱动正常光流的运动分割和自运动估计框架,适用于神经形态视觉传感器,通过几何约束和迭代优化实现高精度分割和运动估计。

Details Motivation: 传统的运动分割和自运动估计方法依赖光学流或深度估计,计算复杂且不适合神经形态传感器的高时间分辨率数据。该论文旨在利用事件数据的稀疏性和高时间分辨率特性,提出一种更高效的解决方案。

Contribution: 主要贡献包括:1)提出了一种基于事件驱动正常光流的运动分割和自运动估计框架;2)通过几何约束结合场景结构和惯性测量;3)实现了无需全局光学流的高精度分割和运动估计。

Method: 方法包括:1)事件数据的超分割;2)通过残差分析分离独立运动物体;3)基于运动相似性和时间一致性的层次聚类优化分割。

Result: 在EVIMO2v2数据集上的实验表明,该方法能够准确实现运动分割和平移运动估计,在物体边界处表现优异。

Insight: 论文展示了事件数据在运动分析中的潜力,为实时机器人导航提供了一种高效、可扩展的解决方案。

Abstract: This paper introduces a robust framework for motion segmentation and egomotion estimation using event-based normal flow, tailored specifically for neuromorphic vision sensors. In contrast to traditional methods that rely heavily on optical flow or explicit depth estimation, our approach exploits the sparse, high-temporal-resolution event data and incorporates geometric constraints between normal flow, scene structure, and inertial measurements. The proposed optimization-based pipeline iteratively performs event over-segmentation, isolates independently moving objects via residual analysis, and refines segmentations using hierarchical clustering informed by motion similarity and temporal consistency. Experimental results on the EVIMO2v2 dataset validate that our method achieves accurate segmentation and translational motion estimation without requiring full optical flow computation. This approach demonstrates significant advantages at object boundaries and offers considerable potential for scalable, real-time robotic and navigation applications.

[63] Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey

Jiahui Zhang,Yuelei Li,Anpei Chen,Muyu Xu,Kunhao Liu,Jianyuan Wang,Xiao-Xiao Long,Hanxue Liang,Zexiang Xu,Hao Su,Christian Theobalt,Christian Rupprecht,Andrea Vedaldi,Hanspeter Pfister,Shijian Lu,Fangneng Zhan

Main category: cs.CV

TL;DR: 该论文综述了前馈式3D重建与视图合成的进展,重点介绍了基于深度学习的快速通用方法,并对不同表示架构(如点云、3D高斯泼溅和神经辐射场)进行了分类。

Details Motivation: 传统3D重建和视图合成方法依赖计算密集型迭代优化,限制了实际应用。深度学习驱动的快速通用前馈方法具有革命性潜力,值得系统总结。

Contribution: 提供了对前馈式3D重建与视图合成的全面综述,提出了按表示架构的分类,并总结了关键任务和未来研究方向。

Method: 通过对现有工作进行分类和分析,包括点云、3D高斯泼溅(3DGS)和神经辐射场(NeRF)等架构,系统总结了前馈方法的进展。

Result: 综述涵盖了姿势无关重建、动态3D重建等任务,并分析了在数字人、SLAM等领域的应用。

Insight: 前馈方法通过深度学习大幅提升了3D重建和视图合成的效率与通用性,未来方向可能聚焦于优化表示架构和扩展应用场景。

Abstract: 3D reconstruction and view synthesis are foundational problems in computer vision, graphics, and immersive technologies such as augmented reality (AR), virtual reality (VR), and digital twins. Traditional methods rely on computationally intensive iterative optimization in a complex chain, limiting their applicability in real-world scenarios. Recent advances in feed-forward approaches, driven by deep learning, have revolutionized this field by enabling fast and generalizable 3D reconstruction and view synthesis. This survey offers a comprehensive review of feed-forward techniques for 3D reconstruction and view synthesis, with a taxonomy according to the underlying representation architectures including point cloud, 3D Gaussian Splatting (3DGS), Neural Radiance Fields (NeRF), etc. We examine key tasks such as pose-free reconstruction, dynamic 3D reconstruction, and 3D-aware image and video synthesis, highlighting their applications in digital humans, SLAM, robotics, and beyond. In addition, we review commonly used datasets with detailed statistics, along with evaluation protocols for various downstream tasks. We conclude by discussing open research challenges and promising directions for future work, emphasizing the potential of feed-forward approaches to advance the state of the art in 3D vision.

[64] DCHM: Depth-Consistent Human Modeling for Multiview Detection

Jiahao Ma,Tianyu Wang,Miaomiao Liu,David Ahmedt-Aristizabal,Chuong Nguyen

Main category: cs.CV

TL;DR: 本文提出了深度一致的人体建模方法(DCHM),通过超像素级高斯散布实现多视角深度一致性,显著降低噪声,提升了多视角行人检测的精度。

Details Motivation: 现有的人体建模方法在多视角行人检测中存在噪声多、精度低的问题,且依赖昂贵的3D标注数据,难以泛化到多样化场景。

Contribution: 提出DCHM框架,首次在稀疏视角、大规模和拥挤场景下实现多视角深度一致性,生成精确的点云用于行人定位,同时支持行人重建和多视角分割。

Method: 采用超像素级高斯散布的管道,实现全局坐标系下的深度一致性估计和多视角信息融合。

Result: 实验验证DCHM显著减少噪声,性能优于现有方法。

Insight: 通过深度一致性建模,无需依赖人工标注数据,即可在多场景下实现高精度行人检测和重建。

Abstract: Multiview pedestrian detection typically involves two stages: human modeling and pedestrian localization. Human modeling represents pedestrians in 3D space by fusing multiview information, making its quality crucial for detection accuracy. However, existing methods often introduce noise and have low precision. While some approaches reduce noise by fitting on costly multiview 3D annotations, they often struggle to generalize across diverse scenes. To eliminate reliance on human-labeled annotations and accurately model humans, we propose Depth-Consistent Human Modeling (DCHM), a framework designed for consistent depth estimation and multiview fusion in global coordinates. Specifically, our proposed pipeline with superpixel-wise Gaussian Splatting achieves multiview depth consistency in sparse-view, large-scaled, and crowded scenarios, producing precise point clouds for pedestrian localization. Extensive validations demonstrate that our method significantly reduces noise during human modeling, outperforming previous state-of-the-art baselines. Additionally, to our knowledge, DCHM is the first to reconstruct pedestrians and perform multiview segmentation in such a challenging setting. Code is available on the \href{https://jiahao-ma.github.io/DCHM/}{project page}.

[65] ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding

Shuo Cao,Nan Ma,Jiayang Li,Xiaohui Li,Lihao Shao,Kaiwen Zhu,Yu Zhou,Yuandong Pu,Jiarui Wu,Jiaquan Wang,Bo Qu,Wenhai Wang,Yu Qiao,Dajuin Yao,Yihao Liu

Main category: cs.CV

TL;DR: ArtiMuse 是一种基于多模态大语言模型的图像美学评估方法,具备联合评分和专家级理解能力,同时发布了首个专家标注的图像美学数据集 ArtiMuse-10K。

Details Motivation: 随着教育应用、艺术创作和AIGC技术的发展,对图像美学评估的需求日益增加,但现有方法存在模态偏差和缺乏细粒度属性分解的问题。

Contribution: 1. 提出了 ArtiMuse,一个支持联合评分和专家级理解的图像美学评估模型;2. 发布了首个专家标注的细粒度美学数据集 ArtiMuse-10K。

Method: 基于多模态大语言模型(MLLM),结合定量评分和8维属性分析的细粒度美学理解。

Result: 模型和数据集将公开,以推动领域发展。

Insight: MLLM在美学评估中表现更强,但需解决模态偏差问题,细粒度属性和评分结合是未来方向。

Abstract: The rapid advancement of educational applications, artistic creation, and AI-generated content (AIGC) technologies has substantially increased practical requirements for comprehensive Image Aesthetics Assessment (IAA), particularly demanding methods capable of delivering both quantitative scoring and professional understanding. Multimodal Large Language Model (MLLM)-based IAA methods demonstrate stronger perceptual and generalization capabilities compared to traditional approaches, yet they suffer from modality bias (score-only or text-only) and lack fine-grained attribute decomposition, thereby failing to support further aesthetic assessment. In this paper, we present:(1) ArtiMuse, an innovative MLLM-based IAA model with Joint Scoring and Expert-Level Understanding capabilities; (2) ArtiMuse-10K, the first expert-curated image aesthetic dataset comprising 10,000 images spanning 5 main categories and 15 subcategories, each annotated by professional experts with 8-dimensional attributes analysis and a holistic score. Both the model and dataset will be made public to advance the field.

[66] Real Time Captioning of Sign Language Gestures in Video Meetings

Sharanya Mukherjee,Md Hishaam Akhtar,Kannadasan R

Main category: cs.CV

TL;DR: 本文提出了一种浏览器扩展程序,能够在视频会议中实时将手语动作翻译为字幕,帮助听力障碍人士与他人交流。

Details Motivation: 听力障碍人士在视频会议中更倾向于使用手语而非打字,但目前大多数人对手语的细节了解不足,因此需要一种实时翻译工具消除沟通障碍。

Contribution: 主要的贡献是开发了一种浏览器扩展程序,支持在视频会议中实时翻译手语为字幕,提高了听力障碍人士的沟通效率。

Method: 使用包含2000多个词级手语视频的大规模数据集(由100多名手语者表演),结合计算机视觉技术实现手语识别与字幕生成。

Result: 通过浏览器扩展程序实现了手语的实时翻译,为视频会议中的听力障碍人士提供了便利的沟通工具。

Insight: 结合计算机视觉和大规模数据集是实现手语实时翻译的关键,此类工具在远程沟通场景中具有重要应用价值。

Abstract: It has always been a rather tough task to communicate with someone possessing a hearing impairment. One of the most tested ways to establish such a communication is through the use of sign based languages. However, not many people are aware of the smaller intricacies involved with sign language. Sign language recognition using computer vision aims at eliminating the communication barrier between deaf-mute and ordinary people so that they can properly communicate with others. Recently the pandemic has left the whole world shaken up and has transformed the way we communicate. Video meetings have become essential for everyone, even people with a hearing disability. In recent studies, it has been found that people with hearing disabilities prefer to sign over typing during these video calls. In this paper, we are proposing a browser extension that will automatically translate sign language to subtitles for everyone else in the video call. The Large-scale dataset which contains more than 2000 Word-Level ASL videos, which were performed by over 100 signers will be used.

[67] Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025

Sujata Gaihre,Amir Thapa Magar,Prasuna Pokharel,Laxmi Tiwari

Main category: cs.CV

TL;DR: 本文介绍了针对MEDVQA-GI 2025挑战赛中视觉问答任务的解决方案,采用Florence多模态基础模型,通过领域特定的数据增强提升泛化能力,实验结果展示了大型多模态模型在医疗VQA中的潜力。

Details Motivation: 解决胃肠道内窥镜检查中的视觉问答问题,推动多模态AI在医疗诊断中的应用。

Contribution: 1. 采用Florence多模态基础模型作为VQA管道框架;2. 引入领域特定的数据增强技术以提升泛化能力;3. 在KASVIR数据集上验证了模型的性能,为未来研究提供了基线。

Method: 1. 使用Florence模型作为主干,结合视觉和文本编码器;2. 应用保留医学特征的领域特定数据增强;3. 在KASVIR数据集上进行微调。

Result: 实验结果表明,微调后的Florence在官方挑战指标上表现优异,验证了大型多模态模型在医疗VQA中的潜力。

Insight: 大型多模态模型在医疗领域的视觉问答任务中具有广阔的应用前景,但仍需进一步研究模型的可解释性、鲁棒性和临床整合性。

Abstract: This paper describes our approach to Subtask 1 of the ImageCLEFmed MEDVQA 2025 Challenge, which targets visual question answering (VQA) for gastrointestinal endoscopy. We adopt the Florence model-a large-scale multimodal foundation model-as the backbone of our VQA pipeline, pairing a powerful vision encoder with a text encoder to interpret endoscopic images and produce clinically relevant answers. To improve generalization, we apply domain-specific augmentations that preserve medical features while increasing training diversity. Experiments on the KASVIR dataset show that fine-tuning Florence yields accurate responses on the official challenge metrics. Our results highlight the potential of large multimodal models in medical VQA and provide a strong baseline for future work on explainability, robustness, and clinical integration. The code is publicly available at: https://github.com/TiwariLaxuu/VQA-Florence.git

[68] Descrip3D: Enhancing Large Language Model-based 3D Scene Understanding with Object-Level Text Descriptions

Jintang Xue,Ganning Zhao,Jie-En Yao,Hong-En Chen,Yue Hu,Meida Chen,Suya You,C. -C. Jay Kuo

Main category: cs.CV

TL;DR: Descrip3D通过为3D场景中的每个对象添加文本描述,增强了大型语言模型对物体间关系的理解能力,显著提升了在多项3D场景理解任务中的表现。

Details Motivation: 现有3D场景-语言模型在物体间关系的理解上表现不足,仅依赖视觉嵌入难以充分捕捉物体的语义和空间关系,亟需一种更有效的方法。

Contribution: 提出了Descrip3D框架,通过自然语言描述显式编码物体间关系,结合嵌入融合和提示级注入,实现了无需任务特定头部或额外监督的统一推理能力。

Method: 使用双层次集成(嵌入融合和提示级注入)将物体级文本描述融入模型,强化对物体内在属性及上下文关系的建模。

Result: 在五个基准数据集(ScanRefer、Multi3DRefer等)上,Descrip3D均超越基线模型,验证了语言引导的关系表示在复杂室内场景理解中的有效性。

Insight: 文本描述为3D场景理解提供了补充语义信息,语言与视觉结合的跨模态方法显著提升了模型的推理能力。

Abstract: Understanding 3D scenes goes beyond simply recognizing objects; it requires reasoning about the spatial and semantic relationships between them. Current 3D scene-language models often struggle with this relational understanding, particularly when visual embeddings alone do not adequately convey the roles and interactions of objects. In this paper, we introduce Descrip3D, a novel and powerful framework that explicitly encodes the relationships between objects using natural language. Unlike previous methods that rely only on 2D and 3D embeddings, Descrip3D enhances each object with a textual description that captures both its intrinsic attributes and contextual relationships. These relational cues are incorporated into the model through a dual-level integration: embedding fusion and prompt-level injection. This allows for unified reasoning across various tasks such as grounding, captioning, and question answering, all without the need for task-specific heads or additional supervision. When evaluated on five benchmark datasets, including ScanRefer, Multi3DRefer, ScanQA, SQA3D, and Scan2Cap, Descrip3D consistently outperforms strong baseline models, demonstrating the effectiveness of language-guided relational representation for understanding complex indoor scenes.

[69] LEAD: Exploring Logit Space Evolution for Model Selection

Zixuan Hu,Xiaotong Li,Shixiang Tang,Jun Liu,Yichun Hu,Ling-Yu Duan

Main category: cs.CV

TL;DR: 这篇论文提出了一种名为LEAD的方法,通过建模logit空间的非线性优化动态来选择最适合下游任务的预训练模型。

Details Motivation: 预训练模型的激增使得选择最适合下游任务的模型变得复杂且耗时,而现有的方法通常基于特征空间的线性变换,无法准确捕捉优化过程中的非线性特性。

Contribution: LEAD提出了一种基于logit空间的非线性微分方程模型,通过类感知分解方法动态建模优化过程,从而快速预测模型的可迁移性。

Method: LEAD利用普通微分方程(ODE)建模logit空间的非线性演化,并提出类感知分解方法处理不同类别的动态差异,一步实现优化目标对齐。

Result: 在24种监督和自监督预训练模型及10个下游数据集上的实验表明,LEAD在低数据场景下仍具有优异的性能和广泛的适应性。

Insight: 基于logit空间的非线性建模能更准确地反映优化动态,为模型选择问题提供了高效且实用的解决方案。

Abstract: The remarkable success of pretrain-then-finetune paradigm has led to a proliferation of available pre-trained models for vision tasks. This surge presents a significant challenge in efficiently choosing the most suitable pre-trained models for downstream tasks. The critical aspect of this challenge lies in effectively predicting the model transferability by considering the underlying fine-tuning dynamics. Existing methods often model fine-tuning dynamics in feature space with linear transformations, which do not precisely align with the fine-tuning objective and fail to grasp the essential nonlinearity from optimization. To this end, we present LEAD, a finetuning-aligned approach based on the network output of logits. LEAD proposes a theoretical framework to model the optimization process and derives an ordinary differential equation (ODE) to depict the nonlinear evolution toward the final logit state. Additionally, we design a class-aware decomposition method to consider the varying evolution dynamics across classes and further ensure practical applicability. Integrating the closely aligned optimization objective and nonlinear modeling capabilities derived from the differential equation, our method offers a concise solution to effectively bridge the optimization gap in a single step, bypassing the lengthy fine-tuning process. The comprehensive experiments on 24 supervised and self-supervised pre-trained models across 10 downstream datasets demonstrate impressive performances and showcase its broad adaptability even in low-data scenarios.

[70] Exp-Graph: How Connections Learn Facial Attributes in Graph-based Expression Recognition

Nandani Sharma,Dinesh Singh

Main category: cs.CV

TL;DR: Exp-Graph是一种基于图模型的框架,利用面部关键点和视觉变换器捕捉面部属性结构关系,提升表情识别的准确性。

Details Motivation: 面部表情识别在人机交互中至关重要,但面部属性的结构随表情变化。现有方法未充分利用结构信息,因此需要一种能够建模面部属性关系的框架。

Contribution: 提出Exp-Graph,通过图建模将面部关键点作为顶点,视觉变换器编码局部相似性作为边,结合图卷积网络捕捉结构依赖关系,提升了表情识别的性能。

Method: 1. 使用面部关键点构建图顶点;2. 基于邻近性和局部外观相似性(通过视觉变换器编码)确定边;3. 结合图卷积网络学习面部属性的全局和局部依赖关系。

Result: 在Oulu-CASIA、eNTERFACE05和AFEW数据集上的准确率分别为98.09%、79.01%和56.39%,展示了强泛化能力。

Insight: 结合图结构和视觉变换器能有效捕捉面部属性的语义关系,适用于实验室和真实场景的表情识别。

Abstract: Facial expression recognition is crucial for human-computer interaction applications such as face animation, video surveillance, affective computing, medical analysis, etc. Since the structure of facial attributes varies with facial expressions, incorporating structural information into facial attributes is essential for facial expression recognition. In this paper, we propose Exp-Graph, a novel framework designed to represent the structural relationships among facial attributes using graph-based modeling for facial expression recognition. For facial attributes graph representation, facial landmarks are used as the graph’s vertices. At the same time, the edges are determined based on the proximity of the facial landmark and the similarity of the local appearance of the facial attributes encoded using the vision transformer. Additionally, graph convolutional networks are utilized to capture and integrate these structural dependencies into the encoding of facial attributes, thereby enhancing the accuracy of expression recognition. Thus, Exp-Graph learns from the facial attribute graphs highly expressive semantic representations. On the other hand, the vision transformer and graph convolutional blocks help the framework exploit the local and global dependencies among the facial attributes that are essential for the recognition of facial expressions. We conducted comprehensive evaluations of the proposed Exp-Graph model on three benchmark datasets: Oulu-CASIA, eNTERFACE05, and AFEW. The model achieved recognition accuracies of 98.09%, 79.01%, and 56.39%, respectively. These results indicate that Exp-Graph maintains strong generalization capabilities across both controlled laboratory settings and real-world, unconstrained environments, underscoring its effectiveness for practical facial expression recognition applications.

[71] Depthwise-Dilated Convolutional Adapters for Medical Object Tracking and Segmentation Using the Segment Anything Model 2

Guoping Xu,Christopher Kabat,You Zhang

Main category: cs.CV

TL;DR: 该论文提出了DD-SAM2,一种高效的适配框架,通过Depthwise-Dilated Adapter(DD-Adapter)增强Segment Anything Model 2(SAM2)的多尺度特征提取能力,适合在有限训练数据的医学视频中进行微调。

Details Motivation: 现有医学图像分割方法多为模态专用设计,适应性差,且对动态医学视频的适应性有限,而SAM2及其变体虽提供了通用解决方案,但全模型微调成本高,易导致灾难性遗忘。

Contribution: 提出了DD-SAM2框架,首次系统地探索基于适配器的SAM2微调方法,专注于医学视频的分割与跟踪任务,显著提升了性能。

Method: 引入Depthwise-Dilated Adapter(DD-Adapter),在SAM2中增强多尺度特征提取,以低参数开销实现高效微调,充分利用SAM2的流式记忆机制。

Result: 在TrackRad2025和EchoNet-Dynamic数据集上分别达到Dice分数0.93和0.97。

Insight: 适配器设计可有效解决医学视频任务中模型微调的高成本和灾难性遗忘问题,同时保持模型的高性能。

Abstract: Recent advances in medical image segmentation have been driven by deep learning; however, most existing methods remain limited by modality-specific designs and exhibit poor adaptability to dynamic medical imaging scenarios. The Segment Anything Model 2 (SAM2) and its related variants, which introduce a streaming memory mechanism for real-time video segmentation, present new opportunities for prompt-based, generalizable solutions. Nevertheless, adapting these models to medical video scenarios typically requires large-scale datasets for retraining or transfer learning, leading to high computational costs and the risk of catastrophic forgetting. To address these challenges, we propose DD-SAM2, an efficient adaptation framework for SAM2 that incorporates a Depthwise-Dilated Adapter (DD-Adapter) to enhance multi-scale feature extraction with minimal parameter overhead. This design enables effective fine-tuning of SAM2 on medical videos with limited training data. Unlike existing adapter-based methods focused solely on static images, DD-SAM2 fully exploits SAM2’s streaming memory for medical video object tracking and segmentation. Comprehensive evaluations on TrackRad2025 (tumor segmentation) and EchoNet-Dynamic (left ventricle tracking) datasets demonstrate superior performance, achieving Dice scores of 0.93 and 0.97, respectively. To the best of our knowledge, this work provides an initial attempt at systematically exploring adapter-based SAM2 fine-tuning for medical video segmentation and tracking. Code, datasets, and models will be publicly available at https://github.com/apple1986/DD-SAM2.

[72] BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

Haiquan Wen,Tianxiao Li,Zhenglin Huang,Yiwei He,Guangliang Cheng

Main category: cs.CV

TL;DR: BusterX++ 是一个全新的跨模态检测框架,结合了 MLLM 和强化学习技术,用于检测和解释合成媒体内容。该框架通过多阶段训练和混合推理提高了性能。论文还提出了一个跨模态基准测试集 GenBuster++。

Details Motivation: 由于生成式 AI 在图像和视频合成方面的快速发展,假内容导致的信息风险急剧增加。现有的单模态检测方法无法有效应对结合多种媒体格式的合成内容,因此需要一个跨模态的解决方案。

Contribution: 1. 提出了 BusterX++ 框架,专门用于跨模态合成媒体的检测与解释;2. 引入了基于强化学习的后训练策略,避免了冷启动问题;3. 提出了 GenBuster++ 基准测试集。

Method: 框架结合了多阶段训练、思维奖励和混合推理,利用 MLLM 和强化学习技术实现跨模态检测和解释。

Result: 实验表明,BusterX++ 在跨模态合成媒体检测中表现出色,具有较强的泛化能力。

Insight: 跨模态检测是未来对抗合成假内容的重要方向,强化学习的后训练策略可以显著提升 MLLM 的性能。

Abstract: Recent advances in generative AI have dramatically improved image and video synthesis capabilities, significantly increasing the risk of misinformation through sophisticated fake content. In response, detection methods have evolved from traditional approaches to multimodal large language models (MLLMs), offering enhanced transparency and interpretability in identifying synthetic media. However, current detection systems remain fundamentally limited by their single-modality design. These approaches analyze images or videos separately, making them ineffective against synthetic content that combines multiple media formats. To address these challenges, we introduce \textbf{BusterX++}, a novel framework designed specifically for cross-modal detection and explanation of synthetic media. Our approach incorporates an advanced reinforcement learning (RL) post-training strategy that eliminates cold-start. Through Multi-stage Training, Thinking Reward, and Hybrid Reasoning, BusterX++ achieves stable and substantial performance improvements. To enable comprehensive evaluation, we also present \textbf{GenBuster++}, a cross-modal benchmark leveraging state-of-the-art image and video generation techniques. This benchmark comprises 4,000 images and video clips, meticulously curated by human experts using a novel filtering methodology to ensure high quality, diversity, and real-world applicability. Extensive experiments demonstrate the effectiveness and generalizability of our approach.

[73] AI-Powered Precision in Sport Taekwondo: Enhancing Fairness, Speed, and Trust in Competition (FST.ai)

Keivan Shariatmadar,Ahmad Osman

Main category: cs.CV

TL;DR: 论文提出了一个名为FST.ai的AI框架,用于提升体育竞技跆拳道中的公平性、速度和信任,特别是针对实时头部踢击检测和评分。

Details Motivation: 传统的手动裁判系统存在延迟、主观性和执行不一致的问题,影响了比赛的公平性和运动员的信任。

Contribution: FST.ai结合计算机视觉、深度学习和边缘推理,实现了快速、一致且透明的动作检测和评分系统。

Method: 基于姿态估计、动作分类和撞击分析的框架,适用于多种体育运动。

Result: 系统将决策时间从几分钟缩短到几秒,同时提升了公平性和透明度。

Insight: 该框架不仅适用于跆拳道,还可扩展到其他体育项目,体现其普适性和可扩展性。

Abstract: The integration of Artificial Intelligence (AI) into sports officiating represents a paradigm shift in how decisions are made in competitive environments. Traditional manual systems, even when supported by Instant Video Replay (IVR), often suffer from latency, subjectivity, and inconsistent enforcement, undermining fairness and athlete trust. This paper introduces FST.ai, a novel AI-powered framework designed to enhance officiating in Sport Taekwondo, particularly focusing on the complex task of real-time head kick detection and scoring. Leveraging computer vision, deep learning, and edge inference, the system automates the identification and classification of key actions, significantly reducing decision time from minutes to seconds while improving consistency and transparency. Importantly, the methodology is not limited to Taekwondo. The underlying framework – based on pose estimation, motion classification, and impact analysis – can be adapted to a wide range of sports requiring action detection, such as judo, karate, fencing, or even team sports like football and basketball, where foul recognition or performance tracking is critical. By addressing one of Taekwondo’s most challenging scenarios – head kick scoring – we demonstrate the robustness, scalability, and sport-agnostic potential of FST.ai to transform officiating standards across multiple disciplines.

[74] Artificial Intelligence in the Food Industry: Food Waste Estimation based on Computer Vision, a Brief Case Study in a University Dining Hall

Shayan Rokhva,Babak Teimourpour

Main category: cs.CV

TL;DR: 论文提出了一种基于计算机视觉的、低成本的食物浪费估算框架,利用RGB图像的分割技术对大学食堂中的五种伊朗菜肴的餐盘食物浪费进行量化评估。

Details Motivation: 量化机构用餐环境中的食物浪费对推动数据驱动的可持续发展策略至关重要,但缺乏高效、自动化的解决方案。

Contribution: 提出了一种基于语义分割的计算机视觉框架,能够实时估计食物浪费;定制了动态逆频率损失和Distributional Pixel Agreement (DPA)指标,优化模型性能。

Method: 采用四种全监督模型(U-Net、U-Net++及其轻量版),使用动态逆频率损失和AdamW优化器训练,并通过多种指标进行评估。

Result: 所有模型表现良好,部分模型DPA接近或超过90%;轻量模型在NVIDIA T4 GPU上实现实时推理;干燥、刚性食物分割效果更佳。

Insight: 2D成像和食物多样性有限是主要局限,但该框架为大规模食物浪费监测提供了可扩展的无接触方案,未来可拓展至更复杂的场景。

Abstract: Quantifying post-consumer food waste in institutional dining settings is essential for supporting data-driven sustainability strategies. This study presents a cost-effective computer vision framework that estimates plate-level food waste by utilizing semantic segmentation of RGB images taken before and after meal consumption across five Iranian dishes. Four fully supervised models (U-Net, U-Net++, and their lightweight variants) were trained using a capped dynamic inverse-frequency loss and AdamW optimizer, then evaluated through a comprehensive set of metrics, including Pixel Accuracy, Dice, IoU, and a custom-defined Distributional Pixel Agreement (DPA) metric tailored to the task. All models achieved satisfying performance, and for each food type, at least one model approached or surpassed 90% DPA, demonstrating strong alignment in pixel-wise proportion estimates. Lighter models with reduced parameter counts offered faster inference, achieving real-time throughput on an NVIDIA T4 GPU. Further analysis showed superior segmentation performance for dry and more rigid components (e.g., rice and fries), while more complex, fragmented, or viscous dishes, such as stews, showed reduced performance, specifically post-consumption. Despite limitations such as reliance on 2D imaging, constrained food variety, and manual data collection, the proposed framework is pioneering and represents a scalable, contactless solution for continuous monitoring of food consumption. This research lays foundational groundwork for automated, real-time waste tracking systems in large-scale food service environments and offers actionable insights and outlines feasible future directions for dining hall management and policymakers aiming to reduce institutional food waste.

[75] Gene-DML: Dual-Pathway Multi-Level Discrimination for Gene Expression Prediction from Histopathology Images

Yaxuan Song,Jianan Fan,Hang Chang,Weidong Cai

Main category: cs.CV

TL;DR: Gene-DML是一个通过双路径多级判别增强组织病理学图像与基因表达特征对齐的框架,显著提升了基因表达预测的性能。

Details Motivation: 现有方法未能充分利用组织病理学图像与基因表达特征的多层次表示对齐,限制了预测性能。Gene-DML旨在弥补这一缺陷。

Contribution: 提出了双路径多级判别框架Gene-DML,通过多尺度实例级判别和跨层级实例-群体判别,增强形态学与转录模态的对齐。

Method: 设计了多尺度实例级判别路径和跨层级实例-群体判别路径,联合建模细粒度和结构层面的判别关系。

Result: 在公开的空间转录组数据集中,Gene-DML实现了最先进的基因表达预测性能。

Insight: 多层次和跨模态的对齐是提升基因表达预测的关键,双路径设计能有效增强模型的泛化能力。

Abstract: Accurately predicting gene expression from histopathology images offers a scalable and non-invasive approach to molecular profiling, with significant implications for precision medicine and computational pathology. However, existing methods often underutilize the cross-modal representation alignment between histopathology images and gene expression profiles across multiple representational levels, thereby limiting their prediction performance. To address this, we propose Gene-DML, a unified framework that structures latent space through Dual-pathway Multi-Level discrimination to enhance correspondence between morphological and transcriptional modalities. The multi-scale instance-level discrimination pathway aligns hierarchical histopathology representations extracted at local, neighbor, and global levels with gene expression profiles, capturing scale-aware morphological-transcriptional relationships. In parallel, the cross-level instance-group discrimination pathway enforces structural consistency between individual (image/gene) instances and modality-crossed (gene/image, respectively) groups, strengthening the alignment across modalities. By jointly modelling fine-grained and structural-level discrimination, Gene-DML is able to learn robust cross-modal representations, enhancing both predictive accuracy and generalization across diverse biological contexts. Extensive experiments on public spatial transcriptomics datasets demonstrate that Gene-DML achieves state-of-the-art performance in gene expression prediction. The code and checkpoints will be released soon.

[76] Docopilot: Improving Multimodal Models for Document-Level Understanding

Yuchen Duan,Zhe Chen,Yusong Hu,Weiyun Wang,Shenglong Ye,Botian Shi,Lewei Lu,Qibin Hou,Tong Lu,Hongsheng Li,Jifeng Dai,Wenhai Wang

Main category: cs.CV

TL;DR: 论文提出了Docopilot,一种原生多模态模型,通过高质量数据集Doc-750K提升复杂文档理解能力,避免了传统检索增强生成方法的缺陷。

Details Motivation: 现有多模态大语言模型在复杂、多页文档理解任务中表现不佳,主要由于缺乏高质量的文档级数据集,且传统检索增强生成方法存在上下文碎片化和多阶段误差累积等问题。

Contribution: 1. 发布了高质量文档级数据集Doc-750K;2. 提出了无需依赖检索增强的原生多模态模型Docopilot,显著提升了文档理解的连贯性和准确性。

Method: 基于Doc-750K数据集,开发了Docopilot模型,通过直接学习文档级依赖关系,避免了传统检索增强方法的多阶段处理。

Result: 实验表明,Docopilot在文档理解任务和多轮交互中表现出优越的连贯性、准确性和效率,为文档级多模态理解设定了新基准。

Insight: 高质量数据集和原生多模态设计是提升复杂文档理解的关键,未来工作可进一步探索文档结构的动态建模。

Abstract: Despite significant progress in multimodal large language models (MLLMs), their performance on complex, multi-page document comprehension remains inadequate, largely due to the lack of high-quality, document-level datasets. While current retrieval-augmented generation (RAG) methods offer partial solutions, they suffer from issues, such as fragmented retrieval contexts, multi-stage error accumulation, and extra time costs of retrieval. In this work, we present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents. This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents. Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG. Experiments demonstrate that Docopilot achieves superior coherence, accuracy, and efficiency in document understanding tasks and multi-turn interactions, setting a new baseline for document-level multimodal understanding. Data, code, and models are released at https://github.com/OpenGVLab/Docopilot

[77] WSI-Agents: A Collaborative Multi-Agent System for Multi-Modal Whole Slide Image Analysis

Xinheng Lyu,Yuci Liang,Wenting Chen,Meidan Ding,Jiaqi Yang,Guolin Huang,Daokun Zhang,Xiangjian He,Linlin Shen

Main category: cs.CV

TL;DR: WSI-Agents是一种新型协作多智能体系统,用于多模态全切片图像分析,通过任务分配、验证和总结模块提升任务特定准确性和多任务通用性。

Details Motivation: 当前的多模态大语言模型在病理学领域的任务表现不及任务专用模型,且协作多智能体系统在病理学中的潜力尚未充分开发。

Contribution: 提出WSI-Agents,整合专家智能体、任务分配与验证机制,提升多模态全切片图像分析的性能。

Method: 1. 任务分配模块利用MLLMs模型库分配任务;2. 验证机制通过内部一致性和外部知识库确保准确性;3. 总结模块生成最终报告和视觉解释图。

Result: 在多项任务中,WSI-Agents优于现有WSI MLLMs和医学智能体框架。

Insight: 协作多智能体系统能有效结合任务专用模型的准确性和通用模型的灵活性,尤其适用于复杂的医学图像分析。

Abstract: Whole slide images (WSIs) are vital in digital pathology, enabling gigapixel tissue analysis across various pathological tasks. While recent advancements in multi-modal large language models (MLLMs) allow multi-task WSI analysis through natural language, they often underperform compared to task-specific models. Collaborative multi-agent systems have emerged as a promising solution to balance versatility and accuracy in healthcare, yet their potential remains underexplored in pathology-specific domains. To address these issues, we propose WSI-Agents, a novel collaborative multi-agent system for multi-modal WSI analysis. WSI-Agents integrates specialized functional agents with robust task allocation and verification mechanisms to enhance both task-specific accuracy and multi-task versatility through three components: (1) a task allocation module assigning tasks to expert agents using a model zoo of patch and WSI level MLLMs, (2) a verification mechanism ensuring accuracy through internal consistency checks and external validation using pathology knowledge bases and domain-specific models, and (3) a summary module synthesizing the final summary with visual interpretation maps. Extensive experiments on multi-modal WSI benchmarks show WSI-Agents’s superiority to current WSI MLLMs and medical agent frameworks across diverse tasks.

[78] From Semantics, Scene to Instance-awareness: Distilling Foundation Model for Open-vocabulary Situation Recognition

Chen Cai,Tianyi Liu,Jianjun Gao,Wenyang Liu,Kejun Wu,Ruoyu Wang,Yi Wang,Soo Chin Liew

Main category: cs.CV

TL;DR: 该论文提出了一种新的框架MIPD,通过从多模态大语言模型(MLLM)中提取知识,增强小型GSR模型的泛化和零样本能力,以解决Open-vocabulary Grounded Situation Recognition(Ov-GSR)任务。

Details Motivation: 现有的MLLM在零样本能力上表现优异但计算资源需求高,而传统GSR模型在泛化性上不足,无法有效识别未见和罕见场景。因此,作者希望通过知识蒸馏提升GSR模型的性能。

Contribution: 提出了Multimodal Interactive Prompt Distillation(MIPD)框架,用于从MLLM中提取丰富的多模态知识,提升Ov-GSR模型的泛化性和零样本能力。

Method: 1. 利用LLM-based Judgmental Rationales Generator(JRG)生成上下文语义丰富的正负样本rationales;2. 引入scene-aware和instance-perception prompts,通过NMPA模块对齐视觉信息;3. 将知识蒸馏到学生Ov-GSR模型中。

Result: 在Ov-SWiG数据集上表现优异,显著提升了模型在未见和罕见场景中的性能,并在HICO-DET数据集上验证了其泛化能力。

Insight: 通过多模态知识蒸馏,可以有效提升小模型的泛化能力,同时减少计算资源需求,为边缘设备上的复杂场景识别提供了新思路。

Abstract: Recent Multimodal Large Language Models (MLLMs) exhibit strong zero-shot abilities but struggle with complex Grounded Situation Recognition (GSR) and are resource-intensive for edge device deployment. Meanwhile, conventional GSR models often lack generalization ability, falling short in recognizing unseen and rare situations. In this paper, we exploit transferring knowledge from a teacher MLLM to a small GSR model to enhance its generalization and zero-shot abilities, thereby introducing the task of Open-vocabulary Grounded Situation Recognition (Ov-GSR). To achieve this, we propose Multimodal Interactive Prompt Distillation (MIPD), a novel framework that distills enriched multimodal knowledge from the foundation model, enabling the student Ov-GSR model to recognize unseen situations and be better aware of rare situations. Specifically, the MIPD framework first leverages the LLM-based Judgmental Rationales Generator (JRG) to construct positive and negative glimpse and gaze rationales enriched with contextual semantic information. The proposed scene-aware and instance-perception prompts are then introduced to align rationales with visual information from the MLLM teacher via the Negative-Guided Multimodal Prompting Alignment (NMPA) module, effectively capturing holistic and perceptual multimodal knowledge. Finally, the aligned multimodal knowledge is distilled into the student Ov-GSR model, providing a stronger foundation for generalization that enhances situation understanding, bridges the gap between seen and unseen scenarios, and mitigates prediction bias in rare cases. We evaluate MIPD on the refined Ov-SWiG dataset, achieving superior performance on seen, rare, and unseen situations, and further demonstrate improved unseen detection on the HICO-DET dataset.

[79] MultiRetNet: A Multimodal Vision Model and Deferral System for Staging Diabetic Retinopathy

Jeannie She,Katie Spivakovsky

Main category: cs.CV

TL;DR: MultiRetNet 是一种结合视网膜成像、社会经济因素和共病数据的多模态模型,用于糖尿病视网膜病变(DR)分期,同时整合了临床延迟系统以实现人机协作。

Details Motivation: 糖尿病视网膜病变是全球范围内可预防性失明的主要原因之一,低收入群体由于筛查机会有限,病情更容易发展到晚期。共病条件进一步加剧了病情发展。

Contribution: 提出了 MultiRetNet 这一多模态管道,结合多种数据提升 DR 分期的准确性;开发了临床延迟系统,通过对比学习识别需要医生复查的异常样本。

Method: 使用三种多模态融合方法,最终选择了全连接层的融合方式;通过合成对抗性和低质量图像,利用对比学习训练延迟系统。

Result: 系统在低质量图像上保持了诊断准确性,并能整合关键健康数据,有助于提高早期检测率,特别是在服务不足的人群中。

Insight: 多模态数据和临床延迟系统的结合可以提升 DR 分期的准确性,同时促进医疗公平性,减少医疗成本。

Abstract: Diabetic retinopathy (DR) is a leading cause of preventable blindness, affecting over 100 million people worldwide. In the United States, individuals from lower-income communities face a higher risk of progressing to advanced stages before diagnosis, largely due to limited access to screening. Comorbid conditions further accelerate disease progression. We propose MultiRetNet, a novel pipeline combining retinal imaging, socioeconomic factors, and comorbidity profiles to improve DR staging accuracy, integrated with a clinical deferral system for a clinical human-in-the-loop implementation. We experiment with three multimodal fusion methods and identify fusion through a fully connected layer as the most versatile methodology. We synthesize adversarial, low-quality images and use contrastive learning to train the deferral system, guiding the model to identify out-of-distribution samples that warrant clinician review. By maintaining diagnostic accuracy on suboptimal images and integrating critical health data, our system can improve early detection, particularly in underserved populations where advanced DR is often first identified. This approach may reduce healthcare costs, increase early detection rates, and address disparities in access to care, promoting healthcare equity.

[80] InterAct-Video: Reasoning-Rich Video QA for Urban Traffic

Joseph Raj Vishal,Rutuja Patil,Manas Srinivas Gowda,Katha Naik,Yezhou Yang,Bharatesh Chakravarthi

Main category: cs.CV

TL;DR: InterAct VideoQA是一个专为交通监控任务设计的视频问答数据集,包含8小时真实交通视频和25,000多组问答对,旨在提升模型在复杂交通场景中的时空推理能力。

Details Motivation: 现有VideoQA模型在复杂交通场景中表现不佳,无法有效处理多事件并发和时空依赖的挑战。

Contribution: 提出InterAct VideoQA数据集,涵盖丰富交通属性和复杂场景,评测并提升现有VideoQA模型的性能。

Method: 收集真实交通视频,划分为10秒片段并标注QA对,评测和微调SOTA VideoQA模型。

Result: 实验显示现有模型在复杂场景中表现不足,但通过微调可显著提升性能,验证了领域专用数据集的重要性。

Insight: 领域专用数据集对提升VideoQA模型在复杂场景中的推理能力至关重要。

Abstract: Traffic monitoring is crucial for urban mobility, road safety, and intelligent transportation systems (ITS). Deep learning has advanced video-based traffic monitoring through video question answering (VideoQA) models, enabling structured insight extraction from traffic videos. However, existing VideoQA models struggle with the complexity of real-world traffic scenes, where multiple concurrent events unfold across spatiotemporal dimensions. To address these challenges, this paper introduces \textbf{InterAct VideoQA}, a curated dataset designed to benchmark and enhance VideoQA models for traffic monitoring tasks. The InterAct VideoQA dataset comprises 8 hours of real-world traffic footage collected from diverse intersections, segmented into 10-second video clips, with over 25,000 question-answer (QA) pairs covering spatiotemporal dynamics, vehicle interactions, incident detection, and other critical traffic attributes. State-of-the-art VideoQA models are evaluated on InterAct VideoQA, exposing challenges in reasoning over fine-grained spatiotemporal dependencies within complex traffic scenarios. Additionally, fine-tuning these models on InterAct VideoQA yields notable performance improvements, demonstrating the necessity of domain-specific datasets for VideoQA. InterAct VideoQA is publicly available as a benchmark dataset to facilitate future research in real-world deployable VideoQA models for intelligent transportation systems. GitHub Repo: https://github.com/joe-rabbit/InterAct_VideoQA

[81] LeAdQA: LLM-Driven Context-Aware Temporal Grounding for Video Question Answering

Xinxin Dong,Baoyun Peng,Haokai Ma,Yufei Wang,Zixuan Dong,Fei Hu,Xiaodong Wang

Main category: cs.CV

TL;DR: LeAdQA是一种基于LLM驱动的方法,通过因果感知查询优化和细粒度视觉定位,解决了视频问答中关键帧稀疏和复杂推理的难题,实现了SOTA性能。

Details Motivation: 当前视频问答方法存在任务无关采样和启发式检索的局限性,导致关键事件淹没在无关内容中或忽略因果-时间结构。

Contribution: 提出LeAdQA,通过LLM驱动的因果感知查询优化和自适应融合机制,实现精准的视觉定位和复杂推理。

Method: 1. 利用LLM重构问题-选项对;2. 通过时序定位模型检索关键片段;3. 自适应融合视觉-文本线索;4. 使用MLLM生成答案。

Result: 在NExT-QA、IntentQA和NExT-GQA数据集上达到SOTA,兼具计算效率。

Insight: 因果感知和时间定位的结合是解决复杂视频问答的关键,LLM的引入显著提升了查询的精度。

Abstract: Video Question Answering (VideoQA) requires identifying sparse critical moments in long videos and reasoning about their causal relationships to answer semantically complex questions. While recent advances in multimodal learning have improved alignment and fusion, current approaches remain limited by two prevalent but fundamentally flawed strategies: (1) task-agnostic sampling indiscriminately processes all frames, overwhelming key events with irrelevant content; and (2) heuristic retrieval captures superficial patterns but misses causal-temporal structures needed for complex reasoning. To address these challenges, we introduce LeAdQA, an innovative approach that bridges these gaps through synergizing causal-aware query refinement with fine-grained visual grounding. Our method first leverages LLMs to reformulate question-option pairs, resolving causal ambiguities and sharpening temporal focus. These refined queries subsequently direct a temporal grounding model to precisely retrieve the most salient segments, complemented by an adaptive fusion mechanism dynamically integrating the evidence to maximize relevance. The integrated visual-textual cues are then processed by an MLLM to generate accurate, contextually-grounded answers. Experiments on NExT-QA, IntentQA, and NExT-GQA demonstrate that our method’s precise visual grounding substantially enhances the understanding of video-question relationships, achieving state-of-the-art (SOTA) performance on complex reasoning tasks while maintaining computational efficiency.

[82] FOCUS: Fused Observation of Channels for Unveiling Spectra

Xi Xiao,Aristeidis Tsaris,Anika Tabassum,John Lagergren,Larry M. York,Tianyang Wang,Xiao Wang

Main category: cs.CV

TL;DR: FOCUS是一个用于高光谱成像(HSI)中ViT可解释性的框架,通过类特定光谱提示和[SINK]令牌设计,实现高效3D显著性图生成,无需梯度回传。

Details Motivation: 高光谱数据的高维特性使得ViT的可解释性面临挑战,现有方法难以捕捉有意义的频谱线索且计算成本高。

Contribution: 1. 提出了首个针对ViT在HSI中的可靠且高效的空间-频谱可解释性框架FOCUS。 2. 设计了类特定光谱提示和[SINK]令牌,显著提升了显著图的质量和效率。

Method: 1. 类特定光谱提示引导注意力到语义相关频段。 2. [SINK]令牌通过吸引力损失吸收噪声注意力。 3. 单次前向传播生成3D显著图和频段重要性曲线。

Result: FOCUS将频段IoU提升15%,注意力崩溃减少40%,显著结果与专家标注一致,参数开销低于1%。

Insight: FOCUS通过轻量级设计解决了HSI中ViT的可解释性问题,为黑盒模型与可信决策之间架起桥梁。

Abstract: Hyperspectral imaging (HSI) captures hundreds of narrow, contiguous wavelength bands, making it a powerful tool in biology, agriculture, and environmental monitoring. However, interpreting Vision Transformers (ViTs) in this setting remains largely unexplored due to two key challenges: (1) existing saliency methods struggle to capture meaningful spectral cues, often collapsing attention onto the class token, and (2) full-spectrum ViTs are computationally prohibitive for interpretability, given the high-dimensional nature of HSI data. We present FOCUS, the first framework that enables reliable and efficient spatial-spectral interpretability for frozen ViTs. FOCUS introduces two core components: class-specific spectral prompts that guide attention toward semantically meaningful wavelength groups, and a learnable [SINK] token trained with an attraction loss to absorb noisy or redundant attention. Together, these designs make it possible to generate stable and interpretable 3D saliency maps and spectral importance curves in a single forward pass, without any gradient backpropagation or backbone modification. FOCUS improves band-level IoU by 15 percent, reduces attention collapse by over 40 percent, and produces saliency results that align closely with expert annotations. With less than 1 percent parameter overhead, our method makes high-resolution ViT interpretability practical for real-world hyperspectral applications, bridging a long-standing gap between black-box modeling and trustworthy HSI decision-making.

[83] A Novel Downsampling Strategy Based on Information Complementarity for Medical Image Segmentation

Wenbo Yue,Chang Li,Guoping Xu

Main category: cs.CV

TL;DR: 该论文提出了一种基于信息互补的新型下采样策略(HPD),通过MinMaxPooling替代传统方法,在医学图像分割任务中显著提升了性能。

Details Motivation: 传统下采样方法(如最大池化和跨行卷积)可能导致关键空间信息的丢失,影响像素级预测精度。因此,作者旨在开发一种能更好地保留图像细节特征的下采样方法。

Contribution: 提出了混合池化下采样(HPD)方法,通过结合最大和最小信息互补,有效保留了图像的明暗对比和细节特征,从而提升了分割任务的精度。

Method: 采用MinMaxPooling作为核心下采样操作,结合局部区域的最大值信息,优化特征提取过程。该方法被集成到多种CNN架构中,并在ACDC和Synapse数据集上验证。

Result: 实验结果表明,HPD在分割任务中优于传统方法,平均DSC系数提高了0.5%。

Insight: 信息互补的下采样策略可以有效减少空间信息丢失,为医学图像分割任务提供了一种高效的解决方案。

Abstract: In convolutional neural networks (CNNs), downsampling operations are crucial to model performance. Although traditional downsampling methods (such as maximum pooling and cross-row convolution) perform well in feature aggregation, receptive field expansion, and computational reduction, they may lead to the loss of key spatial information in semantic segmentation tasks, thereby affecting the pixel-by-pixel prediction accuracy.To this end, this study proposes a downsampling method based on information complementarity - Hybrid Pooling Downsampling (HPD). The core is to replace the traditional method with MinMaxPooling, and effectively retain the light and dark contrast and detail features of the image by extracting the maximum value information of the local area.Experiment on various CNN architectures on the ACDC and Synapse datasets show that HPD outperforms traditional methods in segmentation performance, and increases the DSC coefficient by 0.5% on average. The results show that the HPD module provides an efficient solution for semantic segmentation tasks.

[84] Distilling Parallel Gradients for Fast ODE Solvers of Diffusion Models

Beier Zhu,Ruoyu Wang,Tong Zhao,Hanwang Zhang,Chi Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种名为 Ensemble Parallel Direction (EPD) 的新方法,通过并行梯度评估减少扩散模型的采样延迟,同时保持高质量生成。

Details Motivation: 扩散模型因其高质量的生成结果而闻名,但其顺序去噪过程导致采样延迟高。现有的加速方法在低延迟预算下往往牺牲图像质量,因此需要一种既能加速采样又能保持高质量的方法。

Contribution: 1. 提出EPD方法,利用并行梯度评估减少截断误差;2. 梯度计算完全并行化,保持低延迟;3. 通过蒸馏方式优化少量可学习参数,训练开销小;4. 作为插件兼容现有ODE采样器。

Method: EPD通过在每个ODE步骤中引入多个并行梯度评估来缓解截断误差。梯度计算独立,可完全并行化。使用蒸馏方法优化可学习参数,减少训练成本。

Result: 在多个图像合成基准测试中表现优异,如5 NFE下在CIFAR-10上FID为4.47,显著优于现有学习型求解器。

Insight: 并行梯度评估是一种有效的加速扩散模型采样方法,同时保持高质量生成;蒸馏优化可以避免额外训练成本。

Abstract: Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature. Existing solver-based acceleration methods often face image quality degradation under a low-latency budget. In this paper, we propose the Ensemble Parallel Direction solver (dubbed as \ours), a novel ODE solver that mitigates truncation errors by incorporating multiple parallel gradient evaluations in each ODE step. Importantly, since the additional gradient computations are independent, they can be fully parallelized, preserving low-latency sampling. Our method optimizes a small set of learnable parameters in a distillation fashion, ensuring minimal training overhead. In addition, our method can serve as a plugin to improve existing ODE samplers. Extensive experiments on various image synthesis benchmarks demonstrate the effectiveness of our \ours~in achieving high-quality and low-latency sampling. For example, at the same latency level of 5 NFE, EPD achieves an FID of 4.47 on CIFAR-10, 7.97 on FFHQ, 8.17 on ImageNet, and 8.26 on LSUN Bedroom, surpassing existing learning-based solvers by a significant margin. Codes are available in https://github.com/BeierZhu/EPD.

[85] An Evaluation of DUSt3R/MASt3R/VGGT 3D Reconstruction on Photogrammetric Aerial Blocks

Xinyi Wu,Steven Landgraf,Markus Ulrich,Rongjun Qin

Main category: cs.CV

TL;DR: 该论文评估了DUSt3R、MASt3R和VGGT三种基于Transformer的3D重建模型在航拍图像块上的表现,发现它们在稀疏图像集(少于10张图像)上能显著提升完整性,但无法完全替代传统SfM和MVS方法。

Details Motivation: 尽管当前先进的3D重建模型在稀疏无序图像集上表现优异,但它们在航拍图像块上的潜力尚未被充分探索。论文旨在填补这一空白,评估这些模型在航拍场景中的实际性能。

Contribution: 论文首次对DUSt3R、MASt3R和VGGT在航拍图像块上的性能进行全面评估,揭示了其在稀疏图像集下的优势和局限性。

Method: 使用了UseGeo数据集的航拍图像块,对预训练的DUSt3R、MASt3R和VGGT模型进行密集3D重建和相机姿态估计的测试。

Result: 结果表明,这些模型在稀疏图像集(少于10张)上能显著提升完整性(比COLMAP高50%),但处理高分辨率和大规模图像集时仍存在局限性。VGGT在计算效率和姿态估计可靠性上表现最佳。

Insight: Transformer-based方法在低分辨率和稀疏场景中具有潜力,但不能完全替代传统SfM和MVS,更适合作为补充工具。高分辨率和复杂几何场景仍需要传统方法的支持。

Abstract: State-of-the-art 3D computer vision algorithms continue to advance in handling sparse, unordered image sets. Recently developed foundational models for 3D reconstruction, such as Dense and Unconstrained Stereo 3D Reconstruction (DUSt3R), Matching and Stereo 3D Reconstruction (MASt3R), and Visual Geometry Grounded Transformer (VGGT), have attracted attention due to their ability to handle very sparse image overlaps. Evaluating DUSt3R/MASt3R/VGGT on typical aerial images matters, as these models may handle extremely low image overlaps, stereo occlusions, and textureless regions. For redundant collections, they can accelerate 3D reconstruction by using extremely sparsified image sets. Despite tests on various computer vision benchmarks, their potential on photogrammetric aerial blocks remains unexplored. This paper conducts a comprehensive evaluation of the pre-trained DUSt3R/MASt3R/VGGT models on the aerial blocks of the UseGeo dataset for pose estimation and dense 3D reconstruction. Results show these methods can accurately reconstruct dense point clouds from very sparse image sets (fewer than 10 images, up to 518 pixels resolution), with completeness gains up to +50% over COLMAP. VGGT also demonstrates higher computational efficiency, scalability, and more reliable camera pose estimation. However, all exhibit limitations with high-resolution images and large sets, as pose reliability declines with more images and geometric complexity. These findings suggest transformer-based methods cannot fully replace traditional SfM and MVS, but offer promise as complementary approaches, especially in challenging, low-resolution, and sparse scenarios.

[86] Exploring Scalable Unified Modeling for General Low-Level Vision

Xiangyu Chen,Kaiwen Zhu,Yuandong Pu,Shuo Cao,Xiaohui Li,Wenlong Zhang,Yihao Liu,Yu Qiao,Jiantao Zhou,Chao Dong

Main category: cs.CV

TL;DR: 该论文提出了一个基于视觉提示的统一低层视觉建模框架VPIP,通过端到端的图像处理架构和任务特定的视觉表示,实现了多种低层视觉任务的统一建模。模型GenLV在大规模任务基准测试中表现优异,展示了良好的可扩展性和泛化能力。

Details Motivation: 低层视觉任务(如图像恢复、增强、风格化等)在任务定义和输出领域差异较大,传统方法难以统一建模。论文旨在探索一种统一的框架,能够灵活适应多种低层视觉任务。

Contribution: 1. 提出了基于视觉提示的VPIP框架,支持多任务统一建模;2. 开发了统一模型GenLV,并在100多种任务上验证其性能;3. 验证了模型在零样本泛化、少样本迁移和任务特定微调中的适应性。

Method: VPIP框架包括三个核心组件:端到端图像处理主干、提示编码器和提示交互模块。通过视觉提示(输入-目标图像对)引导模型完成不同任务。模型GenLV通过扩展模型容量和任务多样性来探索其可扩展性。

Result: 实验结果表明,GenLV在多种任务上表现优异,增加训练任务数量能提升泛化能力,尤其是在数据有限的任务上。模型在零样本、少样本和微调场景中均表现出强适应性。

Insight: 联合训练能够学习到可迁移的表征,这对于低层视觉任务的统一建模具有重要意义。VPIP框架的可扩展性表明其在通用低层视觉建模中的潜力。

Abstract: Low-level vision involves a wide spectrum of tasks, including image restoration, enhancement, stylization, and feature extraction, which differ significantly in both task formulation and output domains. To address the challenge of unified modeling across such diverse tasks, we propose a Visual task Prompt-based Image Processing (VPIP) framework that leverages input-target image pairs as visual prompts to guide the model in performing a variety of low-level vision tasks. The framework comprises an end-to-end image processing backbone, a prompt encoder, and a prompt interaction module, enabling flexible integration with various architectures and effective utilization of task-specific visual representations. Based on this design, we develop a unified low-level vision model, GenLV, and evaluate its performance across multiple representative tasks. To explore the scalability of this approach, we extend the framework along two dimensions: model capacity and task diversity. We construct a large-scale benchmark consisting of over 100 low-level vision tasks and train multiple versions of the model with varying scales. Experimental results show that the proposed method achieves considerable performance across a wide range of tasks. Notably, increasing the number of training tasks enhances generalization, particularly for tasks with limited data, indicating the model’s ability to learn transferable representations through joint training. Further evaluations in zero-shot generalization, few-shot transfer, and task-specific fine-tuning scenarios demonstrate the model’s strong adaptability, confirming the effectiveness, scalability, and potential of the proposed framework as a unified foundation for general low-level vision modeling.

[87] Seeing Through Deepfakes: A Human-Inspired Framework for Multi-Face Detection

Juan Hu,Shaojing Fan,Terence Sim

Main category: cs.CV

TL;DR: 该论文提出了一种基于人类认知的多面深度伪造视频检测框架HICOM,通过研究人类在社交环境中检测深度伪造的关键线索,显著提升了多面场景下的检测准确性和泛化能力。

Details Motivation: 现有的深度伪造检测方法在单面场景中表现良好,但在多面社交场景中因缺乏上下文线索而表现不佳。因此,论文受人类认知启发,旨在解决多面深度伪造视频的检测问题。

Contribution: 1. 通过人类研究量化了四种关键检测线索;2. 提出了HICOM框架,显著提升多面场景的检测性能;3. 引入LLM增强检测结果的可解释性。

Method: 基于人类认知线索(场景-运动一致性、面间外观兼容性、人际注视对齐、面体一致性),设计HICOM框架,并结合LLM提供可解释性分析。

Result: HICOM在基准数据集上平均提升3.3%的检测准确率,在未见数据集上优于现有方法5.8%,且在真实扰动下表现稳健。

Insight: 人类认知线索能有效提升深度伪造检测的泛化性和可解释性,未来防御系统可更多融入人类因素。

Abstract: Multi-face deepfake videos are becoming increasingly prevalent, often appearing in natural social settings that challenge existing detection methods. Most current approaches excel at single-face detection but struggle in multi-face scenarios, due to a lack of awareness of crucial contextual cues. In this work, we develop a novel approach that leverages human cognition to analyze and defend against multi-face deepfake videos. Through a series of human studies, we systematically examine how people detect deepfake faces in social settings. Our quantitative analysis reveals four key cues humans rely on: scene-motion coherence, inter-face appearance compatibility, interpersonal gaze alignment, and face-body consistency. Guided by these insights, we introduce \textsf{HICOM}, a novel framework designed to detect every fake face in multi-face scenarios. Extensive experiments on benchmark datasets show that \textsf{HICOM} improves average accuracy by 3.3% in in-dataset detection and 2.8% under real-world perturbations. Moreover, it outperforms existing methods by 5.8% on unseen datasets, demonstrating the generalization of human-inspired cues. \textsf{HICOM} further enhances interpretability by incorporating an LLM to provide human-readable explanations, making detection results more transparent and convincing. Our work sheds light on involving human factors to enhance defense against deepfakes.

[88] Light Future: Multimodal Action Frame Prediction via InstructPix2Pix

Zesen Zhong,Duomin Zhang,Yijia Li

Main category: cs.CV

TL;DR: 本文提出了一种轻量化的多模态动作帧预测方法,利用InstructPix2Pix模型,通过单张图像和文本指令预测未来10秒的视觉帧,显著降低了计算成本和延迟。

Details Motivation: 未来运动轨迹预测在机器人、自动驾驶等领域至关重要,但传统视频预测模型计算成本高且需要多帧输入。本文旨在提出一种高效轻量的解决方案。

Contribution: 首次将InstructPix2Pix模型应用于机器人任务中的未来视觉帧预测,实现了基于单图像和文本的多模态预测,降低了计算需求和延迟。

Method: 通过微调InstructPix2Pix模型,使其能够接受视觉和文本输入,进行多模态未来帧预测。输入仅需单张图像和文本指令。

Result: 在RoboTWin数据集上的实验表明,该方法在SSIM和PSNR指标上优于现有基线,同时具备更快的推理速度和更低的GPU需求。

Insight: 轻量化的设计使其特别适合注重运动轨迹精度而非视觉保真度的应用,如机器人控制和运动分析。

Abstract: Predicting future motion trajectories is a critical capability across domains such as robotics, autonomous systems, and human activity forecasting, enabling safer and more intelligent decision-making. This paper proposes a novel, efficient, and lightweight approach for robot action prediction, offering significantly reduced computational cost and inference latency compared to conventional video prediction models. Importantly, it pioneers the adaptation of the InstructPix2Pix model for forecasting future visual frames in robotic tasks, extending its utility beyond static image editing. We implement a deep learning-based visual prediction framework that forecasts what a robot will observe 100 frames (10 seconds) into the future, given a current image and a textual instruction. We repurpose and fine-tune the InstructPix2Pix model to accept both visual and textual inputs, enabling multimodal future frame prediction. Experiments on the RoboTWin dataset (generated based on real-world scenarios) demonstrate that our method achieves superior SSIM and PSNR compared to state-of-the-art baselines in robot action prediction tasks. Unlike conventional video prediction models that require multiple input frames, heavy computation, and slow inference latency, our approach only needs a single image and a text prompt as input. This lightweight design enables faster inference, reduced GPU demands, and flexible multimodal control, particularly valuable for applications like robotics and sports motion trajectory analytics, where motion trajectory precision is prioritized over visual fidelity.

[89] SegQuant: A Semantics-Aware and Generalizable Quantization Framework for Diffusion Models

Jiaji Zhang,Ruichao Sun,Hailiang Zhao,Jiaju Wu,Peng Chen,Hao Li,Xinkui Zhao,Kingsum Chow,Gang Xiong,Lin Ye,Shuiguang Deng

Main category: cs.CV

TL;DR: SegQuant是一种新的量化框架,旨在解决扩散模型在资源有限或延迟敏感环境中的部署问题。通过结合分段的线性量化和双尺度量化方案,该框架提升了通用性并保持了生成输出的视觉保真度。

Details Motivation: 扩散模型在生成任务中表现出色,但计算成本高昂,限制了其在资源受限环境中的应用。现有量化方法通常依赖特定架构的启发式方法,缺乏通用性。

Contribution: 提出了SegQuant,一个统一的量化框架,结合分段线性量化和双尺度量化方案,实现了跨模型的通用性和高性能。

Method: SegQuant采用分段感知的图基量化策略(SegLinear)和双尺度量化方案(DualScale),分别捕捉结构语义和保留极性不对称的激活。

Result: 该框架在多种扩散模型上表现优异,同时保持与主流部署工具的无缝兼容性。

Insight: 通过语义感知的量化和双尺度处理,SegQuant在保持视觉保真度的同时,显著提升了量化模型的通用性和部署效率。

Abstract: Diffusion models have demonstrated exceptional generative capabilities but are computationally intensive, posing significant challenges for deployment in resource-constrained or latency-sensitive environments. Quantization offers an effective means to reduce model size and computational cost, with post-training quantization (PTQ) being particularly appealing due to its compatibility with pre-trained models without requiring retraining or training data. However, existing PTQ methods for diffusion models often rely on architecture-specific heuristics that limit their generalizability and hinder integration with industrial deployment pipelines. To address these limitations, we propose SegQuant, a unified quantization framework that adaptively combines complementary techniques to enhance cross-model versatility. SegQuant consists of a segment-aware, graph-based quantization strategy (SegLinear) that captures structural semantics and spatial heterogeneity, along with a dual-scale quantization scheme (DualScale) that preserves polarity-asymmetric activations, which is crucial for maintaining visual fidelity in generated outputs. SegQuant is broadly applicable beyond Transformer-based diffusion models, achieving strong performance while ensuring seamless compatibility with mainstream deployment tools.

[90] FinChart-Bench: Benchmarking Financial Chart Comprehension in Vision-Language Models

Dong Shu,Haoyang Yuan,Yuchen Wang,Yanguang Liu,Huopu Zhang,Haiyan Zhao,Mengnan Du

Main category: cs.CV

TL;DR: FinChart-Bench是首个专注于金融图表理解的基准测试,包含1,200张金融图表和7,016个问题,评估了25种前沿视觉语言模型,揭示了当前模型的局限性。

Details Motivation: 金融图表具有复杂的时间结构和专业术语,而现有的大视觉语言模型(LVLM)在这方面的能力尚未充分探索,因此需要专门的基准测试。

Contribution: 提出了FinChart-Bench,首个专注于金融图表的基准测试,并通过对25种LVLM的评估揭示了关键问题。

Method: 收集1,200张金融图表,标注7,016个问题(包含TF、MC和QA形式),并对25种LVLM进行综合评估。

Result: 发现开源和闭源模型性能差距缩小、升级模型性能下降、指令跟随困难、空间推理能力不足,且现有模型不适合作为自动化评估工具。

Insight: 当前LVLM在金融图表理解方面仍存在显著不足,需要进一步改进,尤其是在指令跟随和空间推理能力上。

Abstract: Large vision-language models (LVLMs) have made significant progress in chart understanding. However, financial charts, characterized by complex temporal structures and domain-specific terminology, remain notably underexplored. We introduce FinChart-Bench, the first benchmark specifically focused on real-world financial charts. FinChart-Bench comprises 1,200 financial chart images collected from 2015 to 2024, each annotated with True/False (TF), Multiple Choice (MC), and Question Answering (QA) questions, totaling 7,016 questions. We conduct a comprehensive evaluation of 25 state-of-the-art LVLMs on FinChart-Bench. Our evaluation reveals critical insights: (1) the performance gap between open-source and closed-source models is narrowing, (2) performance degradation occurs in upgraded models within families, (3) many models struggle with instruction following, (4) both advanced models show significant limitations in spatial reasoning abilities, and (5) current LVLMs are not reliable enough to serve as automated evaluators. These findings highlight important limitations in current LVLM capabilities for financial chart understanding. The FinChart-Bench dataset is available at https://huggingface.co/datasets/Tizzzzy/FinChart-Bench.

[91] Training Self-Supervised Depth Completion Using Sparse Measurements and a Single Image

Rizhao Fan,Zhigen Li,Heping Li,Ning An

Main category: cs.CV

TL;DR: 提出了一种新型自监督深度补全方法,仅需稀疏深度测量和单张图像进行训练,通过设计新的损失函数和利用分割图提升性能。

Details Motivation: 现有的深度补全方法依赖于密集深度标签或多帧图像,成本高且不适用于静态或单帧场景。本文旨在解决这一问题。

Contribution: 1. 提出无需密集标签或额外图像的自监督深度补全范式;2. 设计基于深度分布特性的新型损失函数;3. 利用基础模型的分割图增强深度估计。

Method: 1. 使用稀疏深度测量和单张图像训练;2. 设计损失函数从观测点传播深度信息到未观测区域;3. 结合分割图优化深度估计。

Result: 实验证明了方法的有效性,优于依赖密集标签或多帧图像的方法。

Insight: 通过结合基础模型的分割图,可以在自监督框架下更有效地推断未观测区域的深度信息。

Abstract: Depth completion is an important vision task, and many efforts have been made to enhance the quality of depth maps from sparse depth measurements. Despite significant advances, training these models to recover dense depth from sparse measurements remains a challenging problem. Supervised learning methods rely on dense depth labels to predict unobserved regions, while self-supervised approaches require image sequences to enforce geometric constraints and photometric consistency between frames. However, acquiring dense annotations is costly, and multi-frame dependencies limit the applicability of self-supervised methods in static or single-frame scenarios. To address these challenges, we propose a novel self-supervised depth completion paradigm that requires only sparse depth measurements and their corresponding image for training. Unlike existing methods, our approach eliminates the need for dense depth labels or additional images captured from neighboring viewpoints. By leveraging the characteristics of depth distribution, we design novel loss functions that effectively propagate depth information from observed points to unobserved regions. Additionally, we incorporate segmentation maps generated by vision foundation models to further enhance depth estimation. Extensive experiments demonstrate the effectiveness of our proposed method.

[92] Grounding Degradations in Natural Language for All-In-One Video Restoration

Muhammad Kamran Janjua,Amirhosein Ghasemabadi,Kunlin Zhang,Mohammad Salameh,Chao Gao,Di Niu

Main category: cs.CV

TL;DR: 该论文提出了一种基于自然语言的“全场景视频修复”框架,利用基础模型将视频退化语义与自然语言结合,提供可解释且灵活的指导。方法在训练和测试时无需退化知识,并通过标准化基准测试验证了其性能。

Details Motivation: 现有视频修复方法通常假设退化类型已知,但在实际应用中退化类型可能未知或复杂多变。本文旨在开发一种无需退化知识的通用视频修复框架。

Contribution: 1. 提出首个基于自然语言的全场景视频修复框架;2. 设计了无需退化知识的训练和推理方法;3. 提出了新的标准化基准测试数据集。

Method: 通过基础模型将视频退化的语义信息嵌入自然语言,生成退化感知的上下文,以此指导视频修复。训练中学习退化知识的近似表示,推理时无需依赖基础模型。

Result: 在多退化场景基准测试中(3D和4D任务),以及时变复合退化数据集上均达到了最佳性能。

Insight: 自然语言可以作为退化信息的有效载体,为视频修复提供可解释的指导,同时标准化基准对推动领域研究至关重要。

Abstract: In this work, we propose an all-in-one video restoration framework that grounds degradation-aware semantic context of video frames in natural language via foundation models, offering interpretable and flexible guidance. Unlike prior art, our method assumes no degradation knowledge in train or test time and learns an approximation to the grounded knowledge such that the foundation model can be safely disentangled during inference adding no extra cost. Further, we call for standardization of benchmarks in all-in-one video restoration, and propose two benchmarks in multi-degradation setting, three-task (3D) and four-task (4D), and two time-varying composite degradation benchmarks; one of the latter being our proposed dataset with varying snow intensity, simulating how weather degradations affect videos naturally. We compare our method with prior works and report state-of-the-art performance on all benchmarks.

[93] Hybrid-supervised Hypergraph-enhanced Transformer for Micro-gesture Based Emotion Recognition

Zhaoqiang Xia,Hexiang Huang,Haoyu Chen,Xiaoyi Feng,Guoying Zhao

Main category: cs.CV

TL;DR: 该论文提出了一种混合监督的超图增强Transformer框架,用于基于微表情的情绪识别,结合了超图增强的自注意力机制和多尺度时间卷积模块,并通过自监督和监督联合训练实现了最优性能。

Details Motivation: 微表情是一种无意识的身体动作,能传达人类的情绪状态,但在情感计算领域的研究尚未充分展开。论文旨在通过建模微表情的细微动作来更准确地识别情绪。

Contribution: 1. 提出了一个混合监督的超图增强Transformer框架;2. 设计了超图增强的自注意力模块和多尺度时间卷积模块;3. 通过自监督任务(重建)和监督任务(情绪识别)联合训练模型。

Method: 1. 使用超图结构建模骨架关节点之间的关系;2. 结合Transformer编码器和解码器(带升采样)进行微动作的时空建模;3. 通过自监督和监督联合训练端到端框架。

Result: 在iMiGUE和SMG数据集上取得了最优性能,超越了现有方法。

Insight: 超图结构能够有效捕捉微表情的局部细微动作,而混合监督学习结合了自监督和监督任务的优点,提升了情绪识别的准确性。

Abstract: Micro-gestures are unconsciously performed body gestures that can convey the emotion states of humans and start to attract more research attention in the fields of human behavior understanding and affective computing as an emerging topic. However, the modeling of human emotion based on micro-gestures has not been explored sufficiently. In this work, we propose to recognize the emotion states based on the micro-gestures by reconstructing the behavior patterns with a hypergraph-enhanced Transformer in a hybrid-supervised framework. In the framework, hypergraph Transformer based encoder and decoder are separately designed by stacking the hypergraph-enhanced self-attention and multiscale temporal convolution modules. Especially, to better capture the subtle motion of micro-gestures, we construct a decoder with additional upsampling operations for a reconstruction task in a self-supervised learning manner. We further propose a hypergraph-enhanced self-attention module where the hyperedges between skeleton joints are gradually updated to present the relationships of body joints for modeling the subtle local motion. Lastly, for exploiting the relationship between the emotion states and local motion of micro-gestures, an emotion recognition head from the output of encoder is designed with a shallow architecture and learned in a supervised way. The end-to-end framework is jointly trained in a one-stage way by comprehensively utilizing self-reconstruction and supervision information. The proposed method is evaluated on two publicly available datasets, namely iMiGUE and SMG, and achieves the best performance under multiple metrics, which is superior to the existing methods.

[94] Region-aware Depth Scale Adaptation with Sparse Measurements

Rizhao Fan,Tianfang Ma,Zhigen Li,Ning An,Jian Cheng

Main category: cs.CV

TL;DR: 本文提出了一种无需学习的、基于稀疏深度测量的方法,将基础模型输出的相对深度转换为度量深度,避免了额外的训练或微调,保持了模型的泛化能力。

Details Motivation: 基础模型在零样本单目深度估计方面表现出色,但其输出通常是相对尺度而非度量尺度,限制了实际应用的直接部署。现有标度适应方法成本高且可能损害模型的泛化能力。

Contribution: 提出了一种非学习的标度适应方法,利用稀疏深度测量将基础模型的相对深度预测转换为度量深度,无需额外训练或微调。

Method: 采用稀疏深度测量作为参考,通过区域感知的深度尺度适应技术,将相对深度转换为度量深度。

Result: 实验证明,该方法能有效将相对深度转换为度量深度,且不增加计算成本或损害泛化能力。

Insight: 通过稀疏测量实现尺度适应是一种高效且灵活的方法,适用于多种场景,同时保持了基础模型的强大泛化性能。

Abstract: In recent years, the emergence of foundation models for depth prediction has led to remarkable progress, particularly in zero-shot monocular depth estimation. These models generate impressive depth predictions; however, their outputs are often in relative scale rather than metric scale. This limitation poses challenges for direct deployment in real-world applications. To address this, several scale adaptation methods have been proposed to enable foundation models to produce metric depth. However, these methods are typically costly, as they require additional training on new domains and datasets. Moreover, fine-tuning these models often compromises their original generalization capabilities, limiting their adaptability across diverse scenes. In this paper, we introduce a non-learning-based approach that leverages sparse depth measurements to adapt the relative-scale predictions of foundation models into metric-scale depth. Our method requires neither retraining nor fine-tuning, thereby preserving the strong generalization ability of the original foundation models while enabling them to produce metric depth. Experimental results demonstrate the effectiveness of our approach, high-lighting its potential to bridge the gap between relative and metric depth without incurring additional computational costs or sacrificing generalization ability.

[95] BeatFormer: Efficient motion-robust remote heart rate estimation through unsupervised spectral zoomed attention filters

Joaquim Comas,Federico Sukno

Main category: cs.CV

TL;DR: BeatFormer 是一种高效的远程心率估计方法,结合了深度学习和手工方法的优势,无需监督标签即可训练,并在运动中表现出优异的性能。

Details Motivation: 当前的心率估计方法中,深度学习依赖大量数据,手工方法则难以适应复杂条件。BeatFormer 旨在结合两者的优势,提供高效且鲁棒的解决方案。

Contribution: 提出了 BeatFormer,一种轻量级的频谱注意力模型,以及无需监督标签的 Spectral Contrastive Learning (SCL) 方法。

Method: 结合了放大的正交复数注意力和频域能量测量,并引入 SCL 进行无监督训练。

Result: 在 PURE、UBFC-rPPG 和 MMPD 数据集上验证了模型的鲁棒性,尤其在运动场景下的跨数据集评估中表现优异。

Insight: 频谱注意力机制和无监督学习可以显著提高心率估计的性能和泛化能力。

Abstract: Remote photoplethysmography (rPPG) captures cardiac signals from facial videos and is gaining attention for its diverse applications. While deep learning has advanced rPPG estimation, it relies on large, diverse datasets for effective generalization. In contrast, handcrafted methods utilize physiological priors for better generalization in unseen scenarios like motion while maintaining computational efficiency. However, their linear assumptions limit performance in complex conditions, where deep learning provides superior pulsatile information extraction. This highlights the need for hybrid approaches that combine the strengths of both methods. To address this, we present BeatFormer, a lightweight spectral attention model for rPPG estimation, which integrates zoomed orthonormal complex attention and frequency-domain energy measurement, enabling a highly efficient model. Additionally, we introduce Spectral Contrastive Learning (SCL), which allows BeatFormer to be trained without any PPG or HR labels. We validate BeatFormer on the PURE, UBFC-rPPG, and MMPD datasets, demonstrating its robustness and performance, particularly in cross-dataset evaluations under motion scenarios.

[96] TriCLIP-3D: A Unified Parameter-Efficient Framework for Tri-Modal 3D Visual Grounding based on CLIP

Fan Li,Zanyi Wang,Zeyi Huang,Guang Dai,Jingdong Wang,Mengmeng Wang

Main category: cs.CV

TL;DR: TriCLIP-3D提出了一种统一的三模态(RGB图像、文本和点云)3D视觉定位框架,基于CLIP预训练模型,通过适配器微调和几何感知的特征融合模块,显著简化了架构并提升了性能。

Details Motivation: 现有的3D视觉定位方法通常为不同模态设计独立编码器,导致模型复杂且训练效率低。作者希望通过统一的2D预训练多模态网络处理所有模态,简化架构并提升性能。

Contribution: 提出了一个基于CLIP的三模态3D视觉定位框架,通过适配器微调和几何感知特征融合模块(GARF),实现了跨模态统一特征提取与融合,显著减少了可训练参数并提升了任务性能。

Method: 使用CLIP预训练模型作为基础,通过适配器微调适应三模态设置;设计了GARF模块融合点云与图像的几何多尺度特征;引入多模态解码器促进深度跨模态理解。

Result: 与基线相比,可训练参数减少约58%,3D检测和3D视觉定位任务的性能分别提升6.52%和6.25%。

Insight: 通过统一的多模态预训练模型和轻量级适配器微调,可以显著简化复杂任务的模型架构,同时提升性能。

Abstract: 3D visual grounding allows an embodied agent to understand visual information in real-world 3D environments based on human instructions, which is crucial for embodied intelligence. Existing 3D visual grounding methods typically rely on separate encoders for different modalities (e.g., RGB images, text, and 3D point clouds), resulting in large and complex models that are inefficient to train. While some approaches use pre-trained 2D multi-modal models like CLIP for 3D tasks, they still struggle with aligning point cloud data to 2D encoders. As a result, these methods continue to depend on 3D encoders for feature extraction, further increasing model complexity and training inefficiency. In this paper, we propose a unified 2D pre-trained multi-modal network to process all three modalities (RGB images, text, and point clouds), significantly simplifying the architecture. By leveraging a 2D CLIP bi-modal model with adapter-based fine-tuning, this framework effectively adapts to the tri-modal setting, improving both adaptability and performance across modalities. Our Geometric-Aware 2D-3D Feature Recovery and Fusion (GARF) module is designed to fuse geometric multi-scale features from point clouds and images. We then integrate textual features for final modality fusion and introduce a multi-modal decoder to facilitate deep cross-modal understanding. Together, our method achieves unified feature extraction and fusion across the three modalities, enabling an end-to-end 3D visual grounding model. Compared to the baseline, our method reduces the number of trainable parameters by approximately 58%, while achieving a 6.52% improvement in the 3D detection task and a 6.25% improvement in the 3D visual grounding task.

[97] Semantic-Aware Representation Learning for Multi-label Image Classification

Ren-Dong Xie,Zhi-Fen He,Bo Li,Bin Liu,Jin-Yan Hu

Main category: cs.CV

TL;DR: 该论文提出了一种语义感知表示学习(SARL)方法,用于多标签图像分类,通过语义相关特征学习和最优传输注意力机制提升分类性能。

Details Motivation: 现有方法在多标签图像分类中可能引入噪声且对象定位不精确,因此需要更准确的语义感知表示。

Contribution: 1. 提出了语义相关的特征学习模块;2. 设计了基于最优传输的注意力机制;3. 引入了区域分数聚合策略。

Method: 1. 提取语义相关特征;2. 使用最优传输注意力机制对齐语义表示;3. 通过区域分数聚合实现多标签预测。

Result: 在PASCAL VOC 2007和MS-COCO数据集上验证了SARL的优越性。

Insight: 语义对齐和区域聚合能显著提升多标签分类的准确性。

Abstract: Multi-label image classification, an important research area in computer vision, focuses on identifying multiple labels or concepts within an image. Existing approaches often employ attention mechanisms or graph convolutional networks (GCNs) to learn image representation. However, this representation may contain noise and may not locate objects precisely. Therefore, this paper proposes a Semantic-Aware Representation Learning (SARL) for multi-label image classification. First, a label semantic-related feature learning module is utilized to extract semantic-related features. Then, an optimal transport-based attention mechanism is designed to obtain semantically aligned image representation. Finally, a regional score aggregation strategy is used for multi-label prediction. Experimental results on two benchmark datasets, PASCAL VOC 2007 and MS-COCO, demonstrate the superiority of SARL over existing methods.

[98] Stereo-GS: Multi-View Stereo Vision Model for Generalizable 3D Gaussian Splatting Reconstruction

Xiufeng Huang,Ka Chun Cheung,Runmin Cong,Simon See,Renjie Wan

Main category: cs.CV

TL;DR: 该论文提出了一种称为Stereo-GS的方法,通过解耦几何与外观预测,实现了高效且可泛化的3D高斯泼溅重建。

Details Motivation: 现有的3D高斯泼溅重建方法通常依赖于大量计算资源和数据驱动的先验,导致训练效率低下且泛化能力受限。

Contribution: 1. 提出了一种解耦的框架,分别预测几何和外观;2. 通过双目视觉主干提取特征并融合全局注意力;3. 设计了GS-maps表示3D高斯泼溅对象;4. 实现了无需相机参数的姿态无关重建。

Method: 1. 使用双目视觉主干提取局部图像对特征;2. 通过全局注意力块融合特征;3. 分别用点和高斯预测头生成几何和外观特征;4. 用细化网络提升GS-maps质量。

Result: 实验表明,该方法在保持高质量重建的同时显著降低了资源需求,且对姿态变化具有鲁棒性。

Insight: 解耦几何与外观预测可以提升效率与泛化能力,全局注意力机制有助于特征融合,姿态无关设计增强了实用性。

Abstract: Generalizable 3D Gaussian Splatting reconstruction showcases advanced Image-to-3D content creation but requires substantial computational resources and large datasets, posing challenges to training models from scratch. Current methods usually entangle the prediction of 3D Gaussian geometry and appearance, which rely heavily on data-driven priors and result in slow regression speeds. To address this, we propose \method, a disentangled framework for efficient 3D Gaussian prediction. Our method extracts features from local image pairs using a stereo vision backbone and fuses them via global attention blocks. Dedicated point and Gaussian prediction heads generate multi-view point-maps for geometry and Gaussian features for appearance, combined as GS-maps to represent the 3DGS object. A refinement network enhances these GS-maps for high-quality reconstruction. Unlike existing methods that depend on camera parameters, our approach achieves pose-free 3D reconstruction, improving robustness and practicality. By reducing resource demands while maintaining high-quality outputs, \method provides an efficient, scalable solution for real-world 3D content generation.

[99] 3-Dimensional CryoEM Pose Estimation and Shift Correction Pipeline

Kaishva Chintan Shah,Virajith Boddapati,Karthik S. Gurumoorthy,Sandip Kaledhonkar,Ajit Rajwade

Main category: cs.CV

TL;DR: 本文提出了一种用于cryo-EM中3D姿态估计和位移校正的鲁棒方法,通过结合多维度尺度分析和鲁棒优化框架,显著提升了低信噪比条件下的重建精度。

Details Motivation: 在cryo-EM中,由于极低的信噪比(SNR),姿态估计和位移校正成为关键挑战,直接影响3D重建的准确性。现有方法对噪声敏感且几何约束不严格,导致重建质量不佳。

Contribution: 1. 提出了基于鲁棒优化框架的姿态估计方法,利用ℓ₁范数目标函数确保旋转矩阵的正确性;2. 设计了迭代位移校正算法,通过全局最小二乘实现位移一致性估计;3. 在低信噪比条件下显著提升了重建精度。

Method: 1. 使用多维度尺度分析(MDS)和常见线几何(common lines)估计旋转矩阵;2. 采用ℓ₁范数目标函数和投影坐标下降法优化旋转轴和平面向量;3. 通过迭代最小二乘校正位移误差。

Result: 与现有方法相比,该方法在欧拉角精度和Fourier Shell Correlation(FSC)重建保真度上表现更优。

Insight: 通过严格几何约束和鲁棒优化,有效解决了低信噪比导致的误差累积问题,为cryo-EM的高质量3D重建提供了可靠解决方案。

Abstract: Accurate pose estimation and shift correction are key challenges in cryo-EM due to the very low SNR, which directly impacts the fidelity of 3D reconstructions. We present an approach for pose estimation in cryo-EM that leverages multi-dimensional scaling (MDS) techniques in a robust manner to estimate the 3D rotation matrix of each particle from pairs of dihedral angles. We express the rotation matrix in the form of an axis of rotation and a unit vector in the plane perpendicular to the axis. The technique leverages the concept of common lines in 3D reconstruction from projections. However, common line estimation is ridden with large errors due to the very low SNR of cryo-EM projection images. To address this challenge, we introduce two complementary components: (i) a robust joint optimization framework for pose estimation based on an $\ell_1$-norm objective or a similar robust norm, which simultaneously estimates rotation axes and in-plane vectors while exactly enforcing unit norm and orthogonality constraints via projected coordinate descent; and (ii) an iterative shift correction algorithm that estimates consistent in-plane translations through a global least-squares formulation. While prior approaches have leveraged such embeddings and common-line geometry for orientation recovery, existing formulations typically rely on $\ell_2$-based objectives that are sensitive to noise, and enforce geometric constraints only approximately. These choices, combined with a sequential pipeline structure, can lead to compounding errors and suboptimal reconstructions in low-SNR regimes. Our pipeline consistently outperforms prior methods in both Euler angle accuracy and reconstruction fidelity, as measured by the Fourier Shell Correlation (FSC).

[100] Open-set Cross Modal Generalization via Multimodal Unified Representation

Hai Huang,Yan Xia,Shulei Wang,Hanting Wang,Minghui Fang,Shengpeng Ji,Sashuai Zhou,Tao Jin,Zhou Zhao

Main category: cs.CV

TL;DR: 本文提出了一种新的任务——开放集跨模态泛化(OSCMG),并通过MICU方法解决,其包含FCMI和CUJP两个关键组件,显著提升了模型在开放集条件下的跨模态泛化能力。

Details Motivation: 传统跨模态泛化(CMG)仅在封闭集环境中评估,忽略了实际应用中常见的未知类别问题。本文提出OSCMG任务,旨在解决开放集条件下多模态统一表征的挑战。

Contribution: 1. 提出OSCMG任务,填补跨模态泛化在开放集环境的研究空白。\n2. 提出MICU方法,结合FCMI和CUJP,增强模型对未知类别的泛化能力。

Method: 1. FCMI:通过掩码对比学习在多模态的语义和时间层面增强对齐。\n2. CUJP:结合模态无关特征选择与自监督学习,提升特征多样性和模型不确定性。

Result: 在CMG和新提出的OSCMG任务上,MICU方法均表现优异,验证了其在开放集条件下的有效性。

Insight: 开放集条件下的跨模态泛化是实际应用中的重要挑战,MICU通过掩码对比学习和自监督拼图任务,为未来多模态开放集研究提供了新思路。

Abstract: This paper extends Cross Modal Generalization (CMG) to open-set environments by proposing the more challenging Open-set Cross Modal Generalization (OSCMG) task. This task evaluates multimodal unified representations in open-set conditions, addressing the limitations of prior closed-set cross-modal evaluations. OSCMG requires not only cross-modal knowledge transfer but also robust generalization to unseen classes within new modalities, a scenario frequently encountered in real-world applications. Existing multimodal unified representation work lacks consideration for open-set environments. To tackle this, we propose MICU, comprising two key components: Fine-Coarse Masked multimodal InfoNCE (FCMI) and Cross modal Unified Jigsaw Puzzles (CUJP). FCMI enhances multimodal alignment by applying contrastive learning at both holistic semantic and temporal levels, incorporating masking to enhance generalization. CUJP enhances feature diversity and model uncertainty by integrating modality-agnostic feature selection with self-supervised learning, thereby strengthening the model’s ability to handle unknown categories in open-set tasks. Extensive experiments on CMG and the newly proposed OSCMG validate the effectiveness of our approach. The code is available at https://github.com/haihuangcode/CMG.

[101] Polymorph: Energy-Efficient Multi-Label Classification for Video Streams on Embedded Devices

Saeid Ghafouri,Mohsen Fayyaz,Xiangchen Li,Deepu John,Bo Ji,Dimitrios Nikolopoulos,Hans Vandierendonck

Main category: cs.CV

TL;DR: Polymorph是一种面向嵌入式设备的实时多标签视频分类框架,利用视频流的结构特性(如标签稀疏性、时序连续性和标签共现性),动态激活轻量级低秩适配器以减少能耗并提升性能。

Details Motivation: 嵌入式设备在多标签视频分类任务中面临计算和能量预算的限制,但视频流的结构特性(如标签稀疏性和时序连续性)提供了优化推断效率的机会。

Contribution: 提出了Polymorph框架,通过动态选择和组合轻量级低秩适配器(LoRA),减少计算和能量消耗,同时提升分类性能。

Method: 使用上下文感知的低秩适配器(LoRA)动态激活和组合,每个适配器专注于基于标签共现模式的子集,无需切换全模型或合并权重。

Result: 在TAO数据集上,Polymorph能耗降低40%,平均精度(mAP)提升9个百分点。

Insight: 视频流的固有结构特性可被有效利用以优化性能与效率,尤其是在资源受限的嵌入式设备上。

Abstract: Real-time multi-label video classification on embedded devices is constrained by limited compute and energy budgets. Yet, video streams exhibit structural properties such as label sparsity, temporal continuity, and label co-occurrence that can be leveraged for more efficient inference. We introduce Polymorph, a context-aware framework that activates a minimal set of lightweight Low Rank Adapters (LoRA) per frame. Each adapter specializes in a subset of classes derived from co-occurrence patterns and is implemented as a LoRA weight over a shared backbone. At runtime, Polymorph dynamically selects and composes only the adapters needed to cover the active labels, avoiding full-model switching and weight merging. This modular strategy improves scalability while reducing latency and energy overhead. Polymorph achieves 40% lower energy consumption and improves mAP by 9 points over strong baselines on the TAO dataset. Polymorph is open source at https://github.com/inference-serving/polymorph/.

[102] Decision PCR: Decision version of the Point Cloud Registration task

Yaojie Zhang,Tianlun Huang,Weijun Wang,Wei Feng

Main category: cs.CV

TL;DR: 该论文提出了一个基于数据驱动的方法来解决点云配准(PCR)任务中的决策版本问题,通过深度学习框架评估配准质量,显著提升了现有方法的性能。

Details Motivation: 传统评估指标(如最大内点数量)在极低内点比例情况下失效,因此需要重新审视配准结果评估问题,提出决策版本PCR任务作为根本问题。

Contribution: 1. 首次通过深度学习框架全面研究点云配准的决策版本任务;2. 构建基于3DMatch的数据集并训练深度学习分类器;3. 将该分类器集成到标准PCR流程中,显著提升性能。

Method: 1. 构建数据集;2. 训练深度学习分类器评估配准质量;3. 将分类器集成到现有PCR方法中。

Result: 与GeoTransformer结合,在3DLoMatch基准测试上实现了86.97%的SOTA配准召回率,并在ETH数据集上展示了强泛化能力。

Insight: 决策版本的PCR任务是解决低重叠点云配准问题的关键,数据驱动的评估方法可以克服传统指标的局限性。

Abstract: Low-overlap point cloud registration (PCR) remains a significant challenge in 3D vision. Traditional evaluation metrics, such as Maximum Inlier Count, become ineffective under extremely low inlier ratios. In this paper, we revisit the registration result evaluation problem and identify the Decision version of the PCR task as the fundamental problem. To address this Decision PCR task, we propose a data-driven approach. First, we construct a corresponding dataset based on the 3DMatch dataset. Then, a deep learning-based classifier is trained to reliably assess registration quality, overcoming the limitations of traditional metrics. To our knowledge, this is the first comprehensive study to address this task through a deep learning framework. We incorporate this classifier into standard PCR pipelines. When integrated with our approach, existing state-of-the-art PCR methods exhibit significantly enhanced registration performance. For example, combining our framework with GeoTransformer achieves a new SOTA registration recall of 86.97% on the challenging 3DLoMatch benchmark. Our method also demonstrates strong generalization capabilities on the unseen outdoor ETH dataset.

[103] Hierarchical Cross-modal Prompt Learning for Vision-Language Models

Hao Zheng,Shunzhi Yang,Zhuoxin He,Jinfeng Yang,Zhenhua Huang

Main category: cs.CV

TL;DR: HiCroPL是一种分层跨模态提示学习框架,通过双向知识流动解决模态孤立和语义衰减问题,提升视觉语言模型的泛化能力。

Details Motivation: 预训练的视觉语言模型(如CLIP)虽然泛化能力强,但在适应下游任务时仍面临模态孤立和分层语义衰减的挑战。

Contribution: 提出HiCroPL框架,通过双向知识流动和多尺度语义融合,显著提升模型在下游任务中的表现。

Method: 分层知识映射器实现文本和视觉模态的互补优化,早期层文本提示增强视觉语义,后期层视觉提示优化文本提示。

Result: 在11个基准测试中取得最优结果,显著优于现有方法。

Insight: 多尺度语义融合和轻量级知识代理是实现高效跨模态交互的关键。

Abstract: Pre-trained Vision-Language Models (VLMs) such as CLIP have shown excellent generalization abilities. However, adapting these large-scale models to downstream tasks while preserving their generalization capabilities remains challenging. Although prompt learning methods have shown promise, they suffer from two fundamental bottlenecks that limit generalization: (a) modality isolation, and (b) hierarchical semantic decay. To address these limitations, we propose HiCroPL, a Hierarchical Cross-modal Prompt Learning framework that establishes bidirectional knowledge flow between text and vision modalities, enabling them to refine their semantics mutually. HiCroPL routes knowledge flows by leveraging the complementary strengths of text and vision. In early layers, text prompts inject relatively clear semantics into visual prompts through a hierarchical knowledge mapper, enhancing the representation of low-level visual semantics. In later layers, visual prompts encoding specific task-relevant objects flow back to refine text prompts, enabling deeper alignment. Crucially, our hierarchical knowledge mapper allows representations at multi-scales to be fused, ensuring that deeper representations retain transferable shallow semantics thereby enhancing generalization. We further introduce a lightweight layer-specific knowledge proxy to enable efficient cross-modal interactions. Extensive evaluations across four tasks demonstrate HiCroPL’s superior performance, achieving state-of-the-art results on 11 benchmarks with significant improvements. Code is available at: https://github.com/zzeoZheng/HiCroPL.

[104] Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression

Roy H. Jennings,Genady Paikin,Roy Shaul,Evgeny Soloveichik

Main category: cs.CV

TL;DR: 该论文提出了一种名为RvTC的新方法,通过灵活的分箱策略取代预设词汇分类,并结合语义丰富的提示语,显著提升了多模态大语言模型在图像回归任务中的性能。

Details Motivation: 现有的多模态大语言模型(MLLMs)在图像回归任务中使用预设词汇和通用提示语,无法有效利用语义信息,性能与纯图像训练模型相当。作者希望通过改进方法,更好地发挥MLLMs的跨模态理解能力。

Contribution: 提出了Regression via Transformer-Based Classification (RvTC)方法,采用分箱策略避免手动词汇设计的复杂性,并通过引入图像语义相关的提示语,显著提升了性能。

Method: RvTC通过灵活的bin-based方法取代预设词汇分类,同时使用包含图像语义信息的提示语,而非通用任务描述。

Result: 在AVA和AGIQA-3k等数据集上,RvTC实现了最先进性能,尤其是在AVA数据集上,通过语义提示语将相关性从0.83提升到0.90。

Insight: 多模态回归任务中,语义丰富的提示语至关重要,能帮助模型更好地利用跨模态信息,超越统计偏差的影响。

Abstract: Multimodal Large Language Models (MLLMs) show promise for image-based regression tasks, but current approaches face key limitations. Recent methods fine-tune MLLMs using preset output vocabularies and generic task-level prompts (e.g., “How would you rate this image?”), assuming this mimics human rating behavior. Our analysis reveals these approaches provide no benefit over image-only training. Models using preset vocabularies and generic prompts perform equivalently to image-only models, failing to leverage semantic understanding from textual input. We propose Regression via Transformer-Based Classification (RvTC), which replaces vocabulary-constrained classification with a flexible bin-based approach. Unlike approaches that address discretization errors through complex distributional modeling, RvTC eliminates manual vocabulary crafting through straightforward bin increase, achieving state-of-the-art performance on four image assessment datasets using only images. More importantly, we demonstrate that data-specific prompts dramatically improve performance. Unlike generic task descriptions, prompts containing semantic information about specific images enable MLLMs to leverage cross-modal understanding. On the AVA dataset, adding challenge titles to prompts improves correlations from 0.83 to 0.90, a new state-of-the-art. We demonstrate through empirical evidence from the AVA and AGIQA-3k datasets that MLLMs benefit from semantic prompt information surpassing mere statistical biases. This underscores the importance of incorporating meaningful textual context in multimodal regression tasks.

[105] Axis-Aligned Document Dewarping

Chaoyun Wang,I-Chao Shen,Takeo Igarashi,Nanning Zheng,Caigui Jiang

Main category: cs.CV

TL;DR: 本文提出一种基于轴对齐几何约束的文档去扭曲方法,通过在学习和推理阶段引入轴对齐的先验知识,显著提升了去扭曲效果,并在新提出的AAD指标上表现出色。

Details Motivation: 现有基于学习的文档去扭曲方法过于依赖标注数据,未充分利用物理文档的几何特性。本文利用文档的轴对齐几何特性,提升去扭曲效果。

Contribution: 1. 提出轴对齐几何约束(训练阶段);2. 引入轴对齐预处理策略(推理阶段);3. 提出新的AAD评价指标;4. 在多个基准上达到SOTA效果。

Method: 1. 训练时加入轴对齐几何约束,优化特征线的轴对齐特性;2. 推理时通过预处理简化扭曲问题;3. 提出AAD指标衡量去扭曲效果。

Result: 在多个基准测试中达到SOTA,AAD指标提升18.2%~34.5%。

Insight: 文档的轴对齐特性是去扭曲的关键,几何约束能有效提升模型性能。新指标AAD更具鲁棒性和可视一致性。

Abstract: Document dewarping is crucial for many applications. However, existing learning-based methods primarily rely on supervised regression with annotated data without leveraging the inherent geometric properties in physical documents to the dewarping process. Our key insight is that a well-dewarped document is characterized by transforming distorted feature lines into axis-aligned ones. This property aligns with the inherent axis-aligned nature of the discrete grid geometry in planar documents. In the training phase, we propose an axis-aligned geometric constraint to enhance document dewarping. In the inference phase, we propose an axis alignment preprocessing strategy to reduce the dewarping difficulty. In the evaluation phase, we introduce a new metric, Axis-Aligned Distortion (AAD), that not only incorporates geometric meaning and aligns with human visual perception but also demonstrates greater robustness. As a result, our method achieves SOTA results on multiple existing benchmarks and achieves 18.2%~34.5% improvements on the AAD metric.

[106] FastSmoothSAM: A Fast Smooth Method For Segment Anything Model

Jiasheng Xu,Yewang Chen

Main category: cs.CV

TL;DR: 论文提出了一种基于B样条曲线拟合的优化方法FastSmoothSAM,用于改进FastSAM生成的锯齿状边缘,提高了分割的视觉质量和准确性,同时保持实时性能。

Details Motivation: FastSAM虽然实现了实时分割,但其生成的边缘通常为锯齿状,偏离真实物体形状。为提高边缘质量和分割精度,需要一种高效且不损害几何信息的优化方法。

Contribution: 提出了一种四阶段的B样条曲线拟合细化方法,显著平滑锯齿边缘,提升视觉质量和分析准确性,同时保持FastSAM的实时处理能力。

Method: 采用B样条曲线拟合技术,通过四阶段的细化过程(包括两轮曲线拟合)平滑锯齿边缘,保留关键几何信息。

Result: 实验表明该方法有效提升了FastSAM的边缘质量,同时维持了其实时性能,适用于工业自动化、医学影像和自动驾驶等场景。

Insight: B样条曲线的灵活性和形状控制能力使其成为优化边缘锯齿问题的理想选择,结合FastSAM的实时性能,为实际应用提供了更高精度的分割解决方案。

Abstract: Accurately identifying and representing object edges is a challenging task in computer vision and image processing. The Segment Anything Model (SAM) has significantly influenced the field of image segmentation, but suffers from high memory consumption and long inference times, limiting its efficiency in real-time applications. To address these limitations, Fast Segment Anything (FastSAM) was proposed, achieving real-time segmentation. However, FastSAM often generates jagged edges that deviate from the true object shapes. Therefore, this paper introduces a novel refinement approach using B-Spline curve fitting techniques to enhance the edge quality in FastSAM. Leveraging the robust shape control and flexible geometric construction of B-Splines, a four-stage refining process involving two rounds of curve fitting is employed to effectively smooth jagged edges. This approach significantly improves the visual quality and analytical accuracy of object edges without compromising critical geometric information. The proposed method improves the practical utility of FastSAM by improving segmentation accuracy while maintaining real-time processing capabilities. This advancement unlocks greater potential for FastSAM technology in various real-world scenarios, such as industrial automation, medical imaging, and autonomous systems, where precise and efficient edge recognition is crucial.

[107] Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding

Yuanhan Zhang,Yunice Chew,Yuhao Dong,Aria Leo,Bo Hu,Ziwei Liu

Main category: cs.CV

TL;DR: 本文介绍了Video-TT,一个新的视频理解基准测试,旨在评估视频大语言模型在复杂视觉叙事和对抗性问题上的表现,揭示了模型与人类智能的显著差距。

Details Motivation: 现有基准测试未能充分反映视频大语言模型在视频理解中的正确性和鲁棒性与人类智能的差距,因此提出了Video-TT以填补这一空白。

Contribution: 提出了Video-TT基准测试,包含1000个短视频及其开放性和对抗性问题,用于系统评估视频大语言模型的视觉叙事理解和鲁棒性。

Method: 通过收集YouTube Shorts视频并设计开放性和对抗性问题,构建全面评估框架,对比模型与人类在视频理解中的表现。

Result: 评估结果显示,视频大语言模型在正确性和鲁棒性上显著落后于人类表现。

Insight: Video-TT揭示了当前视频大语言模型在复杂视觉推理和对抗性挑战中的局限性,为未来发展指明了方向。

Abstract: Human intelligence requires correctness and robustness, with the former being foundational for the latter. In video understanding, correctness ensures the accurate interpretation of visual content, and robustness maintains consistent performance in challenging conditions. Despite advances in video large language models (video LLMs), existing benchmarks inadequately reflect the gap between these models and human intelligence in maintaining correctness and robustness in video interpretation. We introduce the Video Thinking Test (Video-TT), to assess if video LLMs can interpret real-world videos as effectively as humans. Video-TT reflects genuine gaps in understanding complex visual narratives, and evaluates robustness against natural adversarial questions. Video-TT comprises 1,000 YouTube Shorts videos, each with one open-ended question and four adversarial questions that probe visual and narrative complexity. Our evaluation shows a significant gap between video LLMs and human performance.

[108] OpenBreastUS: Benchmarking Neural Operators for Wave Imaging Using Breast Ultrasound Computed Tomography

Zhijun Zeng,Youjia Zheng,Hao Hu,Zeyuan Dong,Yihang Zheng,Xinliang Liu,Jinzhuo Wang,Zuoqiang Shi,Linfeng Zhang,Yubing Li,He Sun

Main category: cs.CV

TL;DR: OpenBreastUS是一个大规模波动方程数据集,用于评估神经算子在乳房超声CT中的性能,填补了理论与实际应用间的鸿沟,并首次实现了基于神经算子的活体乳房高效成像。

Details Motivation: 传统波动方程数值求解器计算量大且不稳定,难以满足准实时成像需求,而现有数据集过于简化,无法验证神经算子在真实成像中的效果。

Contribution: 提出了一个包含8000个解剖学真实的乳房模型和超过1600万频域模拟的大规模数据集OpenBreastUS,首次验证了神经算子在活体成像中的应用。

Method: 通过真实超声CT配置生成频域波动模拟数据,用于评估神经算子的前向模拟和逆成像任务性能。

Result: OpenBreastUS为开发高效的神经PDE求解器提供了平台,并首次实现了基于神经算子的活体乳房成像。

Insight: 该研究为神经算子在医学成像中的应用提供了可行性和新方向,特别是在复杂真实场景下的性能验证。

Abstract: Accurate and efficient simulation of wave equations is crucial in computational wave imaging applications, such as ultrasound computed tomography (USCT), which reconstructs tissue material properties from observed scattered waves. Traditional numerical solvers for wave equations are computationally intensive and often unstable, limiting their practical applications for quasi-real-time image reconstruction. Neural operators offer an innovative approach by accelerating PDE solving using neural networks; however, their effectiveness in realistic imaging is limited because existing datasets oversimplify real-world complexity. In this paper, we present OpenBreastUS, a large-scale wave equation dataset designed to bridge the gap between theoretical equations and practical imaging applications. OpenBreastUS includes 8,000 anatomically realistic human breast phantoms and over 16 million frequency-domain wave simulations using real USCT configurations. It enables a comprehensive benchmarking of popular neural operators for both forward simulation and inverse imaging tasks, allowing analysis of their performance, scalability, and generalization capabilities. By offering a realistic and extensive dataset, OpenBreastUS not only serves as a platform for developing innovative neural PDE solvers but also facilitates their deployment in real-world medical imaging problems. For the first time, we demonstrate efficient in vivo imaging of the human breast using neural operator solvers.

[109] EBA-AI: Ethics-Guided Bias-Aware AI for Efficient Underwater Image Enhancement and Coral Reef Monitoring

Lyes Saad Saoud,Irfan Hussain

Main category: cs.CV

TL;DR: EBA-AI提出了一种结合伦理指导和偏差感知的AI框架,用于高效的水下图像增强和珊瑚礁监测,通过CLIP嵌入检测偏差并优化计算效率。

Details Motivation: 水下图像增强在海洋保护中至关重要,但现有AI模型存在数据集偏差、高计算成本和缺乏透明度的问题,可能导致误判。

Contribution: 提出EBA-AI框架,整合CLIP嵌入检测偏差、自适应处理优化计算效率,并通过不确定性估计和可解释性技术增强信任。

Method: 利用CLIP嵌入检测和缓解数据集偏差,采用自适应处理减少GPU使用,同时保持增强质量。

Result: 在多个数据集上的实验显示,虽PSNR下降1.0 dB,但计算效率显著提升,实现实时大规模监测。

Insight: 该框架在公平性、效率与可解释性之间取得平衡,为可持续海洋保护提供支持。

Abstract: Underwater image enhancement is vital for marine conservation, particularly coral reef monitoring. However, AI-based enhancement models often face dataset bias, high computational costs, and lack of transparency, leading to potential misinterpretations. This paper introduces EBA-AI, an ethics-guided bias-aware AI framework to address these challenges. EBA-AI leverages CLIP embeddings to detect and mitigate dataset bias, ensuring balanced representation across varied underwater environments. It also integrates adaptive processing to optimize energy efficiency, significantly reducing GPU usage while maintaining competitive enhancement quality. Experiments on LSUI400, Oceanex, and UIEB100 show that while PSNR drops by a controlled 1.0 dB, computational savings enable real-time feasibility for large-scale marine monitoring. Additionally, uncertainty estimation and explainability techniques enhance trust in AI-driven environmental decisions. Comparisons with CycleGAN, FunIEGAN, RAUNENet, WaterNet, UGAN, PUGAN, and UTUIE validate EBA-AI’s effectiveness in balancing efficiency, fairness, and interpretability in underwater image processing. By addressing key limitations of AI-driven enhancement, this work contributes to sustainable, bias-aware, and computationally efficient marine conservation efforts. For interactive visualizations, animations, source code, and access to the preprint, visit: https://lyessaadsaoud.github.io/EBA-AI/

[110] StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image Animation

Shuyuan Tu,Zhen Xing,Xintong Han,Zhi-Qi Cheng,Qi Dai,Chong Luo,Zuxuan Wu,Yu-Gang Jiang

Main category: cs.CV

TL;DR: StableAnimator++通过可学习的姿态对齐和ID保留技术,解决了人类图像动画中的姿态错位和面部失真问题,提出了一种高质量的视频扩散框架。

Details Motivation: 当前基于扩散模型的人类图像动画方法在参考图像和驱动视频存在较大差异(如体型或位置)时,难以保持身份一致性。

Contribution: 1. 提出了首个ID保留的视频扩散框架StableAnimator++。2. 设计了可学习的姿态对齐模块,通过SVD指导预测相似变换矩阵。3. 引入了分布感知的ID适配器,减少时间层的干扰。4. 提出了基于HJB的面部优化方法,提升生成视频的面部保真度。

Method: 1. 使用可学习层预测相似变换矩阵对齐姿态。2. 通过预训练编码器提取图像和面部嵌入,优化面部嵌入。3. 提出分布感知的ID适配器。4. 在推理阶段集成HJB面部优化。

Result: 在基准测试中,StableAnimator++在质量和数量上都表现出色,显著提升了身份一致性和面部保真度。

Insight: 通过结合姿态对齐和ID保留技术,可以显著提升人类图像动画的生成质量,特别是在姿态差异较大的情况下。

Abstract: Current diffusion models for human image animation often struggle to maintain identity (ID) consistency, especially when the reference image and driving video differ significantly in body size or position. We introduce StableAnimator++, the first ID-preserving video diffusion framework with learnable pose alignment, capable of generating high-quality videos conditioned on a reference image and a pose sequence without any post-processing. Building upon a video diffusion model, StableAnimator++ contains carefully designed modules for both training and inference, striving for identity consistency. In particular, StableAnimator++ first uses learnable layers to predict the similarity transformation matrices between the reference image and the driven poses via injecting guidance from Singular Value Decomposition (SVD). These matrices align the driven poses with the reference image, mitigating misalignment to a great extent. StableAnimator++ then computes image and face embeddings using off-the-shelf encoders, refining the face embeddings via a global content-aware Face Encoder. To further maintain ID, we introduce a distribution-aware ID Adapter that counteracts interference caused by temporal layers while preserving ID via distribution alignment. During the inference stage, we propose a novel Hamilton-Jacobi-Bellman (HJB) based face optimization integrated into the denoising process, guiding the diffusion trajectory for enhanced facial fidelity. Experiments on benchmarks show the effectiveness of StableAnimator++ both qualitatively and quantitatively.

[111] Aesthetics is Cheap, Show me the Text: An Empirical Evaluation of State-of-the-Art Generative Models for OCR

Peirong Zhang,Haowei Xu,Jiaxin Zhang,Guitao Xu,Xuhan Zheng,Zhenhua Yang,Junle Liu,Yuyi Zhang,Lianwen Jin

Main category: cs.CV

TL;DR: 本文对当前先进的生成模型在文本图像生成与编辑任务中进行评估,揭示了它们在OCR任务中的局限性,并呼吁将逼真文本生成和编辑能力纳入通用生成模型的基座能力。

Details Motivation: 近期生成模型在图像生成领域取得了显著进展,但其在文本图像生成与编辑任务中的表现尚未被全面评估。本文旨在填补这一空白。

Contribution: 1. 将OCR任务扩展为OCR生成任务;2. 对33个代表性任务进行分类评估;3. 揭示了当前生成模型在文本图像生成与编辑中的不足。

Method: 选择6个开源和闭源模型,使用高质量的图像输入和提示,对33个任务进行五类评估(文档、手写文本、场景文本、艺术文本、复杂布局文本)。

Result: 发现当前生成模型在逼真文本生成和编辑方面存在局限,尤其是在复杂布局和艺术文本任务中表现不佳。

Insight: 逼真文本生成和编辑应作为通用生成模型的基础能力,而非依赖专用方案。研究结果为社区提供了改进方向。

Abstract: Text image is a unique and crucial information medium that integrates visual aesthetics and linguistic semantics in modern e-society. Due to their subtlety and complexity, the generation of text images represents a challenging and evolving frontier in the image generation field. The recent surge of specialized image generators (\emph{e.g.}, Flux-series) and unified generative models (\emph{e.g.}, GPT-4o), which demonstrate exceptional fidelity, raises a natural question: can they master the intricacies of text image generation and editing? Motivated by this, we assess current state-of-the-art generative models’ capabilities in terms of text image generation and editing. We incorporate various typical optical character recognition (OCR) tasks into our evaluation and broaden the concept of text-based generation tasks into OCR generative tasks. We select 33 representative tasks and categorize them into five categories: document, handwritten text, scene text, artistic text, and complex & layout-rich text. For comprehensive evaluation, we examine six models across both closed-source and open-source domains, using tailored, high-quality image inputs and prompts. Through this evaluation, we draw crucial observations and identify the weaknesses of current generative models for OCR tasks. We argue that photorealistic text image generation and editing should be internalized as foundational skills into general-domain generative models, rather than being delegated to specialized solutions, and we hope this empirical analysis can provide valuable insights for the community to achieve this goal. This evaluation is online and will be continuously updated at our GitHub repository.

[112] Visual Place Recognition for Large-Scale UAV Applications

Ioannis Tsampikos Papapetros,Ioannis Kansizoglou,Antonios Gasteratos

Main category: cs.CV

TL;DR: 摘要介绍了视觉地点识别(vPR)在无人机导航中的重要性,针对现有数据集规模小、多样性不足及旋转模糊性问题,提出了大规模数据集LASED和可转向CNN方法,显著提高了vPR的泛化能力和鲁棒性。

Details Motivation: 无人机视觉地点识别面临现有数据集规模小、多样性不足及旋转模糊性问题,限制了模型的泛化能力和性能。

Contribution: 1. 推出了大型数据集LASED,包含约100万图像,覆盖广泛地理和时间多样性;2. 提出使用可转向CNN解决旋转模糊性问题。

Method: 通过可转向CNN处理旋转模糊性,利用其旋转不变性生成鲁棒的特征表示。

Result: 在LASED上训练的模型召回率显著优于小规模数据集,可转向CNN平均比传统CNN提升12%召回率。

Insight: 大规模数据集与旋转不变性网络的结合显著提升了无人机vPR的性能和泛化能力。

Abstract: Visual Place Recognition (vPR) plays a crucial role in Unmanned Aerial Vehicle (UAV) navigation, enabling robust localization across diverse environments. Despite significant advancements, aerial vPR faces unique challenges due to the limited availability of large-scale, high-altitude datasets, which limits model generalization, along with the inherent rotational ambiguity in UAV imagery. To address these challenges, we introduce LASED, a large-scale aerial dataset with approximately one million images, systematically sampled from 170,000 unique locations throughout Estonia over a decade, offering extensive geographic and temporal diversity. Its structured design ensures clear place separation significantly enhancing model training for aerial scenarios. Furthermore, we propose the integration of steerable Convolutional Neural Networks (CNNs) to explicitly handle rotational variance, leveraging their inherent rotational equivariance to produce robust, orientation-invariant feature representations. Our extensive benchmarking demonstrates that models trained on LASED achieve significantly higher recall compared to those trained on smaller, less diverse datasets, highlighting the benefits of extensive geographic coverage and temporal diversity. Moreover, steerable CNNs effectively address rotational ambiguity inherent in aerial imagery, consistently outperforming conventional convolutional architectures, achieving on average 12% recall improvement over the best-performing non-steerable network. By combining structured, large-scale datasets with rotation-equivariant neural networks, our approach significantly enhances model robustness and generalization for aerial vPR.

[113] BleedOrigin: Dynamic Bleeding Source Localization in Endoscopic Submucosal Dissection via Dual-Stage Detection and Tracking

Mengya Xu,Rulin Zhou,An Wang,Chaoyang Lyu,Zhen Li,Ning Zhong,Hongliang Ren

Main category: cs.CV

TL;DR: 这篇论文介绍了BleedOrigin-Bench数据集和BleedOrigin-Net框架,用于在内窥镜黏膜下剥离术(ESD)中实时定位并跟踪出血源,解决了当前AI方法不能有效处理ESD环境中动态遮挡和场景变化的问题。

Details Motivation: ESD术中出血的实时定位和连续监测对有效止血至关重要,但目前AI方法主要关注出血区域分割,缺乏对出血源精确检测和时间跟踪的能力。此外,缺乏专门的数据集限制了AI辅助系统的发展。

Contribution: 1. 提出了首个ESD出血源数据集BleedOrigin-Bench,包含大量专家标注和伪标注帧。2. 开发了BleedOrigin-Net,一个双阶段检测-跟踪框架,用于从出血检测到持续空间跟踪的完整流程。

Method: BleedOrigin-Net采用双阶段框架,结合目标检测和点跟踪技术,解决了出血源检测和跟踪的动态难题。通过对比YOLOv11/v12、多模态大语言模型和点跟踪方法,验证了其有效性。

Result: 实验显示,BleedOrigin-Net在出血起始检测上达到96.85%的帧级准确率(±≤8帧),初始源检测的像素级准确率为70.24%(≤100像素),点跟踪的像素级准确率为96.11%(≤100像素)。

Insight: 1. ESD环境下的动态特性需要专门的数据集和算法。2. 双阶段检测-跟踪框架能够有效结合空间和时间信息,提高出血源定位的鲁棒性。

Abstract: Intraoperative bleeding during Endoscopic Submucosal Dissection (ESD) poses significant risks, demanding precise, real-time localization and continuous monitoring of the bleeding source for effective hemostatic intervention. In particular, endoscopists have to repeatedly flush to clear blood, allowing only milliseconds to identify bleeding sources, an inefficient process that prolongs operations and elevates patient risks. However, current Artificial Intelligence (AI) methods primarily focus on bleeding region segmentation, overlooking the critical need for accurate bleeding source detection and temporal tracking in the challenging ESD environment, which is marked by frequent visual obstructions and dynamic scene changes. This gap is widened by the lack of specialized datasets, hindering the development of robust AI-assisted guidance systems. To address these challenges, we introduce BleedOrigin-Bench, the first comprehensive ESD bleeding source dataset, featuring 1,771 expert-annotated bleeding sources across 106,222 frames from 44 procedures, supplemented with 39,755 pseudo-labeled frames. This benchmark covers 8 anatomical sites and 6 challenging clinical scenarios. We also present BleedOrigin-Net, a novel dual-stage detection-tracking framework for the bleeding source localization in ESD procedures, addressing the complete workflow from bleeding onset detection to continuous spatial tracking. We compare with widely-used object detection models (YOLOv11/v12), multimodal large language models, and point tracking methods. Extensive evaluation demonstrates state-of-the-art performance, achieving 96.85% frame-level accuracy ($\pm\leq8$ frames) for bleeding onset detection, 70.24% pixel-level accuracy ($\leq100$ px) for initial source detection, and 96.11% pixel-level accuracy ($\leq100$ px) for point tracking.

[114] LoopNet: A Multitasking Few-Shot Learning Approach for Loop Closure in Large Scale SLAM

Mohammad-Maher Nakshbandi,Ziad Sharawy,Sorin Grigorescu

Main category: cs.CV

TL;DR: 论文提出了一种名为LoopNet的多任务少样本学习方法,用于解决大规模SLAM中的闭环检测问题,通过改进经典ResNet架构并结合在线重训练技术,显著提升了闭环检测精度和实时性能。

Details Motivation: 解决SLAM系统中闭环检测的准确性和实时性挑战,尤其是如何在嵌入式硬件上高效运行深度学习模型。

Contribution: 1) 提出了基于多任务ResNet的LoopNet框架;2) 结合在线重训练和少样本学习技术;3) 设计了新的基准数据集LoopDB。

Method: 采用多任务ResNet架构,结合DISK描述符和在线重训练技术,通过少样本学习优化模型性能。

Result: LoopNet在闭环检测精度和实时性能上表现优越,超越传统深度学习和手工特征方法。

Insight: 结合多任务学习和在线优化技术可以显著提升SLAM系统的闭环检测能力,尤其在资源受限的嵌入式设备上。

Abstract: One of the main challenges in the Simultaneous Localization and Mapping (SLAM) loop closure problem is the recognition of previously visited places. In this work, we tackle the two main problems of real-time SLAM systems: 1) loop closure detection accuracy and 2) real-time computation constraints on the embedded hardware. Our LoopNet method is based on a multitasking variant of the classical ResNet architecture, adapted for online retraining on a dynamic visual dataset and optimized for embedded devices. The online retraining is designed using a few-shot learning approach. The architecture provides both an index into the queried visual dataset, and a measurement of the prediction quality. Moreover, by leveraging DISK (DIStinctive Keypoints) descriptors, LoopNet surpasses the limitations of handcrafted features and traditional deep learning methods, offering better performance under varying conditions. Code is available at https://github.com/RovisLab/LoopNet. Additinally, we introduce a new loop closure benchmarking dataset, coined LoopDB, which is available at https://github.com/RovisLab/LoopDB.

[115] Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction

Ce Zhang,Yale Song,Ruta Desai,Michael Louis Iuzzolino,Joseph Tighe,Gedas Bertasius,Satwik Kottur

Main category: cs.CV

TL;DR: 论文提出了VideoPlan方法,通过辅助任务增强和多令牌预测,解决视觉规划任务中的数据稀缺性和动作空间建模问题,在多个数据集上取得SOTA性能。

Details Motivation: 视觉规划任务面临数据稀缺和动作空间难建模的挑战,现有方法在长时程规划中表现不足。

Contribution: 1. 提出辅助任务增强(Auxiliary Task Augmentation)以缓解数据稀缺性;2. 引入多令牌预测(Multi-token Prediction)改进动作空间建模。

Method: 结合辅助任务(如目标预测)训练模型,并通过多令牌预测扩展传统单令牌预测,以更好地捕捉结构化动作空间。

Result: 在COIN和CrossTask数据集上的视觉规划任务中,性能分别提升7.3%和3.4%;在Ego4D任务上表现与SOTA相当。

Insight: 辅助任务和多令牌预测能有效提升视觉规划性能,且方法可推广至其他非专用特征的任务。

Abstract: Visual Planning for Assistance (VPA) aims to predict a sequence of user actions required to achieve a specified goal based on a video showing the user’s progress. Although recent advances in multimodal large language models (MLLMs) have shown promising results in video understanding, long-horizon visual planning remains a challenging problem. We identify two challenges in training large MLLMs for video-based planning tasks: (1) scarcity of procedural annotations, limiting the model’s ability to learn procedural task dynamics effectively, and (2) inefficiency of next-token prediction objective to explicitly capture the structured action space for visual planning when compared to free-form, natural language. To tackle data scarcity, we introduce Auxiliary Task Augmentation. We design and train our model on auxiliary tasks relevant to long-horizon video-based planning (e.g., goal prediction) to augment the model’s planning ability. To more explicitly model the structured action space unique to visual planning tasks, we leverage Multi-token Prediction, extending traditional next-token prediction by using multiple heads to predict multiple future tokens during training. Our approach, VideoPlan, achieves state-of-the-art VPA performance on the COIN and CrossTask datasets, surpassing prior methods by 7.3% and 3.4%, respectively, when predicting 3 future actions. We further extend our method to the challenging Ego4D Long-term Action Anticipation task, and show that it is on par with the state-of-the-art approaches despite not using specialized egocentric features. Code will be made available.

[116] Event-based Graph Representation with Spatial and Motion Vectors for Asynchronous Object Detection

Aayush Atul Verma,Arpitsinh Vaghela,Bharatesh Chakravarthi,Kaustav Chanda,Yezhou Yang

Main category: cs.CV

TL;DR: 论文提出了一种新颖的时空多图表示方法,用于事件传感器的异步目标检测,通过分离空间图和时间图建模,提高了检测精度和计算效率。

Details Motivation: 事件传感器数据稀疏且异步,传统方法将其转为密集张量会丧失优势,而现有图表示方法在时空动态建模上表现不足,因此需要更优的表示方法。

Contribution: 1. 提出了分离的空间图和时间图表示方法;2. 利用B样条基函数建模全局空间结构,运动向量注意力建模局部动态变化;3. 显著提升了检测精度和计算效率。

Method: 1. 构建空间图(B样条基函数)和时间图(运动向量注意力);2. 避免了3D核的计算负担,使用2D核优化效率。

Result: 在Gen1和eTraM数据集上,检测精度提升6%,速度提升5倍,参数减少且计算成本不变。

Insight: 分离建模时空动态是提升事件传感器数据处理性能的有效方法,同时兼顾了精度和效率。

Abstract: Event-based sensors offer high temporal resolution and low latency by generating sparse, asynchronous data. However, converting this irregular data into dense tensors for use in standard neural networks diminishes these inherent advantages, motivating research into graph representations. While such methods preserve sparsity and support asynchronous inference, their performance on downstream tasks remains limited due to suboptimal modeling of spatiotemporal dynamics. In this work, we propose a novel spatiotemporal multigraph representation to better capture spatial structure and temporal changes. Our approach constructs two decoupled graphs: a spatial graph leveraging B-spline basis functions to model global structure, and a temporal graph utilizing motion vector-based attention for local dynamic changes. This design enables the use of efficient 2D kernels in place of computationally expensive 3D kernels. We evaluate our method on the Gen1 automotive and eTraM datasets for event-based object detection, achieving over a 6% improvement in detection accuracy compared to previous graph-based works, with a 5x speedup, reduced parameter count, and no increase in computational cost. These results highlight the effectiveness of structured graph modeling for asynchronous vision. Project page: eventbasedvision.github.io/eGSMV.

[117] MeshMamba: State Space Models for Articulated 3D Mesh Generation and Reconstruction

Yusuke Yoshiyasu,Leyuan Sun,Ryusuke Sagawa

Main category: cs.CV

TL;DR: MeshMamba利用Mamba状态空间模型,通过顶点序列化技术高效处理大规模三维网格数据,实现了生成和重建超过10,000顶点的身体网格模型。

Details Motivation: 传统方法在处理高分辨率三维网格时效率低且难以扩展,MeshMamba旨在通过新型序列模型提升效率,同时捕捉细节(如衣物和手部几何)。

Contribution: 1) 提出MeshMamba框架;2) 设计生成模型MambaDiff3D和重建模型Mamba-HMR;3) 将非参数化方法扩展到全身建模。

Method: 1) 通过顶点序列化技术优化处理;2) 基于Mamba-SSMs设计生成和重建模型;3) 使用模板网格或身体部分标记排序顶点。

Result: MambaDiff3D生成带衣物和手部的密集网格,优于现有方法;Mamba-HMR在实时性下扩展了全身建模能力。

Insight: 序列化技术和Mamba-SSMs的结合为高分辨率3D网格处理提供了高效解决方案,推动了生成和重建任务的性能边界。

Abstract: In this paper, we introduce MeshMamba, a neural network model for learning 3D articulated mesh models by employing the recently proposed Mamba State Space Models (Mamba-SSMs). MeshMamba is efficient and scalable in handling a large number of input tokens, enabling the generation and reconstruction of body mesh models with more than 10,000 vertices, capturing clothing and hand geometries. The key to effectively learning MeshMamba is the serialization technique of mesh vertices into orderings that are easily processed by Mamba. This is achieved by sorting the vertices based on body part annotations or the 3D vertex locations of a template mesh, such that the ordering respects the structure of articulated shapes. Based on MeshMamba, we design 1) MambaDiff3D, a denoising diffusion model for generating 3D articulated meshes and 2) Mamba-HMR, a 3D human mesh recovery model that reconstructs a human body shape and pose from a single image. Experimental results showed that MambaDiff3D can generate dense 3D human meshes in clothes, with grasping hands, etc., and outperforms previous approaches in the 3D human shape generation task. Additionally, Mamba-HMR extends the capabilities of previous non-parametric human mesh recovery approaches, which were limited to handling body-only poses using around 500 vertex tokens, to the whole-body setting with face and hands, while achieving competitive performance in (near) real-time.

[118] Improving Joint Embedding Predictive Architecture with Diffusion Noise

Yuping Qiu,Rui Zhu,Ying-cong Chen

Main category: cs.CV

TL;DR: 该论文提出了一种将扩散噪声(diffusion noise)引入自监督学习的方法N-JEPA,通过结合掩码图像建模(MIM)和扩散模型,增强了模型的表示能力。

Details Motivation: 自监督学习(SSL)在判别任务中表现出色,而生成模型在图像生成和细节增强方面表现更好。作者希望通过结合SSL和生成模型的核心思想(扩散噪声),进一步提升SSL的表征能力。

Contribution: 1. 提出N-JEPA方法,将扩散噪声与掩码图像建模结合;2. 通过多级噪声调度增强模型的鲁棒性。

Method: 1. 将扩散噪声视为掩码的一种特殊状态;2. 通过掩码标记的位置嵌入引入扩散噪声;3. 设计多级噪声调度作为特征增强手段。

Result: 实验验证了N-JEPA在下游分类任务中的有效性。

Insight: 扩散噪声与掩码图像建模之间存在紧密联系,结合两者可以提升模型的语义理解能力。

Abstract: Self-supervised learning has become an incredibly successful method for feature learning, widely applied to many downstream tasks. It has proven especially effective for discriminative tasks, surpassing the trending generative models. However, generative models perform better in image generation and detail enhancement. Thus, it is natural for us to find a connection between SSL and generative models to further enhance the representation capacity of SSL. As generative models can create new samples by approximating the data distribution, such modeling should also lead to a semantic understanding of the raw visual data, which is necessary for recognition tasks. This enlightens us to combine the core principle of the diffusion model: diffusion noise, with SSL to learn a competitive recognition model. Specifically, diffusion noise can be viewed as a particular state of mask that reveals a close relationship between masked image modeling (MIM) and diffusion models. In this paper, we propose N-JEPA (Noise-based JEPA) to incorporate diffusion noise into MIM by the position embedding of masked tokens. The multi-level noise schedule is a series of feature augmentations to further enhance the robustness of our model. We perform a comprehensive study to confirm its effectiveness in the classification of downstream tasks. Codes will be released soon in public.

[119] Hierarchical Part-based Generative Model for Realistic 3D Blood Vessel

Siqi Chen,Guoqing Zhang,Jiahao Lai,Bingzhi Shen,Sihong Zhang,Caixia Dong,Xuejin Chen,Yang Li

Main category: cs.CV

TL;DR: 该论文提出了一种分层部件基础的生成模型,通过全局与局部分离的方法生成3D血管网络,在真实数据集上表现优于现有方法。

Details Motivation: 3D血管的复杂几何和拓扑结构难以准确表示,现有方法难以建模其分支、曲率和不规则形状。

Contribution: 提出首个基于部件的生成模型,将全局的树状拓扑与局部几何细节分离,实现了更精确的3D血管生成。

Method: 分三阶段:1)生成关键图建模层次结构;2)基于几何属性的血管段生成;3)根据关键图集成局部段为完整血管。

Result: 在真实数据上验证,表现优于现有方法,为血管数据生成设定了新标准。

Insight: 通过分层部件方法可以更有效建模复杂血管网络,全局与局部分离是关键。

Abstract: Advancements in 3D vision have increased the impact of blood vessel modeling on medical applications. However, accurately representing the complex geometry and topology of blood vessels remains a challenge due to their intricate branching patterns, curvatures, and irregular shapes. In this study, we propose a hierarchical part-based frame work for 3D vessel generation that separates the global binary tree-like topology from local geometric details. Our approach proceeds in three stages: (1) key graph generation to model the overall hierarchical struc ture, (2) vessel segment generation conditioned on geometric properties, and (3) hierarchical vessel assembly by integrating the local segments according to the global key graph. We validate our framework on real world datasets, demonstrating superior performance over existing methods in modeling complex vascular networks. This work marks the first successful application of a part-based generative approach for 3D vessel modeling, setting a new benchmark for vascular data generation. The code is available at: https://github.com/CybercatChen/PartVessel.git.

[120] Mammo-SAE: Interpreting Breast Cancer Concept Learning with Sparse Autoencoders

Krishna Kanth Nakka

Main category: cs.CV

TL;DR: 论文提出了Mammo-SAE,一种基于稀疏自编码器(SAE)的方法,用于解释乳房成像基础模型Mammo-CLIP中的概念学习。通过分析SAE的潜在特征,揭示了与临床相关乳房概念的关联及其模型的决策过程。

Details Motivation: 在医疗影像等高风险领域,模型的解释性对临床采用至关重要。研究旨在通过SAE提供对基础模型内部工作的深入见解,尤其是与乳腺癌相关的概念学习。

Contribution: 1. 首次将SAE应用于乳房成像基础模型的解释性分析。2. 识别了与临床相关乳房概念(如肿块和可疑钙化)的潜在特征。3. 揭示了模型决策过程中的混杂因素及其在下游任务中的依赖性。

Method: 1. 在Mammo-CLIP上训练了patch级的Mammo-SAE。2. 通过激活分析识别潜在神经元与临床概念的关联。3. 研究了潜在神经元在下游微调中的作用。

Result: 发现SAE潜在空间中高激活的类别级神经元通常与真实区域对齐,并揭示了影响模型决策的混杂因素。此外,研究还明确了模型在下游任务中依赖的潜在神经元。

Insight: 稀疏自编码器的潜在表示能够为乳房成像基础模型提供可解释性,有助于理解复杂模型的内部工作机制及其在医疗决策中的作用。

Abstract: Interpretability is critical in high-stakes domains such as medical imaging, where understanding model decisions is essential for clinical adoption. In this work, we introduce Sparse Autoencoder (SAE)-based interpretability to breast imaging by analyzing {Mammo-CLIP}, a vision–language foundation model pretrained on large-scale mammogram image–report pairs. We train a patch-level \texttt{Mammo-SAE} on Mammo-CLIP to identify and probe latent features associated with clinically relevant breast concepts such as \textit{mass} and \textit{suspicious calcification}. Our findings reveal that top activated class level latent neurons in the SAE latent space often tend to align with ground truth regions, and also uncover several confounding factors influencing the model’s decision-making process. Additionally, we analyze which latent neurons the model relies on during downstream finetuning for improving the breast concept prediction. This study highlights the promise of interpretable SAE latent representations in providing deeper insight into the internal workings of foundation models at every layer for breast imaging.

[121] Cross-Domain Few-Shot Learning with Coalescent Projections and Latent Space Reservation

Naeem Paeedeh,Mahardhika Pratama,Wolfgang Mayer,Jimmy Cao,Ryszard Kowlczyk

Main category: cs.CV

TL;DR: 论文提出了一种新方法CPLSR,结合Coalescent Projection和Latent Space Reservation,解决跨域小样本学习中的过拟合问题,并在极端域偏移场景中表现优异。

Details Motivation: 跨域小样本学习中,基于DINO预训练的模型虽表现优异,但过多的参数调优会导致过拟合。为此,作者希望通过新方法解决这一问题。

Contribution: 1. 提出Coalescent Projection(CP)作为soft prompts的替代;2. 结合自监督变换生成伪类,提升模型对新域的泛化能力。

Method: 1. 使用CP减少参数更新;2. 利用自监督变换生成伪类,仅依赖基域数据为未见域做准备。

Result: 在BSCD-FSL极端域偏移场景中,新方法表现显著优于现有技术。

Insight: CP和伪类生成方法有效缓解了跨域小样本学习中的过拟合问题,提升了泛化性能。

Abstract: Despite the progress in Cross-Domain Few-Shot Learning (CD-FSL), a model pre-trained with DINO combined with a prototypical classifier outperforms the latest SOTA methods. A crucial limitation that needs to be overcome is that updating too many parameters of the transformers leads to overfitting due to the scarcity of labeled samples. To address this challenge, we propose a new concept, Coalescent Projection (CP), as an effective successor to soft prompts. Additionally, we propose a novel pseudo-class generation method combined with Self-Supervised Transformations (SSTs) that relies solely on the base domain to prepare the network for encountering unseen samples from different domains. The proposed method exhibits its effectiveness in comprehensive experiments on the extreme domain shift scenario of the BSCD-FSL benchmark. Our code is published at https://github.com/Naeem-Paeedeh/CPLSR.

[122] FreeCus: Free Lunch Subject-driven Customization in Diffusion Transformers

Yanbing Zhang,Zhe Wang,Qin Zhou,Mengping Yang

Main category: cs.CV

TL;DR: FreeCus是一种无需训练的框架,通过创新的注意力共享机制和改进的DiT动态偏移分析,激活扩散变换器(DiT)的零样本能力,实现高效的主题驱动定制,并兼容现有修复和控制模块。

Details Motivation: 现有主题驱动方法依赖训练过程(如可训练文本嵌入或专用编码器),限制了其实际应用。FreeCus旨在充分利用扩散变换器的零样本潜力,提供一种无需训练的高效解决方案。

Contribution: 1)提出注意力共享机制;2)改进DiT动态偏移分析以提升细粒度特征提取;3)集成多模态大语言模型(MLLMs)增强跨模态语义表示。

Method: 通过注意力共享机制、DiT动态偏移分析升级和多模态大语言模型集成,激活DiT的零样本能力,实现高效主题驱动合成。

Result: FreeCus在零样本主题驱动合成中达到或优于需额外训练的方法,且兼容现有修复和控制模块。

Insight: 扩散变换器具备零样本潜力,通过创新机制可无需训练实现高效主题驱动定制,展现了与现有工具的无缝兼容性。

Abstract: In light of recent breakthroughs in text-to-image (T2I) generation, particularly with diffusion transformers (DiT), subject-driven technologies are increasingly being employed for high-fidelity customized production that preserves subject identity from reference inputs, enabling thrilling design workflows and engaging entertainment. Existing alternatives typically require either per-subject optimization via trainable text embeddings or training specialized encoders for subject feature extraction on large-scale datasets. Such dependencies on training procedures fundamentally constrain their practical applications. More importantly, current methodologies fail to fully leverage the inherent zero-shot potential of modern diffusion transformers (e.g., the Flux series) for authentic subject-driven synthesis. To bridge this gap, we propose FreeCus, a genuinely training-free framework that activates DiT’s capabilities through three key innovations: 1) We introduce a pivotal attention sharing mechanism that captures the subject’s layout integrity while preserving crucial editing flexibility. 2) Through a straightforward analysis of DiT’s dynamic shifting, we propose an upgraded variant that significantly improves fine-grained feature extraction. 3) We further integrate advanced Multimodal Large Language Models (MLLMs) to enrich cross-modal semantic representations. Extensive experiments reflect that our method successfully unlocks DiT’s zero-shot ability for consistent subject synthesis across diverse contexts, achieving state-of-the-art or comparable results compared to approaches that require additional training. Notably, our framework demonstrates seamless compatibility with existing inpainting pipelines and control modules, facilitating more compelling experiences. Our code is available at: https://github.com/Monalissaa/FreeCus.

[123] MinCD-PnP: Learning 2D-3D Correspondences with Approximate Blind PnP

Pei An,Jiaqi Yang,Muyao Peng,You Yang,Qiong Liu,Xiaolin Wu,Liangliang Nan

Main category: cs.CV

TL;DR: 论文提出了一种新的2D-3D对应学习方法MinCD-PnP,通过近似盲PnP解决传统差分PnP在噪声和异常值下的敏感性问题,并提出轻量级的多任务模块MinCD-Net,显著提升了跨场景和跨数据集的配准性能。

Details Motivation: 传统差分PnP在图像到点云配准中对噪声和异常值敏感,影响了对应学习的有效性。受盲PnP的鲁棒性启发,作者提出了近似盲PnP的方法来改进这一问题。

Contribution: 1. 提出了近似盲PnP方法MinCD-PnP,通过最小化学习到的2D和3D关键点的Chamfer距离来简化盲PnP;2. 设计了轻量级多任务模块MinCD-Net,提升配准性能。

Method: MinCD-PnP将盲PnP简化为Chamfer距离最小化任务,并通过MinCD-Net模块联合学习2D-3D对应关系,解决噪声和异常值问题。

Result: 在7-Scenes、RGBD-V2、ScanNet等数据集上,MinCD-Net在跨场景和跨数据集设置下均表现出色,提高了内点比(IR)和配准召回率(RR)。

Insight: 近似盲PnP方法为解决2D-3D对应学习中的噪声和异常值问题提供了新思路,Chamfer距离的简化设计提高了计算效率。

Abstract: Image-to-point-cloud (I2P) registration is a fundamental problem in computer vision, focusing on establishing 2D-3D correspondences between an image and a point cloud. The differential perspective-n-point (PnP) has been widely used to supervise I2P registration networks by enforcing the projective constraints on 2D-3D correspondences. However, differential PnP is highly sensitive to noise and outliers in the predicted correspondences. This issue hinders the effectiveness of correspondence learning. Inspired by the robustness of blind PnP against noise and outliers in correspondences, we propose an approximated blind PnP based correspondence learning approach. To mitigate the high computational cost of blind PnP, we simplify blind PnP to an amenable task of minimizing Chamfer distance between learned 2D and 3D keypoints, called MinCD-PnP. To effectively solve MinCD-PnP, we design a lightweight multi-task learning module, named as MinCD-Net, which can be easily integrated into the existing I2P registration architectures. Extensive experiments on 7-Scenes, RGBD-V2, ScanNet, and self-collected datasets demonstrate that MinCD-Net outperforms state-of-the-art methods and achieves a higher inlier ratio (IR) and registration recall (RR) in both cross-scene and cross-dataset settings.

[124] Conditional Video Generation for High-Efficiency Video Compression

Fangqiu Yi,Jingyu Xu,Jiawei Shao,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: 该论文提出了一种基于条件扩散模型的高效视频压缩框架,通过生成式方法优化感知质量,显著优于传统和神经编解码器。

Details Motivation: 传统的视频压缩方法通常忽略了人类视觉感知的特点,导致在高压缩比下感知质量下降。基于扩散模型在图像生成中的成功表现,论文旨在利用其条件生成能力优化视频压缩的感知质量。

Contribution: 论文的主要贡献包括:1)将视频压缩重新定义为条件生成任务;2)提出多粒度条件模块、紧凑表示设计以及多条件训练方法;3)在感知质量指标(如FVD和LPIPS)上显著提升压缩性能。

Method: 方法包括三个关键模块:1)多粒度条件捕捉静态场景结构和动态时空线索;2)设计紧凑表示以实现高效传输;3)采用多条件训练(模态丢失和角色感知嵌入)以增强鲁棒性。

Result: 实验结果表明,该框架在高压缩比下显著优于传统和神经编解码器,尤其是在FVD和LPIPS等感知质量指标上表现优异。

Insight: 条件扩散模型可以有效地结合稀疏信号生成高质量视频,同时多粒度条件和多条件训练策略能够进一步优化模型的鲁棒性和感知质量。

Abstract: Perceptual studies demonstrate that conditional diffusion models excel at reconstructing video content aligned with human visual perception. Building on this insight, we propose a video compression framework that leverages conditional diffusion models for perceptually optimized reconstruction. Specifically, we reframe video compression as a conditional generation task, where a generative model synthesizes video from sparse, yet informative signals. Our approach introduces three key modules: (1) Multi-granular conditioning that captures both static scene structure and dynamic spatio-temporal cues; (2) Compact representations designed for efficient transmission without sacrificing semantic richness; (3) Multi-condition training with modality dropout and role-aware embeddings, which prevent over-reliance on any single modality and enhance robustness. Extensive experiments show that our method significantly outperforms both traditional and neural codecs on perceptual quality metrics such as Fr'echet Video Distance (FVD) and LPIPS, especially under high compression ratios.

[125] In-context Learning of Vision Language Models for Detection of Physical and Digital Attacks against Face Recognition Systems

Lazaro Janier Gonzalez-Soler,Maciej Salwowski,Christoph Busch

Main category: cs.CV

TL;DR: 论文研究了如何利用视觉语言模型(VLM)的上下文学习框架,检测针对人脸识别系统的物理和数字攻击,提出了一种无需大量训练数据的轻量化方法。

Details Motivation: 传统的深度学习模型在检测人脸识别攻击时需要大量训练数据,且难以适应新型攻击或变化的环境条件。视觉语言模型因其通用性和上下文学习能力,为解决这些问题提供了可能。

Contribution: 提出了首个系统性的上下文学习框架,用于评估VLM在安全关键场景中的表现,并在开源模型上实现了与CNN相当的检测性能,且无需资源密集的训练。

Method: 采用VLM的上下文学习技术,结合少量示例进行检测任务,利用开源模型的通用能力,避免了传统模型的训练需求。

Result: 实验表明,该方法在物理和数字攻击检测中具有竞争力,部分性能优于传统CNN,展现了更好的泛化能力。

Insight: 上下文学习为安全领域的攻击检测提供了轻量化解决方案,VLM的通用性可减少对大量标注数据的依赖,同时应对多变攻击场景。

Abstract: Recent advances in biometric systems have significantly improved the detection and prevention of fraudulent activities. However, as detection methods improve, attack techniques become increasingly sophisticated. Attacks on face recognition systems can be broadly divided into physical and digital approaches. Traditionally, deep learning models have been the primary defence against such attacks. While these models perform exceptionally well in scenarios for which they have been trained, they often struggle to adapt to different types of attacks or varying environmental conditions. These subsystems require substantial amounts of training data to achieve reliable performance, yet biometric data collection faces significant challenges, including privacy concerns and the logistical difficulties of capturing diverse attack scenarios under controlled conditions. This work investigates the application of Vision Language Models (VLM) and proposes an in-context learning framework for detecting physical presentation attacks and digital morphing attacks in biometric systems. Focusing on open-source models, the first systematic framework for the quantitative evaluation of VLMs in security-critical scenarios through in-context learning techniques is established. The experimental evaluation conducted on freely available databases demonstrates that the proposed subsystem achieves competitive performance for physical and digital attack detection, outperforming some of the traditional CNNs without resource-intensive training. The experimental results validate the proposed framework as a promising tool for improving generalisation in attack detection.

[126] Minutiae-Anchored Local Dense Representation for Fingerprint Matching

Zhiyu Pan,Xiongjun Guan,Yongjie Duan,Jianjiang Feng,Jie Zhou

Main category: cs.CV

TL;DR: 该论文提出了一种名为DMD的指纹匹配方法,通过细粒度的细节特征锚定局部密集表示,结合纹理和细节特征,并在空间结构中捕获这些信息。实验表明,该方法在多种指纹数据集上表现优异,具有高效率和强泛化能力。

Details Motivation: 指纹匹配在多样化采集条件下仍面临挑战,需要一种既能捕捉细粒度纹理又能利用细节特征的表示方法,以提高鲁棒性和准确性。

Contribution: 提出了DMD表示方法,通过细节特征锚定的局部密集描述符,结合空间结构和语义特征,显著提升了指纹匹配的精度和效率。

Method: 以每个细节特征为中心,提取局部补块的描述符,形成三维张量,并利用前景分割掩码限制匹配区域。

Result: 在多种指纹数据集上达到最先进性能,同时保持了高效性,适用于大规模指纹识别。

Insight: 通过结合空间结构和语义特征,DMD能够在多样条件下实现鲁棒且精确的指纹匹配,为指纹识别提供了新思路。

Abstract: Fingerprint matching under diverse capture conditions remains a fundamental challenge in biometric recognition. To achieve robust and accurate performance in such scenarios, we propose DMD, a minutiae-anchored local dense representation which captures both fine-grained ridge textures and discriminative minutiae features in a spatially structured manner. Specifically, descriptors are extracted from local patches centered and oriented on each detected minutia, forming a three-dimensional tensor, where two dimensions represent spatial locations on the fingerprint plane and the third encodes semantic features. This representation explicitly captures abstract features of local image patches, enabling a multi-level, fine-grained description that aggregates information from multiple minutiae and their surrounding ridge structures. Furthermore, thanks to its strong spatial correspondence with the patch image, DMD allows for the use of foreground segmentation masks to identify valid descriptor regions. During matching, comparisons are then restricted to overlapping foreground areas, improving efficiency and robustness. Extensive experiments on rolled, plain, parital, contactless, and latent fingerprint datasets demonstrate the effectiveness and generalizability of the proposed method. It achieves state-of-the-art accuracy across multiple benchmarks while maintaining high computational efficiency, showing strong potential for large-scale fingerprint recognition. Corresponding code is available at https://github.com/Yu-Yy/DMD.

[127] BenchDepth: Are We on the Right Way to Evaluate Depth Foundation Models?

Zhenyu Li,Haotong Lin,Jiashi Feng,Peter Wonka,Bingyi Kang

Main category: cs.CV

TL;DR: 论文提出BenchDepth,一种通过下游代理任务评估深度基础模型(DFMs)的新基准,避免了传统对齐指标的偏差问题。

Details Motivation: 现有深度基础模型的评估协议存在不一致性,传统基于对齐的指标引入偏差,难以公平比较。

Contribution: 提出BenchDepth基准,通过五种下游代理任务评估DFMs的实际实用性,避免了对齐问题。

Method: 选择五种下游代理任务(如深度补全、立体匹配等),直接评估DFMs在真实场景中的表现。

Result: 对八种先进的DFMs进行了基准测试,并提供了关键发现的分析。

Insight: 通过代理任务评估模型的实际效用比传统对齐指标更公平,为深度估计的评测提供了新方向。

Abstract: Depth estimation is a fundamental task in computer vision with diverse applications. Recent advancements in deep learning have led to powerful depth foundation models (DFMs), yet their evaluation remains challenging due to inconsistencies in existing protocols. Traditional benchmarks rely on alignment-based metrics that introduce biases, favor certain depth representations, and complicate fair comparisons. In this work, we propose BenchDepth, a new benchmark that evaluates DFMs through five carefully selected downstream proxy tasks: depth completion, stereo matching, monocular feed-forward 3D scene reconstruction, SLAM, and vision-language spatial understanding. Unlike conventional evaluation protocols, our approach assesses DFMs based on their practical utility in real-world applications, bypassing problematic alignment procedures. We benchmark eight state-of-the-art DFMs and provide an in-depth analysis of key findings and observations. We hope our work sparks further discussion in the community on best practices for depth model evaluation and paves the way for future research and advancements in depth estimation.

[128] ExDD: Explicit Dual Distribution Learning for Surface Defect Detection via Diffusion Synthesis

Muhammad Aqeel,Federico Leonardi,Francesco Setti

Main category: cs.CV

TL;DR: ExDD是一种显式双分布学习框架,通过扩散模型合成缺陷数据,解决了工业缺陷检测中数据稀缺和异常分布不均的问题,显著提升了检测性能。

Details Motivation: 现有的一类异常检测范式假设异常分布均匀且难以应对真实制造环境中的数据稀缺问题,ExDD通过显式建模双特征分布来克服这些限制。

Contribution: 提出了ExDD框架,利用并行记忆库显式建模正常和异常模式的双分布,并通过扩散模型生成上下文保留的合成缺陷数据。

Method: 采用潜在扩散模型(结合领域特定文本条件)生成合成缺陷数据,并通过邻域感知比例评分机制融合互补距离度量。

Result: 在KSDD2数据集上取得了94.2%的I-AUROC和97.7%的P-AUROC,最佳增强效果出现在100个合成样本时。

Insight: 显式建模双分布和利用扩散模型生成合成数据可以有效解决工业缺陷检测中的数据稀缺和异常分布问题。

Abstract: Industrial defect detection systems face critical limitations when confined to one-class anomaly detection paradigms, which assume uniform outlier distributions and struggle with data scarcity in realworld manufacturing environments. We present ExDD (Explicit Dual Distribution), a novel framework that transcends these limitations by explicitly modeling dual feature distributions. Our approach leverages parallel memory banks that capture the distinct statistical properties of both normality and anomalous patterns, addressing the fundamental flaw of uniform outlier assumptions. To overcome data scarcity, we employ latent diffusion models with domain-specific textual conditioning, generating in-distribution synthetic defects that preserve industrial context. Our neighborhood-aware ratio scoring mechanism elegantly fuses complementary distance metrics, amplifying signals in regions exhibiting both deviation from normality and similarity to known defect patterns. Experimental validation on KSDD2 demonstrates superior performance (94.2% I-AUROC, 97.7% P-AUROC), with optimal augmentation at 100 synthetic samples.

[129] DAViD: Data-efficient and Accurate Vision Models from Synthetic Data

Fatemeh Saleh,Sadegh Aliakbarian,Charlie Hewitt,Lohit Petikam,Xiao-Xian,Antonio Criminisi,Thomas J. Cashman,Tadas Baltrušaitis

Main category: cs.CV

TL;DR: DAViD 提出了一种通过高保真合成数据训练高效、高精度的视觉模型的方法,避免了传统大数据集的需求,同时保持了模型的准确性。

Details Motivation: 传统的人类中心计算机视觉需要庞大的数据集和计算资源。DAViD 旨在通过合成数据解决这一问题,提供高效且数据来源可控的训练方案。

Contribution: 1. 使用高保真合成数据训练模型,减少数据需求。2. 通过程序化生成控制数据多样性,解决模型不公平问题。3. 在多个密集预测任务中验证了模型的效率和准确性。

Method: 通过程序化生成高质量合成数据,并结合完美标签和细节优化,训练轻量级模型。利用数据多样性控制提升模型鲁棒性。

Result: 实验表明,DAViD 模型在深度估计、表面法线估计和前景分割任务中达到与大型模型相当的精度,同时显著降低训练和推理成本。

Insight: 合成数据可以有效替代真实数据,尤其在数据隐私和效率要求高的场景。程序化生成提供了可控的数据多样性,有助于解决公平性问题。

Abstract: The state of the art in human-centric computer vision achieves high accuracy and robustness across a diverse range of tasks. The most effective models in this domain have billions of parameters, thus requiring extremely large datasets, expensive training regimes, and compute-intensive inference. In this paper, we demonstrate that it is possible to train models on much smaller but high-fidelity synthetic datasets, with no loss in accuracy and higher efficiency. Using synthetic training data provides us with excellent levels of detail and perfect labels, while providing strong guarantees for data provenance, usage rights, and user consent. Procedural data synthesis also provides us with explicit control on data diversity, that we can use to address unfairness in the models we train. Extensive quantitative assessment on real input images demonstrates accuracy of our models on three dense prediction tasks: depth estimation, surface normal estimation, and soft foreground segmentation. Our models require only a fraction of the cost of training and inference when compared with foundational models of similar accuracy. Our human-centric synthetic dataset and trained models are available at https://aka.ms/DAViD.

[130] Rethinking Occlusion in FER: A Semantic-Aware Perspective and Go Beyond

Huiyu Zhai,Xingxing Yang,Yalan Ye,Chenyang Li,Bin Fan,Changze Li

Main category: cs.CV

TL;DR: 论文提出ORSANet,通过引入多模态语义引导、多尺度交互模块和动态对抗排斥损失,提升遮挡条件下的面部表情识别性能,并构建了首个遮挡专用数据集Occlu-FER。

Details Motivation: 现有FER模型在面部遮挡条件下表现不佳,无法有效提取特征,导致分类不准确。作者希望通过引入语义引导和多模态融合来解决这一问题。

Contribution: 1. 引入多模态语义引导(语义分割图和面部关键点)以消除遮挡歧义;2. 设计多尺度交互模块(MCM)融合多模态特征;3. 提出动态对抗排斥损失(DARELoss)增强模型对相似表情的区分能力;4. 构建首个遮挡专用数据集Occlu-FER。

Method: 1. 利用语义分割图和面部关键点作为先验知识增强特征表示;2. 通过MCM模块自适应融合多尺度特征;3. 使用DARELoss动态调整分类边界。

Result: 在公开基准和Occlu-FER数据集上实现SOTA性能。

Insight: 语义和几何先验知识的结合可以有效缓解遮挡问题,而动态损失函数能进一步提升模型对类间模糊性的鲁棒性。

Abstract: Facial expression recognition (FER) is a challenging task due to pervasive occlusion and dataset biases. Especially when facial information is partially occluded, existing FER models struggle to extract effective facial features, leading to inaccurate classifications. In response, we present ORSANet, which introduces the following three key contributions: First, we introduce auxiliary multi-modal semantic guidance to disambiguate facial occlusion and learn high-level semantic knowledge, which is two-fold: 1) we introduce semantic segmentation maps as dense semantics prior to generate semantics-enhanced facial representations; 2) we introduce facial landmarks as sparse geometric prior to mitigate intrinsic noises in FER, such as identity and gender biases. Second, to facilitate the effective incorporation of these two multi-modal priors, we customize a Multi-scale Cross-interaction Module (MCM) to adaptively fuse the landmark feature and semantics-enhanced representations within different scales. Third, we design a Dynamic Adversarial Repulsion Enhancement Loss (DARELoss) that dynamically adjusts the margins of ambiguous classes, further enhancing the model’s ability to distinguish similar expressions. We further construct the first occlusion-oriented FER dataset to facilitate specialized robustness analysis on various real-world occlusion conditions, dubbed Occlu-FER. Extensive experiments on both public benchmarks and Occlu-FER demonstrate that our proposed ORSANet achieves SOTA recognition performance. Code is publicly available at https://github.com/Wenyuzhy/ORSANet-master.

[131] SurgX: Neuron-Concept Association for Explainable Surgical Phase Recognition

Ka Young Kim,Hyeon Bae Kim,Seong Tae Kim

Main category: cs.CV

TL;DR: SurgX提出了一种新的基于概念的解析框架,通过将神经元与手术视频中的相关概念关联,增强了手术阶段识别模型的可解释性,解决了现有深度学习方法缺乏透明性的问题。

Details Motivation: 手术阶段识别在手术工作流分析中至关重要,但现有深度学习方法缺乏可解释性,阻碍了用户对模型的信任和调试。SurgX旨在通过概念关联提升模型的可解释性。

Contribution: 1. 提出了SurgX框架,关联神经元与手术视频中的概念;2. 设计了代表序列选择、概念集构建和关键神经元识别的方法;3. 在两种手术阶段识别模型上验证了框架的有效性。

Method: 1. 选择神经元的代表性示例序列;2. 构造针对手术视频数据集的概念集;3. 将神经元与概念关联并识别关键神经元;4. 通过实验验证框架的解释能力。

Result: 在两种手术阶段识别模型上的实验表明,SurgX能有效解释模型的预测,提升了可解释性。

Insight: 通过概念关联增强可解释性是解决复杂视觉任务(如手术阶段识别)中模型透明性问题的有效途径。

Abstract: Surgical phase recognition plays a crucial role in surgical workflow analysis, enabling various applications such as surgical monitoring, skill assessment, and workflow optimization. Despite significant advancements in deep learning-based surgical phase recognition, these models remain inherently opaque, making it difficult to understand how they make decisions. This lack of interpretability hinders trust and makes it challenging to debug the model. To address this challenge, we propose SurgX, a novel concept-based explanation framework that enhances the interpretability of surgical phase recognition models by associating neurons with relevant concepts. In this paper, we introduce the process of selecting representative example sequences for neurons, constructing a concept set tailored to the surgical video dataset, associating neurons with concepts and identifying neurons crucial for predictions. Through extensive experiments on two surgical phase recognition models, we validate our method and analyze the explanation for prediction. This highlights the potential of our method in explaining surgical phase recognition. The code is available at https://github.com/ailab-kyunghee/SurgX

[132] EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent

Jiaao Li,Kaiyuan Li,Chen Gao,Yong Li,Xinlei Chen

Main category: cs.CV

TL;DR: EgoPrune是一种专为第一人称视频(egomotion video)设计的高效token剪枝方法,通过keyframe选择、视角感知冗余过滤和多样性与相关性平衡的token选择器,显著提升了计算效率和推理速度。

Details Motivation: 第一人称视频因其动态性和冗余性,对现有基于第三人称视频设计的token剪枝方法提出了挑战,亟需一种高效、无需训练的剪枝方法以适应实际部署需求。

Contribution: 提出了EgoPrune,一种专为第一人称视频设计的高效token剪枝方法,包含三个核心组件:keyframe选择、视角感知冗余过滤(PARF)和MMR-based token选择器。

Method: EgoPrune结合了关键帧采样(来自EmbodiedR)、PARF(通过视角变换对齐并过滤冗余token)和MMR-based选择器(平衡视觉-文本相关性和帧内多样性),无需额外训练。

Result: 在多个基准测试中,EgoPrune优于现有的无需训练方法,显著降低了计算开销和延迟,并在边缘设备上验证了其实用性。

Insight: 通过利用第一人称视频的时空连续性和运动约束,EgoPrune展示了token剪枝在动态视觉输入中的潜力,为嵌入式AI的高效推理提供了新思路。

Abstract: Egomotion videos are first-person recordings where the view changes continuously due to the agent’s movement. As they serve as the primary visual input for embodied AI agents, making egomotion video reasoning more efficient is therefore essential for real-world deployment. Recent advances in vision-language models have enabled strong multimodal reasoning capabilities, but their computational cost remains prohibitive for long, redundant video inputs. Existing token pruning methods, typically designed for third-person videos, fail to leverage the spatiotemporal continuity and motion constraints inherent in egomotion settings. To address this, we propose EgoPrune, a training-free token pruning method tailored for egomotion video reasoning. EgoPrune comprises three components: a keyframe selector adapted from EmbodiedR for temporally efficient sampling; Perspective-Aware Redundancy Filtering (PARF), which aligns visual tokens using perspective transformations and removes redundant tokens; and a Maximal Marginal Relevance (MMR)-based token selector that jointly considers visual-text relevance and intra-frame diversity. Experiments on two egomotion video benchmarks show that EgoPrune consistently outperforms prior training-free methods across various pruning ratios while significantly reducing FLOPs, memory usage, and latency. Moreover, we deploy EgoPrune on an embodied agent equipped with a Jetson Orin NX 16GB edge device, demonstrating its real-world efficiency and suitability for on-device egomotion video reasoning.

[133] One Last Attention for Your Vision-Language Model

Liang Chen,Ghazi Shazan Ahmad,Tianjun Yao,Lingqiao Liu,Zhiqiang Shen

Main category: cs.CV

TL;DR: 论文提出了Rational Adaptation (RAda),一种简单有效的微调方法,通过动态校准视觉-语言模型中融合表示的贡献,提升模型的零样本性能。

Details Motivation: 现有的视觉-语言模型(如CLIP)在零样本任务中表现优异,但其下游潜力依赖于有效的微调。现有方法多关注单模态表征的优化,而忽略了融合表征在决策过程中的作用。

Contribution: 提出了RAda方法,通过轻量级的注意力层学习掩码,动态调整融合表示中各元素的贡献,从而优化跨模态交互。

Method: RAda在预训练模型的末端附加一个轻量级注意力层,学习掩码以校准融合表征的贡献,不需要修改中间特征。

Result: 实验表明,RAda在不同设置(如微调或冻结预训练编码器、测试时仅有未标记数据)下均优于基线方法,与当前先进方法表现相当。

Insight: 融合表征的优化对提升视觉-语言模型的下游任务性能至关重要,RAda提供了一种高效的微调解决方案。

Abstract: Pretrained vision-language models (VLMs), such as CLIP, achieve remarkable zero-shot performance, yet their downstream potential hinges on effective fine-tuning. Most adaptation methods typically focus on refining representation from separate modalities (text or vision) but neglect the critical role of their fused representations in the decision-making process, \emph{\ie} rational matrix that drives the final prediction. To bridge the gap, we propose a simple yet effective \textbf{R}ational \textbf{Ada}ptaion ({RAda}) to explicitly exploit the final fused representation during fine-tuning. RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix, enabling targeted adjustments to the final cross-modal interactions without incurring costly modifications to intermediate features. Experiments in different settings (i.e., updating, or freezing pretrained encoders in adaptation, and test-time training that can only access the unlabeled test data) show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings. Code is available at \href{https://github.com/khufia/RAda/tree/main}{github.com/khufia/RAda}.

[134] Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization

Bingqing Zhang,Zhuo Cao,Heming Du,Yang Li,Xue Li,Jiajun Liu,Sen Wang

Main category: cs.CV

TL;DR: 论文提出了一种名为UMIVR的交互式文本到视频检索框架,通过量化文本歧义、映射不确定性和帧不确定性,生成针对性澄清问题,有效减少检索模糊性。

Details Motivation: 当前文本到视频检索系统因文本查询模糊、文本视频映射不明确和视频帧质量低等问题存在固有不确定性,而现有交互式方法缺乏对这些不确定性的量化。

Contribution: 提出UMIVR框架,通过训练无关的度量指标(如TAS、MUS和TQFS)显式量化三种不确定性,并生成针对性问题迭代优化查询。

Method: 采用语义熵(TAS)、Jensen-Shannon散度(MUS)和基于时序质量的帧采样器(TQFS)量化不确定性,并通过交互式澄清问题逐步减少不确定性。

Result: 在MSR-VTT-1k数据集上,经过10轮交互后,Recall@1达到69.2%,验证了方法的有效性。

Insight: 显式量化不确定性并动态生成澄清问题是提升交互式文本到视频检索性能的关键。

Abstract: Despite recent advances, Text-to-video retrieval (TVR) is still hindered by multiple inherent uncertainties, such as ambiguous textual queries, indistinct text-video mappings, and low-quality video frames. Although interactive systems have emerged to address these challenges by refining user intent through clarifying questions, current methods typically rely on heuristic or ad-hoc strategies without explicitly quantifying these uncertainties, limiting their effectiveness. Motivated by this gap, we propose UMIVR, an Uncertainty-Minimizing Interactive Text-to-Video Retrieval framework that explicitly quantifies three critical uncertainties-text ambiguity, mapping uncertainty, and frame uncertainty-via principled, training-free metrics: semantic entropy-based Text Ambiguity Score (TAS), Jensen-Shannon divergence-based Mapping Uncertainty Score (MUS), and a Temporal Quality-based Frame Sampler (TQFS). By adaptively generating targeted clarifying questions guided by these uncertainty measures, UMIVR iteratively refines user queries, significantly reducing retrieval ambiguity. Extensive experiments on multiple benchmarks validate UMIVR’s effectiveness, achieving notable gains in Recall@1 (69.2% after 10 interactive rounds) on the MSR-VTT-1k dataset, thereby establishing an uncertainty-minimizing foundation for interactive TVR.

[135] Procedure Learning via Regularized Gromov-Wasserstein Optimal Transport

Syed Ahmed Mahmood,Ali Shah Ali,Umer Ahmed,Fawad Javed Fateh,M. Zeeshan Zia,Quoc-Huy Tran

Main category: cs.CV

TL;DR: 本文提出了一种自监督过程学习框架,通过结合正则化的Gromov-Wasserstein最优传输和对比正则化项,解决了视频中关键步骤发现及其顺序确定的挑战。

Details Motivation: 自监督过程学习需要从未标记的视频中发现关键步骤及其顺序,但现有方法因顺序变化、背景/冗余帧和重复动作而性能受限。

Contribution: 提出了一种融合Gromov-Wasserstein最优传输和对比正则化的方法,避免了平凡解问题,提升了关键步骤的发现和顺序确定能力。

Method: 使用了正则化的Gromov-Wasserstein最优传输框架,并结合对比正则化项以避免嵌入空间中的平凡解。

Result: 在EgoProceL、ProceL和CrossTask等基准测试中表现优于传统方法如OPEL。

Insight: 通过结合结构先验和对比正则化,既能解决时间对齐问题,也能避免嵌入空间的崩溃,为自监督过程学习提供了新思路。

Abstract: We study the problem of self-supervised procedure learning, which discovers key steps and establishes their order from a set of unlabeled procedural videos. Previous procedure learning methods typically learn frame-to-frame correspondences between videos before determining key steps and their order. However, their performance often suffers from order variations, background/redundant frames, and repeated actions. To overcome these challenges, we propose a self-supervised procedure learning framework, which utilizes a fused Gromov-Wasserstein optimal transport formulation with a structural prior for computing frame-to-frame mapping between videos. However, optimizing exclusively for the above temporal alignment term may lead to degenerate solutions, where all frames are mapped to a small cluster in the embedding space and hence every video is associated with only one key step. To address that limitation, we further integrate a contrastive regularization term, which maps different frames to different points in the embedding space, avoiding the collapse to trivial solutions. Finally, we conduct extensive experiments on large-scale egocentric (i.e., EgoProceL) and third-person (i.e., ProceL and CrossTask) benchmarks to demonstrate superior performance by our approach against previous methods, including OPEL which relies on a traditional Kantorovich optimal transport formulation with an optimality prior.

[136] HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation

Qinqian Lei,Bo Wang,Robby T. Tan

Main category: cs.CV

TL;DR: HOLa提出了一种零样本人-物交互(HOI)检测新方法,通过低秩分解VLM文本特征并引入LLM正则化,显著提升对未见类别的泛化能力和动作区分能力。

Details Motivation: 现有零样本HOI检测方法在区分相同对象的不同动作或泛化到未见类别时表现不佳,HOLa旨在解决这些问题。

Contribution: HOLa的核心贡献是通过低秩分解VLM特征生成共享基特征和可适应权重,结合LLM正则化,提升泛化能力和动作区分能力。

Method: 方法包括低秩分解VLM文本特征生成共享基特征和权重,引入人-物标记丰富视觉交互表示,并通过LLM正则化指导权重适应。

Result: 在HICO-DET数据集的零样本HOI任务中,HOLa在未见动词设置下达到27.91的mAP,刷新了现有最佳表现。

Insight: 低秩分解和LLM正则化是提升零样本HOI性能的关键,共享基特征的引入有助于捕捉跨类别共性信息。

Abstract: Zero-shot human-object interaction (HOI) detection remains a challenging task, particularly in generalizing to unseen actions. Existing methods address this challenge by tapping Vision-Language Models (VLMs) to access knowledge beyond the training data. However, they either struggle to distinguish actions involving the same object or demonstrate limited generalization to unseen classes. In this paper, we introduce HOLa (Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation), a novel approach that both enhances generalization to unseen classes and improves action distinction. In training, HOLa decomposes VLM text features for given HOI classes via low-rank factorization, producing class-shared basis features and adaptable weights. These features and weights form a compact HOI representation that preserves shared information across classes, enhancing generalization to unseen classes. Subsequently, we refine action distinction by adapting weights for each HOI class and introducing human-object tokens to enrich visual interaction representations. To further distinguish unseen actions, we guide the weight adaptation with LLM-derived action regularization. Experimental results show that our method sets a new state-of-the-art across zero-shot HOI settings on HICO-DET, achieving an unseen-class mAP of 27.91 in the unseen-verb setting. Our code is available at https://github.com/ChelsieLei/HOLa.

[137] DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding

Xiaoyi Bao,Chenwei Xie,Hao Tang,Tingyu Weng,Xiaofeng Wang,Yun Zheng,Xingang Wang

Main category: cs.CV

TL;DR: 这篇论文提出了一种名为DynImg的创新视频表示方法,通过引入非关键帧作为时间提示,强调快速移动物体的空间区域,并结合4D视频旋转位置嵌入保持时空顺序,显著提升了多模态视频理解的性能。

Details Motivation: 多模态大语言模型(MLLM)在视频理解任务中的应用日益增多,但如何有效整合时空信息仍是关键问题。传统方法将时空信息分开处理,导致对快速移动物体的空间特征提取不足,影响时空交互和视频理解的准确性。

Contribution: 1. 提出DynImg,一种动态图像表示方法,通过时间提示突出快速移动物体的空间区域。2. 引入4D视频旋转位置嵌入,保持时空邻接性,帮助MLLM理解时空顺序。3. 在多个视频理解基准测试中性能超越现有方法约2%。

Method: 1. 使用非关键帧作为时间提示,引导模型关注快速移动物体的细粒度空间特征。2. 结合4D视频旋转位置嵌入,确保DynImg的时空顺序正确性。

Result: 实验表明,DynImg在多个视频理解任务中优于现有方法约2%,验证了时间提示对提升视频理解的有效性。

Insight: 通过动态图像和时空提示的结合,可以有效解决视频理解中快速移动物体的空间特征提取不足问题,为多模态视频理解的时空信息整合提供了新思路。

Abstract: In recent years, the introduction of Multi-modal Large Language Models (MLLMs) into video understanding tasks has become increasingly prevalent. However, how to effectively integrate temporal information remains a critical research focus. Traditional approaches treat spatial and temporal information separately. Due to issues like motion blur, it is challenging to accurately represent the spatial information of rapidly moving objects. This can lead to temporally important regions being underemphasized during spatial feature extraction, which in turn hinders accurate spatio-temporal interaction and video understanding. To address this limitation, we propose an innovative video representation method called Dynamic-Image (DynImg). Specifically, we introduce a set of non-key frames as temporal prompts to highlight the spatial areas containing fast-moving objects. During the process of visual feature extraction, these prompts guide the model to pay additional attention to the fine-grained spatial features corresponding to these regions. Moreover, to maintain the correct sequence for DynImg, we employ a corresponding 4D video Rotary Position Embedding. This retains both the temporal and spatial adjacency of DynImg, helping MLLM understand the spatio-temporal order within this combined format. Experimental evaluations reveal that DynImg surpasses the state-of-the-art methods by approximately 2% across multiple video understanding benchmarks, proving the effectiveness of our temporal prompts in enhancing video comprehension.

[138] SegDT: A Diffusion Transformer-Based Segmentation Model for Medical Imaging

Salah Eddine Bekhouche,Gaby Maroun,Fadi Dornaika,Abdenour Hadid

Main category: cs.CV

TL;DR: SegDT是一种基于扩散Transformer(DiT)的医学图像分割模型,专注于皮肤病变分割,通过引入Rectified Flow提升生成质量并在低成本硬件上实现快速推理,达到SOTA性能。

Details Motivation: 医学图像分割(如皮肤病变分割)对疾病诊断和治疗规划至关重要,但现有方法在高精度与低成本硬件上的表现仍有提升空间。

Contribution: 提出了SegDT模型,结合扩散Transformer与Rectified Flow,优化生成质量与推理速度,适合低成本硬件。

Method: 基于扩散Transformer(DiT),引入Rectified Flow减少推理步骤,提升生成质量,同时保持扩散模型的灵活性。

Result: 在三个基准数据集上验证,性能达到SOTA,且推理速度快,适用于实际医疗场景。

Insight: 通过扩散Transformer的结合,SegDT在医学图像分割中实现了高效与高精度的平衡,为低成本硬件上的应用提供了新思路。

Abstract: Medical image segmentation is crucial for many healthcare tasks, including disease diagnosis and treatment planning. One key area is the segmentation of skin lesions, which is vital for diagnosing skin cancer and monitoring patients. In this context, this paper introduces SegDT, a new segmentation model based on diffusion transformer (DiT). SegDT is designed to work on low-cost hardware and incorporates Rectified Flow, which improves the generation quality at reduced inference steps and maintains the flexibility of standard diffusion models. Our method is evaluated on three benchmarking datasets and compared against several existing works, achieving state-of-the-art results while maintaining fast inference speeds. This makes the proposed model appealing for real-world medical applications. This work advances the performance and capabilities of deep learning models in medical image analysis, enabling faster, more accurate diagnostic tools for healthcare professionals. The code is made publicly available at \href{https://github.com/Bekhouche/SegDT}{GitHub}.

[139] Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos

Hao Luo,Yicheng Feng,Wanpeng Zhang,Sipeng Zheng,Ye Wang,Haoqi Yuan,Jiazheng Liu,Chaoyi Xu,Qin Jin,Zongqing Lu

Main category: cs.CV

TL;DR: 这篇论文提出了Being-H0,一种基于大规模人类视频训练的多模态视觉-语言-动作模型(VLA),通过物理指令调优和零件级运动标记化方法,解决了现有模型在复杂操纵任务和泛化能力上的不足。

Details Motivation: 现有的视觉-语言-动作模型(VLA)在处理需要高灵巧性的复杂操纵任务时表现不佳,且对新场景和任务的泛化能力有限。这主要是因为它们依赖于存在模拟与现实差距的合成数据或缺乏规模和多样性的遥操作演示。人类手的灵巧性和网络数据的丰富性为解决这一问题提供了可能。

Contribution: 1. 提出了Being-H0,一种基于人类视频训练的多模态模型;2. 设计了物理指令调优方法,结合大规模预训练、物理空间对齐和机器人任务适应;3. 提出零件级运动标记化方法,实现毫米级重建精度;4. 构建了一个包含数百万实例的大规模数据集。

Method: 1. 利用人类手的视频进行大规模VLA预训练;2. 通过物理空间对齐实现3D推理;3. 采用零件级运动标记化方法学习精确手部轨迹;4. 开发了数据整理流程,整合了动作捕捉、VR和RGB视频数据。

Result: 实验表明,Being-H0在手部动作生成和指令跟随任务中表现出色,且在模型和数据规模扩大时表现良好。物理指令调优进一步提升了其在现实机器人操作任务中的性能。

Insight: 利用人类手的视频数据可以有效解决机器人操纵任务中的数据瓶颈问题;物理指令调优是提升模型在实际任务中表现的关键。

Abstract: We introduce Being-H0, a dexterous Vision-Language-Action model (VLA) trained on large-scale human videos. Existing VLAs struggle with complex manipulation tasks requiring high dexterity and generalize poorly to novel scenarios and tasks, primarily due to their reliance on synthetic data with significant sim-to-real gaps or teleoperated demonstrations lacking scale and diversity. To address this data bottleneck, we propose leveraging human hands as a foundation manipulator, capitalizing on the rich dexterity and scalability present in web data. Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks. Additionally, we introduce a part-level motion tokenization method which achieves millimeter-level reconstruction accuracy to model precise hand trajectories for action learning. To support our proposed paradigm, we further develop a comprehensive data curation pipeline that integrates heterogeneous sources – including motion capture, VR, and RGB-only videos – into a large-scale dataset with millions of motion-based instructional instances. We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes. Importantly, we observe the expected gains of Being-H0 in real-world robotic manipulation as physical instruction tuning is applied. More details are available at https://beingbeyond.github.io/Being-H0.

[140] CylinderPlane: Nested Cylinder Representation for 3D-aware Image Generation

Ru Jia,Xiaozhuang Ma,Jianji Wang,Nanning Zheng

Main category: cs.CV

TL;DR: 论文提出了一种基于圆柱坐标系的新颖隐式表示CylinderPlane,解决了Tri-plane表示在多视角一致性上的局限性,通过嵌套圆柱结构提升了360°图像生成的细节和一致性。

Details Motivation: Tri-plane表示在生成360°视角图像时存在特征模糊和多面伪影的问题,无法有效区分对称区域的特征。为了解决这一问题,作者提出基于圆柱坐标系的表示方法。

Contribution: 1. 提出基于圆柱坐标系的CylinderPlane表示,解决了特征模糊问题;2. 引入嵌套圆柱结构以适应复杂几何和多分辨率需求;3. 表示方法可无缝集成到现有神经渲染框架中。

Method: 1. 使用圆柱坐标系明确分离不同角度的特征;2. 通过嵌套圆柱结构捕捉多尺度特征;3. 兼容现有隐式渲染方法。

Result: 在合成数据集和无约束真实图像上的实验表明,CylinderPlane在生成质量和多视角一致性上优于现有方法。

Insight: 圆柱坐标系更适合处理360°视角生成问题,嵌套结构有效提升了模型的细节学习能力和适应性。

Abstract: While the proposal of the Tri-plane representation has advanced the development of the 3D-aware image generative models, problems rooted in its inherent structure, such as multi-face artifacts caused by sharing the same features in symmetric regions, limit its ability to generate 360$^\circ$ view images. In this paper, we propose CylinderPlane, a novel implicit representation based on Cylindrical Coordinate System, to eliminate the feature ambiguity issue and ensure multi-view consistency in 360$^\circ$. Different from the inevitable feature entanglement in Cartesian coordinate-based Tri-plane representation, the cylindrical coordinate system explicitly separates features at different angles, allowing our cylindrical representation possible to achieve high-quality, artifacts-free 360$^\circ$ image synthesis. We further introduce the nested cylinder representation that composites multiple cylinders at different scales, thereby enabling the model more adaptable to complex geometry and varying resolutions. The combination of cylinders with different resolutions can effectively capture more critical locations and multi-scale features, greatly facilitates fine detail learning and robustness to different resolutions. Moreover, our representation is agnostic to implicit rendering methods and can be easily integrated into any neural rendering pipeline. Extensive experiments on both synthetic dataset and unstructured in-the-wild images demonstrate that our proposed representation achieves superior performance over previous methods.

[141] A Survey on Efficiency Optimization Techniques for DNN-based Video Analytics: Process Systems, Algorithms, and Applications

Shanjiang Tang,Rui Huang,Hsinyu Luo,Chunjiang Wang,Ce Yu,Yusen Li,Hao Fu,Chao Sun,and Jian Xiao

Main category: cs.CV

TL;DR: 这篇综述论文系统地回顾并分类了优化DNN在视频分析中效率的技术和方法,包括硬件支持、数据处理和操作部署等多个角度,填补了现有研究中以精度优化为主的空白。

Details Motivation: 随着视频数据的爆炸性增长,视频分析的需求日益增加,而DNN的高效性仍是一个开放性问题。本文旨在填补现有综述在效率优化方面的不足。

Contribution: 论文提供了全面的DNN效率优化技术综述,从多个视角系统地组织现有方法,并分析了性能优化中的问题和挑战。

Method: 采用自底向上的方式分类整理现有方法,涵盖硬件支持、数据处理、操作部署等多个层次。

Result: 论文总结了优化DNN效率的多种技术和框架,并指出了当前研究中的问题和未来方向。

Insight: 视频分析的效率优化需要多层次的综合方法,而不仅仅是算法层面的改进,硬件和系统优化同样关键。

Abstract: The explosive growth of video data in recent years has brought higher demands for video analytics, where accuracy and efficiency remain the two primary concerns. Deep neural networks (DNNs) have been widely adopted to ensure accuracy; however, improving their efficiency in video analytics remains an open challenge. Different from existing surveys that make summaries of DNN-based video mainly from the accuracy optimization aspect, in this survey, we aim to provide a thorough review of optimization techniques focusing on the improvement of the efficiency of DNNs in video analytics. We organize existing methods in a bottom-up manner, covering multiple perspectives such as hardware support, data processing, operational deployment, etc. Finally, based on the optimization framework and existing works, we analyze and discuss the problems and challenges in the performance optimization of DNN-based video analytics.

[142] Uncovering Critical Features for Deepfake Detection through the Lottery Ticket Hypothesis

Lisan Al Amin,Md. Ismail Hossain,Thanh Thi Nguyen,Tasnim Jahan,Mahbubul Islam,Faisal Quader

Main category: cs.CV

TL;DR: 本文研究了彩票假设(LTH)在深度伪造检测中的应用,通过识别关键特征并高效剪枝神经网络,发现即使在显著稀疏性下也能保持高性能的子网络。提出的迭代幅度剪枝方法表现优于一次性剪枝,并通过Grad-CAM可视化分析揭示了关键面部区域的重要性。

Details Motivation: 深度伪造技术的进步对信息完整性和社会信任构成挑战,现有检测方法虽然有效,但模型庞大且机制不透明,难以在资源受限环境中部署。

Contribution: 1. 验证了深度伪造检测网络中存在‘中奖票’(winning tickets);2. 提出基于LTH的迭代幅度剪枝方法,显著优于一次性剪枝;3. 通过Grad-CAM分析关键特征,并证明子网络在数据集间的可迁移性。

Method: 1. 使用MesoNet、CNN-5和ResNet-18架构;2. 在OpenForensic和FaceForensics++数据集上进行实验;3. 提出迭代幅度剪枝方法,并结合Grad-CAM可视化。

Result: MesoNet在80%稀疏性下仍保持56.2%的准确率(基线为62.6%),仅需3,000参数;LTH方法普遍优于一次性剪枝;子网络具有跨数据集的可迁移性。

Insight: 1. 深度伪造检测依赖少量关键特征;2. 稀疏剪枝可高效部署模型;3. 关键面部区域对检测至关重要,且可迁移性为实际应用提供可能。

Abstract: Recent advances in deepfake technology have created increasingly convincing synthetic media that poses significant challenges to information integrity and social trust. While current detection methods show promise, their underlying mechanisms remain poorly understood, and the large sizes of their models make them challenging to deploy in resource-limited environments. This study investigates the application of the Lottery Ticket Hypothesis (LTH) to deepfake detection, aiming to identify the key features crucial for recognizing deepfakes. We examine how neural networks can be efficiently pruned while maintaining high detection accuracy. Through extensive experiments with MesoNet, CNN-5, and ResNet-18 architectures on the OpenForensic and FaceForensics++ datasets, we find that deepfake detection networks contain winning tickets, i.e., subnetworks, that preserve performance even at substantial sparsity levels. Our results indicate that MesoNet retains 56.2% accuracy at 80% sparsity on the OpenForensic dataset, with only 3,000 parameters, which is about 90% of its baseline accuracy (62.6%). The results also show that our proposed LTH-based iterative magnitude pruning approach consistently outperforms one-shot pruning methods. Using Grad-CAM visualization, we analyze how pruned networks maintain their focus on critical facial regions for deepfake detection. Additionally, we demonstrate the transferability of winning tickets across datasets, suggesting potential for efficient, deployable deepfake detection systems.

[143] Extracting Visual Facts from Intermediate Layers for Mitigating Hallucinations in Multimodal Large Language Models

Haoran Zhou,Zihan Zhang,Hao Chen

Main category: cs.CV

TL;DR: 论文提出了一种名为EVA的无训练方法,通过动态选择具有最显著视觉事实信息的中间层,提取视觉事实知识并整合到最终层,以减少多模态大语言模型(MLLMs)中的幻觉问题。

Details Motivation: 尽管多模态大语言模型(MLLMs)在结合视觉和语言理解方面取得了显著进展,但它们仍存在对象幻觉问题,即模型生成看似合理但事实错误的输出。研究发现,深层中的先验知识会显著抑制视觉信息,但中间层阶段如何抑制视觉信息尚不清楚。

Contribution: 提出EVA方法,通过提取中间层的视觉事实知识并动态调整输出分布,显著减少MLLMs中的幻觉问题。EVA无需训练,模型无关且兼容多种经典解码策略。

Method: EVA动态选择视觉事实信息最显著的中间层,对比原始输入和纯文本输入在该层的输出分布,提取视觉事实知识并整合到最终层的输出logits中。

Result: 在多个基准测试中,EVA显著降低了幻觉率,验证了其在减少MLLMs幻觉方面的有效性。

Insight: 发现中间层的视觉事实知识和先验/原始概率分布的差异具有相似的演化趋势,这为动态调整输出分布提供了理论基础。

Abstract: Multimodal Large Language Models (MLLMs) have made significant strides by combining visual recognition and language understanding to generate content that is both coherent and contextually accurate. However, MLLMs continue to struggle with object hallucinations, where models produce seemingly plausible but factually incorrect outputs, including objects that do not exist in the image. Recent work has revealed that the prior knowledge in MLLMs significantly suppresses visual information in deep layers, causing hallucinatory outputs. However, how these priors suppress visual information at the intermediate layer stage in MLLMs remains unclear. We observe that visual factual knowledge and the differences between intermediate-layer prior/original probability distributions show similar evolutionary trends in intermediate layers. Motivated by this, we introduce Decoding by Extracting Visual Facts (EVA), a simple, training-free method that dynamically selects intermediate layers with the most significant visual factual information. By contrasting the output distributions of the selected layer derived from the original input and pure-text input, EVA extracts visual factual knowledge and proportionally incorporates it into the final layer to correct the output logits. Importantly, EVA is model-agnostic, seamlessly integrates with various classic decoding strategies, and is applicable across different MLLMs. We validate EVA on widely-used benchmarks, and the results show that it significantly reduces hallucination rates compared to baseline methods, underscoring its effectiveness in mitigating hallucinations.

[144] HW-MLVQA: Elucidating Multilingual Handwritten Document Understanding with a Comprehensive VQA Benchmark

Aniket Pal,Ajoy Mondal,Minesh Mathew,C. V. Jawahar

Main category: cs.CV

TL;DR: 论文提出了HW-MLVQA,一个针对多语言手写文档理解的VQA基准测试,包含1600页手写文档和2400个问答对,旨在推动多模态模型在复杂手写文档场景下的发展。

Details Motivation: 当前的多语言视觉问答(MLVQA)模型在应对多样化的手写文档时表现不足,缺乏专门的基准测试。论文旨在填补这一空白。

Contribution: HW-MLVQA基准测试的提出,包含丰富的手写文档数据和问答对,以及跨文本、图像和多模态的评测框架。

Method: 构建了一个包含1600页手写文档和2400个问答对的数据集,并设计了文本、图像和多模态的评测框架,还评估了OCR模型的性能。

Result: HW-MLVQA为多语言手写文档理解提供了全面的评测基准,促进了该领域的研究进展。

Insight: 手写文档的多样性和复杂性对多模态模型提出了新挑战,HW-MLVQA为未来研究提供了重要工具。

Abstract: The proliferation of MultiLingual Visual Question Answering (MLVQA) benchmarks augments the capabilities of large language models (LLMs) and multi-modal LLMs, thereby enabling them to adeptly capture the intricate linguistic subtleties and visual complexities inherent across diverse languages. Despite its potential, the current MLVQA model struggles to fully utilize its capabilities when dealing with the extensive variety of handwritten documents. This article delineates HW-MLVQA, an avant-garde VQA benchmark meticulously crafted to mitigate the dearth of authentic Multilingual Handwritten document comprehension. HW-MLVQA encompasses an extensive collection of 1,600 handwritten Pages complemented by 2,400 question-answers. Furthermore, it provides a robust benchmark evaluation framework spanning three distinct modalities: text, image, and an integrated image & text modality. To simulate authentic real-world contexts devoid of ground truth textual transcriptions, we facilitates a rigorous assessment of proprietary and open-source OCR models. The benchmark aspires to facilitate pivotal advancements in multilingual handwritten document interpretation, fostering innovation and scholarly inquiry within this specialized domain.

[145] Visual-Language Model Knowledge Distillation Method for Image Quality Assessment

Yongkang Hou,Jiarun Song

Main category: cs.CV

TL;DR: 该论文提出了一种基于视觉-语言模型知识蒸馏的图像质量评估方法,旨在利用CLIP模型的IQA知识指导具有结构优势的学生模型训练,显著降低模型复杂度并提升性能。

Details Motivation: 现有的基于CLIP的多模态IQA方法存在参数量过大和局部失真特征识别能力不足的问题,需要一种既能继承CLIP泛化能力又能优化模型效率的方法。

Contribution: 1) 设计了质量分级提示模板以引导CLIP输出质量分数;2) 通过微调CLIP提升其在IQA任务中的能力;3) 提出了一种模态自适应的知识蒸馏策略,将CLIP的知识传递给学生模型。

Method: 1) 设计质量分级提示模板;2) 微调CLIP;3) 提出模态自适应知识蒸馏策略,从CLIP教师模型指导学生模型。

Result: 实验表明,该方法在多个IQA数据集上显著降低了模型复杂度并超越了现有IQA方法,展现出实际应用的潜力。

Insight: 知识蒸馏结合视觉-语言模型能有效平衡模型的泛化能力和效率,为IQA任务提供了一种新思路。

Abstract: Image Quality Assessment (IQA) is a core task in computer vision. Multimodal methods based on vision-language models, such as CLIP, have demonstrated exceptional generalization capabilities in IQA tasks. To address the issues of excessive parameter burden and insufficient ability to identify local distorted features in CLIP for IQA, this study proposes a visual-language model knowledge distillation method aimed at guiding the training of models with architectural advantages using CLIP’s IQA knowledge. First, quality-graded prompt templates were designed to guide CLIP to output quality scores. Then, CLIP is fine-tuned to enhance its capabilities in IQA tasks. Finally, a modality-adaptive knowledge distillation strategy is proposed to achieve guidance from the CLIP teacher model to the student model. Our experiments were conducted on multiple IQA datasets, and the results show that the proposed method significantly reduces model complexity while outperforming existing IQA methods, demonstrating strong potential for practical deployment.

[146] LINR-PCGC: Lossless Implicit Neural Representations for Point Cloud Geometry Compression

Wenjie Huang,Qi Yang,Shuting Xia,He Huang,Zhu Li,Yiling Xu

Main category: cs.CV

TL;DR: LINR-PCGC提出了一种基于隐式神经表示(INR)的无损点云几何压缩方法,通过优化编码时间和减小解码器大小,显著提升了压缩效率和实用性。

Details Motivation: 当前AI点云压缩方法依赖特定数据分布,INR方法虽能解决这一问题,但现有方法仅支持有损压缩且编码时间和解码器大小受限。因此,需要一种高效的无损压缩方法。

Contribution: 首次提出基于INR的无损点云几何压缩方法(LINR-PCGC),设计了分组的点云层级编码框架和轻量化编码网络,显著减少编码时间和解码器大小。

Method: 采用分组点云层级编码框架,结合多尺度稀疏卷积(SparseConv)的轻量编码网络,包含尺度上下文提取、子节点预测和模型压缩模块。

Result: 在MVUB数据集上,相比G-PCC TMC13v23和SparsePCGC,LINR-PCGC分别减少21.21%和21.95%的比特流。

Insight: 通过优化网络结构和编码策略,INR方法可以高效实现无损压缩,并在实际应用中超越传统和AI方法。

Abstract: Existing AI-based point cloud compression methods struggle with dependence on specific training data distributions, which limits their real-world deployment. Implicit Neural Representation (INR) methods solve the above problem by encoding overfitted network parameters to the bitstream, resulting in more distribution-agnostic results. However, due to the limitation of encoding time and decoder size, current INR based methods only consider lossy geometry compression. In this paper, we propose the first INR based lossless point cloud geometry compression method called Lossless Implicit Neural Representations for Point Cloud Geometry Compression (LINR-PCGC). To accelerate encoding speed, we design a group of point clouds level coding framework with an effective network initialization strategy, which can reduce around 60% encoding time. A lightweight coding network based on multiscale SparseConv, consisting of scale context extraction, child node prediction, and model compression modules, is proposed to realize fast inference and compact decoder size. Experimental results show that our method consistently outperforms traditional and AI-based methods: for example, with the convergence time in the MVUB dataset, our method reduces the bitstream by approximately 21.21% compared to G-PCC TMC13v23 and 21.95% compared to SparsePCGC. Our project can be seen on https://huangwenjie2023.github.io/LINR-PCGC/.

[147] DWTGS: Rethinking Frequency Regularization for Sparse-view 3D Gaussian Splatting

Hung Nguyen,Runfa Li,An Le,Truong Nguyen

Main category: cs.CV

TL;DR: 本文提出DWTGS框架,通过小波空间损失改进稀疏视角3D高斯溅射中的频率正则化,专注于低频监督和高频稀疏性,显著提升泛化能力。

Details Motivation: 稀疏视角3D高斯溅射(3DGS)容易过拟合训练视图的高频细节,导致新视角生成质量下降。现有基于傅里叶变换的频率正则化方法存在参数调优困难和高频学习偏差的问题。

Contribution: 1. 提出DWTGS框架,利用小波空间损失提供空间监督;2. 仅监督低频子带,并通过自监督方式促进高频子带稀疏性。

Method: 1. 使用离散小波变换(DWT)分解多尺度频率子带;2. 在多个DWT层级监督低频LL子带;3. 对高频HH子带施加自监督稀疏约束。

Result: 实验表明,DWTGS在多个基准测试中优于基于傅里叶的方法,泛化能力更强且高频伪影更少。

Insight: 低频监督能有效抑制高频过拟合并提升模型泛化性;小波空间损失比傅里叶变换更适合频率正则化任务。

Abstract: Sparse-view 3D Gaussian Splatting (3DGS) presents significant challenges in reconstructing high-quality novel views, as it often overfits to the widely-varying high-frequency (HF) details of the sparse training views. While frequency regularization can be a promising approach, its typical reliance on Fourier transforms causes difficult parameter tuning and biases towards detrimental HF learning. We propose DWTGS, a framework that rethinks frequency regularization by leveraging wavelet-space losses that provide additional spatial supervision. Specifically, we supervise only the low-frequency (LF) LL subbands at multiple DWT levels, while enforcing sparsity on the HF HH subband in a self-supervised manner. Experiments across benchmarks show that DWTGS consistently outperforms Fourier-based counterparts, as this LF-centric strategy improves generalization and reduces HF hallucinations.

[148] A Practical Investigation of Spatially-Controlled Image Generation with Transformers

Guoxuan Xia,Harleen Hanspal,Petru-Daniel Tudosiu,Shifeng Zhang,Sarah Parisot

Main category: cs.CV

TL;DR: 本文通过对比实验研究了基于Transformer的空间控制图像生成方法,提出了控制令牌预填充作为通用基线方法,并探讨了采样时间增强和适配器方法的优缺点。

Details Motivation: 近年来空间控制图像生成模型发展迅速,但缺乏详细、公平的科学比较,不同方法之间的差异难以区分,导致某些方法的动机和细节被忽略。

Contribution: 1) 提出控制令牌预填充作为通用基线方法;2) 研究采样时间增强(如分类器自由引导和softmax截断)对控制一致性的影响;3) 重新评估适配器方法的优势与局限。

Method: 在ImageNet上对扩散/流模型和自回归模型进行对比实验,分析控制令牌预填充、采样时间增强及适配器方法的性能。

Result: 控制令牌预填充方法表现优异;采样时间增强显著提升控制一致性;适配器方法在有限数据下减少遗忘,但在控制一致性上不如全训练模型。

Insight: 采样时间增强是提升控制一致性的关键;适配器在数据有限时有效,但需权衡控制一致性。

Abstract: Enabling image generation models to be spatially controlled is an important area of research, empowering users to better generate images according to their own fine-grained specifications via e.g. edge maps, poses. Although this task has seen impressive improvements in recent times, a focus on rapidly producing stronger models has come at the cost of detailed and fair scientific comparison. Differing training data, model architectures and generation paradigms make it difficult to disentangle the factors contributing to performance. Meanwhile, the motivations and nuances of certain approaches become lost in the literature. In this work, we aim to provide clear takeaways across generation paradigms for practitioners wishing to develop transformer-based systems for spatially-controlled generation, clarifying the literature and addressing knowledge gaps. We perform controlled experiments on ImageNet across diffusion-based/flow-based and autoregressive (AR) models. First, we establish control token prefilling as a simple, general and performant baseline approach for transformers. We then investigate previously underexplored sampling time enhancements, showing that extending classifier-free guidance to control, as well as softmax truncation, have a strong impact on control-generation consistency. Finally, we re-clarify the motivation of adapter-based approaches, demonstrating that they mitigate “forgetting” and maintain generation quality when trained on limited downstream data, but underperform full training in terms of generation-control consistency. Code will be released upon publication.

[149] TokensGen: Harnessing Condensed Tokens for Long Video Generation

Wenqi Ouyang,Zeqi Xiao,Danni Yang,Yifan Zhou,Shuai Yang,Lei Yang,Jianlou Si,Xingang Pan

Main category: cs.CV

TL;DR: TokensGen提出了一种利用浓缩token生成长视频的两阶段框架,通过分解任务为语义控制、长期一致性和平滑过渡,结合预训练短视频模型显著提升长视频生成的连贯性和效率。

Details Motivation: 基于扩散模型的长视频生成面临内存瓶颈和长期不一致性的挑战,需要一种既能保持视觉质量又能高效扩展的方法。

Contribution: 提出了TokensGen框架,通过浓缩token和两阶段训练(To2V和T2To),结合自适应FIFO-Diffusion策略,显著提升长视频生成的连贯性和计算效率。

Method: 1. 使用Video Tokenizer将短片段浓缩为语义丰富的token。2. 训练To2V(短视频扩散模型)和T2To(token扩散transformer),分别控制语义和全局一致性。3. 推理时采用自适应FIFO-Diffusion策略平滑连接相邻片段。

Result: 实验证明,该方法在保持高效计算的同时,显著提升了长视频的时间一致性和内容连贯性。

Insight: 通过模块化和预训练短视频模型的结合,Tok​ensGen为长视频生成提供了一种可扩展的解决方案,适用于影视制作和模拟场景。

Abstract: Generating consistent long videos is a complex challenge: while diffusion-based generative models generate visually impressive short clips, extending them to longer durations often leads to memory bottlenecks and long-term inconsistency. In this paper, we propose TokensGen, a novel two-stage framework that leverages condensed tokens to address these issues. Our method decomposes long video generation into three core tasks: (1) inner-clip semantic control, (2) long-term consistency control, and (3) inter-clip smooth transition. First, we train To2V (Token-to-Video), a short video diffusion model guided by text and video tokens, with a Video Tokenizer that condenses short clips into semantically rich tokens. Second, we introduce T2To (Text-to-Token), a video token diffusion transformer that generates all tokens at once, ensuring global consistency across clips. Finally, during inference, an adaptive FIFO-Diffusion strategy seamlessly connects adjacent clips, reducing boundary artifacts and enhancing smooth transitions. Experimental results demonstrate that our approach significantly enhances long-term temporal and content coherence without incurring prohibitive computational overhead. By leveraging condensed tokens and pre-trained short video models, our method provides a scalable, modular solution for long video generation, opening new possibilities for storytelling, cinematic production, and immersive simulations. Please see our project page at https://vicky0522.github.io/tokensgen-webpage/ .

[150] Learning from Heterogeneity: Generalizing Dynamic Facial Expression Recognition via Distributionally Robust Optimization

Feng-Qi Cui,Anyang Tong,Jinyang Huang,Jie Zhang,Dan Guo,Zhi Liu,Meng Wang

Main category: cs.CV

TL;DR: 该论文提出了一种新的动态面部表情识别框架HDF,通过时间-频率分布注意力模块(DAM)和分布感知缩放模块(DSM)解决样本异构性和优化不平衡问题,显著提升了识别精度和鲁棒性。

Details Motivation: 动态面部表情识别(DFER)在情感计算和人机交互中至关重要。但现有方法在多源数据和个体表情变异性导致的样本异构性下性能下降。

Contribution: 1. 提出异构感知分布框架HDF;2. 设计DAM模块增强时间-频率建模;3. 引入DSM模块动态平衡分类和对比损失。

Method: 1. DAM模块通过双分支注意力设计捕捉时间一致性和频率鲁棒性;2. DSM模块基于梯度敏感性和信息瓶颈原理自适应优化损失。

Result: 在DFEW和FERV39k数据集上,HDF显著提升了加权平均召回率(WAR)和非加权平均召回率(UAR),并在多样化和不平衡场景中表现优异。

Insight: 通过提升对样本异构性的鲁棒性和动态优化损失,HDF在DFER任务中展现了更强的通用性和性能提升潜力。

Abstract: Dynamic Facial Expression Recognition (DFER) plays a critical role in affective computing and human-computer interaction. Although existing methods achieve comparable performance, they inevitably suffer from performance degradation under sample heterogeneity caused by multi-source data and individual expression variability. To address these challenges, we propose a novel framework, called Heterogeneity-aware Distributional Framework (HDF), and design two plug-and-play modules to enhance time-frequency modeling and mitigate optimization imbalance caused by hard samples. Specifically, the Time-Frequency Distributional Attention Module (DAM) captures both temporal consistency and frequency robustness through a dual-branch attention design, improving tolerance to sequence inconsistency and visual style shifts. Then, based on gradient sensitivity and information bottleneck principles, an adaptive optimization module Distribution-aware Scaling Module (DSM) is introduced to dynamically balance classification and contrastive losses, enabling more stable and discriminative representation learning. Extensive experiments on two widely used datasets, DFEW and FERV39k, demonstrate that HDF significantly improves both recognition accuracy and robustness. Our method achieves superior weighted average recall (WAR) and unweighted average recall (UAR) while maintaining strong generalization across diverse and imbalanced scenarios. Codes are released at https://github.com/QIcita/HDF_DFER.

[151] Label tree semantic losses for rich multi-class medical image segmentation

Junwen Wang,Oscar MacCormac,William Rochford,Aaron Kujawa,Jonathan Shapey,Tom Vercauteren

Main category: cs.CV

TL;DR: 本文提出两种基于树结构的语义损失函数,利用标签层次结构提升医学图像分割的准确性,尤其在类别丰富且语义相近的场景下表现优异。

Details Motivation: 当前的医学图像分割方法对所有错误均等惩罚,未能利用标签空间中的类间语义关系,导致在类别丰富且语义相近的任务中性能受限。

Contribution: 提出了两种树结构语义损失函数,并结合稀疏标注训练方法,进一步扩展了损失函数的适用性,实现了医学图像分割任务的性能提升。

Method: 通过构建标签的层次结构,设计了两种树结构语义损失函数,能够捕捉类间语义关系;同时结合稀疏标注训练技术,提升模型在有限标注数据下的表现。

Result: 在全脑分区(WBP)和神经外科高光谱成像(HSI)任务中,表现达到当前最佳水平,验证了方法的有效性。

Insight: 标签层次的语义信息能够有效指导分割任务,尤其是类别丰富且语义相似时,树结构损失函数显著提升了模型的性能。

Abstract: Rich and accurate medical image segmentation is poised to underpin the next generation of AI-defined clinical practice by delineating critical anatomy for pre-operative planning, guiding real-time intra-operative navigation, and supporting precise post-operative assessment. However, commonly used learning methods for medical and surgical imaging segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the labels space. This becomes particularly problematic as the cardinality and richness of labels increases to include subtly different classes. In this work, we propose two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations to extend the applicability of our proposed losses. Extensive experiments are reported on two medical and surgical image segmentation tasks, namely head MRI for whole brain parcellation (WBP) with full supervision and neurosurgical hyperspectral imaging (HSI) for scene understanding with sparse annotations. Results demonstrate that our proposed method reaches state-of-the-art performance in both cases.

[152] Exploring Superposition and Interference in State-of-the-Art Low-Parameter Vision Models

Lilian Hollard,Lucas Mohimont,Nathalie Gaveau,Luiz-Angelo Steffenel

Main category: cs.CV

TL;DR: 该论文研究了低参数深度学习模型在计算机视觉中的表现,重点关注瓶颈架构和超线性激活函数的行为,并提出了一种名为NoDepth Bottleneck的新架构,通过限制特征图中的干扰,显著提升了小规模网络的准确性和扩展性。

Details Motivation: 当前低参数视觉模型的性能受限于特征图中的干扰现象,这种干扰与超线性激活函数和神经元的多特征编码(叠加现象)相关。研究旨在通过减少干扰来提升小规模网络的效率和扩展性。

Contribution: 论文的主要贡献包括:1) 研究了瓶颈架构中干扰现象的机制;2) 提出了减少干扰的关键设计元素;3) 设计了一种名为NoDepth Bottleneck的新架构,显著提升了小规模网络的性能。

Method: 通过实验分析瓶颈架构和超线性激活函数的行为,识别干扰现象的来源,并基于机制性见解设计了一种新的高效架构NoDepth Bottleneck。

Result: 在ImageNet数据集上的实验表明,NoDepth Bottleneck在小规模(参数低于150万)网络中表现出优秀的扩展性和准确性。

Insight: 低参数网络中,干扰现象是性能瓶颈的关键因素;通过优化架构设计减少干扰,可以显著提升模型的效率和性能。

Abstract: The paper investigates the performance of state-of-the-art low-parameter deep neural networks for computer vision, focusing on bottleneck architectures and their behavior using superlinear activation functions. We address interference in feature maps, a phenomenon associated with superposition, where neurons simultaneously encode multiple characteristics. Our research suggests that limiting interference can enhance scaling and accuracy in very low-scaled networks (under 1.5M parameters). We identify key design elements that reduce interference by examining various bottleneck architectures, leading to a more efficient neural network. Consequently, we propose a proof-of-concept architecture named NoDepth Bottleneck built on mechanistic insights from our experiments, demonstrating robust scaling accuracy on the ImageNet dataset. These findings contribute to more efficient and scalable neural networks for the low-parameter range and advance the understanding of bottlenecks in computer vision. https://caiac.pubpub.org/pub/3dh6rsel

[153] ConformalSAM: Unlocking the Potential of Foundational Segmentation Models in Semi-Supervised Semantic Segmentation with Conformal Prediction

Danhui Chen,Ziquan Liu,Chuxi Yang,Dan Wang,Yan Yan,Yi Xu,Xiangyang Ji

Main category: cs.CV

TL;DR: ConformalSAM是一个半监督语义分割框架,通过使用基础分割模型(SEEM)生成的掩码作为初始标注,并结合不确定性校准的共形预测(Conformal Prediction)技术,过滤低置信度标签,从而有效利用未标注数据。该方法在三个标准数据集上表现优于其他半监督方法。

Details Motivation: 像素级视觉任务(如语义分割)需要大量高质量标注数据,但标注成本高。半监督语义分割(SSSS)通过利用未标注数据缓解这一问题。本文探讨基础分割模型是否可以作为未标注数据的标注器,解决标注稀缺问题。

Contribution: 提出了ConformalSAM框架:1)利用共形预测校准基础模型(SEEM),筛选高置信度像素标签;2)结合自依赖性训练策略,避免过拟合基础模型生成的掩码。

Method: 1)使用SEEM生成未标注数据的预测掩码;2)通过共形预测校准模型不确定性,过滤低置信度标签;3)结合自依赖性训练策略优化模型。

Result: 在三个标准SSSS数据集上,ConformalSAM表现优于现有方法,并可作为插件提升其他方法的性能。

Insight: 基础分割模型的潜力可通过不确定性校准可靠地用于半监督学习;自依赖性训练策略能有效避免模型对初始伪标注的过拟合。

Abstract: Pixel-level vision tasks, such as semantic segmentation, require extensive and high-quality annotated data, which is costly to obtain. Semi-supervised semantic segmentation (SSSS) has emerged as a solution to alleviate the labeling burden by leveraging both labeled and unlabeled data through self-training techniques. Meanwhile, the advent of foundational segmentation models pre-trained on massive data, has shown the potential to generalize across domains effectively. This work explores whether a foundational segmentation model can address label scarcity in the pixel-level vision task as an annotator for unlabeled images. Specifically, we investigate the efficacy of using SEEM, a Segment Anything Model (SAM) variant fine-tuned for textual input, to generate predictive masks for unlabeled data. To address the shortcomings of using SEEM-generated masks as supervision, we propose ConformalSAM, a novel SSSS framework which first calibrates the foundation model using the target domain’s labeled data and then filters out unreliable pixel labels of unlabeled data so that only high-confidence labels are used as supervision. By leveraging conformal prediction (CP) to adapt foundation models to target data through uncertainty calibration, ConformalSAM exploits the strong capability of the foundational segmentation model reliably which benefits the early-stage learning, while a subsequent self-reliance training strategy mitigates overfitting to SEEM-generated masks in the later training stage. Our experiment demonstrates that, on three standard benchmarks of SSSS, ConformalSAM achieves superior performance compared to recent SSSS methods and helps boost the performance of those methods as a plug-in.

[154] True Multimodal In-Context Learning Needs Attention to the Visual Context

Shuo Chen,Jianzhe Liu,Zhen Han,Yan Xia,Daniel Cremers,Philip Torr,Volker Tresp,Jindong Gu

Main category: cs.CV

TL;DR: 这篇论文揭示了当前多模态大语言模型(MLLMs)在多模态上下文学习(MICL)中过度依赖文本信息、忽略视觉内容的局限性,并提出了一种动态注意力重分配(DARA)方法来解决这一问题,同时还创建了一个新的数据集TrueMICL来更可靠地评估MICL性能。

Details Motivation: 尽管MLLMs在多模态任务上表现有所提升,但它们往往依赖文本模式而忽视视觉信息,导致多模态学习实际仍是单模态的。这种局限性在不需要视觉理解的任务中被掩盖,使得如何真正提升和评估MICL能力成为未解问题。

Contribution: 论文的主要贡献包括:1)提出动态注意力重分配(DARA)方法,有效引导模型关注视觉内容;2)构建TrueMICL数据集,明确要求整合视觉信息以完成任务;3)通过实验验证了方法的有效性。

Method: 作者提出了Dynamic Attention Reallocation(DARA),通过重新平衡视觉和文本标记之间的注意力,促使模型更多地关注视觉上下文。此外,设计了TrueMICL数据集,专门用于评估模型在多模态上下文学习中的真实能力。

Result: 实验结果表明,DARA方法显著提升了模型在多模态上下文学习中的表现,尤其是在需要整合视觉信息的任务上。TrueMICL数据集也验证了方法的有效性。

Insight: 论文揭示了当前MLLMs在多模态学习中的瓶颈——视觉信息利用不足,并提供了一个高效的解决方案和评估工具,为未来多模态学习的研究提供了新方向。

Abstract: Multimodal Large Language Models (MLLMs), built on powerful language backbones, have enabled Multimodal In-Context Learning (MICL)-adapting to new tasks from a few multimodal demonstrations consisting of images, questions, and answers. Despite showing noticeable improvement on standard vision-language datasets, current MLLMs struggle to leverage visual information in the demonstrations. Specifically, they tend to neglect visual cues and over-rely on textual patterns, leading to mere text imitation rather than genuine multimodal adaptation. This behavior makes MICL still unimodal and largely restricts its practical utility. More importantly, this limitation is often concealed by the improved performance on tasks that do not require understanding the visual context. As a result, how to effectively enhance MICL ability and reliably evaluate the MICL performance remains underexplored. To address these issues, we first introduce Dynamic Attention Reallocation (DARA), an efficient fine-tuning strategy that encourages models to attend to the visual context by rebalancing attention across visual and textual tokens. In addition, we present TrueMICL, an MICL-dedicated dataset with both support and test sets that explicitly requires the integration of multimodal information-particularly visual content-for correct task completion. Extensive experiments demonstrate the effectiveness of our holistic solution, showcasing substantial improvements in the true multimodal in-context learning capabilities. Code and datasets are available at https://chenxshuo.github.io/true-micl-colm .

[155] Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models

Enes Sanli,Baris Sarper Tezcan,Aykut Erdem,Erkut Erdem

Main category: cs.CV

TL;DR: 该论文提出了PhysVidBench,一个用于评估文本到视频(T2V)生成模型物理常识理解能力的基准测试,重点关注工具使用、材料属性和交互行为的合理性。

Details Motivation: 当前T2V模型在生成视频时缺乏物理常识,导致输出违反直觉的因果关系和物体行为。论文旨在填补这一空白,提供系统化的评估方法。

Contribution: PhysVidBench是一个包含383个精心设计提示的基准测试,提出了一种间接的三阶段评估流程,避免直接视频评价的幻觉问题。

Method: 通过(1)从提示生成物理问题,(2)用视觉语言模型为视频生成字幕,(3)用语言模型基于字幕回答问题,间接评估物理常识。

Result: 该方法为T2V模型的物理常识理解提供了结构化、可解释的评估框架。

Insight: 物理常识是T2V生成中的重要挑战,间接评估方法可有效规避直接评价的缺陷。

Abstract: Recent progress in text-to-video (T2V) generation has enabled the synthesis of visually compelling and temporally coherent videos from natural language. However, these models often fall short in basic physical commonsense, producing outputs that violate intuitive expectations around causality, object behavior, and tool use. Addressing this gap, we present PhysVidBench, a benchmark designed to evaluate the physical reasoning capabilities of T2V systems. The benchmark includes 383 carefully curated prompts, emphasizing tool use, material properties, and procedural interactions, and domains where physical plausibility is crucial. For each prompt, we generate videos using diverse state-of-the-art models and adopt a three-stage evaluation pipeline: (1) formulate grounded physics questions from the prompt, (2) caption the generated video with a vision-language model, and (3) task a language model to answer several physics-involved questions using only the caption. This indirect strategy circumvents common hallucination issues in direct video-based evaluation. By highlighting affordances and tool-mediated actions, areas overlooked in current T2V evaluations, PhysVidBench provides a structured, interpretable framework for assessing physical commonsense in generative video models.

[156] SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction

Zhixiong Zhang,Shuangrui Ding,Xiaoyi Dong,Songxin He,Jianfan Lin,Junsong Tang,Yuhang Zang,Yuhang Cao,Dahua Lin,Jiaqi Wang

Main category: cs.CV

TL;DR: SeC提出了一种基于概念驱动的视频对象分割框架,通过逐步构建高层次的对象表示来替代传统的外观匹配方法,显著提升了复杂场景下的分割性能。

Details Motivation: 现有视频对象分割方法依赖外观匹配,难以应对剧烈视觉变化和复杂场景,因此作者提出利用概念理解弥补这一不足。

Contribution: 1. 提出了概念驱动的分割框架SeC;2. 引入了基于LVLMs的高层次语义表示构建方法;3. 提出了新的评估基准SeCVOS。

Method: SeC利用大型视觉语言模型(LVLMs)构建对象的语义表示,并在推理时动态调整语义推理与特征匹配的平衡。

Result: SeC在SeCVOS基准上相比SAM 2.1提升了11.8%,达到了新的SOTA性能。

Insight: 概念驱动的分割方法能够显著提升模型对于复杂动态场景的适应能力,弥补了传统外观匹配方法的局限性。

Abstract: Video Object Segmentation (VOS) is a core task in computer vision, requiring models to track and segment target objects across video frames. Despite notable advances with recent efforts, current techniques still lag behind human capabilities in handling drastic visual variations, occlusions, and complex scene changes. This limitation arises from their reliance on appearance matching, neglecting the human-like conceptual understanding of objects that enables robust identification across temporal dynamics. Motivated by this gap, we propose Segment Concept (SeC), a concept-driven segmentation framework that shifts from conventional feature matching to the progressive construction and utilization of high-level, object-centric representations. SeC employs Large Vision-Language Models (LVLMs) to integrate visual cues across diverse frames, constructing robust conceptual priors. During inference, SeC forms a comprehensive semantic representation of the target based on processed frames, realizing robust segmentation of follow-up frames. Furthermore, SeC adaptively balances LVLM-based semantic reasoning with enhanced feature matching, dynamically adjusting computational efforts based on scene complexity. To rigorously assess VOS methods in scenarios demanding high-level conceptual reasoning and robust semantic understanding, we introduce the Semantic Complex Scenarios Video Object Segmentation benchmark (SeCVOS). SeCVOS comprises 160 manually annotated multi-scenario videos designed to challenge models with substantial appearance variations and dynamic scene transformations. In particular, SeC achieves an 11.8-point improvement over SAM 2.1 on SeCVOS, establishing a new state-of-the-art in concept-aware video object segmentation.

cs.CR [Back]

[157] ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

Yiran Wu,Mauricio Velazco,Andrew Zhao,Manuel Raúl Meléndez Luján,Srisuma Movva,Yogesh K Roy,Quang Nguyen,Roberto Rodriguez,Qingyun Wu,Michael Albada,Julia Kiseleva,Anand Mudgerikar

Main category: cs.CR

TL;DR: ExCyTIn-Bench是首个专注于评估LLM代理在网络安全威胁调查任务中的基准测试,通过从调查图中生成的安全问题来测试代理能力。

Details Motivation: 网络安全分析师需要处理大量异构警报信号和安全日志,追踪多跳证据链并编写事件报告。LLM代理的自动化潜力为解决这一复杂任务提供了可能。

Contribution: 提出了首个针对网络安全威胁调查任务的基准测试ExCyTIn-Bench,包含模拟真实攻击的数据集和问题生成框架,支持自动解释和扩展。

Method: 基于Azure租户构建数据集,覆盖8种模拟攻击、57个日志表和589个自动生成问题;利用专家检测逻辑构建威胁调查图,生成链式问题与答案。

Result: 基准测试显示当前模型的平均奖励为0.249,最优值为0.368,表明任务难度高,未来研究空间大。

Insight: 通过显式节点和边的问题生成提供可解释性,且框架可复用、易扩展,为强化学习训练代理奠定了基础。

Abstract: We present ExCyTIn-Bench, the first benchmark to Evaluate an LLM agent x on the task of Cyber Threat Investigation through security questions derived from investigation graphs. Real-world security analysts must sift through a large number of heterogeneous alert signals and security logs, follow multi-hop chains of evidence, and compile an incident report. With the developments of LLMs, building LLM-based agents for automatic thread investigation is a promising direction. To assist the development and evaluation of LLM agents, we construct a dataset from a controlled Azure tenant that covers 8 simulated real-world multi-step attacks, 57 log tables from Microsoft Sentinel and related services, and 589 automatically generated questions. We leverage security logs extracted with expert-crafted detection logic to build threat investigation graphs, and then generate questions with LLMs using paired nodes on the graph, taking the start node as background context and the end node as answer. Anchoring each question to these explicit nodes and edges not only provides automatic, explainable ground truth answers but also makes the pipeline reusable and readily extensible to new logs. This also enables the automatic generation of procedural tasks with verifiable rewards, which can be naturally extended to training agents via reinforcement learning. Our comprehensive experiments with different models confirm the difficulty of the task: with the base setting, the average reward across all evaluated models is 0.249, and the best achieved is 0.368, leaving substantial headroom for future research. Code and data are coming soon!

[158] Breaking the Illusion of Security via Interpretation: Interpretable Vision Transformer Systems under Attack

Eldor Abdukhamidov,Mohammed Abuhamad,Simon S. Woo,Hyoungshick Kim,Tamer Abuhmed

Main category: cs.CR

TL;DR: 这篇论文研究了在结合解释模型的情况下,视觉Transformer(ViT)系统对对抗攻击的脆弱性,并提出了一种名为AdViT的攻击方法,能够同时欺骗ViT模型及其解释模型。实验表明,AdViT在白盒和黑盒场景下均达到高攻击成功率,且生成的对抗样本具有高置信度和看似准确的解释,使其难以被检测。

Details Motivation: 当前的研究主要关注于生成最小的对抗扰动以欺骗ViT模型,而忽略了这些扰动对模型解释的影响。然而,解释模型通常被用于检测对抗样本。因此,本文旨在揭示即便结合解释模型,ViT系统仍然可能遭受攻击。

Contribution: 论文的主要贡献是提出了AdViT攻击方法,该方法能够同时误导ViT模型及其解释模型,并在实验中验证了其有效性。此外,研究表明对抗样本可以生成看似准确的解释,从而增加了防御的难度。

Method: AdViT通过生成对抗样本,同时优化对ViT模型和解释模型的误导。研究在多种ViT模型和两种解释器上进行了广泛的实验,涵盖了白盒和黑盒攻击场景。

Result: 实验结果显示,AdViT在白盒场景下达到98%的误分类置信度,黑盒场景下达到76%的置信度,攻击成功率均为100%。同时,对抗样本的解释看起来是准确的,使其更难被检测。

Insight: 研究表明,结合解释模型的ViT系统并不能完全免疫对抗攻击,对抗样本甚至可以生成看似真实的解释,这为未来的防御研究提出了新的挑战。

Abstract: Vision transformer (ViT) models, when coupled with interpretation models, are regarded as secure and challenging to deceive, making them well-suited for security-critical domains such as medical applications, autonomous vehicles, drones, and robotics. However, successful attacks on these systems can lead to severe consequences. Recent research on threats targeting ViT models primarily focuses on generating the smallest adversarial perturbations that can deceive the models with high confidence, without considering their impact on model interpretations. Nevertheless, the use of interpretation models can effectively assist in detecting adversarial examples. This study investigates the vulnerability of transformer models to adversarial attacks, even when combined with interpretation models. We propose an attack called “AdViT” that generates adversarial examples capable of misleading both a given transformer model and its coupled interpretation model. Through extensive experiments on various transformer models and two transformer-based interpreters, we demonstrate that AdViT achieves a 100% attack success rate in both white-box and black-box scenarios. In white-box scenarios, it reaches up to 98% misclassification confidence, while in black-box scenarios, it reaches up to 76% misclassification confidence. Remarkably, AdViT consistently generates accurate interpretations in both scenarios, making the adversarial examples more difficult to detect.

eess.IV [Back]

[159] Exposing and Mitigating Calibration Biases and Demographic Unfairness in MLLM Few-Shot In-Context Learning for Medical Image Classification

Xing Shen,Justin Szeto,Mingyang Li,Hengguan Huang,Tal Arbel

Main category: eess.IV

TL;DR: 本文研究了多模态大语言模型(MLLMs)在医学图像分类任务中的校准偏见和人口统计学不公平问题,并提出了一种名为CALIN的推理时校准方法,以减少这些偏见。该方法通过双层次校准矩阵估计并应用于预测置信度,实验证明了其在多个医学数据集上的有效性。

Details Motivation: 尽管MLLMs在医学图像分析中展现了巨大潜力,但其预测的校准误差和人口统计学偏见问题尚未得到充分研究。为确保模型在临床实践中的安全部署,作者开展了这一研究。

Contribution: 1. 首次研究了MLLMs在医学图像分类中的校准偏见和人口统计学不公平问题。2. 提出了CALIN方法,通过双层次校准矩阵估计和应用,减少偏见并提高预测准确性。

Method: 作者设计了CALIN方法,采用双层次流程(从总体到子群)估计校准矩阵,并在推理时应用这些矩阵对置信度分数进行校准。

Result: 实验在PAPILA、HAM10000和MIMIC-CXR三个数据集上验证了CALIN的有效性,不仅改善了预测准确性和校准公平性,还实现了最小化的公平性-效用权衡。

Insight: 研究揭示了MLLMs在医学领域中的潜在偏见问题,并提出了一种实用的解决方案,为未来公平性研究提供了重要参考。

Abstract: Multimodal large language models (MLLMs) have enormous potential to perform few-shot in-context learning in the context of medical image analysis. However, safe deployment of these models into real-world clinical practice requires an in-depth analysis of the accuracies of their predictions, and their associated calibration errors, particularly across different demographic subgroups. In this work, we present the first investigation into the calibration biases and demographic unfairness of MLLMs’ predictions and confidence scores in few-shot in-context learning for medical image classification. We introduce CALIN, an inference-time calibration method designed to mitigate the associated biases. Specifically, CALIN estimates the amount of calibration needed, represented by calibration matrices, using a bi-level procedure: progressing from the population level to the subgroup level prior to inference. It then applies this estimation to calibrate the predicted confidence scores during inference. Experimental results on three medical imaging datasets: PAPILA for fundus image classification, HAM10000 for skin cancer classification, and MIMIC-CXR for chest X-ray classification demonstrate CALIN’s effectiveness at ensuring fair confidence calibration in its prediction, while improving its overall prediction accuracies and exhibiting minimum fairness-utility trade-off. Our codebase can be found at https://github.com/xingbpshen/medical-calibration-fairness-mllm.

[160] NuSeC: A Dataset for Nuclei Segmentation in Breast Cancer Histopathology Images

Refik Samet,Nooshin Nemati,Emrah Hancer,Serpil Sak,Bilge Ayca Kirmizi

Main category: eess.IV

TL;DR: NuSeC是一个用于乳腺癌组织病理图像中核分割的数据集,包含25位患者的100张图像,分为75张训练集和25张测试集。

Details Motivation: 为解决乳腺癌组织病理图像中核分割的标准化评估问题,作者开发了一个新的数据集NuSeC,旨在促进未来研究方法的比较分析。

Contribution: NuSeC数据集的创建为核分割领域提供了标准化的基准测试工具,并公开了详细的数据划分方式。

Method: 从25位患者中每位选择4张图像,总100张;随机从每位患者的图像中选1张作为测试集,其余为训练集,确保数据一致性。

Result: 训练集包含75张图像(约3万个核结构),测试集包含25张图像(约6千个核结构)。

Insight: NuSeC的分区方法确保了数据分布的公平性,同时为未来研究提供了可复现的实验基础。

Abstract: The NuSeC dataset is created by selecting 4 images with the size of 1024*1024 pixels from the slides of each patient among 25 patients. Therefore, there are a total of 100 images in the NuSeC dataset. To carry out a consistent comparative analysis between the methods that will be developed using the NuSeC dataset by the researchers in the future, we divide the NuSeC dataset 75% as the training set and 25% as the testing set. In detail, an image is randomly selected from 4 images of each patient among 25 patients to build the testing set, and then the remaining images are reserved for the training set. While the training set includes 75 images with around 30000 nuclei structures, the testing set includes 25 images with around 6000 nuclei structures.

[161] Classification of Histopathology Slides with Persistence Homology Convolutions

Shrunal Pothagoni,Benjamin Schweinhart

Main category: eess.IV

TL;DR: 这篇论文提出了一种名为“持续性同调卷积”的新方法,用于在组织病理学图像分类任务中捕获局部拓扑信息,提升了模型性能。

Details Motivation: 在组织病理学中,拓扑信息(如细胞形状)对疾病诊断至关重要,但传统CNN可能丢失这些信息。全球拓扑摘要缺乏局部特征信息,因此需要一种新方法来捕捉局部和平移不变的拓扑特征。

Contribution: 提出了持续性同调卷积方法,能够生成基于局部持续性同调的数据,弥补了传统方法在局部拓扑信息上的不足。

Method: 通过改进的卷积算子(持续性同调卷积)生成局部拓扑特征,并结合CNN进行图像分类。实验比较了多种组织病理学图像表示方法。

Result: 实验表明,使用持续性同调卷积训练的模型性能优于传统方法,且对超参数更鲁棒。

Insight: 局部拓扑信息在组织病理学图像分类中具有重要作用,持续性同调卷积能够有效捕获这些特征。

Abstract: Convolutional neural networks (CNNs) are a standard tool for computer vision tasks such as image classification. However, typical model architectures may result in the loss of topological information. In specific domains such as histopathology, topology is an important descriptor that can be used to distinguish between disease-indicating tissue by analyzing the shape characteristics of cells. Current literature suggests that reintroducing topological information using persistent homology can improve medical diagnostics; however, previous methods utilize global topological summaries which do not contain information about the locality of topological features. To address this gap, we present a novel method that generates local persistent homology-based data using a modified version of the convolution operator called Persistent Homology Convolutions. This method captures information about the locality and translation invariance of topological features. We perform a comparative study using various representations of histopathology slides and find that models trained with persistent homology convolutions outperform conventionally trained models and are less sensitive to hyperparameters. These results indicate that persistent homology convolutions extract meaningful geometric information from the histopathology slides.

[162] Performance Analysis of Post-Training Quantization for CNN-based Conjunctival Pallor Anemia Detection

Sebastian A. Cruz Romero,Wilfredo E. Lugo Beauchamp

Main category: eess.IV

TL;DR: 该论文探讨了使用深度学习模型通过结膜苍白检测贫血,并在CP-AnemiC数据集上验证了MobileNet架构的性能。通过量化技术优化模型以适应用于边缘设备,发现FP16量化能保持较高准确率。

Details Motivation: 贫血是全球健康问题,现有检测方法成本高且依赖专家知识。深度学习可以提供低资源环境下的解决方案,尤其适合移动医疗应用。

Contribution: 提出了基于MobileNet的贫血检测方法;评估了后训练量化对不同比特宽度下模型性能的影响;为移动设备部署提供了优化方向。

Method: 使用MobileNet架构,通过数据增强和交叉验证微调模型;测试了FP32、FP16、INT8和INT4量化对性能的影响。

Result: 模型在量化前准确率为0.9313,FP16量化后性能下降较小(准确率0.9250),但INT8和INT4量化导致显著性能下降。

Insight: FP16量化是移动设备部署的可行方案,在性能与资源消耗之间取得了较好平衡;更激进的量化可能导致诊断准确率显著下降。

Abstract: Anemia is a widespread global health issue, particularly among young children in low-resource settings. Traditional methods for anemia detection often require expensive equipment and expert knowledge, creating barriers to early and accurate diagnosis. To address these challenges, we explore the use of deep learning models for detecting anemia through conjunctival pallor, focusing on the CP-AnemiC dataset, which includes 710 images from children aged 6-59 months. The dataset is annotated with hemoglobin levels, gender, age and other demographic data, enabling the development of machine learning models for accurate anemia detection. We use the MobileNet architecture as a backbone, known for its efficiency in mobile and embedded vision applications, and fine-tune our model end-to-end using data augmentation techniques and a cross-validation strategy. Our model implementation achieved an accuracy of 0.9313, a precision of 0.9374, and an F1 score of 0.9773 demonstrating strong performance on the dataset. To optimize the model for deployment on edge devices, we performed post-training quantization, evaluating the impact of different bit-widths (FP32, FP16, INT8, and INT4) on model performance. Preliminary results suggest that while FP16 quantization maintains high accuracy (0.9250), precision (0.9370), and F1 Score (0.9377), more aggressive quantization (INT8 and INT4) leads to significant performance degradation. Overall, our study supports further exploration of quantization schemes and hardware optimizations to assess trade-offs between model size, inference time, and diagnostic accuracy in mobile healthcare applications.

[163] Personalized 3D Myocardial Infarct Geometry Reconstruction from Cine MRI with Explicit Cardiac Motion Modeling

Yilin Lyu,Fan Yang,Xiaoyue Liu,Zichen Jiang,Joshua Dillon,Debbie Zhao,Martyn Nash,Charlene Mauger,Alistair Young,Ching-Hui Sia,Mark YY Chan,Lei Li

Main category: eess.IV

TL;DR: 论文提出了一种无需对比剂的3D心肌梗死几何重建方法,从标准2D电影MRI中通过显式心脏运动建模实现高保真重建。

Details Motivation: 延迟增强MRI(LGE)虽然临床常用,但需对比剂且依赖于稀疏采样的2D切片,限制了空间分辨率和准确性。研究旨在解决这些限制。

Contribution: 主要贡献是提出了一个自动重建3D心肌梗死几何的框架,利用2D电影MRI和心脏运动建模,避免对比剂使用。

Method: 方法分两步:1) 通过biv-me模型从多视角电影MRI重建4D双心室网格;2) 设计CMotion2Infarct-Net模型,利用动态几何中的运动模式定位梗死区域。

Result: 在205个MRI扫描数据上测试,结果与手动标注有合理一致性。

Insight: 研究表明无需对比剂的动态几何建模能有效重建3D梗死区域,为心肌梗死的数字孪生提供了新思路。

Abstract: Accurate representation of myocardial infarct geometry is crucial for patient-specific cardiac modeling in MI patients. While Late gadolinium enhancement (LGE) MRI is the clinical gold standard for infarct detection, it requires contrast agents, introducing side effects and patient discomfort. Moreover, infarct reconstruction from LGE often relies on sparsely sampled 2D slices, limiting spatial resolution and accuracy. In this work, we propose a novel framework for automatically reconstructing high-fidelity 3D myocardial infarct geometry from 2D clinically standard cine MRI, eliminating the need for contrast agents. Specifically, we first reconstruct the 4D biventricular mesh from multi-view cine MRIs via an automatic deep shape fitting model, biv-me. Then, we design a infarction reconstruction model, CMotion2Infarct-Net, to explicitly utilize the motion patterns within this dynamic geometry to localize infarct regions. Evaluated on 205 cine MRI scans from 126 MI patients, our method shows reasonable agreement with manual delineation. This study demonstrates the feasibility of contrast-free, cardiac motion-driven 3D infarct reconstruction, paving the way for efficient digital twin of MI.

[164] EndoControlMag: Robust Endoscopic Vascular Motion Magnification with Periodic Reference Resetting and Hierarchical Tissue-aware Dual-Mask Contro

An Wanga,Rulin Zhou,Mengya Xu,Yiru Ye,Longfei Gou,Yiting Chang,Hao Chen,Chwee Ming Lim,Jiankun Wang,Hongliang Ren

Main category: eess.IV

TL;DR: EndoControlMag是一种无需训练的、基于拉格朗日的框架,通过周期性参考重置和分层组织感知双模式掩模控制,显著提升了内窥镜血管运动放大的鲁棒性。

Details Motivation: 内窥镜手术中细微血管运动的可视化对手术精度和决策至关重要,但由于手术场景的复杂性和动态性,这一任务极具挑战性。

Contribution: 提出的EndoControlMag包含两个关键模块:周期性参考重置(PRR)和分层组织感知放大(HTM),有效解决了误差累积和组织变形的问题。

Method: PRR将视频分为短重叠片段并动态更新参考帧;HTM结合了预训练视觉跟踪模型和双模式掩模膨胀策略,分别处理复杂组织变形和不稳定光流条件。

Result: 在EndoVMM24数据集上的实验表明,EndoControlMag在放大精度和视觉质量上显著优于现有方法,并保持了鲁棒性。

Insight: 通过双模式策略(运动基与距离基)灵活适应不同手术场景,为内窥镜运动放大提供了新的技术思路。

Abstract: Visualizing subtle vascular motions in endoscopic surgery is crucial for surgical precision and decision-making, yet remains challenging due to the complex and dynamic nature of surgical scenes. To address this, we introduce EndoControlMag, a training-free, Lagrangian-based framework with mask-conditioned vascular motion magnification tailored to endoscopic environments. Our approach features two key modules: a Periodic Reference Resetting (PRR) scheme that divides videos into short overlapping clips with dynamically updated reference frames to prevent error accumulation while maintaining temporal coherence, and a Hierarchical Tissue-aware Magnification (HTM) framework with dual-mode mask dilation. HTM first tracks vessel cores using a pretrained visual tracking model to maintain accurate localization despite occlusions and view changes. It then applies one of two adaptive softening strategies to surrounding tissues: motion-based softening that modulates magnification strength proportional to observed tissue displacement, or distance-based exponential decay that simulates biomechanical force attenuation. This dual-mode approach accommodates diverse surgical scenarios-motion-based softening excels with complex tissue deformations while distance-based softening provides stability during unreliable optical flow conditions. We evaluate EndoControlMag on our EndoVMM24 dataset spanning four different surgery types and various challenging scenarios, including occlusions, instrument disturbance, view changes, and vessel deformations. Quantitative metrics, visual assessments, and expert surgeon evaluations demonstrate that EndoControlMag significantly outperforms existing methods in both magnification accuracy and visual quality while maintaining robustness across challenging surgical conditions. The code, dataset, and video results are available at https://szupc.github.io/EndoControlMag/.

[165] Latent Space Synergy: Text-Guided Data Augmentation for Direct Diffusion Biomedical Segmentation

Muhammad Aqeel,Maham Nazir,Zanxi Ruan,Francesco Setti

Main category: eess.IV

TL;DR: SynDiff是一个结合文本引导合成数据生成和高效扩散分割的框架,解决了医学图像分割中的数据稀缺问题,通过文本条件修复生成临床真实的合成息肉样本,显著提升了分割性能。

Details Motivation: 医学图像分割(特别是息肉检测)面临数据稀缺问题,且标注需要专业知识。传统方法迭代去噪效率低,无法满足临床实时需求。因此,作者提出了结合文本引导和直接潜在估计的高效方法。

Contribution: 1. 提出SynDiff框架,结合文本引导合成数据生成和高效扩散分割;2. 引入直接潜在估计,实现单步推理,显著提升计算速度;3. 在CVC-ClinicDB上取得高精度分割结果,同时保持实时性。

Method: 1. 使用潜在扩散模型通过文本条件修复生成临床真实的合成息肉样本;2. 提出直接潜在估计方法,替代传统迭代去噪,实现单步推理;3. 结合合成数据增强分割模型的鲁棒性。

Result: 在CVC-ClinicDB数据集上,SynDiff达到了96.0% Dice和92.9% IoU,同时保持实时性,适用于临床部署。

Insight: 1. 文本引导的合成数据增强可以有效解决医学图像数据稀缺问题;2. 直接潜在估计显著提升了扩散模型的推理效率;3. 合成数据增强可以提升模型鲁棒性,且不会引入分布偏移。

Abstract: Medical image segmentation suffers from data scarcity, particularly in polyp detection where annotation requires specialized expertise. We present SynDiff, a framework combining text-guided synthetic data generation with efficient diffusion-based segmentation. Our approach employs latent diffusion models to generate clinically realistic synthetic polyps through text-conditioned inpainting, augmenting limited training data with semantically diverse samples. Unlike traditional diffusion methods requiring iterative denoising, we introduce direct latent estimation enabling single-step inference with T x computational speedup. On CVC-ClinicDB, SynDiff achieves 96.0% Dice and 92.9% IoU while maintaining real-time capability suitable for clinical deployment. The framework demonstrates that controlled synthetic augmentation improves segmentation robustness without distribution shift. SynDiff bridges the gap between data-hungry deep learning models and clinical constraints, offering an efficient solution for deployment in resourcelimited medical settings.

[166] A Steel Surface Defect Detection Method Based on Lightweight Convolution Optimization

Cong Chen,Ming Chen,Hoileong Lee,Yan Li,Jiyang Yu

Main category: eess.IV

TL;DR: 该论文提出了一种基于轻量级卷积优化的钢材表面缺陷检测方法,结合了YOLOv9s、C3Ghost模块、SCConv模块和CARAFE上采样算子,显著提高了检测精度和模型性能。

Details Motivation: 钢材表面缺陷检测在多尺度缺陷识别中存在挑战,传统方法精度不足且漏检率高,尤其是在复杂环境下和小目标缺陷中。

Contribution: 提出了一种改进的YOLOv9s框架,通过C3Ghost模块减少冗余计算、SCConv模块优化特征表示,以及CARAFE上采样算子提升细节恢复能力。

Method: 结合YOLOv9s、SCConv模块、C3Ghost模块和CARAFE上采样算子,优化特征提取和上采样过程,提高检测效率和精度。

Result: 实验结果表明,该方法比其他方法在钢材表面缺陷检测任务中表现出更高的准确性和鲁棒性。

Insight: 通过轻量化和内容感知的模块优化,可以在不牺牲性能的情况下,显著提升缺陷检测的效率和质量。

Abstract: Surface defect detection of steel, especially the recognition of multi-scale defects, has always been a major challenge in industrial manufacturing. Steel surfaces not only have defects of various sizes and shapes, which limit the accuracy of traditional image processing and detection methods in complex environments. However, traditional defect detection methods face issues of insufficient accuracy and high miss-detection rates when dealing with small target defects. To address this issue, this study proposes a detection framework based on deep learning, specifically YOLOv9s, combined with the C3Ghost module, SCConv module, and CARAFE upsampling operator, to improve detection accuracy and model performance. First, the SCConv module is used to reduce feature redundancy and optimize feature representation by reconstructing the spatial and channel dimensions. Second, the C3Ghost module is introduced to enhance the model’s feature extraction ability by reducing redundant computations and parameter volume, thereby improving model efficiency. Finally, the CARAFE upsampling operator, which can more finely reorganize feature maps in a content-aware manner, optimizes the upsampling process and ensures detailed restoration of high-resolution defect regions. Experimental results demonstrate that the proposed model achieves higher accuracy and robustness in steel surface defect detection tasks compared to other methods, effectively addressing defect detection problems.

cs.AI [Back]

[167] Inverse Scaling in Test-Time Compute

Aryo Pradipta Gema,Alexander Hägele,Runjin Chen,Andy Arditi,Jacob Goldman-Wetzler,Kit Fraser-Taliente,Henry Sleight,Linda Petrini,Julian Michael,Beatrice Alex,Pasquale Minervini,Yanda Chen,Joe Benton,Ethan Perez

Main category: cs.AI

TL;DR: 论文研究发现,在某些任务中,增加大语言模型的推理长度反而会降低性能,揭示了测试时计算量与准确性之间的反比关系,并识别了五种失败模式。

Details Motivation: 研究旨在探索测试时计算(推理长度)对大型推理模型(LRMs)性能的影响,尤其是在可能产生反比关系的任务中。

Contribution: 提出了四种评估任务类别,发现推理长度增加会导致性能下降,并总结了五种失败模式,揭示了测试时计算扩展的潜在风险。

Method: 设计了四类任务(简单计数、回归、演绎和高级AI风险),并测试了不同推理长度下模型的性能变化,分析了失败模式。

Result: 发现推理长度增加会导致模型性能下降,尤其是在干扰信息、虚假特征、复杂演绎等任务中,且可能放大不良行为。

Insight: 尽管测试时计算扩展可提升模型能力,但也可能强化错误推理模式,需在不同推理长度下评估模型性能以减少潜在风险。

Abstract: We construct evaluation tasks where extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance, exhibiting an inverse scaling relationship between test-time compute and accuracy. Our evaluation tasks span four categories: simple counting tasks with distractors, regression tasks with spurious features, deduction tasks with constraint tracking, and advanced AI risks. We identify five distinct failure modes when models reason for longer: 1) Claude models become increasingly distracted by irrelevant information; 2) OpenAI o-series models resist distractors but overfit to problem framings; 3) models shift from reasonable priors to spurious correlations; 4) all models show difficulties in maintaining focus on complex deductive tasks; and 5) extended reasoning may amplify concerning behaviors, with Claude Sonnet 4 showing increased expressions of self-preservation. These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns. Our results demonstrate the importance of evaluating models across diverse reasoning lengths to identify and address these failure modes in LRMs.

[168] Routine: A Structural Planning Framework for LLM Agent System in Enterprise

Guancheng Zeng,Xueyi Chen,Jiawang Hu,Shaohua Qi,Yaxuan Mao,Zhantao Wang,Yifan Nie,Shuang Li,Qiuyang Feng,Pengxu Qiu,Yujia Wang,Wenqiang Han,Linyan Huang,Gang Li,Jingjing Mo,Haowen Hu

Main category: cs.AI

TL;DR: 该论文提出了Routine框架,通过结构化规划和明确指令解决企业环境中LLM代理系统的执行不稳定问题,显著提升了工具调用任务的准确性。

Details Motivation: 企业环境中代理系统常因缺乏领域特定的流程知识而导致执行不稳定和计划混乱。Routine框架旨在通过结构化和明确的指令解决这些问题。

Contribution: 1. 提出了Routine框架,通过多步骤规划和参数传递提升代理系统的执行稳定性。2. 构建了Routine-following训练数据集,并通过蒸馏技术进一步优化模型性能。3. 实验表明Routine显著提升了LLM在企业场景中的工具调用准确性。

Method: 1. 设计结构化多步骤规划框架,包括明确指令和参数传递机制。2. 构建场景特定数据集,并通过蒸馏和微调优化模型性能。

Result: Routine将GPT-4o的工具调用准确率从41.1%提升至96.3%,Qwen3-14B从32.6%提升至88.2%(微调后达95.5%),证明了其有效性。

Insight: 结构化规划和领域特定知识蒸馏是提升LLM在企业环境中执行稳定性和适应性的关键。

Abstract: The deployment of agent systems in an enterprise environment is often hindered by several challenges: common models lack domain-specific process knowledge, leading to disorganized plans, missing key tools, and poor execution stability. To address this, this paper introduces Routine, a multi-step agent planning framework designed with a clear structure, explicit instructions, and seamless parameter passing to guide the agent’s execution module in performing multi-step tool-calling tasks with high stability. In evaluations conducted within a real-world enterprise scenario, Routine significantly increases the execution accuracy in model tool calls, increasing the performance of GPT-4o from 41.1% to 96.3%, and Qwen3-14B from 32.6% to 83.3%. We further constructed a Routine-following training dataset and fine-tuned Qwen3-14B, resulting in an accuracy increase to 88.2% on scenario-specific evaluations, indicating improved adherence to execution plans. In addition, we employed Routine-based distillation to create a scenario-specific, multi-step tool-calling dataset. Fine-tuning on this distilled dataset raised the model’s accuracy to 95.5%, approaching GPT-4o’s performance. These results highlight Routine’s effectiveness in distilling domain-specific tool-usage patterns and enhancing model adaptability to new scenarios. Our experimental results demonstrate that Routine provides a practical and accessible approach to building stable agent workflows, accelerating the deployment and adoption of agent systems in enterprise environments, and advancing the technical vision of AI for Process.

[169] Towards physician-centered oversight of conversational diagnostic AI

Elahe Vedadi,David Barrett,Natalie Harris,Ellery Wulczyn,Shashir Reddy,Roma Ruparel,Mike Schaekermann,Tim Strother,Ryutaro Tanno,Yash Sharma,Jihyeon Lee,Cían Hughes,Dylan Slack,Anil Palepu,Jan Freyberg,Khaled Saab,Valentin Liévin,Wei-Hung Weng,Tao Tu,Yun Liu,Nenad Tomasev,Kavita Kulkarni,S. Sara Mahdavi,Kelvin Guu,Joëlle Barral,Dale R. Webster,James Manyika,Avinatan Hassidim,Katherine Chou,Yossi Matias,Pushmeet Kohli,Adam Rodman,Vivek Natarajan,Alan Karthikesalingam,David Stutz

Main category: cs.AI

TL;DR: 该论文提出了一种名为g-AMIE的多智能体系统框架,用于在对话式诊断AI中实现医生为中心的异步监管,提升诊断质量和效率。

Details Motivation: 当前对话式AI在医疗诊断中展现出潜力,但实际应用中需确保患者安全,医生需对诊断和治疗计划进行监管。论文旨在设计一个框架,使AI系统能在医生监管下高效运作。

Contribution: 提出g-AMIE多智能体系统框架,实现AI在医疗诊断中的异步监管,并通过实验验证其优于传统医疗团队的表现。

Method: g-AMIE在设置的护栏内进行病史采集,避免直接提供医疗建议,随后通过临床驾驶舱界面将评估结果提交给医生审查。医生保留决策权并负责最终决定。

Result: 在60个虚拟场景测试中,g-AMIE在病史采集、案例总结、诊断和管理计划建议方面优于护士/医师助理和医生组,且医生监管g-AMIE更高效。

Insight: 异步监管是诊断AI系统在实际医疗中可行的一种范式,能够在确保患者安全的同时提升诊断效率和质量。

Abstract: Recent work has demonstrated the promise of conversational AI systems for diagnostic dialogue. However, real-world assurance of patient safety means that providing individual diagnoses and treatment plans is considered a regulated activity by licensed professionals. Furthermore, physicians commonly oversee other team members in such activities, including nurse practitioners (NPs) or physician assistants/associates (PAs). Inspired by this, we propose a framework for effective, asynchronous oversight of the Articulate Medical Intelligence Explorer (AMIE) AI system. We propose guardrailed-AMIE (g-AMIE), a multi-agent system that performs history taking within guardrails, abstaining from individualized medical advice. Afterwards, g-AMIE conveys assessments to an overseeing primary care physician (PCP) in a clinician cockpit interface. The PCP provides oversight and retains accountability of the clinical decision. This effectively decouples oversight from intake and can thus happen asynchronously. In a randomized, blinded virtual Objective Structured Clinical Examination (OSCE) of text consultations with asynchronous oversight, we compared g-AMIE to NPs/PAs or a group of PCPs under the same guardrails. Across 60 scenarios, g-AMIE outperformed both groups in performing high-quality intake, summarizing cases, and proposing diagnoses and management plans for the overseeing PCP to review. This resulted in higher quality composite decisions. PCP oversight of g-AMIE was also more time-efficient than standalone PCP consultations in prior work. While our study does not replicate existing clinical practices and likely underestimates clinicians’ capabilities, our results demonstrate the promise of asynchronous oversight as a feasible paradigm for diagnostic AI systems to operate under expert human oversight for enhancing real-world care.

[170] LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization

Xingyu Wu,Yuchen Yan,Shangke Lyu,Linjuan Wu,Yiwen Qiu,Yongliang Shen,Weiming Lu,Jian Shao,Jun Xiao,Yueting Zhuang

Main category: cs.AI

TL;DR: LAPO通过两阶段强化学习将推理长度控制内化为模型能力,显著减少令牌使用并提升准确性。

Details Motivation: 当前大模型通过长链式推理达到高性能,但常因过度计算资源消耗而低效。需将推理长度控制由外部约束转为内在能力。

Contribution: 提出LAPO框架,通过两阶段强化学习内化推理长度控制,减少令牌使用40.9%的同时提升准确率2.3%。

Method: 两阶段强化学习:1)学习成功推理长度的统计分布;2)将其作为元认知指导嵌入推理上下文。

Result: 在数学推理任务中,LAPO显著减少计算资源消耗,模型表现出按问题复杂度分配资源的能力。

Insight: 模型可通过内化推理长度控制实现高效推理,而无需牺牲性能。

Abstract: Large reasoning models have achieved remarkable performance through extended chain-of-thought sequences, yet this computational freedom leads to excessive token generation even for simple problems. We present Length-Adaptive Policy Optimization (LAPO), a novel framework that transforms reasoning length control from an external constraint into an intrinsic model capability. Unlike existing approaches that impose rigid limits or rely on post-hoc interventions, LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process. In the first stage, models learn natural reasoning patterns by discovering the statistical distribution of successful solution lengths. The second stage leverages these patterns as meta-cognitive guidance, embedding them directly within the model’s reasoning context to ensure inference-time flexibility. Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9% while improving accuracy by 2.3%. Our analysis reveals that models trained with LAPO develop emergent abilities to allocate computational resources based on problem complexity, achieving efficient reasoning without sacrificing quality.

[171] Hierarchical Budget Policy Optimization for Adaptive Reasoning

Shangke Lyu,Linjuan Wu,Yuchen Yan,Xingyu Wu,Hao Li,Yongliang Shen,Peisheng Jiang,Weiming Lu,Jun Xiao,Yueting Zhuang

Main category: cs.AI

TL;DR: 本文提出了Hierarchical Budget Policy Optimization (HBPO),一种强化学习框架,通过分层次预算探索和差异化奖励机制,使模型能够根据问题复杂度自适应调整推理深度,从而显著减少计算资源的使用并提升性能。

Details Motivation: 大型推理模型在广泛使用推理链生成时表现出卓越性能,但无论问题复杂度如何,统一推理策略导致计算效率低下。现有方法难以在效率与性能之间取得平衡。

Contribution: 1. 提出HBPO框架,解决效率导向训练中的探索空间崩溃问题;2. 引入分层次预算探索和差异化奖励机制,实现资源高效分配;3. 在四个推理基准上减少60.6%的令牌使用量,同时提升3.14%的准确率。

Method: 1. 使用强化学习框架HBPO;2. 分层次预算探索:将样本划分为不同令牌预算的子组;3. 差异化奖励机制,根据问题复杂度提供预算感知的激励。

Result: 实验证明HBPO在减少令牌使用量的同时提升准确性,模型能根据问题复杂度自动调整推理深度。

Insight: 推理效率与性能并非天生冲突,通过结构化分层次训练可以同时优化两者,而无需外部约束或离散模式选择。

Abstract: Large reasoning models achieve remarkable performance through extensive chain-of-thought generation, yet exhibit significant computational inefficiency by applying uniform reasoning strategies regardless of problem complexity. We present Hierarchical Budget Policy Optimization (HBPO), a reinforcement learning framework that enables models to learn problem-specific reasoning depths without sacrificing capability. HBPO addresses the fundamental challenge of exploration space collapse in efficiency-oriented training, where penalties on long output length systematically bias models away from necessary long reasoning paths. Through hierarchical budget exploration, our approach partitions rollout samples into multiple subgroups with distinct token budgets, aiming to enable efficient resource allocation while preventing degradation of capability. We introduce differentiated reward mechanisms that create budget-aware incentives aligned with the complexity of the problem, allowing models to discover natural correspondences between task requirements and computational effort. Extensive experiments demonstrate that HBPO reduces average token usage by up to 60.6% while improving accuracy by 3.14% across four reasoning benchmarks. Unlike existing methods that impose external constraints or rely on discrete mode selection, HBPO exhibits emergent adaptive behavior where models automatically adjust reasoning depth based on problem complexity. Our results suggest that reasoning efficiency and capability are not inherently conflicting, and can be simultaneously optimized through appropriately structured hierarchical training that preserves exploration diversity.

[172] InsightX Agent: An LMM-based Agentic Framework with Integrated Tools for Reliable X-ray NDT Analysis

Jiale Liu,Huan Wang,Yue Zhang,Xiaoyu Luo,Jiaxiang Hu,Zhiliang Liu,Min Xie

Main category: cs.AI

TL;DR: 本文提出InsightX Agent,一种基于大模型(LMM)的智能框架,通过集成工具实现可靠、可解释且交互式的X射线无损检测分析。

Details Motivation: 现有的深度学习方法在工业X射线检测中缺乏交互性、可解释性和自我评估能力,限制了其可靠性和操作者信任。InsightX Agent旨在解决这些问题。

Contribution: 1)提出一种新型的LMM智能框架,集成工具实现主动推理;2)开发SDMSD检测器和EGR工具,优化检测性能和可解释性。

Method: 框架以LMM为核心协调者,结合SDMSD(稀疏可变多尺度检测器)和EGR(基于证据的反思工具),通过多尺度特征检测和链式思维验证提升分析质量。

Result: 在GDXray+数据集上,F1分数达96.35%,显著提升检测性能和结果可信度。

Insight: 智能代理框架通过主动推理和工具集成,有望推动工业检测任务的可靠性和可解释性发展。

Abstract: Non-destructive testing (NDT), particularly X-ray inspection, is vital for industrial quality assurance, yet existing deep-learning-based approaches often lack interactivity, interpretability, and the capacity for critical self-assessment, limiting their reliability and operator trust. To address these shortcomings, this paper proposes InsightX Agent, a novel LMM-based agentic framework designed to deliver reliable, interpretable, and interactive X-ray NDT analysis. Unlike typical sequential pipelines, InsightX Agent positions a Large Multimodal Model (LMM) as a central orchestrator, coordinating between the Sparse Deformable Multi-Scale Detector (SDMSD) and the Evidence-Grounded Reflection (EGR) tool. The SDMSD generates dense defect region proposals for multi-scale feature maps and sparsifies them through Non-Maximum Suppression (NMS), optimizing detection of small, dense targets in X-ray images while maintaining computational efficiency. The EGR tool guides the LMM agent through a chain-of-thought-inspired review process, incorporating context assessment, individual defect analysis, false positive elimination, confidence recalibration and quality assurance to validate and refine the SDMSD’s initial proposals. By strategically employing and intelligently using tools, InsightX Agent moves beyond passive data processing to active reasoning, enhancing diagnostic reliability and providing interpretations that integrate diverse information sources. Experimental evaluations on the GDXray+ dataset demonstrate that InsightX Agent not only achieves a high object detection F1-score of 96.35% but also offers significantly improved interpretability and trustworthiness in its analyses, highlighting the transformative potential of agentic LLM frameworks for industrial inspection tasks.

[173] Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner

Lei Chen,Xuanle Zhao,Zhixiong Zeng,Jing Huang,Yufeng Zhong,Lin Ma

Main category: cs.AI

TL;DR: Chart-R1是一种基于强化学习微调的视觉语言模型,专注于复杂图表推理任务,通过链式思维监督和数值敏感的强化微调实现高性能。

Details Motivation: 现有R1-Style方法主要集中在数学推理和代码智能任务上,缺乏对通用多模态数据的验证。图表作为一种重要多模态数据类型,其复杂推理需求成为研究重点。

Contribution: 1. 提出程序化数据合成技术生成高质量分步图表推理数据;2. 设计两阶段训练策略(Chart-COT和Chart-RFT)提升图表推理能力;3. 在公开基准和自建数据集上验证了Chart-R1的优势。

Method: 1. 生成分步骤的图表推理数据(单/多子图表);2. Chart-COT:通过链式思维监督分解推理任务;3. Chart-RFT:采用数值敏感的强化学习微调策略。

Result: Chart-R1在图表推理任务上显著优于领域内其他方法,性能接近GPT-4o和Claude-3.5等大规模模型。

Insight: 链式思维监督和数值敏感的强化学习是提升多模态图表推理任务性能的关键策略。

Abstract: Recently, inspired by OpenAI-o1/o3 and Deepseek-R1, the R1-Style method based on reinforcement learning fine-tuning has received widespread attention from the community. Previous R1-Style methods mainly focus on mathematical reasoning and code intelligence. It is of great research significance to verify their advantages on more general multimodal data. Chart is an important multimodal data type with rich information, which brings important research challenges in complex reasoning. In this work, we introduce Chart-R1, a chart-domain vision-language model with reinforcement learning fine-tuning to enable complex chart reasoning. To support Chart-R1, we first propose a novel programmatic data synthesis technology to generate high-quality step-by-step chart reasoning data covering single- and multi-subcharts, which makes up for the lack of reasoning data in the chart domain. Then we develop a two-stage training strategy: Chart-COT with step-by-step chain-of-thought supervision, and Chart-RFT with numerically sensitive reinforcement fine-tuning. Chart-COT aims to decompose complex chart reasoning tasks into fine-grained, understandable subtasks through step-by-step supervision, which lays a good foundation for improving the reasoning level of reinforcement learning. Chart-RFT utilize the typical group relative policy optimization strategy, in which a relatively soft reward is adopted for numerical response to emphasize the numerical sensitivity in the chart domain. We conduct extensive experiments on open-source benchmarks and self-built chart reasoning dataset (\emph{i.e., ChartRQA}). Experimental results show that Chart-R1 has significant advantages compared to chart-domain methods, even comparable to open/closed source large-scale models (\emph{e.g., GPT-4o, Claude-3.5}).

cs.PL [Back]

[174] Hear Your Code Fail, Voice-Assisted Debugging for Python

Sayed Mahbub Hasan Amiri,Md. Mainul Islam,Mohammad Shakhawat Hossen,Sayed Majhab Hasan Amiri,Mohammad Shawkat Ali Mamun,Sk. Humaun Kabir,Naznin Akter

Main category: cs.PL

TL;DR: 这篇论文提出了一种创新的语音辅助调试插件,用于Python编程,通过将运行时错误转换为可听诊断,显著降低了认知负载并提高了调试效率。

Details Motivation: 传统调试工具(如堆栈跟踪)对视觉依赖性强,且在认知上较为复杂。论文旨在通过多模态反馈(听觉和视觉)降低认知负载,提高调试效率,并增强对视觉障碍开发者和其他多任务场景的支持。

Contribution: 1. 开发了一个语音辅助调试插件,支持Python 3.7+环境;2. 通过听觉和视觉反馈减少认知负载37%;3. 实现78%的错误识别速度提升;4. 为教育领域提供了45%的调试技能加速学习。

Method: 采用了全局异常钩子架构,结合pyttsx3文本到语音转换和Tkinter GUI可视化,提供多模态错误反馈。系统还支持交互式跟踪回溯和文档深度链接。

Result: 实验证明,插件显著降低了认知负载(p<0.01),提升了调试效率,同时兼容多种操作系统且资源占用低(CPU开销<18%)。

Insight: 听觉反馈可以为编程调试提供新的范式,尤其适用于教育和无障碍场景。未来的方向包括结合GPT的修复建议和多语言实时翻译。

Abstract: This research introduces an innovative voice-assisted debugging plugin for Python that transforms silent runtime errors into actionable audible diagnostics. By implementing a global exception hook architecture with pyttsx3 text-to-speech conversion and Tkinter-based GUI visualization, the solution delivers multimodal error feedback through parallel auditory and visual channels. Empirical evaluation demonstrates 37% reduced cognitive load (p<0.01, n=50) compared to traditional stack-trace debugging, while enabling 78% faster error identification through vocalized exception classification and contextualization. The system achieves sub-1.2 second voice latency with under 18% CPU overhead during exception handling, vocalizing error types and consequences while displaying interactive tracebacks with documentation deep links. Criteria validate compatibility across Python 3.7+ environments on Windows, macOS, and Linux platforms. Needing only two lines of integration code, the plugin significantly boosts availability for aesthetically impaired designers and supports multitasking workflows through hands-free error medical diagnosis. Educational applications show particular promise, with pilot studies indicating 45% faster debugging skill acquisition among novice programmers. Future development will incorporate GPT-based repair suggestions and real-time multilingual translation to further advance auditory debugging paradigms. The solution represents a fundamental shift toward human-centric error diagnostics, bridging critical gaps in programming accessibility while establishing new standards for cognitive efficiency in software development workflows.

cs.LG [Back]

[175] It’s Not That Simple. An Analysis of Simple Test-Time Scaling

Guojun Wu

Main category: cs.LG

TL;DR: 论文分析了简单测试时缩放方法,发现缩放行为主要来自通过限制最大长度实现的缩放下调,而通过追加“等待”实现的缩放上调则会导致不一致性。关键区别在于o1类模型的自然缩放能力与简单测试时缩放的限制性。

Details Motivation: 研究动机是探讨简单测试时缩放方法的实际效果及其与o1类模型的自然缩放能力的差异,揭示简单方法的局限性。

Contribution: 论文的主要贡献是明确指出了简单测试时缩放方法的局限性,尤其是通过限制最大长度实现的缩放下调与自然缩放能力的对比。

Method: 研究方法包括对简单测试时缩放的分析,重点关注缩放下调和缩放上调的效果,并与o1类模型的自然缩放行为进行比较。

Result: 结果表明,简单测试时缩放的缩放下调有效,但缩放上调会导致不一致性,且无法像o1类模型那样自然提升性能。

Insight: 关键洞见是简单测试时缩放无法完全复制o1类模型的自然缩放能力,其目标应是解锁更高性能而非仅模仿缩放行为。

Abstract: Prior work proposed simple test-time scaling, a method for replicating this scaling behavior with models distilled from o1-like models by manually controlling test-time compute: either scaling down by enforcing a maximum length or scaling up by iteratively appending “Wait” when the model is about to terminate its generation. This paper presents an analysis of simple test-time scaling and finds that the scaling behavior is largely attributed to scaling down by enforcing a maximum length. In contrast, fine-tuning on long CoT data distilled from o1-like models has no significant impact on scaling behavior, and scaling up by appending “Wait” leads to inconsistencies, as the model may oscillate between solutions. A key distinction exists between scaling down by enforcing a maximum length and scaling up test-time compute in o1-like models, such as DeepSeek-R1@. These models are typically allowed to utilize as much compute as needed, with the only constraint being the model’s maximum supported length. By learning to naturally scale up test-time compute during reinforcement learning, o1-like models surpass their peak performance when scaling up. In contrast, simple test-time scaling progressively imposes a lower upper limit on model performance as it scales down. While replicating the test-time scaling behavior of o1 models can be straightforward by scaling down, it is crucial to recognize that the goal of scaling test-time compute is to unlock higher performance – beyond what the model could originally achieve – rather than merely reproducing the appearance of scaling behavior.

[176] GCC-Spam: Spam Detection via GAN, Contrastive Learning, and Character Similarity Networks

Zixin Xu,Zhijie Wang,Zhiyuan Pan

Main category: cs.LG

TL;DR: GCC-Spam是一个结合GAN、对比学习和字符相似性网络的新型垃圾文本检测框架,旨在解决对抗攻击和数据稀缺问题,实验表明其性能优于基线方法。

Details Motivation: 互联网垃圾文本的快速增长导致信息泄露和社会不稳定风险增加,现有方法难以应对对抗性攻击和标注数据稀缺的问题。

Contribution: 1. 字符相似性网络捕捉拼写和语音特征以对抗混淆攻击;2. 对比学习优化潜在空间中垃圾与正常文本的距离;3. GAN生成伪样本缓解数据稀缺并提升模型鲁棒性。

Method: 结合字符相似性网络、对比学习和GAN,通过多模块协作提高垃圾文本检测的准确性和鲁棒性。

Result: 在真实数据集上,GCC-Spam在检测率和数据效率上优于基线方法。

Insight: 多模态特征融合和半监督学习策略在对抗性环境中的垃圾文本检测任务中具有潜力。

Abstract: The exponential growth of spam text on the Internet necessitates robust detection mechanisms to mitigate risks such as information leakage and social instability. This work addresses two principal challenges: adversarial strategies employed by spammers and the scarcity of labeled data. We propose a novel spam-text detection framework GCC-Spam, which integrates three core innovations. First, a character similarity network captures orthographic and phonetic features to counter character-obfuscation attacks and furthermore produces sentence embeddings for downstream classification. Second, contrastive learning enhances discriminability by optimizing the latent-space distance between spam and normal texts. Third, a Generative Adversarial Network (GAN) generates realistic pseudo-spam samples to alleviate data scarcity while improving model robustness and classification accuracy. Extensive experiments on real-world datasets demonstrate that our model outperforms baseline approaches, achieving higher detection rates with significantly fewer labeled examples.

[177] The Invisible Leash: Why RLVR May Not Escape Its Origin

Fang Wu,Weihao Xuan,Ximing Lu,Zaid Harchaoui,Yejin Choi

Main category: cs.LG

TL;DR: 该论文探讨了强化学习与可验证奖励(RLVR)的潜在局限性,指出其可能仅是对基础模型已知的高奖励输出进行优化,而非真正扩展推理边界。

Details Motivation: RLVR被认为是提升AI在复杂逻辑任务中能力的有效方法,但其是否真正扩展了模型的推理能力尚不明确。论文旨在通过理论和实证研究揭示RLVR的潜在限制。

Contribution: 1. 提出RLVR受限于基础模型的支持范围,无法生成初始概率为零的解决方案;2. 揭示了熵-奖励的权衡:RLVR提高精度但可能抑制探索;3. 通过实验验证RLVR在提升通过率的同时可能忽略少数正确的解决方案。

Method: 结合理论分析和实证实验,探讨RLVR的支持范围限制及其对探索行为的影响,通过量化熵和奖励的权衡关系验证其局限性。

Result: 实验表明,RLVR虽能提升pass@1指标,但支持范围的缩小通常超过其扩展,且可能遗漏基础模型原本可找到的正确答案。

Insight: RLVR可能并非真正的推理能力扩展,未来需结合显式探索机制或混合策略以突破其限制。

Abstract: Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI’s capabilities, particularly in solving complex logical tasks. However, it remains unclear whether RLVR truly expands a model’s reasoning boundary or merely amplifies high-reward outputs that the base model already knows for improved precision. This study presents a theoretical and empirical investigation that provides fresh insights into the potential limits of RLVR. First, we offer a new theoretical perspective that RLVR is constrained by the base model’s support-unable to sample solutions with zero initial probability-and operates as a conservative reweighting mechanism that may restrict the discovery of entirely original solutions. We also identify an entropy-reward tradeoff: while RLVR reliably enhances precision, it may progressively narrow exploration and potentially overlook correct yet underrepresented solutions. Extensive empirical experiments validate that while RLVR consistently improves pass@1, the shrinkage of empirical support generally outweighs the expansion of empirical support under larger sampling budgets, failing to recover correct answers that were previously accessible to the base model. Interestingly, we also observe that while RLVR sometimes increases token-level entropy, resulting in greater uncertainty at each generation step, answer-level entropy declines, indicating that these seemingly more uncertain paths ultimately converge onto a smaller set of distinct answers. Taken together, these findings reveal potential limits of RLVR in extending reasoning horizons. Breaking this invisible leash may require future algorithmic innovations such as explicit exploration mechanisms or hybrid strategies that seed probability mass into underrepresented solution regions.

[178] Long-Short Distance Graph Neural Networks and Improved Curriculum Learning for Emotion Recognition in Conversation

Xinran Li,Xiujuan Xu,Jiaqi Qiao

Main category: cs.LG

TL;DR: 该论文提出了一种新颖的多模态方法——长短距离图神经网络(LSDGNN)和改进课程学习(ICL),用于对话中的情绪识别。通过长距离和短距离图神经网络提取多模态特征,并结合差分正则器和双仿射模块增强特征交互。改进的课程学习方法解决了数据不平衡问题。实验结果优于现有基准。

Details Motivation: 对话中的情绪识别(ERC)是一个具有挑战性的任务,现有的方法难以充分捕捉远距离和近距离话语之间的多模态特征及其交互。此外,数据不平衡问题也影响了模型的性能。

Contribution: 1. 提出了长短距离图神经网络(LSDGNN),分别提取远距离和近距离话语的多模态特征,并通过差分正则器和双仿射模块增强特征交互效果。2. 提出了改进的课程学习(ICL),通过“加权情绪转移”指标和难度测量器,优先学习简单样本以解决数据不平衡问题。

Method: 1. LSDGNN基于有向无环图(DAG)构建远距离和近距离图神经网络。2. 使用差分正则器和双仿射模块优化特征提取与交互。3. ICL通过情绪相似性计算和难度测量器设计动态调整训练顺序。

Result: 在IEMOCAP和MELD数据集上的实验表明,所提模型优于现有基准方法。

Insight: 1. 远近距离图神经网络的结合能够更好地捕捉对话中的情绪动态变化。2. 改进的课程学习方法通过动态调整训练顺序,有效缓解了数据不平衡问题。

Abstract: Emotion Recognition in Conversation (ERC) is a practical and challenging task. This paper proposes a novel multimodal approach, the Long-Short Distance Graph Neural Network (LSDGNN). Based on the Directed Acyclic Graph (DAG), it constructs a long-distance graph neural network and a short-distance graph neural network to obtain multimodal features of distant and nearby utterances, respectively. To ensure that long- and short-distance features are as distinct as possible in representation while enabling mutual influence between the two modules, we employ a Differential Regularizer and incorporate a BiAffine Module to facilitate feature interaction. In addition, we propose an Improved Curriculum Learning (ICL) to address the challenge of data imbalance. By computing the similarity between different emotions to emphasize the shifts in similar emotions, we design a “weighted emotional shift” metric and develop a difficulty measurer, enabling a training process that prioritizes learning easy samples before harder ones. Experimental results on the IEMOCAP and MELD datasets demonstrate that our model outperforms existing benchmarks.

[179] Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback

Johannes Ackermann,Takashi Ishida,Masashi Sugiyama

Main category: cs.LG

TL;DR: 论文提出了Off-Policy Corrected Reward Modeling (OCRM),通过重要性加权校正奖励模型(RM),解决了RLHF中由于分布偏移导致的过优化问题,从而提升了策略的准确性。

Details Motivation: 在RLHF训练过程中,语言模型生成的响应逐渐偏离RM训练时的分布,导致RM不准确,进而引发过优化问题。论文旨在解决这种分布偏移带来的不一致性问题。

Contribution: 提出了OCRM方法,通过重要性加权迭代校正RM,无需额外标注数据即可提升RM的准确性,从而改善最终策略。

Method: 利用重要性加权对RM进行离策略校正,迭代更新RM参数,以缓解分布偏移带来的影响。

Result: 在摘要生成和聊天机器人任务上的实验表明,OCRM显著优于标准的RLHF方法。

Insight: 分布偏移是RLHF中过优化问题的核心原因,通过离策略校正可以有效提升奖励模型的鲁棒性。

Abstract: Reinforcement Learning from Human Feedback (RLHF) allows us to train models, such as language models (LMs), to follow complex human preferences. In RLHF for LMs, we first train an LM using supervised fine-tuning, sample pairs of responses, obtain human feedback, and use the resulting data to train a reward model (RM). RL methods are then used to train the LM to maximize the reward given by the RM. As training progresses, the responses generated by the LM no longer resemble the responses seen by the RM during training, leading to the RM becoming inaccurate. The score given by the RM keeps increasing, but the learned behavior no longer matches the human preferences. This issue is known as overoptimization. We investigate overoptimization from the point of view of distribution shift and show that the shift results in an inconsistent estimate of the RM parameters, leading to an inconsistent estimate of the policy gradient. We propose Off-Policy Corrected Reward Modeling (OCRM), which iteratively off-policy corrects the RM using importance weighting, without requiring new labels or samples. This results in a more accurate RM, which empirically leads to an improved final policy. We validate our approach in experiments with summarization and chatbot datasets and show that it performs significantly better than standard RLHF methods and baselines. Our implementation is available at https://github.com/JohannesAck/OffPolicyCorrectedRewardModeling

[180] Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

Kailai Yang,Xiao Liu,Lei Ji,Hao Li,Yeyun Gong,Peng Cheng,Mao Yang

Main category: cs.LG

TL;DR: 本文提出了Data Mixing Agent,一种基于模型的端到端框架,通过强化学习自动调整源领域和目标领域数据的权重,以平衡模型性能。实验证明了其在数学推理和代码生成任务中的有效性。

Details Motivation: 持续预训练在特定任务上可以提升大语言模型性能,但容易导致灾难性遗忘。传统方法依赖人工设计的数据混合策略,缺乏通用性和自动化。

Contribution: 提出了首个基于模型的数据混合权重调整框架,通过强化学习自动学习通用启发式规则,显著提升模型平衡性能。

Method: 使用强化学习训练Data Mixing Agent,通过大量数据混合轨迹及其评估反馈学习通用启发式规则。

Result: 在数学推理任务中优于基线方法,且在未见过的源领域、目标模型和领域空间中表现出良好的泛化能力。代码生成任务也验证了其适应性。

Insight: 自动学习的启发式规则与人类直觉一致,且能以较少源领域数据实现更优性能,展示了其在跨领域任务中的潜力。

Abstract: Continual pre-training on small-scale task-specific data is an effective method for improving large language models in new target fields, yet it risks catastrophic forgetting of their original capabilities. A common solution is to re-weight training data mixtures from source and target fields on a domain space to achieve balanced performance. Previous domain reweighting strategies rely on manual designation with certain heuristics based on human intuition or empirical results. In this work, we prove that more general heuristics can be parameterized by proposing Data Mixing Agent, the first model-based, end-to-end framework that learns to re-weight domains. The agent learns generalizable heuristics through reinforcement learning on large quantities of data mixing trajectories with corresponding feedback from an evaluation environment. Experiments in continual pre-training on math reasoning show that Data Mixing Agent outperforms strong baselines in achieving balanced performance across source and target field benchmarks. Furthermore, it generalizes well across unseen source fields, target models, and domain spaces without retraining. Direct application to the code generation field also indicates its adaptability across target domains. Further analysis showcases the agents’ well-aligned heuristics with human intuitions and their efficiency in achieving superior model performance with less source-field data.

[181] Small LLMs Do Not Learn a Generalizable Theory of Mind via Reinforcement Learning

Sneheel Sarangi,Hanan Salam

Main category: cs.LG

TL;DR: 小型LLMs无法通过强化学习掌握通用的心理理论能力,其表现仅限于训练数据的统计模式,无法推广到不同任务。

Details Motivation: 探索小型LLMs是否可以通过强化学习(RL)获得通用且稳健的心理理论(ToM)能力。

Contribution: 研究表明小型LLMs无法通过RL掌握通用的ToM能力,表现仅为对训练数据统计模式的局部过拟合。

Method: 使用RL与可验证奖励(RLVR)训练小型LLMs,并在多个ToM数据集上进行系统评估和泛化测试。

Result: 模型在训练数据上表现提升,但在未见的ToM任务上表现无变化或下降,表明其未能学习真正的ToM能力。

Insight: 小型LLMs可能无法通过RL实现复杂的社交智能,需更高级的方法或更大的模型规模。

Abstract: Recent advancements in large language models (LLMs) have demonstrated emergent capabilities in complex reasoning, largely spurred by rule-based Reinforcement Learning (RL) techniques applied during the post-training. This has raised the question of whether similar methods can instill more nuanced, human-like social intelligence, such as a Theory of Mind (ToM), in LLMs. This paper investigates whether small-scale LLMs can acquire a robust and generalizable ToM capability through RL with verifiable rewards (RLVR). We conduct a systematic evaluation by training models on various combinations of prominent ToM datasets (HiToM, ExploreToM, FANToM) and testing for generalization on held-out datasets (e.g., OpenToM). Our findings indicate that small LLMs struggle to develop a generic ToM capability. While performance on in-distribution tasks improves, this capability fails to transfer to unseen ToM tasks with different characteristics. Furthermore, we demonstrate that prolonged RL training leads to models ``hacking’’ the statistical patterns of the training datasets, resulting in significant performance gains on in-domain data but no change, or degradation of performance on out-of-distribution tasks. This suggests the learned behavior is a form of narrow overfitting rather than the acquisition of a true, abstract ToM capability.

[182] GUI-G$^2$: Gaussian Reward Modeling for GUI Grounding

Fei Tang,Zhangxuan Gu,Zhengxi Lu,Xuyang Liu,Shuheng Shen,Changhua Meng,Wen Wang,Wenqi Zhang,Yongliang Shen,Weiming Lu,Jun Xiao,Yueting Zhuang

Main category: cs.LG

TL;DR: 论文提出GUI-G$^2$,通过高斯奖励建模改进图形用户界面(GUI)的定位任务,将稀疏的二元奖励转化为连续的优化问题,显著提升了性能。

Details Motivation: 现有基于强化学习的方法使用二元奖励,忽略了GUI交互的连续性,导致信号稀疏且低效。受人类点击行为启发,作者提出将GUI元素建模为高斯分布,以更自然地模拟空间交互。

Contribution: 1. 提出GUI-G$^2$框架,通过高斯奖励建模GUI元素的空间分布;2. 设计了高斯点奖励和覆盖奖励两种机制;3. 引入自适应方差机制,处理不同尺寸元素。

Method: 1. 将GUI元素建模为高斯分布,中心在元素质心;2. 通过点奖励和覆盖奖励优化定位;3. 自适应方差机制根据元素尺寸调整分布范围。

Result: 在多个基准测试(如ScreenSpot-Pro)中,GUI-G$^2$比最先进方法UI-TARS-72B提升了24.7%,且对界面变化和未见布局具有更强的鲁棒性。

Insight: 连续的高斯建模能够提供更丰富的梯度信号,显著提升GUI定位任务的性能,同时为空间推理任务提供了新范式。

Abstract: Graphical User Interface (GUI) grounding maps natural language instructions to precise interface locations for autonomous interaction. Current reinforcement learning approaches use binary rewards that treat elements as hit-or-miss targets, creating sparse signals that ignore the continuous nature of spatial interactions. Motivated by human clicking behavior that naturally forms Gaussian distributions centered on target elements, we introduce GUI Gaussian Grounding Rewards (GUI-G$^2$), a principled reward framework that models GUI elements as continuous Gaussian distributions across the interface plane. GUI-G$^2$ incorporates two synergistic mechanisms: Gaussian point rewards model precise localization through exponentially decaying distributions centered on element centroids, while coverage rewards assess spatial alignment by measuring the overlap between predicted Gaussian distributions and target regions. To handle diverse element scales, we develop an adaptive variance mechanism that calibrates reward distributions based on element dimensions. This framework transforms GUI grounding from sparse binary classification to dense continuous optimization, where Gaussian distributions generate rich gradient signals that guide models toward optimal interaction positions. Extensive experiments across ScreenSpot, ScreenSpot-v2, and ScreenSpot-Pro benchmarks demonstrate that GUI-G$^2$, substantially outperforms state-of-the-art method UI-TARS-72B, with the most significant improvement of 24.7% on ScreenSpot-Pro. Our analysis reveals that continuous modeling provides superior robustness to interface variations and enhanced generalization to unseen layouts, establishing a new paradigm for spatial reasoning in GUI interaction tasks.

[183] Generative Distribution Distillation

Jiequan Cui,Beier Zhu,Qingshan Xu,Xiaogang Xu,Pengguang Chen,Xiaojuan Qi,Bei Yu,Hanwang Zhang,Richang Hong

Main category: cs.LG

TL;DR: 本文提出了一种新的知识蒸馏框架——生成分布蒸馏(GenDD),将知识蒸馏问题转化为条件生成问题。通过引入Split Tokenization和Distribution Contraction技术,解决了高维优化和缺乏标签监督的挑战,并在无监督和监督设置中均取得了显著性能提升。

Details Motivation: 知识蒸馏(KD)通常依赖于教师模型和学生模型之间的输出匹配,但传统方法在高维数据和多任务学习中效果有限。本文旨在通过生成模型的形式重新定义KD,提高其效率和泛化能力。

Contribution: 1. 提出GenDD框架,将KD问题转化为条件生成问题;2. 引入Split Tokenization策略,实现无监督KD的稳定优化;3. 开发Distribution Contraction技术,整合标签监督,提升生成模型的分类能力。

Method: 1. Split Tokenization:分割高维输入为低维token,缓解优化困难;2. Distribution Contraction:将标签监督融入生成目标,理论证明其等效于多任务学习的梯度替代。

Result: 在ImageNet验证集上,无监督GenDD显著优于KL基线16.29%;监督设置下,ResNet-50在600个训练周期内达到82.28%的Top-1准确率,刷新了SOTA记录。

Insight: 通过生成模型框架重新定义KD,能够更灵活地结合无监督和监督信号,为知识蒸馏的高效优化和多任务学习提供了新的思路。

Abstract: In this paper, we formulate the knowledge distillation (KD) as a conditional generative problem and propose the \textit{Generative Distribution Distillation (GenDD)} framework. A naive \textit{GenDD} baseline encounters two major challenges: the curse of high-dimensional optimization and the lack of semantic supervision from labels. To address these issues, we introduce a \textit{Split Tokenization} strategy, achieving stable and effective unsupervised KD. Additionally, we develop the \textit{Distribution Contraction} technique to integrate label supervision into the reconstruction objective. Our theoretical proof demonstrates that \textit{GenDD} with \textit{Distribution Contraction} serves as a gradient-level surrogate for multi-task learning, realizing efficient supervised training without explicit classification loss on multi-step sampling image representations. To evaluate the effectiveness of our method, we conduct experiments on balanced, imbalanced, and unlabeled data. Experimental results show that \textit{GenDD} performs competitively in the unsupervised setting, significantly surpassing KL baseline by \textbf{16.29%} on ImageNet validation set. With label supervision, our ResNet-50 achieves \textbf{82.28%} top-1 accuracy on ImageNet in 600 epochs training, establishing a new state-of-the-art.

[184] The Origin of Self-Attention: From Pairwise Affinity Matrices to Transformers

Giorgio Roffo

Main category: cs.LG

TL;DR: 本文追溯了自注意力机制的起源,将其视为一种更通用的计算原则的实例,即通过成对亲和矩阵控制信息流,并指出其与无限特征选择(Inf-FS)方法的联系。

Details Motivation: 自注意力机制是现代深度学习架构(如Transformers)的核心,但其概念根源可以追溯到更广泛的领域。本文试图通过亲和矩阵(A)的概念,统一计算机视觉、自然语言处理和图表学习中的相关方法,揭示其共同数学基础。

Contribution: 论文的主要贡献是将自注意力机制置于基于亲和矩阵的计算范式中,指出其与Inf-FS方法的联系,并揭示两者在定义和应用亲和矩阵上的差异。

Method: 通过分析Inf-FS方法及其多跳传播计算特征相关性的机制,与自注意力机制的单跳动态亲和矩阵计算进行对比,提出两者共有的成对关系推理框架。

Result: 研究发现,自注意力是Inf-FS的特殊情况,两者的关键差异在于亲和矩阵的定义方式(Inf-FS基于领域知识或学习,自注意力则动态计算)。

Insight: 通过将自注意力与更广泛的亲和矩阵计算范式联系起来,论文统一了多个机器学习领域的模型和任务,强调了其共同的数学基础。

Abstract: The self-attention mechanism, now central to deep learning architectures such as Transformers, is a modern instance of a more general computational principle: learning and using pairwise affinity matrices to control how information flows through a model. This paper traces the conceptual origins of self-attention across multiple domains, including computer vision, natural language processing, and graph learning, through their shared reliance on an affinity matrix, denoted as A. We highlight Infinite Feature Selection (Inf-FS) as a foundational approach that generalizes the idea of affinity-based weighting. Unlike the fixed dot-product structure used in Transformers, Inf-FS defines A either through domain knowledge or by learning, and computes feature relevance through multi-hop propagation over the affinity graph. From this perspective, self-attention can be seen as a special case of Inf-FS: it uses a single-hop affinity computation where A is dynamically built from token similarities. We argue that the underlying structure, reasoning over pairwise relationships, is preserved across both approaches, and the key differences lie in how the affinity matrix is defined and applied. By situating self-attention within the broader paradigm of affinity-based computation, we unify several strands of machine learning research and highlight a common mathematical foundation that underpins diverse models and tasks.

[185] CXR-TFT: Multi-Modal Temporal Fusion Transformer for Predicting Chest X-ray Trajectories

Mehak Arora,Ayman Ali,Kaiyuan Wu,Carolyn Davis,Takashi Shimazui,Mahmoud Alwakeel,Victor Moas,Philip Yang,Annette Esper,Rishikesan Kamaleswaran

Main category: cs.LG

TL;DR: CXR-TFT 是一种多模态时序融合框架,通过整合稀疏的胸部 X 光影像、放射报告与高频临床数据,预测重症患者的 CXR 轨迹。其基于 Transformer 的模型在预测异常发现上表现优异。

Details Motivation: ICU 患者的 CXR 数据获取不规则,现有工具仅支持横断面分析,无法捕捉时序动态。亟需一种能整合多模态数据、预测 CXR 轨迹的方法,以支持早期干预。

Contribution: 提出 CXR-TFT,首次将多模态数据(CXR 影像、报告与临床指标)通过时序融合 Transformer 建模,实现对 CXR 异常的提前预测(12 小时)。

Method: 1. 用视觉编码器提取 CXR 潜嵌入,与时序对齐的临床数据插值;2. 训练 Transformer 模型,基于历史嵌入和临床指标预测未来 CXR 嵌入。

Result: 在 20,000 名 ICU 患者的回顾性研究中,CXR-TFT 能提前 12 小时高精度预测异常 CXR 发现,对急性呼吸窘迫综合征等具有临床价值。

Insight: 整合时序稀疏影像与高频临床数据的多模态建模是关键;CXR-TFT 的预测能力可显著提升时间敏感病症的管理效率,推动‘全患者’诊疗范式。

Abstract: In intensive care units (ICUs), patients with complex clinical conditions require vigilant monitoring and prompt interventions. Chest X-rays (CXRs) are a vital diagnostic tool, providing insights into clinical trajectories, but their irregular acquisition limits their utility. Existing tools for CXR interpretation are constrained by cross-sectional analysis, failing to capture temporal dynamics. To address this, we introduce CXR-TFT, a novel multi-modal framework that integrates temporally sparse CXR imaging and radiology reports with high-frequency clinical data, such as vital signs, laboratory values, and respiratory flow sheets, to predict the trajectory of CXR findings in critically ill patients. CXR-TFT leverages latent embeddings from a vision encoder that are temporally aligned with hourly clinical data through interpolation. A transformer model is then trained to predict CXR embeddings at each hour, conditioned on previous embeddings and clinical measurements. In a retrospective study of 20,000 ICU patients, CXR-TFT demonstrated high accuracy in forecasting abnormal CXR findings up to 12 hours before they became radiographically evident. This predictive capability in clinical data holds significant potential for enhancing the management of time-sensitive conditions like acute respiratory distress syndrome, where early intervention is crucial and diagnoses are often delayed. By providing distinctive temporal resolution in prognostic CXR analysis, CXR-TFT offers actionable ‘whole patient’ insights that can directly improve clinical outcomes.

physics.app-ph [Back]

[186] What do Large Language Models know about materials?

Adrian Ehrenhofer,Thomas Wallmersperger,Gianaurelio Cuniberti

Main category: physics.app-ph

TL;DR: 该论文探讨了大型语言模型(LLMs)在材料科学领域的知识表现能力,特别是通过元素周期表的例子,分析了词汇和标记化对材料指纹识别的影响,并评估了不同开源模型生成准确信息的能力。

Details Motivation: 随着LLMs在机械工程和材料科学领域的应用增加,但互联网内容多为非科学性质,论文旨在评估这些模型在材料科学中的内在知识能力,以确定其在处理材料科学和工程中的Processing-Structure-Property-Performance(PSPP)链时的适用性。

Contribution: 论文提出了一个材料知识基准,用于评估LLMs生成准确材料信息的能力,并指出了在PSPP链中哪些步骤可以借助LLMs,哪些需要专业模型。

Method: 通过分析元素周期表的词汇和标记化对材料指纹的影响,评估了多个开源LLMs生成正确材料信息的能力。

Result: 研究结果表明,LLMs在材料科学中的知识表现依赖于词汇和标记化策略,且不同模型的准确性存在差异。

Insight: LLMs在材料科学中的应用潜力取决于其生成事实准确信息的能力,而专门的模型可能在某些PSPP步骤中更为必要。

Abstract: Large Language Models (LLMs) are increasingly applied in the fields of mechanical engineering and materials science. As models that establish connections through the interface of language, LLMs can be applied for step-wise reasoning through the Processing-Structure-Property-Performance chain of material science and engineering. Current LLMs are built for adequately representing a dataset, which is the most part of the accessible internet. However, the internet mostly contains non-scientific content. If LLMs should be applied for engineering purposes, it is valuable to investigate models for their intrinsic knowledge – here: the capacity to generate correct information about materials. In the current work, for the example of the Periodic Table of Elements, we highlight the role of vocabulary and tokenization for the uniqueness of material fingerprints, and the LLMs’ capabilities of generating factually correct output of different state-of-the-art open models. This leads to a material knowledge benchmark for an informed choice, for which steps in the PSPP chain LLMs are applicable, and where specialized models are required.

cs.NE [Back]

[187] APTx Neuron: A Unified Trainable Neuron Architecture Integrating Activation and Computation

Ravin Kumar

Main category: cs.NE

TL;DR: 本文提出了一种新型神经元架构APTx Neuron,将非线性和线性变换统一为一个可训练表达式,简化了网络结构并提高了计算效率。

Details Motivation: 传统神经元设计中非线性激活和线性变换是分开的,导致网络结构复杂且计算效率较低。APTx Neuron旨在解决这一问题。

Contribution: 提出了APTx Neuron,一种统一的神经元架构,将激活和计算整合为单个可训练表达式,减少了网络复杂性。

Method: APTx Neuron基于APTx激活函数,表达式为$y = \sum_{i=1}^{n} ((\alpha_i + \tanh(\beta_i x_i)) \cdot \gamma_i x_i) + \delta$,所有参数可训练。在MNIST上验证。

Result: 在MNIST数据集上仅用20轮训练和约332K参数,达到96.69%的测试准确率,表现出高效性和表达能力。

Insight: APTx Neuron为神经元设计提供了新范式,可能推动更高效的神经网络架构发展。

Abstract: We propose the APTx Neuron, a novel, unified neural computation unit that integrates non-linear activation and linear transformation into a single trainable expression. The APTx Neuron is derived from the APTx activation function, thereby eliminating the need for separate activation layers and making the architecture both computationally efficient and elegant. The proposed neuron follows the functional form $y = \sum_{i=1}^{n} ((\alpha_i + \tanh(\beta_i x_i)) \cdot \gamma_i x_i) + \delta$, where all parameters $\alpha_i$, $\beta_i$, $\gamma_i$, and $\delta$ are trainable. We validate our APTx Neuron-based architecture on the MNIST dataset, achieving up to 96.69% test accuracy in just 20 epochs using approximately 332K trainable parameters. The results highlight the superior expressiveness and computational efficiency of the APTx Neuron compared to traditional neurons, pointing toward a new paradigm in unified neuron design and the architectures built upon it.

cs.DC [Back]

[188] Towards a Proactive Autoscaling Framework for Data Stream Processing at the Edge using GRU and Transfer Learning

Eugene Armah,Linda Amoako Bannning

Main category: cs.DC

TL;DR: 这篇论文提出了一种主动的边缘数据流处理自动扩展框架,结合GRU和迁移学习,以预测负载并动态调整资源分配,优于传统反应式方法和强化学习模型。

Details Motivation: 边缘计算和数据流处理(DSP)面临快速负载波动问题,传统反应式方法(如阈值策略)通常滞后,而强化学习需要大量模拟。现有预测模型因在线分布和概念漂移问题准确性不足。

Contribution: 1. 使用GRU神经网络预测上游负载;2. 通过迁移学习框架结合DTW算法和联合分布适应,处理离线与在线域的差异;3. 设计轻量级自动扩展模块,动态调整算子并行度。

Method: 1. 基于真实和合成DSP数据集,用GRU预测负载;2. 采用DTW和联合分布适应整合预测模型到在线系统;3. 根据预测负载和边缘资源约束,动态扩展算子并行度。

Result: GRU模型在真实数据集上取得1.3%的SMAPE值,优于CNN、ARIMA和Prophet,且训练时间比强化学习模型短。

Insight: 结合预测模型和迁移学习可以显著提升边缘流处理的主动扩展能力,同时降低计算开销。GRU在负载预测中表现出色,轻量级设计适合边缘环境。

Abstract: Processing data at high speeds is becoming increasingly critical as digital economies generate enormous data. The current paradigms for timely data processing are edge computing and data stream processing (DSP). Edge computing places resources closer to where data is generated, while stream processing analyzes the unbounded high-speed data in motion. However, edge stream processing faces rapid workload fluctuations, complicating resource provisioning. Inadequate resource allocation leads to bottlenecks, whereas excess allocation results in wastage. Existing reactive methods, such as threshold-based policies and queuing theory scale only after performance degrades, potentially violating SLAs. Although reinforcement learning (RL) offers a proactive approach through agents that learn optimal runtime adaptation policies, it requires extensive simulation. Furthermore, predictive machine learning models face online distribution and concept drift that minimize their accuracy. We propose a three-step solution to the proactive edge stream processing autoscaling problem. Firstly, a GRU neural network forecasts the upstream load using real-world and synthetic DSP datasets. Secondly, a transfer learning framework integrates the predictive model into an online stream processing system using the DTW algorithm and joint distribution adaptation to handle the disparities between offline and online domains. Finally, a horizontal autoscaling module dynamically adjusts the degree of operator parallelism, based on predicted load while considering edge resource constraints. The lightweight GRU model for load predictions recorded up to 1.3% SMAPE value on a real-world data set. It outperformed CNN, ARIMA, and Prophet on the SMAPE and RMSE evaluation metrics, with lower training time than the computationally intensive RL models.

cs.IR [Back]

Van-Hoang Le,Duc-Vu Nguyen,Kiet Van Nguyen,Ngan Luu-Thuy Nguyen

Main category: cs.IR

TL;DR: 本文提出了一种两阶段框架(检索和重排)来优化越南语法律文档检索,通过微调Bi-Encoder和Cross-Encoder,并结合半硬负例采样提升性能。

Details Motivation: 在专业领域(如法律)中,大型语言模型(LLMs)面临精度和领域知识的挑战,本文旨在通过高效的数据处理和负例采样提升法律文档检索效果。

Contribution: 1. 提出两阶段框架(Bi-Encoder检索 + Cross-Encoder重排);2. 引入Exist@m评估指标;3. 采用半硬负例采样减少训练偏差。

Method: 1. 微调Bi-Encoder快速生成候选;2. Cross-Encoder进行精确重排;3. 使用半硬负例优化训练。

Result: 在SoICT 2024比赛中取得前三成绩,显示轻量级单次检索框架在性能和参数效率上的竞争力。

Insight: 优化数据处理、适配损失函数和平衡负例采样是构建法律检索系统的关键。

Abstract: Large Language Models (LLMs) face significant challenges in specialized domains like law, where precision and domain-specific knowledge are critical. This paper presents a streamlined two-stage framework consisting of Retrieval and Re-ranking to enhance legal document retrieval efficiency and accuracy. Our approach employs a fine-tuned Bi-Encoder for rapid candidate retrieval, followed by a Cross-Encoder for precise re-ranking, both optimized through strategic negative example mining. Key innovations include the introduction of the Exist@m metric to evaluate retrieval effectiveness and the use of semi-hard negatives to mitigate training bias, which significantly improved re-ranking performance. Evaluated on the SoICT Hackathon 2024 for Legal Document Retrieval, our team, 4Huiter, achieved a top-three position. While top-performing teams employed ensemble models and iterative self-training on large bge-m3 architectures, our lightweight, single-pass approach offered a competitive alternative with far fewer parameters. The framework demonstrates that optimized data processing, tailored loss functions, and balanced negative sampling are pivotal for building robust retrieval-augmented systems in legal contexts.

Ninglu Shao,Jinshan Wang,Chenxu Wang,Qingbiao Li,Xiaoxue Zang,Han Li

Main category: cs.IR

TL;DR: 该论文提出了GREAT框架,通过基于trie的查询生成方法解决短视频平台中视频到查询(I2Q)的推荐问题,并发布了一个大规模数据集KuaiRS。

Details Motivation: 短视频平台成为信息获取的主要渠道,但缺乏相关学术研究和公开数据集,现有方法依赖嵌入计算相似性,缺乏语义深度交互。

Contribution: 1. 首次系统性分析视频相关搜索的挑战;2. 发布大规模数据集KuaiRS;3. 提出基于LLM的GREAT框架,通过trie引导查询生成改善推荐效果。

Method: 1. 构建高质量查询的trie;2. 训练时增强LLM生成能力;3. 推理时trie引导生成;4. 后处理模块优化相关性和文本质量。

Result: 离线与在线实验验证了GREAT的有效性。

Insight: 结合trie结构能有效引导LLM生成高质量查询,提升语义交互能力。

Abstract: Currently, short video platforms have become the primary place for individuals to share experiences and obtain information. To better meet users’ needs for acquiring information while browsing short videos, some apps have introduced a search entry at the bottom of videos, accompanied with recommended relevant queries. This scenario is known as query recommendation in video-related search, where core task is item-to-query (I2Q) recommendation. As this scenario has only emerged in recent years, there is a notable scarcity of academic research and publicly available datasets in this domain. To address this gap, we systematically examine the challenges associated with this scenario for the first time. Subsequently, we release a large-scale dataset derived from real-world data pertaining to the query recommendation in video-\textit{\textbf{r}}elated \textit{\textbf{s}}earch on the \textit{\textbf{Kuai}}shou app (\textbf{KuaiRS}). Presently, existing methods rely on embeddings to calculate similarity for matching short videos with queries, lacking deep interaction between the semantic content and the query. In this paper, we introduce a novel LLM-based framework named \textbf{GREAT}, which \textit{\textbf{g}}uides que\textit{\textbf{r}}y g\textit{\textbf{e}}ner\textit{\textbf{a}}tion with a \textit{\textbf{t}}rie to address I2Q recommendation in related search. Specifically, we initially gather high-quality queries with high exposure and click-through rate to construct a query-based trie. During training, we enhance the LLM’s capability to generate high-quality queries using the query-based trie. In the inference phase, the query-based trie serves as a guide for the token generation. Finally, we further refine the relevance and literal quality between items and queries via a post-processing module. Extensive offline and online experiments demonstrate the effectiveness of our proposed method.

[191] LOVO: Efficient Complex Object Query in Large-Scale Video Datasets

Yuxin Liu,Yuezhang Peng,Hefeng Zhou,Hongze Liu,Xinyu Lu,Jiong Lou,Chentao Wu,Wei Zhao,Jie Li

Main category: cs.IR

TL;DR: LOVO 是一个高效处理大规模视频数据集中复杂对象查询的系统,通过视觉嵌入和倒排多索引结构实现低延迟和高准确性。

Details Motivation: 随着摄像头部署的普及,视频数据量激增,但现有方法在复杂对象查询和高延迟方面表现不佳,亟需一种高效且灵活的系统。

Contribution: 1. 提出 LOVO 系统,通过预训练视觉编码器生成紧凑嵌入,支持复杂查询。2. 设计倒排多索引结构,提升查询效率。3. 引入跨模态重排序,优化结果。

Method: 1. 利用预训练视觉编码器提取关键帧嵌入。2. 构建倒排多索引结构组织嵌入和边界框。3. 结合近似最近邻搜索和跨模态重排序。

Result: 在真实数据集上,LOVO 较现有方法降低 85 倍搜索延迟,接近最优查询准确性,且显著降低索引构建成本。

Insight: LOVO 的创新在于将视觉嵌入与高效索引结合,适用于动态环境中的复杂查询,为视频分析领域设定了新的基准。

Abstract: The widespread deployment of cameras has led to an exponential increase in video data, creating vast opportunities for applications such as traffic management and crime surveillance. However, querying specific objects from large-scale video datasets presents challenges, including (1) processing massive and continuously growing data volumes, (2) supporting complex query requirements, and (3) ensuring low-latency execution. Existing video analysis methods struggle with either limited adaptability to unseen object classes or suffer from high query latency. In this paper, we present LOVO, a novel system designed to efficiently handle comp$\underline{L}$ex $\underline{O}$bject queries in large-scale $\underline{V}$ide$\underline{O}$ datasets. Agnostic to user queries, LOVO performs one-time feature extraction using pre-trained visual encoders, generating compact visual embeddings for key frames to build an efficient index. These visual embeddings, along with associated bounding boxes, are organized in an inverted multi-index structure within a vector database, which supports queries for any objects. During the query phase, LOVO transforms object queries to query embeddings and conducts fast approximate nearest-neighbor searches on the visual embeddings. Finally, a cross-modal rerank is performed to refine the results by fusing visual features with detailed textual features. Evaluation on real-world video datasets demonstrates that LOVO outperforms existing methods in handling complex queries, with near-optimal query accuracy and up to 85x lower search latency, while significantly reducing index construction costs. This system redefines the state-of-the-art object query approaches in video analysis, setting a new benchmark for complex object queries with a novel, scalable, and efficient approach that excels in dynamic environments.

[192] U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs

Xiaojie Li,Chu Li,Shi-Zhe Chen,Xi Chen

Main category: cs.IR

TL;DR: 该论文通过系统分析多模态大语言模型(MLLMs)在多模态检索任务中的关键因素,提出了一个统一的嵌入学习框架U-MARVEL,显著提升了性能并在多个任务中展示了强大的泛化能力。

Details Motivation: 尽管基于MLLMs的多模态检索方法取得了显著进展,但其背后的机制仍不明确,可能导致性能不佳和泛化能力有限。论文旨在揭示这些关键因素,并设计一个更优的通用框架。

Contribution: 1. 系统地分析了MLLMs在多模态检索中的关键因素;2. 提出了统一的嵌入学习框架U-MARVEL;3. 在M-BEIR基准测试中取得了显著优于现有方法的表现。

Method: 采用一个通用的MLLM嵌入学习流程,重点研究了嵌入生成和训练策略的细节,包括渐进转换、困难负样本挖掘和重排器蒸馏。

Result: 在监督设置下,U-MARVEL在M-BEIR基准测试中大幅领先现有方法,并在零样本任务中表现出色。

Insight: 研究中发现了一些常被忽视的因素(如嵌入生成策略)对性能有重大影响,证明了通用框架在多模态检索任务中的潜力。

Abstract: Universal multimodal retrieval (UMR), which aims to address complex retrieval tasks where both queries and candidates span diverse modalities, has been significantly advanced by the emergence of MLLMs. While state-of-the-art MLLM-based methods in the literature predominantly adopt contrastive learning principles, they often differ in their specific training recipes. Despite their success, the mechanisms underlying their retrieval capabilities remain largely unexplored, potentially resulting in suboptimal performance and limited generalization ability. To address these issues, we present a comprehensive study aimed at uncovering the key factors that drive effective embedding learning for UMR using MLLMs. We begin by implementing a general MLLM-based embedding learning pipeline, and systematically analyze the primary contributors to high-performing universal retrieval systems. Based on this, we explore various aspects of the details in embedding generation and training strategies, including progressive transition, hard negative mining and re-ranker distillation. Notably, our findings reveal that often-overlooked factors can have a substantial impact on model performance. Building on these discoveries, we introduce a unified framework termed U-MARVEL (\textbf{U}niversal \textbf{M}ultimod\textbf{A}l \textbf{R}etrie\textbf{V}al via \textbf{E}mbedding \textbf{L}earning), which outperforms state-of-the-art competitors on the M-BEIR benchmark by a large margin in supervised settings, and also exihibits strong zero-shot performance on several tasks such as composed image retrieval and text-to-video retrieval. These results underscore the generalization potential of our framework across various embedding-based retrieval tasks. Code is available at https://github.com/chaxjli/U-MARVEL

cs.CE [Back]

[193] Self-Supervised Distillation of Legacy Rule-Based Methods for Enhanced EEG-Based Decision-Making

Yipeng Zhang,Yuanyi Ding,Chenda Duan,Atsuro Daida,Hiroki Nariai,Vwani Roychowdhury

Main category: cs.CE

TL;DR: 论文提出了一种自监督蒸馏框架SS2LD,通过利用传统规则检测器的输出生成弱监督信号,结合变分自编码器(VAE)学习潜在表示,从而高效识别病理高频振荡(HFO),解决了标注数据稀缺的问题。

Details Motivation: 传统HFO检测器精度低且依赖人工标注,而监督学习受限于标注数据的稀缺性和不一致性。论文旨在通过自监督方法利用传统检测器的可靠信号,提升HFO识别的效率和准确性。

Contribution: 提出SS2LD框架,结合VAE和聚类生成弱监督信号,利用传统检测器的输出优化HFO分类,实现了标签高效和可扩展的病理HFO识别。

Method: 1. 使用VAE对HFO事件进行形态预训练,学习潜在表示。2. 通过聚类生成弱监督信号。3. 训练分类器结合真实和VAE增强数据优化检测边界。

Result: 在大型多机构间癫痫iEEG数据集上,SS2LD优于现有方法,展示了其在临床中的有效性。

Insight: 传统检测器虽然精度不高,但能可靠捕捉临床相关信号;自监督方法可在标注稀缺时通过潜在表示学习提升任务性能。

Abstract: High-frequency oscillations (HFOs) in intracranial Electroencephalography (iEEG) are critical biomarkers for localizing the epileptogenic zone in epilepsy treatment. However, traditional rule-based detectors for HFOs suffer from unsatisfactory precision, producing false positives that require time-consuming manual review. Supervised machine learning approaches have been used to classify the detection results, yet they typically depend on labeled datasets, which are difficult to acquire due to the need for specialized expertise. Moreover, accurate labeling of HFOs is challenging due to low inter-rater reliability and inconsistent annotation practices across institutions. The lack of a clear consensus on what constitutes a pathological HFO further challenges supervised refinement approaches. To address this, we leverage the insight that legacy detectors reliably capture clinically relevant signals despite their relatively high false positive rates. We thus propose the Self-Supervised to Label Discovery (SS2LD) framework to refine the large set of candidate events generated by legacy detectors into a precise set of pathological HFOs. SS2LD employs a variational autoencoder (VAE) for morphological pre-training to learn meaningful latent representation of the detected events. These representations are clustered to derive weak supervision for pathological events. A classifier then uses this supervision to refine detection boundaries, trained on real and VAE-augmented data. Evaluated on large multi-institutional interictal iEEG datasets, SS2LD outperforms state-of-the-art methods. SS2LD offers a scalable, label-efficient, and clinically effective strategy to identify pathological HFOs using legacy detectors.

cs.ET [Back]

[194] Design of an Edge-based Portable EHR System for Anemia Screening in Remote Health Applications

Sebastian A. Cruz Romero,Misael J. Mercado Hernandez,Samir Y. Ali Rivera,Jorge A. Santiago Fernandez,Wilfredo E. Lugo Beauchamp

Main category: cs.ET

TL;DR: 该论文设计了一种基于边缘计算的便携式电子健康记录系统,专为资源有限的远程医疗环境优化,支持离线操作、安全数据管理和模块化诊断集成,并以贫血筛查为例验证了其有效性。

Details Motivation: 资源有限的远程医疗环境面临互通性差、缺乏离线支持和依赖昂贵基础设施的挑战,现有数字医疗解决方案常忽视这些需求,限制了其在偏远地区的应用。

Contribution: 提出了一种便携式、边缘计算支持的电子健康记录平台,支持离线操作、加密存储和模块化诊断集成,并通过贫血筛查用例验证其性能。

Method: 系统采用嵌入式设备运行,结合AES-256加密本地存储与可选云同步。贫血筛查模块使用随机森林模型,并基于YOLOv8n优化目标检测性能。

Result: 随机森林模型在250例患者数据上的测试误差为RMSE 1.969 g/dL和MAE 1.490 g/dL。优化后的YOLOv8n推理延迟从46.96 ms降至21.50 ms。

Insight: 该系统通过低成本、模块化和隐私合规的设计,解决了偏远地区数字医疗的关键障碍,展示了便携式健康信息系统的可扩展性。

Abstract: The design of medical systems for remote, resource-limited environments faces persistent challenges due to poor interoperability, lack of offline support, and dependency on costly infrastructure. Many existing digital health solutions neglect these constraints, limiting their effectiveness for frontline health workers in underserved regions. This paper presents a portable, edge-enabled Electronic Health Record platform optimized for offline-first operation, secure patient data management, and modular diagnostic integration. Running on small-form factor embedded devices, it provides AES-256 encrypted local storage with optional cloud synchronization for interoperability. As a use case, we integrated a non-invasive anemia screening module leveraging fingernail pallor analysis. Trained on 250 patient cases (27% anemia prevalence) with KDE-balanced data, the Random Forest model achieved a test RMSE of 1.969 g/dL and MAE of 1.490 g/dL. A severity-based model reached 79.2% sensitivity. To optimize performance, a YOLOv8n-based nail bed detector was quantized to INT8, reducing inference latency from 46.96 ms to 21.50 ms while maintaining mAP@0.5 at 0.995. The system emphasizes low-cost deployment, modularity, and data privacy compliance (HIPAA/GDPR), addressing critical barriers to digital health adoption in disconnected settings. Our work demonstrates a scalable approach to enhance portable health information systems and support frontline healthcare in underserved regions.

q-bio.NC [Back]

[195] Dissociating model architectures from inference computations

Noor Sajid,Johan Medrano

Main category: q-bio.NC

TL;DR: 论文探讨了如何将模型架构与推理计算分离,研究了自回归模型和深度时间模型在非马尔可夫序列建模中的差异及其潜在联系。

Details Motivation: 现有的序列建模方法通常将模型架构与推理计算紧密耦合,这限制了灵活性和计算效率。论文旨在探讨二者分离的可能性及其优势。

Contribution: 主要贡献在于展示了自回归模型可以通过结构化上下文访问模拟深度时间计算,同时减少了计算量,保持了预测能力。

Method: 研究了自回归模型(如Transformer)在迭代推理中如何通过引入层次化时间分解模拟深度时间计算。

Result: 实验表明,这种方式可以在保持预测能力的同时减少计算量,验证了模型架构与推理计算的可分离性。

Insight: 预测的构建和优化过程不必依赖于固定的模型架构,这为灵活且高效的序列建模提供了新视角。

Abstract: Parr et al., 2025 examines how auto-regressive and deep temporal models differ in their treatment of non-Markovian sequence modelling. Building on this, we highlight the need for dissociating model architectures, i.e., how the predictive distribution factorises, from the computations invoked at inference. We demonstrate that deep temporal computations are mimicked by autoregressive models by structuring context access during iterative inference. Using a transformer trained on next-token prediction, we show that inducing hierarchical temporal factorisation during iterative inference maintains predictive capacity while instantiating fewer computations. This emphasises that processes for constructing and refining predictions are not necessarily bound to their underlying model architectures.

cs.GR [Back]

[196] Towards Geometric and Textural Consistency 3D Scene Generation via Single Image-guided Model Generation and Layout Optimization

Xiang Tang,Ruotong Li,Xiaopeng Fan

Main category: cs.GR

TL;DR: 提出了一种基于单图像引导的三阶段框架,用于生成具有几何和纹理一致性的3D场景,通过图像修复、相机参数估计和布局优化实现高质量生成。

Details Motivation: 目前从单张图像生成3D场景的方法在生成质量和场景一致性方面存在不足,本文旨在解决这些问题。

Contribution: 1) 提出三阶段框架实现高质量3D场景生成;2) 通过图像修复和伪立体视角提升几何精度;3) 布局优化确保场景一致性。

Method: 1) 图像实例分割与修复;2) 伪立体视角相机参数估计和模型选择;3) 点云Chamfer距离优化的布局参数化。

Result: 在多对象场景数据集上的实验表明,该方法在几何精度、纹理保真度和场景布局合成上优于现有方法。

Insight: 通过分阶段处理几何和纹理细节,结合布局优化,可以有效提升单图像引导的3D场景生成质量。

Abstract: In recent years, 3D generation has made great strides in both academia and industry. However, generating 3D scenes from a single RGB image remains a significant challenge, as current approaches often struggle to ensure both object generation quality and scene coherence in multi-object scenarios. To overcome these limitations, we propose a novel three-stage framework for 3D scene generation with explicit geometric representations and high-quality textural details via single image-guided model generation and spatial layout optimization. Our method begins with an image instance segmentation and inpainting phase, which recovers missing details of occluded objects in the input images, thereby achieving complete generation of foreground 3D assets. Subsequently, our approach captures the spatial geometry of reference image by constructing pseudo-stereo viewpoint for camera parameter estimation and scene depth inference, while employing a model selection strategy to ensure optimal alignment between the 3D assets generated in the previous step and the input. Finally, through model parameterization and minimization of the Chamfer distance between point clouds in 3D and 2D space, our approach optimizes layout parameters to produce an explicit 3D scene representation that maintains precise alignment with input guidance image. Extensive experiments on multi-object scene image sets have demonstrated that our approach not only outperforms state-of-the-art methods in terms of geometric accuracy and texture fidelity of individual generated 3D models, but also has significant advantages in scene layout synthesis.

[197] Blended Point Cloud Diffusion for Localized Text-guided Shape Editing

Etai Sella,Noam Atia,Ron Mokady,Hadar Averbuch-Elor

Main category: cs.GR

TL;DR: 本论文提出了一种基于扩散模型的点云形状编辑方法,通过结合局部条件形状和坐标混合算法,实现了自然语言引导的细粒度3D形状编辑,同时保持全局一致性和局部细节。

Details Motivation: 自然语言为3D形状的局部细粒度编辑提供了直观接口,但现有方法在修改局部区域时难以保持全局一致性。因此,本文旨在通过扩散模型和坐标混合技术解决这一问题。

Contribution: 1. 提出了一种基于修复的框架,用于局部文本引导的点云编辑;2. 引入了部分条件形状的结构指导,确保非编辑区域的形状一致性;3. 设计了推理时坐标混合算法,平衡局部编辑与全局重建。

Method: 1. 利用预训练的3D扩散模型进行局部编辑;2. 通过部分条件形状提供结构指导;3. 在推理过程中采用坐标混合算法,逐步混合原始形状与编辑结果。

Result: 实验表明,该方法在原始形状保真度和文本描述匹配度指标上均优于其他技术。

Insight: 通过扩散模型和坐标混合算法的结合,可以在不依赖计算昂贵的逆变换的情况下,实现高质量的局部3D形状编辑。

Abstract: Natural language offers a highly intuitive interface for enabling localized fine-grained edits of 3D shapes. However, prior works face challenges in preserving global coherence while locally modifying the input 3D shape. In this work, we introduce an inpainting-based framework for editing shapes represented as point clouds. Our approach leverages foundation 3D diffusion models for achieving localized shape edits, adding structural guidance in the form of a partial conditional shape, ensuring that other regions correctly preserve the shape’s identity. Furthermore, to encourage identity preservation also within the local edited region, we propose an inference-time coordinate blending algorithm which balances reconstruction of the full shape with inpainting at a progression of noise levels during the inference process. Our coordinate blending algorithm seamlessly blends the original shape with its edited version, enabling a fine-grained editing of 3D shapes, all while circumventing the need for computationally expensive and often inaccurate inversion. Extensive experiments show that our method outperforms alternative techniques across a wide range of metrics that evaluate both fidelity to the original shape and also adherence to the textual description.

[198] Gaussian Splatting with Discretized SDF for Relightable Assets

Zuo-Liang Zhu,Jian Yang,Beibei Wang

Main category: cs.GR

TL;DR: 该论文提出了一种离散化SDF的方法,通过将SDF编码到每个高斯函数中,避免了复杂的光线追踪计算,提升了逆渲染的质量和效率。

Details Motivation: 3D高斯喷溅在视角合成任务中表现优秀,但在逆渲染中由于高斯基元的离散性难以应用几何约束。传统方法通过引入SDF改进几何表示,但增加了内存和训练复杂度。

Contribution: 主要贡献是提出离散化SDF表示,将其编码到高斯中,通过SDF到不透明度的转换实现高效渲染,避免了光线追踪的成本。

Method: 方法包括离散化SDF表示、SDF到不透明度的转换,以及基于投影的一致性损失函数,用于正则化离散采样与底层SDF的一致性。

Result: 实验表明,该方法在高斯基元的逆渲染任务中表现优于现有方法,且无需额外内存开销。

Insight: 离散化SDF的引入平衡了表达能力和计算效率,为高斯基元的几何约束提供了一种轻量化解决方案。

Abstract: 3D Gaussian splatting (3DGS) has shown its detailed expressive ability and highly efficient rendering speed in the novel view synthesis (NVS) task. The application to inverse rendering still faces several challenges, as the discrete nature of Gaussian primitives makes it difficult to apply geometry constraints. Recent works introduce the signed distance field (SDF) as an extra continuous representation to regularize the geometry defined by Gaussian primitives. It improves the decomposition quality, at the cost of increasing memory usage and complicating training. Unlike these works, we introduce a discretized SDF to represent the continuous SDF in a discrete manner by encoding it within each Gaussian using a sampled value. This approach allows us to link the SDF with the Gaussian opacity through an SDF-to-opacity transformation, enabling rendering the SDF via splatting and avoiding the computational cost of ray marching.The key challenge is to regularize the discrete samples to be consistent with the underlying SDF, as the discrete representation can hardly apply the gradient-based constraints (\eg Eikonal loss). For this, we project Gaussians onto the zero-level set of SDF and enforce alignment with the surface from splatting, namely a projection-based consistency loss. Thanks to the discretized SDF, our method achieves higher relighting quality, while requiring no extra memory beyond GS and avoiding complex manually designed optimization. The experiments reveal that our method outperforms existing Gaussian-based inverse rendering methods. Our code is available at https://github.com/NK-CS-ZZL/DiscretizedSDF.

cs.MM [Back]

[199] Prompt-aware of Frame Sampling for Efficient Text-Video Retrieval

Deyu Zhang,Tingting Long,Jinrui Zhang,Ligeng Chen,Ju Ren,Yaoxue Zhang

Main category: cs.MM

TL;DR: ProCLIP是一个高效的文本-视频检索框架,通过动态的提示感知帧采样和两阶段候选剪枝策略,在保持精度的同时显著降低了计算开销。

Details Motivation: 现有方法在平衡文本-视频检索的精度与计算效率方面面临挑战,均匀帧采样计算成本高,而显著帧采样则因查询无关的帧选择导致结果偏差。

Contribution: 1. 提出了ProCLIP框架,通过提示感知帧采样动态选择语义相关帧;2. 采用两阶段候选剪枝策略提升检索效率。

Method: 1. 动态提示感知帧采样策略;2. 两阶段候选剪枝(粗过滤+细粒度重排序)。

Result: 在MSR-VTT数据集上达到R@1=49.0,延迟降低75.3%。

Insight: 提示感知的动态帧采样能有效解决查询无关的帧选择问题,提升效率的同时保持精度。

Abstract: Enabling efficient text-video retrieval on edge-end devices is critical for real-world applications. Yet, existing methods face a critical challenge in balancing accuracy and computational efficiency: uniform frame sampling methods ensure content coverage but incur prohibitive computational costs, while salient-frame sampling methods reduce overhead but suffer from query-agnostic frame selection that biases retrieval results. To address this, we propose ProCLIP, a user-centric framework that achieves state-of-the-art accuracy with significantly improved efficiency. We design a prompt-aware frame sampling strategy that dynamically guides lightweight feature extractors using textual prompts to select semantically relevant frames, overcoming the limitations of existing salient-frame sampling methods which rely on static, query-agnostic selection criteria. Moreover, we adopt a two-stage candidate pruning strategy that combines rapid coarse filtering via a lightweight module with CLIP-powered fine-grained re-ranking, enhancing retrieval efficiency while preserving accuracy. Experiments across benchmarks show ProCLIP achieves 75.3% latency reduction versus baselines while maintaining competitive accuracy, i.e., R@1=49.0 in MSR-VTT dataset. Code is available at https://github.com/tiffylong/ProCLIP.

cs.SD [Back]

[200] A2TTS: TTS for Low Resource Indian Languages

Ayush Singh Bhadoriya,Abhishek Nikunj Shinde,Isha Pandey,Ganesh Ramakrishnan

Main category: cs.SD

TL;DR: 本文提出了一种基于扩散模型的说话人条件文本转语音(TTS)系统,针对低资源印度语言,支持未见过的说话人和多种语言,通过说话人嵌入和交叉注意力时预测机制提升语音质量。

Details Motivation: 解决低资源印度语言在TTS任务中的挑战,特别是支持未见过的说话人和提升语音的自然度与时序一致性。

Contribution: 1. 提出一种扩散模型架构的TTS系统;2. 引入说话人编码器和交叉注意力时预测机制增强语音质量;3. 使用分类器无关指导(CFG)提升零样本生成能力。

Method: 利用扩散模型(DDPM),结合说话人编码器提取嵌入,并通过交叉注意力机制预测持续时间;进一步使用CFG增强零样本生成。

Result: 在IndicSUPERB数据集上训练了多种印度语言的模型,生成语音更接近目标说话人且时序更准确。

Insight: 扩散模型与说话人条件化结合,能够有效支持低资源语言的TTS任务,并提升未见说话人的语音生成质量。

Abstract: We present a speaker conditioned text-to-speech (TTS) system aimed at addressing challenges in generating speech for unseen speakers and supporting diverse Indian languages. Our method leverages a diffusion-based TTS architecture, where a speaker encoder extracts embeddings from short reference audio samples to condition the DDPM decoder for multispeaker generation. To further enhance prosody and naturalness, we employ a cross-attention based duration prediction mechanism that utilizes reference audio, enabling more accurate and speaker consistent timing. This results in speech that closely resembles the target speaker while improving duration modeling and overall expressiveness. Additionally, to improve zero-shot generation, we employed classifier free guidance, allowing the system to generate speech more near speech for unknown speakers. Using this approach, we trained language-specific speaker-conditioned models. Using the IndicSUPERB dataset for multiple Indian languages such as Bengali, Gujarati, Hindi, Marathi, Malayalam, Punjabi and Tamil.

cs.RO [Back]

[201] Uncertainty-aware Probabilistic 3D Human Motion Forecasting via Invertible Networks

Yue Ma,Kanglei Zhou,Fuyang Yu,Frederick W. B. Li,Xiaohui Liang

Main category: cs.RO

TL;DR: ProbHMI提出了一种基于可逆网络的概率3D人体运动预测方法,通过解耦的潜在空间建模动态,实现不确定性的有效量化。

Details Motivation: 当前人体运动预测方法在不确定性量化方面存在不足,尤其是在安全关键场景(如人机协作)中,显式建模不确定性非常重要。

Contribution: 提出了ProbHMI方法,利用可逆网络参数化姿态,通过解耦的潜在空间显式建模动态分布,实现了不确定性的直接量化。

Method: 利用可逆网络将姿态映射到解耦的潜在空间,并在该空间中显式预测未来的潜在分布,从而实现概率动态建模和不确定性量化。

Result: 在基准测试中,ProbHMI在确定性预测和多样性预测中均表现出色,且验证了不确定性校准的有效性。

Insight: 显式建模潜在空间的动态分布是量化不确定性的有效途径,可逆网络的引入为这一目标提供了新思路。

Abstract: 3D human motion forecasting aims to enable autonomous applications. Estimating uncertainty for each prediction (i.e., confidence based on probability density or quantile) is essential for safety-critical contexts like human-robot collaboration to minimize risks. However, existing diverse motion forecasting approaches struggle with uncertainty quantification due to implicit probabilistic representations hindering uncertainty modeling. We propose ProbHMI, which introduces invertible networks to parameterize poses in a disentangled latent space, enabling probabilistic dynamics modeling. A forecasting module then explicitly predicts future latent distributions, allowing effective uncertainty quantification. Evaluated on benchmarks, ProbHMI achieves strong performance for both deterministic and diverse prediction while validating uncertainty calibration, critical for risk-aware decision making.

[202] Low-Latency Event-Based Velocimetry for Quadrotor Control in a Narrow Pipe

Leonard Bauersfeld,Davide Scaramuzza

Main category: cs.RO

TL;DR: 该论文提出了一种基于实时流场测量的四旋翼无人机在狭窄管道中的悬停控制方法,通过事件式烟雾测速技术和高分辨率流场估计,结合基于强化学习的控制器,实现了对瞬时气动扰动的有效补偿。

Details Motivation: 在狭窄管道或隧道中,四旋翼无人机飞行面临由自身气动扰动引起的不稳定性问题。现有方法通常依赖持续运动或悬停稳定性有限,因此需要一种能够利用实时流场测量的闭环控制系统。

Contribution: 提出了首个利用实时流场测量的四旋翼悬停闭环控制系统,开发了低延迟事件式烟雾测速方法,并设计了一种基于循环卷积神经网络的扰动估计器。

Method: 采用事件式烟雾测速技术估计局部气流,结合循环卷积神经网络实时推断力和扭矩扰动,并通过强化学习训练的控制器集成扰动信息。

Result: 实验表明,流反馈控制在管道横截面的侧向平移中特别有效,能有效抵消瞬态气动效应,防止与管壁碰撞。

Insight: 该研究为在气动复杂环境中飞行开辟了新方向,同时揭示了狭窄圆管中飞行的特征流结构,推动了机器人与流体动力学的交叉研究。

Abstract: Autonomous quadrotor flight in confined spaces such as pipes and tunnels presents significant challenges due to unsteady, self-induced aerodynamic disturbances. Very recent advances have enabled flight in such conditions, but they either rely on constant motion through the pipe to mitigate airflow recirculation effects or suffer from limited stability during hovering. In this work, we present the first closed-loop control system for quadrotors for hovering in narrow pipes that leverages real-time flow field measurements. We develop a low-latency, event-based smoke velocimetry method that estimates local airflow at high temporal resolution. This flow information is used by a disturbance estimator based on a recurrent convolutional neural network, which infers force and torque disturbances in real time. The estimated disturbances are integrated into a learning-based controller trained via reinforcement learning. The flow-feedback control proves particularly effective during lateral translation maneuvers in the pipe cross-section. There, the real-time disturbance information enables the controller to effectively counteract transient aerodynamic effects, thereby preventing collisions with the pipe wall. To the best of our knowledge, this work represents the first demonstration of an aerial robot with closed-loop control informed by real-time flow field measurements. This opens new directions for research on flight in aerodynamically complex environments. In addition, our work also sheds light on the characteristic flow structures that emerge during flight in narrow, circular pipes, providing new insights at the intersection of robotics and fluid dynamics.

[203] GR-3 Technical Report

Chilam Cheang,Sijin Chen,Zhongren Cui,Yingdong Hu,Liqun Huang,Tao Kong,Hang Li,Yifeng Li,Yuxiao Liu,Xiao Ma,Hao Niu,Wenxuan Ou,Wanli Peng,Zeyu Ren,Haixin Shi,Jiawen Tian,Hongtao Wu,Xin Xiao,Yuyang Xiao,Jiafeng Xu,Yichu Yang

Main category: cs.RO

TL;DR: GR-3是一个大规模视觉-语言-动作(VLA)模型,展示了在泛化到新对象、环境和抽象指令方面的卓越能力,还能通过少量人类轨迹数据快速适应新场景。它还在长视野和灵巧任务中表现出色,并通过多方面的训练方法实现。

Details Motivation: 目标是开发一种通用机器人策略,能够泛化到多种任务和环境,并在现实生活中辅助人类。

Contribution: GR-3模型的提出,展示了其在泛化能力、高效微调和灵巧任务上的优势;同时介绍了ByteMini机器人,与GR-3结合可完成多样化任务。

Method: 采用多方面的训练方法,包括与网络规模视觉-语言数据的联合训练、通过VR设备收集的人类轨迹数据微调,以及机器人轨迹数据的模仿学习。

Result: GR-3在多项任务上超越了基线方法$π_0$,展示了其稳健和可靠的性能。

Insight: 结合视觉-语言-动作模型和机器人硬件设计,可以为通用机器人提供更强大的能力和灵活性。

Abstract: We report our recent progress towards building generalist robot policies, the development of GR-3. GR-3 is a large-scale vision-language-action (VLA) model. It showcases exceptional capabilities in generalizing to novel objects, environments, and instructions involving abstract concepts. Furthermore, it can be efficiently fine-tuned with minimal human trajectory data, enabling rapid and cost-effective adaptation to new settings. GR-3 also excels in handling long-horizon and dexterous tasks, including those requiring bi-manual manipulation and mobile movement, showcasing robust and reliable performance. These capabilities are achieved through a multi-faceted training recipe that includes co-training with web-scale vision-language data, efficient fine-tuning from human trajectory data collected via VR devices, and effective imitation learning with robot trajectory data. In addition, we introduce ByteMini, a versatile bi-manual mobile robot designed with exceptional flexibility and reliability, capable of accomplishing a wide range of tasks when integrated with GR-3. Through extensive real-world experiments, we show GR-3 surpasses the state-of-the-art baseline method, $\pi_0$, on a wide variety of challenging tasks. We hope GR-3 can serve as a step towards building generalist robots capable of assisting humans in daily life.

[204] Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers

Ian Chuang,Andrew Lee,Dechen Gao,Jinyu Zou,Iman Soltani

Main category: cs.RO

TL;DR: 本篇论文探索了如何通过仿人主动视觉(gaze)提升机器人学习系统的效率和性能,提出了基于Foveated Vision Transformers的框架,并展示了其在减少计算开销和提升任务性能上的优势。

Details Motivation: 人类视觉通过主动视线(gaze)高效处理任务相关区域,而传统机器人视觉系统被动处理图像。论文旨在将人类视觉的主动特性引入机器人策略,以提升效率和性能。

Contribution: 1. 提出了一个结合人类视线数据和机器人演示的框架;2. 设计了Foveated Vision Transformers以减少计算开销;3. 探索了两种视线模仿和预测方法,并验证了其优势。

Method: 1. 在AV-ALOHA平台上收集人类视线数据和机器人动作数据;2. 提出Foveated Vision Transformers,通过仿人视觉的分区域注意力减少计算;3. 对比两种视线集成方法:分阶段预测和端到端联合预测。

Result: 实验表明,仿人视觉方法显著减少计算开销,同时提高了高精度任务的性能和对抗干扰的鲁棒性。

Insight: 人类视觉的主动特性为机器人视觉系统提供了有效的归纳偏置,仿人视觉设计在高效性和性能上均有潜力。

Abstract: Human vision is a highly active process driven by gaze, which directs attention and fixation to task-relevant regions and dramatically reduces visual processing. In contrast, robot learning systems typically rely on passive, uniform processing of raw camera images. In this work, we explore how incorporating human-like active gaze into robotic policies can enhance both efficiency and performance. We build on recent advances in foveated image processing and apply them to an Active Vision robot system that emulates both human head movement and eye tracking. Extending prior work on the AV-ALOHA robot simulation platform, we introduce a framework for simultaneously collecting eye-tracking data and robot demonstrations from a human operator as well as a simulation benchmark and dataset for training robot policies that incorporate human gaze. Given the widespread use of Vision Transformers (ViTs) in robot learning, we integrate gaze information into ViTs using a foveated patch tokenization scheme inspired by recent work in image segmentation. Compared to uniform patch tokenization, this significantly reduces the number of tokens-and thus computation-without sacrificing visual fidelity near regions of interest. We also explore two approaches to gaze imitation and prediction from human data. The first is a two-stage model that predicts gaze to guide foveation and action; the second integrates gaze into the action space, allowing the policy to jointly predict gaze and actions end-to-end. Our results show that our method for foveated robot vision not only drastically reduces computational overhead, but also improves performance for high precision tasks and robustness to unseen distractors. Together, these findings suggest that human-inspired visual processing offers a useful inductive bias for robotic vision systems. https://ian-chuang.github.io/gaze-av-aloha/