Table of Contents
- cs.CL [Total: 20]
- cs.CV [Total: 63]
- cs.RO [Total: 3]
- cs.GR [Total: 1]
- cs.AI [Total: 1]
- cs.IR [Total: 1]
- cs.NE [Total: 1]
- cs.SD [Total: 1]
- astro-ph.IM [Total: 1]
- eess.IV [Total: 7]
- q-bio.NC [Total: 1]
- cs.SE [Total: 2]
- eess.SP [Total: 1]
- cs.CR [Total: 1]
- cs.LG [Total: 2]
- cs.SC [Total: 1]
cs.CL [Back]
[1] MapIQ: Benchmarking Multimodal Large Language Models for Map Question Answering
Varun Srivastava,Fan Lei,Srija Mukhopadhyay,Vivek Gupta,Ross Maciejewski
Main category: cs.CL
TL;DR: MapIQ是一个新的基准数据集,用于评估多模态大语言模型(MLLMs)在地图问答(Map-VQA)中的性能,覆盖了三种地图类型和六种主题,并通过实验分析了模型对地图设计变化的鲁棒性和敏感性。
Details
Motivation: 现有的Map-VQA研究主要局限于等值线图(choropleth maps),覆盖的主题和视觉分析任务有限,需要更全面的基准来评估MLLMs在地图问答中的能力。Contribution: 提出了MapIQ数据集,包含14,706个问题-答案对,覆盖三种地图类型和六种主题,并评估了多种MLLMs的性能及其对地图设计变化的响应。
Method: 通过构建多元化的地图数据集(包括等值线图、变形图和比例符号图)和六种视觉分析任务,比较MLLMs的性能和人类基线,并分析地图设计变化对模型的影响。
Result: 实验揭示了MLLMs在地图问答中的性能差异、对地图设计变化的敏感性,以及依赖内部地理知识的程度。
Insight: 研究为改进Map-VQA性能提供了方向,例如优化地图设计以减少模型对特定视觉特征的依赖。
Abstract: Recent advancements in multimodal large language models (MLLMs) have driven researchers to explore how well these models read data visualizations, e.g., bar charts, scatter plots. More recently, attention has shifted to visual question answering with maps (Map-VQA). However, Map-VQA research has primarily focused on choropleth maps, which cover only a limited range of thematic categories and visual analytical tasks. To address these gaps, we introduce MapIQ, a benchmark dataset comprising 14,706 question-answer pairs across three map types: choropleth maps, cartograms, and proportional symbol maps spanning topics from six distinct themes (e.g., housing, crime). We evaluate multiple MLLMs using six visual analytical tasks, comparing their performance against one another and a human baseline. An additional experiment examining the impact of map design changes (e.g., altered color schemes, modified legend designs, and removal of map elements) provides insights into the robustness and sensitivity of MLLMs, their reliance on internal geographic knowledge, and potential avenues for improving Map-VQA performance.
[2] Partitioner Guided Modal Learning Framework
Guimin Hu,Yi Xin,Lijie Hu,Zhihong Zhu,Hasti Seifi
Main category: cs.CL
TL;DR: 该论文提出了一种分区指导的多模态学习框架(PgM),通过模态分区器、单模态学习器、配对模态学习器和单-配对模态解码器,系统地学习单模态和配对模态特征,并展示了其在多种任务中的有效性和可迁移性。
Details
Motivation: 多模态学习虽然受益于多模态信息,但现有方法未能充分区分单模态和配对模态特征的学习。论文旨在通过分区指导的框架,更彻底地学习这两类特征,并灵活适应不同下游任务。Contribution: 1)提出了PgM框架,系统学习单模态和配对模态特征;2)支持灵活调整特征分布以适应不同任务;3)展示了PgM在多种任务中的有效性和可迁移性。
Method: PgM包含四个主要组件:模态分区器(区分单模态和配对模态特征)、单模态学习器、配对模态学习器、单-配对模态解码器。框架支持分区的灵活调整和不同学习率。
Result: PgM在四个多模态任务中表现出色,并验证了其对现有模型的可迁移性。可视化分析揭示了单模态和配对模态特征的贡献差异。
Insight: 分区学习能够更系统地捕捉多模态特征,单模态和配对模态的分布和贡献因任务和模态而异,灵活性是提升多模态学习性能的关键。
Abstract: Multimodal learning benefits from multiple modal information, and each learned modal representations can be divided into uni-modal that can be learned from uni-modal training and paired-modal features that can be learned from cross-modal interaction. Building on this perspective, we propose a partitioner-guided modal learning framework, PgM, which consists of the modal partitioner, uni-modal learner, paired-modal learner, and uni-paired modal decoder. Modal partitioner segments the learned modal representation into uni-modal and paired-modal features. Modal learner incorporates two dedicated components for uni-modal and paired-modal learning. Uni-paired modal decoder reconstructs modal representation based on uni-modal and paired-modal features. PgM offers three key benefits: 1) thorough learning of uni-modal and paired-modal features, 2) flexible distribution adjustment for uni-modal and paired-modal representations to suit diverse downstream tasks, and 3) different learning rates across modalities and partitions. Extensive experiments demonstrate the effectiveness of PgM across four multimodal tasks and further highlight its transferability to existing models. Additionally, we visualize the distribution of uni-modal and paired-modal features across modalities and tasks, offering insights into their respective contributions.
[3] ExpliCIT-QA: Explainable Code-Based Image Table Question Answering
Maximiliano Hormazábal Lagos,Álvaro Bueno Sáez,Pedro Alonso Doval,Jorge Alcalde Vesteiro,Héctor Cerezo-Costas
Main category: cs.CL
TL;DR: ExpliCIT-QA是构建于MRT方法上的可解释表格图像问答系统,通过模块化设计实现透明性,并利用语言推理和代码生成提升可解释性。
Details
Motivation: 解决现有端到端TableVQA系统缺乏可解释性的问题,尤其是在金融、医疗等需要审计结果的敏感领域。Contribution: 提出模块化系统ExpliCIT-QA,通过链式推理、自然语言解释和代码生成实现透明且可解释的表格图像问答。
Method: 系统包含多模态表格理解、语言推理、自动代码生成、代码执行和自然语言解释五个模块。
Result: 在TableVQA-Bench基准测试中显示出更高的可解释性和透明性。
Insight: 模块化设计和中间结果的可视化填补了TableVQA系统的可解释性空白,适用于需要审计的领域。
Abstract: We present ExpliCIT-QA, a system that extends our previous MRT approach for tabular question answering into a multimodal pipeline capable of handling complex table images and providing explainable answers. ExpliCIT-QA follows a modular design, consisting of: (1) Multimodal Table Understanding, which uses a Chain-of-Thought approach to extract and transform content from table images; (2) Language-based Reasoning, where a step-by-step explanation in natural language is generated to solve the problem; (3) Automatic Code Generation, where Python/Pandas scripts are created based on the reasoning steps, with feedback for handling errors; (4) Code Execution to compute the final answer; and (5) Natural Language Explanation that describes how the answer was computed. The system is built for transparency and auditability: all intermediate outputs, parsed tables, reasoning steps, generated code, and final answers are available for inspection. This strategy works towards closing the explainability gap in end-to-end TableVQA systems. We evaluated ExpliCIT-QA on the TableVQA-Bench benchmark, comparing it with existing baselines. We demonstrated improvements in interpretability and transparency, which open the door for applications in sensitive domains like finance and healthcare where auditing results are critical.
[4] CRABS: A syntactic-semantic pincer strategy for bounding LLM interpretation of Python notebooks
Meng Li,Timothy M. McPhillips,Dingmin Wang,Shin-Rong Tsai,Bertram Ludäscher
Main category: cs.CL
TL;DR: 论文提出了一种名为CRABS的策略,通过结合浅层语法分析和LLM,解决了LLM在解释Python笔记本时因幻觉和长上下文挑战导致的错误。该方法通过捕捉和解析笔记本的语法结构,辅助LLM进行逐细胞零样本学习,显著提高了笔记本信息流的准确性和细胞执行依赖的识别率。
Details
Motivation: Python笔记本在数据科学和机器学习中广泛应用,但由于数据和软件依赖问题,重新执行笔记本通常不可行。尽管预训练的LLM在代码理解上表现良好,但在实际笔记本中仍存在幻觉和长上下文理解不足的问题,因此需要一种更可靠的方法来理解笔记本的信息流和执行依赖。Contribution: 提出了CRABS策略,结合浅层语法分析和LLM,填补了LLM在笔记本理解中的不足。通过捕捉语法结构并利用LLM解决剩余歧义,生成准确的信息流图和细胞执行依赖图。
Method: CRABS方法包括两部分:1)通过浅层语法分析和AST解析生成初步信息流边界;2)使用LLM通过零样本学习解决剩余歧义。最终生成笔记本的完整信息流和依赖图。
Result: 在50个Kaggle笔记本的评估中,CRABS在细胞间信息流和细胞执行依赖识别上的平均F1分数分别达到98%和99%。LLM成功解析了98%的剩余歧义(1425中的1397)。
Insight: 结合语法分析和LLM的双重策略能够显著提升笔记本理解的准确性。浅层语法分析提供了边界约束,而LLM填补了细粒度的语义理解漏洞,为笔记本的复用和扩展提供了可靠工具。
Abstract: Recognizing the information flows and operations comprising data science and machine learning Python notebooks is critical for evaluating, reusing, and adapting notebooks for new tasks. Investigating a notebook via re-execution often is impractical due to the challenges of resolving data and software dependencies. While Large Language Models (LLMs) pre-trained on large codebases have demonstrated effectiveness in understanding code without running it, we observe that they fail to understand some realistic notebooks due to hallucinations and long-context challenges. To address these issues, we propose a notebook understanding task yielding an information flow graph and corresponding cell execution dependency graph for a notebook, and demonstrate the effectiveness of a pincer strategy that uses limited syntactic analysis to assist full comprehension of the notebook using an LLM. Our Capture and Resolve Assisted Bounding Strategy (CRABS) employs shallow syntactic parsing and analysis of the abstract syntax tree (AST) to capture the correct interpretation of a notebook between lower and upper estimates of the inter-cell I/O sets, then uses an LLM to resolve remaining ambiguities via cell-by-cell zero-shot learning, thereby identifying the true data inputs and outputs of each cell. We evaluate and demonstrate the effectiveness of our approach using an annotated dataset of 50 representative, highly up-voted Kaggle notebooks that together represent 3454 actual cell inputs and outputs. The LLM correctly resolves 1397 of 1425 (98%) ambiguities left by analyzing the syntactic structure of these notebooks. Across 50 notebooks, CRABS achieves average F1 scores of 98% identifying cell-to-cell information flows and 99% identifying transitive cell execution dependencies.
[5] AI Wizards at CheckThat! 2025: Enhancing Transformer-Based Embeddings with Sentiment for Subjectivity Detection in News Articles
Matteo Fasulo,Luca Babboni,Luca Tedeschini
Main category: cs.CL
TL;DR: AI Wizards提出了一种通过集成情感分数增强基于Transformer的分类器的方法,用于新闻文章的主观性检测任务。该方法在多语言和零样本场景下表现优异,在希腊语任务中排名第一。
Details
Motivation: 新闻文章中的主观性检测是一个重要任务,但现有方法在多语言和零样本场景下的泛化能力有限。情感信息可能有助于区分主观和客观句子。Contribution: 提出了将情感分数集成到Transformer嵌入中的方法,显著提升了主观性检测的性能,尤其在多语言和零样本设置下。
Method: 使用了mDeBERTaV3-base、ModernBERT-base和Llama3.2-1B等模型,并引入情感分数作为额外特征。通过决策阈值校准解决类别不平衡问题。
Result: 情感特征显著提升了性能,尤其是在主观F1分数上。在希腊语任务中取得了Macro F1为0.51的最优成绩。
Insight: 情感信息可以有效地增强主观性检测任务的性能,尤其是在多语言和零样本场景下。
Abstract: This paper presents AI Wizards’ participation in the CLEF 2025 CheckThat! Lab Task 1: Subjectivity Detection in News Articles, classifying sentences as subjective/objective in monolingual, multilingual, and zero-shot settings. Training/development datasets were provided for Arabic, German, English, Italian, and Bulgarian; final evaluation included additional unseen languages (e.g., Greek, Romanian, Polish, Ukrainian) to assess generalization. Our primary strategy enhanced transformer-based classifiers by integrating sentiment scores, derived from an auxiliary model, with sentence representations, aiming to improve upon standard fine-tuning. We explored this sentiment-augmented architecture with mDeBERTaV3-base, ModernBERT-base (English), and Llama3.2-1B. To address class imbalance, prevalent across languages, we employed decision threshold calibration optimized on the development set. Our experiments show sentiment feature integration significantly boosts performance, especially subjective F1 score. This framework led to high rankings, notably 1st for Greek (Macro F1 = 0.51).
[6] DualReward: A Dynamic Reinforcement Learning Framework for Cloze Tests Distractor Generation
Tianyou Huang,Xinglu Chen,Jingshen Zhang,Xinying Qiu,Ruiying Niu
Main category: cs.CL
TL;DR: DualReward提出了一种新的强化学习框架,用于填空题干扰项的生成,通过动态调整奖励信号强度,优化干扰项质量。
Details
Motivation: 传统填空题干扰项生成方法多依赖监督学习或静态生成模型,缺乏对干扰项多样性和质量的动态优化。DualReward旨在通过强化学习的动态奖励机制改进这一问题。Contribution: 1. 提出了DualReward框架,采用双重奖励结构与自适应缩放机制。2. 通过动态调整奖励信号强度,区分人工标注干扰项与模型生成候选。3. 在多个数据集上验证了方法的有效性,尤其在跨域数据上表现突出。
Method: 1. 基于强化学习的框架。2. 双重奖励结构:分别针对人工标注干扰项(Gold Standard)和模型生成候选。3. 自适应奖励缩放机制,根据模型性能动态调整奖励强度。
Result: 在CLOTH-F和MCQ数据集上均优于基线方法,跨域数据(MCQ)上提升显著(P@1提升3.48-3.86%)。
Insight: 动态奖励机制在多样化数据上表现更优,表明其在处理复杂任务时的潜力;框架灵活性高,适合实际应用需求。
Abstract: This paper introduces DualReward, a novel reinforcement learning framework for automatic distractor generation in cloze tests. Unlike conventional approaches that rely primarily on supervised learning or static generative models, our method employs a dual reward structure with adaptive scaling that differentiates between human-created gold standard distractors and model-generated candidates. The framework dynamically adjusts reward signal intensity based on model performance and confidence. We evaluate our approach on both passage-level (CLOTH-F) and sentence-level (MCQ) cloze test datasets, demonstrating consistent improvements over state-of-the-art baselines. Experimental results show that our adaptive reward scaling mechanism provides modest but consistent benefits on homogeneous datasets (CLOTH-F) and more substantial improvements (3.48-3.86% in P@1) on diverse, cross-domain data (MCQ), suggesting its particular effectiveness for handling varied question types and domains. Our work offers a flexible framework that effectively balances learning from reliable human examples while exploring novel, high-quality distractors for automated test generation.
[7] A Survey of Deep Learning for Geometry Problem Solving
Jianzhe Ma,Wenxuan Wang,Qin Jin
Main category: cs.CL
TL;DR: 这篇论文综述了深度学习在几何问题求解中的应用,涵盖了任务总结、深度学习方法回顾、评估指标分析以及当前挑战和未来方向的讨论,旨在为该领域的研究提供全面参考。
Details
Motivation: 几何问题求解是数学推理的关键领域,广泛应用于教育和人工智能能力评估等领域。随着深度学习技术的发展,尤其是多模态大语言模型的兴起,研究如何利用深度学习解决几何问题变得尤为重要。Contribution: 论文的主要贡献包括:(1)全面总结了几何问题求解的相关任务;(2)系统回顾了相关的深度学习方法;(3)详细分析了评估指标和方法;(4)讨论了当前挑战和未来研究方向。
Method: 论文通过系统综述的方式,总结了现有的深度学习在几何问题求解中的应用,包括任务定义、方法分类和评估标准。
Result: 论文提供了一个持续更新的GitHub资源列表(https://github.com/majianz/dl4gps),为研究者提供了实用的参考。
Insight: 多模态大语言模型的兴起为几何问题求解带来了新的可能性,但如何结合几何推理的严谨性与深度学习的灵活性仍是未来的研究重点。
Abstract: Geometry problem solving is a key area of mathematical reasoning, which is widely involved in many important fields such as education, mathematical ability assessment of artificial intelligence, and multimodal ability assessment. In recent years, the rapid development of deep learning technology, especially the rise of multimodal large language models, has triggered a widespread research boom. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our goal is to provide a comprehensive and practical reference of deep learning for geometry problem solving to promote further developments in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.
[8] POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering
Yichen Xu,Liangyu Chen,Liang Zhang,Wenxuan Wang,Qin Jin
Main category: cs.CL
TL;DR: PolyChartQA 是一个大规模多语言图表问答基准,覆盖 10 种语言的 22,606 张图表和 26,151 个问答对,旨在推动全球包容性的视觉语言模型发展。
Details
Motivation: 现有的图表理解基准主要集中于英语,限制了其对全球受众的可访问性和适用性。Contribution: 提出了第一个大规模多语言图表问答基准 PolyChartQA,支持 10 种语言,并设计了一个解耦的生成管道确保质量和一致性。
Method: 采用解耦的图表生成管道,分离数据和渲染代码,利用 LLM 翻译并严格质量控制。
Result: 实验表明,现有视觉语言模型在英语和非拉丁语系的低资源语言之间存在显著性能差距。
Insight: PolyChartQA 为多语言图表理解提供了系统评估工具,揭示了模型在非英语语言上的局限性。
Abstract: Charts are a universally adopted medium for interpreting and communicating data. However, existing chart understanding benchmarks are predominantly English-centric, limiting their accessibility and applicability to global audiences. In this paper, we present PolyChartQA, the first large-scale multilingual chart question answering benchmark covering 22,606 charts and 26,151 question-answering pairs across 10 diverse languages. PolyChartQA is built using a decoupled pipeline that separates chart data from rendering code, allowing multilingual charts to be flexibly generated by simply translating the data and reusing the code. We leverage state-of-the-art LLM-based translation and enforce rigorous quality control in the pipeline to ensure the linguistic and semantic consistency of the generated multilingual charts. PolyChartQA facilitates systematic evaluation of multilingual chart understanding. Experiments on both open- and closed-source large vision-language models reveal a significant performance gap between English and other languages, especially low-resource ones with non-Latin scripts. This benchmark lays a foundation for advancing globally inclusive vision-language models.
[9] The benefits of query-based KGQA systems for complex and temporal questions in LLM era
Artem Alekseev,Mikhail Chaichuk,Miron Butko,Alexander Panchenko,Elena Tutubalina,Oleg Somov
Main category: cs.CL
TL;DR: 论文探讨了在大型语言模型(LLM)时代,基于查询的知识图谱问答(KGQA)系统在处理复杂和多跳问题时仍具有优势。提出了一种多阶段查询生成框架,显著提升了多跳和时间性问题的性能。
Details
Motivation: 尽管大型语言模型在问答任务中表现出色,但在多跳推理和时间性问题中仍存在不足。基于查询的KGQA系统提供了一种模块化替代方案,通过生成可执行查询而非直接答案来提升性能。Contribution: 提出了一种多阶段查询生成框架,用于提升WikiData QA中的多跳和时间性问题解答能力。此外,引入了一种基于链式推理(CoT)的新型实体链接和谓词匹配方法。
Method: 采用多阶段查询生成框架,结合基于链式推理的实体链接和谓词匹配技术,用于生成可执行查询以回答复杂问题。
Result: 实验结果表明,该框架显著提升了多跳和时间性问题的解答性能,证明了基于查询的KGQA系统在小语言模型中的潜力。
Insight: 论文表明,基于查询的多阶段KGQA框架是解决复杂和时间性问题的有效方法,尤其适用于资源受限的小语言模型。
Abstract: Large language models excel in question-answering (QA) yet still struggle with multi-hop reasoning and temporal questions. Query-based knowledge graph QA (KGQA) offers a modular alternative by generating executable queries instead of direct answers. We explore multi-stage query-based framework for WikiData QA, proposing multi-stage approach that enhances performance on challenging multi-hop and temporal benchmarks. Through generalization and rejection studies, we evaluate robustness across multi-hop and temporal QA datasets. Additionally, we introduce a novel entity linking and predicate matching method using CoT reasoning. Our results demonstrate the potential of query-based multi-stage KGQA framework for improving multi-hop and temporal QA with small language models. Code and data: https://github.com/ar2max/NLDB-KGQA-System
[10] Improving Data and Parameter Efficiency of Neural Language Models Using Representation Analysis
Josip Jukić
Main category: cs.CL
TL;DR: 该论文探讨了如何通过表示分析和优化技术提升神经语言模型的数据和参数效率。提出了基于表示平滑性的创新方法,包括利用Jacobian和Hessian矩阵的稳定训练策略,以及结合主动学习和参数高效微调的方法。实验表明,这些方法在性能和效率上显著优于传统方法。
Details
Motivation: 解决神经语言模型在数据和参数效率上的挑战,提升模型的鲁棒性和泛化能力。Contribution: 1. 提出了基于表示平滑性的正则化策略;2. 结合主动学习和参数高效微调提升效率;3. 利用上下文学习增强弱监督技术的有效性。
Method: 1. 表示平滑性分析;2. Jacobian和Hessian矩阵的应用;3. 主动学习与参数高效微调的结合;4. 上下文学习的弱监督增强。
Result: 实验证明,所提方法在性能、稳定性和效率上显著优于传统方法,尤其在低资源场景中表现突出。
Insight: 表示平滑性和上下文学习是提升模型效率的关键技术,特别是在资源有限或动态数据环境中。
Abstract: This thesis addresses challenges related to data and parameter efficiency in neural language models, with a focus on representation analysis and the introduction of new optimization techniques. The first part examines the properties and dynamics of language representations within neural models, emphasizing their significance in enhancing robustness and generalization. It proposes innovative approaches based on representation smoothness, including regularization strategies that utilize Jacobian and Hessian matrices to stabilize training and mitigate sensitivity to input perturbations. The second part focuses on methods to significantly enhance data and parameter efficiency by integrating active learning strategies with parameter-efficient fine-tuning, guided by insights from representation smoothness analysis. It presents smoothness-informed early-stopping techniques designed to eliminate the need for labeled validation sets and proposes innovative combinations of active learning and parameter-efficient fine-tuning to reduce labeling efforts and computational resources. Extensive experimental evaluations across various NLP tasks demonstrate that these combined approaches substantially outperform traditional methods in terms of performance, stability, and efficiency. The third part explores weak supervision techniques enhanced by in-context learning to effectively utilize unlabeled data, further reducing dependence on extensive labeling. It shows that using in-context learning as a mechanism for weak supervision enables models to better generalize from limited labeled data by leveraging unlabeled examples more effectively during training. Comprehensive empirical evaluations confirm significant gains in model accuracy, adaptability, and robustness, especially in low-resource settings and dynamic data environments.
[11] Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited
Anthony G Cohn,Robert E Blackwell
Main category: cs.CL
TL;DR: 本文研究了28种大型语言模型(LLM)在基数方向(CD)推理上的能力,通过模板生成的基准测试其正确性,发现即使是新型的大型推理模型也无法在所有问题上可靠地确定正确的CD。
Details
Motivation: 研究大型语言模型在基数方向推理上的能力,填补相关领域的空白并验证模型的可靠性。Contribution: 1)提出了一个基于模板生成的基准测试方法;2)评估了28种LLM在CD推理上的表现;3)扩展了早期的研究成果。
Method: 使用模板生成的基准测试,包含不同变量(如行动方式、人称视角等),对28种LLM进行广泛测试。
Result: 即使新型的大型推理模型也无法在所有问题上可靠地确定正确的CD。
Insight: 大型语言模型在基数方向推理上的能力存在局限性,需要进一步优化或改进。
Abstract: We investigate the abilities of 28 Large language Models (LLMs) to reason about cardinal directions (CDs) using a benchmark generated from a set of templates, extensively testing an LLM’s ability to determine the correct CD given a particular scenario. The templates allow for a number of degrees of variation such as means of locomotion of the agent involved, and whether set in the first, second or third person. Even the newer Large Reasoning Models are unable to reliably determine the correct CD for all questions. This paper summarises and extends earlier work presented at COSIT-24.
[12] Findings of MEGA: Maths Explanation with LLMs using the Socratic Method for Active Learning
Tosin Adewumi,Foteini Simistira Liwicki,Marcus Liwicki,Viktor Gardelli,Lama Alkhaled,Hamam Mokayed
Main category: cs.CL
TL;DR: MEGA结合苏格拉底法、思维链推理、简化游戏化和形成性反馈,通过大型语言模型提升大学生数学学习效果。结果显示MEGA优于传统逐步方法。
Details
Motivation: 部分学生因数学困难回避相关学科,传统教学方法效果不佳。研究希望通过MEGA方法改进数学学习体验。Contribution: 提出MEGA方法,结合多种教学策略,并通过实验证明其在数学学习中优于传统逐步方法。
Method: 采用苏格拉底法、思维链推理、简化游戏化和形成性反馈,利用GPT4o和Claude 3.5 Sonnet测试GSM8K和MATH数据集。
Result: MEGA在GSM8K和MATH数据集上均优于传统方法,尤其在难度更高的MATH数据集上表现更优(47.5% vs 26.67%)。
Insight: MEGA方法尤其适合解决复杂数学问题,其多策略结合显著提升学习效果,为数学教育提供新思路。
Abstract: This paper presents an intervention study on the effects of the combined methods of (1) the Socratic method, (2) Chain of Thought (CoT) reasoning, (3) simplified gamification and (4) formative feedback on university students’ Maths learning driven by large language models (LLMs). We call our approach Mathematics Explanations through Games by AI LLMs (MEGA). Some students struggle with Maths and as a result avoid Math-related discipline or subjects despite the importance of Maths across many fields, including signal processing. Oftentimes, students’ Maths difficulties stem from suboptimal pedagogy. We compared the MEGA method to the traditional step-by-step (CoT) method to ascertain which is better by using a within-group design after randomly assigning questions for the participants, who are university students. Samples (n=60) were randomly drawn from each of the two test sets of the Grade School Math 8K (GSM8K) and Mathematics Aptitude Test of Heuristics (MATH) datasets, based on the error margin of 11%, the confidence level of 90%, and a manageable number of samples for the student evaluators. These samples were used to evaluate two capable LLMs at length (Generative Pretrained Transformer 4o (GPT4o) and Claude 3.5 Sonnet) out of the initial six that were tested for capability. The results showed that students agree in more instances that the MEGA method is experienced as better for learning for both datasets. It is even much better than the CoT (47.5% compared to 26.67%) in the more difficult MATH dataset, indicating that MEGA is better at explaining difficult Maths problems.
[13] Iterative Augmentation with Summarization Refinement (IASR) Evaluation for Unstructured Survey data Modeling and Analysis
Payal Bhattad,Sai Manoj Pudukotai Dinakarrao,Anju Gupta
Main category: cs.CL
TL;DR: 本文提出了一种基于大型语言模型(LLM)的文本增强评估框架,包括可扩展性分析和迭代增强与摘要细化(IASR),旨在解决增强过程中语义一致性和多样性平衡的问题。
Details
Motivation: 现有文本增强技术在语义保存方面机制不足,导致冗余和不稳定性,需结构化评估框架改进。Contribution: 提出两种评估组件:可扩展性分析和IASR,验证了GPT-3.5 Turbo在语义保真、多样性和生成效率上的最佳平衡。
Method: 通过可扩展性分析测量语义一致性,结合IASR评估迭代过程中的语义漂移,应用于BERTopic的增强标注任务。
Result: 在使用GPT增强的BERTopic任务中,主题粒度提升400%,并完全消除主题重叠。
Insight: 结构化评估框架能显著提升LLM增强技术在实践NLP流水线中的效果,尤其在语义一致性和多样性平衡方面。
Abstract: Text data augmentation is a widely used strategy for mitigating data sparsity in natural language processing (NLP), particularly in low-resource settings where limited samples hinder effective semantic modeling. While augmentation can improve input diversity and downstream interpretability, existing techniques often lack mechanisms to ensure semantic preservation during large-scale or iterative generation, leading to redundancy and instability. This work introduces a principled evaluation framework for large language model (LLM) based text augmentation, comprising two components: (1) Scalability Analysis, which measures semantic consistency as augmentation volume increases, and (2) Iterative Augmentation with Summarization Refinement (IASR), which evaluates semantic drift across recursive paraphrasing cycles. Empirical evaluations across state-of-the-art LLMs show that GPT-3.5 Turbo achieved the best balance of semantic fidelity, diversity, and generation efficiency. Applied to a real-world topic modeling task using BERTopic with GPT-enhanced few-shot labeling, the proposed approach results in a 400% increase in topic granularity and complete elimination of topic overlaps. These findings validated the utility of the proposed frameworks for structured evaluation of LLM-based augmentation in practical NLP pipelines.
[14] Overview of the Sensemaking Task at the ELOQUENT 2025 Lab: LLMs as Teachers, Students and Evaluators
Pavel Šindelář,Ondřej Bojar
Main category: cs.CL
TL;DR: 论文介绍了ELOQUENT 2025实验室的Sensemaking任务,旨在通过三个步骤(问题生成、回答和评估)评估生成语言模型的表现。实验涉及多语言材料,参与者包括教师、学生和评估者系统,揭示了当前方法的局限性。
Details
Motivation: 为生成语言模型开发易于测试的高层次评估标准,特别是在理解文本和生成相关内容的能力方面。Contribution: 提出Sensemaking任务的框架,包括教师、学生和评估者的角色分工,并通过实验验证了LLM在这些任务中的表现和局限性。
Method: 基于课堂考试的三步流程(问题生成、回答和评估)设计任务,使用多语言材料进行测试,并比较自动和手动评估结果。
Result: 实验中,问题生成任务面临评估困难;回答问题任务中LLM表现尚可,但受限于输入文本;评估任务中,LLM易误判乱码或混合问题答案。
Insight: 当前LLM在文本理解和评估方面存在明显不足,尤其是在问题生成的评价和答案与输入的严格匹配上需要进一步改进。
Abstract: ELOQUENT is a set of shared tasks that aims to create easily testable high-level criteria for evaluating generative language models. Sensemaking is one such shared task. In Sensemaking, we try to assess how well generative models ``make sense out of a given text’’ in three steps inspired by exams in a classroom setting: (1) Teacher systems should prepare a set of questions, (2) Student systems should answer these questions, and (3) Evaluator systems should score these answers, all adhering rather strictly to a given set of input materials. We report on the 2025 edition of Sensemaking, where we had 7 sources of test materials (fact-checking analyses of statements, textbooks, transcribed recordings of a lecture, and educational videos) spanning English, German, Ukrainian, and Czech languages. This year, 4 teams participated, providing us with 2 Teacher submissions, 2 Student submissions, and 2 Evaluator submissions. We added baselines for Teacher and Student using commercial large language model systems. We devised a fully automatic evaluation procedure, which we compare to a minimalistic manual evaluation. We were able to make some interesting observations. For the first task, the creation of questions, better evaluation strategies will still have to be devised because it is difficult to discern the quality of the various candidate question sets. In the second task, question answering, the LLMs examined overall perform acceptably, but restricting their answers to the given input texts remains problematic. In the third task, evaluation of question answers, our adversarial tests reveal that systems using the LLM-as-a-Judge paradigm erroneously rate both garbled question-answer pairs and answers to mixed-up questions as acceptable.
[15] Toward a Behavioural Translation Style Space: Simulating the Temporal Dynamics of Affect, Behaviour, and Cognition in Human Translation Production
Michael Carl,Takanori Mizowaki,Aishvarya Ray,Masaru Yamada,Devi Sri Bandaru,Xinyue Ren
Main category: cs.CL
TL;DR: 本文提出了一个行为翻译风格空间(BTSS),用于描述可能的翻译行为模式,并通过计算翻译代理模拟翻译过程中的情感、自动行为和认知的动态变化。
Details
Motivation: 研究翻译行为背后的高层认知和情感过程,通过眼动和键盘数据揭示隐藏的心理处理结构,从而更好地理解翻译行为的动态性。Contribution: 提出BTSS作为一个分层结构,整合了行为、认知和情感的多层嵌入处理,为构建计算翻译代理提供了理论基础。
Method: 通过分析眼动和键盘数据,识别行为模式并组织成BTSS,利用计算代理模拟翻译过程中的动态变化。
Result: BTSS能够捕捉翻译行为的复杂动态性,为模拟人类翻译行为提供了一种新的框架。
Insight: 翻译行为不仅受物理操作影响,还由认知和情感驱动,BTSS为这一复杂过程提供了系统化的描述方法。
Abstract: The paper introduces a Behavioural Translation Style Space (BTSS) that describes possible behavioural translation patterns. The suggested BTSS is organized as a hierarchical structure that entails various embedded processing layers. We posit that observable translation behaviour - i.e., eye and finger movements - is fundamental when executing the physical act of translation but it is caused and shaped by higher-order cognitive processes and affective translation states. We analyse records of keystrokes and gaze data as indicators of the hidden mental processing structure and organize the behavioural patterns as a multi-layered embedded BTSS. The BTSS serves as the basis for a computational translation agent to simulate the temporal dynamics of affect, automatized behaviour and cognition during human translation production.
[16] Towards few-shot isolated word reading assessment
Reuben Smit,Retief Louw,Herman Kamper
Main category: cs.CL
TL;DR: 论文提出了一种基于自监督学习(SSL)模型的少量样本孤立词阅读评估方法,研究发现在成人数据上表现良好,但在儿童语音上效果显著下降。
Details
Motivation: 研究动机是针对低资源环境下的孤立词阅读评估,探索不依赖自动语音识别(ASR)的方法,尤其是在儿童语音任务中的适用性。Contribution: 主要贡献是提出了一种基于SSL模型的少量样本分类系统,并分析了其在儿童语音任务中的局限性。
Method: 方法包括使用SSL模型的中间层编码输入语音和参考模板,研究了离散化SSL特征和模板的质心平均等设计选项。
Result: 实验结果表明,系统在成人数据上表现良好,但在儿童语音输入时性能显著下降,即使使用儿童模板。
Insight: 论文揭示了SSL表征在少量样本分类系统中处理儿童语音时的局限性,强调了进一步优化的必要性。
Abstract: We explore an ASR-free method for isolated word reading assessment in low-resource settings. Our few-shot approach compares input child speech to a small set of adult-provided reference templates. Inputs and templates are encoded using intermediate layers from large self-supervised learned (SSL) models. Using an Afrikaans child speech benchmark, we investigate design options such as discretising SSL features and barycentre averaging of the templates. Idealised experiments show reasonable performance for adults, but a substantial drop for child speech input, even with child templates. Despite the success of employing SSL representations in low-resource speech tasks, our work highlights the limitations of SSL representations for processing child data when used in a few-shot classification system.
[17] Web-Browsing LLMs Can Access Social Media Profiles and Infer User Demographics
Meysam Alizadeh,Fabrizio Gilardi,Zeynab Samei,Mohsen Mosleh
Main category: cs.CL
TL;DR: 该研究表明,具备网页浏览能力的LLMs可以从社交媒体用户名推断用户的人口统计学属性,潜在存在性别和政治偏见风险,建议限制公开应用并保留研究用途。
Details
Motivation: 传统LLMs依赖静态数据,而具备网页浏览能力的LLMs能实时获取信息。研究探索LLMs是否可通过社交媒体用户名推断用户属性。Contribution: 首次验证LLMs可通过用户名访问社交媒体数据并预测人口统计学属性,揭示其解析内容的方式及潜在偏见。
Method: 使用合成数据集(48个Twitter账户)和调查数据集(1,384名国际参与者),评估LLMs分析社交媒体内容的能力。
Result: LLMs能以合理准确率预测用户人口统计学属性,但可能对低活跃度账户引入性别和政治偏见。
Insight: 此能力有益于计算社会科学,但也可能被滥用,需在公开应用中限制访问并保留研究用途。
Abstract: Large language models (LLMs) have traditionally relied on static training data, limiting their knowledge to fixed snapshots. Recent advancements, however, have equipped LLMs with web browsing capabilities, enabling real time information retrieval and multi step reasoning over live web content. While prior studies have demonstrated LLMs ability to access and analyze websites, their capacity to directly retrieve and analyze social media data remains unexplored. Here, we evaluate whether web browsing LLMs can infer demographic attributes of social media users given only their usernames. Using a synthetic dataset of 48 X (Twitter) accounts and a survey dataset of 1,384 international participants, we show that these models can access social media content and predict user demographics with reasonable accuracy. Analysis of the synthetic dataset further reveals how LLMs parse and interpret social media profiles, which may introduce gender and political biases against accounts with minimal activity. While this capability holds promise for computational social science in the post API era, it also raises risks of misuse particularly in information operations and targeted advertising underscoring the need for safeguards. We recommend that LLM providers restrict this capability in public facing applications, while preserving controlled access for verified research purposes.
[18] Probing for Arithmetic Errors in Language Models
Yucheng Sun,Alessandro Stolfo,Mrinmaya Sachan
Main category: cs.CL
TL;DR: 论文研究如何通过语言模型的内部激活检测算术错误,发现简单探针能解码正确答案,并训练轻量级错误检测器以指导选择性重提,提升任务准确性。
Details
Motivation: 探索语言模型内部激活是否能用于检测算术错误,为模型自我纠错提供轻量级方法。Contribution: 1. 展示简单探针能从隐藏状态解码模型输出和正确答案;2. 训练错误检测器预测模型正确性;3. 探针在复杂任务中泛化良好;4. 通过选择性重提提升任务准确性。
Method: 1. 在3位数加法任务中使用探针解码隐藏状态;2. 训练轻量级错误检测器;3. 在GSM8K任务中扩展分析;4. 利用探针指导选择性重提。
Result: 错误检测器准确率超90%,探针在复杂任务中表现一致,选择性重提可提升任务准确性。
Insight: 算术错误可通过内部激活预测,探针为轻量级自我纠错提供可行路径。
Abstract: We investigate whether internal activations in language models can be used to detect arithmetic errors. Starting with a controlled setting of 3-digit addition, we show that simple probes can accurately decode both the model’s predicted output and the correct answer from hidden states, regardless of whether the model’s output is correct. Building on this, we train lightweight error detectors that predict model correctness with over 90% accuracy. We then extend our analysis to structured chain-of-thought traces on addition-only GSM8K problems and find that probes trained on simple arithmetic generalize well to this more complex setting, revealing consistent internal representations. Finally, we demonstrate that these probes can guide selective re-prompting of erroneous reasoning steps, improving task accuracy with minimal disruption to correct outputs. Our findings suggest that arithmetic errors can be anticipated from internal activations alone, and that simple probes offer a viable path toward lightweight model self-correction.
[19] Advancing Retrieval-Augmented Generation for Structured Enterprise and Internal Data
Chandana Cheerla
Main category: cs.CL
TL;DR: 论文提出了一种改进的检索增强生成(RAG)框架,专为企业结构化数据设计,采用混合检索策略和元数据过滤,显著提升了检索精度和生成质量。
Details
Motivation: 企业依赖专有数据(如HR记录、表格文档)进行决策,但现有LLM和传统RAG在处理异构结构化数据时存在局限。Contribution: 提出了结合混合检索、元数据过滤和语义分块的先进RAG框架,优化了企业数据的检索和生成效果。
Method: 采用密集嵌入(all-mpnet-base-v2)与BM25的混合检索,结合SpaCy NER和交叉编码器重排序,保留表结构完整性。
Result: 实验中Precision@5提升15%,Recall@5提升13%,且生成结果的Faithfulness、Completeness、Relevance评分显著提高。
Insight: 结构化数据的语义分块和元数据利用是提升RAG效果的关键;未来可扩展至多模态和基于代理的检索。
Abstract: Organizations increasingly rely on proprietary enterprise data, including HR records, structured reports, and tabular documents, for critical decision-making. While Large Language Models (LLMs) have strong generative capabilities, they are limited by static pretraining, short context windows, and challenges in processing heterogeneous data formats. Conventional Retrieval-Augmented Generation (RAG) frameworks address some of these gaps but often struggle with structured and semi-structured data. This work proposes an advanced RAG framework that combines hybrid retrieval strategies using dense embeddings (all-mpnet-base-v2) and BM25, enhanced by metadata-aware filtering with SpaCy NER and cross-encoder reranking. The framework applies semantic chunking to maintain textual coherence and retains tabular data structures to preserve row-column integrity. Quantized indexing optimizes retrieval efficiency, while human-in-the-loop feedback and conversation memory improve adaptability. Experiments on enterprise datasets show notable improvements: Precision@5 increased by 15 percent (90 versus 75), Recall@5 by 13 percent (87 versus 74), and Mean Reciprocal Rank by 16 percent (0.85 versus 0.69). Qualitative evaluations show higher scores in Faithfulness (4.6 versus 3.0), Completeness (4.2 versus 2.5), and Relevance (4.5 versus 3.2) on a 5-point Likert scale. These results demonstrate the framework’s effectiveness in delivering accurate, comprehensive, and contextually relevant responses for enterprise tasks. Future work includes extending to multimodal data and integrating agent-based retrieval. The source code will be released at https://github.com/CheerlaChandana/Enterprise-Chatbot
[20] Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models
Yik Siu Chan,Zheng-Xin Yong,Stephen H. Bach
Main category: cs.CL
TL;DR: 本文研究了利用思维链(CoT)预测语言模型最终输出的对齐性,发现基于CoT激活的线性探针优于基于文本的方法,并能提前预测不安全输出。
Details
Motivation: 开放权重的推理语言模型在生成最终响应前会产生长思维链,这提升了性能但也引入了对齐风险,因为有害内容可能出现在思维链或最终输出中。因此,研究者希望探索能否利用CoT预测最终响应的对齐性。Contribution: 主要贡献包括:1)提出基于CoT激活的线性探针方法,显著优于其他文本监控方法;2)证明该方法能在推理完成前预测安全性,适用于不同模型规模和任务。
Method: 方法包括:1)比较人类、大模型和文本分类器的CoT文本监控能力;2)提出基于CoT激活的线性探针,利用模型潜在表示(latents)预测对齐性。实验覆盖多种模型规模和安全基准。
Result: 结果显示:1)线性探针在预测最终响应安全性上表现最佳,且优于文本方法;2)探针能在推理早期实现高预测准确率,支持实时监控和早期干预。
Insight: 核心洞察是:CoT文本可能不忠实且误导监控工具,而模型潜在表示(activations)提供更可靠的预测信号,且轻量级探针可高效实现实时安全监控。
Abstract: Open-weights reasoning language models generate long chains-of-thought (CoTs) before producing a final response, which improves performance but introduces additional alignment risks, with harmful content often appearing in both the CoTs and the final outputs. In this work, we investigate if we can use CoTs to predict final response misalignment. We evaluate a range of monitoring approaches, including humans, highly-capable large language models, and text classifiers, using either CoT text or activations. First, we find that a simple linear probe trained on CoT activations can significantly outperform all text-based methods in predicting whether a final response will be safe or unsafe. CoT texts are often unfaithful and can mislead humans and classifiers, while model latents (i.e., CoT activations) offer a more reliable predictive signal. Second, the probe makes accurate predictions before reasoning completes, achieving strong performance even when applied to early CoT segments. These findings generalize across model sizes, families, and safety benchmarks, suggesting that lightweight probes could enable real-time safety monitoring and early intervention during generation.
cs.CV [Back]
[21] An Memory-Efficient Framework for Deformable Transformer with Neural Architecture Search
Wendong Mao,Mingfan Zhao,Jianfeng Guan,Qiwei Dong,Zhongfeng Wang
Main category: cs.CV
TL;DR: 该论文提出了一种针对可变形注意力Transformer(DAT)的硬件友好优化框架,通过神经架构搜索(NAS)和新的切片策略,在推理过程中自动分割输入特征为均匀块,避免内存冲突,同时保持模型精度。FPGA验证表明其显著减少存储访问次数。
Details
Motivation: 可变形注意力Transformer(DAT)在计算机视觉任务中表现优异,但其数据依赖的采样机制导致不规则内存访问模式,难以高效部署到硬件上。现有方法要么硬件开销高,要么牺牲模型精度。Contribution: 1. 提出基于NAS的切片策略,自动优化输入特征分割以平衡硬件开销和精度;2. 设计FPGA验证系统,证实方法在边缘设备上的高效性。
Method: 结合NAS探索最优切片配置,并设计硬件友好的推理框架。通过联合优化硬件成本和推理精度,避免直接修改模型架构。
Result: ImageNet-1K实验显示精度仅下降0.2%;FPGA测试中存储访问次数降至现有方法的18%。
Insight: 通过智能分割输入和硬件协同设计,可显著提升Transformer在边缘设备上的部署效率,且无需牺牲性能。
Abstract: Deformable Attention Transformers (DAT) have shown remarkable performance in computer vision tasks by adaptively focusing on informative image regions. However, their data-dependent sampling mechanism introduces irregular memory access patterns, posing significant challenges for efficient hardware deployment. Existing acceleration methods either incur high hardware overhead or compromise model accuracy. To address these issues, this paper proposes a hardware-friendly optimization framework for DAT. First, a neural architecture search (NAS)-based method with a new slicing strategy is proposed to automatically divide the input feature into uniform patches during the inference process, avoiding memory conflicts without modifying model architecture. The method explores the optimal slice configuration by jointly optimizing hardware cost and inference accuracy. Secondly, an FPGA-based verification system is designed to test the performance of this framework on edge-side hardware. Algorithm experiments on the ImageNet-1K dataset demonstrate that our hardware-friendly framework can maintain have only 0.2% accuracy drop compared to the baseline DAT. Hardware experiments on Xilinx FPGA show the proposed method reduces DRAM access times to 18% compared with existing DAT acceleration methods.
[22] Reprogramming Vision Foundation Models for Spatio-Temporal Forecasting
Changlu Chen,Yanbin Liu,Chaoxi Niu,Ling Chen,Tianqing Zhu
Main category: cs.CV
TL;DR: 提出了ST-VFM框架,通过重新编程视觉基础模型(VFMs)来处理时空预测任务,解决了VFMs在时空建模中的局限性,并在多个数据集上取得了优越性能。
Details
Motivation: 现有的基础模型(如大型语言模型)在时间序列预测中表现有限,尤其是缺乏对时空相关性的建模能力。视觉基础模型(VFMs)虽具有强大的空间先验知识,但缺乏时间建模能力,且与时空数据之间存在模态差距。Contribution: 1. 提出ST-VFM框架,首次系统性地重新编程VFMs用于时空预测;2.设计双分支架构,通过时空流输入增强时间建模;3.引入前VFM和后VFM重编程阶段,分别解决时间嵌入和模态对齐问题;4.在多个数据集上验证了框架的优越性。
Method: 1. 双分支架构:结合原始时空输入和辅助时空流输入;2. 前VFM重编程:通过时间感知令牌适配器嵌入时间上下文;3. 后VFM重编程:通过双边交叉提示协调模块动态交互分支信息。
Result: 在十个时空数据集上,ST-VFM超越了现有最优方法,并展示了其在不同VFM骨干模型(如DINO、CLIP、DEIT)中的鲁棒性。
Insight: 通过重新编程VFMs,可以有效利用其强大的空间先验知识,同时通过辅助流输入和动态交互机制弥补时间建模的不足,为时空预测提供了新的解决方案。
Abstract: Foundation models have achieved remarkable success in natural language processing and computer vision, demonstrating strong capabilities in modeling complex patterns. While recent efforts have explored adapting large language models (LLMs) for time-series forecasting, LLMs primarily capture one-dimensional sequential dependencies and struggle to model the richer spatio-temporal (ST) correlations essential for accurate ST forecasting. In this paper, we present \textbf{ST-VFM}, a novel framework that systematically reprograms Vision Foundation Models (VFMs) for general-purpose spatio-temporal forecasting. While VFMs offer powerful spatial priors, two key challenges arise when applying them to ST tasks: (1) the lack of inherent temporal modeling capacity and (2) the modality gap between visual and ST data. To address these, ST-VFM adopts a \emph{dual-branch architecture} that integrates raw ST inputs with auxiliary ST flow inputs, where the flow encodes lightweight temporal difference signals interpretable as dynamic spatial cues. To effectively process these dual-branch inputs, ST-VFM introduces two dedicated reprogramming stages. The \emph{pre-VFM reprogramming} stage applies a Temporal-Aware Token Adapter to embed temporal context and align both branches into VFM-compatible feature spaces. The \emph{post-VFM reprogramming} stage introduces a Bilateral Cross-Prompt Coordination module, enabling dynamic interaction between branches through prompt-based conditioning, thus enriching joint representation learning without modifying the frozen VFM backbone. Extensive experiments on ten spatio-temporal datasets show that ST-VFM outperforms state-of-the-art baselines, demonstrating effectiveness and robustness across VFM backbones (e.g., DINO, CLIP, DEIT) and ablation studies, establishing it as a strong general framework for spatio-temporal forecasting.
[23] Expert Operational GANS: Towards Real-Color Underwater Image Restoration
Ozer Can Devecioglu,Serkan Kiranyaz,Mehmet Yamac,Moncef Gabbouj
Main category: cs.CV
TL;DR: 论文提出了一种新型GAN模型xOp-GAN,通过多个专家生成器网络分别处理不同质量范围的图像,结合判别器的感知置信分数选择最佳恢复图像,显著提升了水下图像恢复性能。
Details
Motivation: 水下图像恢复因复杂的光传播、散射和深度相关衰减导致的多样化变形伪影而具有挑战性,单一生成器网络难以覆盖所有质量范围,因此需要多专家生成器来解决这一问题。Contribution: 提出了xOp-GAN模型,首次在回归任务推理中使用判别器选择多个生成器中最佳的恢复图像,显著提高了恢复性能。
Method: xOp-GAN包含多个专家生成器,每个生成器针对特定质量范围的图像进行训练,恢复时由判别器根据感知置信分数选择最佳结果。
Result: 在LSUI数据集上,xOp-GAN的PSNR高达25.16 dB,远超单一回归模型,且复杂度更低。
Insight: 多专家生成器结合判别器选择机制,能够更精细地处理水下图像的多样化退化问题,为复杂领域的图像恢复提供了新思路。
Abstract: The wide range of deformation artifacts that arise from complex light propagation, scattering, and depth-dependent attenuation makes the underwater image restoration to remain a challenging problem. Like other single deep regressor networks, conventional GAN-based restoration methods struggle to perform well across this heterogeneous domain, since a single generator network is typically insufficient to capture the full range of visual degradations. In order to overcome this limitation, we propose xOp-GAN, a novel GAN model with several expert generator networks, each trained solely on a particular subset with a certain image quality. Thus, each generator can learn to maximize its restoration performance for a particular quality range. Once a xOp-GAN is trained, each generator can restore the input image and the best restored image can then be selected by the discriminator based on its perceptual confidence score. As a result, xOP-GAN is the first GAN model with multiple generators where the discriminator is being used during the inference of the regression task. Experimental results on benchmark Large Scale Underwater Image (LSUI) dataset demonstrates that xOp-GAN achieves PSNR levels up to 25.16 dB, surpassing all single-regressor models by a large margin even, with reduced complexity.
[24] Data-Driven Meta-Analysis and Public-Dataset Evaluation for Sensor-Based Gait Age Estimation
Varun Velankar
Main category: cs.CV
TL;DR: 该论文通过数据驱动的元分析和公开数据集评估,系统地研究了基于传感器的步态年龄估计方法,并提出了降低误差的实用指南。
Details
Motivation: 步态年龄估计在医疗、安全和人机交互中有重要应用,但现有研究缺乏大规模的系统评估和性能基准。Contribution: 论文贡献包括:1) 对59项研究进行元分析,提供了步态年龄估计的性能基准;2) 在大规模数据集上量化步态特征与年龄的相关性;3) 通过可解释的视觉化方法(如Grad-CAM)揭示了模型的注意力区域;4) 在多种深度学习模型中验证了性能优势。
Method: 方法包括:1) 元分析现有研究;2) 使用OU-ISIR数据集分析步态特征与年龄的关系;3) 微调ResNet34模型并应用Grad-CAM进行可解释性分析;4) 在VersatileGait数据集上比较多种机器学习模型。
Result: 结果表明,多传感器融合模型的误差最低(3.4年),深度学习模型在VersatileGait数据集上达到96%的准确率,处理速度低于0.1秒/样本。
Insight: 研究揭示了膝关节和骨盆区域是步态年龄估计的关键,并提出了在实际场景中将误差降低到3年以下的实用建议。
Abstract: Estimating a person’s age from their gait has important applications in healthcare, security and human-computer interaction. In this work, we review fifty-nine studies involving over seventy-five thousand subjects recorded with video, wearable and radar sensors. We observe that convolutional neural networks produce an average error of about 4.2 years, inertial-sensor models about 4.5 years and multi-sensor fusion as low as 3.4 years, with notable differences between lab and real-world data. We then analyse sixty-three thousand eight hundred forty-six gait cycles from the OU-ISIR Large-Population dataset to quantify correlations between age and five key metrics: stride length, walking speed, step cadence, step-time variability and joint-angle entropy, with correlation coefficients of at least 0.27. Next, we fine-tune a ResNet34 model and apply Grad-CAM to reveal that the network attends to the knee and pelvic regions, consistent with known age-related gait changes. Finally, on a one hundred thousand sample subset of the VersatileGait database, we compare support vector machines, decision trees, random forests, multilayer perceptrons and convolutional neural networks, finding that deep networks achieve up to 96 percent accuracy while processing each sample in under 0.1 seconds. By combining a broad meta-analysis with new large-scale experiments and interpretable visualizations, we establish solid performance baselines and practical guidelines for reducing gait-age error below three years in real-world scenarios.
[25] What cat is that? A re-id model for feral cats
Victor Caquilpan
Main category: cs.CV
TL;DR: 论文探讨了如何通过改进的PPGNet模型(PPGNet-Cat)对野猫进行重新识别(re-ID),以帮助监控其对生态的影响,并取得了优异的性能表现。
Details
Motivation: 野猫对澳大利亚野生动物造成严重威胁,因此需要一种高效的监控方法,而re-ID技术可以通过相机陷阱图像帮助实现这一目标。Contribution: 主要贡献是提出了PPGNet-Cat模型,通过改进PPGNet(原用于东北虎re-ID)并引入对比学习(如ArcFace loss),使其适用于野猫的识别。
Method: 方法包括修改PPGNet的架构以适应野猫图像特性,并探索了对比学习(如ArcFace loss)在re-ID中的应用。
Result: PPGNet-Cat表现优异,mAP达到0.86,rank-1准确率为0.95,证明了其在野猫re-ID中的竞争力。
Insight: 研究表明,通过适当的改进和对比学习方法,现有re-ID模型可以成功迁移到新物种(如野猫)的识别任务中。
Abstract: Feral cats exert a substantial and detrimental impact on Australian wildlife, placing them among the most dangerous invasive species worldwide. Therefore, closely monitoring these cats is essential labour in minimising their effects. In this context, the potential application of Re-Identification (re-ID) emerges to enhance monitoring activities for these animals, utilising images captured by camera traps. This project explores different CV approaches to create a re-ID model able to identify individual feral cats in the wild. The main approach consists of modifying a part-pose guided network (PPGNet) model, initially used in the re-ID of Amur tigers, to be applicable for feral cats. This adaptation, resulting in PPGNet-Cat, which incorporates specific modifications to suit the characteristics of feral cats images. Additionally, various experiments were conducted, particularly exploring contrastive learning approaches such as ArcFace loss. The main results indicate that PPGNet-Cat excels in identifying feral cats, achieving high performance with a mean Average Precision (mAP) of 0.86 and a rank-1 accuracy of 0.95. These outcomes establish PPGNet-Cat as a competitive model within the realm of re-ID.
[26] SketchDNN: Joint Continuous-Discrete Diffusion for CAD Sketch Generation
Sathvik Chereddy,John Femiani
Main category: cs.CV
TL;DR: SketchDNN 是一种生成模型,通过联合连续-离散扩散过程合成 CAD 草图,其核心创新是高斯-Softmax 扩散方法,显著提升了生成质量。
Details
Motivation: CAD 草图生成中,连续参数和离散类别的异构性及图元的置换不变性带来了挑战,需要一种统一的建模方式。Contribution: 提出了 Gaussian-Softmax 扩散,统一处理连续和离散变量,显著提高了生成质量并降低了指标(FID 和 NLL)。
Method: 通过高斯噪声扰动 logits,并通过 softmax 变换将其投影到概率单纯形,实现离散变量的混合类别建模。
Result: 在 SketchGraphs 数据集上,FID 从 16.04 降至 7.80,NLL 从 84.8 降至 81.33,达到新的 SOTA。
Insight: 联合连续-离散扩散过程可以有效解决 CAD 草图中的参数异构性和置换不变性问题。
Abstract: We present SketchDNN, a generative model for synthesizing CAD sketches that jointly models both continuous parameters and discrete class labels through a unified continuous-discrete diffusion process. Our core innovation is Gaussian-Softmax diffusion, where logits perturbed with Gaussian noise are projected onto the probability simplex via a softmax transformation, facilitating blended class labels for discrete variables. This formulation addresses 2 key challenges, namely, the heterogeneity of primitive parameterizations and the permutation invariance of primitives in CAD sketches. Our approach significantly improves generation quality, reducing Fr'echet Inception Distance (FID) from 16.04 to 7.80 and negative log-likelihood (NLL) from 84.8 to 81.33, establishing a new state-of-the-art in CAD sketch generation on the SketchGraphs dataset.
[27] Interpretable Prediction of Lymph Node Metastasis in Rectal Cancer MRI Using Variational Autoencoders
Benjamin Keel,Aaron Quyn,David Jayne,Maryam Mohsin,Samuel D. Relton
Main category: cs.CV
TL;DR: 该论文利用变分自编码器(VAE)作为特征编码器,替代传统预训练的CNN,以提高直肠癌MRI中淋巴结转移预测的准确性和可解释性。模型在内部数据集上表现优异,AUC达0.86。
Details
Motivation: 现有的基于淋巴结形态的放射学标准诊断准确性有限,而预训练的CNN缺乏可解释性。VAE通过图像重构直接编码视觉特征,生成的结构化潜在空间更易解释。Contribution: 提出了VAE-MLP模型,替代CNN用于淋巴结转移预测;在MRI数据集上实现了SOTA性能,且潜在空间更具可解释性。
Method: 使用VAE作为特征编码器,结合多层感知机(MLP)进行预测。模型在无新辅助治疗的168名患者MRI数据上训练,以术后病理N分期为基准。
Result: 模型在交叉验证中AUC为0.86,灵敏度0.79,特异性0.85,表现优于现有方法。
Insight: VAE的潜在空间比CNN更易解释,有助于揭示医学图像中的关键特征,为临床决策提供透明支持。
Abstract: Effective treatment for rectal cancer relies on accurate lymph node metastasis (LNM) staging. However, radiological criteria based on lymph node (LN) size, shape and texture morphology have limited diagnostic accuracy. In this work, we investigate applying a Variational Autoencoder (VAE) as a feature encoder model to replace the large pre-trained Convolutional Neural Network (CNN) used in existing approaches. The motivation for using a VAE is that the generative model aims to reconstruct the images, so it directly encodes visual features and meaningful patterns across the data. This leads to a disentangled and structured latent space which can be more interpretable than a CNN. Models are deployed on an in-house MRI dataset with 168 patients who did not undergo neo-adjuvant treatment. The post-operative pathological N stage was used as the ground truth to evaluate model predictions. Our proposed model ‘VAE-MLP’ achieved state-of-the-art performance on the MRI dataset, with cross-validated metrics of AUC 0.86 +/- 0.05, Sensitivity 0.79 +/- 0.06, and Specificity 0.85 +/- 0.05. Code is available at: https://github.com/benkeel/Lymph_Node_Classification_MIUA.
[28] Posture-Driven Action Intent Inference for Playing style and Fatigue Assessment
Abhishek Jaiswal,Nisheeth Srivastava
Main category: cs.CV
TL;DR: 该论文提出了一种基于姿势的动作意图推断方法,用于评估运动员的风格和疲劳状态,并通过板球运动的实验验证了其有效性。
Details
Motivation: 姿势作为心理状态的推断工具在诊断疲劳、预防伤害和提升表现方面具有潜力,但由于人类数据的敏感性,传统方法面临挑战。体育场景为数据收集提供了可行替代方案。Contribution: 1. 提出了一种基于姿势的动作意图识别方法;2. 在板球运动中验证了方法的效果(F1分数>75%, AUC-ROC>80%);3. 利用弱监督验证了姿势对意图推断的强信号作用。
Method: 通过运动分析从活动视频中识别动作意图,利用板球运动数据训练模型,区分进攻性和防守性击球意图。
Result: 方法在意图分类任务中表现优异(F1分数>75%, AUC-ROC>80%),证明了姿势信号的强推断能力。
Insight: 姿势能够有效泄露意图信息,即使数据存在噪声。弱监督为克服标注限制提供了潜在解决方案,可推广至体育分析和其他行为分析领域。
Abstract: Posture-based mental state inference has significant potential in diagnosing fatigue, preventing injury, and enhancing performance across various domains. Such tools must be research-validated with large datasets before being translated into practice. Unfortunately, such vision diagnosis faces serious challenges due to the sensitivity of human subject data. To address this, we identify sports settings as a viable alternative for accumulating data from human subjects experiencing diverse emotional states. We test our hypothesis in the game of cricket and present a posture-based solution to identify human intent from activity videos. Our method achieves over 75% F1 score and over 80% AUC-ROC in discriminating aggressive and defensive shot intent through motion analysis. These findings indicate that posture leaks out strong signals for intent inference, even with inherent noise in the data pipeline. Furthermore, we utilize existing data statistics as weak supervision to validate our findings, offering a potential solution for overcoming data labelling limitations. This research contributes to generalizable techniques for sports analytics and also opens possibilities for applying human behavior analysis across various fields.
[29] VISTA: Monocular Segmentation-Based Mapping for Appearance and View-Invariant Global Localization
Hannah Shafferman,Annika Thomas,Jouko Kinnari,Michael Ricard,Jose Nino,Jonathan How
Main category: cs.CV
TL;DR: VISTA 是一种基于单目分割和跟踪的全局定位框架,能够跨视角和季节变化实现一致定位,无需特定领域训练,性能优于基线方法,同时保持极低的内存占用。
Details
Motivation: 全局定位在自动驾驶导航中至关重要,但传统方法在无结构环境中因视角变化、季节变化等问题表现不佳。VISTA 旨在解决这些挑战。Contribution: 提出 VISTA 框架:1) 结合基于分割与跟踪的前端流程;2) 利用几何一致性进行子地图匹配;3) 无需领域特定训练即可实现跨视角和季节变化的定位。
Method: 1) 前端采用基于分割的跟踪方法;2) 通过几何一致性进行子地图匹配;3) 使用轻量化的基于对象的地图表示。
Result: 在季节变化和斜视角数据集中,VISTA 的召回率比基线方法提升了 69%,地图大小仅为基线方法的 0.6%。
Insight: 基于分割和几何一致性的方法可以有效应对视角和外观变化,轻量化的对象地图为实现实时平台应用提供了可能。
Abstract: Global localization is critical for autonomous navigation, particularly in scenarios where an agent must localize within a map generated in a different session or by another agent, as agents often have no prior knowledge about the correlation between reference frames. However, this task remains challenging in unstructured environments due to appearance changes induced by viewpoint variation, seasonal changes, spatial aliasing, and occlusions – known failure modes for traditional place recognition methods. To address these challenges, we propose VISTA (View-Invariant Segmentation-Based Tracking for Frame Alignment), a novel open-set, monocular global localization framework that combines: 1) a front-end, object-based, segmentation and tracking pipeline, followed by 2) a submap correspondence search, which exploits geometric consistencies between environment maps to align vehicle reference frames. VISTA enables consistent localization across diverse camera viewpoints and seasonal changes, without requiring any domain-specific training or finetuning. We evaluate VISTA on seasonal and oblique-angle aerial datasets, achieving up to a 69% improvement in recall over baseline methods. Furthermore, we maintain a compact object-based map that is only 0.6% the size of the most memory-conservative baseline, making our approach capable of real-time implementation on resource-constrained platforms.
[30] Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis
Maciej Szankin,Vidhyananth Venkatasamy,Lihang Ying
Main category: cs.CV
TL;DR: 该论文系统地评估了多模态视觉语言模型(VLMs)与轻量级CNN OCR模型在广告牌文本识别任务中的表现,发现VLMs在整体场景理解上更优,但CNN模型在裁剪文本任务中表现更高效。
Details
Motivation: 现代市场营销中户外广告的文本可见性验证仍然具有挑战性,传统OCR在复杂户外场景中的表现不足。多模态视觉语言模型(VLMs)可能提供更优的端到端解决方案。Contribution: 论文的主要贡献是:(1)对代表性VLMs与轻量级CNN OCR模型在广告牌文本识别任务中的综合评估;(2)发布了包含天气增强的公开基准数据集和评估代码。
Method: 论文方法是对比实验,评估了Qwen 2.5 VL 3B、InternVL3、SmolVLM2等VLMs与PaddleOCRv4在ICDAR 2015和SVT数据集上的表现,并通过合成天气扰动模拟真实场景。
Result: 结果表明,虽然VLMs在整体场景理解上表现更好,但轻量级CNN模型在裁剪文本任务中依然具有竞争力且计算成本更低。
Insight: 论文的实用意义在于为边缘设备部署提供了轻量级CNN模型的可行性建议,同时强调了多模态VLMs在场景理解中的潜力。
Abstract: Outdoor advertisements remain a critical medium for modern marketing, yet accurately verifying billboard text visibility under real-world conditions is still challenging. Traditional Optical Character Recognition (OCR) pipelines excel at cropped text recognition but often struggle with complex outdoor scenes, varying fonts, and weather-induced visual noise. Recently, multimodal Vision-Language Models (VLMs) have emerged as promising alternatives, offering end-to-end scene understanding with no explicit detection step. This work systematically benchmarks representative VLMs - including Qwen 2.5 VL 3B, InternVL3, and SmolVLM2 - against a compact CNN-based OCR baseline (PaddleOCRv4) across two public datasets (ICDAR 2015 and SVT), augmented with synthetic weather distortions to simulate realistic degradation. Our results reveal that while selected VLMs excel at holistic scene reasoning, lightweight CNN pipelines still achieve competitive accuracy for cropped text at a fraction of the computational cost-an important consideration for edge deployment. To foster future research, we release our weather-augmented benchmark and evaluation code publicly.
[31] Beyond Task-Specific Reasoning: A Unified Conditional Generative Framework for Abstract Visual Reasoning
Fan Shi,Bin Li,Xiangyang Xue
Main category: cs.CV
TL;DR: 这篇论文提出了一种统一的生成式框架UCGS,通过多任务训练解决多种抽象视觉推理任务,并展示了零样本推理能力。
Details
Motivation: 现有抽象视觉推理(AVR)方法通常针对特定任务设计,难以泛化到新任务,增加了计算和设计成本。本文旨在开发一个统一框架,避免任务特定的重复训练和架构调整。Contribution: 提出了统一的生成式求解器UCGS,将多种AVR任务重新定义为目标图像的可预测性问题,并证明一个生成模型即可解决多任务。
Method: 将AVR任务转化为图像可预测性问题,通过训练一个条件生成模型实现多任务统一处理,支持零样本推理。
Result: UCGS通过单轮多任务训练,在多种AVR任务上展示了抽象推理能力,并在测试阶段实现了对未见任务的零样本推理。
Insight: 生成式框架可以有效统一多种AVR任务,避免任务特定设计,同时零样本推理能力为模型泛化提供了新思路。
Abstract: Abstract visual reasoning (AVR) enables humans to quickly discover and generalize abstract rules to new scenarios. Designing intelligent systems with human-like AVR abilities has been a long-standing topic in the artificial intelligence community. Deep AVR solvers have recently achieved remarkable success in various AVR tasks. However, they usually use task-specific designs or parameters in different tasks. In such a paradigm, solving new tasks often means retraining the model, and sometimes retuning the model architectures, which increases the cost of solving AVR problems. In contrast to task-specific approaches, this paper proposes a novel Unified Conditional Generative Solver (UCGS), aiming to address multiple AVR tasks in a unified framework. First, we prove that some well-known AVR tasks can be reformulated as the problem of estimating the predictability of target images in problem panels. Then, we illustrate that, under the proposed framework, training one conditional generative model can solve various AVR tasks. The experiments show that with a single round of multi-task training, UCGS demonstrates abstract reasoning ability across various AVR tasks. Especially, UCGS exhibits the ability of zero-shot reasoning, enabling it to perform abstract reasoning on problems from unseen AVR tasks in the testing phase.
[32] CorrMoE: Mixture of Experts with De-stylization Learning for Cross-Scene and Cross-Domain Correspondence Pruning
Peiwen Xia,Tangfei Liao,Wei Zhu,Danhuai Zhao,Jianjun Ke,Kaihao Zhang,Tong Lu,Tao Wang
Main category: cs.CV
TL;DR: CorrMoE提出了一种新的对应关系修剪框架,通过风格解耦和自适应专家混合方法,提升了跨场景和跨域任务的鲁棒性。
Details
Motivation: 现有方法在处理跨域和跨场景的对应关系修剪时表现不佳,主要原因是忽略了域偏移和场景多样性的挑战。Contribution: 1) 提出了一个风格解耦双分支(De-stylization Dual Branch)来缓解域偏移问题;2) 设计了一个双融合专家混合模块(Bi-Fusion Mixture of Experts),通过动态路由自适应整合多视角特征。
Method: 1) 风格解耦分支对隐式和显式图特征进行风格混合;2) 双融合专家混合模块利用线性复杂度注意力和动态路由实现多特征融合。
Result: 在多个基准数据集上,CorrMoE表现优于现有方法,展现出更高的准确性和泛化能力。
Insight: 通过风格解耦和动态专家融合,可以有效提升跨域和跨场景任务的表现,为对应关系修剪提供了新思路。
Abstract: Establishing reliable correspondences between image pairs is a fundamental task in computer vision, underpinning applications such as 3D reconstruction and visual localization. Although recent methods have made progress in pruning outliers from dense correspondence sets, they often hypothesize consistent visual domains and overlook the challenges posed by diverse scene structures. In this paper, we propose CorrMoE, a novel correspondence pruning framework that enhances robustness under cross-domain and cross-scene variations. To address domain shift, we introduce a De-stylization Dual Branch, performing style mixing on both implicit and explicit graph features to mitigate the adverse influence of domain-specific representations. For scene diversity, we design a Bi-Fusion Mixture of Experts module that adaptively integrates multi-perspective features through linear-complexity attention and dynamic expert routing. Extensive experiments on benchmark datasets demonstrate that CorrMoE achieves superior accuracy and generalization compared to state-of-the-art methods. The code and pre-trained models are available at https://github.com/peiwenxia/CorrMoE.
[33] ProtoConNet: Prototypical Augmentation and Alignment for Open-Set Few-Shot Image Classification
Kexuan Shi,Zhuang Qi,Jingjing Zhu,Lei Meng,Yaochen Zhang,Haibei Huang,Xiangxu Meng
Main category: cs.CV
TL;DR: ProtoConNet提出了一种原型增强与对齐方法,通过整合上下文信息提升开放集小样本图像分类的性能,包含三个核心模块:CDS、CSR和PA。
Details
Motivation: 现有小样本图像分类方法多基于单图像的视觉信息,忽视了上下文信息的潜力,导致泛化能力不足。ProtoConNet旨在通过整合背景信息解决这一问题。Contribution: 1. 提出CDS模块挖掘多样数据模式;2. 设计CSR模块构建上下文字典以增强鲁棒性;3. 引入PA模块缩小图像表示与原型差距。
Method: 1. CDS模块通过聚类选择数据;2. CSR模块整合上下文信息;3. PA模块优化原型对齐。
Result: 在两个数据集上的实验表明,ProtoConNet在表示学习和开放集样本识别上优于现有方法。
Insight: 上下文信息对小样本分类至关重要,原型对齐可有效区分已知与未知类别。
Abstract: Open-set few-shot image classification aims to train models using a small amount of labeled data, enabling them to achieve good generalization when confronted with unknown environments. Existing methods mainly use visual information from a single image to learn class representations to distinguish known from unknown categories. However, these methods often overlook the benefits of integrating rich contextual information. To address this issue, this paper proposes a prototypical augmentation and alignment method, termed ProtoConNet, which incorporates background information from different samples to enhance the diversity of the feature space, breaking the spurious associations between context and image subjects in few-shot scenarios. Specifically, it consists of three main modules: the clustering-based data selection (CDS) module mines diverse data patterns while preserving core features; the contextual-enhanced semantic refinement (CSR) module builds a context dictionary to integrate into image representations, which boosts the model’s robustness in various scenarios; and the prototypical alignment (PA) module reduces the gap between image representations and class prototypes, amplifying feature distances for known and unknown classes. Experimental results from two datasets verified that ProtoConNet enhances the effectiveness of representation learning in few-shot scenarios and identifies open-set samples, making it superior to existing methods.
[34] From Coarse to Nuanced: Cross-Modal Alignment of Fine-Grained Linguistic Cues and Visual Salient Regions for Dynamic Emotion Recognition
Yu Liu,Leyuan Qu,Hanlei Shi,Di Gao,Yuhua Zheng,Taihao Li
Main category: cs.CV
TL;DR: 论文提出GRACE方法,通过动态运动建模、语义文本细化和跨模态对齐,结合粗到细的文本增强模块和运动差异加权机制,显著提升了动态情感识别的性能,并在多个基准数据集上达到SOTA。
Details
Motivation: 现有方法未能充分利用文本中的细微情感线索,且缺乏有效机制过滤与情感无关的面部动态,导致识别性能受限。Contribution: 提出了GRACE框架,结合粗到细的文本增强(CATE)、运动差异加权机制和熵正则化最优传输的跨模态对齐,实现了对情感显著特征的精准定位。
Method: 1. CATE模块生成细粒度的情感文本描述;2. 运动差异加权机制突出表情相关的面部运动;3. 通过熵正则化最优传输实现跨模态对齐。
Result: 在三个基准数据集上表现优异,尤其是在模糊或不平衡情感类别场景下,UAR和WAR均达SOTA。
Insight: 结合文本细化和动态运动建模可以有效捕捉细微情感线索,提升跨模态情感识别的鲁棒性。
Abstract: Dynamic Facial Expression Recognition (DFER) aims to identify human emotions from temporally evolving facial movements and plays a critical role in affective computing. While recent vision-language approaches have introduced semantic textual descriptions to guide expression recognition, existing methods still face two key limitations: they often underutilize the subtle emotional cues embedded in generated text, and they have yet to incorporate sufficiently effective mechanisms for filtering out facial dynamics that are irrelevant to emotional expression. To address these gaps, We propose GRACE, Granular Representation Alignment for Cross-modal Emotion recognition that integrates dynamic motion modeling, semantic text refinement, and token-level cross-modal alignment to facilitate the precise localization of emotionally salient spatiotemporal features. Our method constructs emotion-aware textual descriptions via a Coarse-to-fine Affective Text Enhancement (CATE) module and highlights expression-relevant facial motion through a motion-difference weighting mechanism. These refined semantic and visual signals are aligned at the token level using entropy-regularized optimal transport. Experiments on three benchmark datasets demonstrate that our method significantly improves recognition performance, particularly in challenging settings with ambiguous or imbalanced emotion classes, establishing new state-of-the-art (SOTA) results in terms of both UAR and WAR.
[35] Spatial Frequency Modulation for Semantic Segmentation
Linwei Chen,Ying Fu,Lin Gu,Dezhi Zheng,Jifeng Dai
Main category: cs.CV
TL;DR: 该论文提出了一种名为空间频率调制(SFM)的新方法,旨在解决语义分割中高频信息因下采样导致的混叠问题。通过自适应重采样(ARS)和多种尺度自适应上采样(MSAU),SFM有效保留高频细节,并在多任务中展示了广泛的适用性。
Details
Motivation: 在语义分割中,高频信息(如纹理细节)对准确性至关重要,但下采样层(如步幅卷积)可能导致高频信息混叠或失真。论文旨在解决这一问题。Contribution: 提出了SFM方法,包括自适应重采样(ARS)和多尺度自适应上采样(MSAU),有效调制高频信息以避免混叠,并在多任务中验证了其有效性。
Method: 1. 通过ARS调制高频特征到低频;2. 通过MSAU进行非均匀上采样以恢复高频信息;3. 模块化设计,兼容多种架构(如CNN和Transformer)。
Result: SFM有效缓解了混叠问题,保留了高频细节,并在语义分割、图像分类、对抗鲁棒性等任务中展现出优异性能。
Insight: 通过频率调制机制,论文揭示了高频信息在下采样中的重要性,并提出了可扩展的解决方案,适用于多种视觉任务。
Abstract: High spatial frequency information, including fine details like textures, significantly contributes to the accuracy of semantic segmentation. However, according to the Nyquist-Shannon Sampling Theorem, high-frequency components are vulnerable to aliasing or distortion when propagating through downsampling layers such as strided-convolution. Here, we propose a novel Spatial Frequency Modulation (SFM) that modulates high-frequency features to a lower frequency before downsampling and then demodulates them back during upsampling. Specifically, we implement modulation through adaptive resampling (ARS) and design a lightweight add-on that can densely sample the high-frequency areas to scale up the signal, thereby lowering its frequency in accordance with the Frequency Scaling Property. We also propose Multi-Scale Adaptive Upsampling (MSAU) to demodulate the modulated feature and recover high-frequency information through non-uniform upsampling This module further improves segmentation by explicitly exploiting information interaction between densely and sparsely resampled areas at multiple scales. Both modules can seamlessly integrate with various architectures, extending from convolutional neural networks to transformers. Feature visualization and analysis confirm that our method effectively alleviates aliasing while successfully retaining details after demodulation. Finally, we validate the broad applicability and effectiveness of SFM by extending it to image classification, adversarial robustness, instance segmentation, and panoptic segmentation tasks. The code is available at \href{https://github.com/Linwei-Chen/SFM}{https://github.com/Linwei-Chen/SFM}.
[36] SEPose: A Synthetic Event-based Human Pose Estimation Dataset for Pedestrian Monitoring
Kaustav Chanda,Aayush Atul Verma,Arpitsinh Vaghela,Yezhou Yang,Bharatesh Chakravarthi
Main category: cs.CV
TL;DR: 论文提出了SEPose——一个合成的事件基于人类姿态估计数据集,用于固定视角的行人监控,填补了现有数据的不足。
Details
Motivation: 事件基于传感器在行人监控中表现优异,但真实场景数据不足。研究者希望通过合成数据解决这一问题。Contribution: 提出了SEPose数据集,包含35万标注行人姿态数据,覆盖多种场景和条件,支持仿真到真实的泛化。
Method: 利用CARLA模拟器和动态视觉传感器生成合成数据,并用RVT和YOLOv8等模型验证其有效性。
Result: 实验表明,在SEPose上训练的模型能泛化到真实事件数据,证明了数据集的实用性。
Insight: 合成数据可以有效弥补真实数据的不足,尤其是在复杂场景下的行人姿态估计任务中。
Abstract: Event-based sensors have emerged as a promising solution for addressing challenging conditions in pedestrian and traffic monitoring systems. Their low-latency and high dynamic range allow for improved response time in safety-critical situations caused by distracted walking or other unusual movements. However, the availability of data covering such scenarios remains limited. To address this gap, we present SEPose – a comprehensive synthetic event-based human pose estimation dataset for fixed pedestrian perception generated using dynamic vision sensors in the CARLA simulator. With nearly 350K annotated pedestrians with body pose keypoints from the perspective of fixed traffic cameras, SEPose is a comprehensive synthetic multi-person pose estimation dataset that spans busy and light crowds and traffic across diverse lighting and weather conditions in 4-way intersections in urban, suburban, and rural environments. We train existing state-of-the-art models such as RVT and YOLOv8 on our dataset and evaluate them on real event-based data to demonstrate the sim-to-real generalization capabilities of the proposed dataset.
[37] Dark-EvGS: Event Camera as an Eye for Radiance Field in the Dark
Jingqian Wu,Peiqi Duan,Zongqiang Wang,Changwei Wang,Boxin Shi,Edmund Y. Lam
Main category: cs.CV
TL;DR: Dark-EvGS 是一种基于事件相机的 3D 高斯泼溅框架,用于在低光环境下重建辐射场并生成多视角的明亮帧。通过引入三重监督和色彩一致性模块,解决了事件噪声和帧质量低的问题,并在实验中表现优异。
Details
Motivation: 传统相机在低光环境下因动态范围限制和运动模糊难以捕捉清晰的多视角图像。事件相机的高动态范围和高速特性为解决这一问题提供了可能。Contribution: 1. 提出了首个事件辅助的 3D 高斯泼溅框架 Dark-EvGS;2. 引入了三重监督和色彩一致性模块;3. 构建了首个真实采集的事件引导数据集。
Method: 利用事件相机的高动态范围和高速度特性,结合 3D 高斯泼溅技术,通过三重监督学习整体和细节知识,并使用色彩一致性模块确保渲染帧的色彩一致。
Result: 实验表明,Dark-EvGS 在低光环境下优于现有方法,实现了高质量的辐射场重建和帧渲染。
Insight: 事件相机与 3D 高斯泼溅的结合为低光环境下的多视角成像提供了新思路,未来可进一步优化噪声抑制和实时性能。
Abstract: In low-light environments, conventional cameras often struggle to capture clear multi-view images of objects due to dynamic range limitations and motion blur caused by long exposure. Event cameras, with their high-dynamic range and high-speed properties, have the potential to mitigate these issues. Additionally, 3D Gaussian Splatting (GS) enables radiance field reconstruction, facilitating bright frame synthesis from multiple viewpoints in low-light conditions. However, naively using an event-assisted 3D GS approach still faced challenges because, in low light, events are noisy, frames lack quality, and the color tone may be inconsistent. To address these issues, we propose Dark-EvGS, the first event-assisted 3D GS framework that enables the reconstruction of bright frames from arbitrary viewpoints along the camera trajectory. Triplet-level supervision is proposed to gain holistic knowledge, granular details, and sharp scene rendering. The color tone matching block is proposed to guarantee the color consistency of the rendered frames. Furthermore, we introduce the first real-captured dataset for the event-guided bright frame synthesis task via 3D GS-based radiance field reconstruction. Experiments demonstrate that our method achieves better results than existing methods, conquering radiance field reconstruction under challenging low-light conditions. The code and sample data are included in the supplementary material.
[38] Hyperphantasia: A Benchmark for Evaluating the Mental Visualization Capabilities of Multimodal LLMs
Mohammad Shahab Sepehri,Berk Tinaz,Zalan Fabian,Mahdi Soltanolkotabi
Main category: cs.CV
TL;DR: 论文提出了Hyperphantasia基准,用于评估多模态大语言模型的’心智可视化’能力,发现目前模型在这方面的表现显著落后于人类。
Details
Motivation: 心智可视化是认知的核心能力,但目前多模态大语言模型的评估基准主要关注被动视觉感知,缺乏对主动视觉建构能力的测试。Contribution: 提出了Hyperphantasia基准,包含四个难度递增的生成任务,用于系统性评估模型的心智可视化能力。
Method: 设计了四个程序生成的任务,分三个难度级别,评估模型表现,并探索强化学习对提升视觉模拟能力的潜力。
Result: 评估显示,当前多模态大语言模型在心智可视化任务上表现显著低于人类,部分模型仅能识别视觉模式。
Insight: 心智可视化是多模态模型尚未解决的挑战,可能需进一步研究强化学习或其他方法以提升这一能力。
Abstract: Mental visualization, the ability to construct and manipulate visual representations internally, is a core component of human cognition and plays a vital role in tasks involving reasoning, prediction, and abstraction. Despite the rapid progress of Multimodal Large Language Models (MLLMs), current benchmarks primarily assess passive visual perception, offering limited insight into the more active capability of internally constructing visual patterns to support problem solving. Yet mental visualization is a critical cognitive skill in humans, supporting abilities such as spatial navigation, predicting physical trajectories, and solving complex visual problems through imaginative simulation. To bridge this gap, we introduce Hyperphantasia, a synthetic benchmark designed to evaluate the mental visualization abilities of MLLMs through four carefully constructed puzzles. Each task is procedurally generated and presented at three difficulty levels, enabling controlled analysis of model performance across increasing complexity. Our comprehensive evaluation of state-of-the-art models reveals a substantial gap between the performance of humans and MLLMs. Additionally, we explore the potential of reinforcement learning to improve visual simulation capabilities. Our findings suggest that while some models exhibit partial competence in recognizing visual patterns, robust mental visualization remains an open challenge for current MLLMs.
[39] RaDL: Relation-aware Disentangled Learning for Multi-Instance Text-to-Image Generation
Geon Park,Seon Bin Kim,Gunho Jung,Seong-Whan Lee
Main category: cs.CV
TL;DR: 论文提出了RaDL框架,通过关系感知解耦学习解决多实例文本到图像生成中的关系差异和属性泄漏问题,显著提升了生成图像的位置准确性和实例间关系。
Details
Motivation: 现有方法在多实例图像生成中难以处理实例间关系差异和属性泄漏,导致生成结果不理想。RaDL旨在解决这些问题。Contribution: 提出了RaDL框架,通过可学习参数增强实例属性,并利用关系注意力生成关系感知的图像特征。
Method: 采用关系注意力机制和可学习参数,从全局提示中提取动作动词,实现实例属性的解耦和关系建模。
Result: 在COCO-Position、COCO-MIG和DrawBench等基准测试中,RaDL显著优于现有方法,尤其在位置准确性和实例关系处理上表现突出。
Insight: RaDL通过结合关系感知和解耦学习,为多实例文本到图像生成提供了更全面的解决方案,强调了实例间关系的重要性。
Abstract: With recent advancements in text-to-image (T2I) models, effectively generating multiple instances within a single image prompt has become a crucial challenge. Existing methods, while successful in generating positions of individual instances, often struggle to account for relationship discrepancy and multiple attributes leakage. To address these limitations, this paper proposes the relation-aware disentangled learning (RaDL) framework. RaDL enhances instance-specific attributes through learnable parameters and generates relation-aware image features via Relation Attention, utilizing action verbs extracted from the global prompt. Through extensive evaluations on benchmarks such as COCO-Position, COCO-MIG, and DrawBench, we demonstrate that RaDL outperforms existing methods, showing significant improvements in positional accuracy, multiple attributes consideration, and the relationships between instances. Our results present RaDL as the solution for generating images that consider both the relationships and multiple attributes of each instance within the multi-instance image.
[40] Prototypical Progressive Alignment and Reweighting for Generalizable Semantic Segmentation
Yuhang Zhang,Zhengyu Zhang,Muxin Liao,Shishun Tian,Wenbin Zou,Lu Zhang,Chen Xu
Main category: cs.CV
TL;DR: 本文提出了PPAR框架,通过渐进式原型对齐和重加权机制,提升语义分割在未见目标域上的泛化能力,利用CLIP模型的强泛化性,取得了SOTA效果。
Details
Motivation: 解决通用语义分割中现有方法因粗粒度原型对齐、源数据过拟合及忽视特征适应难度差异而导致的泛化性能不足问题。Contribution: 1. 提出OTP和VTP两种原型作为对齐基础;2. 设计渐进式对齐策略;3. 引入原型重加权机制以减少负迁移。
Method: 结合CLIP生成OTP和VTP原型,采用渐进式对齐策略和基于可靠性的重加权机制。
Result: 在多基准测试中取得SOTA性能,验证了方法的有效性。
Insight: 渐进式对齐和重加权机制能显著提升模型对未见域的泛化能力,CLIP的引入增强了原型稳定性。
Abstract: Generalizable semantic segmentation aims to perform well on unseen target domains, a critical challenge due to real-world applications requiring high generalizability. Class-wise prototypes, representing class centroids, serve as domain-invariant cues that benefit generalization due to their stability and semantic consistency. However, this approach faces three challenges. First, existing methods often adopt coarse prototypical alignment strategies, which may hinder performance. Second, naive prototypes computed by averaging source batch features are prone to overfitting and may be negatively affected by unrelated source data. Third, most methods treat all source samples equally, ignoring the fact that different features have varying adaptation difficulties. To address these limitations, we propose a novel framework for generalizable semantic segmentation: Prototypical Progressive Alignment and Reweighting (PPAR), leveraging the strong generalization ability of the CLIP model. Specifically, we define two prototypes: the Original Text Prototype (OTP) and Visual Text Prototype (VTP), generated via CLIP to serve as a solid base for alignment. We then introduce a progressive alignment strategy that aligns features in an easy-to-difficult manner, reducing domain gaps gradually. Furthermore, we propose a prototypical reweighting mechanism that estimates the reliability of source data and adjusts its contribution, mitigating the effect of irrelevant or harmful features (i.e., reducing negative transfer). We also provide a theoretical analysis showing the alignment between our method and domain generalization theory. Extensive experiments across multiple benchmarks demonstrate that PPAR achieves state-of-the-art performance, validating its effectiveness.
[41] Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos
Yuchi Ishikawa,Shota Nakada,Hokuto Munakata,Kazuhiro Saito,Tatsuya Komatsu,Yoshimitsu Aoki
Main category: cs.CV
TL;DR: LG-CAV-MAE提出了一种结合文本编码器的对比音频-视觉掩码自编码器,通过自动生成的音频-视觉-文本三元组进行多模态学习,显著提升了任务性能。
Details
Motivation: 提升音频-视觉表示学习,通过引入文本模态和自动生成的音频-视觉-文本三元组,减少对人工标注的依赖。Contribution: 1. 提出LG-CAV-MAE模型,结合文本编码器的对比音频-视觉掩码自编码器;2. 自动生成高质量的音频-视觉-文本三元组。
Method: 1. 使用预训练文本编码器增强对比音频-视觉掩码自编码器;2. 通过图像描述模型和CLAP过滤自动生成音频-视觉-文本三元组。
Result: 在音频-视觉检索任务中提升5.6%的recall@10,分类任务提升3.2%。
Insight: 自动生成的多模态三元组和文本引导的对比学习结合,显著提升了模型性能。
Abstract: In this paper, we propose Language-Guided Contrastive Audio-Visual Masked Autoencoders (LG-CAV-MAE) to improve audio-visual representation learning. LG-CAV-MAE integrates a pretrained text encoder into contrastive audio-visual masked autoencoders, enabling the model to learn across audio, visual and text modalities. To train LG-CAV-MAE, we introduce an automatic method to generate audio-visual-text triplets from unlabeled videos. We first generate frame-level captions using an image captioning model and then apply CLAP-based filtering to ensure strong alignment between audio and captions. This approach yields high-quality audio-visual-text triplets without requiring manual annotations. We evaluate LG-CAV-MAE on audio-visual retrieval tasks, as well as an audio-visual classification task. Our method significantly outperforms existing approaches, achieving up to a 5.6% improvement in recall@10 for retrieval tasks and a 3.2% improvement for the classification task.
[42] Watch, Listen, Understand, Mislead: Tri-modal Adversarial Attacks on Short Videos for Content Appropriateness Evaluation
Sahid Hossain Mustakim,S M Jishanul Islam,Ummay Maria Muna,Montasir Chowdhury,Mohammed Jawwadul Islam,Sadia Ahmmed,Tashfia Sikder,Syed Tasdid Azam Dhrubo,Swakkhar Shatabda
Main category: cs.CV
TL;DR: 该论文提出了一个针对多模态大语言模型(MLLMs)的三模态对抗攻击框架,通过短视频内容评估模型的安全性,揭示了模型在视觉、听觉和语义推理中的漏洞。
Details
Motivation: 目前的内容审核多依赖于单模态攻击评估,忽略了多模态联合攻击的潜在风险,因此需要全面评估MLLMs在三模态场景下的鲁棒性。Contribution: 1. 提出了SVMA数据集,包含多样化的短视屏及其合成的对抗攻击样本;2. 设计了ChimeraBreak攻击策略,同时挑战视觉、听觉和语义模态;3. 揭示了MLLMs在高攻击成功率下的显著漏洞和分类偏差。
Method: 通过人工引导合成的对抗样本构建SVMA数据集,并提出ChimeraBreak攻击策略,实现三模态联合攻击。使用LLM作为法官评估攻击效果。
Result: 实验表明,MLLMs在联合攻击下存在高攻击成功率(ASR),并表现出对良性或违规内容的分类偏差。
Insight: 揭示了MLLMs在多模态安全性评估中的弱点,为开发更安全的模型提供了关键见解。
Abstract: Multimodal Large Language Models (MLLMs) are increasingly used for content moderation, yet their robustness in short-form video contexts remains underexplored. Current safety evaluations often rely on unimodal attacks, failing to address combined attack vulnerabilities. In this paper, we introduce a comprehensive framework for evaluating the tri-modal safety of MLLMs. First, we present the Short-Video Multimodal Adversarial (SVMA) dataset, comprising diverse short-form videos with human-guided synthetic adversarial attacks. Second, we propose ChimeraBreak, a novel tri-modal attack strategy that simultaneously challenges visual, auditory, and semantic reasoning pathways. Extensive experiments on state-of-the-art MLLMs reveal significant vulnerabilities with high Attack Success Rates (ASR). Our findings uncover distinct failure modes, showing model biases toward misclassifying benign or policy-violating content. We assess results using LLM-as-a-judge, demonstrating attack reasoning efficacy. Our dataset and findings provide crucial insights for developing more robust and safe MLLMs.
[43] GS-Bias: Global-Spatial Bias Learner for Single-Image Test-Time Adaptation of Vision-Language Models
Zhaohong Huang,Yuxin Zhang,Jingjing Xie,Fei Chao,Rongrong Ji
Main category: cs.CV
TL;DR: GS-Bias提出了一种高效的测试时适应方法,通过全局和空间偏置学习提升视觉语言模型的零样本泛化能力,显著降低了计算开销。
Details
Motivation: 现有测试时适应方法在性能和效率上难以平衡,要么需要调整文本提示导致开销过大,要么依赖手工设计的视觉特征增强效果不稳定。Contribution: 提出全局和空间偏置学习器,通过在逻辑输出上直接添加学习偏置,避免了完整的反向传播,显著提升了效率和性能。
Method: GS-Bias通过全局偏置捕获测试图像的全局语义特征(基于增强视图的一致性),空间偏置学习图像区域间的语义一致性。
Result: 在15个基准数据集上实现SOTA性能,例如在跨数据集泛化和域泛化上分别提升2.23%和2.72%,同时仅需6.5%的内存开销。
Insight: 通过轻量化的偏置学习直接作用于逻辑输出,既避免了传统方法的计算瓶颈,又保持了语义特征的捕获能力。
Abstract: Recent advances in test-time adaptation (TTA) for Vision-Language Models (VLMs) have garnered increasing attention, particularly through the use of multiple augmented views of a single image to boost zero-shot generalization. Unfortunately, existing methods fail to strike a satisfactory balance between performance and efficiency, either due to excessive overhead of tuning text prompts or unstable benefits from handcrafted, training-free visual feature enhancement. In this paper, we present Global-Spatial Bias Learner (GS-Bias), an efficient and effective TTA paradigm that incorporates two learnable biases during TTA, unfolded as the global bias and spatial bias. Particularly, the global bias captures the global semantic features of a test image by learning consistency across augmented views, while spatial bias learns the semantic coherence between regions in the image’s spatial visual representation. It is worth highlighting that these two sets of biases are directly added to the logits outputed by the pretrained VLMs, which circumvent the full backpropagation through VLM that hinders the efficiency of existing TTA methods. This endows GS-Bias with extremely high efficiency while achieving state-of-the-art performance on 15 benchmark datasets. For example, it achieves a 2.23% improvement over TPT in cross-dataset generalization and a 2.72% improvement in domain generalization, while requiring only 6.5% of TPT’s memory usage on ImageNet.
[44] EC-Diff: Fast and High-Quality Edge-Cloud Collaborative Inference for Diffusion Models
Jiajian Xie,Shengyu Zhang,Zhou Zhao,Fan Wu,Fei Wu
Main category: cs.CV
TL;DR: EC-Diff提出了一种边缘-云协同推理框架,通过梯度噪声估计和K步噪声近似策略优化扩散模型的输出质量和推理速度。
Details
Motivation: 扩散模型在图像和视频合成中表现出色,但模型规模和延迟问题影响了用户体验。当前的边缘-云协同框架存在推理时间长或语义模糊的问题。Contribution: 1) 提出了EC-Diff框架;2) 设计了K步噪声近似策略;3) 提出两阶段贪婪搜索算法优化参数。
Method: 1) 基于梯度的噪声估计加速云端推理;2) K步噪声近似策略减少云端推理频率;3) 两阶段贪婪搜索算法确定最优切换点。
Result: 在边缘推理基础上显著提升生成质量,同时在云端推理基础上平均速度提升2倍。
Insight: 通过动态调整云端和边缘模型的分工可同时优化推理速度和生成质量。
Abstract: Diffusion Models have shown remarkable proficiency in image and video synthesis. As model size and latency increase limit user experience, hybrid edge-cloud collaborative framework was recently proposed to realize fast inference and high-quality generation, where the cloud model initiates high-quality semantic planning and the edge model expedites later-stage refinement. However, excessive cloud denoising prolongs inference time, while insufficient steps cause semantic ambiguity, leading to inconsistency in edge model output. To address these challenges, we propose EC-Diff that accelerates cloud inference through gradient-based noise estimation while identifying the optimal point for cloud-edge handoff to maintain generation quality. Specifically, we design a K-step noise approximation strategy to reduce cloud inference frequency by using noise gradients between steps and applying cloud inference periodically to adjust errors. Then we design a two-stage greedy search algorithm to efficiently find the optimal parameters for noise approximation and edge model switching. Extensive experiments demonstrate that our method significantly enhances generation quality compared to edge inference, while achieving up to an average $2\times$ speedup in inference compared to cloud inference. Video samples and source code are available at https://ec-diff.github.io/.
[45] Unsupervised Part Discovery via Descriptor-Based Masked Image Restoration with Optimized Constraints
Jiahao Xia,Yike Wu,Wenjian Huang,Jianguo Zhang,Jian Zhang
Main category: cs.CV
TL;DR: 论文提出了一种名为MPAE的无监督部件发现方法,通过基于描述符的掩码图像恢复和优化约束,能够在复杂场景中稳健地发现与物体形状高度匹配的部件。
Details
Motivation: 部件级特征对图像理解至关重要,但由于缺乏细粒度标注,相关研究较少。现有的无监督方法在跨类别和跨场景时鲁棒性不足,限制了其应用范围。Contribution: 提出MPAE框架,利用掩码恢复和优化约束,无需标注即可稳健地发现部件;提出更宽松但更有效的约束,支持跨场景和跨类别的部件发现。
Method: MPAE首先学习部件描述符和特征图,通过掩码版本的图像生成局部特征,随后利用描述符填充掩码区域,并通过相似性恢复掩码区域,实现对部件形状的对齐。
Result: 实验证明,MPAE能在多种类别和场景下稳健发现有意义部件,支持遮挡处理和跨类别部件相似性探索。
Insight: 通过结合掩码恢复和描述符学习,可以在无监督条件下实现更精确的部件发现,为复杂场景下的图像理解提供新思路。
Abstract: Part-level features are crucial for image understanding, but few studies focus on them because of the lack of fine-grained labels. Although unsupervised part discovery can eliminate the reliance on labels, most of them cannot maintain robustness across various categories and scenarios, which restricts their application range. To overcome this limitation, we present a more effective paradigm for unsupervised part discovery, named Masked Part Autoencoder (MPAE). It first learns part descriptors as well as a feature map from the inputs and produces patch features from a masked version of the original images. Then, the masked regions are filled with the learned part descriptors based on the similarity between the local features and descriptors. By restoring these masked patches using the part descriptors, they become better aligned with their part shapes, guided by appearance features from unmasked patches. Finally, MPAE robustly discovers meaningful parts that closely match the actual object shapes, even in complex scenarios. Moreover, several looser yet more effective constraints are proposed to enable MPAE to identify the presence of parts across various scenarios and categories in an unsupervised manner. This provides the foundation for addressing challenges posed by occlusion and for exploring part similarity across multiple categories. Extensive experiments demonstrate that our method robustly discovers meaningful parts across various categories and scenarios. The code is available at the project https://github.com/Jiahao-UTS/MPAE.
[46] Frequency-Dynamic Attention Modulation for Dense Prediction
Linwei Chen,Lin Gu,Ying Fu
Main category: cs.CV
TL;DR: 本文提出了一种名为FDAM的新方法,通过调制ViTs的频率响应克服了其低频滤波导致的细节丢失问题,提升了多种视觉任务的性能。
Details
Motivation: Vision Transformers(ViTs)的注意力机制导致每层表现为低通滤波器,而多层堆叠架构会导致频率信号衰减,丢失关键细节和纹理,因此需要一种能动态调制频率响应的解决方案。Contribution: 1. 提出了FDAM,一种基于电路理论的方法,通过Attention Inversion(AttInv)和Frequency Dynamic Scaling(FreqScale)动态调制ViTs的频率响应。2. 在多种模型(如SegFormer、DeiT)和任务(如语义分割、目标检测)中验证了性能提升。3. 在遥感检测单尺度场景中取得了SOTA结果。
Method: 1. AttInv:通过反转注意力矩阵中的低通滤波器生成互补的高通滤波。2. FreqScale:对不同频率分量加权以实现精细调整。3. 结合两种技术动态优化ViTs的频率响应。
Result: 在多个模型和任务中展现了性能提升,避免了表示坍塌,并在遥感检测中达到了SOTA。
Insight: 通过电路理论启发的动态频率调制方法,可以有效解决ViTs的低频主导问题,提升其在密集预测任务中的表现。
Abstract: Vision Transformers (ViTs) have significantly advanced computer vision, demonstrating strong performance across various tasks. However, the attention mechanism in ViTs makes each layer function as a low-pass filter, and the stacked-layer architecture in existing transformers suffers from frequency vanishing. This leads to the loss of critical details and textures. We propose a novel, circuit-theory-inspired strategy called Frequency-Dynamic Attention Modulation (FDAM), which can be easily plugged into ViTs. FDAM directly modulates the overall frequency response of ViTs and consists of two techniques: Attention Inversion (AttInv) and Frequency Dynamic Scaling (FreqScale). Since circuit theory uses low-pass filters as fundamental elements, we introduce AttInv, a method that generates complementary high-pass filtering by inverting the low-pass filter in the attention matrix, and dynamically combining the two. We further design FreqScale to weight different frequency components for fine-grained adjustments to the target response function. Through feature similarity analysis and effective rank evaluation, we demonstrate that our approach avoids representation collapse, leading to consistent performance improvements across various models, including SegFormer, DeiT, and MaskDINO. These improvements are evident in tasks such as semantic segmentation, object detection, and instance segmentation. Additionally, we apply our method to remote sensing detection, achieving state-of-the-art results in single-scale settings. The code is available at \href{https://github.com/Linwei-Chen/FDAM}{https://github.com/Linwei-Chen/FDAM}.
[47] Dual form Complementary Masking for Domain-Adaptive Image Segmentation
Jiawen Wang,Yinda Chen,Xiaoyu Liu,Che Liu,Dong Liu,Jianqing Gao,Zhiwei Xiong
Main category: cs.CV
TL;DR: 该论文提出了一种名为MaskTwins的新框架,通过双形式互补掩码(dual form complementary masking)重构稀疏信号,提升跨域图像分割的域不变特征提取能力,无需单独预训练即可实现端到端的域泛化。
Details
Motivation: 现有工作仅将掩码图像建模(MIM)视为输入图像的变形形式,忽略了其理论与潜力;本文从稀疏信号重构角度重新分析掩码重建,探索其在增强特征提取与表示学习中的作用。Contribution: 1. 理论证明了双形式互补掩码在提取域无关特征上的优势;2. 提出了MaskTwins框架,将掩码重建直接整合到训练流程中,实现端到端域泛化。
Method: 基于稀疏信号重构理论,设计互补掩码策略(MaskTwins),通过强制互补掩码图像的预测一致性,学习跨域的结构模式。
Result: 在自然和生物图像分割任务上超越基线方法,验证了MaskTwins在提取域不变特征方面的优越性。
Insight: 掩码重建不仅是数据增强,还可通过理论驱动的互补掩码策略显著提升模型对域变化的鲁棒性。
Abstract: Recent works have correlated Masked Image Modeling (MIM) with consistency regularization in Unsupervised Domain Adaptation (UDA). However, they merely treat masking as a special form of deformation on the input images and neglect the theoretical analysis, which leads to a superficial understanding of masked reconstruction and insufficient exploitation of its potential in enhancing feature extraction and representation learning. In this paper, we reframe masked reconstruction as a sparse signal reconstruction problem and theoretically prove that the dual form of complementary masks possesses superior capabilities in extracting domain-agnostic image features. Based on this compelling insight, we propose MaskTwins, a simple yet effective UDA framework that integrates masked reconstruction directly into the main training pipeline. MaskTwins uncovers intrinsic structural patterns that persist across disparate domains by enforcing consistency between predictions of images masked in complementary ways, enabling domain generalization in an end-to-end manner. Extensive experiments verify the superiority of MaskTwins over baseline methods in natural and biological image segmentation. These results demonstrate the significant advantages of MaskTwins in extracting domain-invariant features without the need for separate pre-training, offering a new paradigm for domain-adaptive segmentation.
[48] Deep Neural Encoder-Decoder Model to Relate fMRI Brain Activity with Naturalistic Stimuli
Florian David,Michael Chan,Elenor Morgenroth,Patrik Vuilleumier,Dimitri Van De Ville
Main category: cs.CV
TL;DR: 该论文提出一种端到端的深度神经编码器-解码器模型,利用fMRI数据编码和解码自然刺激下的大脑活动。通过结合时间卷积层,模型解决了自然电影刺激与fMRI采集之间的时间分辨率差异,并成功预测视觉皮质区的体素活动,还能从神经活动中重建对应的视觉输入。通过显著性图分析,发现中枕叶、梭状回和距状沟是视觉解码的关键区域。
Details
Motivation: 研究动机是通过深度学习方法探索自然刺激(如电影)下的大脑活动模式,尤其是视觉皮质的响应机制。通过模型重建视觉输入,研究者希望进一步理解视觉处理的神经基础。Contribution: 主要贡献包括:1) 提出一种结合时间卷积层的端到端深度神经编码器-解码器模型;2) 解决了自然刺激与fMRI采集时间分辨率不匹配的问题;3) 通过显著性图识别了视觉解码的关键大脑区域。
Method: 方法采用深度神经网络架构,主要包括编码器和解码器模块,使用时间卷积层处理时间相关的连续电影帧输入。模型通过fMRI数据预测视觉皮质的体素活动,并重建对应的视觉输入。显著性图用于分析关键大脑区域。
Result: 实验结果显示,模型能有效预测视觉皮质区的体素活动,并重建边缘、人脸和对比度等视觉特征。显著性图表明中枕叶(形状感知)、梭状回(复杂识别)和距状沟(基础视觉特征)是解码的关键区域。
Insight: 研究发现,模型的解码能力与视觉皮质的已知功能(如边缘检测和面部识别)高度一致,这表明深度学习模型可作为研究视觉神经机制的代理工具。
Abstract: We propose an end-to-end deep neural encoder-decoder model to encode and decode brain activity in response to naturalistic stimuli using functional magnetic resonance imaging (fMRI) data. Leveraging temporally correlated input from consecutive film frames, we employ temporal convolutional layers in our architecture, which effectively allows to bridge the temporal resolution gap between natural movie stimuli and fMRI acquisitions. Our model predicts activity of voxels in and around the visual cortex and performs reconstruction of corresponding visual inputs from neural activity. Finally, we investigate brain regions contributing to visual decoding through saliency maps. We find that the most contributing regions are the middle occipital area, the fusiform area, and the calcarine, respectively employed in shape perception, complex recognition (in particular face perception), and basic visual features such as edges and contrasts. These functions being strongly solicited are in line with the decoder’s capability to reconstruct edges, faces, and contrasts. All in all, this suggests the possibility to probe our understanding of visual processing in films using as a proxy the behaviour of deep learning models such as the one proposed in this paper.
[49] SS-DC: Spatial-Spectral Decoupling and Coupling Across Visible-Infrared Gap for Domain Adaptive Object Detection
Xiwei Zhang,Chunjin Yang,Yiming Xiao,Runtong Zhang,Fanman Meng
Main category: cs.CV
TL;DR: 论文提出了一种基于解耦-耦合策略的SS-DC框架,用于可见光到红外(RGB-IR)领域的无监督域自适应目标检测(UDAOD),通过光谱和空间特征的有效解耦与耦合提升性能。
Details
Motivation: 现有的UDAOD方法将可见光域视为一个统一域,忽略了其内部多个子域(如白天、夜晚、雾天)的差异。论文认为解耦这些子域中的域不变(DI)和域特定(DS)特征有助于跨域适应。Contribution: 1. 提出Spectral Adaptive Idempotent Decoupling (SAID)模块,通过光谱分解解耦DI和DS特征;2. 设计基于滤波器组的光谱处理范式和自蒸馏驱动的解耦损失;3. 提出新的空间-光谱耦合方法,通过DI特征金字塔实现联合耦合。
Method: 1. 使用SAID模块进行光谱解耦,提取DI和DS特征;2. 引入滤波器组和自蒸馏损失优化解耦;3. 通过空间-光谱DI特征金字塔进行耦合,并结合DS特征减少域偏置。
Result: 在多个RGB-IR数据集上,显著优于基线和其他UDAOD方法,特别是在FLIR-ADAS数据集的新实验协议中表现优异。
Insight: 通过解耦域不变和域特定特征,并结合空间-光谱信息,可以有效提升跨域目标检测的性能,尤其是在复杂多子域场景中。
Abstract: Unsupervised domain adaptive object detection (UDAOD) from the visible domain to the infrared (RGB-IR) domain is challenging. Existing methods regard the RGB domain as a unified domain and neglect the multiple subdomains within it, such as daytime, nighttime, and foggy scenes. We argue that decoupling the domain-invariant (DI) and domain-specific (DS) features across these multiple subdomains is beneficial for RGB-IR domain adaptation. To this end, this paper proposes a new SS-DC framework based on a decoupling-coupling strategy. In terms of decoupling, we design a Spectral Adaptive Idempotent Decoupling (SAID) module in the aspect of spectral decomposition. Due to the style and content information being highly embedded in different frequency bands, this module can decouple DI and DS components more accurately and interpretably. A novel filter bank-based spectral processing paradigm and a self-distillation-driven decoupling loss are proposed to improve the spectral domain decoupling. In terms of coupling, a new spatial-spectral coupling method is proposed, which realizes joint coupling through spatial and spectral DI feature pyramids. Meanwhile, this paper introduces DS from decoupling to reduce the domain bias. Extensive experiments demonstrate that our method can significantly improve the baseline performance and outperform existing UDAOD methods on multiple RGB-IR datasets, including a new experimental protocol proposed in this paper based on the FLIR-ADAS dataset.
[50] Dataset Ownership Verification for Pre-trained Masked Models
Yuechen Xie,Jie Song,Yicheng Shan,Xiaoyan Zhang,Yuanyu Wan,Shengxuming Zhang,Jiarui Duan,Mingli Song
Main category: cs.CV
TL;DR: 论文提出了DOV4MM方法,解决掩码模型(masked models)的数据集所有权验证问题,填补了现有技术在这一领域的空白。
Details
Motivation: 高质量开源数据集对深度学习发展至关重要,但其所有权可能被滥用。现有验证技术主要针对监督学习和对比预训练模型,无法直接适用于掩码模型。Contribution: 首次提出针对掩码模型的数据集所有权验证方法(DOV4MM),通过掩码信息重构难度差异验证模型是否使用目标数据集预训练。
Method: 基于掩码模型在目标数据集预训练后,嵌入空间中掩码信息重构难度的显著差异,设计所有权验证方案。
Result: 在ImageNet-1K和WikiText-103数据集上的实验显示,DOV4MM能有效拒绝零假设(p值远低于0.05),优于现有方法。
Insight: 掩码模型的预训练行为在嵌入空间留下了独特的可验证痕迹,为数据集所有权保护提供了新思路。
Abstract: High-quality open-source datasets have emerged as a pivotal catalyst driving the swift advancement of deep learning, while facing the looming threat of potential exploitation. Protecting these datasets is of paramount importance for the interests of their owners. The verification of dataset ownership has evolved into a crucial approach in this domain; however, existing verification techniques are predominantly tailored to supervised models and contrastive pre-trained models, rendering them ill-suited for direct application to the increasingly prevalent masked models. In this work, we introduce the inaugural methodology addressing this critical, yet unresolved challenge, termed Dataset Ownership Verification for Masked Modeling (DOV4MM). The central objective is to ascertain whether a suspicious black-box model has been pre-trained on a particular unlabeled dataset, thereby assisting dataset owners in safeguarding their rights. DOV4MM is grounded in our empirical observation that when a model is pre-trained on the target dataset, the difficulty of reconstructing masked information within the embedding space exhibits a marked contrast to models not pre-trained on that dataset. We validated the efficacy of DOV4MM through ten masked image models on ImageNet-1K and four masked language models on WikiText-103. The results demonstrate that DOV4MM rejects the null hypothesis, with a $p$-value considerably below 0.05, surpassing all prior approaches. Code is available at https://github.com/xieyc99/DOV4MM.
[51] MVAR: MultiVariate AutoRegressive Air Pollutants Forecasting Model
Xu Fan,Zhihao Wang,Yuetan Lin,Yan Zhang,Yang Xiang,Hao Li
Main category: cs.CV
TL;DR: 该论文提出了一个多变量自回归空气污染物预测模型(MVAR),通过减少对长时间窗口输入的依赖并提升数据利用效率,实现了120小时长期序列预测,同时结合气象数据优化空间响应学习。
Details
Motivation: 现有研究多集中于单一污染物预测,忽略了不同污染物间的相互作用及其空间响应的多样性,无法满足实际多变量预测需求。Contribution: 1. 提出MVAR模型,支持多变量空气污染物长期预测;2. 设计了多变量自回归训练范式;3. 开发了气象耦合空间变换器模块;4. 构建了一个覆盖6种主要污染物的标准化数据集。
Method: 采用多变量自回归模型,结合气象耦合空间变换器,动态学习污染物间相互作用及空间响应,并通过减少输入时间窗口提升效率。
Result: 实验表明,MVAR在性能上优于现有方法,验证了其架构的有效性。
Insight: 多变量交互与气象数据的结合是提升空气污染物预测精度的关键,标准化数据集的构建为后续研究提供了重要支持。
Abstract: Air pollutants pose a significant threat to the environment and human health, thus forecasting accurate pollutant concentrations is essential for pollution warnings and policy-making. Existing studies predominantly focus on single-pollutant forecasting, neglecting the interactions among different pollutants and their diverse spatial responses. To address the practical needs of forecasting multivariate air pollutants, we propose MultiVariate AutoRegressive air pollutants forecasting model (MVAR), which reduces the dependency on long-time-window inputs and boosts the data utilization efficiency. We also design the Multivariate Autoregressive Training Paradigm, enabling MVAR to achieve 120-hour long-term sequential forecasting. Additionally, MVAR develops Meteorological Coupled Spatial Transformer block, enabling the flexible coupling of AI-based meteorological forecasts while learning the interactions among pollutants and their diverse spatial responses. As for the lack of standardized datasets in air pollutants forecasting, we construct a comprehensive dataset covering 6 major pollutants across 75 cities in North China from 2018 to 2023, including ERA5 reanalysis data and FuXi-2.0 forecast data. Experimental results demonstrate that the proposed model outperforms state-of-the-art methods and validate the effectiveness of the proposed architecture.
[52] 3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering
Rongtao Xu,Han Gao,Mingming Yu,Dong An,Shunpeng Chen,Changwei Wang,Li Guo,Xiaodan Liang,Shibiao Xu
Main category: cs.CV
TL;DR: 论文提出3D-MoRe框架,利用基础模型生成大规模3D-语言数据集,显著提升了3D场景中问答和密集描述任务的性能。
Details
Motivation: 现有3D场景任务(如问答和密集描述)需要更多多样化和可扩展的数据。论文旨在通过结合多模态数据和高层次推理能力,提升任务性能。Contribution: 1. 提出3D-MoRe框架,集成多模态嵌入、跨模态交互和语言模型解码器;2. 生成了大规模高质量数据集(6.2万QA对和7.3万对象描述);3. 在ScanQA和ScanRefer任务中显著超越当前最优方法。
Method: 结合多模态嵌入、跨模态交互和语言模型解码器处理3D场景与自然语言指令,并通过数据增强和语义过滤确保数据质量。
Result: ScanQA任务的CIDEr提升2.15%;ScanRefer任务的CIDEr@0.5提升1.84%。
Insight: 通过融合多模态上下文和高层次推理,能够有效提升3D场景任务的性能,且生成的大规模数据集有望推动领域发展。
Abstract: With the growing need for diverse and scalable data in indoor scene tasks, such as question answering and dense captioning, we propose 3D-MoRe, a novel paradigm designed to generate large-scale 3D-language datasets by leveraging the strengths of foundational models. The framework integrates key components, including multi-modal embedding, cross-modal interaction, and a language model decoder, to process natural language instructions and 3D scene data. This approach facilitates enhanced reasoning and response generation in complex 3D environments. Using the ScanNet 3D scene dataset, along with text annotations from ScanQA and ScanRefer, 3D-MoRe generates 62,000 question-answer (QA) pairs and 73,000 object descriptions across 1,513 scenes. We also employ various data augmentation techniques and implement semantic filtering to ensure high-quality data. Experiments on ScanQA demonstrate that 3D-MoRe significantly outperforms state-of-the-art baselines, with the CIDEr score improving by 2.15%. Similarly, on ScanRefer, our approach achieves a notable increase in CIDEr@0.5 by 1.84%, highlighting its effectiveness in both tasks. Our code and generated datasets will be publicly released to benefit the community, and both can be accessed on the https://3D-MoRe.github.io.
[53] Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery
Xinhang Wan,Jiyuan Liu,Qian Qu,Suyuan Liu,Chuyu Zhang,Fangdi Wang,Xinwang Liu,En Zhu,Kunlun He
Main category: cs.CV
TL;DR: 论文提出了首个针对多视角数据的NCD方法IICMVNCD,通过视角内和视角间的相关性引导,改进了现有方法在伪标签依赖和多视角数据忽视上的局限性。
Details
Motivation: 现有NCD方法仅关注单视角数据,且依赖伪标签导致性能不稳定,而多视角数据(如多组学数据)在实际中日益常见,亟需更鲁棒的方法。Contribution: 1. 首次探索多视角数据的NCD问题;2. 提出视角内分布一致性和视角间关系引导的框架IICMVNCD。
Method: 1. 在视角内使用矩阵分解捕捉分布一致性;2. 在视角间利用已知类别的视角关系动态调整权重,指导新类别聚类。
Result: 实验验证了IICMVNCD的有效性,展示了在多视角数据上的优越性能。
Insight: 视角间关系的信息传递和动态权重调整是提升多视角NCD性能的关键。
Abstract: In this paper, we address the problem of novel class discovery (NCD), which aims to cluster novel classes by leveraging knowledge from disjoint known classes. While recent advances have made significant progress in this area, existing NCD methods face two major limitations. First, they primarily focus on single-view data (e.g., images), overlooking the increasingly common multi-view data, such as multi-omics datasets used in disease diagnosis. Second, their reliance on pseudo-labels to supervise novel class clustering often results in unstable performance, as pseudo-label quality is highly sensitive to factors such as data noise and feature dimensionality. To address these challenges, we propose a novel framework named Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery (IICMVNCD), which is the first attempt to explore NCD in multi-view setting so far. Specifically, at the intra-view level, leveraging the distributional similarity between known and novel classes, we employ matrix factorization to decompose features into view-specific shared base matrices and factor matrices. The base matrices capture distributional consistency among the two datasets, while the factor matrices model pairwise relationships between samples. At the inter-view level, we utilize view relationships among known classes to guide the clustering of novel classes. This includes generating predicted labels through the weighted fusion of factor matrices and dynamically adjusting view weights of known classes based on the supervision loss, which are then transferred to novel class learning. Experimental results validate the effectiveness of our proposed approach.
[54] InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing
Kun-Hsiang Lin,Yu-Wen Tseng,Kang-Yang Huang,Jhih-Ciang Wu,Wen-Huang Cheng
Main category: cs.CV
TL;DR: InstructFLIP是一个基于视觉-语言模型(VLM)的指令调优框架,用于提升人脸防伪任务的泛化能力,通过解耦指令为内容和风格两部分,显著减少了跨域训练冗余。
Details
Motivation: 当前人脸防伪(FAS)的研究主要集中在跨域泛化,但面临两大挑战:攻击类型的语义理解不足和跨域训练冗余。本文结合VLM和元域策略来解决这些问题。Contribution: 提出了InstructFLIP框架,首次引入VLM增强FAS任务的语义理解,并通过指令调优实现单域训练下的跨域泛化。
Method: 采用元域策略训练统一模型,将指令解耦为内容(核心语义)和风格(环境和相机特性)两部分。
Result: 实验表明,InstructFLIP在准确性上优于现有SOTA模型,并大幅减少跨域训练冗余。
Insight: 指令解耦策略为FAS任务提供了一种新的解决思路,利用文本引导可以有效提升模型的语义理解和泛化能力。
Abstract: Face anti-spoofing (FAS) aims to construct a robust system that can withstand diverse attacks. While recent efforts have concentrated mainly on cross-domain generalization, two significant challenges persist: limited semantic understanding of attack types and training redundancy across domains. We address the first by integrating vision-language models (VLMs) to enhance the perception of visual input. For the second challenge, we employ a meta-domain strategy to learn a unified model that generalizes well across multiple domains. Our proposed InstructFLIP is a novel instruction-tuned framework that leverages VLMs to enhance generalization via textual guidance trained solely on a single domain. At its core, InstructFLIP explicitly decouples instructions into content and style components, where content-based instructions focus on the essential semantics of spoofing, and style-based instructions consider variations related to the environment and camera characteristics. Extensive experiments demonstrate the effectiveness of InstructFLIP by outperforming SOTA models in accuracy and substantially reducing training redundancy across diverse domains in FAS. Project website is available at https://kunkunlin1221.github.io/InstructFLIP.
[55] MS-DETR: Towards Effective Video Moment Retrieval and Highlight Detection by Joint Motion-Semantic Learning
Hongxu Ma,Guanshuo Wang,Fufu Yu,Qiong Jia,Shouhong Ding
Main category: cs.CV
TL;DR: MS-DETR 是一个联合运动-语义学习的框架,用于视频片段检索(MR)和亮点检测(HD),通过解耦运动与语义的模态内相关性并利用跨模态任务相关性,显著提升了性能。
Details
Motivation: 现有 DETR 框架在 MR/HD 任务中未充分利用视频中的运动与语义关系,且有数据稀疏性问题。Contribution: 1) 提出 MS-DETR,联合学习运动与语义特征;2) 使用生成策略增强数据并引入对比去噪学习。
Method: 1) 编码器解耦运动与语义模态内相关性;2) 解码器利用跨模态任务相关性;3) 数据增强和对比学习。
Result: 在四个基准测试中超越现有 SOTA 模型。
Insight: 视频任务中运动与语义的联合学习及数据稀疏性问题至关重要。
Abstract: Video Moment Retrieval (MR) and Highlight Detection (HD) aim to pinpoint specific moments and assess clip-wise relevance based on the text query. While DETR-based joint frameworks have made significant strides, there remains untapped potential in harnessing the intricate relationships between temporal motion and spatial semantics within video content. In this paper, we propose the Motion-Semantics DETR (MS-DETR), a framework that captures rich motion-semantics features through unified learning for MR/HD tasks. The encoder first explicitly models disentangled intra-modal correlations within motion and semantics dimensions, guided by the given text queries. Subsequently, the decoder utilizes the task-wise correlation across temporal motion and spatial semantics dimensions to enable precise query-guided localization for MR and refined highlight boundary delineation for HD. Furthermore, we observe the inherent sparsity dilemma within the motion and semantics dimensions of MR/HD datasets. To address this issue, we enrich the corpus from both dimensions by generation strategies and propose contrastive denoising learning to ensure the above components learn robustly and effectively. Extensive experiments on four MR/HD benchmarks demonstrate that our method outperforms existing state-of-the-art models by a margin. Our code is available at https://github.com/snailma0229/MS-DETR.git.
[56] Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics
Muleilan Pei,Shaoshuai Shi,Xuesong Chen,Xu Liu,Shaojie Shen
Main category: cs.CV
TL;DR: 本文提出了一种基于强化学习的轨迹预测方法,通过结合行为意图和奖励启发式,显著提升了轨迹预测的准确性和置信度。
Details
Motivation: 自动驾驶系统中的运动预测是一个关键但具有挑战性的任务。传统方法直接预测轨迹,忽视了行为意图的重要性。本文从规划角度重新思考这一任务,提出结合意图推理和奖励启发式的新策略。Contribution: 1. 提出了一种基于逆强化学习(IRL)的意图推理器,生成紧凑的奖励分布以指导轨迹预测;2. 开发了一个层次化的DETR-like解码器,结合双向选择性状态空间模型,生成准确的未来轨迹及其概率。
Method: 1. 使用向量化表示编码交通代理和场景元素;2. 通过查询为中心的IRL生成奖励分布;3. 利用奖励启发式推理多意图假设;4. 结合DETR-like解码器和状态空间模型生成轨迹。
Result: 在Argoverse和nuScenes数据集上,该方法显著提升了预测置信度,性能达到最先进水平。
Insight: 从规划角度重新思考运动预测任务,结合意图推理和奖励启发式,能够显著提升预测性能。
Abstract: Motion forecasting for on-road traffic agents presents both a significant challenge and a critical necessity for ensuring safety in autonomous driving systems. In contrast to most existing data-driven approaches that directly predict future trajectories, we rethink this task from a planning perspective, advocating a “First Reasoning, Then Forecasting” strategy that explicitly incorporates behavior intentions as spatial guidance for trajectory prediction. To achieve this, we introduce an interpretable, reward-driven intention reasoner grounded in a novel query-centric Inverse Reinforcement Learning (IRL) scheme. Our method first encodes traffic agents and scene elements into a unified vectorized representation, then aggregates contextual features through a query-centric paradigm. This enables the derivation of a reward distribution, a compact yet informative representation of the target agent’s behavior within the given scene context via IRL. Guided by this reward heuristic, we perform policy rollouts to reason about multiple plausible intentions, providing valuable priors for subsequent trajectory generation. Finally, we develop a hierarchical DETR-like decoder integrated with bidirectional selective state space models to produce accurate future trajectories along with their associated probabilities. Extensive experiments on the large-scale Argoverse and nuScenes motion forecasting datasets demonstrate that our approach significantly enhances trajectory prediction confidence, achieving highly competitive performance relative to state-of-the-art methods.
[57] YOLOv8-SMOT: An Efficient and Robust Framework for Real-Time Small Object Tracking via Slice-Assisted Training and Adaptive Association
Xiang Yu,Xinyao Liu,Guang Liang
Main category: cs.CV
TL;DR: YOLOv8-SMOT提出了一种高效且鲁棒的小物体实时追踪框架,通过切片辅助训练和自适应关联解决了小物体追踪中的特征稀少、运动复杂和遮挡难题,并在MVA 2025比赛中取得冠军。
Details
Motivation: 从无人机视角追踪小鸟等多动小物体是一项极具挑战的任务,主要困难包括目标特征稀少、运动复杂及频繁遮挡所导致的身份模糊。Contribution: 1. 提出切片训练框架SliceTrain,提升小物体检测能力;2. 设计了不依赖外观信息的鲁棒追踪器,结合运动方向维护和自适应相似度度量。
Method: 1. 使用SliceTrain框架结合确定性全切片覆盖和切片级随机增强;2. 在OC-SORT中集成EMA机制和自适应相似度度量。
Result: 在SMOT4SB测试集上取得SO-HOTA 55.205的SOTA性能,验证了框架的有效性。
Insight: SliceTrain和运动方向维护机制是解决小物体追踪中数据不足和身份模糊问题的关键创新。
Abstract: Tracking small, agile multi-objects (SMOT), such as birds, from an Unmanned Aerial Vehicle (UAV) perspective is a highly challenging computer vision task. The difficulty stems from three main sources: the extreme scarcity of target appearance features, the complex motion entanglement caused by the combined dynamics of the camera and the targets themselves, and the frequent occlusions and identity ambiguity arising from dense flocking behavior. This paper details our championship-winning solution in the MVA 2025 “Finding Birds” Small Multi-Object Tracking Challenge (SMOT4SB), which adopts the tracking-by-detection paradigm with targeted innovations at both the detection and association levels. On the detection side, we propose a systematic training enhancement framework named \textbf{SliceTrain}. This framework, through the synergy of ‘deterministic full-coverage slicing’ and ‘slice-level stochastic augmentation, effectively addresses the problem of insufficient learning for small objects in high-resolution image training. On the tracking side, we designed a robust tracker that is completely independent of appearance information. By integrating a \textbf{motion direction maintenance (EMA)} mechanism and an \textbf{adaptive similarity metric} combining \textbf{bounding box expansion and distance penalty} into the OC-SORT framework, our tracker can stably handle irregular motion and maintain target identities. Our method achieves state-of-the-art performance on the SMOT4SB public test set, reaching an SO-HOTA score of \textbf{55.205}, which fully validates the effectiveness and advancement of our framework in solving complex real-world SMOT problems. The source code will be made available at https://github.com/Salvatore-Love/YOLOv8-SMOT.
[58] Out-of-distribution data supervision towards biomedical semantic segmentation
Yiquan Gao,Duohui Xu
Main category: cs.CV
TL;DR: 该论文提出了一种名为Med-OoD的数据中心框架,通过引入Out-of-Distribution(OoD)数据监督来解决生物医学图像分割中的错误分类问题,无需额外数据、特征正则化或标注。该方法可直接应用于现有分割网络,显著提升了性能,并展示了仅用OoD数据训练分割网络的潜力。
Details
Motivation: 生物医学图像分割网络在有限和不完美的数据集上容易发生前景与背景的错误分类,而OoD数据在其他视觉任务中表现出的强大能力启发了作者探索其在分割任务中的应用。Contribution: 提出了Med-OoD框架,通过OoD数据监督提升分割性能,无需外部数据、特征正则化或额外标注;展示了仅用OoD数据训练分割网络的可行性。
Method: Med-OoD框架利用OoD数据监督,直接整合到现有分割网络中,无需修改网络结构。实验验证了其在防止错误分类和提升性能上的有效性。
Result: 在Lizard数据集上取得了显著性能提升,并展示了仅用OoD数据训练时76.1%的mIoU结果。
Insight: OoD数据在生物医学图像分割中具有潜在的重要作用,挑战了传统依赖标注数据的学习范式。
Abstract: Biomedical segmentation networks easily suffer from the unexpected misclassification between foreground and background objects when learning on limited and imperfect medical datasets. Inspired by the strong power of Out-of-Distribution (OoD) data on other visual tasks, we propose a data-centric framework, Med-OoD to address this issue by introducing OoD data supervision into fully-supervised biomedical segmentation with none of the following needs: (i) external data sources, (ii) feature regularization objectives, (iii) additional annotations. Our method can be seamlessly integrated into segmentation networks without any modification on the architectures. Extensive experiments show that Med-OoD largely prevents various segmentation networks from the pixel misclassification on medical images and achieves considerable performance improvements on Lizard dataset. We also present an emerging learning paradigm of training a medical segmentation network completely using OoD data devoid of foreground class labels, surprisingly turning out 76.1% mIoU as test result. We hope this learning paradigm will attract people to rethink the roles of OoD data. Code is made available at https://github.com/StudioYG/Med-OoD.
[59] Non-Adaptive Adversarial Face Generation
Sunpill Kim,Seunghun Paik,Chanwoo Hwang,Minsu Kim,Jae Hong Seo
Main category: cs.CV
TL;DR: 本文提出了一种新的非适应性对抗人脸生成方法,通过利用FRS特征空间的结构特性,仅需少量查询即可生成视觉差异显著但被识别为目标身份的对抗人脸,无需依赖于迁移性或开源代理模型。
Details
Motivation: 当前的人脸识别系统(FRSs)在面对对抗攻击时存在严重的安全和隐私风险,尤其是在身份验证场景中。现有方法通常依赖迭代优化或迁移性攻击,而本文旨在提出一种更高效且无需适应性查询的对抗生成方法。Contribution: 1. 提出了一种基于FRS特征空间结构特性的非适应性对抗人脸生成方法;2. 通过利用属性子球面的特性,仅需少量查询即可实现高成功率;3. 方法能够生成具有特定高层属性的对抗人脸。
Method: 通过分析FRS特征空间的结构,发现具有相同属性(如性别或种族)的人脸样本集中在子球面内。基于此,生成对抗人脸时直接从子球面中采样,无需迭代优化或频繁查询。
Result: 在AWS的CompareFaces API上,仅需一次非适应性查询(包含100张人脸图像),成功率达到93%以上,显著优于现有方法。
Insight: FRS特征空间的结构特性(如属性子球面)为对抗攻击提供了新的研究方向,同时也揭示了现有系统的潜在脆弱性。
Abstract: Adversarial attacks on face recognition systems (FRSs) pose serious security and privacy threats, especially when these systems are used for identity verification. In this paper, we propose a novel method for generating adversarial faces-synthetic facial images that are visually distinct yet recognized as a target identity by the FRS. Unlike iterative optimization-based approaches (e.g., gradient descent or other iterative solvers), our method leverages the structural characteristics of the FRS feature space. We figure out that individuals sharing the same attribute (e.g., gender or race) form an attributed subsphere. By utilizing such subspheres, our method achieves both non-adaptiveness and a remarkably small number of queries. This eliminates the need for relying on transferability and open-source surrogate models, which have been a typical strategy when repeated adaptive queries to commercial FRSs are impossible. Despite requiring only a single non-adaptive query consisting of 100 face images, our method achieves a high success rate of over 93% against AWS’s CompareFaces API at its default threshold. Furthermore, unlike many existing attacks that perturb a given image, our method can deliberately produce adversarial faces that impersonate the target identity while exhibiting high-level attributes chosen by the adversary.
[60] LidarPainter: One-Step Away From Any Lidar View To Novel Guidance
Yuzhou Ji,Ke Ma,Hong Cai,Anchun Zhang,Lizhuang Ma,Xin Tan
Main category: cs.CV
TL;DR: LidarPainter 是一种一步扩散模型,能够从稀疏的 LiDAR 条件和带有伪影的渲染中实时恢复一致的驾驶视图,支持高保真的车道变换和风格化生成。
Details
Motivation: 动态驾驶场景重建在数字孪生系统和自动驾驶仿真中具有重要意义,但现有方法在视图偏离输入轨迹时会导致背景和车辆模型质量下降,且存在速度、一致性和资源效率等问题。Contribution: 提出了 LidarPainter,一种快速高效的扩散模型,能够在实时中恢复高质量的驾驶视图,并支持风格化生成。
Method: 采用一步扩散模型,直接从稀疏 LiDAR 条件和伪影渲染中生成一致的驾驶视图,支持风格化生成(如输入文本提示“雾天”或“夜间”)。
Result: 实验表明,LidarPainter 在速度、质量和资源效率上优于现有方法(比 StreetCrafter 快 7 倍,GPU 内存需求仅为 1/5),并能实现风格化生成。
Insight: LidarPainter 通过一步扩散模型实现了高效高质量的驾驶场景重建,为数字孪生和自动驾驶仿真提供了新的解决方案。
Abstract: Dynamic driving scene reconstruction is of great importance in fields like digital twin system and autonomous driving simulation. However, unacceptable degradation occurs when the view deviates from the input trajectory, leading to corrupted background and vehicle models. To improve reconstruction quality on novel trajectory, existing methods are subject to various limitations including inconsistency, deformation, and time consumption. This paper proposes LidarPainter, a one-step diffusion model that recovers consistent driving views from sparse LiDAR condition and artifact-corrupted renderings in real-time, enabling high-fidelity lane shifts in driving scene reconstruction. Extensive experiments show that LidarPainter outperforms state-of-the-art methods in speed, quality and resource efficiency, specifically 7 x faster than StreetCrafter with only one fifth of GPU memory required. LidarPainter also supports stylized generation using text prompts such as “foggy” and “night”, allowing for a diverse expansion of the existing asset library.
[61] Open-Vocabulary Indoor Object Grounding with 3D Hierarchical Scene Graph
Sergey Linok,Gleb Naumov
Main category: cs.CV
TL;DR: 该论文提出了OVIGo-3DHSG方法,通过3D层次场景图在开放词汇场景中实现室内物体定位,结合大型语言模型提升空间推理能力。
Details
Motivation: 现有室内场景理解方法难以处理复杂空间关系和开放词汇查询,需要一种能结合几何与语义信息的多层次表示方法。Contribution: 提出了基于3D层次场景图的开放词汇物体定位方法,并结合大型语言模型实现了多步空间推理。
Method: 利用RGB-D数据和开放词汇基础模型构建层次场景图,通过大型语言模型处理复杂查询,结合层间和层内连接增强空间理解。
Result: 在Habitat Matterport 3D多楼层场景中表现出高效的场景理解和鲁棒的物体定位能力。
Insight: 层次场景图结合语言模型可以显著提升复杂空间任务的性能,适用于需要高精度空间推理的应用。
Abstract: We propose OVIGo-3DHSG method - Open-Vocabulary Indoor Grounding of objects using 3D Hierarchical Scene Graph. OVIGo-3DHSG represents an extensive indoor environment over a Hierarchical Scene Graph derived from sequences of RGB-D frames utilizing a set of open-vocabulary foundation models and sensor data processing. The hierarchical representation explicitly models spatial relations across floors, rooms, locations, and objects. To effectively address complex queries involving spatial reference to other objects, we integrate the hierarchical scene graph with a Large Language Model for multistep reasoning. This integration leverages inter-layer (e.g., room-to-object) and intra-layer (e.g., object-to-object) connections, enhancing spatial contextual understanding. We investigate the semantic and geometry accuracy of hierarchical representation on Habitat Matterport 3D Semantic multi-floor scenes. Our approach demonstrates efficient scene comprehension and robust object grounding compared to existing methods. Overall OVIGo-3DHSG demonstrates strong potential for applications requiring spatial reasoning and understanding of indoor environments. Related materials can be found at https://github.com/linukc/OVIGo-3DHSG.
[62] Block-based Symmetric Pruning and Fusion for Efficient Vision Transformers
Yi-Kuan Hsieh,Jun-Wei Hsieh,Xin Li,Yu-Ming Chang,Yu-Chee Tseng
Main category: cs.CV
TL;DR: 本文提出了一种名为BSPF-ViT的新方法,通过对称性剪枝与融合技术优化ViT的计算效率,显著提升了精度并降低了计算成本。
Details
Motivation: Vision Transformer的高计算复杂度限制了其实际应用,现有方法在剪枝时忽略了token间的交互,导致精度损失。Contribution: 提出了Block-based Symmetric Pruning and Fusion (BSPF-ViT),联合优化Q/K token的剪枝,保留关键信息,提升计算效率。
Method: 通过评估token及其邻居的交互,决定保留哪些token,并进行相似性融合;利用对称注意力矩阵的特性,仅剪枝上半部分以加速计算。
Result: 在多个ViT模型上表现优异,DeiT-T和DeiT-S的ImageNet分类精度分别提升1.3%和2.0%,计算开销降低50%,速度提升40%。
Insight: 对称性剪枝与融合能同时提升模型精度和效率,为ViT的轻量化设计提供了新思路。
Abstract: Vision Transformer (ViT) has achieved impressive results across various vision tasks, yet its high computational cost limits practical applications. Recent methods have aimed to reduce ViT’s $O(n^2)$ complexity by pruning unimportant tokens. However, these techniques often sacrifice accuracy by independently pruning query (Q) and key (K) tokens, leading to performance degradation due to overlooked token interactions. To address this limitation, we introduce a novel {\bf Block-based Symmetric Pruning and Fusion} for efficient ViT (BSPF-ViT) that optimizes the pruning of Q/K tokens jointly. Unlike previous methods that consider only a single direction, our approach evaluates each token and its neighbors to decide which tokens to retain by taking token interaction into account. The retained tokens are compressed through a similarity fusion step, preserving key information while reducing computational costs. The shared weights of Q/K tokens create a symmetric attention matrix, allowing pruning only the upper triangular part for speed up. BSPF-ViT consistently outperforms state-of-the-art ViT methods at all pruning levels, increasing ImageNet classification accuracy by 1.3% on DeiT-T and 2.0% on DeiT-S, while reducing computational overhead by 50%. It achieves 40% speedup with improved accuracy across various ViTs.
[63] AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving
Jiawei Xu,Kai Deng,Zexin Fan,Shenlong Wang,Jin Xie,Jian Yang
Main category: cs.CV
TL;DR: AD-GS提出了一种自监督的高质量驾驶场景渲染框架,通过B样条曲线和全局三角函数的结合,实现动态对象建模,无需标注即可分割场景并增强渲染效果。
Details
Motivation: 当前高质量的动态场景渲染方法依赖昂贵的标注,而自监督方法难以准确捕捉动态运动和分解场景,导致渲染伪影。AD-GS旨在解决这一问题。Contribution: 1. 提出B样条曲线与三角函数的结合,实现灵活且精确的动态对象建模;2. 通过伪2D分割自动分解场景,无需标注;3. 引入可见性推理和物理刚性正则化提升鲁棒性。
Method: 1. 使用B样条和三角函数建模动态对象;2. 通过伪2D分割自动分解场景为对象和背景;3. 动态高斯和双向时间可见性掩码表示对象;4. 引入可见性推理和物理正则化。
Result: AD-GS在无标注方法中表现出色,与依赖标注的方法竞争力相当。
Insight: 创新的运动模型和自监督分割方法为动态场景渲染提供了高效且低成本的解决方案。
Abstract: Modeling and rendering dynamic urban driving scenes is crucial for self-driving simulation. Current high-quality methods typically rely on costly manual object tracklet annotations, while self-supervised approaches fail to capture dynamic object motions accurately and decompose scenes properly, resulting in rendering artifacts. We introduce AD-GS, a novel self-supervised framework for high-quality free-viewpoint rendering of driving scenes from a single log. At its core is a novel learnable motion model that integrates locality-aware B-spline curves with global-aware trigonometric functions, enabling flexible yet precise dynamic object modeling. Rather than requiring comprehensive semantic labeling, AD-GS automatically segments scenes into objects and background with the simplified pseudo 2D segmentation, representing objects using dynamic Gaussians and bidirectional temporal visibility masks. Further, our model incorporates visibility reasoning and physically rigid regularization to enhance robustness. Extensive evaluations demonstrate that our annotation-free model significantly outperforms current state-of-the-art annotation-free methods and is competitive with annotation-dependent approaches.
[64] Fine-Grained Image Recognition from Scratch with Teacher-Guided Data Augmentation
Edwin Arkel Rios,Fernando Mikael,Oswin Gosal,Femiloye Oyerinde,Hao-Chun Liang,Bo-Cheng Lai,Min-Chun Hu
Main category: cs.CV
TL;DR: 本文提出了一种名为TGDA的新框架,通过教师引导的数据增强和知识蒸馏,实现了从零开始训练高性能细粒度图像识别模型,摆脱了对预训练模型的依赖。
Details
Motivation: 现有细粒度图像识别方法依赖大规模预训练模型,限制了在资源受限环境中的应用和任务特定架构的发展。本文旨在探索从零开始训练的可行性。Contribution: 1. 提出TGDA框架,结合数据增强和教师模型引导的知识蒸馏;2. 设计任务特定架构(如LRNets和ViTFS);3. 在多个基准测试中超越预训练模型性能。
Method: TGDA框架通过细粒度感知的教师模型进行知识蒸馏,结合数据增强,从零开始训练模型。任务特定架构(如LRNets和ViTFS)进一步优化性能。
Result: 在低分辨率和高分辨率输入下,TGDA均优于预训练模型。LRNets提升准确率23%,参数减少20.6倍;ViTFS-T性能匹配ViT B-16,但参数减少15.3倍。
Insight: 从零开始训练细粒度图像识别系统是可行的,TGDA为任务特定和硬件感知架构设计提供了新思路,减少了对预训练模型的依赖。
Abstract: Fine-grained image recognition (FGIR) aims to distinguish visually similar sub-categories within a broader class, such as identifying bird species. While most existing FGIR methods rely on backbones pretrained on large-scale datasets like ImageNet, this dependence limits adaptability to resource-constrained environments and hinders the development of task-specific architectures tailored to the unique challenges of FGIR. In this work, we challenge the conventional reliance on pretrained models by demonstrating that high-performance FGIR systems can be trained entirely from scratch. We introduce a novel training framework, TGDA, that integrates data-aware augmentation with weak supervision via a fine-grained-aware teacher model, implemented through knowledge distillation. This framework unlocks the design of task-specific and hardware-aware architectures, including LRNets for low-resolution FGIR and ViTFS, a family of Vision Transformers optimized for efficient inference. Extensive experiments across three FGIR benchmarks over diverse settings involving low-resolution and high-resolution inputs show that our method consistently matches or surpasses state-of-the-art pretrained counterparts. In particular, in the low-resolution setting, LRNets trained with TGDA improve accuracy by up to 23% over prior methods while requiring up to 20.6x less parameters, lower FLOPs, and significantly less training data. Similarly, ViTFS-T can match the performance of a ViT B-16 pretrained on ImageNet-21k while using 15.3x fewer trainable parameters and requiring orders of magnitudes less data. These results highlight TGDA’s potential as an adaptable alternative to pretraining, paving the way for more efficient fine-grained vision systems.
[65] Hybrid Ensemble Approaches: Optimal Deep Feature Fusion and Hyperparameter-Tuned Classifier Ensembling for Enhanced Brain Tumor Classification
Zahid Ullah,Dragan Pamucar,Jihie Kim
Main category: cs.CV
TL;DR: 该论文提出了一种新型的双重集成框架,通过集成预训练的深度学习模型和机器学习分类器,结合特征融合和超参数调优,显著提升了脑瘤分类的准确性。
Details
Motivation: 传统的MRI图像诊断依赖专家评估,易受疲劳、经验不足或图像细节不足的影响,导致误诊或漏诊。本文旨在通过自动化的深度学习与机器学习结合的方法提高诊断精度。Contribution: 提出了双重集成框架,包括:1) 集成的预训练深度学习模型用于特征提取;2) 集成的超参数调优机器学习分类器;3) 特征融合和分类器融合技术;4) 在多个公开数据集上验证了方法的有效性。
Method: 方法包括:1) 数据预处理和增强;2) 利用多种预训练的深度卷积神经网络和视觉Transformer提取深度特征;3) 对机器学习分类器进行超参数调优;4) 结合特征融合和分类器集成以提升性能。
Result: 结果表明,特征融合和分类器融合显著优于现有方法,超参数调优进一步提升了集成方法的效果。此外,消融研究证明了各组件对分类准确性的贡献。
Insight: 深度学习与机器学习的结合(特征提取与分类器优化)在医学图像分类任务中具有显著优势,超参数调优是提升性能的关键;特征融合和集成学习可以有效缓解小样本或复杂背景下的分类挑战。
Abstract: Magnetic Resonance Imaging (MRI) is widely recognized as the most reliable tool for detecting tumors due to its capability to produce detailed images that reveal their presence. However, the accuracy of diagnosis can be compromised when human specialists evaluate these images. Factors such as fatigue, limited expertise, and insufficient image detail can lead to errors. For example, small tumors might go unnoticed, or overlap with healthy brain regions could result in misidentification. To address these challenges and enhance diagnostic precision, this study proposes a novel double ensembling framework, consisting of ensembled pre-trained deep learning (DL) models for feature extraction and ensembled fine-tuned hyperparameter machine learning (ML) models to efficiently classify brain tumors. Specifically, our method includes extensive preprocessing and augmentation, transfer learning concepts by utilizing various pre-trained deep convolutional neural networks and vision transformer networks to extract deep features from brain MRI, and fine-tune hyperparameters of ML classifiers. Our experiments utilized three different publicly available Kaggle MRI brain tumor datasets to evaluate the pre-trained DL feature extractor models, ML classifiers, and the effectiveness of an ensemble of deep features along with an ensemble of ML classifiers for brain tumor classification. Our results indicate that the proposed feature fusion and classifier fusion improve upon the state of the art, with hyperparameter fine-tuning providing a significant enhancement over the ensemble method. Additionally, we present an ablation study to illustrate how each component contributes to accurate brain tumor classification.
[66] Revealing the Ancient Beauty: Digital Reconstruction of Temple Tiles using Computer Vision
Arkaprabha Basu
Main category: cs.CV
TL;DR: 该论文提出三种计算机视觉技术——分形卷积、自适应瓷砖填充(SSTF)和数据增强方法MosaicSlice,用于印度古迹的数字重建,同时结合超分辨率技术提升图像质量,实现了文化遗产保护中的高效与美学平衡。
Details
Motivation: 现代数字化方法在文化遗产保护中的应用需求日益增长,而印度古迹因其独特的建筑风格和美学价值需要特殊的技术手段,因此研究提出了结合计算机视觉的创新方法。Contribution: 1. 提出了分形卷积方法,用于分割和揭示古迹中的精细建筑图案。2. 开发了自适应瓷砖填充(SSTF)方法,专门用于修复Bankura陶瓦寺庙。3. 设计了新型数据增强技术MosaicSlice,结合超分辨率技术提升图像质量。
Method: 1. 分形卷积:基于图像处理的分割方法,用于提取古迹中的建筑细节。2. SSTF:一种自适应的瓷砖填充技术,结合MosaicSlice数据增强方法。3. 超分辨率技术:用于图像升分辨率而不损失质量。
Result: 研究实现了高细节的古迹瓷砖重建,保持了文化遗产的真实性,同时通过自动化降低了成本,提供了高效且美学优异的解决方案。
Insight: 通过计算机视觉技术,可以在保持传统与创新平衡的前提下,高效地保护和修复文化遗产,为多学科合作提供了新思路。
Abstract: Modern digitised approaches have dramatically changed the preservation and restoration of cultural treasures, integrating computer scientists into multidisciplinary projects with ease. Machine learning, deep learning, and computer vision techniques have revolutionised developing sectors like 3D reconstruction, picture inpainting,IoT-based methods, genetic algorithms, and image processing with the integration of computer scientists into multidisciplinary initiatives. We suggest three cutting-edge techniques in recognition of the special qualities of Indian monuments, which are famous for their architectural skill and aesthetic appeal. First is the Fractal Convolution methodology, a segmentation method based on image processing that successfully reveals subtle architectural patterns within these irreplaceable cultural buildings. The second is a revolutionary Self-Sensitive Tile Filling (SSTF) method created especially for West Bengal’s mesmerising Bankura Terracotta Temples with a brand-new data augmentation method called MosaicSlice on the third. Furthermore, we delve deeper into the Super Resolution strategy to upscale the images without losing significant amount of quality. Our methods allow for the development of seamless region-filling and highly detailed tiles while maintaining authenticity using a novel data augmentation strategy within affordable costs introducing automation. By providing effective solutions that preserve the delicate balance between tradition and innovation, this study improves the subject and eventually ensures unrivalled efficiency and aesthetic excellence in cultural heritage protection. The suggested approaches advance the field into an era of unmatched efficiency and aesthetic quality while carefully upholding the delicate equilibrium between tradition and innovation.
[67] MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM
Tao Chen,Jingyi Zhang,Decheng Liu,Chunlei Peng
Main category: cs.CV
TL;DR: 论文提出了MGFFD-VLM框架,通过多粒度提示学习和属性驱动的混合LoRA策略,提升视觉大语言模型(VLM)在深度伪造检测中的性能,同时增强解释性。
Details
Motivation: 现有基于VLM的深度伪造检测方法未能充分利用人脸质量相关属性,且缺乏有效的训练策略。Contribution: 1. 扩展了VQA数据集为DD-VQA+,增加属性和样本多样性;2. 提出MGFFD-VLM框架,整合多粒度提示学习和伪造感知训练策略;3. 设计了多个伪造相关辅助损失以提升性能。
Method: 1. 使用Attribute-Driven Hybrid LoRA策略增强VLM;2. 多粒度提示学习;3. 将分类和伪造分割结果转化为提示;4. 引入伪造感知训练策略和辅助损失。
Result: 实验表明,MGFFD-VLM在文本驱动的伪造判断和分析中优于现有方法,准确率更高。
Insight: 结合多粒度提示和属性驱动策略,可有效提升VLM在深度伪造检测中的性能和解释性。
Abstract: Recent studies have utilized visual large language models (VLMs) to answer not only “Is this face a forgery?” but also “Why is the face a forgery?” These studies introduced forgery-related attributes, such as forgery location and type, to construct deepfake VQA datasets and train VLMs, achieving high accuracy while providing human-understandable explanatory text descriptions. However, these methods still have limitations. For example, they do not fully leverage face quality-related attributes, which are often abnormal in forged faces, and they lack effective training strategies for forgery-aware VLMs. In this paper, we extend the VQA dataset to create DD-VQA+, which features a richer set of attributes and a more diverse range of samples. Furthermore, we introduce a novel forgery detection framework, MGFFD-VLM, which integrates an Attribute-Driven Hybrid LoRA Strategy to enhance the capabilities of Visual Large Language Models (VLMs). Additionally, our framework incorporates Multi-Granularity Prompt Learning and a Forgery-Aware Training Strategy. By transforming classification and forgery segmentation results into prompts, our method not only improves forgery classification but also enhances interpretability. To further boost detection performance, we design multiple forgery-related auxiliary losses. Experimental results demonstrate that our approach surpasses existing methods in both text-based forgery judgment and analysis, achieving superior accuracy.
[68] Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models
Felix Nützel,Mischa Dombrowski,Bernhard Kainz
Main category: cs.CV
TL;DR: 论文提出了一种基于生成式扩散模型的方法(结合多模态文本条件)用于医学图像中的短语定位任务,通过引入新的后处理技术(BBM)显著提升了性能。
Details
Motivation: 当前基于判别式自监督对比学习的方法在医学图像短语定位任务中表现有限,生成式扩散模型的潜力尚未被充分挖掘。Contribution: 1. 展示了生成式扩散模型在零样本短语定位任务中的优越性;2. 提出了Bimodal Bias Merging(BBM)后处理技术,进一步提升定位精度;3. 通过实验证明,结合领域专用语言模型(如CXR-BERT)效果显著优于通用模型。
Method: 1. 使用跨注意力机制生成扩散模型的注意力图;2. 结合领域专用语言模型(CXR-BERT)进行微调;3. 提出BBM技术,对齐文本与图像偏置以精确定位高置信度区域。
Result: 实验显示,该方法在mIoU指标上比当前判别式方法翻倍,显著提升了定位性能。
Insight: 生成式模型在医学图像短语定位任务中具有巨大潜力,结合领域专用语言模型和后处理技术可以显著提升性能,为临床应用提供了更鲁棒和可解释的方案。
Abstract: Phrase grounding, i.e., mapping natural language phrases to specific image regions, holds significant potential for disease localization in medical imaging through clinical reports. While current state-of-the-art methods rely on discriminative, self-supervised contrastive models, we demonstrate that generative text-to-image diffusion models, leveraging cross-attention maps, can achieve superior zero-shot phrase grounding performance. Contrary to prior assumptions, we show that fine-tuning diffusion models with a frozen, domain-specific language model, such as CXR-BERT, substantially outperforms domain-agnostic counterparts. This setup achieves remarkable improvements, with mIoU scores doubling those of current discriminative methods. These findings highlight the underexplored potential of generative models for phrase grounding tasks. To further enhance performance, we introduce Bimodal Bias Merging (BBM), a novel post-processing technique that aligns text and image biases to identify regions of high certainty. BBM refines cross-attention maps, achieving even greater localization accuracy. Our results establish generative approaches as a more effective paradigm for phrase grounding in the medical imaging domain, paving the way for more robust and interpretable applications in clinical practice. The source code and model weights are available at https://github.com/Felix-012/generate_to_ground.
[69] Calisthenics Skills Temporal Video Segmentation
Antonio Finocchiaro,Giovanni Maria Farinella,Antonino Furnari
Main category: cs.CV
TL;DR: 这篇论文提出了一个静态卡路里技能(Calisthenics Skills)的时间视频分割问题,并构建了一个标注数据集,为自动化工具的开发提供了一个初步的基础。
Details
Motivation: 卡路里技能的评价基于难度和持续时间,但目前缺少自动化的工具来从视频中分割和评估这些技能。论文旨在填补这一空白,为运动员训练和比赛评审提供支持。Contribution: 论文的主要贡献是提出了一个静态卡路里技能的标注数据集,并展示了一个基线方法用于技能的时间分割问题。
Method: 作者构建了一个视频数据集,标注了技能的时间范围,并提出了一个基准方法来解决时间分割问题。
Result: 结果显示该问题的可行性,但仍有改进空间。
Insight: 这是首个针对卡路里技能时间分割的研究,未来可以结合更先进的视频理解和姿态分析方法来提升性能。
Abstract: Calisthenics is a fast-growing bodyweight discipline that consists of different categories, one of which is focused on skills. Skills in calisthenics encompass both static and dynamic elements performed by athletes. The evaluation of static skills is based on their difficulty level and the duration of the hold. Automated tools able to recognize isometric skills from a video by segmenting them to estimate their duration would be desirable to assist athletes in their training and judges during competitions. Although the video understanding literature on action recognition through body pose analysis is rich, no previous work has specifically addressed the problem of calisthenics skill temporal video segmentation. This study aims to provide an initial step towards the implementation of automated tools within the field of Calisthenics. To advance knowledge in this context, we propose a dataset of video footage of static calisthenics skills performed by athletes. Each video is annotated with a temporal segmentation which determines the extent of each skill. We hence report the results of a baseline approach to address the problem of skill temporal segmentation on the proposed dataset. The results highlight the feasibility of the proposed problem, while there is still room for improvement.
[70] Site-Level Fine-Tuning with Progressive Layer Freezing: Towards Robust Prediction of Bronchopulmonary Dysplasia from Day-1 Chest Radiographs in Extremely Preterm Infants
Sybelle Goedicke-Fritz,Michelle Bous,Annika Engel,Matthias Flotho,Pascal Hirsch,Hannah Wittig,Dino Milanovic,Dominik Mohr,Mathias Kaspar,Sogand Nemat,Dorothea Kerner,Arno Bücker,Andreas Keller,Sascha Meyer,Michael Zemlin,Philipp Flotho
Main category: cs.CV
TL;DR: 本文提出了一种基于深度学习的渐进层冻结与线性探测方法,从早产儿出生24小时内的胸部X光片中预测支气管肺发育不良(BPD)。该方法在特定领域预训练的基础上表现优异,具有临床实用性。
Details
Motivation: 支气管肺发育不良(BPD)是一种严重的早产儿慢性肺病,早期预测对避免不必要的治疗风险至关重要。由于常规影像学指标(如IRDS)预测能力有限,研究者探索了基于深度学习的非侵入性预测方法。Contribution: 1. 提出了一种结合渐进层冻结、线性探测和CutMix数据增强的深度学习模型。2. 证明了特定领域(成人胸部X光片)预训练对BPD预测的重要性。3. 模型的AUROC达到0.78,显著优于ImageNet初始化的模型。
Method: 使用预训练的ResNet-50模型,通过渐进层冻结和分阶段学习率防止过拟合,结合CutMix数据增强和线性探测优化性能。
Result: 模型在预测中/重度BPD时,AUROC为0.78 ± 0.10,平衡准确率为0.69 ± 0.10,F1分数为0.67 ± 0.11,优于ImageNet初始化的模型和常规IRDS指标。
Insight: 特定领域的预训练对医学影像任务至关重要;渐进层冻结与线性探测的结合既能提升性能,又能降低计算成本,适合临床落地和联邦学习部署。
Abstract: Bronchopulmonary dysplasia (BPD) is a chronic lung disease affecting 35% of extremely low birth weight infants. Defined by oxygen dependence at 36 weeks postmenstrual age, it causes lifelong respiratory complications. However, preventive interventions carry severe risks, including neurodevelopmental impairment, ventilator-induced lung injury, and systemic complications. Therefore, early BPD prognosis and prediction of BPD outcome is crucial to avoid unnecessary toxicity in low risk infants. Admission radiographs of extremely preterm infants are routinely acquired within 24h of life and could serve as a non-invasive prognostic tool. In this work, we developed and investigated a deep learning approach using chest X-rays from 163 extremely low-birth-weight infants ($\leq$32 weeks gestation, 401-999g) obtained within 24 hours of birth. We fine-tuned a ResNet-50 pretrained specifically on adult chest radiographs, employing progressive layer freezing with discriminative learning rates to prevent overfitting and evaluated a CutMix augmentation and linear probing. For moderate/severe BPD outcome prediction, our best performing model with progressive freezing, linear probing and CutMix achieved an AUROC of 0.78 $\pm$ 0.10, balanced accuracy of 0.69 $\pm$ 0.10, and an F1-score of 0.67 $\pm$ 0.11. In-domain pre-training significantly outperformed ImageNet initialization (p = 0.031) which confirms domain-specific pretraining to be important for BPD outcome prediction. Routine IRDS grades showed limited prognostic value (AUROC 0.57 $\pm$ 0.11), confirming the need of learned markers. Our approach demonstrates that domain-specific pretraining enables accurate BPD prediction from routine day-1 radiographs. Through progressive freezing and linear probing, the method remains computationally feasible for site-level implementation and future federated learning deployments.
[71] Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation
Antonio Finocchiaro,Giovanni Maria Farinella,Antonino Furnari
Main category: cs.CV
TL;DR: 这篇论文提出了一种高效的自体重训技能分类方法,通过前景实例选择和深度估计替代传统的高计算成本姿态估计,显著提升了速度和准确性。
Details
Motivation: 传统基于姿态估计的分类方法计算成本高且复杂,限制了实时性和移动设备的应用,因此需要一种更高效的替代方案。Contribution: 1. 提出了一种基于深度估计和前景实例选择的直接分类方法;2. 显著降低了计算成本和推理时间;3. 模块化设计支持灵活替换组件。
Method: 使用Depth Anything V2进行深度估计,YOLOv10进行运动员定位,直接从图像中提取前景(运动员)而非依赖姿态估计。
Result: 方法比基于骨架的方法快38.3倍,分类精度更高(深度块:0.837 vs. 0.815)。
Insight: 通过避免姿态估计直接处理前景和深度信息,可以显著提升效率和精度,适合实时和移动端应用。
Abstract: Calisthenics skill classification is the computer vision task of inferring the skill performed by an athlete from images, enabling automatic performance assessment and personalized analytics. Traditional methods for calisthenics skill recognition are based on pose estimation methods to determine the position of skeletal data from images, which is later fed to a classification algorithm to infer the performed skill. Despite the progress in human pose estimation algorithms, they still involve high computational costs, long inference times, and complex setups, which limit the applicability of such approaches in real-time applications or mobile devices. This work proposes a direct approach to calisthenics skill recognition, which leverages depth estimation and athlete patch retrieval to avoid the computationally expensive human pose estimation module. Using Depth Anything V2 for depth estimation and YOLOv10 for athlete localization, we segment the subject from the background rather than relying on traditional pose estimation techniques. This strategy increases efficiency, reduces inference time, and improves classification accuracy. Our approach significantly outperforms skeleton-based methods, achieving 38.3x faster inference with RGB image patches and improved classification accuracy with depth patches (0.837 vs. 0.815). Beyond these performance gains, the modular design of our pipeline allows for flexible replacement of components, enabling future enhancements and adaptation to real-world applications.
[72] Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion Priors
Subin Jeon,In Cho,Junyoung Hong,Seon Joo Kim
Main category: cs.CV
TL;DR: KeyDiff3D是一种无监督的单目3D关键点估计框架,通过利用预训练的多视角扩散模型的几何先验,从单张图像预测精确的3D关键点,且无需人工标注或多视角校准数据。
Details
Motivation: 现有方法依赖昂贵的人工标注或多视角校准数据,限制了3D关键点估计的应用。KeyDiff3D旨在通过无监督方式仅使用单视角图像实现3D关键点估计。Contribution: 1. 提出首个无监督的单目3D关键点估计框架;2. 利用多视角扩散模型的几何先验为监督信号;3. 实现3D物体生成后的操纵能力。
Method: 1. 使用预训练多视角扩散模型生成多视角图像作为监督信号;2. 将扩散模型的中间表征转换为3D特征体积;3. 构建3D特征并预测关键点。
Result: 实验表明KeyDiff3D在Human3.6M、Stanford Dogs等数据集上具有高精度和泛化能力,并能操纵扩散模型生成的3D物体。
Insight: 扩散模型的隐含3D先验可转换为显式3D特征,为无监督3D视觉任务提供新思路。
Abstract: This paper introduces KeyDiff3D, a framework for unsupervised monocular 3D keypoints estimation that accurately predicts 3D keypoints from a single image. While previous methods rely on manual annotations or calibrated multi-view images, both of which are expensive to collect, our method enables monocular 3D keypoints estimation using only a collection of single-view images. To achieve this, we leverage powerful geometric priors embedded in a pretrained multi-view diffusion model. In our framework, this model generates multi-view images from a single image, serving as a supervision signal to provide 3D geometric cues to our model. We also use the diffusion model as a powerful 2D multi-view feature extractor and construct 3D feature volumes from its intermediate representations. This transforms implicit 3D priors learned by the diffusion model into explicit 3D features. Beyond accurate keypoints estimation, we further introduce a pipeline that enables manipulation of 3D objects generated by the diffusion model. Experimental results on diverse aspects and datasets, including Human3.6M, Stanford Dogs, and several in-the-wild and out-of-domain datasets, highlight the effectiveness of our method in terms of accuracy, generalization, and its ability to enable manipulation of 3D objects generated by the diffusion model from a single image.
[73] Cluster Contrast for Unsupervised Visual Representation Learning
Nikolaos Giakoumoglou,Tania Stathaki
Main category: cs.CV
TL;DR: 论文提出了一种名为Cluster Contrast(CueCo)的无监督视觉表示学习方法,结合了对比学习和聚类技术的优点,通过分散和对齐特征表示提升模型性能。
Details
Motivation: 现有的无监督表示学习方法在特征空间中对特征的分散和对齐能力不足,限制了模型的性能提升。CueCo旨在通过结合对比学习和聚类方法,解决这一问题。Contribution: 提出了Cluster Contrast(CueCo)方法,首次将对比学习和聚类目标结合,同时优化特征的分散和对齐,显著提升了无监督表示学习的性能。
Method: CueCo使用两个神经网络(查询网络和关键网络),通过缓慢移动平均更新关键网络。结合对比损失(增强类间分离)和聚类目标(提升类内紧凑性)来优化特征表示。
Result: 在CIFAR-10、CIFAR-100和ImageNet-100数据集上,CueCo分别取得了91.40%、68.56%和78.65%的Top-1分类准确率,显著优于现有方法。
Insight: 通过结合对比学习和聚类目标,CueCo展示了无监督表示学习中特征分散与对齐的重要性,为未来研究提供了新的方向。
Abstract: We introduce Cluster Contrast (CueCo), a novel approach to unsupervised visual representation learning that effectively combines the strengths of contrastive learning and clustering methods. Inspired by recent advancements, CueCo is designed to simultaneously scatter and align feature representations within the feature space. This method utilizes two neural networks, a query and a key, where the key network is updated through a slow-moving average of the query outputs. CueCo employs a contrastive loss to push dissimilar features apart, enhancing inter-class separation, and a clustering objective to pull together features of the same cluster, promoting intra-class compactness. Our method achieves 91.40% top-1 classification accuracy on CIFAR-10, 68.56% on CIFAR-100, and 78.65% on ImageNet-100 using linear evaluation with a ResNet-18 backbone. By integrating contrastive learning with clustering, CueCo sets a new direction for advancing unsupervised visual representation learning.
[74] OD-VIRAT: A Large-Scale Benchmark for Object Detection in Realistic Surveillance Environments
Hayat Ullah,Abbas Khan,Arslan Munir,Hari Kalva
Main category: cs.CV
TL;DR: 该论文提出了两个大规模的监控场景目标检测基准OD-VIRAT Large和OD-VIRAT Tiny,用于在复杂环境中评估目标检测模型的性能,并测试了包括RETMDET、YOLOX等多种先进架构。
Details
Motivation: 开发能够应对复杂监控场景(如遮挡、小目标、复杂背景)的鲁棒目标检测算法,需要多样且具有挑战性的数据集来评估模型性能。Contribution: 1)提出了两个大规模的监控目标检测基准;2)首次在真实监控场景下测试了多种先进目标检测架构的性能。
Method: 通过从10个监控场景中采集视频序列,生成包含丰富标注的数据集,并在其上测试RETMDET、YOLOX等模型的性能。
Result: 提供了8.7百万(Large)和28.9万(Tiny)标注实例的数据集,并展示了不同模型在这些数据上的表现。
Insight: 复杂监控场景下的目标检测仍面临挑战,尤其是小目标和遮挡情况下的性能需进一步优化。
Abstract: Realistic human surveillance datasets are crucial for training and evaluating computer vision models under real-world conditions, facilitating the development of robust algorithms for human and human-interacting object detection in complex environments. These datasets need to offer diverse and challenging data to enable a comprehensive assessment of model performance and the creation of more reliable surveillance systems for public safety. To this end, we present two visual object detection benchmarks named OD-VIRAT Large and OD-VIRAT Tiny, aiming at advancing visual understanding tasks in surveillance imagery. The video sequences in both benchmarks cover 10 different scenes of human surveillance recorded from significant height and distance. The proposed benchmarks offer rich annotations of bounding boxes and categories, where OD-VIRAT Large has 8.7 million annotated instances in 599,996 images and OD-VIRAT Tiny has 288,901 annotated instances in 19,860 images. This work also focuses on benchmarking state-of-the-art object detection architectures, including RETMDET, YOLOX, RetinaNet, DETR, and Deformable-DETR on this object detection-specific variant of VIRAT dataset. To the best of our knowledge, it is the first work to examine the performance of these recently published state-of-the-art object detection architectures on realistic surveillance imagery under challenging conditions such as complex backgrounds, occluded objects, and small-scale objects. The proposed benchmarking and experimental settings will help in providing insights concerning the performance of selected object detection models and set the base for developing more efficient and robust object detection architectures.
[75] AutoVDC: Automated Vision Data Cleaning Using Vision-Language Models
Santosh Vasa,Aditi Ramadwar,Jnana Rama Krishna Darabattula,Md Zafar Anwar,Stanislaw Antol,Andrei Vatavu,Thomas Monninger,Sihao Ding
Main category: cs.CV
TL;DR: AutoVDC是一种利用视觉-语言模型(VLM)自动检测视觉数据集中错误标注的框架,目标是提升自动驾驶领域数据质量,减少人工标注成本。
Details
Motivation: 自动驾驶系统训练依赖高质量标注数据,但人工标注存在缺陷且成本高昂,因此需要自动化工具提升数据质量。Contribution: 提出AutoVDC框架,首次将VLM应用于视觉数据清洗任务,显著提升了错误检测效率。
Method: 通过VLM自动识别错误标注,实验中使用KITTI和nuImages数据集,注入错误以测试模型性能,并比较不同VLM的效果及微调的影响。
Result: AutoVDC在错误检测和数据清洗实验中表现优异,验证了其提升大规模数据集可靠性的潜力。
Insight: VLM在数据清洗任务中具有高效性和扩展性,微调能进一步提升性能,为自动驾驶数据管理提供了新思路。
Abstract: Training of autonomous driving systems requires extensive datasets with precise annotations to attain robust performance. Human annotations suffer from imperfections, and multiple iterations are often needed to produce high-quality datasets. However, manually reviewing large datasets is laborious and expensive. In this paper, we introduce AutoVDC (Automated Vision Data Cleaning) framework and investigate the utilization of Vision-Language Models (VLMs) to automatically identify erroneous annotations in vision datasets, thereby enabling users to eliminate these errors and enhance data quality. We validate our approach using the KITTI and nuImages datasets, which contain object detection benchmarks for autonomous driving. To test the effectiveness of AutoVDC, we create dataset variants with intentionally injected erroneous annotations and observe the error detection rate of our approach. Additionally, we compare the detection rates using different VLMs and explore the impact of VLM fine-tuning on our pipeline. The results demonstrate our method’s high performance in error detection and data cleaning experiments, indicating its potential to significantly improve the reliability and accuracy of large-scale production datasets in autonomous driving.
[76] InterpIoU: Rethinking Bounding Box Regression with Interpolation-Based IoU Optimization
Haoyuan Liu,Hiroshi Watanabe
Main category: cs.CV
TL;DR: 论文提出了InterpIoU,一种新的边界框回归损失函数,通过插值优化IoU损失,解决了现有方法中因几何惩罚导致的小物体检测效果差和边界框膨胀问题。
Details
Motivation: 现有基于IoU的边界框回归损失常通过手工设计的几何惩罚来解决IoU在非重叠情况下的不可微问题,但这些惩罚对框的形状、大小和分布敏感,容易导致小物体检测效果不佳和边界框膨胀。Contribution: 提出InterpIoU损失函数,用插值框与目标的IoU代替手工几何惩罚,解决了梯度问题和边界框膨胀;进一步引入动态调整插值系数的Dynamic InterpIoU,提升对不同物体分布的适应性。
Method: 利用插值框填补预测框与真实框之间的差异,避免手工设计惩罚项;动态调整插值系数以适应多样化的物体分布。
Result: 在COCO、VisDrone和PASCAL VOC数据集上,InterpIoU和Dynamic InterpIoU均超越了现有IoU损失函数,尤其在小物体检测中表现突出。
Insight: IoU本身是一个理想的回归目标,手工设计的几何惩罚是不必要且次优的;通过插值优化可以更自然地解决IoU不可微问题,并避免误对齐导致的副作用。
Abstract: Bounding box regression (BBR) is fundamental to object detection, where the regression loss is crucial for accurate localization. Existing IoU-based losses often incorporate handcrafted geometric penalties to address IoU’s non-differentiability in non-overlapping cases and enhance BBR performance. However, these penalties are sensitive to box shape, size, and distribution, often leading to suboptimal optimization for small objects and undesired behaviors such as bounding box enlargement due to misalignment with the IoU objective. To address these limitations, we propose InterpIoU, a novel loss function that replaces handcrafted geometric penalties with a term based on the IoU between interpolated boxes and the target. By using interpolated boxes to bridge the gap between predictions and ground truth, InterpIoU provides meaningful gradients in non-overlapping cases and inherently avoids the box enlargement issue caused by misaligned penalties. Simulation results further show that IoU itself serves as an ideal regression target, while existing geometric penalties are both unnecessary and suboptimal. Building on InterpIoU, we introduce Dynamic InterpIoU, which dynamically adjusts interpolation coefficients based on IoU values, enhancing adaptability to scenarios with diverse object distributions. Experiments on COCO, VisDrone, and PASCAL VOC show that our methods consistently outperform state-of-the-art IoU-based losses across various detection frameworks, with particularly notable improvements in small object detection, confirming their effectiveness.
[77] DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition
Hayat Ullah,Muhammad Ali Shafique,Abbas Khan,Arslan Munir
Main category: cs.CV
TL;DR: 本文提出了一种轻量级的视频焦点调制网络DVFL-Net,通过知识蒸馏和时空特征调制,实现了高效的时空动作识别,同时保持了高性能。
Details
Motivation: 现有的Transformer模型尽管在时空动作识别任务中表现优异,但计算成本较高,尤其是在密集视频数据上。本文旨在设计一种轻量化的网络,既能保留高性能,又能高效部署在设备端。Contribution: 提出的DVFL-Net通过知识蒸馏和时空焦点调制技术,显著降低了计算复杂度,同时保持了高识别性能;实验证明其在多个数据集上实现了性能与效率的平衡。
Method: DVFL-Net结合了知识蒸馏和时空焦点调制技术,使用前向KL散度从大型预训练教师模型中蒸馏时空知识到紧凑的学生模型;通过空间-时间特征调制减少计算量。
Result: 在UCF50、UCF101、HMDB51、SSV2和Kinetics-400等数据集上的实验表明,DVFL-Net在内存占用、计算量(GFLOPs)和精度之间取得了最优平衡,适用于实时应用。
Insight: 时空焦点调制和知识蒸馏的结合是提升轻量化模型性能的有效方法,前向KL散度在知识传递中发挥了关键作用。
Abstract: The landscape of video recognition has evolved significantly, shifting from traditional Convolutional Neural Networks (CNNs) to Transformer-based architectures for improved accuracy. While 3D CNNs have been effective at capturing spatiotemporal dynamics, recent Transformer models leverage self-attention to model long-range spatial and temporal dependencies. Despite achieving state-of-the-art performance on major benchmarks, Transformers remain computationally expensive, particularly with dense video data. To address this, we propose a lightweight Video Focal Modulation Network, DVFL-Net, which distills spatiotemporal knowledge from a large pre-trained teacher into a compact nano student model, enabling efficient on-device deployment. DVFL-Net utilizes knowledge distillation and spatial-temporal feature modulation to significantly reduce computation while preserving high recognition performance. We employ forward Kullback-Leibler (KL) divergence alongside spatio-temporal focal modulation to effectively transfer both local and global context from the Video-FocalNet Base (teacher) to the proposed VFL-Net (student). We evaluate DVFL-Net on UCF50, UCF101, HMDB51, SSV2, and Kinetics-400, benchmarking it against recent state-of-the-art methods in Human Action Recognition (HAR). Additionally, we conduct a detailed ablation study analyzing the impact of forward KL divergence. The results confirm the superiority of DVFL-Net in achieving an optimal balance between performance and efficiency, demonstrating lower memory usage, reduced GFLOPs, and strong accuracy, making it a practical solution for real-time HAR applications.
[78] Describe Anything Model for Visual Question Answering on Text-rich Images
Yen-Linh Vu,Dinh-Thang Duong,Truong-Binh Duong,Anh-Khoi Nguyen,Thanh-Huy Nguyen,Le Thien Phuc Nguyen,Jianhua Xing,Xingjian Li,Tianyang Wang,Ulas Bagci,Min Xu
Main category: cs.CV
TL;DR: DAM-QA框架利用Describe Anything Model的区域感知能力,通过聚合多个区域的答案提升文本丰富图像的VQA任务性能,显著优于基线模型。
Details
Motivation: 现有的视觉-语言模型在文本丰富的图像VQA任务中表现不足,区域感知的DAM模型可以生成详细描述,这为解决文本相关VQA问题提供了可能。Contribution: 提出了DAM-QA框架,首次将DAM的区域感知能力用于文本丰富的VQA任务,并设计了一种多区域答案聚合机制。
Method: DAM-QA通过提取多个区域的描述信息,聚合生成最终答案,提升了文本相关信息的细粒度推理能力。
Result: 在六个VQA基准测试中,DAM-QA显著优于基线DAM模型,DocVQA任务提升7+点,且参数更少。
Insight: 区域感知模型在文本丰富的VQA任务中潜力巨大,高效的区域信息整合策略是关键。
Abstract: Recent progress has been made in region-aware vision-language modeling, particularly with the emergence of the Describe Anything Model (DAM). DAM is capable of generating detailed descriptions of any specific image areas or objects without the need for additional localized image-text alignment supervision. We hypothesize that such region-level descriptive capability is beneficial for the task of Visual Question Answering (VQA), especially in challenging scenarios involving images with dense text. In such settings, the fine-grained extraction of textual information is crucial to producing correct answers. Motivated by this, we introduce DAM-QA, a framework with a tailored evaluation protocol, developed to investigate and harness the region-aware capabilities from DAM for the text-rich VQA problem that requires reasoning over text-based information within images. DAM-QA incorporates a mechanism that aggregates answers from multiple regional views of image content, enabling more effective identification of evidence that may be tied to text-related elements. Experiments on six VQA benchmarks show that our approach consistently outperforms the baseline DAM, with a notable 7+ point gain on DocVQA. DAM-QA also achieves the best overall performance among region-aware models with fewer parameters, significantly narrowing the gap with strong generalist VLMs. These results highlight the potential of DAM-like models for text-rich and broader VQA tasks when paired with efficient usage and integration strategies. Our code is publicly available at https://github.com/Linvyl/DAM-QA.git.
[79] Vision-based Perception for Autonomous Vehicles in Obstacle Avoidance Scenarios
Van-Hoang-Anh Phan,Chi-Tam Nguyen,Doan-Trung Au,Thanh-Danh Phan,Minh-Thien Duong,My-Ha Le
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉的高效障碍物避障系统,结合YOLOv11目标检测和单目深度估计模型(如Depth Anything V2),并通过Frenet-Pure Pursuit规划策略实现自动驾驶车辆的安全导航。
Details
Motivation: 自动驾驶车辆在复杂环境中需要高精度的感知和运动规划能力以确保安全,现有的视觉感知和避障方法仍存在效率和鲁棒性问题。Contribution: 提出了一种基于摄像头的高效障碍物避障系统,结合了最新的视觉感知技术和实时规划策略,并在真实场景中验证了性能。
Method: 采用YOLOv11进行目标检测和Depth Anything V2进行单目深度估计,结合Frenet-Pure Pursuit规划策略实现避障。
Result: 系统在校园多样场景中验证,表现出良好的避障效果和实时性能。
Insight: 单目深度估计结合目标检测可以有效提升自动驾驶车辆的环境感知能力,但需要在效率和鲁棒性之间权衡。
Abstract: Obstacle avoidance is essential for ensuring the safety of autonomous vehicles. Accurate perception and motion planning are crucial to enabling vehicles to navigate complex environments while avoiding collisions. In this paper, we propose an efficient obstacle avoidance pipeline that leverages a camera-only perception module and a Frenet-Pure Pursuit-based planning strategy. By integrating advancements in computer vision, the system utilizes YOLOv11 for object detection and state-of-the-art monocular depth estimation models, such as Depth Anything V2, to estimate object distances. A comparative analysis of these models provides valuable insights into their accuracy, efficiency, and robustness in real-world conditions. The system is evaluated in diverse scenarios on a university campus, demonstrating its effectiveness in handling various obstacles and enhancing autonomous navigation. The video presenting the results of the obstacle avoidance experiments is available at: https://www.youtube.com/watch?v=FoXiO5S_tA8
[80] Mitigating Object Hallucinations via Sentence-Level Early Intervention
Shangpin Peng,Senqiao Yang,Li Jiang,Zhuotao Tian
Main category: cs.CV
TL;DR: 论文提出了SENTINEL框架,通过句子级早期干预减少多模态大语言模型中的幻觉问题,利用无监督方法生成偏好对,并通过上下文感知偏好损失(C-DPO)训练模型,实验显示幻觉减少90%以上。
Details
Motivation: 多模态大语言模型(MLLMs)在跨模态理解中表现突出,但普遍存在幻觉问题(生成与视觉输入矛盾的内容)。现有方法成本高或引入数据分布不匹配,作者发现幻觉问题主要在生成早期阶段出现并传播。Contribution: 1. 提出SENTINEL框架,通过句子级早期干预减少幻觉;2. 无需人工标注,通过无监督方法生成高质量偏好对;3. 提出上下文感知偏好损失(C-DPO)增强模型区分能力。
Method: 1. 利用模型输出和开放词汇检测器生成偏好对;2. 将句子分类为幻觉/非幻觉;3. 迭代构建上下文感知的偏好数据;4. 使用C-DPO损失训练模型。
Result: 实验表明,SENTINEL比原始模型减少90%以上的幻觉,并在幻觉评测和通用能力评测中优于现有方法。
Insight: 幻觉问题主要源于生成早期阶段,通过句子级干预可有效阻断其传播,无监督偏好学习是一种高效的缓解途径。
Abstract: Multimodal large language models (MLLMs) have revolutionized cross-modal understanding but continue to struggle with hallucinations - fabricated content contradicting visual inputs. Existing hallucination mitigation methods either incur prohibitive computational costs or introduce distribution mismatches between training data and model outputs. We identify a critical insight: hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. To address this, we propose SENTINEL (Sentence-level Early iNtervention Through IN-domain prEference Learning), a framework that eliminates dependency on human annotations. Specifically, we first bootstrap high-quality in-domain preference pairs by iteratively sampling model outputs, validating object existence through cross-checking with two open-vocabulary detectors, and classifying sentences into hallucinated/non-hallucinated categories. Subsequently, we use context-coherent positive samples and hallucinated negative samples to build context-aware preference data iteratively. Finally, we train models using a context-aware preference loss (C-DPO) that emphasizes discriminative learning at the sentence level where hallucinations initially manifest. Experimental results show that SENTINEL can reduce hallucinations by over 90% compared to the original model and outperforms the previous state-of-the-art method on both hallucination benchmarks and general capabilities benchmarks, demonstrating its superiority and generalization ability. The models, datasets, and code are available at https://github.com/pspdada/SENTINEL.
[81] SpatialTrackerV2: 3D Point Tracking Made Easy
Yuxi Xiao,Jianyuan Wang,Nan Xue,Nikita Karaev,Yuri Makarov,Bingyi Kang,Xing Zhu,Hujun Bao,Yujun Shen,Xiaowei Zhou
Main category: cs.CV
TL;DR: SpatialTrackerV2 是一种前馈式单目视频 3D 点追踪方法,通过联合学习几何与运动,显著超越现有方法。
Details
Motivation: 现有 3D 追踪方法多依赖模块化流程和现成组件,限制了性能和数据适应性。Contribution: 提出了一种端到端、可微分的统一架构,将点追踪、深度估计和相机姿态估计紧密结合,提升了性能和效率。
Method: 通过分解世界空间的 3D 运动为场景几何、相机自运动和像素级对象运动,实现联合学习。
Result: 在多种数据集上验证,性能提升 30%,运行速度比动态 3D 重建方法快 50 倍。
Insight: 联合学习几何与运动能够提高泛化能力和效率,适用于异构数据。
Abstract: We present SpatialTrackerV2, a feed-forward 3D point tracking method for monocular videos. Going beyond modular pipelines built on off-the-shelf components for 3D tracking, our approach unifies the intrinsic connections between point tracking, monocular depth, and camera pose estimation into a high-performing and feedforward 3D point tracker. It decomposes world-space 3D motion into scene geometry, camera ego-motion, and pixel-wise object motion, with a fully differentiable and end-to-end architecture, allowing scalable training across a wide range of datasets, including synthetic sequences, posed RGB-D videos, and unlabeled in-the-wild footage. By learning geometry and motion jointly from such heterogeneous data, SpatialTrackerV2 outperforms existing 3D tracking methods by 30%, and matches the accuracy of leading dynamic 3D reconstruction approaches while running 50$\times$ faster.
[82] MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior Understanding
Renjie Li,Ruijie Ye,Mingyang Wu,Hao Frank Yang,Zhiwen Fan,Hezhen Hu,Zhengzhong Tu
Main category: cs.CV
TL;DR: 论文提出MMHU,一个大规模多模态基准数据集,用于自动驾驶中人类行为理解,包含丰富的注释和多任务评估。
Details
Motivation: 现有数据集对人类行为的分析不够全面,尤其是在自动驾驶场景中,缺乏一个统一的基准来评估人类行为的多个方面。Contribution: 提出了MMHU数据集,包含57k人类动作片段和1.73M帧数据,涵盖多种注释(如运动、轨迹、意图等),并开发了人机协同标注流程。
Method: 从Waymo等驾驶数据集、YouTube视频及自采数据中收集数据,通过人机协同标注生成行为描述,并设计多任务基准测试。
Result: MMHU提供了全面的数据分析和多任务评估(如运动预测、行为问答等),为研究社区提供了强大的工具。
Insight: 多模态数据与丰富注释的结合对理解复杂的人类行为至关重要,尤其是在自动驾驶领域。
Abstract: Humans are integral components of the transportation ecosystem, and understanding their behaviors is crucial to facilitating the development of safe driving systems. Although recent progress has explored various aspects of human behavior$\unicode{x2014}$such as motion, trajectories, and intention$\unicode{x2014}$a comprehensive benchmark for evaluating human behavior understanding in autonomous driving remains unavailable. In this work, we propose $\textbf{MMHU}$, a large-scale benchmark for human behavior analysis featuring rich annotations, such as human motion and trajectories, text description for human motions, human intention, and critical behavior labels relevant to driving safety. Our dataset encompasses 57k human motion clips and 1.73M frames gathered from diverse sources, including established driving datasets such as Waymo, in-the-wild videos from YouTube, and self-collected data. A human-in-the-loop annotation pipeline is developed to generate rich behavior captions. We provide a thorough dataset analysis and benchmark multiple tasks$\unicode{x2014}$ranging from motion prediction to motion generation and human behavior question answering$\unicode{x2014}$thereby offering a broad evaluation suite. Project page : https://MMHU-Benchmark.github.io.
[83] PhysX: Physical-Grounded 3D Asset Generation
Ziang Cao,Zhaoxi Chen,Linag Pan,Ziwei Liu
Main category: cs.CV
TL;DR: 该论文提出PhysX,一种物理基础的3D资产生成方法,解决了现有方法忽视物理属性的问题,通过构建PhysXNet数据集和PhysXGen模型,实现了物理驱动的3D生成。
Details
Motivation: 现有3D生成方法主要关注几何和纹理,忽视了物理属性,限制了其在仿真和具身AI等领域的应用。Contribution: 1) 提出第一个物理基础的3D数据集PhysXNet,系统标注了绝对尺度、材料、功能等五个物理维度;2) 设计了PhysXGen模型,通过双分支架构显式建模3D结构与物理属性关联,生成具有合理物理预测的3D资产。
Method: 1) 基于视觉语言模型构建人机协同标注流程,高效创建物理标注数据集PhysXNet;2) 提出PhysXGen模型,利用双分支架构将物理知识注入预训练的3D结构空间。
Result: 实验验证了PhysXGen在物理预测和几何质量上的优越性能,展现了泛化能力。
Insight: 物理属性对3D生成的真实性和实用性至关重要,结合人机协同标注和双分支架构是有效实现物理驱动生成的关键。
Abstract: 3D modeling is moving from virtual to physical. Existing 3D generation primarily emphasizes geometries and textures while neglecting physical-grounded modeling. Consequently, despite the rapid development of 3D generative models, the synthesized 3D assets often overlook rich and important physical properties, hampering their real-world application in physical domains like simulation and embodied AI. As an initial attempt to address this challenge, we propose \textbf{PhysX}, an end-to-end paradigm for physical-grounded 3D asset generation. 1) To bridge the critical gap in physics-annotated 3D datasets, we present PhysXNet - the first physics-grounded 3D dataset systematically annotated across five foundational dimensions: absolute scale, material, affordance, kinematics, and function description. In particular, we devise a scalable human-in-the-loop annotation pipeline based on vision-language models, which enables efficient creation of physics-first assets from raw 3D assets.2) Furthermore, we propose \textbf{PhysXGen}, a feed-forward framework for physics-grounded image-to-3D asset generation, injecting physical knowledge into the pre-trained 3D structural space. Specifically, PhysXGen employs a dual-branch architecture to explicitly model the latent correlations between 3D structures and physical properties, thereby producing 3D assets with plausible physical predictions while preserving the native geometry quality. Extensive experiments validate the superior performance and promising generalization capability of our framework. All the code, data, and models will be released to facilitate future research in generative physical AI.
cs.RO [Back]
[84] Towards Autonomous Riding: A Review of Perception, Planning, and Control in Intelligent Two-Wheelers
Mohammed Hassanin,Mohammad Abu Alsheikh,Carlos C. N. Kuhn,Damith Herath,Dinh Thai Hoang,Ibrahim Radwan
Main category: cs.RO
TL;DR: 这篇综述全面分析了双轮自动驾驶(AR)系统的感知、规划和控制三大核心组件,对比了自动驾驶(AD)技术,指出了当前研究的不足,并提出了未来研究方向。
Details
Motivation: 微出行工具(如电动滑板车和电动自行车)的普及推动了对双轮自动驾驶技术的需求。然而,双轮平台的不稳定性、有限的体积和动力,以及不可预测的环境带来了独特的挑战,亟需研究和解决。Contribution: 1. 系统总结了双轮自动驾驶的核心组件(感知、规划、控制);2. 指出了当前研究中的关键不足(如缺乏全面的感知系统);3. 通过对比AD技术,提出了未来研究方向(如多模态传感器技术和边缘深度学习架构)。
Method: 通过文献综述方法,结合AD技术的研究成果,分析AR系统的三大核心组件及其挑战。
Result: 综述中明确了AR技术与AD技术的差异,提出了AR研究中的关键问题和潜力方向。
Insight: 双轮自动驾驶的研究需要更多行业和政府支持,同时需注重轻量化平台的多模态传感器技术和边缘计算能力的提升。
Abstract: The rapid adoption of micromobility solutions, particularly two-wheeled vehicles like e-scooters and e-bikes, has created an urgent need for reliable autonomous riding (AR) technologies. While autonomous driving (AD) systems have matured significantly, AR presents unique challenges due to the inherent instability of two-wheeled platforms, limited size, limited power, and unpredictable environments, which pose very serious concerns about road users’ safety. This review provides a comprehensive analysis of AR systems by systematically examining their core components, perception, planning, and control, through the lens of AD technologies. We identify critical gaps in current AR research, including a lack of comprehensive perception systems for various AR tasks, limited industry and government support for such developments, and insufficient attention from the research community. The review analyses the gaps of AR from the perspective of AD to highlight promising research directions, such as multimodal sensor techniques for lightweight platforms and edge deep learning architectures. By synthesising insights from AD research with the specific requirements of AR, this review aims to accelerate the development of safe, efficient, and scalable autonomous riding systems for future urban mobility.
[85] A Multi-Level Similarity Approach for Single-View Object Grasping: Matching, Planning, and Fine-Tuning
Hao Chen,Takuya Kiyokawa,Zhengtao Hu,Weiwei Wan,Kensuke Harada
Main category: cs.RO
TL;DR: 论文提出了一种基于多级相似性的单视角物体抓取方法,通过相似性匹配、规划和微调解决了未知物体抓取的鲁棒性问题。
Details
Motivation: 传统学习框架对感知噪声和环境变化敏感,无法实现高度泛化的抓取效果。因此,作者放弃了传统方法,转而探索相似性匹配的新视角。Contribution: 1) 提出多级相似性匹配框架,综合语义、几何和维度特征;2) 引入新的点云几何描述符C-FPFH,提升部分点云与完整模型的匹配精度;3) 结合大语言模型、半定向边界框和基于平面检测的点云配准方法。
Method: 方法分为三步:1) 利用视觉特征与数据库中的物体模型进行相似性匹配;2) 利用候选模型的预存抓取知识规划模仿抓取;3) 通过局部微调优化抓取质量。
Result: 该方法在单视角条件下实现了对未知物体的鲁棒抓取,优于传统学习框架的泛化能力。
Insight: 通过相似性匹配和已知知识的迁移,可以显著提升未知物体抓取的鲁棒性,尤其是在部分观测条件下。
Abstract: Grasping unknown objects from a single view has remained a challenging topic in robotics due to the uncertainty of partial observation. Recent advances in large-scale models have led to benchmark solutions such as GraspNet-1Billion. However, such learning-based approaches still face a critical limitation in performance robustness for their sensitivity to sensing noise and environmental changes. To address this bottleneck in achieving highly generalized grasping, we abandon the traditional learning framework and introduce a new perspective: similarity matching, where similar known objects are utilized to guide the grasping of unknown target objects. We newly propose a method that robustly achieves unknown-object grasping from a single viewpoint through three key steps: 1) Leverage the visual features of the observed object to perform similarity matching with an existing database containing various object models, identifying potential candidates with high similarity; 2) Use the candidate models with pre-existing grasping knowledge to plan imitative grasps for the unknown target object; 3) Optimize the grasp quality through a local fine-tuning process. To address the uncertainty caused by partial and noisy observation, we propose a multi-level similarity matching framework that integrates semantic, geometric, and dimensional features for comprehensive evaluation. Especially, we introduce a novel point cloud geometric descriptor, the C-FPFH descriptor, which facilitates accurate similarity assessment between partial point clouds of observed objects and complete point clouds of database models. In addition, we incorporate the use of large language models, introduce the semi-oriented bounding box, and develop a novel point cloud registration approach based on plane detection to enhance matching accuracy under single-view conditions. Videos are available at https://youtu.be/qQDIELMhQmk.
[86] EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
Ruihan Yang,Qinxi Yu,Yecheng Wu,Rui Yan,Borui Li,An-Chieh Cheng,Xueyan Zou,Yunhao Fang,Hongxu Yin,Sifei Liu,Song Han,Yao Lu,Xiaolong Wang
Main category: cs.RO
TL;DR: 论文提出了一种基于人类自我中心视频训练视觉-语言-动作(VLA)模型的方法EgoVLA,通过逆运动学和动作重定向将人类动作转化为机器人动作,并通过少量机器人演示进行微调,显著提升了机器人操作任务的性能。
Details
Motivation: 机器人模仿学习需要大量真实数据,但硬件限制了数据规模。人类视频不仅规模大,且场景和任务丰富,因此探索利用人类自我中心视频训练VLA模型。Contribution: 1) 提出利用人类视频训练VLA模型的方法EgoVLA;2) 提出Isaac Humanoid Manipulation Benchmark仿真基准,评估模型在多样化双手操作任务中的表现。
Method: 1) 使用人类自我中心视频训练VLA模型预测人类手腕和手部动作;2) 通过逆运动学和动作重定向将人类动作转化为机器人动作;3) 通过少量机器人演示微调模型。
Result: 在Isaac Humanoid Manipulation Benchmark上评估,EgoVLA显著优于基线方法,验证了人类数据的重要性。
Insight: 人类视频不仅能提供大规模数据,还能覆盖更丰富的场景和任务,为机器人操作学习提供了新思路。
Abstract: Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Isaac Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Isaac Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data. Videos can be found on our website: https://rchalyang.github.io/EgoVLA
cs.GR [Back]
[87] MOSPA: Human Motion Generation Driven by Spatial Audio
Shuyang Xu,Zhiyang Dou,Mingyi Shi,Liang Pan,Leo Ho,Jingbo Wang,Yuan Liu,Cheng Lin,Yuexin Ma,Wenping Wang,Taku Komura
Main category: cs.GR
TL;DR: 论文《MOSPA: Human Motion Generation Driven by Spatial Audio》提出了第一个空间音频驱动的人类动作生成任务(SAM数据集),并开发了一个基于扩散模型的生成框架(MOSPA),用于高质量地模拟人类对空间音频的反应动作。该方法在实验中取得了最先进的性能。
Details
Motivation: 目前的人类动作生成研究主要关注语音、音乐等模态的映射,而忽略了空间音频信号中的空间特征对人类动作的影响。填补这一空白,并实现对空间音频的高质量动作生成,是本文的核心动机。Contribution: 1. 提出了首个包含高质量空间音频和动作数据的SAM数据集;2. 开发了一个基于扩散模型的生成框架(MOSPA),通过有效的融合机制捕捉身体动作与空间音频的关系;3. 在基准实验中取得最先进性能。
Method: 提出了一个简单而有效的扩散生成框架(MOSPA),结合空间音频特征与身体动作的融合机制,通过训练实现对多样化空间音频输入的动作生成。
Result: MOSPA在生成的多样性和真实性上表现出色,并在基准实验中取得了最先进的性能。
Insight: 空间音频信号的空间特征对人类动作生成具有重要影响,通过扩散模型和融合机制可以高质量地模拟这种关系。此外,公开的数据集和模型将推动这一领域的研究。
Abstract: Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA could generate diverse realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our model and dataset will be open-sourced upon acceptance. Please refer to our supplementary video for more details.
cs.AI [Back]
[88] Let’s Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification
Moises Andrade,Joonhyuk Cha,Brandon Ho,Vriksha Srihari,Karmesh Yadav,Zsolt Kira
Main category: cs.AI
TL;DR: 论文提出了一种名为‘自我验证’(SGV)的轻量级方法,通过两步推理过程解决多模态大语言模型(MLLMs)在验证任务中的‘一致性偏差’问题。该方法显著提升了验证任务的准确性和失败检测率。
Details
Motivation: 目前,在数学和棋类游戏等领域,验证器(verifiers)通过奖励机制推动了AI的进步。然而,在没有明确成功标准的领域(如计算机使用),验证器的设计仍具挑战性。多模态大语言模型(MLLMs)因其世界知识、人类偏好对齐和推理能力成为潜在解决方案,但其在验证任务中存在‘一致性偏差’问题。Contribution: 论文的主要贡献是提出了一种名为Self-Grounded Verification(SGV)的方法,通过两步推理(先提取任务先验知识,再基于先验知识评估候选轨迹)有效缓解MLLMs的一致性偏差,显著提升了验证任务的性能。
Method: SGV分为两步:1)无条件生成任务完成的先验知识;2)基于先验知识对候选轨迹进行条件生成和评估。这种方法利用MLLMs自身的采样机制,增强了其知识和推理能力的有效性。
Result: SGV使MLLM验证器的准确性和失败检测率提升了高达20个百分点,并在多个任务(如OSWorld中的GUI专家、robomimic中的扩散策略和VisualWebArena中的ReAct代理)中实现了实时监督,性能超越了之前的SOTA方法48%。
Insight: 论文揭示了MLLMs在验证任务中的‘一致性偏差’问题,并通过简单的两步推理方法显著缓解了该问题。这表明,通过合理设计,MLLMs的自生成机制可以更好地服务于复杂任务的验证。
Abstract: Verifiers – functions assigning rewards to agent behavior – have been key for AI progress in domains like math and board games. However, extending these gains to domains without clear-cut success criteria (e.g.,computer use) remains a challenge: while humans can recognize suitable outcomes, translating this intuition into scalable rules is non-trivial. Multimodal Large Language Models(MLLMs) emerge as a promising solution, given their world knowledge, human-preference alignment, and reasoning skills. We evaluate MLLMs as verifiers of agent trajectories across web navigation, computer use, and robotic manipulation, and identify a critical limitation: agreement bias, a strong tendency for MLLMs to favor information in their context window, often generating chains of thought to rationalize flawed behavior. This bias is pervasive across models, resilient to test-time scaling, and can impact several methods using MLLMs as evaluators (e.g.,data filtering). Notably, it occurs despite MLLMs showing strong, human-aligned priors on desired behavior. To address this, we propose Self-Grounded Verification (SGV), a lightweight method that enables more effective use of MLLMs’ knowledge and reasoning by harnessing their own sampling mechanisms via unconditional and conditional generation. SGV operates in two steps: first, the MLLM is elicited to retrieve broad priors about task completion, independent of the data under evaluation. Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory. Enhanced with SGV, MLLM verifiers show gains of up to 20 points in accuracy and failure detection rates, and can perform real-time supervision of heterogeneous agents, boosting task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena – setting a new state of the art on the benchmark, surpassing the previous best by 48%.
cs.IR [Back]
[89] Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker
Rachna Saxena,Abhijeet Kumar,Suresh Shanmugam
Main category: cs.IR
TL;DR: 该论文提出了一种结合视觉嵌入检索和晚期交互重排序的视觉增强问答系统,解决了多模态检索中的效率和质量问题。
Details
Motivation: 传统文本语言模型无法处理信息图表等视觉元素,而多模态大语言模型(MLLM)在检索海量文档时存在效率问题。Contribution: 提出了一种可扩展的视觉嵌入检索方法,结合混合搜索和晚期交互重排序,提升了检索效率和质量。
Method: 采用多步骤实现,包括混合搜索(元数据+嵌入)和晚期交互重排序,并结合MLLM生成答案。
Result: 实验表明,系统在保持性能的同时显著提升了检索速度,适用于实际生产环境。
Insight: 结合混合搜索和晚期交互重排序是多模态检索领域的高效解决方案。
Abstract: Traditional information extraction systems face challenges with text only language models as it does not consider infographics (visual elements of information) such as tables, charts, images etc. often used to convey complex information to readers. Multimodal LLM (MLLM) face challenges of finding needle in the haystack problem i.e., either longer context length or substantial number of documents as search space. Late interaction mechanism over visual language models has shown state of the art performance in retrieval-based vision augmented Q&A tasks. There are yet few challenges using it for RAG based multi-modal Q&A. Firstly, many popular and widely adopted vector databases do not support native multi-vector retrieval. Secondly, late interaction requires computation which inflates space footprint and can hinder enterprise adoption. Lastly, the current state of late interaction mechanism does not leverage the approximate neighbor search indexing methods for large speed ups in retrieval process. This paper explores a pragmatic approach to make vision retrieval process scalable and efficient without compromising on performance quality. We propose multi-step custom implementation utilizing widely adopted hybrid search (metadata & embedding) and state of the art late interaction re-ranker to retrieve best matching pages. Finally, MLLM are prompted as reader to generate answers from contextualized best matching pages. Through experiments, we observe that the proposed design is scalable (significant speed up) and stable (without degrading performance quality), hence can be used as production systems at enterprises.
cs.NE [Back]
[90] Simulated Language Acquisition in a Biologically Realistic Model of the Brain
Daniel Mitropolsky,Christos Papadimitriou
Main category: cs.NE
TL;DR: 这篇论文提出了一种基于生物学启发的脑模型,通过六种神经科学原理的数学形式化,实现了语言学习的基本能力。
Details
Motivation: 尽管神经科学取得巨大进展,但神经元活动如何导致高级认知现象(如语言)仍缺乏清晰解释。本文旨在填补这一空白。Contribution: 1. 提出了一种简单但生物学合理的脑模型数学形式化方法;2. 实现了从零开始的语义和语法学习能力。
Method: 结合兴奋性神经元、脑区、随机突触、Hebbian可塑性、局部抑制和区际抑制等六种神经科学原理,构建了模拟神经形态系统。
Result: 系统能够从少量接地句子中学习单词语义、语法角色及语言语序,甚至能生成新句子。
Insight: 这种生物学启发的模型为揭示高级认知现象(如语言)的神经机制提供了新思路。
Abstract: Despite tremendous progress in neuroscience, we do not have a compelling narrative for the precise way whereby the spiking of neurons in our brain results in high-level cognitive phenomena such as planning and language. We introduce a simple mathematical formulation of six basic and broadly accepted principles of neuroscience: excitatory neurons, brain areas, random synapses, Hebbian plasticity, local inhibition, and inter-area inhibition. We implement a simulated neuromorphic system based on this formalism, which is capable of basic language acquisition: Starting from a tabula rasa, the system learns, in any language, the semantics of words, their syntactic role (verb versus noun), and the word order of the language, including the ability to generate novel sentences, through the exposure to a modest number of grounded sentences in the same language. We discuss several possible extensions and implications of this result.
cs.SD [Back]
[91] Stereo Sound Event Localization and Detection with Onscreen/offscreen Classification
Kazuki Shimada,Archontis Politis,Iran R. Roman,Parthasaarathy Sudarsanam,David Diaz-Guerra,Ruchi Pandey,Kengo Uchida,Yuichiro Koyama,Naoya Takahashi,Takashi Shibuya,Shusuke Takahashi,Tuomas Virtanen,Yuki Mitsufuji
Main category: cs.SD
TL;DR: 本文介绍了DCASE2025挑战赛任务3的目标、数据集、基线和指标,重点关注立体声音频下的声事件定位与检测(SELD),增加了屏幕上/下事件的分类子任务。
Details
Motivation: 以往的任务使用四声道音频(如FOA和麦克风阵列),今年转向更常见的立体声音频场景,更贴近实际应用中的有限视场(FOV)问题。Contribution: 1. 提出了基于立体声音频的SELD任务,扩展了应用场景;2. 引入了屏幕上/下事件分类子任务,适应有限FOV的需求;3. 发布了DCASE2025 Task3立体声SELD数据集。
Method: 基线系统结合立体声音频和视频帧输入,除了传统的SELD任务外,还集成了屏幕上/下分类模块。
Result: 基线系统在立体声音频数据上表现良好。
Insight: 立体声音频的局限性(如方向模糊性)促使任务聚焦于方位角和距离估计,同时屏幕上/下分类为有限FOV场景提供了新思路。
Abstract: This paper presents the objective, dataset, baseline, and metrics of Task 3 of the DCASE2025 Challenge on sound event localization and detection (SELD). In previous editions, the challenge used four-channel audio formats of first-order Ambisonics (FOA) and microphone array. In contrast, this year’s challenge investigates SELD with stereo audio data (termed stereo SELD). This change shifts the focus from more specialized 360{\deg} audio and audiovisual scene analysis to more commonplace audio and media scenarios with limited field-of-view (FOV). Due to inherent angular ambiguities in stereo audio data, the task focuses on direction-of-arrival (DOA) estimation in the azimuth plane (left-right axis) along with distance estimation. The challenge remains divided into two tracks: audio-only and audiovisual, with the audiovisual track introducing a new sub-task of onscreen/offscreen event classification necessitated by the limited FOV. This challenge introduces the DCASE2025 Task3 Stereo SELD Dataset, whose stereo audio and perspective video clips are sampled and converted from the STARSS23 recordings. The baseline system is designed to process stereo audio and corresponding video frames as inputs. In addition to the typical SELD event classification and localization, it integrates onscreen/offscreen classification for the audiovisual track. The evaluation metrics have been modified to introduce an onscreen/offscreen accuracy metric, which assesses the models’ ability to identify which sound sources are onscreen. In the experimental evaluation, the baseline system performs reasonably well with the stereo audio data.
astro-ph.IM [Back]
[92] Image-Based Multi-Survey Classification of Light Curves with a Pre-Trained Vision Transformer
Daniel Moreno-Cartagena,Guillermo Cabrera-Vives,Alejandra M. Muñoz Arancibia,Pavlos Protopapas,Francisco Förster,Márcio Catelan,A. Bayo,Pablo A. Estévez,P. Sánchez-Sáez,Franz E. Bauer,M. Pavez-Herrera,L. Hernández-García,Gonzalo Rojas
Main category: astro-ph.IM
TL;DR: 论文探讨了使用预训练的视觉Transformer(Swin Transformer V2)在多巡天数据(ZTF和ATLAS)中进行光度分类,发现联合处理多巡天数据的架构性能最佳。
Details
Motivation: 研究动机是开发一种可扩展的分类器,用于处理来自不同巡天项目的光变曲线数据,并解决多巡天数据整合的问题。Contribution: 主要贡献是提出了一种多巡天联合处理的架构,证明了其性能优于单巡天处理,并分析了巡天间相互作用的重要性。
Method: 主要方法是基于预训练的Swin Transformer V2,提出了一种联合处理ZTF和ATLAS光变曲线的多巡天架构。
Result: 实验结果表明,多巡天联合处理的架构在分类性能上优于单巡天处理,验证了巡天间交互建模的重要性。
Insight: 研究指出,建模巡天特异性特征和巡天间相互作用是提升分类性能的关键,为未来时域天文学的可扩展分类器提供了指导。
Abstract: We explore the use of Swin Transformer V2, a pre-trained vision Transformer, for photometric classification in a multi-survey setting by leveraging light curves from the Zwicky Transient Facility (ZTF) and the Asteroid Terrestrial-impact Last Alert System (ATLAS). We evaluate different strategies for integrating data from these surveys and find that a multi-survey architecture which processes them jointly achieves the best performance. These results highlight the importance of modeling survey-specific characteristics and cross-survey interactions, and provide guidance for building scalable classifiers for future time-domain astronomy.
eess.IV [Back]
[93] CompressedVQA-HDR: Generalized Full-reference and No-reference Quality Assessment Models for Compressed High Dynamic Range Videos
Wei Sun,Linhan Cao,Kang Fu,Dandan Zhu,Jun Jia,Menghan Hu,Xiongkuo Min,Guangtao Zhai
Main category: eess.IV
TL;DR: CompressedVQA-HDR提出了一种用于高动态范围(HDR)压缩视频质量评估的框架,结合Swin Transformer和SigLip 2作为骨干网络,分别用于全参考(FR)和无参考(NR)模型,并通过预训练和数据增强策略提升了性能。
Details
Motivation: 现有的压缩视频质量评估方法无法很好地处理HDR内容的多样性,因此需要一种更通用的框架来解决这一问题。Contribution: 1. 提出了CompressedVQA-HDR框架,结合Swin Transformer和SigLip 2分别用于FR和NR模型;2. 通过预训练和数据增强策略解决了HDR训练数据不足的问题;3. 在多个数据集上验证了模型的先进性能。
Method: 1. FR模型利用Swin Transformer提取特征,计算结构和纹理相似性;2. NR模型使用SigLip 2提取全局均值特征;3. 通过预训练和混合数据集训练策略优化模型。
Result: 实验结果表明,模型在性能上优于现有方法,并在IEEE ICME 2025的挑战赛中取得第一名。
Insight: 通过结合预训练和数据增强,可以显著提升HDR视频质量评估的泛化能力。
Abstract: Video compression is a standard procedure applied to all videos to minimize storage and transmission demands while preserving visual quality as much as possible. Therefore, evaluating the visual quality of compressed videos is crucial for guiding the practical usage and further development of video compression algorithms. Although numerous compressed video quality assessment (VQA) methods have been proposed, they often lack the generalization capability needed to handle the increasing diversity of video types, particularly high dynamic range (HDR) content. In this paper, we introduce CompressedVQA-HDR, an effective VQA framework designed to address the challenges of HDR video quality assessment. Specifically, we adopt the Swin Transformer and SigLip 2 as the backbone networks for the proposed full-reference (FR) and no-reference (NR) VQA models, respectively. For the FR model, we compute deep structural and textural similarities between reference and distorted frames using intermediate-layer features extracted from the Swin Transformer as its quality-aware feature representation. For the NR model, we extract the global mean of the final-layer feature maps from SigLip 2 as its quality-aware representation. To mitigate the issue of limited HDR training data, we pre-train the FR model on a large-scale standard dynamic range (SDR) VQA dataset and fine-tune it on the HDRSDR-VQA dataset. For the NR model, we employ an iterative mixed-dataset training strategy across multiple compressed VQA datasets, followed by fine-tuning on the HDRSDR-VQA dataset. Experimental results show that our models achieve state-of-the-art performance compared to existing FR and NR VQA models. Moreover, CompressedVQA-HDR-FR won first place in the FR track of the Generalizable HDR & SDR Video Quality Measurement Grand Challenge at IEEE ICME 2025. The code is available at https://github.com/sunwei925/CompressedVQA-HDR.
[94] Identifying Signatures of Image Phenotypes to Track Treatment Response in Liver Disease
Matthias Perkonigg,Nina Bastati,Ahmed Ba-Ssalamah,Peter Mesenbrink,Alexander Goehler,Miljen Martic,Xiaofei Zhou,Michael Trauner,Georg Langs
Main category: eess.IV
TL;DR: 该论文提出了一种无监督的深度聚类网络方法,用于从肝脏磁共振图像中识别与疾病进展和治疗反应相关的图像模式。通过建立组织词汇表,该方法能够量化治疗反应,并在非酒精性脂肪性肝炎患者中验证了其有效性。
Details
Motivation: 在弥漫性肝脏疾病中,量化图像模式对于指导个体化治疗和开发新疗法至关重要。现有的方法通常依赖于侵入性活检,而该研究旨在通过无监督学习从非侵入性图像数据中提取有用的信息。Contribution: 提出了一种无监督深度聚类网络,通过编码和聚类医学图像斑块,建立了一个低维潜在空间的组织词汇表;证明了该词汇表可用于量化治疗反应,并预测活检特征。
Method: 使用深度聚类网络对肝脏磁共振图像斑块进行无监督编码和聚类,生成组织词汇表,并在临床试验队列中验证其效果。
Result: 研究结果表明,该方法能够识别与治疗相关的特定肝脏组织变化路径,并在治疗组间提供比现有非成像指标更好的分离效果。此外,词汇表还能从非侵入性图像数据中预测活检特征。
Insight: 无监督学习可以从医学图像中自动提取有意义的模式,为疾病管理和治疗监测提供了新的工具,减少了对侵入性活检的依赖。
Abstract: Quantifiable image patterns associated with disease progression and treatment response are critical tools for guiding individual treatment, and for developing novel therapies. Here, we show that unsupervised machine learning can identify a pattern vocabulary of liver tissue in magnetic resonance images that quantifies treatment response in diffuse liver disease. Deep clustering networks simultaneously encode and cluster patches of medical images into a low-dimensional latent space to establish a tissue vocabulary. The resulting tissue types capture differential tissue change and its location in the liver associated with treatment response. We demonstrate the utility of the vocabulary on a randomized controlled trial cohort of non-alcoholic steatohepatitis patients. First, we use the vocabulary to compare longitudinal liver change in a placebo and a treatment cohort. Results show that the method identifies specific liver tissue change pathways associated with treatment, and enables a better separation between treatment groups than established non-imaging measures. Moreover, we show that the vocabulary can predict biopsy derived features from non-invasive imaging data. We validate the method on a separate replication cohort to demonstrate the applicability of the proposed method.
[95] Benchmarking and Explaining Deep Learning Cortical Lesion MRI Segmentation in Multiple Sclerosis
Nataliia Molchanova,Alessandro Cagol,Mario Ocampo-Pineda,Po-Jui Lu,Matthias Weigel,Xinjie Chen,Erin Beck,Charidimos Tsagkas,Daniel Reich,Colin Vanden Bulcke,Anna Stolting,Serena Borrelli,Pietro Maggi,Adrien Depeursinge,Cristina Granziera,Henning Mueller,Pedro M. Gordaliza,Meritxell Bach Cuadra
Main category: eess.IV
TL;DR: 该论文提出了一个多中心基准测试,用于评估深度学习在MRI中对多发性硬化症(MS)皮质病变(CLs)的检测和分割性能,并提出了改进的方法和公开可用的模型。
Details
Motivation: 皮质病变(CLs)在多发性硬化症中具有重要的诊断和预后价值,但由于MRI图像中的CLs表现微妙、专家标注困难以及缺乏标准化的自动化方法,其临床应用受限。Contribution: 提出了一个多中心、多协议的CLs检测和分割基准测试,利用nnU-Net框架改进模型性能,并通过模型特征分析和错误分析提升对AI决策的理解。
Method: 使用自配置的nnU-Net框架,并针对CL检测提出改进方法,通过域外测试验证模型的泛化能力。
Result: 模型在域内和域外的F1分数分别为0.64和0.5,展示了较强的病变检测能力。
Insight: 论文分析了数据变异性、病变模糊性和协议差异对模型性能的影响,为未来临床应用的障碍提供了解决建议。
Abstract: Cortical lesions (CLs) have emerged as valuable biomarkers in multiple sclerosis (MS), offering high diagnostic specificity and prognostic relevance. However, their routine clinical integration remains limited due to subtle magnetic resonance imaging (MRI) appearance, challenges in expert annotation, and a lack of standardized automated methods. We propose a comprehensive multi-centric benchmark of CL detection and segmentation in MRI. A total of 656 MRI scans, including clinical trial and research data from four institutions, were acquired at 3T and 7T using MP2RAGE and MPRAGE sequences with expert-consensus annotations. We rely on the self-configuring nnU-Net framework, designed for medical imaging segmentation, and propose adaptations tailored to the improved CL detection. We evaluated model generalization through out-of-distribution testing, demonstrating strong lesion detection capabilities with an F1-score of 0.64 and 0.5 in and out of the domain, respectively. We also analyze internal model features and model errors for a better understanding of AI decision-making. Our study examines how data variability, lesion ambiguity, and protocol differences impact model performance, offering future recommendations to address these barriers to clinical adoption. To reinforce the reproducibility, the implementation and models will be publicly accessible and ready to use at https://github.com/Medical-Image-Analysis-Laboratory/ and https://doi.org/10.5281/zenodo.15911797.
[96] 3D Wavelet Latent Diffusion Model for Whole-Body MR-to-CT Modality Translation
Jiaxu Zheng,Meiman He,Xuhui Tang,Xiong Wang,Tuoyu Cao,Tianyi Zeng,Lichi Zhang,Chenyu You
Main category: eess.IV
TL;DR: 该论文提出了一种新颖的3D小波潜在扩散模型(3D-WLDM),用于从磁共振(MR)图像合成计算机断层扫描(CT)图像,解决了现有方法中空间对齐和图像质量不足的问题。
Details
Motivation: MR成像在临床诊断中至关重要,但在混合PET/MR成像和仅MR放射治疗等应用中,需要从MR合成CT以估计辐射衰减。现有方法存在空间对齐和图像质量问题,影响了临床任务的可靠性。Contribution: 1)提出了3D-WLDM模型,在潜在空间中实现模态转换;2)通过小波残差模块增强图像和潜在空间中的细粒度特征;3)解耦结构和模态特征以保持解剖完整性;4)引入双跳过连接注意力机制,提升高分辨率CT生成质量。
Method: 1)在编码器-解码器架构中引入小波残差模块;2)通过扩散模型在潜在空间中进行模态转换;3)使用双跳过连接注意力机制优化高分辨率图像的生成;4)解耦结构和模态特征以防止变形。
Result: 3D-WLDM能够生成具有更好骨骼结构和软组织对比的高分辨率CT图像,显著提升了空间对齐和图像质量。
Insight: 潜在空间中的模态转换结合小波分析和扩散模型,可有效解决MR-to-CT合成中的关键挑战,为临床任务提供了更可靠的解决方案。
Abstract: Magnetic Resonance (MR) imaging plays an essential role in contemporary clinical diagnostics. It is increasingly integrated into advanced therapeutic workflows, such as hybrid Positron Emission Tomography/Magnetic Resonance (PET/MR) imaging and MR-only radiation therapy. These integrated approaches are critically dependent on accurate estimation of radiation attenuation, which is typically facilitated by synthesizing Computed Tomography (CT) images from MR scans to generate attenuation maps. However, existing MR-to-CT synthesis methods for whole-body imaging often suffer from poor spatial alignment between the generated CT and input MR images, and insufficient image quality for reliable use in downstream clinical tasks. In this paper, we present a novel 3D Wavelet Latent Diffusion Model (3D-WLDM) that addresses these limitations by performing modality translation in a learned latent space. By incorporating a Wavelet Residual Module into the encoder-decoder architecture, we enhance the capture and reconstruction of fine-scale features across image and latent spaces. To preserve anatomical integrity during the diffusion process, we disentangle structural and modality-specific characteristics and anchor the structural component to prevent warping. We also introduce a Dual Skip Connection Attention mechanism within the diffusion model, enabling the generation of high-resolution CT images with improved representation of bony structures and soft-tissue contrast.
[97] Predicting Pulmonary Hypertension in Newborns: A Multi-view VAE Approach
Lucas Erlacher,Samuel Ruipérez-Campillo,Holger Michel,Sven Wellmann,Thomas M. Sutter,Ece Ozkan,Julia E. Vogt
Main category: eess.IV
TL;DR: 该论文提出了一种基于多视角变分自编码器(VAE)的方法,用于新生儿肺动脉高压(PH)的预测。通过多视角超声心动图视频,该方法提高了特征提取的鲁棒性,并展现了优于单视角和监督学习方法的泛化能力和分类准确性。
Details
Motivation: 新生儿肺动脉高压(PH)的诊断通常依赖于操作者依赖的超声心动图,导致评估主观性强。现有自动化方法多针对成人且基于单视角数据,泛化能力不足。多视角超声心动图虽有望提升性能,但现有模型难以应对这一挑战。Contribution: 论文的主要贡献包括:1)首次将多视角VAE应用于新生儿PH预测;2)证明了多视角学习在提高模型泛化性和分类准确性方面的有效性;3)为新生儿PH的非侵入性诊断提供了一种更可靠的解决方案。
Method: 论文采用多视角变分自编码器(VAE),通过超声心动图视频提取复杂潜在特征。该方法结合多视角数据,利用VAE框架实现鲁棒的特征表示,并与单视角和监督学习方法进行了对比。
Result: 实验结果表明,多视角VAE方法的分类准确性和泛化能力显著优于单视角和监督学习方法,验证了多视角学习在PH评估中的有效性。
Insight: 多视角数据能够捕捉更全面的病理特征,而VAE的潜在表示进一步增强了模型的鲁棒性。这为新生儿PH的自动化诊断提供了新的思路。
Abstract: Pulmonary hypertension (PH) in newborns is a critical condition characterized by elevated pressure in the pulmonary arteries, leading to right ventricular strain and heart failure. While right heart catheterization (RHC) is the diagnostic gold standard, echocardiography is preferred due to its non-invasive nature, safety, and accessibility. However, its accuracy highly depends on the operator, making PH assessment subjective. While automated detection methods have been explored, most models focus on adults and rely on single-view echocardiographic frames, limiting their performance in diagnosing PH in newborns. While multi-view echocardiography has shown promise in improving PH assessment, existing models struggle with generalizability. In this work, we employ a multi-view variational autoencoder (VAE) for PH prediction using echocardiographic videos. By leveraging the VAE framework, our model captures complex latent representations, improving feature extraction and robustness. We compare its performance against single-view and supervised learning approaches. Our results show improved generalization and classification accuracy, highlighting the effectiveness of multi-view learning for robust PH assessment in newborns.
[98] Are Vision Foundation Models Ready for Out-of-the-Box Medical Image Registration?
Hanxue Gu,Yaqian Chen,Nicholas Konz,Qihang Li,Maciej A. Mazurowski
Main category: eess.IV
TL;DR: 该论文评估了基于基础模型的医学图像配准算法在乳腺MRI中的表现,发现某些模型(如SAM)在全局对齐上优于传统方法,但在细粒度组织对齐上表现不佳。
Details
Motivation: 探讨基础模型(如DINO-v2、SAM等)是否能在医学图像配准中(尤其是乳腺MRI这种复杂、可变形的解剖结构)实现零样本性能。Contribution: 1. 对五种预训练基础模型进行了乳腺MRI配准的全面评估;2. 揭示了这些模型在全局对齐和细粒度结构配准上的优缺点;3. 公开了代码和数据。
Method: 使用了五种预训练编码器(DINO-v2、SAM、MedSAM、SSLSAM、MedCLIP),在四种乳腺MRI配准任务上进行测试,涵盖不同时间、序列、模态和疾病状态的变化。
Result: SAM在全局对齐上表现优于传统方法,但在细粒度纤维腺体组织对齐上表现不佳;医学特定预训练(如MedSAM)并未提升性能,甚至可能降低。
Insight: 基础模型在医学图像配准中潜力巨大,但需进一步研究如何优化其对细粒度结构的捕获,且域特定训练需谨慎设计。
Abstract: Foundation models, pre-trained on large image datasets and capable of capturing rich feature representations, have recently shown potential for zero-shot image registration. However, their performance has mostly been tested in the context of rigid or less complex structures, such as the brain or abdominal organs, and it remains unclear whether these models can handle more challenging, deformable anatomy. Breast MRI registration is particularly difficult due to significant anatomical variation between patients, deformation caused by patient positioning, and the presence of thin and complex internal structure of fibroglandular tissue, where accurate alignment is crucial. Whether foundation model-based registration algorithms can address this level of complexity remains an open question. In this study, we provide a comprehensive evaluation of foundation model-based registration algorithms for breast MRI. We assess five pre-trained encoders, including DINO-v2, SAM, MedSAM, SSLSAM, and MedCLIP, across four key breast registration tasks that capture variations in different years and dates, sequences, modalities, and patient disease status (lesion versus no lesion). Our results show that foundation model-based algorithms such as SAM outperform traditional registration baselines for overall breast alignment, especially under large domain shifts, but struggle with capturing fine details of fibroglandular tissue. Interestingly, additional pre-training or fine-tuning on medical or breast-specific images in MedSAM and SSLSAM, does not improve registration performance and may even decrease it in some cases. Further work is needed to understand how domain-specific training influences registration and to explore targeted strategies that improve both global alignment and fine structure accuracy. We also publicly release our code at \href{https://github.com/mazurowski-lab/Foundation-based-reg}{Github}.
[99] Unit-Based Histopathology Tissue Segmentation via Multi-Level Feature Representation
Ashkan Shakarami,Azade Farshad,Yousef Yeganeh,Lorenzo Nicole,Peter Schuffler,Stefano Ghidoni,Nassir Navab
Main category: eess.IV
TL;DR: 论文提出了一种基于单元的组织分割框架UTS,利用多级视觉变换器(L-ViT)对32×32的图块进行分类,显著减少了标注成本并提升了计算效率。
Details
Motivation: 传统组织分割方法对像素级标注需求高且计算效率低,作者希望通过图块级分类解决这些问题。Contribution: 1. 提出UTS框架,以图块为分割单元;2. 设计L-ViT模型,通过多级特征表示捕获局部和全局信息。
Method: 采用图块(32×32)分类策略,结合L-ViT提取多级特征,支持肿瘤-间质量化和手术边缘评估等任务。
Result: 在459个H&E染色区域和386,371个图块上测试,UTS优于U-Net变体和基于变换器的基线方法。
Insight: 图块级分类在减少标注需求的同时保持准确性,多级特征融合有助于提升分割性能。
Abstract: We propose UTS, a unit-based tissue segmentation framework for histopathology that classifies each fixed-size 32 * 32 tile, rather than each pixel, as the segmentation unit. This approach reduces annotation effort and improves computational efficiency without compromising accuracy. To implement this approach, we introduce a Multi-Level Vision Transformer (L-ViT), which benefits the multi-level feature representation to capture both fine-grained morphology and global tissue context. Trained to segment breast tissue into three categories (infiltrating tumor, non-neoplastic stroma, and fat), UTS supports clinically relevant tasks such as tumor-stroma quantification and surgical margin assessment. Evaluated on 386,371 tiles from 459 H&E-stained regions, it outperforms U-Net variants and transformer-based baselines. Code and Dataset will be available at GitHub.
q-bio.NC [Back]
[100] Spontaneous Spatial Cognition Emerges during Egocentric Video Viewing through Non-invasive BCI
Weichen Dai,Yuxuan Huang,Li Zhu,Dongjun Liu,Yu Zhang,Qibin Zhao,Andrzej Cichocki,Fabio Babiloni,Ke Li,Jianyu Qiu,Gangyong Jia,Wanzeng Kong,Qing Wu
Main category: q-bio.NC
TL;DR: 通过非侵入式脑机接口(BCI)解码,首次证明了在被动观看自我中心视频时,自发的高精度6D位姿(3D位置和方向)可以被解码。这一发现挑战了主动与被动空间认知的传统区分。
Details
Motivation: 尽管海马神经元对位置和方向的编码已被广泛研究,但在自然、被动体验中支持空间表征的大规模神经动力学仍不清楚。本文旨在通过EEG技术探索这一问题。Contribution: 1. 首次通过非侵入式BCI解码自发6D位姿;2. 揭示了视觉输入的时间频率对解码性能的影响;3. 通过梯度回溯方法识别了参与位姿编码的特定EEG通道。
Method: 使用基于EEG的BCI解码被动观看自我中心视频时的6D位姿。通过梯度回溯分析,识别与位置和方向相关的EEG通道。视觉输入以100ms/帧展示以优化性能。
Result: EEG可以解码连续的6D位姿,且解码性能在100ms/帧时最佳。研究发现了一种分布式的互补神经编码模式。
Insight: 空间认知系统即使在被动条件下也能自发、连续运作,这表明主动与被动认知的界线可能比传统认知更模糊。
Abstract: Humans possess a remarkable capacity for spatial cognition, allowing for self-localization even in novel or unfamiliar environments. While hippocampal neurons encoding position and orientation are well documented, the large-scale neural dynamics supporting spatial representation, particularly during naturalistic, passive experience, remain poorly understood. Here, we demonstrate for the first time that non-invasive brain-computer interfaces (BCIs) based on electroencephalography (EEG) can decode spontaneous, fine-grained egocentric 6D pose, comprising three-dimensional position and orientation, during passive viewing of egocentric video. Despite EEG’s limited spatial resolution and high signal noise, we find that spatially coherent visual input (i.e., continuous and structured motion) reliably evokes decodable spatial representations, aligning with participants’ subjective sense of spatial engagement. Decoding performance further improves when visual input is presented at a frame rate of 100 ms per image, suggesting alignment with intrinsic neural temporal dynamics. Using gradient-based backpropagation through a neural decoding model, we identify distinct EEG channels contributing to position – and orientation specific – components, revealing a distributed yet complementary neural encoding scheme. These findings indicate that the brain’s spatial systems operate spontaneously and continuously, even under passive conditions, challenging traditional distinctions between active and passive spatial cognition. Our results offer a non-invasive window into the automatic construction of egocentric spatial maps and advance our understanding of how the human mind transforms everyday sensory experience into structured internal representations.
cs.SE [Back]
[101] MetaLint: Generalizable Idiomatic Code Quality Analysis through Instruction-Following and Easy-to-Hard Generalization
Atharva Naik,Lawanya Baghel,Dhakshin Govindarajan,Darsh Agrawal,Daniel Fried,Carolyn Rose
Main category: cs.SE
TL;DR: MetaLint是一个基于指令遵循的框架,通过指令微调合成数据支持代码质量分析,能够在不重新训练的情况下适应新或复杂的代码模式,优于现有方法。
Details
Motivation: 现有大语言模型在代码质量分析中受限于静态训练数据,无法灵活适应不断演进的最佳实践。Contribution: MetaLint框架通过指令遵循和循序渐进(easy-to-hard)的泛化能力,提升了代码质量分析的适应性和泛化性。
Method: 采用指令微调合成linter生成的数据,支持循序渐进的学习方式,从而适应新或复杂的代码模式。
Result: 在未见的PEP成语检测中表现优异,F-score达70.37%,且在4B参数量下与更大规模模型性能相当。
Insight: 通过指令微调和数据合成,模型可以在不更新训练数据的情况下适应新的代码实践,为代码质量分析提供了一种更灵活的方法。
Abstract: Large Language Models, though successful in code generation, struggle with code quality analysis because they are limited by static training data and can’t easily adapt to evolving best practices. We introduce MetaLint, a new instruction-following framework that formulates code quality analysis as the task of detecting and fixing problematic semantic code fragments or code idioms based on high-level specifications. Unlike conventional approaches that train models on static, rule-based data, MetaLint employs instruction tuning on synthetic linter-generated data to support easy-to-hard generalization, enabling models to adapt to novel or complex code patterns without retraining. To evaluate this, we construct a benchmark of challenging idioms inspired by real-world coding standards such as Python Enhancement Proposals (PEPs) and assess whether MetaLint-trained models reason adaptively or simply memorize. Our results show that MetaLint improves generalization to unseen PEP idioms, achieving a 70.37% F-score on idiom detection with the highest recall (70.43%) among all evaluated models. It also achieves 26.73% on localization, competitive for its 4B parameter size and comparable to larger state-of-the-art models like o3-mini, highlighting its potential for future-proof code quality analysis.
[102] MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks
Artem Chervyakov,Alexander Kharitonov,Pavel Zadorozhny,Adamenko Pavel,Rodion Levichev,Dmitrii Vorobev,Dmitrii Salikhov,Aidar Valeev,Alena Pestova,Maria Dziuba,Ilseyar Alimova,Artem Zavgorodnev,Aleksandr Medvedev,Stanislav Moiseev,Elena Bruches,Daniil Grebenkin,Roman Derunets,Vikulov Vladimir,Anton Emelyanov,Dmitrii Babaev,Vladimir V. Ivanov,Valentin Malykh,Alena Fenogenova
Main category: cs.SE
TL;DR: 该论文提出了MERA Code,一个专注于评估代码生成大语言模型(LLMs)的基准框架,涵盖8种编程语言和11项任务,填补了现有评估在代码质量方面的不足。
Details
Motivation: 现有的大语言模型评估主要关注自然语言任务,忽视了代码质量和实际生产环境中的表现,这导致对模型真实能力和风险的评估不够全面。Contribution: 1. 提出MERA Code基准,专门针对代码生成模型的评估;2. 设计了涵盖8种编程语言的11项任务,并提供了开源代码库和评估平台;3. 针对非英语(俄语)环境中的模型性能进行了分析。
Method: 通过构建一个任务分类法,明确实际编程技能要求,并设计了包括开源代码库、评分系统和排行榜平台的综合评估框架。
Result: 评估了开源和前沿API模型,揭示了其在非英语环境下实际编码任务中的局限性。
Insight: MERA Code为未来研究提供了标准化评估工具,帮助模型开发者突破非英语环境中的编码任务挑战,并推动代码生成领域的进步。
Abstract: Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.
eess.SP [Back]
[103] DoRF: Doppler Radiance Fields for Robust Human Activity Recognition Using Wi-Fi
Navid Hasanzadeh,Shahrokh Valaee
Main category: eess.SP
TL;DR: 该论文提出了一种基于Wi-Fi CSI的多普勒速度投影的新方法DoRF(多普勒辐射场),通过学习3D潜在运动表示,提高人类活动识别(HAR)在环境变化下的鲁棒性和泛化能力,受NeRF启发。
Details
Motivation: 尽管Wi-Fi CSI的多普勒速度投影在HAR中表现出一定鲁棒性,但在实际部署中其泛化能力仍不足。论文受NeRF启发,试图通过3D潜在运动表示解决这一问题。Contribution: 提出DoRF方法,从Wi-Fi CSI的多普勒速度投影重建3D潜在运动表示,构建统一的多普勒辐射场,显著提升了HAR的泛化性和环境适应性。
Method: 1. 从Wi-Fi CSI提取一维多普勒速度投影;2. 学习3D潜在运动表示;3. 构建多普勒辐射场(DoRF)。
Result: 实验结果表明,DoRF显著提升了Wi-Fi HAR的泛化精度,展现了在实际应用中的潜力。
Insight: 通过3D潜在表示和多普勒辐射场,可以更好地捕捉运动的全局特征,从而克服环境变化的干扰。
Abstract: Wi-Fi Channel State Information (CSI) has gained increasing interest for remote sensing applications. Recent studies show that Doppler velocity projections extracted from CSI can enable human activity recognition (HAR) that is robust to environmental changes and generalizes to new users. However, despite these advances, generalizability still remains insufficient for practical deployment. Inspired by neural radiance fields (NeRF), which learn a volumetric representation of a 3D scene from 2D images, this work proposes a novel approach to reconstruct an informative 3D latent motion representation from one-dimensional Doppler velocity projections extracted from Wi-Fi CSI. The resulting latent representation is then used to construct a uniform Doppler radiance field (DoRF) of the motion, providing a comprehensive view of the performed activity and improving the robustness to environmental variability. The results show that the proposed approach noticeably enhances the generalization accuracy of Wi-Fi-based HAR, highlighting the strong potential of DoRFs for practical sensing applications.
cs.CR [Back]
[104] Effective Fine-Tuning of Vision Transformers with Low-Rank Adaptation for Privacy-Preserving Image Classification
Haiwei Lin,Shoko Imaizumi,Hitoshi Kiya
Main category: cs.CR
TL;DR: 论文提出了一种低秩适应方法,用于训练隐私保护的ViT模型,通过冻结预训练权重并注入可训练的低秩分解矩阵,同时解冻patch嵌入层,以在减少可训练参数的同时保持高精度。
Details
Motivation: 传统低秩适应方法在ViT中冻结patch嵌入层可能导致性能损失,本文旨在解决这一问题,同时实现参数效率和隐私保护的平衡。Contribution: 提出了一种改进的低秩适应方法,解冻patch嵌入层并注入可训练的低秩矩阵,显著减少参数量的同时保持模型性能。
Method: 在ViT的每一层注入可训练的低秩分解矩阵,同时不冻结patch嵌入层,利用低秩分解减少训练参数。
Result: 方法在减少可训练参数的同时保持了与全参数调优相近的精度。
Insight: 解冻patch嵌入层可能是低秩适应方法在ViT中提升性能的关键,为隐私保护的高效训练提供了新思路。
Abstract: We propose a low-rank adaptation method for training privacy-preserving vision transformer (ViT) models that efficiently freezes pre-trained ViT model weights. In the proposed method, trainable rank decomposition matrices are injected into each layer of the ViT architecture, and moreover, the patch embedding layer is not frozen, unlike in the case of the conventional low-rank adaptation methods. The proposed method allows us not only to reduce the number of trainable parameters but to also maintain almost the same accuracy as that of full-time tuning.
cs.LG [Back]
[105] MNIST-Gen: A Modular MNIST-Style Dataset Generation Using Hierarchical Semantics, Reinforcement Learning, and Category Theory
Pouya Shaeri,Arash Karimi,Ariane Middel
Main category: cs.LG
TL;DR: 这篇论文提出了MNIST-Gen,一个自动化、模块化的框架,用于生成定制化的MNIST风格数据集,结合了层次语义分类、强化学习和范畴论,显著提高了数据集生成的效率和灵活性。
Details
Motivation: 标准数据集(如MNIST)局限于通用类别,无法满足特定领域任务的需求。手动创建定制数据集耗时且复杂,需要一种自动化且灵活的解决方案。Contribution: 提出了MNIST-Gen框架,结合CLIP语义理解、强化学习和人类反馈,实现智能分类;基于范畴论的设计增强了模块化和可扩展性。
Method: 使用层次语义分类与强化学习结合人类反馈进行智能分类;通过范畴论的态射(morphism)建模数据转换阶段。
Result: 生成的两个新数据集(Tree-MNIST和Food-MNIST)展示了框架的实用性,自动分类准确率达到85%,相比手动方法节省80%时间。
Insight: 将语义理解与强化学习结合,结合人类反馈,可以高效生成定制化数据集;范畴论的设计思想提升了框架的可扩展性。
Abstract: Neural networks are often benchmarked using standard datasets such as MNIST, FashionMNIST, or other variants of MNIST, which, while accessible, are limited to generic classes such as digits or clothing items. For researchers working on domain-specific tasks, such as classifying trees, food items, or other real-world objects, these data sets are insufficient and irrelevant. Additionally, creating and publishing a custom dataset can be time consuming, legally constrained, or beyond the scope of individual projects. We present MNIST-Gen, an automated, modular, and adaptive framework for generating MNIST-style image datasets tailored to user-specified categories using hierarchical semantic categorization. The system combines CLIP-based semantic understanding with reinforcement learning and human feedback to achieve intelligent categorization with minimal manual intervention. Our hierarchical approach supports complex category structures with semantic characteristics, enabling fine-grained subcategorization and multiple processing modes: individual review for maximum control, smart batch processing for large datasets, and fast batch processing for rapid creation. Inspired by category theory, MNIST-Gen models each data transformation stage as a composable morphism, enhancing clarity, modularity, and extensibility. As proof of concept, we generate and benchmark two novel datasets-\textit{Tree-MNIST} and \textit{Food-MNIST}-demonstrating MNIST-Gen’s utility for producing task-specific evaluation data while achieving 85% automatic categorization accuracy and 80% time savings compared to manual approaches.
[106] RegCL: Continual Adaptation of Segment Anything Model via Model Merging
Yuan-Chen Shu,Zhiwei Lin,Yongtao Wang
Main category: cs.LG
TL;DR: RegCL通过模型合并实现Segment Anything Model (SAM)的持续适应,解决了传统适配器方法在跨域应用中的性能下降问题。
Details
Motivation: 解决SAM在特定领域性能受限的问题,传统适配器方法在多域应用时会出现性能下降和灾难性遗忘。Contribution: 提出RegCL框架,通过模型合并实现多域知识的持续集成,保持参数效率和动态适应性。
Method: 将模型合并算法引入持续学习范式,通过权重优化合并不同领域的LoRA模块。
Result: 实验表明,RegCL在多个下游数据集上表现出色,验证了其在动态场景中的有效性。
Insight: RegCL避免了历史数据存储需求,同时保持模型大小恒定,适用于多任务场景。
Abstract: To address the performance limitations of the Segment Anything Model (SAM) in specific domains, existing works primarily adopt adapter-based one-step adaptation paradigms. However, some of these methods are specific developed for specific domains. If used on other domains may lead to performance degradation. This issue of catastrophic forgetting severely limits the model’s scalability. To address this issue, this paper proposes RegCL, a novel non-replay continual learning (CL) framework designed for efficient multi-domain knowledge integration through model merging. Specifically, RegCL incorporates the model merging algorithm into the continual learning paradigm by merging the parameters of SAM’s adaptation modules (e.g., LoRA modules) trained on different domains. The merging process is guided by weight optimization, which minimizes prediction discrepancies between the merged model and each of the domain-specific models. RegCL effectively consolidates multi-domain knowledge while maintaining parameter efficiency, i.e., the model size remains constant regardless of the number of tasks, and no historical data storage is required. Experimental results demonstrate that RegCL achieves favorable continual learning performance across multiple downstream datasets, validating its effectiveness in dynamic scenarios.
cs.SC [Back]
[107] FactorHD: A Hyperdimensional Computing Model for Multi-Object Multi-Class Representation and Factorization
Yifei Zhou,Xuchu Huang,Chenyu Ni,Min Zhou,Zheyu Yan,Xunzhao Yin,Cheng Zhuo
Main category: cs.SC
TL;DR: FactorHD是一种新颖的HDC模型,专注于高效表示和分解复杂的类-子类关系,显著提升了计算效率和精度。
Details
Motivation: 现有的HDC模型在表示复杂的类-子类关系时面临挑战,尤其在多对象多类的场景下难以高效分解,这是神经符号AI系统的关键任务。Contribution: 提出了FactorHD模型,通过符号编码方法和高效因子分解算法,解决了HDC模型在类-子类关系表示和分解中的问题。
Method: FactorHD采用了一种符号编码方法,嵌入额外的记忆条款以保留更多信息,并结合选择性消除冗余类的分解算法。
Result: 在10^9规模下,FactorHD比现有HDC模型快5667倍;与ResNet-18集成时,在Cifar-10数据集上实现了92.48%的分解准确率。
Insight: FactorHD通过引入记忆条款和高效分解算法,克服了HDC模型中的‘叠加灾难’和‘问题2’,为神经符号AI提供了更高效的工具。
Abstract: Neuro-symbolic artificial intelligence (neuro-symbolic AI) excels in logical analysis and reasoning. Hyperdimensional Computing (HDC), a promising brain-inspired computational model, is integral to neuro-symbolic AI. Various HDC models have been proposed to represent class-instance and class-class relations, but when representing the more complex class-subclass relation, where multiple objects associate different levels of classes and subclasses, they face challenges for factorization, a crucial task for neuro-symbolic AI systems. In this article, we propose FactorHD, a novel HDC model capable of representing and factorizing the complex class-subclass relation efficiently. FactorHD features a symbolic encoding method that embeds an extra memorization clause, preserving more information for multiple objects. In addition, it employs an efficient factorization algorithm that selectively eliminates redundant classes by identifying the memorization clause of the target class. Such model significantly enhances computing efficiency and accuracy in representing and factorizing multiple objects with class-subclass relation, overcoming limitations of existing HDC models such as “superposition catastrophe” and “the problem of 2”. Evaluations show that FactorHD achieves approximately 5667x speedup at a representation size of 10^9 compared to existing HDC models. When integrated with the ResNet-18 neural network, FactorHD achieves 92.48% factorization accuracy on the Cifar-10 dataset.