Table of Contents

cs.CL [Back]

[1] Automated Evidence Extraction and Scoring for Corporate Climate Policy Engagement: A Multilingual RAG Approach

Imene Kolli,Ario Saeid Vaghefi,Chiara Colesanti Senni,Shantam Raj,Markus Leippold

Main category: cs.CL

TL;DR: 论文提出了一种基于检索增强生成(RAG)的AI辅助框架,用于自动化从多语言公司文件中提取和分类气候政策证据,以加速企业气候政策参与监测。

Details Motivation: InfluenceMap的监测工作大部分仍依赖人工,导致费时费力且易出错。因此,作者希望通过AI自动化解决这一问题。

Contribution: 提出了一种结合布局感知解析、Nomic嵌入模型和少样本提示策略的RAG框架,显著提升了证据提取和分类的效率。

Method: 采用检索增强生成技术,结合布局感知解析和多语言嵌入模型,通过少样本提示优化证据提取与分类。

Result: 评估表明,该框架在多语言文档中高效提取证据,但仍需人工参与以保证分析的准确性。

Insight: 自动化工具虽能加速证据提取,但在复杂分析中仍需专家介入,技术应作为辅助而非替代。

Abstract: InfluenceMap’s LobbyMap Platform monitors the climate policy engagement of over 500 companies and 250 industry associations, assessing each entity’s support or opposition to science-based policy pathways for achieving the Paris Agreement’s goal of limiting global warming to 1.5{\deg}C. Although InfluenceMap has made progress with automating key elements of the analytical workflow, a significant portion of the assessment remains manual, making it time- and labor-intensive and susceptible to human error. We propose an AI-assisted framework to accelerate the monitoring of corporate climate policy engagement by leveraging Retrieval-Augmented Generation to automate the most time-intensive extraction of relevant evidence from large-scale textual data. Our evaluation shows that a combination of layout-aware parsing, the Nomic embedding model, and few-shot prompting strategies yields the best performance in extracting and classifying evidence from multilingual corporate documents. We conclude that while the automated RAG system effectively accelerates evidence extraction, the nuanced nature of the analysis necessitates a human-in-the-loop approach where the technology augments, rather than replaces, expert judgment to ensure accuracy.

[2] BRoverbs – Measuring how much LLMs understand Portuguese proverbs

Thales Sales Almeida,Giovana Kerche Bonás,João Guilherme Alves Santos

Main category: cs.CL

TL;DR: 该论文提出了BRoverbs数据集,用于评估大型语言模型(LLMs)在理解巴西葡萄牙语谚语方面的能力。

Details Motivation: 由于现有评估框架在葡萄牙语中的局限性,尤其是依赖翻译数据集无法完全捕捉语言和文化细节,作者希望通过巴西谚语填补这一空白。

Contribution: BRoverbs数据集是首个专注于巴西葡萄牙语谚语的评估工具,为评估LLMs的区域语言理解能力提供了新方法。

Method: 数据集基于巴西谚语设计,这些谚语包含了文化智慧、比喻表达和复杂句法结构,用于测试LLMs对区域语言的理解。

Result: BRoverbs为葡萄牙语LLMs提供了一个新的基准测试工具,推动区域化评估的发展。数据集已公开在Hugging Face平台上。

Insight: 谚语作为语言和文化的核心载体,能够有效评估LLMs的区域语言能力,为其他语言的类似评估提供了参考。

Abstract: Large Language Models (LLMs) exhibit significant performance variations depending on the linguistic and cultural context in which they are applied. This disparity signals the necessity of mature evaluation frameworks that can assess their capabilities in specific regional settings. In the case of Portuguese, existing evaluations remain limited, often relying on translated datasets that may not fully capture linguistic nuances or cultural references. Meanwhile, native Portuguese-language datasets predominantly focus on structured national exams or sentiment analysis of social media interactions, leaving gaps in evaluating broader linguistic understanding. To address this limitation, we introduce BRoverbs, a dataset specifically designed to assess LLM performance through Brazilian proverbs. Proverbs serve as a rich linguistic resource, encapsulating cultural wisdom, figurative expressions, and complex syntactic structures that challenge the model comprehension of regional expressions. BRoverbs aims to provide a new evaluation tool for Portuguese-language LLMs, contributing to advancing regionally informed benchmarking. The benchmark is available at https://huggingface.co/datasets/Tropic-AI/BRoverbs.

[3] Can Vision-Language Models Solve Visual Math Equations?

Monjoy Narayan Choudhury,Junling Wang,Yifan Hou,Mrinmaya Sachan

Main category: cs.CL

TL;DR: 本文研究了视觉语言模型(VLMs)在解决视觉数学方程时的问题,发现其在系数计数和变量识别中的表现较差,尤其是多步骤视觉推理的挑战显著。

Details Motivation: 尽管VLMs在视觉理解和语言推理方面表现优异,但在需要感知与符号计算结合的任务中能力有限。本文通过视觉方程的解决任务探索这一局限,揭示其弱点。

Contribution: 1. 分解视觉方程任务为系数计数和变量识别,发现计数是主要瓶颈;2. 表明多步骤视觉推理的复合误差;3. 揭示了符号推理在高复杂度方程中的限制。

Method: 通过将视觉方程任务分解为系数计数和变量识别两个子任务,分析VLMs的表现,并探讨多步骤推理和符号计算的影响。

Result: VLMs在文本方程上表现良好,但在视觉方程中表现不佳,尤其是系数计数和多步骤推理的误差显著。随着方程复杂度增加,符号推理也成为限制因素。

Insight: 当前VLMs在视觉数学推理中的局限性主要来源于计数能力和多步骤推理的不足,未来研究需改进这些方面以实现更优的视觉-符号组合能力。

Abstract: Despite strong performance in visual understanding and language-based reasoning, Vision-Language Models (VLMs) struggle with tasks requiring integrated perception and symbolic computation. We study this limitation through visual equation solving, where mathematical equations are embedded in images, variables are represented by object icons, and coefficients must be inferred by counting. While VLMs perform well on textual equations, they fail on visually grounded counterparts. To understand this gap, we decompose the task into coefficient counting and variable recognition, and find that counting is the primary bottleneck, even when recognition is accurate. We also observe that composing recognition and reasoning introduces additional errors, highlighting challenges in multi-step visual reasoning. Finally, as equation complexity increases, symbolic reasoning itself becomes a limiting factor. These findings reveal key weaknesses in current VLMs and point toward future improvements in visually grounded mathematical reasoning.

[4] MR-UIE: Multi-Perspective Reasoning with Reinforcement Learning for Universal Information Extraction

Zhongqiu Li,Shiquan Wang,Ruiyu Fang,Mengjiao Bao,Zhenhe Wu,Shuangyong Song,Yongxiang Li,Zhongjiang He

Main category: cs.CL

TL;DR: 这篇论文提出了一种名为MR-UIE的方法,通过结合多视角推理和强化学习,提升了大型语言模型在通用信息抽取任务中的泛化能力和准确性。

Details Motivation: 大型语言模型在通用信息抽取任务中表现不佳,尤其是在需要多步推理和复杂模式描述的结构化输出场景中。现有的方法(如上下文学习和指令微调)仍有明显局限,因此需要一种新方法提升模型的推理能力和泛化性。

Contribution: 论文的主要贡献是提出了一种将多视角推理与强化学习相结合的方法MR-UIE,使得大型语言模型从被动抽取者转变为主动推理者,显著提升了其在复杂信息抽取任务中的表现。

Method: MR-UIE通过强化学习框架和多视角推理策略(理解‘抽取什么’和‘如何推理’)改进模型。实验验证其在多个信息抽取基准数据集上的有效性。

Result: 实验表明,MR-UIE在多个领域的信息抽取任务中显著提升了准确性,并在某些数据集上超越了现有最先进方法。多视角推理的引入还增强了模型在复杂任务中的泛化能力。

Insight: 论文强调了推理在复杂信息抽取任务中的关键作用,展示了多视角推理与强化学习的结合能够有效提升模型的泛化能力和任务表现。

Abstract: Large language models (LLMs) demonstrate robust capabilities across diverse research domains. However, their performance in universal information extraction (UIE) remains insufficient, especially when tackling structured output scenarios that involve complex schema descriptions and require multi-step reasoning. While existing approaches enhance the performance of LLMs through in-context learning and instruction tuning, significant limitations nonetheless persist. To enhance the model’s generalization ability, we propose integrating reinforcement learning (RL) with multi-perspective reasoning for information extraction (IE) tasks. Our work transitions LLMs from passive extractors to active reasoners, enabling them to understand not only what to extract but also how to reason. Experiments conducted on multiple IE benchmarks demonstrate that MR-UIE consistently elevates extraction accuracy across domains and surpasses state-of-the-art methods on several datasets. Furthermore, incorporating multi-perspective reasoning into RL notably enhances generalization in complex IE tasks, underscoring the critical role of reasoning in challenging scenarios.

[5] TigerCoder: A Novel Suite of LLMs for Code Generation in Bangla

Nishat Raihan,Antonios Anastasopoulos,Marcos Zampieri

Main category: cs.CL

TL;DR: 论文提出了一个专门用于孟加拉语(Bangla)代码生成的首个大型语言模型家族TigerCoder,通过高质量数据集和评测基准显著提升了性能。

Details Motivation: 孟加拉语虽然是全球第五大语言,但在代码生成的LLM中代表性不足,主要原因是缺乏高质量的数据集。

Contribution: 1. 提供了首个全面的孟加拉语代码指令数据集。2. 提出了MBPP-Bangla评测基准。3. 发布了TigerCoder模型家族,性能显著提升。

Method: 通过预训练和微调,结合高质量的孟加拉语代码数据集和专用评测基准,开发了TigerCoder模型家族。

Result: TigerCoder在Pass@1指标上比现有多语言和通用孟加拉语LLMs性能提升11-18%。

Insight: 研究发现,高质量的数据集可以有效弥补低资源语言模型的局限性。

Abstract: Despite being the 5th most spoken language, Bangla remains underrepresented in Large Language Models (LLMs), particularly for code generation. This primarily stems from the scarcity of high-quality data to pre-train and/or finetune such models. Hence, we introduce the first dedicated family of Code LLMs for Bangla (1B & 9B). We offer three major contributions: (1) a comprehensive Bangla code instruction datasets for programming domain adaptation; (2) MBPP-Bangla, an evaluation benchmark for Bangla code generation; and (3) the TigerCoder-family of Code LLMs, achieving significant ~11-18% performance gains at Pass@1 over existing multilingual and general-purpose Bangla LLMs. Our findings show that curated, high-quality datasets can overcome limitations of smaller models for low-resource languages. We open-source all resources to advance further Bangla LLM research.

[6] Compass-v3: Scaling Domain-Specific LLMs for Multilingual E-Commerce in Southeast Asia

Sophia Maria

Main category: cs.CL

TL;DR: Compass-v3是一个面向东南亚多语言电商的垂直领域专家混合模型,通过硬件优化和定制训练策略提升性能,并在多语言和电商任务上表现优于现有模型。

Details Motivation: 通用大语言模型在电商等专业领域表现不足,尤其是面对多语言和动态数据的挑战。为解决这一问题,作者开发了针对东南亚电商的专用模型Compass-v3。

Contribution: 1. 提出Compass-v3,一个245B参数的MoE模型,针对东南亚多语言电商场景;2. 引入硬件优化策略和混合训练方法;3. 提出OTPO方法提升指令对齐能力。

Method: 1. 采用更少但更大的专家模块,结合硬件优化(如节点内专家并行、定制内存操作);2. 使用12T token的多语言语料和合成电商指令混合训练;3. 提出OTPO方法优化token级偏好对齐。

Result: Compass-v3在电商任务和多语言场景中表现优于DeepSeek-V3.1、GPT-4和Qwen3-235B,并在Shopee的实际应用中占比超过70%。

Insight: 1. 垂直领域的专用模型可以通过硬件优化和混合训练显著提升性能;2. OTPO对指令对齐有显著效果;3. 多语言能力在低资源语言中仍可保持竞争力。

Abstract: Large language models (LLMs) excel in general-domain applications, yet their performance often degrades in specialized tasks requiring domain-specific knowledge. E-commerce is particularly challenging, as its data are noisy, heterogeneous, multilingual, and highly dynamic. We present Compass-v3, a vertical-domain Mixture-of-Experts (MoE) model with 245B total parameters and 71B active per token, designed for Southeast Asian e-commerce. Compass-v3 adopts fewer but larger experts, combined with hardware-efficient optimizations-such as intra-node expert parallelism and a customized memcpy operator-to maximize GPU utilization. The model is trained on 12T tokens of curated multilingual corpora and large-scale synthetic e-commerce instructions using a mixed-training strategy. To enhance alignment, we propose Optimal-Transport Direct Preference Optimization (OTPO), which captures token-level distinctions and improves instruction adherence in commerce-specific scenarios. Extensive evaluations demonstrate that Compass-v3 delivers state-of-the-art e-commerce performance, surpassing DeepSeek-V3.1, GPT-4 series, and Qwen3-235B. Moreover, Compass-v3 demonstrates strong multilingual capability across low-resource Southeast Asian languages (Indonesian, Thai, Filipino, Vietnamese, Malay, Taglog) and Portuguese while sustaining competitive performance on general benchmarks. It has already been widely applied in Shopee’s industrial-scale e-commerce platform and is gradually replacing OpenAI’s traffic, now accounting for over 70% of total LLM usage, highlighting its dual strengths in specialized commerce expertise and broad linguistic competence.

[7] Target-oriented Multimodal Sentiment Classification with Counterfactual-enhanced Debiasing

Zhiyue Liu,Fanrong Ma,Xin Ling

Main category: cs.CL

TL;DR: 本文提出了一种基于反事实增强的去偏框架,用于目标导向的多模态情感分类,通过改进文本特征与标签之间的虚假相关性,提升分类准确性。

Details Motivation: 现有目标导向多模态情感分类方法过度依赖文本内容且忽略数据集偏差(特别是词级上下文偏差),导致文本特征与输出标签之间存在虚假相关性,影响分类效果。

Contribution: 1)引入反事实数据增强策略,生成细节匹配的图像-文本样本以引导模型关注情感相关内容;2)提出自适应去偏对比学习机制,从反事实数据中学习鲁棒特征并削弱偏差词的影响。

Method: 1)反事实数据增强策略通过最小化改变情感相关因果特征生成样本;2)自适应去偏对比学习机制通过对比学习减少偏差词的干扰。

Result: 在多个基准数据集上的实验表明,该方法优于当前最优基线方法。

Insight: 1)反事实数据增强能有效减少虚假相关性;2)自适应去偏对比学习机制在多模态任务中表现优越。

Abstract: Target-oriented multimodal sentiment classification seeks to predict sentiment polarity for specific targets from image-text pairs. While existing works achieve competitive performance, they often over-rely on textual content and fail to consider dataset biases, in particular word-level contextual biases. This leads to spurious correlations between text features and output labels, impairing classification accuracy. In this paper, we introduce a novel counterfactual-enhanced debiasing framework to reduce such spurious correlations. Our framework incorporates a counterfactual data augmentation strategy that minimally alters sentiment-related causal features, generating detail-matched image-text samples to guide the model’s attention toward content tied to sentiment. Furthermore, for learning robust features from counterfactual data and prompting model decisions, we introduce an adaptive debiasing contrastive learning mechanism, which effectively mitigates the influence of biased words. Experimental results on several benchmark datasets show that our proposed method outperforms state-of-the-art baselines.

[8] EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

Yuhao Zhang,Yuhao Du,Zhanchen Dai,Xiangnan Ma,Kaiqi Kou,Benyou Wang,Haizhou Li

Main category: cs.CL

TL;DR: EchoX通过回波训练弥合语音到语音大语言模型(SLLMs)中的声学-语义鸿沟,结合声学和语义学习,提升推理能力。

Details Motivation: 当前语音到语音大语言模型(SLLMs)在知识和推理能力上存在退化,原因是声学与语义特征空间的训练未能有效结合。

Contribution: 提出EchoX方法,利用语义表示和动态生成的语音训练目标,弥合声学-语义鸿沟,保留推理能力。

Method: 结合声学和语义学习,通过动态生成语音训练目标,优化模型的语义理解与推理能力。

Result: 在约6000小时训练数据下,EchoX在多个知识问答基准上表现优异。

Insight: 声学与语义的结合对语音大语言模型的推理能力至关重要,动态生成训练目标是有效方法。

Abstract: Speech-to-speech large language models (SLLMs) are attracting increasing attention. Derived from text-based large language models (LLMs), SLLMs often exhibit degradation in knowledge and reasoning capabilities. We hypothesize that this limitation arises because current training paradigms for SLLMs fail to bridge the acoustic-semantic gap in the feature representation space. To address this issue, we propose EchoX, which leverages semantic representations and dynamically generates speech training targets. This approach integrates both acoustic and semantic learning, enabling EchoX to preserve strong reasoning abilities as a speech LLM. Experimental results demonstrate that EchoX, with about six thousand hours of training data, achieves advanced performance on multiple knowledge-based question-answering benchmarks. The project is available at https://github.com/FreedomIntelligence/EchoX.

[9] Efficient Trie-based Biasing using K-step Prediction for Rare Word Recognition

Chin Yuen Kwok,Jia Qi yip

Main category: cs.CL

TL;DR: 这篇论文提出了一种基于Trie的高效偏置方法,通过K步预测来优化罕见词识别,避免了传统方法的计算开销,显著降低了词错误率。

Details Motivation: 当前的Trie偏置方法在解码时需要对部分假设给出‘奖励分’,但如果罕见词未被识别,系统需要撤销这些奖励,这在大型解码器中计算开销很大。

Contribution: 提出了一种适应性方法,通过多步预测避免撤销奖励步骤,显著提升了罕见词识别的效率和准确率。

Method: 通过微调ASR模型(如Whisper),使其能够一次性预测多步,从而更好地估计部分假设是否会生成完整罕见词。

Result: 仅用10小时合成数据微调后,NSC Part 2测试集的词错误率从30.86%降至12.19%。

Insight: 多步预测技术可以显著减少传统方法的计算负担,同时提升罕见词识别的性能。

Abstract: Contextual biasing improves rare word recognition of ASR models by prioritizing the output of rare words during decoding. A common approach is Trie-based biasing, which gives “bonus scores” to partial hypothesis (e.g. “Bon”) that may lead to the generation of the rare word (e.g. “Bonham”). If the full word (“Bonham”) isn’t ultimately recognized, the system revokes those earlier bonuses. This revocation is limited to beam search and is computationally expensive, particularly for models with large decoders. To overcome these limitations, we propose adapting ASR models to look ahead and predict multiple steps at once. This avoids the revocation step entirely by better estimating whether a partial hypothesis will lead to the generation of the full rare word. By fine-tuning Whisper with only 10 hours of synthetic data, our method reduces the word error rate on the NSC Part 2 test set from 30.86% to 12.19%.

[10] Improving Synthetic Data Training for Contextual Biasing Models with a Keyword-Aware Cost Function

Chin Yuen Kwok,Jia Qi Yip,Eng Siong Chng

Main category: cs.CL

TL;DR: 论文提出了一种关键词感知损失函数,通过结合掩码交叉熵和二元分类任务,改进了基于TCPGen的上下文偏置模型在合成数据上的训练效果,显著降低了词错误率。

Details Motivation: 在合成数据上训练的上下文偏置模型容易因数据中的伪影而出现过拟合,影响罕见词识别的性能。为了缓解这一问题,作者提出了一种新的损失函数设计。

Contribution: 1. 提出了一种关键词感知损失函数,包含掩码交叉熵和二元分类任务;2. 改进了基于TCPGen的上下文偏置方法;3. 在合成数据上显著提升了模型性能。

Method: 1. 采用掩码交叉熵损失优化偏置词的预测;2. 引入二元分类任务检测偏置词位置;3. 将两种损失互补结合,提升推理时的偏置词解码能力。

Result: 在10小时合成数据上微调Whisper模型,词错误率从29.71%降至11.81%(NSC Part 2测试集)。

Insight: 联合优化偏置词预测和位置检测任务可以有效缓解合成数据训练中的过拟合问题,提升罕见词识别性能。

Abstract: Rare word recognition can be improved by adapting ASR models to synthetic data that includes these words. Further improvements can be achieved through contextual biasing, which trains and adds a biasing module into the model architecture to prioritize rare words. While training the module on synthetic rare word data is more effective than using non-rare-word data, it can lead to overfitting due to artifacts in the synthetic audio. To address this, we enhance the TCPGen-based contextual biasing approach and propose a keyword-aware loss function that additionally focuses on biased words when training biasing modules. This loss includes a masked cross-entropy term for biased word prediction and a binary classification term for detecting biased word positions. These two terms complementarily support the decoding of biased words during inference. By adapting Whisper to 10 hours of synthetic data, our method reduced the word error rate on the NSC Part 2 test set from 29.71% to 11.81%.

[11] From scratch to silver: Creating trustworthy training data for patent-SDG classification using Large Language Models

Grazia Sveva Ascione,Nicolò Tamagnone

Main category: cs.CL

TL;DR: 该论文提出了一种基于弱监督和大型语言模型(LLM)的方法,用于将专利分类到联合国可持续发展目标(SDG)中,通过构建一个银标准的多标签数据集,解决了现有方法缺乏可扩展性和泛化性的问题。

Details Motivation: 专利与SDG的分类对于追踪创新如何应对全球挑战至关重要,但由于缺乏大规模标注数据集,现有方法(如关键词搜索、迁移学习和基于引用的启发式方法)难以满足需求。因此,需要一种可扩展且泛化的解决方案。

Contribution: 1. 将专利-SDG分类问题建模为弱监督问题;2. 开发了一种复合标注函数(LF),利用LLM从专利和SDG论文中提取结构化概念;3. 提出了一种基于排名的检索方法;4. 构建了一个银标准的多标签数据集。

Method: 1. 利用专利引用SDG标记科学文献(NPL引用)作为初始噪声信号;2. 通过LLM提取专利和SDG论文中的结构化概念;3. 计算跨域相似性得分并结合排名检索方法;4. 通过自定义的正样本损失函数校准LF。

Result: 1. 内部验证中,该方法优于包括基于Transformer的模型和零样本LLM在内的多种基线;2. 外部验证中,提出的标签在专利引用、共同发明人和共同申请人图中显示出更高的主题、认知和组织一致性。

Insight: 弱监督和语义对齐能够在缺乏大规模标注数据的情况下,显著提升SDG分类的扩展性和效果。

Abstract: Classifying patents by their relevance to the UN Sustainable Development Goals (SDGs) is crucial for tracking how innovation addresses global challenges. However, the absence of a large, labeled dataset limits the use of supervised learning. Existing methods, such as keyword searches, transfer learning, and citation-based heuristics, lack scalability and generalizability. This paper frames patent-to-SDG classification as a weak supervision problem, using citations from patents to SDG-tagged scientific publications (NPL citations) as a noisy initial signal. To address its sparsity and noise, we develop a composite labeling function (LF) that uses large language models (LLMs) to extract structured concepts, namely functions, solutions, and applications, from patents and SDG papers based on a patent ontology. Cross-domain similarity scores are computed and combined using a rank-based retrieval approach. The LF is calibrated via a custom positive-only loss that aligns with known NPL-SDG links without penalizing discovery of new SDG associations. The result is a silver-standard, soft multi-label dataset mapping patents to SDGs, enabling the training of effective multi-label regression models. We validate our approach through two complementary strategies: (1) internal validation against held-out NPL-based labels, where our method outperforms several baselines including transformer-based models, and zero-shot LLM; and (2) external validation using network modularity in patent citation, co-inventor, and co-applicant graphs, where our labels reveal greater thematic, cognitive, and organizational coherence than traditional technological classifications. These results show that weak supervision and semantic alignment can enhance SDG classification at scale.

[12] MetaRAG: Metamorphic Testing for Hallucination Detection in RAG Systems

Channdeth Sok,David Luz,Yacine Haddam

Main category: cs.CL

TL;DR: MetaRAG提出了一种用于检测RAG系统中幻觉的变形测试框架,能实时、无监督地运行,适用于高风险和专有领域。

Details Motivation: LLM在企业应用中广泛部署,但幻觉问题(即自信但事实错误的信息)限制了其可靠性。现有方法如SelfCheckGPT和MetaQA针对独立LLM,未解决RAG系统的独特挑战,即响应需与检索证据一致。

Contribution: 提出了MetaRAG,一个专注于RAG系统的幻觉检测框架,无需真实参考或模型内部访问,支持实时、无监督的黑盒测试,并能定位未支持的声明。

Method: 框架分为四阶段:1) 分解回答为原子事实;2) 用同义词和反义词生成变异;3) 验证变异是否符合检索上下文;4) 汇总不一致性为幻觉评分。

Result: 在专有企业数据集上验证了MetaRAG的有效性,可检测幻觉并支持可信赖的RAG对话代理部署。

Insight: MetaRAG通过定位未支持的声明和提供身份敏感查询的保护机制,展示了在身份感知AI中的实用性。

Abstract: Large Language Models (LLMs) are increasingly deployed in enterprise applications, yet their reliability remains limited by hallucinations, i.e., confident but factually incorrect information. Existing detection approaches, such as SelfCheckGPT and MetaQA, primarily target standalone LLMs and do not address the unique challenges of Retrieval-Augmented Generation (RAG) systems, where responses must be consistent with retrieved evidence. We therefore present MetaRAG, a metamorphic testing framework for hallucination detection in Retrieval-Augmented Generation (RAG) systems. MetaRAG operates in a real-time, unsupervised, black-box setting, requiring neither ground-truth references nor access to model internals, making it suitable for proprietary and high-stakes domains. The framework proceeds in four stages: (1) decompose answers into atomic factoids, (2) generate controlled mutations of each factoid using synonym and antonym substitutions, (3) verify each variant against the retrieved context (synonyms are expected to be entailed and antonyms contradicted), and (4) aggregate penalties for inconsistencies into a response-level hallucination score. Crucially for identity-aware AI, MetaRAG localizes unsupported claims at the factoid span where they occur (e.g., pregnancy-specific precautions, LGBTQ+ refugee rights, or labor eligibility), allowing users to see flagged spans and enabling system designers to configure thresholds and guardrails for identity-sensitive queries. Experiments on a proprietary enterprise dataset illustrate the effectiveness of MetaRAG for detecting hallucinations and enabling trustworthy deployment of RAG-based conversational agents. We also outline a topic-based deployment design that translates MetaRAG’s span-level scores into identity-aware safeguards; this design is discussed but not evaluated in our experiments.

[13] Modelling Analogies and Analogical Reasoning: Connecting Cognitive Science Theory and NLP Research

Molly R Petersen,Claire E Stevenson,Lonneke van der Plas

Main category: cs.CL

TL;DR: 论文探讨了类比推理在认知科学和NLP研究中的联系,提出了如何通过认知视角优化NLP中的关系理解。

Details Motivation: 类比推理是人类认知的核心,但NLP研究通常未从认知科学角度探讨其底层过程,本文旨在填补这一空白。

Contribution: 将认知科学中的类比推理理论与NLP研究结合,展示了其对NLP中关系理解的优化潜力。

Method: 总结认知科学文献中的关键理论,并将其与NLP现有研究关联,探讨其在NLP挑战中的应用。

Result: 表明认知科学的类比推理理论可以指导NLP研究超越实体相似性,更注重关系理解。

Insight: NLP可以通过认知科学的视角提升对文本中关系的建模能力,而不仅仅是依赖实体层面的相似性。

Abstract: Analogical reasoning is an essential aspect of human cognition. In this paper, we summarize key theory about the processes underlying analogical reasoning from the cognitive science literature and relate it to current research in natural language processing. While these processes can be easily linked to concepts in NLP, they are generally not viewed through a cognitive lens. Furthermore, we show how these notions are relevant for several major challenges in NLP research, not directly related to analogy solving. This may guide researchers to better optimize relational understanding in text, as opposed to relying heavily on entity-level similarity.

[14] Hierarchical Bracketing Encodings Work for Dependency Graphs

Ana Ezquerro,Carlos Gómez-Rodríguez,David Vilares

Main category: cs.CL

TL;DR: 该论文提出了一种基于分层括号编码的依赖图解析方法,能够在保持结构信息的同时减少标签空间,并在多语言基准测试中取得了竞争性结果。

Details Motivation: 依赖图解析在自然语言处理中具有重要作用,传统方法难以高效处理图的线性化表示,尤其是存在重入、循环和空节点的情况。本文旨在通过分层括号编码解决这些问题。

Contribution: 论文的主要贡献是提出了一种分层括号编码方法,能够将依赖图表示为序列,从而支持线性时间解析,同时保留图的复杂结构信息(如重入、循环和空节点)。

Method: 该方法通过分层括号编码将依赖图转换为序列,每个节点和边通过括号标记表示。这种编码方式显著减少了标签空间,同时保持了图的结构信息。

Result: 在多语言和多形式基准测试中,该方法在精确匹配准确率上优于其他现有方法,验证了其有效性。

Insight: 分层括号编码为依赖图解析提供了一种高效且紧凑的表示方法,能够在不损失结构信息的情况下简化解析过程,适用于复杂图结构的处理。

Abstract: We revisit hierarchical bracketing encodings from a practical perspective in the context of dependency graph parsing. The approach encodes graphs as sequences, enabling linear-time parsing with $n$ tagging actions, and still representing reentrancies, cycles, and empty nodes. Compared to existing graph linearizations, this representation substantially reduces the label space while preserving structural information. We evaluate it on a multilingual and multi-formalism benchmark, showing competitive results and consistent improvements over other methods in exact match accuracy.

[15] Mitigating Language Barriers in Education: Developing Multilingual Digital Learning Materials with Machine Translation

Lucie Poláková,Martin Popel,Věra Kloudová,Michal Novák,Mariia Anisimova,Jiří Balhar

Main category: cs.CL

TL;DR: EdUKate项目利用机器翻译等技术开发多语言学习材料,解决捷克中小学教育中的语言障碍问题,重点关注捷克语-乌克兰语的机器翻译系统。

Details Motivation: 解决非捷克语学生在捷克教育系统中的语言障碍问题,通过多语言学习材料提升教育包容性。

Contribution: 开发了一个针对教育领域的捷克语-乌克兰语机器翻译系统,并处理了格式化内容(如XML和PDF)和技术术语。

Method: 结合数字教育、语言学和翻译研究,开发多模态交互式练习的多语言内容,并通过调查评估教师需求。

Result: 项目成果已免费提供给学生、教育者和研究人员,包括翻译的9,000个练习和机器翻译系统的实现。

Insight: 机器翻译在教育领域中需针对特定语言对和内容格式进行优化,以提升翻译质量和实用性。

Abstract: The EdUKate project combines digital education, linguistics, translation studies, and machine translation to develop multilingual learning materials for Czech primary and secondary schools. Launched through collaboration between a major Czech academic institution and the country’s largest educational publisher, the project is aimed at translating up to 9,000 multimodal interactive exercises from Czech into Ukrainian, English, and German for an educational web portal. It emphasizes the development and evaluation of a direct Czech-Ukrainian machine translation system tailored to the educational domain, with special attention to processing formatted content such as XML and PDF and handling technical and scientific terminology. We present findings from an initial survey of Czech teachers regarding the needs of non-Czech-speaking students and describe the system’s evaluation and implementation on the web portal. All resulting applications are freely available to students, educators, and researchers.

[16] All for One: LLMs Solve Mental Math at the Last Token With Information Transferred From Other Tokens

Siddarth Mamidanna,Daking Rai,Ziyu Yao,Yilun Zhou

Main category: cs.CL

TL;DR: 论文通过抑制初始层特定token的计算、限制信息传递路径以及强制最后token完成所有计算,揭示了LLMs在心理数学任务中主要通过特定中间层传递信息到最后一个token进行计算。

Details Motivation: 探究LLMs在心理数学任务中的内部工作机制,明确其如何通过token间的信息传递完成计算任务。

Contribution: 提出Context-Aware Mean Ablation (CAMA)和Attention-Based Peeking (ABP)方法,发现LLMs中一个名为All-for-One (AF1)的子图,其在多种心理数学任务中高效计算,仅需最后一个token通过特定中间层获取其他token的信息。

Method: 通过三个阶段:抑制初始层的token计算、限制信息传递路径、强制最后token完成所有计算,结合CAMA和ABP技术识别AF1子图。

Result: 实验表明AF1子图在多种模型和算术表达式中高效且必要,具有跨模型适应性和输入风格鲁棒性。

Insight: LLMs在心理数学任务中的计算集中在最后token,且特定中间层的信息传递是关键,这为理解LLMs的内部机制提供了新视角。

Abstract: Large language models (LLMs) demonstrate proficiency across numerous computational tasks, yet their inner workings remain unclear. In theory, the combination of causal self-attention and multilayer perceptron layers allows every token to access and compute information based on all preceding tokens. In practice, to what extent are such operations present? In this paper, on mental math tasks (i.e., direct math calculation via next-token prediction without explicit reasoning), we investigate this question in three steps: inhibiting input-specific token computations in the initial layers, restricting the routes of information transfer across token positions in the next few layers, and forcing all computation to happen at the last token in the remaining layers. With two proposed techniques, Context-Aware Mean Ablation (CAMA) and Attention-Based Peeking (ABP), we identify an All-for-One subgraph (AF1) with high accuracy on a wide variety of mental math tasks, where meaningful computation occurs very late (in terms of layer depth) and only at the last token, which receives information of other tokens in few specific middle layers. Experiments on a variety of models and arithmetic expressions show that this subgraph is sufficient and necessary for high model performance, transfers across different models, and works on a variety of input styles. Ablations on different CAMA and ABP alternatives reveal their unique advantages over other methods, which may be of independent interest.

[17] CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models

Runpeng Dai,Linfeng Song,Haolin Liu,Zhenwen Liang,Dian Yu,Haitao Mi,Zhaopeng Tu,Rui Liu,Tong Zheng,Hongtu Zhu,Dong Yu

Main category: cs.CL

TL;DR: 论文提出了一种基于好奇心的探索框架(CDE),用于解决大型语言模型(LLM)在强化学习中探索效率低下的问题。CDE通过结合生成困惑度和价值估计方差作为内在奖励,提升了模型的多样性和鲁棒性,并在实验中取得了显著改进。

Details Motivation: 当前的强化学习方法(如RLVR)在大型语言模型中存在探索不足的问题,容易导致过早收敛和熵崩溃,限制了模型的推理能力提升。为解决这一问题,研究者提出了好奇心驱动的探索机制。

Contribution: 1. 提出了CDE框架,利用模型的生成困惑度和价值估计方差作为内在奖励,提升探索效率。
2. 理论分析表明,CDE能有效惩罚过度自信的错误并促进正确响应的多样性。
3. 实验证明CDE在AIME基准上比标准RLVR方法(GRPO/PPO)提升了约3个点。

Method: 1. 提出基于生成困惑度(actor)和价值估计方差(critic)的好奇心信号。
2. 将这些信号作为探索奖励融入RLVR框架。
3. 采用多头架构计算价值估计方差,增强探索的鲁棒性。

Result: 实验显示,CDE在AIME基准上比GRPO/PPO方法提升了约3个点。此外,研究还揭示了RLVR中存在的一种校准崩溃机制,解释了LLM的常见失效模式。

Insight: 1. 好奇心信号(如困惑度和价值方差)可以显著提升探索效率和模型多样性。
2. RLVR中的校准崩溃可能是模型性能提升的关键瓶颈之一。

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a powerful paradigm for enhancing the reasoning ability of Large Language Models (LLMs). Yet current RLVR methods often explore poorly, leading to premature convergence and entropy collapse. To address this challenge, we introduce Curiosity-Driven Exploration (CDE), a framework that leverages the model’s own intrinsic sense of curiosity to guide exploration. We formalize curiosity with signals from both the actor and the critic: for the actor, we use perplexity over its generated response, and for the critic, we use the variance of value estimates from a multi-head architecture. Both signals serve as an exploration bonus within the RLVR framework to guide the model. Our theoretical analysis shows that the actor-wise bonus inherently penalizes overconfident errors and promotes diversity among correct responses; moreover, we connect the critic-wise bonus to the well-established count-based exploration bonus in RL. Empirically, our method achieves an approximate +3 point improvement over standard RLVR using GRPO/PPO on AIME benchmarks. Further analysis identifies a calibration collapse mechanism within RLVR, shedding light on common LLM failure modes.

cs.CV [Back]

[18] Recurrence Meets Transformers for Universal Multimodal Retrieval

Davide Caffagni,Sara Sarto,Marcella Cornia,Lorenzo Baraldi,Rita Cucchiara

Main category: cs.CV

TL;DR: 这篇论文提出了ReT-2,一种支持多模态查询的统一检索模型,通过结合循环Transformer和LSTM门控机制,动态整合跨层和多模态信息,实现了在多模态检索任务上的先进性能。

Details Motivation: 针对现有检索方法局限于单模态查询或文档的问题,以及复杂多模态检索任务的需求,提出了一个支持多模态查询的统一模型。

Contribution: 提出了ReT-2模型,支持多模态查询和检索,通过多层级表示和循环Transformer架构动态整合信息,显著提升了性能。

Method: 结合多层级表示和循环Transformer架构,引入LSTM启发的门控机制,动态整合跨层和多模态信息。

Result: 在M2KR和M-BEIR基准测试上表现优异,推理速度更快,内存占用更少,同时提升了检索增强生成任务的下游性能。

Insight: 动态整合跨层和多模态信息是提升多模态检索性能的关键,循环Transformer架构在此类任务中表现出色。

Abstract: With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2

[19] Diffusion-Based Action Recognition Generalizes to Untrained Domains

Rogerio Guimaraes,Frank Xiao,Pietro Perona,Markus Marks

Main category: cs.CV

TL;DR: 该论文提出了一种基于视觉扩散模型(VDM)的特征提取方法,通过变换器聚合特征,实现了人类级别的动作识别泛化能力,特别是在动物种类、视角和记录上下文等未训练领域的任务中。

Details Motivation: 人类能够无视背景和视角的差异识别相同动作,而现有深度学习模型在此类泛化任务中表现不佳。论文旨在通过扩散模型的特征提取能力提升模型的泛化性能。

Contribution: 主要贡献包括:(1)提出使用VDM提取特征并通过变换器聚合的方法;(2)通过条件扩散模型强调语义信息而非像素细节;(3)在跨物种、跨视角和跨上下文的动作识别任务中实现新的SOTA。

Method: 方法分为两步:(1)利用VDM在不同时间步生成特征;(2)通过变换器聚合这些特征,条件扩散模型用于增强语义信息提取。

Result: 在跨物种、跨视角和跨上下文的动作识别任务中,该方法均取得了最优性能,显著提升了模型在未训练领域的泛化能力。

Insight: 论文表明,扩散模型生成的特征能够更好地捕捉高阶语义信息,而变换器的聚合能力进一步提升了跨领域任务的性能。

Abstract: Humans can recognize the same actions despite large context and viewpoint variations, such as differences between species (walking in spiders vs. horses), viewpoints (egocentric vs. third-person), and contexts (real life vs movies). Current deep learning models struggle with such generalization. We propose using features generated by a Vision Diffusion Model (VDM), aggregated via a transformer, to achieve human-like action recognition across these challenging conditions. We find that generalization is enhanced by the use of a model conditioned on earlier timesteps of the diffusion process to highlight semantic information over pixel level details in the extracted features. We experimentally explore the generalization properties of our approach in classifying actions across animal species, across different viewing angles, and different recording contexts. Our model sets a new state-of-the-art across all three generalization benchmarks, bringing machine action recognition closer to human-like robustness. Project page: $\href{https://www.vision.caltech.edu/actiondiff/}{\texttt{vision.caltech.edu/actiondiff}}$ Code: $\href{https://github.com/frankyaoxiao/ActionDiff}{\texttt{github.com/frankyaoxiao/ActionDiff}}$

[20] PromptGuard: An Orchestrated Prompting Framework for Principled Synthetic Text Generation for Vulnerable Populations using LLMs with Enhanced Safety, Fairness, and Controllability

Tung Vu,Lam Nguyen,Quynh Dao

Main category: cs.CV

TL;DR: 本文提出PromptGuard,一个模块化提示框架,通过VulnGuard Prompt技术主动预防LLM对弱势群体生成有害文本,结合多目标优化和伦理推理,实现25-30%的危害降低。

Details Motivation: 大语言模型(LLMs)在现实应用中对弱势群体(如同性恋者、单亲家庭等)可能生成有害、偏见或误导性信息。现有方法多为事后过滤,无法从源头预防。

Contribution: 1. 提出VulnGuard Prompt技术,结合对比学习和伦理推理,主动预防有害生成; 2. 提出六模块框架PromptGuard,实现端到端危害控制; 3. 提供数学证明,实现25-30%危害降低。

Method: 1. 基于GitHub数据集的对比学习生成VulnGuard Prompt; 2. 结合少样本学习、伦理链式推理和角色提示; 3. 多目标优化确保帕累托最优性。

Result: 通过理论分析和实验验证,PromptGuard显著减少有害输出,并在GitHub数据集上展示有效性。

Insight: 1. 从源头预防有害生成比事后过滤更有效; 2. 结合伦理推理和数据驱动方法可提升模型安全性; 3. 模块化设计适合实际部署。

Abstract: The proliferation of Large Language Models (LLMs) in real-world applications poses unprecedented risks of generating harmful, biased, or misleading information to vulnerable populations including LGBTQ+ individuals, single parents, and marginalized communities. While existing safety approaches rely on post-hoc filtering or generic alignment techniques, they fail to proactively prevent harmful outputs at the generation source. This paper introduces PromptGuard, a novel modular prompting framework with our breakthrough contribution: VulnGuard Prompt, a hybrid technique that prevents harmful information generation using real-world data-driven contrastive learning. VulnGuard integrates few-shot examples from curated GitHub repositories, ethical chain-of-thought reasoning, and adaptive role-prompting to create population-specific protective barriers. Our framework employs theoretical multi-objective optimization with formal proofs demonstrating 25-30% analytical harm reduction through entropy bounds and Pareto optimality. PromptGuard orchestrates six core modules: Input Classification, VulnGuard Prompting, Ethical Principles Integration, External Tool Interaction, Output Validation, and User-System Interaction, creating an intelligent expert system for real-time harm prevention. We provide comprehensive mathematical formalization including convergence proofs, vulnerability analysis using information theory, and theoretical validation framework using GitHub-sourced datasets, establishing mathematical foundations for systematic empirical research.

[21] Similarity-based Outlier Detection for Noisy Object Re-Identification Using Beta Mixtures

Waqar Ahmad,Evan Murphy,Vladimir A. Krylov

Main category: cs.CV

TL;DR: 论文提出了一种基于Beta混合模型的相似性离群值检测方法(Beta-SOD),通过统计建模提升噪声环境下对象重识别(Re-ID)的鲁棒性。

Details Motivation: 对象重识别对标签噪声高度敏感,导致性能显著下降,论文旨在解决这一问题。

Contribution: 1. 提出Beta-SOD方法,通过Beta混合模型检测离群值;2. 证明了两种Beta分布混合模型的可识别性;3. 结合多种损失函数优化特征相似性学习。

Method: 使用Siamese网络架构和二分类Beta混合模型,结合交叉熵、对比和余弦嵌入损失进行联合优化。

Result: 在CUHK03、Market-1501和VeRi-776数据集上表现出优于现有方法的性能,尤其在10-30%噪声水平下。

Insight: 统计建模能够有效处理噪声问题,为Re-ID任务提供了一种鲁棒性更强的解决方案。

Abstract: Object re-identification (Re-ID) methods are highly sensitive to label noise, which typically leads to significant performance degradation. We address this challenge by reframing Re-ID as a supervised image similarity task and adopting a Siamese network architecture trained to capture discriminative pairwise relationships. Central to our approach is a novel statistical outlier detection (OD) framework, termed Beta-SOD (Beta mixture Similarity-based Outlier Detection), which models the distribution of cosine similarities between embedding pairs using a two-component Beta distribution mixture model. We establish a novel identifiability result for mixtures of two Beta distributions, ensuring that our learning task is well-posed.The proposed OD step complements the Re-ID architecture combining binary cross-entropy, contrastive, and cosine embedding losses that jointly optimize feature-level similarity learning.We demonstrate the effectiveness of Beta-SOD in de-noising and Re-ID tasks for person Re-ID, on CUHK03 and Market-1501 datasets, and vehicle Re-ID, on VeRi-776 dataset. Our method shows superior performance compared to the state-of-the-art methods across various noise levels (10-30%), demonstrating both robustness and broad applicability in noisy Re-ID scenarios. The implementation of Beta-SOD is available at: https://github.com/waqar3411/Beta-SOD

[22] Discovering Divergent Representations between Text-to-Image Models

Lisa Dunlap,Joseph E. Gonzalez,Trevor Darrell,Fabian Caba Heilbron,Josef Sivic,Bryan Russell

Main category: cs.CV

TL;DR: 论文研究了两种文本到图像模型学习到的视觉表征在何时及如何出现差异,提出了一种进化搜索算法CompCon,用于发现模型输出中更常见的视觉属性差异,并揭示了触发这些差异的提示概念。

Details Motivation: 研究动机在于探究不同生成模型在视觉表征上的差异,以便更好地理解模型的行为和潜在偏见。

Contribution: 主要贡献包括提出了CompCon算法,用于发现模型间的视觉表征差异,并构建了ID2数据集用于评估。

Method: 采用进化搜索算法CompCon,结合自动化数据生成流程,发现模型间的视觉属性差异及其关联提示。

Result: 实验结果表明,CompCon能够有效发现模型间的差异,例如某些模型在处理特定情感提示时表现出独特的视觉特征。

Insight: 研究发现不同模型在视觉表征上的差异可能与文化或情感提示相关,揭示了模型的潜在偏见和多样性。

Abstract: In this paper, we investigate when and how visual representations learned by two different generative models diverge. Given two text-to-image models, our goal is to discover visual attributes that appear in images generated by one model but not the other, along with the types of prompts that trigger these attribute differences. For example, “flames” might appear in one model’s outputs when given prompts expressing strong emotions, while the other model does not produce this attribute given the same prompts. We introduce CompCon (Comparing Concepts), an evolutionary search algorithm that discovers visual attributes more prevalent in one model’s output than the other, and uncovers the prompt concepts linked to these visual differences. To evaluate CompCon’s ability to find diverging representations, we create an automated data generation pipeline to produce ID2, a dataset of 60 input-dependent differences, and compare our approach to several LLM- and VLM-powered baselines. Finally, we use CompCon to compare popular text-to-image models, finding divergent representations such as how PixArt depicts prompts mentioning loneliness with wet streets and Stable Diffusion 3.5 depicts African American people in media professions. Code at: https://github.com/adobe-research/CompCon

[23] An U-Net-Based Deep Neural Network for Cloud Shadow and Sun-Glint Correction of Unmanned Aerial System (UAS) Imagery

Yibin Wang,Wondimagegn Beshah,Padmanava Dash,Haifeng Wang

Main category: cs.CV

TL;DR: 该论文提出了一种基于U-Net的深度学习方法,用于校正无人机系统(UAS)影像中的云影和太阳耀斑,以提高水质参数估计的准确性。

Details Motivation: 无人机影像在云层下拍摄时易受云影和太阳耀斑的影响,导致图像质量下降,影响水质参数的估计。因此需要一种有效的方法对这些干扰区域进行识别和校正。

Contribution: 论文的主要贡献是提出了一种基于U-Net的深度学习模型,能够有效识别并恢复受云影和太阳耀斑影响的区域,从而提升影像质量。

Method: 方法包括从像素级别提取数据,训练U-Net模型,并通过多种评估指标选择最佳模型配置,最终用于图像校正。

Result: 实验结果表明,该方法能够成功恢复受干扰区域,生成高质量的校正影像。

Insight: 这项研究表明,深度学习可以有效处理无人机影像中的复杂干扰问题,为遥感应用提供了新的技术手段。

Abstract: The use of unmanned aerial systems (UASs) has increased tremendously in the current decade. They have significantly advanced remote sensing with the capability to deploy and image the terrain as per required spatial, spectral, temporal, and radiometric resolutions for various remote sensing applications. One of the major advantages of UAS imagery is that images can be acquired in cloudy conditions by flying the UAS under the clouds. The limitation to the technology is that the imagery is often sullied by cloud shadows. Images taken over water are additionally affected by sun glint. These are two pose serious issues for estimating water quality parameters from the UAS images. This study proposes a novel machine learning approach first to identify and extract regions with cloud shadows and sun glint and separate such regions from non-obstructed clear sky regions and sun-glint unaffected regions. The data was extracted from the images at pixel level to train an U-Net based deep learning model and best settings for model training was identified based on the various evaluation metrics from test cases. Using this evaluation, a high-quality image correction model was determined, which was used to recover the cloud shadow and sun glint areas in the images.

[24] CoSwin: Convolution Enhanced Hierarchical Shifted Window Attention For Small-Scale Vision

Puskal Khadka,Rodrigue Rizk,Longwei Wang,KC Santosh

Main category: cs.CV

TL;DR: CoSwin提出了一种融合卷积增强的分层移位窗口注意力架构,用于小规模视觉任务,显著提升了性能。

Details Motivation: 当前Vision Transformers在全局上下文建模方面表现出色,但在小规模数据上缺乏局部特征提取能力。为解决这一问题,CoSwin结合卷积的局部特征学习与Transformers的全局注意力机制。

Contribution: CoSwin的核心贡献是将可学习的局部特征增强模块集成到每个注意力块中,实现了局部与全局特征的有效融合。

Method: CoSwin通过分层移位窗口注意力结合卷积模块,同时捕捉细粒度空间细节和全局语义结构。

Result: 实验表明,CoSwin在多个图像分类基准(如CIFAR-10、CIFAR-100等)上均超过现有卷积和Transformer模型的性能。

Insight: 局部与全局特征融合能显著提升小规模视觉任务中Transformer的泛化性和鲁棒性。

Abstract: Vision Transformers (ViTs) have achieved impressive results in computer vision by leveraging self-attention to model long-range dependencies. However, their emphasis on global context often comes at the expense of local feature extraction in small datasets, particularly due to the lack of key inductive biases such as locality and translation equivariance. To mitigate this, we propose CoSwin, a novel feature-fusion architecture that augments the hierarchical shifted window attention with localized convolutional feature learning. Specifically, CoSwin integrates a learnable local feature enhancement module into each attention block, enabling the model to simultaneously capture fine-grained spatial details and global semantic structure. We evaluate CoSwin on multiple image classification benchmarks including CIFAR-10, CIFAR-100, MNIST, SVHN, and Tiny ImageNet. Our experimental results show consistent performance gains over state-of-the-art convolutional and transformer-based models. Notably, CoSwin achieves improvements of 2.17% on CIFAR-10, 4.92% on CIFAR-100, 0.10% on MNIST, 0.26% on SVHN, and 4.47% on Tiny ImageNet over the baseline Swin Transformer. These improvements underscore the effectiveness of local-global feature fusion in enhancing the generalization and robustness of transformers for small-scale vision. Code and pretrained weights available at https://github.com/puskal-khadka/coswin

[25] iMatcher: Improve matching in point cloud registration via local-to-global geometric consistency learning

Karim Slimani,Catherine Achard,Brahim Tamadazte

Main category: cs.CV

TL;DR: iMatcher是一个完全可微的点云配准特征匹配框架,通过局部到全局的几何一致性学习提升匹配性能,在多个数据集上达到最先进效果。

Details Motivation: 点云配准中的特征匹配通常面临局部和全局几何一致性不足的问题,传统方法难以兼顾两者。iMatcher旨在通过学习局部和全局几何一致性提升匹配的准确性和鲁棒性。

Contribution: 1. 提出了一种完全可微的框架iMatcher,通过学习局部和全局几何一致性生成置信矩阵。2. 设计了局部图嵌入模块和双边重定位步骤,结合最近邻搜索优化匹配。3. 在多个数据集(KITTI、KITTI-360、3DMatch等)上验证了方法的有效性,达到最先进的配准性能。

Method: 1. 使用局部图嵌入模块初始化分数矩阵。2. 通过双边源到目标和目标到源的匹配优化矩阵。3. 堆叠配对特征并通过全局几何一致性学习预测点级匹配概率。

Result: 在KITTI和KITTI-360上取得了95%-97%的内点率,3DMatch上达到81.1%,显著优于现有方法。

Insight: 结合局部和全局几何一致性是提升点云配准性能的关键,学习的特征更具鲁棒性,适用于多种场景。

Abstract: This paper presents iMatcher, a fully differentiable framework for feature matching in point cloud registration. The proposed method leverages learned features to predict a geometrically consistent confidence matrix, incorporating both local and global consistency. First, a local graph embedding module leads to an initialization of the score matrix. A subsequent repositioning step refines this matrix by considering bilateral source-to-target and target-to-source matching via nearest neighbor search in 3D space. The paired point features are then stacked together to be refined through global geometric consistency learning to predict a point-wise matching probability. Extensive experiments on real-world outdoor (KITTI, KITTI-360) and indoor (3DMatch) datasets, as well as on 6-DoF pose estimation (TUD-L) and partial-to-partial matching (MVP-RG), demonstrate that iMatcher significantly improves rigid registration performance. The method achieves state-of-the-art inlier ratios, scoring 95% - 97% on KITTI, 94% - 97% on KITTI-360, and up to 81.1% on 3DMatch, highlighting its robustness across diverse settings.

[26] COCO-Urdu: A Large-Scale Urdu Image-Caption Dataset with Multimodal Quality Estimation

Umair Hassan

Main category: cs.CV

TL;DR: 论文提出了COCO-Urdu,一个基于MS COCO的大规模乌尔都语图像描述数据集,旨在填补乌尔都语在多模态研究中的空白,并通过多模态质量评估框架确保翻译质量和语义一致性。

Details Motivation: 乌尔都语作为全球超过2.5亿人使用的语言,在多模态和视觉语言研究中长期被忽视。缺乏高质量的乌尔都语数据集限制了相关系统的开发,并加剧了多语言模型对高资源语言的依赖。作者希望通过COCO-Urdu数据集减少多模态研究中的语言偏见。

Contribution: 1) 提出了迄今为止最大的公开乌尔都语图像描述数据集COCO-Urdu;2) 设计了一个结合翻译质量、视觉基础和语义一致性的多模态质量评估框架;3) 通过开源工具和LLM迭代优化低质量描述。

Method: 数据集从MS COCO中分层抽样获得59,000张图像和319,000条乌尔都语描述。描述通过SeamlessM4T v2翻译,并用混合框架(COMET-Kiwi、CLIP相似度、BERTScore和回译)评估和优化质量。

Result: COCO-Urdu在BLEU、SacreBLEU和chrF等指标上表现优异,验证了数据集的高质量。

Insight: 多模态质量评估框架可用于提升低资源语言的翻译质量,减少对人工标注的依赖。开源数据与工具为包容性视觉语言系统的发展提供了基础。

Abstract: Urdu, spoken by over 250 million people, remains critically under-served in multimodal and vision-language research. The absence of large-scale, high-quality datasets has limited the development of Urdu-capable systems and reinforced biases in multilingual vision-language models trained primarily on high-resource languages. To address this gap, we present COCO-Urdu, a large-scale image-caption dataset derived from MS COCO, containing 59,000 images and 319,000 Urdu captions selected through stratified sampling to preserve the original distribution. Captions were translated using SeamlessM4T v2 and validated with a hybrid multimodal quality estimation framework that integrates COMET-Kiwi for translation quality, CLIP-based similarity for visual grounding, and BERTScore with back-translation for semantic consistency; low-scoring captions were iteratively refined using open-source large language models. We further benchmark COCO-Urdu on BLEU, SacreBLEU, and chrF, reporting consistently strong results. To the best of our knowledge, COCO-Urdu is the largest publicly available Urdu captioning dataset. By releasing both the dataset and the quality estimation pipeline, we aim to reduce language bias in multimodal research and establish a foundation for inclusive vision-language systems.

[27] VoxelFormer: Parameter-Efficient Multi-Subject Visual Decoding from fMRI

Chenqian Le,Yilin Zhao,Nikasadat Emami,Kushagra Yadav,Xujin “Chris” Liu,Xupeng Chen,Yao Wang

Main category: cs.CV

TL;DR: VoxelFormer是一种轻量级Transformer架构,支持多被试的fMRI视觉解码,通过Token Merging Transformer和Q-Former实现参数高效化,在7T自然场景数据集上表现出色。

Details Motivation: 现有fMRI视觉解码方法多为单被试训练,缺乏可扩展性和实际部署能力。VoxelFormer旨在解决这一问题,实现多被试训练。

Contribution: 1. 提出VoxelFormer,支持多被试fMRI视觉解码;2. 引入Token Merging Transformer(ToMer)和Q-Former,提高参数效率;3. 在7T自然场景数据集上验证了性能。

Method: 结合ToMer进行高效体素压缩,以及Q-Former生成与CLIP图像嵌入空间对齐的固定大小神经表征。

Result: 在训练包含的被试上实现了竞争力强的检索性能,且参数显著少于现有方法。

Insight: Token合并和基于查询的Transformer是参数高效神经解码的有效策略。

Abstract: Recent advances in fMRI-based visual decoding have enabled compelling reconstructions of perceived images. However, most approaches rely on subject-specific training, limiting scalability and practical deployment. We introduce \textbf{VoxelFormer}, a lightweight transformer architecture that enables multi-subject training for visual decoding from fMRI. VoxelFormer integrates a Token Merging Transformer (ToMer) for efficient voxel compression and a query-driven Q-Former that produces fixed-size neural representations aligned with the CLIP image embedding space. Evaluated on the 7T Natural Scenes Dataset, VoxelFormer achieves competitive retrieval performance on subjects included during training with significantly fewer parameters than existing methods. These results highlight token merging and query-based transformers as promising strategies for parameter-efficient neural decoding.

[28] Integrating Anatomical Priors into a Causal Diffusion Model

Binxu Li,Wei Peng,Mingjie Li,Ehsan Adeli,Kilian M. Pohl

Main category: cs.CV

TL;DR: 该论文提出了一种名为PCGM的方法,将解剖学先验整合到因果扩散模型中,以生成高质量且解剖学合理的脑部MRI图像,解决了现有方法难以保留医学相关局部细节的问题。

Details Motivation: 现有反事实模型在生成脑部MRI时缺乏对细微解剖学细节的保留,导致生成的图像在医学研究中实用性不足。论文旨在通过引入解剖学先验,提升生成图像的解剖学合理性和医学价值。

Contribution: 1. 提出了PCGM框架,通过概率图模块和空间掩码显式集成解剖学先验。
2. 提出了一种新的反事实去噪UNet架构,结合3D扩散解码器生成高质量MRI。
3. 首次证明生成的反事实图像能够复现真实疾病对大脑皮层的细微影响。

Method: 1. 使用概率图模块捕捉解剖学约束,生成空间二进制掩码标记细微变化区域。
2. 通过3D ControlNet编码掩码,约束去噪UNet的输出。
3. 结合3D扩散解码器生成高质量MRI。

Result: 实验表明,PCGM在多个数据集上生成的脑部MRI质量优于基线方法,且能够复现真实疾病对大脑皮层的细微形态影响。

Insight: 通过显式引入解剖学先验,生成模型能够更好地保留医学相关的局部细节,为脑部MRI研究提供更实用的工具。

Abstract: 3D brain MRI studies often examine subtle morphometric differences between cohorts that are hard to detect visually. Given the high cost of MRI acquisition, these studies could greatly benefit from image syntheses, particularly counterfactual image generation, as seen in other domains, such as computer vision. However, counterfactual models struggle to produce anatomically plausible MRIs due to the lack of explicit inductive biases to preserve fine-grained anatomical details. This shortcoming arises from the training of the models aiming to optimize for the overall appearance of the images (e.g., via cross-entropy) rather than preserving subtle, yet medically relevant, local variations across subjects. To preserve subtle variations, we propose to explicitly integrate anatomical constraints on a voxel-level as prior into a generative diffusion framework. Called Probabilistic Causal Graph Model (PCGM), the approach captures anatomical constraints via a probabilistic graph module and translates those constraints into spatial binary masks of regions where subtle variations occur. The masks (encoded by a 3D extension of ControlNet) constrain a novel counterfactual denoising UNet, whose encodings are then transferred into high-quality brain MRIs via our 3D diffusion decoder. Extensive experiments on multiple datasets demonstrate that PCGM generates structural brain MRIs of higher quality than several baseline approaches. Furthermore, we show for the first time that brain measurements extracted from counterfactuals (generated by PCGM) replicate the subtle effects of a disease on cortical brain regions previously reported in the neuroscience literature. This achievement is an important milestone in the use of synthetic MRIs in studies investigating subtle morphological differences.

[29] Enhancing 3D Medical Image Understanding with Pretraining Aided by 2D Multimodal Large Language Models

Qiuhui Chen,Xuancheng Yao,Huping Ye,Yi Hong

Main category: cs.CV

TL;DR: 论文提出了一种名为Med3DInsight的预训练框架,通过集成3D图像编码器和2D多模态大语言模型(MLLMs),结合平面切片感知变换器模块,提升3D医学图像理解的语义深度,无需人工标注,并在分割和分类任务上表现优异。

Details Motivation: 现有的3D医学图像自监督学习方法缺乏深层语义理解,而2D多模态大语言模型(MLLMs)为通过文本描述增强图像理解提供了可能。论文旨在利用这些技术进步改进3D医学图像理解。

Contribution: 1. 提出了Med3DInsight框架,结合3D图像编码器和2D MLLMs;2. 设计了平面切片感知变换器模块和基于部分最优传输的对齐方法;3. 在无需人工标注的情况下实现了多模态3D医学表示学习。

Method: 1. 集成3D图像编码器和2D MLLMs;2. 使用平面切片感知变换器模块捕捉跨模态信息;3. 采用部分最优传输对齐,对LLM生成内容中的噪声具有更强的鲁棒性。

Result: 在多个公共数据集(CT和MRI模态)的分割和分类任务中,Med3DInsight表现优于现有的自监督学习方法,达到state-of-the-art性能。

Insight: 1. 2D MLLMs可以高效辅助3D医学图像理解;2. 无需人工标注即可实现多模态表示学习;3. 该框架可无缝集成到现有3D医学图像理解网络中。

Abstract: Understanding 3D medical image volumes is critical in the medical field, yet existing 3D medical convolution and transformer-based self-supervised learning (SSL) methods often lack deep semantic comprehension. Recent advancements in multimodal large language models (MLLMs) provide a promising approach to enhance image understanding through text descriptions. To leverage these 2D MLLMs for improved 3D medical image understanding, we propose Med3DInsight, a novel pretraining framework that integrates 3D image encoders with 2D MLLMs via a specially designed plane-slice-aware transformer module. Additionally, our model employs a partial optimal transport based alignment, demonstrating greater tolerance to noise introduced by potential noises in LLM-generated content. Med3DInsight introduces a new paradigm for scalable multimodal 3D medical representation learning without requiring human annotations. Extensive experiments demonstrate our state-of-the-art performance on two downstream tasks, i.e., segmentation and classification, across various public datasets with CT and MRI modalities, outperforming current SSL methods. Med3DInsight can be seamlessly integrated into existing 3D medical image understanding networks, potentially enhancing their performance. Our source code, generated datasets, and pre-trained models will be available at https://github.com/Qybc/Med3DInsight.

[30] Improvement of Human-Object Interaction Action Recognition Using Scene Information and Multi-Task Learning Approach

Hesham M. Shehata,Mohammad Abdolrahmani

Main category: cs.CV

TL;DR: 论文提出了一种结合场景信息和多任务学习的方法,提升了人机交互动作识别的性能。

Details Motivation: 现有的图卷积神经网络(GCNs)在人体动作识别中表现良好,但缺乏对场景信息的有效表征,难以准确识别人与固定物体的交互动作。

Contribution: 提出了一种结合固定物体场景信息和多任务学习的方法,显著提高了交互动作与非交互动作的识别准确率。

Method: 采用多任务学习框架,结合交互区域信息,利用公共环境中的真实数据构建数据集进行验证。

Result: 所提方法在交互和非交互动作识别中实现了99.25%的准确率,比仅使用人体骨架姿态的基线模型提升了2.75%。

Insight: 场景信息和多任务学习能够弥补传统动作识别方法的不足,尤其适用于人与固定物体交互的场景。

Abstract: Recent graph convolutional neural networks (GCNs) have shown high performance in the field of human action recognition by using human skeleton poses. However, it fails to detect human-object interaction cases successfully due to the lack of effective representation of the scene information and appropriate learning architectures. In this context, we propose a methodology to utilize human action recognition performance by considering fixed object information in the environment and following a multi-task learning approach. In order to evaluate the proposed method, we collected real data from public environments and prepared our data set, which includes interaction classes of hands-on fixed objects (e.g., ATM ticketing machines, check-in/out machines, etc.) and non-interaction classes of walking and standing. The multi-task learning approach, along with interaction area information, succeeds in recognizing the studied interaction and non-interaction actions with an accuracy of 99.25%, outperforming the accuracy of the base model using only human skeleton poses by 2.75%.

[31] SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models

Hengyu Fang,Yijiang Liu,Yuan Du,Li Du,Huanrui Yang

Main category: cs.CV

TL;DR: SQAP-VLA是一个结合量化和token剪枝的训练加速框架,首次实现了两者协同设计,解决了传统VLA模型中二者的不兼容问题。

Details Motivation: VLA模型因计算和内存开销大而难以实用化,现有方法无法同时实现高效量化和token剪枝。

Contribution: 提出首个结构化、无需训练的VLA推理加速框架SQAP-VLA,协同设计了量化和token剪枝流程。

Method: 设计了量化感知的token剪枝标准,改进了量化器设计以提升剪枝效果。

Result: 在标准VLA模型上,SQAP-VLA实现了1.93倍加速和4.5%的平均成功率提升。

Insight: 量化与剪枝的协同设计是提升VLA模型效率的关键,且无需额外训练即可部署。

Abstract: Vision-Language-Action (VLA) models exhibit unprecedented capabilities for embodied intelligence. However, their extensive computational and memory costs hinder their practical deployment. Existing VLA compression and acceleration approaches conduct quantization or token pruning in an ad-hoc manner but fail to enable both for a holistic efficiency improvement due to an observed incompatibility. This work introduces SQAP-VLA, the first structured, training-free VLA inference acceleration framework that simultaneously enables state-of-the-art quantization and token pruning. We overcome the incompatibility by co-designing the quantization and token pruning pipeline, where we propose new quantization-aware token pruning criteria that work on an aggressively quantized model while improving the quantizer design to enhance pruning effectiveness. When applied to standard VLA models, SQAP-VLA yields significant gains in computational efficiency and inference speed while successfully preserving core model performance, achieving a $\times$1.93 speedup and up to a 4.5% average success rate enhancement compared to the original model.

[32] FPI-Det: a face–phone Interaction Dataset for phone-use detection and understanding

Jianqin Gao,Tianqi Wang,Yu Zhang,Yishu Zhang,Chenyuan Wang,Allan Dong,Zihao Wang

Main category: cs.CV

TL;DR: 本文提出了FPI-Det数据集,用于检测和理解手机使用行为,填补了现有通用基准在细粒度的‘人脸-手机’交互数据上的空白,并提供了基线检测结果和分析。

Details Motivation: 移动设备的普及带来安全监控等工作场景中检测手机使用行为的需求,但现有数据集难以充分捕捉‘人脸-手机’交互的细粒度行为。

Contribution: 提出了FPI-Det数据集,包含22,879张图像,涵盖多样场景下‘人脸-手机’的同步标注,并提供了基线检测器评估。

Method: 使用YOLO和DETR检测器作为基线,分析了在不同物体大小、遮挡和环境下的表现。

Result: 提供了检测器的性能分析,为后续研究提供了基准。

Insight: FPI-Det填补了细粒度交互数据集的空白,为手机使用行为的研究提供了重要工具。

Abstract: The widespread use of mobile devices has created new challenges for vision systems in safety monitoring, workplace productivity assessment, and attention management. Detecting whether a person is using a phone requires not only object recognition but also an understanding of behavioral context, which involves reasoning about the relationship between faces, hands, and devices under diverse conditions. Existing generic benchmarks do not fully capture such fine-grained human–device interactions. To address this gap, we introduce the FPI-Det, containing 22{,}879 images with synchronized annotations for faces and phones across workplace, education, transportation, and public scenarios. The dataset features extreme scale variation, frequent occlusions, and varied capture conditions. We evaluate representative YOLO and DETR detectors, providing baseline results and an analysis of performance across object sizes, occlusion levels, and environments. Source code and dataset is available at https://github.com/KvCgRv/FPI-Det.

[33] Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention

Junhao Xing,Ryohei Miyakawa,Yang Yang,Xinpeng Liu,Risa Shinoda,Hiroaki Santo,Yosuke Toda,Fumio Okura

Main category: cs.CV

TL;DR: 提出了一种零样本分层植物分割方法ZeroPlantSeg,结合基础分割模型和视觉语言模型,无需额外训练即可从俯视图中分割出植物个体。

Details Motivation: 基础分割模型可实现零样本的叶片实例分割,但复杂的分层分割任务仍需要标注数据集。为解决这一问题,提出了一种无需训练的零样本方法。

Contribution: 提出了ZeroPlantSeg方法,结合基础分割模型和视觉语言模型,实现了零样本的分层植物分割,表现优于现有零样本方法,并具有跨域性能优势。

Method: 集成基础分割模型提取叶片实例,利用视觉语言模型推理植物结构以实现植物个体的分割。

Result: 在多物种、多生长阶段和多拍摄环境的数据集上验证了方法的优越性。

Insight: 结合多种模型的能力可实现复杂任务的零样本分割,且无需额外的标注数据。

Abstract: Foundation segmentation models achieve reasonable leaf instance extraction from top-view crop images without training (i.e., zero-shot). However, segmenting entire plant individuals with each consisting of multiple overlapping leaves remains challenging. This problem is referred to as a hierarchical segmentation task, typically requiring annotated training datasets, which are often species-specific and require notable human labor. To address this, we introduce ZeroPlantSeg, a zero-shot segmentation for rosette-shaped plant individuals from top-view images. We integrate a foundation segmentation model, extracting leaf instances, and a vision-language model, reasoning about plants’ structures to extract plant individuals without additional training. Evaluations on datasets with multiple plant species, growth stages, and shooting environments demonstrate that our method surpasses existing zero-shot methods and achieves better cross-domain performance than supervised methods. Implementations are available at https://github.com/JunhaoXing/ZeroPlantSeg.

[34] Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval

Tianlu Zheng,Yifan Zhang,Xiang An,Ziyong Feng,Kaicheng Yang,Qichuan Ding

Main category: cs.CV

TL;DR: 本文提出了一种基于CLIP的文本-人物检索框架GA-DMS,通过高效数据构建和模型架构改进,解决了数据稀缺和噪声文本问题,实现了最先进的性能。

Details Motivation: 现有CLIP模型在人物检索任务中面临数据稀缺和噪声文本的挑战,需要改进以提升细粒度匹配能力。

Contribution: 1. 提出WebPerson数据集,5M高质量人物中心图像-文本对;2. 设计GA-DMS框架,通过梯度注意力引导的双掩码机制优化跨模态对齐。

Method: 1. 利用MLLM自动构建噪声过滤的数据集;2. GA-DMS框架结合梯度注意力指导的掩码和掩码预测目标,提升细粒度学习。

Result: GA-DMS在多个基准测试中达到最优性能。

Insight: 梯度注意力和掩码预测目标能有效提升模型对噪声的鲁棒性和细粒度语义表示能力。

Abstract: Although Contrastive Language-Image Pre-training (CLIP) exhibits strong performance across diverse vision tasks, its application to person representation learning faces two critical challenges: (i) the scarcity of large-scale annotated vision-language data focused on person-centric images, and (ii) the inherent limitations of global contrastive learning, which struggles to maintain discriminative local features crucial for fine-grained matching while remaining vulnerable to noisy text tokens. This work advances CLIP for person representation learning through synergistic improvements in data curation and model architecture. First, we develop a noise-resistant data construction pipeline that leverages the in-context learning capabilities of MLLMs to automatically filter and caption web-sourced images. This yields WebPerson, a large-scale dataset of 5M high-quality person-centric image-text pairs. Second, we introduce the GA-DMS (Gradient-Attention Guided Dual-Masking Synergetic) framework, which improves cross-modal alignment by adaptively masking noisy textual tokens based on the gradient-attention similarity score. Additionally, we incorporate masked token prediction objectives that compel the model to predict informative text tokens, enhancing fine-grained semantic representation learning. Extensive experiments show that GA-DMS achieves state-of-the-art performance across multiple benchmarks.

[35] ALL-PET: A Low-resource and Low-shot PET Foundation Model in the Projection Domain

Bin Huang,Kang Chen,Bingxuan Li,Huafeng Liu,Qiegen Liu

Main category: cs.CV

TL;DR: ALL-PET 是一种低资源、低样本的 PET 基础模型,直接工作在投影域,通过潜在扩散模型(LDM)和创新的数据增强与注意力机制,实现了高质量的正电子发射断层扫描(PET)成像任务。

Details Motivation: PET 成像领域中,构建大规模基础模型面临标记数据稀缺和计算资源有限的挑战。作者旨在提出一种能够在低资源和小样本条件下高效完成 PET 成像任务的解决方案。

Contribution: 1. 设计了 Radon 掩码增强策略(RMAS)和动态多掩码机制(DMM),显著提升了数据多样性;2. 通过正/负掩码约束嵌入几何一致性,减少参数负担;3. 提出了透明医学注意力(TMA),增强病灶相关区域的聚焦能力。

Method: 采用潜在扩散模型(LDM),结合 RMAS 和 DMM 生成多样训练样本,利用 TMA 机制从粗分割中提取病灶注意力图,实现物理一致的投影数据增强。

Result: ALL-PET 仅需 500 个样本即可生成高质量投影数据,性能媲美大数据集训练的模型,且内存占用低于 24GB。

Insight: 通过几何驱动和物理一致的注意力机制,可以在低资源条件下实现 PET 成像任务的泛化能力,为医学影像分析提供了高效的解决方案。

Abstract: Building large-scale foundation model for PET imaging is hindered by limited access to labeled data and insufficient computational resources. To overcome data scarcity and efficiency limitations, we propose ALL-PET, a low-resource, low-shot PET foundation model operating directly in the projection domain. ALL-PET leverages a latent diffusion model (LDM) with three key innovations. First, we design a Radon mask augmentation strategy (RMAS) that generates over 200,000 structurally diverse training samples by projecting randomized image-domain masks into sinogram space, significantly improving generalization with minimal data. This is extended by a dynamic multi-mask (DMM) mechanism that varies mask quantity and distribution, enhancing data diversity without added model complexity. Second, we implement positive/negative mask constraints to embed strict geometric consistency, reducing parameter burden while preserving generation quality. Third, we introduce transparent medical attention (TMA), a parameter-free, geometry-driven mechanism that enhances lesion-related regions in raw projection data. Lesion-focused attention maps are derived from coarse segmentation, covering both hypermetabolic and hypometabolic areas, and projected into sinogram space for physically consistent guidance. The system supports clinician-defined ROI adjustments, ensuring flexible, interpretable, and task-adaptive emphasis aligned with PET acquisition physics. Experimental results show ALL-PET achieves high-quality sinogram generation using only 500 samples, with performance comparable to models trained on larger datasets. ALL-PET generalizes across tasks including low-dose reconstruction, attenuation correction, delayed-frame prediction, and tracer separation, operating efficiently with memory use under 24GB.

[36] Noise-Robust Topology Estimation of 2D Image Data via Neural Networks and Persistent Homology

Dylan Peek,Matthew P. Skerritt,Stephan Chalup

Main category: cs.CV

TL;DR: 该论文研究了在二维二值图像中,使用神经网络预测贝蒂数的抗噪性能,并对比了持久同调方法(PH),发现神经网络由于能从数据中学习上下文和几何先验,在噪声环境下表现更好。

Details Motivation: 持久同调和神经网络是两种不同的拓扑结构推断方法。作者旨在探索神经网络在噪声环境下预测拓扑结构的鲁棒性,并与传统的基于持久同调的方法进行比较。

Contribution: 提出了使用神经网络预测拓扑结构的方法,并通过实验证明了其在噪声环境下优于传统的持久同调方法。

Method: 在合成和真实数据集上训练监督神经网络预测贝蒂数,同时对比基于立方体复形和SEDT的持久同调方法。

Result: 实验表明,神经网络在噪声环境下表现更优,可能因其能学习上下文和几何先验。

Insight: 神经网络为噪声环境下的拓扑估计提供了一种有潜力的替代方案,尽管该领域仍在发展中。

Abstract: Persistent Homology (PH) and Artificial Neural Networks (ANNs) offer contrasting approaches to inferring topological structure from data. In this study, we examine the noise robustness of a supervised neural network trained to predict Betti numbers in 2D binary images. We compare an ANN approach against a PH pipeline based on cubical complexes and the Signed Euclidean Distance Transform (SEDT), which is a widely adopted strategy for noise-robust topological analysis. Using one synthetic and two real-world datasets, we show that ANNs can outperform this PH approach under noise, likely due to their capacity to learn contextual and geometric priors from training data. Though still emerging, the use of ANNs for topology estimation offers a compelling alternative to PH under structural noise.

[37] Objectness Similarity: Capturing Object-Level Fidelity in 3D Scene Evaluation

Yuiko Uchida,Ren Togo,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama

Main category: cs.CV

TL;DR: 本文提出了Objectness SIMilarity(OSIM),一种专注于3D场景中“物体”的评估指标,通过对象检测模型量化场景中每个物体的“物体性”,更符合人类感知。

Details Motivation: 现有3D场景评估指标关注整体图像质量,与人类感知不一致。本文基于心理学研究,假设人类识别3D场景时更关注单个物体,因此提出OSIM以提升评估的准确性。

Contribution: 1. 提出OSIM,一种新的3D场景评估指标,强调物体级保真度;2. 通过用户研究验证OSIM与人类感知的一致性;3. 重新评估近期3D重建与生成模型,澄清领域进展。

Method: 1. 利用对象检测模型提取物体特征;2. 设计OSIM指标量化场景中每个物体的“物体性”;3. 结合用户实验验证指标有效性。

Result: OSIM比现有指标更符合人类感知;通过标准化实验重新评估了3D重建与生成模型的性能。

Insight: 物体级保真度对人类理解3D场景至关重要;现有指标需从整体质量评估转向更细粒度的物体级评价。

Abstract: This paper presents Objectness SIMilarity (OSIM), a novel evaluation metric for 3D scenes that explicitly focuses on “objects,” which are fundamental units of human visual perception. Existing metrics assess overall image quality, leading to discrepancies with human perception. Inspired by neuropsychological insights, we hypothesize that human recognition of 3D scenes fundamentally involves attention to individual objects. OSIM enables object-centric evaluations by leveraging an object detection model and its feature representations to quantify the “objectness” of each object in the scene. Our user study demonstrates that OSIM aligns more closely with human perception compared to existing metrics. We also analyze the characteristics of OSIM using various approaches. Moreover, we re-evaluate recent 3D reconstruction and generation models under a standardized experimental setup to clarify advancements in this field. The code is available at https://github.com/Objectness-Similarity/OSIM.

[38] Video Understanding by Design: How Datasets Shape Architectures and Insights

Lei Wang,Piotr Koniusz,Yongsheng Gao

Main category: cs.CV

TL;DR: 这篇论文从数据集驱动的视角重新审视了视频理解领域的发展,揭示了数据集如何通过运动复杂性、时间跨度、层次结构和多模态丰富性等因素引导模型架构的演变。

Details Motivation: 现有的视频理解研究大多按任务或模型家族分类,忽视了数据集对模型架构演化的结构性影响。论文旨在填补这一空白。

Contribution: 1. 首次以数据集为视角,系统分析了视频理解模型的演变;2. 提出了数据集驱动的框架,将运动复杂性、时间跨度等作为指导模型设计的归纳偏置;3. 提供了模型设计与数据集对齐的实用指南。

Method: 论文通过回顾里程碑式模型(如双流网络、3D CNN、序列模型、Transformer和多模态基础模型)的设计,分析其如何响应数据集的特性。

Result: 论文提供了一个统一的框架,将数据集、归纳偏置和模型架构结合起来,为视频理解领域的未来发展提供了路线图。

Insight: 数据集不仅是模型性能的基准,其特性(如时间动态或层次结构)直接指导了模型架构的设计方向。未来研究应更注重数据集与模型的协同设计。

Abstract: Video understanding has advanced rapidly, fueled by increasingly complex datasets and powerful architectures. Yet existing surveys largely classify models by task or family, overlooking the structural pressures through which datasets guide architectural evolution. This survey is the first to adopt a dataset-driven perspective, showing how motion complexity, temporal span, hierarchical composition, and multimodal richness impose inductive biases that models should encode. We reinterpret milestones, from two-stream and 3D CNNs to sequential, transformer, and multimodal foundation models, as concrete responses to these dataset-driven pressures. Building on this synthesis, we offer practical guidance for aligning model design with dataset invariances while balancing scalability and task demands. By unifying datasets, inductive biases, and architectures into a coherent framework, this survey provides both a comprehensive retrospective and a prescriptive roadmap for advancing general-purpose video understanding.

[39] OCELOT 2023: Cell Detection from Cell-Tissue Interaction Challenge

JaeWoong Shin,Jeongun Ryu,Aaron Valero Puche,Jinhee Lee,Biagio Brattoli,Wonkyung Jung,Soo Ick Cho,Kyunghyun Paeng,Chan-Young Ock,Donggeun Yoo,Zhaoyang Li,Wangkai Li,Huayu Mai,Joshua Millward,Zhen He,Aiden Nibali,Lydia Anette Schoenpflug,Viktor Hendrik Koelzer,Xu Shuoyu,Ji Zheng,Hu Bin,Yu-Wen Lo,Ching-Hui Yang,Sérgio Pereira

Main category: cs.CV

TL;DR: OCELOT 2023挑战赛旨在通过多尺度细胞与组织交互标注数据集验证细胞与组织关系的理解对细胞检测的重要性,并推动相关研究。参赛模型通过整合多尺度语义显著提升了性能,显示了这一方法的潜力。

Details Motivation: 现有基于深度学习的细胞检测模型难以模拟病理学家的多尺度观察行为,缺乏多尺度交互标注数据集是主要瓶颈。OCELOT 2023挑战赛通过收集和组织多尺度标注数据,验证细胞与组织关系对性能提升的关键作用。

Contribution: 1. 发布首个包含多尺度重叠细胞与组织标注的数据集;2. 验证了整合细胞与组织关系对提升细胞检测性能的显著作用;3. 展示了参赛模型的创新策略,为未来研究提供了方向。

Method: 挑战赛提供了包含673对多尺度标注的数据集,参赛者设计了整合细胞与组织关系的模型,通过多尺度语义学习提升检测性能。

Result: 最佳参赛模型相比基线(仅细胞检测)在测试集上F1分数提升了7.99,证明了多尺度语义整合的有效性。

Insight: 细胞检测任务需结合组织上下文信息,多尺度语义学习是实现人类水平性能的关键。

Abstract: Pathologists routinely alternate between different magnifications when examining Whole-Slide Images, allowing them to evaluate both broad tissue morphology and intricate cellular details to form comprehensive diagnoses. However, existing deep learning-based cell detection models struggle to replicate these behaviors and learn the interdependent semantics between structures at different magnifications. A key barrier in the field is the lack of datasets with multi-scale overlapping cell and tissue annotations. The OCELOT 2023 challenge was initiated to gather insights from the community to validate the hypothesis that understanding cell and tissue (cell-tissue) interactions is crucial for achieving human-level performance, and to accelerate the research in this field. The challenge dataset includes overlapping cell detection and tissue segmentation annotations from six organs, comprising 673 pairs sourced from 306 The Cancer Genome Atlas (TCGA) Whole-Slide Images with hematoxylin and eosin staining, divided into training, validation, and test subsets. Participants presented models that significantly enhanced the understanding of cell-tissue relationships. Top entries achieved up to a 7.99 increase in F1-score on the test set compared to the baseline cell-only model that did not incorporate cell-tissue relationships. This is a substantial improvement in performance over traditional cell-only detection methods, demonstrating the need for incorporating multi-scale semantics into the models. This paper provides a comparative analysis of the methods used by participants, highlighting innovative strategies implemented in the OCELOT 2023 challenge.

[40] RT-DETR++ for UAV Object Detection

Yuan Shufang

Main category: cs.CV

TL;DR: RT-DETR++通过改进RT-DETR的编码器部分,引入通道门控注意力机制和CSP-PAC特征融合技术,显著提升了无人机图像中小目标和密集目标的检测性能,同时保持了实时性。

Details Motivation: 无人机图像中的目标检测面临小目标密集、尺度变化大和遮挡等挑战,现有方法在性能和效率上存在不足。

Contribution: 1. 提出通道门控注意力机制(AU/AD)以减少特征传播中的误差;
2. 设计CSP-PAC特征融合技术以整合多尺度信息。

Method: 1. 编码器改进:通道门控注意力机制的双路径设计;
2. 特征融合:采用并行空洞卷积处理局部和上下文信息。

Result: RT-DETR++在小目标和密集目标的检测上表现优异,同时维持了实时检测速度。

Insight: 通过优化的编码器和特征融合设计,可以在不增加计算复杂度的前提下提升目标检测性能。

Abstract: Object detection in unmanned aerial vehicle (UAV) imagery presents significant challenges. Issues such as densely packed small objects, scale variations, and occlusion are commonplace. This paper introduces RT-DETR++, which enhances the encoder component of the RT-DETR model. Our improvements focus on two key aspects. First, we introduce a channel-gated attention-based upsampling/downsampling (AU/AD) mechanism. This dual-path system minimizes errors and preserves details during feature layer propagation. Second, we incorporate CSP-PAC during feature fusion. This technique employs parallel hollow convolutions to process local and contextual information within the same layer, facilitating the integration of multi-scale features. Evaluation demonstrates that our novel neck design achieves superior performance in detecting small and densely packed objects. The model maintains sufficient speed for real-time detection without increasing computational complexity. This study provides an effective approach for feature encoding design in real-time detection systems.

[41] A Knowledge Noise Mitigation Framework for Knowledge-based Visual Question Answering

Zhiyue Liu,Sihang Liu,Jinyuan Liu,Xinru Zhang

Main category: cs.CV

TL;DR: 提出了一种无需训练的框架,用于减轻知识冗余对KB-VQA任务的干扰,通过提高知识相关性和减少冗余来优化答案生成过程。

Details Motivation: 现有的KB-VQA方法直接将检索到的知识注入模型,忽略了知识冗余带来的噪声问题,影响了回答的准确性。

Contribution: 1. 提出低噪声查询生成方法以提高知识检索的相关性;2. 利用大模型提取有用的知识片段以减少冗余;3. 引入选择性知识集成策略,仅在不自信时注入知识。

Method: 1. 基于图像-问题对生成低噪声查询;2. 利用大模型筛选有益于答案的知识片段;3. 动态选择是否引入知识以减少冗余影响。

Result: 实验表明,该方法显著优于现有技术,能更准确地利用关键知识回答问题。

Insight: 通过减少知识噪声并动态选择知识集成,KB-VQA任务能更高效地利用外部知识,提高回答质量。

Abstract: Knowledge-based visual question answering (KB-VQA) requires a model to understand images and utilize external knowledge to provide accurate answers. Existing approaches often directly augment models with retrieved information from knowledge sources while ignoring substantial knowledge redundancy, which introduces noise into the answering process. To address this, we propose a training-free framework with knowledge focusing for KB-VQA, that mitigates the impact of noise by enhancing knowledge relevance and reducing redundancy. First, for knowledge retrieval, our framework concludes essential parts from the image-question pairs, creating low-noise queries that enhance the retrieval of highly relevant knowledge. Considering that redundancy still persists in the retrieved knowledge, we then prompt large models to identify and extract answer-beneficial segments from knowledge. In addition, we introduce a selective knowledge integration strategy, allowing the model to incorporate knowledge only when it lacks confidence in answering the question, thereby mitigating the influence of redundant information. Our framework enables the acquisition of accurate and critical knowledge, and extensive experiments demonstrate that it outperforms state-of-the-art methods.

[42] CWSSNet: Hyperspectral Image Classification Enhanced by Wavelet Domain Convolution

Yulin Tong,Fengzong Zhang,Haiqin Cheng

Main category: cs.CV

TL;DR: CWSSNet是一种结合3D谱空特征和小波卷积的高光谱图像分类框架,通过多尺度卷积注意力模块和多波段小波分解提升分类性能,在小样本训练下表现稳健。

Details Motivation: 高光谱图像因波段多、维度高和光谱混合特性导致特征冗余,传统方法分类性能有限,需突破瓶颈。

Contribution: 提出了CWSSNet框架,整合3D谱空特征和小波卷积,引入多尺度注意力模块和小波域多波段分解。

Method: 使用ZY1F卫星的高光谱数据,结合3D卷积和多尺度小波域卷积,优化特征提取和分类。

Result: 在mIoU、mAcc和mF1上分别达到74.50%、82.73%和84.94%,尤其在水体、植被和裸地分类中IoU最高。

Insight: 小波域卷积和多尺度注意力模块能显著提升高光谱图像分类性能,且模型在小样本下表现稳健。

Abstract: Hyperspectral remote sensing technology has significant application value in fields such as forestry ecology and precision agriculture, while also putting forward higher requirements for fine ground object classification. However, although hyperspectral images are rich in spectral information and can improve recognition accuracy, they tend to cause prominent feature redundancy due to their numerous bands, high dimensionality, and spectral mixing characteristics. To address this, this study used hyperspectral images from the ZY1F satellite as a data source and selected Yugan County, Shangrao City, Jiangxi Province as the research area to perform ground object classification research. A classification framework named CWSSNet was proposed, which integrates 3D spectral-spatial features and wavelet convolution. This framework integrates multimodal information us-ing a multiscale convolutional attention module and breaks through the classification performance bottleneck of traditional methods by introducing multi-band decomposition and convolution operations in the wavelet domain. The experiments showed that CWSSNet achieved 74.50%, 82.73%, and 84.94% in mean Intersection over Union (mIoU), mean Accuracy (mAcc), and mean F1-score (mF1) respectively in Yugan County. It also obtained the highest Intersection over Union (IoU) in the classifica-tion of water bodies, vegetation, and bare land, demonstrating good robustness. Additionally, when the training set proportion was 70%, the increase in training time was limited, and the classification effect was close to the optimal level, indicating that the model maintains reliable performance under small-sample training conditions.

[43] Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios

Chunxiao Li,Xiaoxiao Wang,Meiling Li,Boming Miao,Peng Sun,Yunjian Zhang,Xiangyang Ji,Yao Zhu

Main category: cs.CV

TL;DR: 本研究提出了Real-World Robustness Dataset (RRDataset),用于评估AI生成图像检测模型在真实复杂场景中的性能。该数据集覆盖了七个主要场景,并测试了模型在互联网传输和重新数字化后的鲁棒性。实验结果表明当前方法在真实条件下的局限性,并强调结合人类适应能力开发更鲁棒检测算法的必要性。

Details Motivation: 随着生成模型的快速发展,高真实感的图像合成对数字安全和媒体可信度提出了新挑战。现有AI生成图像检测方法在复杂真实场景中的评估存在研究空白,作者试图填补这一空白。

Contribution: 1) 提出了RRDataset,覆盖七个主要场景,填补了内容视角的数据集空白;2) 测试了检测模型在互联网传输和重新数字化后的鲁棒性;3) 对17种检测器和10种视觉-语言模型进行了基准测试,并结合大规模人类研究揭示了当前方法的局限性。

Method: 1) 构建RRDataset,覆盖七大类场景;2) 测试模型在互联网传输和重新数字化后的性能;3) 人类研究中比较人类少量学习能力和AI模型的性能差异。

Result: 基准测试显示现有AI检测方法在真实条件下表现不佳,尤其在互联网传输和重新数字化后性能下降明显。人类实验表明人在少量学习后能更好地适应复杂场景。

Insight: 当前的AI生成图像检测模型在真实复杂场景中存在明显局限性,未来研究需结合人类适应能力开发更鲁棒的算法,尤其是在数据传输和图像处理变形后的检测能力。

Abstract: With the rapid advancement of generative models, highly realistic image synthesis has posed new challenges to digital security and media credibility. Although AI-generated image detection methods have partially addressed these concerns, a substantial research gap remains in evaluating their performance under complex real-world conditions. This paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions: 1) Scenario Generalization: RRDataset encompasses high-quality images from seven major scenarios (War and Conflict, Disasters and Accidents, Political and Social Events, Medical and Public Health, Culture and Religion, Labor and Production, and everyday life), addressing existing dataset gaps from a content perspective. 2) Internet Transmission Robustness: examining detector performance on images that have undergone multiple rounds of sharing across various social media platforms. 3) Re-digitization Robustness: assessing model effectiveness on images altered through four distinct re-digitization methods. We benchmarked 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducted a large-scale human study involving 192 participants to investigate human few-shot learning capabilities in detecting AI-generated images. The benchmarking results reveal the limitations of current AI detection methods under real-world conditions and underscore the importance of drawing on human adaptability to develop more robust detection algorithms.

[44] VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models: Methods and Results

Hanwei Zhu,Haoning Wu,Zicheng Zhang,Lingyu Zhu,Yixuan Li,Peilin Chen,Shiqi Wang,Chris Wei Zhou,Linhan Cao,Wei Sun,Xiangyang Zhu,Weixia Zhang,Yucheng Zhu,Jing Liu,Dandan Zhu,Guangtao Zhai,Xiongkuo Min,Zhichao Zhang,Xinyue Li,Shubo Xu,Anh Dao,Yifan Li,Hongyuan Yu,Jiaojiao Yi,Yiding Tian,Yupeng Wu,Feiran Sun,Lijuan Liao,Song Jiang

Main category: cs.CV

TL;DR: 本文总结了2025年ICCV视觉质量评估研讨会上举办的VQualA挑战赛,旨在评估和改进大型多模态模型(LMMs)在视觉质量差异比较中的能力。比赛提出了一个包含数千个从粗到细粒度视觉质量比较任务的新基准,并吸引了100名参赛者,展示了五款指令调优LMMs在质量评估中的潜力。

Details Motivation: 现有大型多模态模型在视觉质量差异的开放式和详细推理能力上仍需改进,因此设计了VQualA挑战赛,以推动相关技术的发展。

Contribution: 提出了一个涵盖多粒度视觉质量比较任务的新基准,并设计了全面的评估协议(如2AFC和MCQs),展示了指令调优LMMs的质量评估潜力。

Method: 挑战赛引入从单张图像、图像对到多图像组的任务,要求模型提供精确的质量判断,并通过2AFC和多选题进行评估。

Result: 五款模型展示了指令调优LMMs在视觉质量评估中的新兴能力,标志着开放域视觉质量推理的重要进展。

Insight: 比赛表明指令调优LMMs在质量评估任务中具有潜力,并为未来可解释且与人一致的质量评估系统研究提供了方向。

Abstract: This paper presents a summary of the VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models (LMMs), hosted as part of the ICCV 2025 Workshop on Visual Quality Assessment. The challenge aims to evaluate and enhance the ability of state-of-the-art LMMs to perform open-ended and detailed reasoning about visual quality differences across multiple images. To this end, the competition introduces a novel benchmark comprising thousands of coarse-to-fine grained visual quality comparison tasks, spanning single images, pairs, and multi-image groups. Each task requires models to provide accurate quality judgments. The competition emphasizes holistic evaluation protocols, including 2AFC-based binary preference and multi-choice questions (MCQs). Around 100 participants submitted entries, with five models demonstrating the emerging capabilities of instruction-tuned LMMs on quality assessment. This challenge marks a significant step toward open-domain visual quality reasoning and comparison and serves as a catalyst for future research on interpretable and human-aligned quality evaluation systems.

[45] Medverse: A Universal Model for Full-Resolution 3D Medical Image Segmentation, Transformation and Enhancement

Jiesi Hu,Jianfeng Cao,Yanwu Yang,Chenfei Ye,Yixuan Zhang,Hanyang Peng,Ting Ma

Main category: cs.CV

TL;DR: Medverse提出了一种通用的上下文学习模型,用于3D医学图像的分割、变换和增强,解决了现有方法在高保真预测和全局解剖理解方面的局限性。

Details Motivation: 当前医学图像分析的上下文学习模型无法同时实现高保真预测和全局解剖理解,且缺乏跨任务和区域的统一模型,限制了其在医学影像中的潜力。

Contribution: 提出了Medverse,一个基于22个数据集的通用模型,支持多器官、多模态和多任务的3D医学图像分析;引入了next-scale自回归框架和分块跨注意力模块,优化了预测精度和计算效率。

Method: 采用next-scale自回归上下文学习框架,逐步从粗到细优化预测;设计分块跨注意力模块,实现长距离交互并保持计算效率。

Result: Medverse在未见过的数据集上显著优于现有方法,展现了上下文学习的新范式。

Insight: Medverse的成功表明,统一的上下文学习模型可以跨任务、跨模态工作,为医学图像分析提供了新的可能性。

Abstract: In-context learning (ICL) offers a promising paradigm for universal medical image analysis, enabling models to perform diverse image processing tasks without retraining. However, current ICL models for medical imaging remain limited in two critical aspects: they cannot simultaneously achieve high-fidelity predictions and global anatomical understanding, and there is no unified model trained across diverse medical imaging tasks (e.g., segmentation and enhancement) and anatomical regions. As a result, the full potential of ICL in medical imaging remains underexplored. Thus, we present \textbf{Medverse}, a universal ICL model for 3D medical imaging, trained on 22 datasets covering diverse tasks in universal image segmentation, transformation, and enhancement across multiple organs, imaging modalities, and clinical centers. Medverse employs a next-scale autoregressive in-context learning framework that progressively refines predictions from coarse to fine, generating consistent, full-resolution volumetric outputs and enabling multi-scale anatomical awareness. We further propose a blockwise cross-attention module that facilitates long-range interactions between context and target inputs while preserving computational efficiency through spatial sparsity. Medverse is extensively evaluated on a broad collection of held-out datasets covering previously unseen clinical centers, organs, species, and imaging modalities. Results demonstrate that Medverse substantially outperforms existing ICL baselines and establishes a novel paradigm for in-context learning. Code and model weights will be made publicly available. Our model are publicly available at https://github.com/jiesihu/Medverse.

[46] CoAtNeXt:An Attention-Enhanced ConvNeXtV2-Transformer Hybrid Model for Gastric Tissue Classification

Mustafa Yurdakul,Sakir Tasdemir

Main category: cs.CV

TL;DR: CoAtNeXt是一種結合ConvNeXtV2和Transformer的混合模型,用於胃組織圖像分類,表現優於傳統CNN和ViT模型。

Details Motivation: 早期診斷胃病對預防致命後果至關重要,但目前依賴手動的組織病理學檢查存在勞動密集和主觀差異等問題,需要自動化方法。

Contribution: 提出CoAtNeXt模型,通過整合ConvNeXtV2模塊和CBAM注意力機制,提升分類性能並超越現有模型。

Method: 基於CoAtNet架構,替換MBConv層為ConvNeXtV2塊,並加入CBAM模塊增強局部特徵提取。

Result: 在兩個數據集上表現優異,HMU-GC-HE-30K的準確率達96.47%,GasHisSDB達98.29%,均超越所有對比模型。

Insight: CoAtNeXt展示了在組織病理學分類中的潛力,有助於提高診斷準確性並減輕病理學家的工作負擔。

Abstract: Background and objective Early diagnosis of gastric diseases is crucial to prevent fatal outcomes. Although histopathologic examination remains the diagnostic gold standard, it is performed entirely manually, making evaluations labor-intensive and prone to variability among pathologists. Critical findings may be missed, and lack of standard procedures reduces consistency. These limitations highlight the need for automated, reliable, and efficient methods for gastric tissue analysis. Methods In this study, a novel hybrid model named CoAtNeXt was proposed for the classification of gastric tissue images. The model is built upon the CoAtNet architecture by replacing its MBConv layers with enhanced ConvNeXtV2 blocks. Additionally, the Convolutional Block Attention Module (CBAM) is integrated to improve local feature extraction through channel and spatial attention mechanisms. The architecture was scaled to achieve a balance between computational efficiency and classification performance. CoAtNeXt was evaluated on two publicly available datasets, HMU-GC-HE-30K for eight-class classification and GasHisSDB for binary classification, and was compared against 10 Convolutional Neural Networks (CNNs) and ten Vision Transformer (ViT) models. Results CoAtNeXt achieved 96.47% accuracy, 96.60% precision, 96.47% recall, 96.45% F1 score, and 99.89% AUC on HMU-GC-HE-30K. On GasHisSDB, it reached 98.29% accuracy, 98.07% precision, 98.41% recall, 98.23% F1 score, and 99.90% AUC. It outperformed all CNN and ViT models tested and surpassed previous studies in the literature. Conclusion Experimental results show that CoAtNeXt is a robust architecture for histopathological classification of gastric tissue images, providing performance on binary and multiclass. Its highlights its potential to assist pathologists by enhancing diagnostic accuracy and reducing workload.

[47] Towards Better Dental AI: A Multimodal Benchmark and Instruction Dataset for Panoramic X-ray Analysis

Jing Hao,Yuxuan Fan,Yanpeng Sun,Kaixin Guo,Lizhuo Lin,Jinrong Yang,Qi Yong H. Ai,Lun M. Wong,Hao Tang,Kuo Feng Hung

Main category: cs.CV

TL;DR: 这篇论文提出了首个针对全景X射线分析的多模态指令数据集和基准MMOral,并在此基础上开发了OralGPT模型,显著提升了模型在牙科领域的表现。

Details Motivation: 尽管大型视觉语言模型(LVLMs)在通用医疗任务中表现出色,但在牙科等专业领域的效果仍未充分探索。全景X射线由于其复杂的解剖结构和细微的病理特征,现有模型难以准确解析。

Contribution: 1. 提出首个大规模多模态指令数据集和基准MMOral,覆盖多种任务类型;2. 开发OralGPT模型,通过监督微调显著提升性能;3. 公开数据集、模型和评估套件。

Method: 1. 构建MMOral数据集,包含20,563张标注图像和130万指令实例;2. 设计MMOral-Bench评估套件;3. 基于Qwen2.5-VL-7B进行监督微调开发OralGPT。

Result: 现有LVLMs在MMOral-Bench上表现不佳(最高准确率41.45%),而OralGPT通过单轮微调显著提升24.73%。

Insight: 1. 专业化领域需要针对性的数据集和基准;2. 监督微调在提升模型性能中效果显著;3. 开源资源对推动牙科AI发展至关重要。

Abstract: Recent advances in large vision-language models (LVLMs) have demonstrated strong performance on general-purpose medical tasks. However, their effectiveness in specialized domains such as dentistry remains underexplored. In particular, panoramic X-rays, a widely used imaging modality in oral radiology, pose interpretative challenges due to dense anatomical structures and subtle pathological cues, which are not captured by existing medical benchmarks or instruction datasets. To this end, we introduce MMOral, the first large-scale multimodal instruction dataset and benchmark tailored for panoramic X-ray interpretation. MMOral consists of 20,563 annotated images paired with 1.3 million instruction-following instances across diverse task types, including attribute extraction, report generation, visual question answering, and image-grounded dialogue. In addition, we present MMOral-Bench, a comprehensive evaluation suite covering five key diagnostic dimensions in dentistry. We evaluate 64 LVLMs on MMOral-Bench and find that even the best-performing model, i.e., GPT-4o, only achieves 41.45% accuracy, revealing significant limitations of current models in this domain. To promote the progress of this specific domain, we also propose OralGPT, which conducts supervised fine-tuning (SFT) upon Qwen2.5-VL-7B with our meticulously curated MMOral instruction dataset. Remarkably, a single epoch of SFT yields substantial performance enhancements for LVLMs, e.g., OralGPT demonstrates a 24.73% improvement. Both MMOral and OralGPT hold significant potential as a critical foundation for intelligent dentistry and enable more clinically impactful multimodal AI systems in the dental field. The dataset, model, benchmark, and evaluation suite are available at https://github.com/isbrycee/OralGPT.

[48] DATE: Dynamic Absolute Time Enhancement for Long Video Understanding

Chao Yuan,Yang Yang,Yehui Yang,Zach Cheng

Main category: cs.CV

TL;DR: DATE通过动态绝对时间增强方法,结合时间戳注入机制和时间感知相似性采样策略,显著提升了多模态大语言模型在长视频理解中的时序感知能力。

Details Motivation: 长视频理解是多模态大语言模型的根本挑战,传统方法因均匀帧采样和隐式位置编码导致关键信息丢失和时序理解下降。

Contribution: 引入了时间戳注入机制(TIM)和时间感知相似性采样策略(TASS),构建连续的时间参考系统,并通过两阶段算法优化视频采样。

Method: 将视频帧嵌入与文本时间戳令牌交错以增强时序感知,并将视频采样问题重新定义为视觉语言检索任务,采用描述性标题和相似驱动的贪婪策略。

Result: 在长达小时的视频基准测试中,7B和72B模型均取得优异表现,7B模型甚至在某些任务中超越72B模型。

Insight: 显式的时间建模和语义引导的采样策略对长视频理解的时序推理至关重要,小模型通过高效方法也能超越大模型性能。

Abstract: Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs), particularly in tasks requiring precise temporal reasoning and event localization. Existing approaches typically adopt uniform frame sampling and rely on implicit position encodings to model temporal order. However, these methods struggle with long-range dependencies, leading to critical information loss and degraded temporal comprehension. In this paper, we propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs through the Timestamp Injection Mechanism (TIM) and a semantically guided Temporal-Aware Similarity Sampling (TASS) strategy. Specifically, we interleave video frame embeddings with textual timestamp tokens to construct a continuous temporal reference system. We further reformulate the video sampling problem as a vision-language retrieval task and introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage: enriching each query into a descriptive caption to better align with the vision feature, and sampling key event with a similarity-driven temporally regularized greedy strategy. Our method achieves remarkable improvements w.r.t. absolute time understanding and key event localization, resulting in state-of-the-art performance among 7B and 72B models on hour-long video benchmarks. Particularly, our 7B model even exceeds many 72B models on some benchmarks.

[49] Unified Start, Personalized End: Progressive Pruning for Efficient 3D Medical Image Segmentation

Linhao Li,Yiwen Ye,Ziyang Chen,Yong Xia

Main category: cs.CV

TL;DR: PSP-Seg是一个渐进式剪枝框架,用于高效动态的3D医学图像分割,通过逐步剪枝冗余模块,显著减少资源消耗并保持性能。

Details Motivation: 3D医学图像分割通常资源消耗大,现有高效模型多为静态设计,难以适应多样任务和平衡性能与效率。

Contribution: 提出PSP-Seg框架,通过渐进式剪枝和功能解耦损失,实现动态高效的3D分割。

Method: PSP-Seg从冗余模型开始,结合块级剪枝和功能解耦损失迭代剪枝冗余模块。

Result: PSP-Seg-S在性能接近nnU-Net的同时,GPU内存减少42-45%,训练时间减少29-48%,参数减少83-87%。

Insight: 动态剪枝策略在医学图像分割中能显著提高资源效率,同时保持高性能。

Abstract: 3D medical image segmentation often faces heavy resource and time consumption, limiting its scalability and rapid deployment in clinical environments. Existing efficient segmentation models are typically static and manually designed prior to training, which restricts their adaptability across diverse tasks and makes it difficult to balance performance with resource efficiency. In this paper, we propose PSP-Seg, a progressive pruning framework that enables dynamic and efficient 3D segmentation. PSP-Seg begins with a redundant model and iteratively prunes redundant modules through a combination of block-wise pruning and a functional decoupling loss. We evaluate PSP-Seg on five public datasets, benchmarking it against seven state-of-the-art models and six efficient segmentation models. Results demonstrate that the lightweight variant, PSP-Seg-S, achieves performance on par with nnU-Net while reducing GPU memory usage by 42-45%, training time by 29-48%, and parameter number by 83-87% across all datasets. These findings underscore PSP-Seg’s potential as a cost-effective yet high-performing alternative for widespread clinical application.

[50] Visual Programmability: A Guide for Code-as-Thought in Chart Understanding

Bohao Tang,Yan Ma,Fei Zhang,Jiadi Su,Ethan Chern,Zhulin Hu,Zhixin Wang,Pengfei Liu,Ya Zhang

Main category: cs.CV

TL;DR: 该论文提出了一种名为Code-as-Thought (CaT)的方法,用于提升视觉语言模型在图表理解任务中的推理能力,并通过强化学习动态选择最佳推理路径。

Details Motivation: 现有的图表理解方法要么依赖外部工具,要么采用单一推理策略(如文本思维链),限制了模型的灵活性和准确性。为解决这一问题,论文探索了代码形式作为视觉信息的符号化表示,并提出了动态选择推理路径的方案。

Contribution: 1. 提出Code-as-Thought (CaT)方法,将图表信息表示为可验证的符号格式。2. 引入Visual Programmability概念,允许模型动态选择代码推理或直接视觉分析。3. 设计双奖励强化学习系统,提升模型在准确性和路径选择上的表现。

Method: 1. 自适应框架:模型学习在CaT路径和直接视觉推理路径间选择。2. 双奖励强化学习:结合数据准确性奖励和决策奖励训练模型。3. 验证性符号表示:通过代码形式确保推理步骤的可验证性。

Result: 在多个图表理解基准测试中表现优异,验证了动态选择推理路径的有效性。

Insight: 视觉语言模型不仅能学习如何推理,还能动态选择最优推理路径,从而提升任务适应性。这为复杂视觉推理任务提供了新的解决方案。

Abstract: Chart understanding presents a critical test to the reasoning capabilities of Vision-Language Models (VLMs). Prior approaches face critical limitations: some rely on external tools, making them brittle and constrained by a predefined toolkit, while others fine-tune specialist models that often adopt a single reasoning strategy, such as text-based chain-of-thought (CoT). The intermediate steps of text-based reasoning are difficult to verify, which complicates the use of reinforcement-learning signals that reward factual accuracy. To address this, we propose a Code-as-Thought (CaT) approach to represent the visual information of a chart in a verifiable, symbolic format. Our key insight is that this strategy must be adaptive: a fixed, code-only implementation consistently fails on complex charts where symbolic representation is unsuitable. This finding leads us to introduce Visual Programmability: a learnable property that determines if a chart-question pair is better solved with code or direct visual analysis. We implement this concept in an adaptive framework where a VLM learns to choose between the CaT pathway and a direct visual reasoning pathway. The selection policy of the model is trained with reinforcement learning using a novel dual-reward system. This system combines a data-accuracy reward to ground the model in facts and prevent numerical hallucination, with a decision reward that teaches the model when to use each strategy, preventing it from defaulting to a single reasoning mode. Experiments demonstrate strong and robust performance across diverse chart-understanding benchmarks. Our work shows that VLMs can be taught not only to reason but also how to reason, dynamically selecting the optimal reasoning pathway for each task.

[51] Modality-Agnostic Input Channels Enable Segmentation of Brain lesions in Multimodal MRI with Sequences Unavailable During Training

Anthony P. Addison,Felix Wagner,Wentian Xu,Natalie Voets,Konstantinos Kamnitsas

Main category: cs.CV

TL;DR: 该论文提出了一种模态无关的输入通道方法,使U-net架构能够处理训练中未见过的MRI模态混合数据,提升了脑病变分割的灵活性。

Details Motivation: 现有脑MRI分割模型通常局限于训练时的固定模态,无法处理推理时的新模态或模态混合数据。论文旨在开发一种更灵活的模型,能够适应任何可用的MRI模态,包括训练中未见过的情况。

Contribution: 主要贡献是提出了一种模态无关的输入通道或路径,通过一种图像增强方案训练该通道,能够处理训练中未见过的MRI模态,同时保留已有模态的分割能力。

Method: 方法基于U-net架构,引入模态无关的输入通道,并使用增强技术合成人工MRI模态。增强技术通过改变病理与健康组织的对比度,同时保持解剖结构的真实性。

Result: 实验使用了8个MRI数据库和5种病理类型,结果表明方法能够有效处理训练过的模态,同时提升对未见模态的分割能力。

Insight: 模态无关的设计和合成增强技术的结合,为处理多模态医学图像提供了一种实用且灵活的解决方案。

Abstract: Segmentation models are important tools for the detection and analysis of lesions in brain MRI. Depending on the type of brain pathology that is imaged, MRI scanners can acquire multiple, different image modalities (contrasts). Most segmentation models for multimodal brain MRI are restricted to fixed modalities and cannot effectively process new ones at inference. Some models generalize to unseen modalities but may lose discriminative modality-specific information. This work aims to develop a model that can perform inference on data that contain image modalities unseen during training, previously seen modalities, and heterogeneous combinations of both, thus allowing a user to utilize any available imaging modalities. We demonstrate this is possible with a simple, thus practical alteration to the U-net architecture, by integrating a modality-agnostic input channel or pathway, alongside modality-specific input channels. To train this modality-agnostic component, we develop an image augmentation scheme that synthesizes artificial MRI modalities. Augmentations differentially alter the appearance of pathological and healthy brain tissue to create artificial contrasts between them while maintaining realistic anatomical integrity. We evaluate the method using 8 MRI databases that include 5 types of pathologies (stroke, tumours, traumatic brain injury, multiple sclerosis and white matter hyperintensities) and 8 modalities (T1, T1+contrast, T2, PD, SWI, DWI, ADC and FLAIR). The results demonstrate that the approach preserves the ability to effectively process MRI modalities encountered during training, while being able to process new, unseen modalities to improve its segmentation. Project code: https://github.com/Anthony-P-Addison/AGN-MOD-SEG

[52] Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization

Zhengzhao Lai,Youbin Zheng,Zhenyang Cai,Haonan Lyu,Jinpu Yang,Hongqing Liang,Yan Hu,Benyou Wang

Main category: cs.CV

TL;DR: 论文提出了MatCha基准测试,用于评估多模态大语言模型(MLLMs)在材料表征图像理解中的能力,发现现有模型在专家级任务上表现不佳,提示学习方法也难以弥补其局限性。

Details Motivation: 材料表征是材料科学的核心,但目前的多模态大语言模型在实际材料表征图像理解方面的能力尚未充分探索。因此,作者提出了MatCha基准测试,填补了这一空白。

Contribution: 提出了首个材料表征图像理解的基准测试MatCha,包含1500个问题,覆盖材料研究的四个关键阶段和21个任务,揭示了现有MLLMs在专家级任务上的性能差距。

Method: 设计了MatCha基准测试,涵盖材料科学的实际任务,并评估了先进的MLLMs模型在Few-shot和Chain-of-Thought提示下的表现。

Result: 实验表明,现有MLLMs在材料表征图像理解任务上与人类专家存在显著差距,尤其是在需要高阶专业知识和复杂视觉感知的任务上表现不佳。

Insight: 现有的MLLMs在真实材料表征场景中的适应性有限,提示学习方法(如Few-shot和Chain-of-Thought)效果不显著。MatCha有望推动新材料发现和自主科学代理的研究。

Abstract: Materials characterization is fundamental to acquiring materials information, revealing the processing-microstructure-property relationships that guide material design and optimization. While multimodal large language models (MLLMs) have recently shown promise in generative and predictive tasks within materials science, their capacity to understand real-world characterization imaging data remains underexplored. To bridge this gap, we present MatCha, the first benchmark for materials characterization image understanding, comprising 1,500 questions that demand expert-level domain expertise. MatCha encompasses four key stages of materials research comprising 21 distinct tasks, each designed to reflect authentic challenges faced by materials scientists. Our evaluation of state-of-the-art MLLMs on MatCha reveals a significant performance gap compared to human experts. These models exhibit degradation when addressing questions requiring higher-level expertise and sophisticated visual perception. Simple few-shot and chain-of-thought prompting struggle to alleviate these limitations. These findings highlight that existing MLLMs still exhibit limited adaptability to real-world materials characterization scenarios. We hope MatCha will facilitate future research in areas such as new material discovery and autonomous scientific agents. MatCha is available at https://github.com/FreedomIntelligence/MatCha.

[53] You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception

Hao Si,Ehsan Javanmardi,Manabu Tsukada

Main category: cs.CV

TL;DR: PHCP是一种新型协作感知框架,通过无需标注数据和联合训练的推理时动态特征对齐,解决了异构车辆协作感知的实际挑战。

Details Motivation: 现实场景中,不同车辆的感知模型通常异构,现有方法需要联合训练或预存模型,无法直接应用于推理阶段。PHCP旨在无需这些步骤,直接在推理阶段实现高效协作。

Contribution: 提出了PHCP框架,首次将异构协作感知问题建模为少样本无监督域适应任务,并通过自训练适配器在推理时动态对齐特征。

Method: PHCP采用少样本无监督域适应方法,在推理时通过自训练动态调整适配器,无需标注数据或联合训练。

Result: 在OPV2V数据集上,PHCP在异构场景中表现出色,性能与全数据集训练的SOTA方法相当,仅需少量未标注数据。

Insight: PHCP的创新在于将协作感知问题转化为推理时的动态适应问题,为实现实时异构协作提供了可行方案。

Abstract: Collaborative perception enables vehicles to overcome individual perception limitations by sharing information, allowing them to see further and through occlusions. In real-world scenarios, models on different vehicles are often heterogeneous due to manufacturer variations. Existing methods for heterogeneous collaborative perception address this challenge by fine-tuning adapters or the entire network to bridge the domain gap. However, these methods are impractical in real-world applications, as each new collaborator must undergo joint training with the ego vehicle on a dataset before inference, or the ego vehicle stores models for all potential collaborators in advance. Therefore, we pose a new question: Can we tackle this challenge directly during inference, eliminating the need for joint training? To answer this, we introduce Progressive Heterogeneous Collaborative Perception (PHCP), a novel framework that formulates the problem as few-shot unsupervised domain adaptation. Unlike previous work, PHCP dynamically aligns features by self-training an adapter during inference, eliminating the need for labeled data and joint training. Extensive experiments on the OPV2V dataset demonstrate that PHCP achieves strong performance across diverse heterogeneous scenarios. Notably, PHCP achieves performance comparable to SOTA methods trained on the entire dataset while using only a small amount of unlabeled data.

[54] Image Recognition with Vision and Language Embeddings of VLMs

Illia Volkov,Nikita Kisel,Klara Janouskova,Jiri Matas

Main category: cs.CV

TL;DR: 该论文对视觉语言模型(VLMs)的纯视觉推理能力进行了全面评估,并通过引入一种基于类精度的无学习融合方法,结合语言和视觉的互补性提升了分类性能。

Details Motivation: 视觉语言模型在零样本分类中表现出色,但其纯视觉推理能力尚未得到充分研究。论文旨在填补这一空白,并探索语言和视觉的结合策略。

Contribution: 1. 对多种VLM模型(如SigLIP 2和RADIOv2.5)的视觉和语言引导分类性能进行了全面评估;2. 提出了一种基于类精度的无学习融合方法,结合语言和视觉的互补性提升分类性能。

Method: 论文通过实验评估了VLM模型在ImageNet-1k数据集上的表现,分析了提示设计、类别多样性、k-NN邻居数和参考集大小等因素对性能的影响,并提出了一种基于类精度的融合方法。

Result: 实验表明,语言和视觉在分类任务中具有互补性,某些类别更适合文本提示,而其他类别则更适合视觉相似性。提出的融合方法有效提升了分类性能。

Insight: 视觉和语言嵌入在图像分类中各有所长,结合两者的互补性可以显著提升模型性能。这种融合方法无需额外训练,具有实用性。

Abstract: Vision-language models (VLMs) have enabled strong zero-shot classification through image-text alignment. Yet, their purely visual inference capabilities remain under-explored. In this work, we conduct a comprehensive evaluation of both language-guided and vision-only image classification with a diverse set of dual-encoder VLMs, including both well-established and recent models such as SigLIP 2 and RADIOv2.5. The performance is compared in a standard setup on the ImageNet-1k validation set and its label-corrected variant. The key factors affecting accuracy are analysed, including prompt design, class diversity, the number of neighbours in k-NN, and reference set size. We show that language and vision offer complementary strengths, with some classes favouring textual prompts and others better handled by visual similarity. To exploit this complementarity, we introduce a simple, learning-free fusion method based on per-class precision that improves classification performance. The code is available at: https://github.com/gonikisgo/bmvc2025-vlm-image-recognition.

[55] Fine-Grained Customized Fashion Design with Image-into-Prompt benchmark and dataset from LMM

Hui Li,Yi You,Qiqi Chen,Bingfeng Zhang,George Q. Huang

Main category: cs.CV

TL;DR: 论文提出了一种基于大型多模态模型(LMM)的细粒度时尚设计定制工作流(BUG),通过图像到提示的转换,自动生成和定制服装设计,解决了文本输入不确定性导致的设计难题。

Details Motivation: 当前的生成式AI模型能够轻松将创意转化为设计,但在缺少专业背景知识的终端用户中,细粒度定制仍受限于文本输入的不确定性。论文旨在通过LMM降低时尚设计的门槛,提升用户体验。

Contribution: 1) 提出BUG工作流,结合LMM实现图像到提示的自动转换;2) 发布新的FashionEdit数据集,模拟真实服装设计工作流;3) 提供代码和数据集以促进研究。

Method: 采用LMM自动生成设计提示(prompt),并结合图像输入实现细粒度定制。通过FashionEdit数据集评估生成相似性、用户满意度和设计质量。

Result: 实验证明BUG工作流能够有效提升设计的可控性和用户满意度,降低设计门槛。

Insight: LMM在时尚设计领域的应用潜力巨大,图像到提示的转换机制为其他领域的细粒度生成任务提供了参考。

Abstract: Generative AI evolves the execution of complex workflows in industry, where the large multimodal model empowers fashion design in the garment industry. Current generation AI models magically transform brainstorming into fancy designs easily, but the fine-grained customization still suffers from text uncertainty without professional background knowledge from end-users. Thus, we propose the Better Understanding Generation (BUG) workflow with LMM to automatically create and fine-grain customize the cloth designs from chat with image-into-prompt. Our framework unleashes users’ creative potential beyond words and also lowers the barriers of clothing design/editing without further human involvement. To prove the effectiveness of our model, we propose a new FashionEdit dataset that simulates the real-world clothing design workflow, evaluated from generation similarity, user satisfaction, and quality. The code and dataset: https://github.com/detectiveli/FashionEdit.

[56] Exploring Pre-training Across Domains for Few-Shot Surgical Skill Assessment

Dimitrios Anastasiou,Razvan Caramalau,Nazir Sirajudeen,Matthew Boal,Philip Edwards,Justin Collins,John Kelly,Ashwin Sridhar,Maxine Tran,Faiz Mumtaz,Nevil Pavithran,Nader Francis,Danail Stoyanov,Evangelos B. Mazomenos

Main category: cs.CV

TL;DR: 本篇论文研究自监督预训练策略对少样本外科手术技能评估(SSA)任务的影响,发现小规模但领域相关的数据集优于大规模但领域不匹配的数据集,并证明将特定手术数据融入预训练能显著提升性能。

Details Motivation: 由于外科手术技能标注稀缺且耗时,研究少样本学习方法(FSL)作为一种替代方案,但其有效性依赖于预训练策略。目前预训练在外科手术技能评估(SSA)中尚未得到充分探索。

Contribution: 1. 将SSA任务形式化为少样本学习问题;2. 研究了不同自监督预训练策略对少样本SSA性能的影响;3. 发现小规模但领域相关的预训练数据集优于大规模但领域不匹配的数据集;4. 提出融入特定手术数据的预训练策略可显著提升性能。

Method: 1. 使用公开的机器人手术数据集,并标注OSATS评分;2. 评估不同预训练数据源在三种少样本设置下的性能;3. 量化领域相似性,分析领域差距和特定手术数据对预训练迁移性的影响。

Result: 在1-shot、2-shot和5-shot设置下,分别达到60.16%、66.03%和73.65%的准确率。融入特定手术数据的预训练策略平均提升1.22%准确率和2.28% F1分数。

Insight: 领域相关性比数据规模更重要;特定手术数据的加入能显著提升性能,但需确保领域匹配。

Abstract: Automated surgical skill assessment (SSA) is a central task in surgical computer vision. Developing robust SSA models is challenging due to the scarcity of skill annotations, which are time-consuming to produce and require expert consensus. Few-shot learning (FSL) offers a scalable alternative enabling model development with minimal supervision, though its success critically depends on effective pre-training. While widely studied for several surgical downstream tasks, pre-training has remained largely unexplored in SSA. In this work, we formulate SSA as a few-shot task and investigate how self-supervised pre-training strategies affect downstream few-shot SSA performance. We annotate a publicly available robotic surgery dataset with Objective Structured Assessment of Technical Skill (OSATS) scores, and evaluate various pre-training sources across three few-shot settings. We quantify domain similarity and analyze how domain gap and the inclusion of procedure-specific data into pre-training influence transferability. Our results show that small but domain-relevant datasets can outperform large scale, less aligned ones, achieving accuracies of 60.16%, 66.03%, and 73.65% in the 1-, 2-, and 5-shot settings, respectively. Moreover, incorporating procedure-specific data into pre-training with a domain-relevant external dataset significantly boosts downstream performance, with an average gain of +1.22% in accuracy and +2.28% in F1-score; however, applying the same strategy with less similar but large-scale sources can instead lead to performance degradation. Code and models are available at https://github.com/anastadimi/ssa-fsl.

[57] Classification of Driver Behaviour Using External Observation Techniques for Autonomous Vehicles

Ian Nell,Shane Gilroy

Main category: cs.CV

TL;DR: 本文提出了一种基于外部观察技术的驾驶员行为分类系统,通过计算机视觉方法检测分心和受损驾驶行为,适用于非联网车辆,具有较高的可靠性和适应性。

Details Motivation: 交通事故主要由人为错误(如分心和受损驾驶)引起,现有系统多依赖车辆间通信,无法覆盖非联网车辆。本文旨在开发一种无需依赖车辆通信的视觉解决方案。

Contribution: 1. 提出了一种基于外部观察的驾驶员行为分类系统;2. 结合YOLO目标检测和自定义车道估计算法,检测和分析不安全驾驶行为;3. 系统可适用于非联网车辆。

Method: 使用计算机视觉方法,包括实时目标跟踪、横向位移分析和车道位置监控,结合YOLO模型和自定义车道估计算法。

Result: 在多样化视频数据集上的实验表明,系统在不同道路和环境条件下具有可靠性和适应性。

Insight: 视觉方法为驾驶员行为分析提供了非侵入式解决方案,特别适用于非联网车辆,有助于提升交通安全。

Abstract: Road traffic accidents remain a significant global concern, with human error, particularly distracted and impaired driving, among the leading causes. This study introduces a novel driver behavior classification system that uses external observation techniques to detect indicators of distraction and impairment. The proposed framework employs advanced computer vision methodologies, including real-time object tracking, lateral displacement analysis, and lane position monitoring. The system identifies unsafe driving behaviors such as excessive lateral movement and erratic trajectory patterns by implementing the YOLO object detection model and custom lane estimation algorithms. Unlike systems reliant on inter-vehicular communication, this vision-based approach enables behavioral analysis of non-connected vehicles. Experimental evaluations on diverse video datasets demonstrate the framework’s reliability and adaptability across varying road and environmental conditions.

[58] FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

Rongyao Fang,Aldrich Yu,Chengqi Duan,Linjiang Huang,Shuai Bai,Yuxuan Cai,Kun Wang,Si Liu,Xihui Liu,Hongsheng Li

Main category: cs.CV

TL;DR: 该论文提出了FLUX-Reason-6M和PRISM-Bench,分别是一个大规模推理导向的文本到图像数据集和一个综合评估基准,旨在填补开源模型在复杂推理能力上的不足。

Details Motivation: 开源文本到图像(T2I)模型由于缺乏大规模推理导向的数据集和综合评估标准,性能落后于闭源系统。作者希望通过引入大规模数据集和基准测试推动开源社区的发展。

Contribution: 1. 发布FLUX-Reason-6M数据集(600万高质量图像和2000万双语描述);2. 提出PRISM-Bench评估标准(7个评估轨道);3. 设计了显式生成链(GCoT)和长文本挑战任务。

Method: 1. FLUX-Reason-6M通过6个关键特征(想象力、实体、文本渲染、风格、情感、构图)组织数据,并设计了生成链(GCoT)详细描述图像生成步骤;2. PRISM-Bench使用先进的视觉语言模型和精心设计的提示进行多维度评估。

Result: 对19个领先模型的评估揭示了性能差距和改进方向,数据集和基准测试为社区提供了新资源。

Insight: 该研究填补了开源T2I模型在推理能力上的空白,特别是通过GCoT和长文本挑战任务提升了模型对复杂提示的理解能力。

Abstract: The advancement of open-source text-to-image (T2I) models has been hindered by the absence of large-scale, reasoning-focused datasets and comprehensive evaluation benchmarks, resulting in a performance gap compared to leading closed-source systems. To address this challenge, We introduce FLUX-Reason-6M and PRISM-Bench (Precise and Robust Image Synthesis Measurement Benchmark). FLUX-Reason-6M is a massive dataset consisting of 6 million high-quality FLUX-generated images and 20 million bilingual (English and Chinese) descriptions specifically designed to teach complex reasoning. The image are organized according to six key characteristics: Imagination, Entity, Text rendering, Style, Affection, and Composition, and design explicit Generation Chain-of-Thought (GCoT) to provide detailed breakdowns of image generation steps. The whole data curation takes 15,000 A100 GPU days, providing the community with a resource previously unattainable outside of large industrial labs. PRISM-Bench offers a novel evaluation standard with seven distinct tracks, including a formidable Long Text challenge using GCoT. Through carefully designed prompts, it utilizes advanced vision-language models for nuanced human-aligned assessment of prompt-image alignment and image aesthetics. Our extensive evaluation of 19 leading models on PRISM-Bench reveals critical performance gaps and highlights specific areas requiring improvement. Our dataset, benchmark, and evaluation code are released to catalyze the next wave of reasoning-oriented T2I generation. Project page: https://flux-reason-6m.github.io/ .

[59] Plug-and-play Diffusion Models for Image Compressive Sensing with Data Consistency Projection

Xiaodong Wang,Ping Wang,Zhangyuan Li,Xin Yuan

Main category: cs.CV

TL;DR: 该论文通过解耦扩散模型为去噪、数据一致性和采样三个阶段,提出了一种结合PnP方法和DDIM模型的统一框架,用于解决单像素成像中的逆问题。

Details Motivation: 研究PnP方法与扩散模型(特别是DDIM)在解决病态逆问题(如单像素成像)中的联系,旨在整合学习先验与物理前向模型。

Contribution: 1. 提出解耦扩散模型的三个可解释阶段;2. 设计了一种混合数据一致性模块,通过线性组合多个PnP式保真项提高重建质量。

Method: 1. 将扩散过程解耦为去噪、数据一致性和采样三个阶段;2. 提出混合数据一致性模块,结合物理模型与学习先验。

Result: 在单像素成像任务中,该方法实现了更好的重建质量。

Insight: 通过解耦和统一框架,扩散模型能更高效地结合领域知识与数据驱动方法,解决逆问题。

Abstract: We explore the connection between Plug-and-Play (PnP) methods and Denoising Diffusion Implicit Models (DDIM) for solving ill-posed inverse problems, with a focus on single-pixel imaging. We begin by identifying key distinctions between PnP and diffusion models-particularly in their denoising mechanisms and sampling procedures. By decoupling the diffusion process into three interpretable stages: denoising, data consistency enforcement, and sampling, we provide a unified framework that integrates learned priors with physical forward models in a principled manner. Building upon this insight, we propose a hybrid data-consistency module that linearly combines multiple PnP-style fidelity terms. This hybrid correction is applied directly to the denoised estimate, improving measurement consistency without disrupting the diffusion sampling trajectory. Experimental results on single-pixel imaging tasks demonstrate that our method achieves better reconstruction quality.

[60] A Fully Automatic Framework for Intracranial Pressure Grading: Integrating Keyframe Identification, ONSD Measurement and Clinical Data

Pengxu Wen,Tingting Yu,Ziwei Nie,Cheng Jiang,Zhenyu Yin,Mingyang He,Bo Liao,Xiaoping Yang

Main category: cs.CV

TL;DR: 该论文提出了一种全自动框架,结合关键帧识别、ONSD测量和临床数据进行颅内压分级,显著提升了准确性和可靠性。

Details Motivation: 当前颅内压(ICP)监测的侵入性方法存在风险,而基于Optic nerve sheath diameter(ONSD)的非侵入方法因操作不一致和主观性导致可靠性不足。

Contribution: 提出了一个两阶段全自动框架,通过关键帧识别和ONSD测量结合临床数据,实现了准确、客观的ICP分级。

Method: 框架包括眼底超声视频处理阶段(关键帧识别和ONSD测量)和颅内压分级阶段(融合ONSD和临床数据)。

Result: 验证准确率为0.845±0.071,独立测试准确率为0.786,显著优于传统阈值方法(0.637±0.111和0.429)。

Insight: 该研究通过减少操作变异性,结合多源数据,为非侵入性ICP评估提供了可靠工具,有望优化急性神经系统疾病管理。

Abstract: Intracranial pressure (ICP) elevation poses severe threats to cerebral function, thus necessitating monitoring for timely intervention. While lumbar puncture is the gold standard for ICP measurement, its invasiveness and associated risks drive the need for non-invasive alternatives. Optic nerve sheath diameter (ONSD) has emerged as a promising biomarker, as elevated ICP directly correlates with increased ONSD. However, current clinical practices for ONSD measurement suffer from inconsistency in manual operation, subjectivity in optimal view selection, and variability in thresholding, limiting their reliability. To address these challenges, we introduce a fully automatic two-stage framework for ICP grading, integrating keyframe identification, ONSD measurement and clinical data. Specifically, the fundus ultrasound video processing stage performs frame-level anatomical segmentation, rule-based keyframe identification guided by an international consensus statement, and precise ONSD measurement. The intracranial pressure grading stage then fuses ONSD metrics with clinical features to enable the prediction of ICP grades, thereby demonstrating an innovative blend of interpretable ultrasound analysis and multi-source data integration for objective clinical evaluation. Experimental results demonstrate that our method achieves a validation accuracy of $0.845 \pm 0.071$ (with standard deviation from five-fold cross-validation) and an independent test accuracy of 0.786, significantly outperforming conventional threshold-based method ($0.637 \pm 0.111$ validation accuracy, $0.429$ test accuracy). Through effectively reducing operator variability and integrating multi-source information, our framework establishes a reliable non-invasive approach for clinical ICP evaluation, holding promise for improving patient management in acute neurological conditions.

[61] Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift

Umaima Rahman,Raza Imam,Mohammad Yaqub,Dwarikanath Mahapatra

Main category: cs.CV

TL;DR: DRiFt通过解耦临床特征和任务无关特征,提高医学视觉语言模型(VLM)在分布偏移下的可靠性和泛化能力。

Details Motivation: 医学视觉语言模型在分布偏移下可靠性差,且容易学习任务无关的伪相关,限制了其在实际临床场景中的应用。

Contribution: 提出了DRiFt框架,通过参数高效调优(LoRA)和可学习的提示标记,分离临床相关特征与任务无关噪声,增强模型泛化性和鲁棒性。

Method: 使用LoRA和提示标记解耦特征,并生成高质量医学图像-文本对以提升模态对齐。

Result: 在分布内性能上Top-1准确率提升11.4%,Macro-F1提升3.3%,且在未知数据上表现稳定。

Insight: 特征解耦和对齐显著提升模型泛化能力,减少领域偏移下的不确定性行为,为构建更安全的医学VLM提供了方向。

Abstract: Medical vision-language models (VLMs) offer promise for clinical decision support, yet their reliability under distribution shifts remains a major concern for safe deployment. These models often learn task-agnostic correlations due to variability in imaging protocols and free-text reports, limiting their generalizability and increasing the risk of failure in real-world settings. We propose DRiFt, a structured feature decoupling framework that explicitly separates clinically relevant signals from task-agnostic noise using parameter-efficient tuning (LoRA) and learnable prompt tokens. To enhance cross-modal alignment and reduce uncertainty, we curate high-quality, clinically grounded image-text pairs by generating captions for a diverse medical dataset. Our approach improves in-distribution performance by +11.4% Top-1 accuracy and +3.3% Macro-F1 over prior prompt-based methods, while maintaining strong robustness across unseen datasets. Ablation studies reveal that disentangling task-relevant features and careful alignment significantly enhance model generalization and reduce unpredictable behavior under domain shift. These insights contribute toward building safer, more trustworthy VLMs for clinical use. The code is available at https://github.com/rumaima/DRiFt.

[62] FS-Diff: Semantic guidance and clarity-aware simultaneous multimodal image fusion and super-resolution

Yuchan Jie,Yushen Xu,Xiaosong Li,Fuqiang Zhou,Jianming Lv,Huafeng Li

Main category: cs.CV

TL;DR: FS-Diff是一种结合语义引导和清晰度感知的多模态图像融合与超分辨率方法,通过条件生成问题统一两项任务,利用改进的U-Net网络实现高质量结果。

Details Motivation: 现实应用中(如军事侦察),多模态图像的目标和背景结构易受损且语义信息弱,现有方法效果不佳,需一种能同时提升分辨率和语义信息的方法。

Contribution: 1. 提出FS-Diff框架,统一图像融合与超分辨率为条件生成问题;2. 设计清晰度感知机制和双向特征Mamba;3. 构建AVMS基准数据集。

Method: 1. 通过高斯噪声初始化融合结果;2. 双向特征Mamba提取全局特征;3. 改进U-Net实现多噪声级去噪,生成高分辨率融合结果。

Result: 在多个公开数据集和AVMS上,FS-Diff在融合与超分辨率任务中均优于现有方法,恢复更多细节和语义信息。

Insight: 通过语义引导和清晰度感知,FS-Diff为多模态图像处理提供了端到端解决方案,尤其适用于低分辨率和高噪声场景。

Abstract: As an influential information fusion and low-level vision technique, image fusion integrates complementary information from source images to yield an informative fused image. A few attempts have been made in recent years to jointly realize image fusion and super-resolution. However, in real-world applications such as military reconnaissance and long-range detection missions, the target and background structures in multimodal images are easily corrupted, with low resolution and weak semantic information, which leads to suboptimal results in current fusion techniques. In response, we propose FS-Diff, a semantic guidance and clarity-aware joint image fusion and super-resolution method. FS-Diff unifies image fusion and super-resolution as a conditional generation problem. It leverages semantic guidance from the proposed clarity sensing mechanism for adaptive low-resolution perception and cross-modal feature extraction. Specifically, we initialize the desired fused result as pure Gaussian noise and introduce the bidirectional feature Mamba to extract the global features of the multimodal images. Moreover, utilizing the source images and semantics as conditions, we implement a random iterative denoising process via a modified U-Net network. This network istrained for denoising at multiple noise levels to produce high-resolution fusion results with cross-modal features and abundant semantic information. We also construct a powerful aerial view multiscene (AVMS) benchmark covering 600 pairs of images. Extensive joint image fusion and super-resolution experiments on six public and our AVMS datasets demonstrated that FS-Diff outperforms the state-of-the-art methods at multiple magnifications and can recover richer details and semantics in the fused images. The code is available at https://github.com/XylonXu01/FS-Diff.

[63] FlexiD-Fuse: Flexible number of inputs multi-modal medical image fusion based on diffusion model

Yushen Xu,Xiaosong Li,Yuchun Wang,Xiaoqi Cheng,Huafeng Li,Haishu Tan

Main category: cs.CV

TL;DR: FlexiD-Fuse提出了一种基于扩散模型的多模态医学图像融合方法,支持灵活数量的输入模态,解决了现有方法仅能处理固定数量输入的问题,且性能优于现有方法。

Details Motivation: 现有医学图像融合方法只能处理固定数量的输入模态(如双模态或三模态),无法直接适应变化的输入数量,限制了临床应用。本次研究旨在解决这一问题。

Contribution: 提出了FlexiD-Fuse,一种基于扩散模型的网络,支持灵活数量的输入模态,并能通过最大似然估计和层次贝叶斯建模实现高质量的融合图像生成。

Method: 将扩散融合问题转化为基于扩散过程和层次贝叶斯建模的最大似然估计问题,通过EM算法嵌入扩散采样过程,支持任意数量的输入图像。

Result: 在哈佛数据集和多种任务上的实验表明,FlexiD-Fuse在灵活输入数量的医学图像融合中性能最佳,且在扩展任务中优于其他SOTA方法。

Insight: 通过扩散模型和贝叶斯建模的结合,FlexiD-Fuse展示了在多模态图像融合中处理动态输入数量的潜力,为临床提供了更灵活的解决方案。

Abstract: Different modalities of medical images provide unique physiological and anatomical information for diseases. Multi-modal medical image fusion integrates useful information from different complementary medical images with different modalities, producing a fused image that comprehensively and objectively reflects lesion characteristics to assist doctors in clinical diagnosis. However, existing fusion methods can only handle a fixed number of modality inputs, such as accepting only two-modal or tri-modal inputs, and cannot directly process varying input quantities, which hinders their application in clinical settings. To tackle this issue, we introduce FlexiD-Fuse, a diffusion-based image fusion network designed to accommodate flexible quantities of input modalities. It can end-to-end process two-modal and tri-modal medical image fusion under the same weight. FlexiD-Fuse transforms the diffusion fusion problem, which supports only fixed-condition inputs, into a maximum likelihood estimation problem based on the diffusion process and hierarchical Bayesian modeling. By incorporating the Expectation-Maximization algorithm into the diffusion sampling iteration process, FlexiD-Fuse can generate high-quality fused images with cross-modal information from source images, independently of the number of input images. We compared the latest two and tri-modal medical image fusion methods, tested them on Harvard datasets, and evaluated them using nine popular metrics. The experimental results show that our method achieves the best performance in medical image fusion with varying inputs. Meanwhile, we conducted extensive extension experiments on infrared-visible, multi-exposure, and multi-focus image fusion tasks with arbitrary numbers, and compared them with the perspective SOTA methods. The results of the extension experiments consistently demonstrate the effectiveness and superiority of our method.

[64] OpenFake: An Open Dataset and Platform Toward Large-Scale Deepfake Detection

Victor Livernoche,Akshatha Arodi,Andreea Musulan,Zachary Yang,Adam Salvail,Gaétan Marceau Caron,Jean-François Godbout,Reihaneh Rabbany

Main category: cs.CV

TL;DR: 这篇论文提出了一个名为OpenFake的数据集和平台,旨在通过大规模高质量的合成图像数据集和众包对抗平台,推动深度伪造检测技术的发展。

Details Motivation: 深度伪造技术快速发展,导致虚假信息传播加剧,尤其在政治敏感领域。现有检测数据集通常规模小、生成方法过时或缺乏多样性,难以为现代检测技术提供有效支持。

Contribution: 论文贡献包括:(1) 发布了一个专为政治敏感内容设计的深度伪造检测数据集,包含300万真实图像及其描述性标题,以及96.3万高质量合成图像;(2) 提出了一个众包对抗平台,激励社区持续生成挑战性合成图像,以保持检测方法的鲁棒性。

Method: 方法包括:(1) 从社交媒体分析深度伪造传播的多模态信息;(2) 通过人类感知实验评估现代生成模型的质量;(3) 构建大规模数据集并通过众包平台持续更新对抗样本。

Result: 结果表明,现代专有模型生成的合成图像与真实图像的区分难度显著增加,证明了数据集和平台的必要性。

Insight: 论文强调了深度伪造技术在真实性上的持续改进,以及社区驱动对抗方法在长期应对虚假信息威胁中的重要性。

Abstract: Deepfakes, synthetic media created using advanced AI techniques, have intensified the spread of misinformation, particularly in politically sensitive contexts. Existing deepfake detection datasets are often limited, relying on outdated generation methods, low realism, or single-face imagery, restricting the effectiveness for general synthetic image detection. By analyzing social media posts, we identify multiple modalities through which deepfakes propagate misinformation. Furthermore, our human perception study demonstrates that recently developed proprietary models produce synthetic images increasingly indistinguishable from real ones, complicating accurate identification by the general public. Consequently, we present a comprehensive, politically-focused dataset specifically crafted for benchmarking detection against modern generative models. This dataset contains three million real images paired with descriptive captions, which are used for generating 963k corresponding high-quality synthetic images from a mix of proprietary and open-source models. Recognizing the continual evolution of generative techniques, we introduce an innovative crowdsourced adversarial platform, where participants are incentivized to generate and submit challenging synthetic images. This ongoing community-driven initiative ensures that deepfake detection methods remain robust and adaptive, proactively safeguarding public discourse from sophisticated misinformation threats.

[65] Region-Wise Correspondence Prediction between Manga Line Art Images

Yingxuan Li,Jiafeng Mao,Qianru Qiu,Yusuke Matsui

Main category: cs.CV

TL;DR: 该论文提出了一种基于Transformer的框架,用于预测未标注的漫画线稿图像之间的区域级对应关系,支持下游应用如上色和中间帧生成。

Details Motivation: 漫画处理中区域级对应关系的研究较少,尤其是在无标注或分割的情况下,该任务对漫画自动上色和中间帧生成等应用非常关键。

Contribution: 1. 提出了一种新颖实用的任务:预测未标注漫画线稿图像的区域级对应关系;2. 设计了一个Transformer框架,通过学习跨图像的块级相似性解决该任务;3. 开发了自动标注流程并构建了基准数据集。

Method: 1. 将线稿图像分割为块;2. 基于Transformer的框架学习块级相似性;3. 通过边缘感知聚类和区域匹配算法将块级预测转化为区域级对应关系。

Result: 在多个数据集上测试,块级准确率达到96.34%,并能生成一致的区域级对应关系。

Insight: 无标注的漫画线稿图像可以通过块级学习和区域匹配实现高效的区域对应,为漫画处理任务提供了新思路。

Abstract: Understanding region-wise correspondence between manga line art images is a fundamental task in manga processing, enabling downstream applications such as automatic line art colorization and in-between frame generation. However, this task remains largely unexplored, especially in realistic scenarios without pre-existing segmentation or annotations. In this paper, we introduce a novel and practical task: predicting region-wise correspondence between raw manga line art images without any pre-existing labels or masks. To tackle this problem, we divide each line art image into a set of patches and propose a Transformer-based framework that learns patch-level similarities within and across images. We then apply edge-aware clustering and a region matching algorithm to convert patch-level predictions into coherent region-level correspondences. To support training and evaluation, we develop an automatic annotation pipeline and manually refine a subset of the data to construct benchmark datasets. Experiments on multiple datasets demonstrate that our method achieves high patch-level accuracy (e.g., 96.34%) and generates consistent region-level correspondences, highlighting its potential for real-world manga applications.

[66] DualTrack: Sensorless 3D Ultrasound needs Local and Global Context

Paul F. R. Wilson,Matteo Ronchetti,Rüdiger Göbl,Viktoria Markova,Sebastian Rosenzweig,Raphael Prevost,Parvin Mousavi,Oliver Zettinig

Main category: cs.CV

TL;DR: 论文提出DualTrack,一种双编码器架构,分别处理超声图像的局部和全局特征,提升3D重建精度。

Details Motivation: 传统3D超声系统成本高且复杂,而基于深度学习的无传感器3D超声技术需同时捕捉局部(如斑点模式)和全局(如解剖结构)特征,但现有方法未能充分解耦两者。

Contribution: 提出DualTrack,通过分离的局部和全局编码器分别提取特征,再通过轻量融合模块生成3D探针轨迹,实现更精准的3D重建。

Method: 局部编码器使用密集时空卷积捕捉细粒度特征,全局编码器结合2D CNN或基础模型与时序注意力层提取高层次特征,最后通过融合模块整合两者。

Result: 在公开基准测试中,DualTrack实现平均重建误差低于5毫米,优于现有方法。

Insight: 解耦局部和全局特征提取可显著提升3D超声重建精度,表明两者在医学图像分析中的互补性。

Abstract: Three-dimensional ultrasound (US) offers many clinical advantages over conventional 2D imaging, yet its widespread adoption is limited by the cost and complexity of traditional 3D systems. Sensorless 3D US, which uses deep learning to estimate a 3D probe trajectory from a sequence of 2D US images, is a promising alternative. Local features, such as speckle patterns, can help predict frame-to-frame motion, while global features, such as coarse shapes and anatomical structures, can situate the scan relative to anatomy and help predict its general shape. In prior approaches, global features are either ignored or tightly coupled with local feature extraction, restricting the ability to robustly model these two complementary aspects. We propose DualTrack, a novel dual-encoder architecture that leverages decoupled local and global encoders specialized for their respective scales of feature extraction. The local encoder uses dense spatiotemporal convolutions to capture fine-grained features, while the global encoder utilizes an image backbone (e.g., a 2D CNN or foundation model) and temporal attention layers to embed high-level anatomical features and long-range dependencies. A lightweight fusion module then combines these features to estimate the trajectory. Experimental results on a large public benchmark show that DualTrack achieves state-of-the-art accuracy and globally consistent 3D reconstructions, outperforming previous methods and yielding an average reconstruction error below 5 mm.

[67] Improving Video Diffusion Transformer Training by Multi-Feature Fusion and Alignment from Self-Supervised Vision Encoders

Dohun Lee,Hyeonho Jeong,Jiwook Kim,Duygu Ceylan,Jong Chul Ye

Main category: cs.CV

TL;DR: 该论文提出了一种通过自监督视觉编码器的多特征融合与对齐来改进视频扩散Transformer训练的方法,称为Align4Gen。该方法显著提升了视频生成质量。

Details Motivation: 现有视频扩散模型在架构创新和训练目标上取得进展,但对特征表示能力的改进关注不足。论文希望通过对齐预训练视觉编码器的特征提升视频生成质量。

Contribution: 提出了Align4Gen方法,通过多特征融合与对齐改进视频扩散模型训练;提出新指标分析视觉编码器的判别性和时序一致性。

Method: 基于对视觉编码器的分析,Align4Gen在视频扩散模型训练中引入多特征融合与对齐机制,结合预训练视觉编码器的特征。

Result: 在无条件视频生成和类别条件视频生成任务中,Align4Gen显著提升了生成视频的质量,并通过多种指标验证了其有效性。

Insight: 预训练视觉编码器的特征对齐是改进视频扩散模型的一种有效路径,多特征融合进一步提升了模型的生成能力。

Abstract: Video diffusion models have advanced rapidly in the recent years as a result of series of architectural innovations (e.g., diffusion transformers) and use of novel training objectives (e.g., flow matching). In contrast, less attention has been paid to improving the feature representation power of such models. In this work, we show that training video diffusion models can benefit from aligning the intermediate features of the video generator with feature representations of pre-trained vision encoders. We propose a new metric and conduct an in-depth analysis of various vision encoders to evaluate their discriminability and temporal consistency, thereby assessing their suitability for video feature alignment. Based on the analysis, we present Align4Gen which provides a novel multi-feature fusion and alignment method integrated into video diffusion model training. We evaluate Align4Gen both for unconditional and class-conditional video generation tasks and show that it results in improved video generation as quantified by various metrics. Full video results are available on our project page: https://align4gen.github.io/align4gen/

[68] Invisible Attributes, Visible Biases: Exploring Demographic Shortcuts in MRI-based Alzheimer’s Disease Classification

Akshit Achara,Esther Puyol Anton,Alexander Hammers,Andrew P. King

Main category: cs.CV

TL;DR: 本文探讨了基于MRI的阿尔茨海默病(AD)分类中深度学习的捷径学习和人口统计学偏见问题,揭示了种族和性别相关的分布偏移及模型性能下降,并通过特征归因分析提出了更公平的诊断工具基础。

Details Motivation: 深度学习方法在MRI辅助AD诊断中存在潜在捷径学习问题,可能导致基于种族和性别的性能偏见,影响公平性。

Contribution: 1)验证了DL模型能从MRI中识别种族和性别;2)揭示了训练集不平衡导致模型性能下降;3)通过特征归因分析明确了偏见来源。

Method: 使用多种DL模型(ResNet、SwinTransformer)和多数据集,分析MRI中对性别和种族的分类能力,并通过训练集不平衡实验和特征归因分析探究捷径学习和偏见。

Result: 实验证明AD分类中存在种族和性别相关的捷径学习和性能偏见,具体表现为模型对某些脑区特征的依赖。

Insight: MRI数据中的隐含人口统计学差异可能导致模型偏见,未来需设计更公平的诊断工具以避免对少数群体的歧视。

Abstract: Magnetic resonance imaging (MRI) is the gold standard for brain imaging. Deep learning (DL) algorithms have been proposed to aid in the diagnosis of diseases such as Alzheimer’s disease (AD) from MRI scans. However, DL algorithms can suffer from shortcut learning, in which spurious features, not directly related to the output label, are used for prediction. When these features are related to protected attributes, they can lead to performance bias against underrepresented protected groups, such as those defined by race and sex. In this work, we explore the potential for shortcut learning and demographic bias in DL based AD diagnosis from MRI. We first investigate if DL algorithms can identify race or sex from 3D brain MRI scans to establish the presence or otherwise of race and sex based distributional shifts. Next, we investigate whether training set imbalance by race or sex can cause a drop in model performance, indicating shortcut learning and bias. Finally, we conduct a quantitative and qualitative analysis of feature attributions in different brain regions for both the protected attribute and AD classification tasks. Through these experiments, and using multiple datasets and DL models (ResNet and SwinTransformer), we demonstrate the existence of both race and sex based shortcut learning and bias in DL based AD classification. Our work lays the foundation for fairer DL diagnostic tools in brain MRI. The code is provided at https://github.com/acharaakshit/ShortMR

[69] PeftCD: Leveraging Vision Foundation Models with Parameter-Efficient Fine-Tuning for Remote Sensing Change Detection

Sijun Dong,Yuxuan Hu,LiBo Wang,Geng Chen,Xiaoliang Meng

Main category: cs.CV

TL;DR: PeftCD是基于视觉基础模型(VFMs)和参数高效微调(PEFT)的遥感变化检测框架,通过集成LoRA和Adapter模块高效适应任务,使用SAM2和DINOv3两种骨干网络,取得了多个数据集上的SOTA性能。

Details Motivation: 解决多时相多源遥感影像中伪变化、标注样本稀缺和跨域泛化困难的问题。

Contribution: 提出了一个集成LoRA和Adapter的高效微调框架PeftCD,使用SAM2和DINOv3骨干网络,在多数据集上实现最优性能。

Method: 采用权重共享的Siamese编码器,集成LoRA和Adapter模块,并设计轻量级解码器,专注于骨干网络的特征表示。

Result: 在SYSU-CD、WHUCD等数据集上取得SOTA性能,精确边界刻画和强伪变化抑制。

Insight: PeftCD展示了在准确性、效率和泛化性之间的最佳平衡,为大规模VFM在遥感变化检测中的应用提供了范例。

Abstract: To tackle the prevalence of pseudo changes, the scarcity of labeled samples, and the difficulty of cross-domain generalization in multi-temporal and multi-source remote sensing imagery, we propose PeftCD, a change detection framework built upon Vision Foundation Models (VFMs) with Parameter-Efficient Fine-Tuning (PEFT). At its core, PeftCD employs a weight-sharing Siamese encoder derived from a VFM, into which LoRA and Adapter modules are seamlessly integrated. This design enables highly efficient task adaptation by training only a minimal set of additional parameters. To fully unlock the potential of VFMs, we investigate two leading backbones: the Segment Anything Model v2 (SAM2), renowned for its strong segmentation priors, and DINOv3, a state-of-the-art self-supervised representation learner. The framework is complemented by a deliberately lightweight decoder, ensuring the focus remains on the powerful feature representations from the backbones. Extensive experiments demonstrate that PeftCD achieves state-of-the-art performance across multiple public datasets, including SYSU-CD (IoU 73.81%), WHUCD (92.05%), MSRSCD (64.07%), MLCD (76.89%), CDD (97.01%), S2Looking (52.25%) and LEVIR-CD (85.62%), with notably precise boundary delineation and strong suppression of pseudo-changes. In summary, PeftCD presents an optimal balance of accuracy, efficiency, and generalization. It offers a powerful and scalable paradigm for adapting large-scale VFMs to real-world remote sensing change detection applications. The code and pretrained models will be released at https://github.com/dyzy41/PeftCD.

[70] Visual Grounding from Event Cameras

Lingdong Kong,Dongyue Lu,Ao Liang,Rong Li,Yuhao Dong,Tianshuai Hu,Lai Xing Ng,Wei Tsang Ooi,Benoit R. Cottereau

Main category: cs.CV

TL;DR: 论文介绍了Talk2Event,第一个基于事件相机数据的大规模语言驱动对象接地基准,旨在填补事件相机与自然语言理解结合的多模态感知空白。

Details Motivation: 事件相机在高动态场景下具有微秒级精度和抗运动模糊的特性,但其与自然语言理解的结合研究较少,限制了多模态感知的发展。

Contribution: 提出了Talk2Event基准,包含5,567个场景、13,458个标注对象和30,000多条验证过的参考表达式,支持可解释和组合式的对象接地。

Method: 通过结构化属性(外观、状态、与观察者的关系、与周围对象的关系)设计参考表达式,捕捉动态环境中的时空和关系线索。

Result: Talk2Event为动态环境中的上下文推理提供了基础,并支持多模态和时间感知研究。

Insight: 事件相机与语言理解的结合有望推动机器人、人机交互等领域的发展。

Abstract: Event cameras capture changes in brightness with microsecond precision and remain reliable under motion blur and challenging illumination, offering clear advantages for modeling highly dynamic scenes. Yet, their integration with natural language understanding has received little attention, leaving a gap in multimodal perception. To address this, we introduce Talk2Event, the first large-scale benchmark for language-driven object grounding using event data. Built on real-world driving scenarios, Talk2Event comprises 5,567 scenes, 13,458 annotated objects, and more than 30,000 carefully validated referring expressions. Each expression is enriched with four structured attributes – appearance, status, relation to the viewer, and relation to surrounding objects – that explicitly capture spatial, temporal, and relational cues. This attribute-centric design supports interpretable and compositional grounding, enabling analysis that moves beyond simple object recognition to contextual reasoning in dynamic environments. We envision Talk2Event as a foundation for advancing multimodal and temporally-aware perception, with applications spanning robotics, human-AI interaction, and so on.

[71] Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Yikang Ding,Jiwen Liu,Wenyuan Zhang,Zekun Wang,Wentao Hu,Liyuan Cui,Mingming Lao,Yingchao Shao,Hui Liu,Xiaohan Li,Ming Chen,Xiaoqiang Liu,Yu-Shen Liu,Pengfei Wan

Main category: cs.CV

TL;DR: Kling-Avatar提出了一种新颖的多模态指令理解与逼真肖像生成结合的级联框架,解决了现有方法在叙事连贯性和角色表现力上的局限性。

Details Motivation: 现有音频驱动的虚拟形象视频生成方法仅将指令条件视为低层次的跟踪驱动,缺乏对指令传达的交流目的建模,导致叙事连贯性和表现力不足。

Contribution: 提出了一个两阶段级联框架,结合多模态大语言模型(MLLM)导演和高保真肖像生成,实现了语义驱动的长视频生成。

Method: 采用两阶段管道:第一阶段通过MLLM导演生成蓝图视频,指导高级语义(如动作和情感);第二阶段基于蓝图关键帧并行生成多子片段。

Result: 实验表明,Kling-Avatar能够生成生动流畅的长视频(1080p,48fps),在唇同步、情感表现、指令可控性等方面表现优越。

Insight: 通过全局-局部框架和并行生成策略,Kling-Avatar在保留细节的同时高效捕捉指令的高层意图,适用于实时应用。

Abstract: Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.

[72] Measuring Epistemic Humility in Multimodal Large Language Models

Bingkui Tong,Jiaer Xia,Sifeng Shang,Kaiyang Zhou

Main category: cs.CV

TL;DR: 该论文提出了HumbleBench,一个专门用于评估多模态大语言模型(MLLMs)识别和拒绝错误答案能力的基准测试,填补了现有测试中忽略的’认知谦逊’评估空白。

Details Motivation: 现有的多模态大语言模型基准测试主要关注模型能否从候选答案中选出正确答案,但忽视了模型识别和拒绝错误答案的能力,这在安全关键应用中尤为重要。

Contribution: 论文的主要贡献包括:(1)提出了HumbleBench,一个评估MLLMs拒绝错误答案能力的基准测试;(2)基于细粒度场景图标注构建了包含三种幻觉类型的问题集;(3)通过大量实验分析了现有MLLMs的表现。

Method: 方法包括:(1)利用全景场景图数据集提取真实实体和关系;(2)使用GPT-4-Turbo生成多项选择题,并手动筛选;(3)设计包含’以上都不是’选项的问题,以评估模型拒绝错误答案的能力。

Result: 实验评估了多种先进的MLLMs,结果表明现有模型在识别错误答案方面表现不佳,验证了HumbleBench的重要性。

Insight: 该研究揭示了现有MLLMs在’认知谦逊’方面的不足,强调了在实际应用中测试模型拒绝错误答案能力的必要性,为未来研究提供了方向。

Abstract: Hallucinations in multimodal large language models (MLLMs) – where the model generates content inconsistent with the input image – pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks an equally critical capability for trustworthy AI: recognizing when none of the provided options are correct, a behavior reflecting epistemic humility. We present HumbleBench, a new hallucination benchmark designed to evaluate MLLMs’ ability to reject plausible but incorrect answers across three hallucination types: object, relation, and attribute. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations to extract ground-truth entities and relations, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a “None of the above” option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs – including both general-purpose and specialized reasoning models – on HumbleBench and share valuable findings and insights with the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites, providing a more realistic measure of MLLM reliability in safety-critical settings. Our code and dataset are released publicly and can be accessed at https://github.com/maifoundations/HumbleBench.

[73] Can Understanding and Generation Truly Benefit Together – or Just Coexist?

Zhiyuan Yan,Kaiqing Lin,Zongjian Li,Junyan Ye,Hui Han,Zhendong Wang,Hao Liu,Bin Lin,Hao Li,Xue Xu,Xinyan Xiao,Jingdong Wang,Haifeng Wang,Li Yuan

Main category: cs.CV

TL;DR: 本文通过自编码器视角提出了一种统一的多模态学习框架UAE,通过图像到文本(I2T)和文本到图像(T2I)的双向信息流实现理解与生成的协同增益。

Details Motivation: 探索理解(I2T)与生成(T2I)是否能够真正协同优化,而非仅共存,并通过统一目标(重建保真度)实现双向信息流。

Contribution: 1)提出UAE框架,将理解与生成统一为一个自编码器;2)设计Unified-GRPO训练策略,分阶段优化编码器与解码器;3)引入Unified-Bench基准评估多模态模型的统一性。

Method: 1)预训练解码器以捕捉细粒度语义和空间关系;2)分三阶段RL训练:冷启动、生成促进理解、理解优化生成;3)利用重建保真度作为统一目标。

Result: RL训练中,编码器生成的描述更丰富,解码器理解能力增强,显著提升重建保真度,实现双向增益。

Insight: 理解与生成可通过统一框架协同优化,RL驱动的迭代训练是关键;描述质量与生成能力相互促进是意外的发现。

Abstract: In this paper, we introduce an insightful paradigm through the Auto-Encoder lens-understanding as the encoder (I2T) that compresses images into text, and generation as the decoder (T2I) that reconstructs images from that text. Using reconstruction fidelity as the unified training objective, we enforce the coherent bidirectional information flow between the understanding and generation processes, bringing mutual gains. To implement this, we propose UAE, a novel framework for unified multimodal learning. We begin by pre-training the decoder with large-scale long-context image captions to capture fine-grained semantic and complex spatial relationships. We then propose Unified-GRPO via reinforcement learning (RL), which covers three stages: (1) A cold-start phase to gently initialize both encoder and decoder with a semantic reconstruction loss; (2) Generation for Understanding, where the encoder is trained to generate informative captions that maximize the decoder’s reconstruction quality, enhancing its visual understanding; (3) Understanding for Generation, where the decoder is refined to reconstruct from these captions, forcing it to leverage every detail and improving its long-context instruction following and generation fidelity. For evaluation, we introduce Unified-Bench, the first benchmark tailored to assess the degree of unification of the UMMs. A surprising “aha moment” arises within the multimodal learning domain: as RL progresses, the encoder autonomously produces more descriptive captions, while the decoder simultaneously demonstrates a profound ability to understand these intricate descriptions, resulting in reconstructions of striking fidelity.

[74] Locality in Image Diffusion Models Emerges from Data Statistics

Artem Lukoianov,Chenyang Yuan,Justin Solomon,Vincent Sitzmann

Main category: cs.CV

TL;DR: 该论文探讨了图像扩散模型中局部性的来源,提出局部性源自图像数据本身的统计特性,而非卷积神经网络的归纳偏置。作者通过理论和实验证明,最优参数化线性去噪器与深度神经去噪器具有相似的局部性,并利用这一洞察设计了更匹配深度扩散模型的分析去噪器。

Details Motivation: 扩散模型的训练目标存在闭式最优解(最优去噪器),但其直接使用仅能复现训练集图像,无法生成新图像。此前研究认为卷积神经网络的平移不变性和局部性归纳偏置导致了这一差距,但本文质疑这一假设。

Contribution: 论文的主要贡献包括:1)证明深度扩散模型的局部性源于图像数据的统计特性;2)理论和实验验证局部性与像素相关性直接相关;3)设计了一个更接近深度扩散模型的分析去噪器。

Method: 作者首先分析最优参数化线性去噪器,证明其具有与深度神经去噪器相似的局部性。随后通过理论和实验验证局部性与自然图像数据中像素相关性的关系。最后基于这些洞察改进分析去噪器的设计。

Result: 实验表明,本文设计的分析去噪器比之前的专家设计方法更准确地预测深度扩散模型的分数,支持了局部性源自数据统计特性的观点。

Insight: 论文揭示了图像扩散模型中局部性的根本来源是数据而非模型结构,为理解扩散模型的行为提供了新视角,并启发了更接近实际模型特性的分析方法。

Abstract: Among generative models, diffusion models are uniquely intriguing due to the existence of a closed-form optimal minimizer of their training objective, often referred to as the optimal denoiser. However, diffusion using this optimal denoiser merely reproduces images in the training set and hence fails to capture the behavior of deep diffusion models. Recent work has attempted to characterize this gap between the optimal denoiser and deep diffusion models, proposing analytical, training-free models that can generate images that resemble those generated by a trained UNet. The best-performing method hypothesizes that shift equivariance and locality inductive biases of convolutional neural networks are the cause of the performance gap, hence incorporating these assumptions into its analytical model. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset, not due to the inductive bias of convolutional neural networks. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to the deep neural denoisers. We further show, both theoretically and experimentally, that this locality arises directly from the pixel correlations present in natural image datasets. Finally, we use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than the prior expert-crafted alternative.

[75] SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

Jiahao Wang,Yufeng Yuan,Rujie Zheng,Youtian Lin,Jian Gao,Lin-Zhuo Chen,Yajie Bao,Yi Zhang,Chang Zeng,Yanxi Zhou,Xiaoxiao Long,Hao Zhu,Zhaoxiang Zhang,Xun Cao,Yao Yao

Main category: cs.CV

TL;DR: SpatialVID是一个大规模视频数据集,提供丰富的空间标注,包括相机位姿、深度图等,旨在解决现有数据在规模和多样性上的不足。

Details Motivation: 当前空间智能模型的扩展性和真实世界保真度受到高质量训练数据稀缺的限制。现有数据集在规模、多样性和标注丰富性上不足。

Contribution: 提出了SpatialVID数据集,包含21,000小时原始视频,处理后形成7,089小时动态内容,涵盖多种场景和相机运动,并带有密集的3D标注。

Method: 通过分层过滤流程处理原始视频,生成2.7百万个剪辑;采用后续标注流程为这些剪辑添加详细的时空和语义信息。

Result: SpatialVID的数据统计显示其丰富性和多样性,能直接促进模型泛化能力和性能提升。

Insight: SpatialVID填补了高质量、大规模视频数据的空白,为视频和3D视觉研究提供了重要资源。

Abstract: Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for real-world dynamic scenes with ground-truth camera motion. To this end, we collect \textbf{SpatialVID}, a dataset consists of a large corpus of in-the-wild videos with diverse scenes, camera movements and dense 3D annotations such as per-frame camera poses, depth, and motion instructions. Specifically, we collect more than 21,000 hours of raw video, and process them into 2.7 million clips through a hierarchical filtering pipeline, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and serialized motion instructions. Analysis of SpatialVID’s data statistics reveals a richness and diversity that directly foster improved model generalization and performance, establishing it as a key asset for the video and 3D vision research community.

cs.GR [Back]

[76] CameraVDP: Perceptual Display Assessment with Uncertainty Estimation via Camera and Visual Difference Prediction

Yancheng Cai,Robert Wanat,Rafal Mantiuk

Main category: cs.GR

TL;DR: CameraVDP 是一种结合相机重建流程与视觉差异预测的方法,用于感知显示评估和不确定性估计,解决了传统显示测量方法在捕捉空间变化和高频失真方面的不足。

Details Motivation: 传统显示测量方法无法捕捉高频和像素级失真,而相机虽然具有足够空间分辨率,但引入了光学、采样和光度失真。此外,需要结合视觉系统模型评估失真是否可见。

Contribution: 提出了结合相机重建流程和视觉差异预测器(VDP)的 CameraVDP 框架,能够同时处理相机测量的不准确性和视觉差异预测问题,支持对显示缺陷的感知评估。

Method: 通过结合 HDR 图像堆叠、MTF 反演、渐晕校正、几何去畸变、单应变换和色彩校正的相机重建流程,以及视觉差异预测器(VDP),实现了高精度显示测量和失真可见性建模。

Result: 在缺陷像素检测、色边感知和显示非均匀性评估等应用中验证了 CameraVDP 的有效性,并通过不确定性分析框架提供了缺陷检测的理论上限和 VDP 质量评分的置信区间。

Insight: CameraVDP 不仅解决了传统方法的局限性,还引入了视觉感知模型,使得显示评估更接近人类视觉系统的实际感知效果。

Abstract: Accurate measurement of images produced by electronic displays is critical for the evaluation of both traditional and computational displays. Traditional display measurement methods based on sparse radiometric sampling and fitting a model are inadequate for capturing spatially varying display artifacts, as they fail to capture high-frequency and pixel-level distortions. While cameras offer sufficient spatial resolution, they introduce optical, sampling, and photometric distortions. Furthermore, the physical measurement must be combined with a model of a visual system to assess whether the distortions are going to be visible. To enable perceptual assessment of displays, we propose a combination of a camera-based reconstruction pipeline with a visual difference predictor, which account for both the inaccuracy of camera measurements and visual difference prediction. The reconstruction pipeline combines HDR image stacking, MTF inversion, vignetting correction, geometric undistortion, homography transformation, and color correction, enabling cameras to function as precise display measurement instruments. By incorporating a Visual Difference Predictor (VDP), our system models the visibility of various stimuli under different viewing conditions for the human visual system. We validate the proposed CameraVDP framework through three applications: defective pixel detection, color fringing awareness, and display non-uniformity evaluation. Our uncertainty analysis framework enables the estimation of the theoretical upper bound for defect pixel detection performance and provides confidence intervals for VDP quality scores.

cs.LG [Back]

[77] Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents

Jiawei Wang,Jiacai Liu,Yuqian Fu,Yingru Li,Xintao Wang,Yuan Lin,Yu Yue,Lin Zhang,Yang Wang,Ke Wang

Main category: cs.LG

TL;DR: 为了解决长周期任务中LLM代理的稀疏奖励和信用分配问题,论文提出了一种基于熵调制的策略梯度方法(EMPG),通过重新校准学习信号以提高效率和稳定性。

Details Motivation: 长周期任务中,基于LLM的代理面临稀疏奖励和信用分配的挑战,传统方法通过密集奖励信号引导学习,但忽略了策略梯度与熵的耦合问题。

Contribution: 提出EMPG框架,通过熵调制策略梯度,重新校准学习信号,同时引入未来清晰度奖励,鼓励代理找到更可预测的解决方案路径。

Method: EMPG利用步长不确定性和最终任务结果重新校准学习信号,放大自信正确行为的更新,惩罚自信错误,并减弱不确定步长的更新。

Result: 在WebShop、ALFWorld和Deep Search三个任务上,EMPG显著优于基线方法,表现出性能提升。

Insight: 策略梯度与熵的耦合是LLM代理学习低效和不稳定的主要原因,通过熵调制可以显著改善学习动态。

Abstract: In long-horizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines. Project page is at https://empgseed-seed.github.io/

[78] Adaptive Pareto-Optimal Token Merging for Edge Transformer Models in Semantic Communication

Omar Erak,Omar Alhussein,Hatem Abou-Zeid,Mehdi Bennis

Main category: cs.LG

TL;DR: 该论文提出了一种无需训练的框架,通过自适应令牌合并技术,在预训练视觉变换器中减少推理时间和传输资源使用。该方法通过多目标优化和贝叶斯优化,平衡精度与计算成本,并在动态应用中灵活调整。实验证明其在降低计算复杂度的同时保持了竞争力。

Details Motivation: 大规模变换器模型在语义通信系统中表现出色,但其高计算需求限制了在资源受限的6G网络中的实际应用,因此需要一种高效的方法来解决这一挑战。

Contribution: 1. 提出了一种无需训练的自适应令牌合并框架;2. 将令牌合并比例选择建模为多目标优化问题;3. 使用高斯过程贝叶斯优化构建帕累托前沿,实现灵活配置。

Method: 通过贝叶斯优化和多目标优化方法,动态调整令牌合并比例,以平衡计算成本和模型精度。

Result: 实验表明,该方法在降低浮点运算次数的同时保持了高精度,并能适应不同信噪比条件。

Insight: 动态调整合并策略能够有效应对信道质量的波动,为边缘智能系统中的变换器部署提供了一种高效且灵活的方法。

Abstract: Large-scale transformer models have emerged as a powerful tool for semantic communication systems, enabling edge devices to extract rich representations for robust inference across noisy wireless channels. However, their substantial computational demands remain a major barrier to practical deployment in resource-constrained 6G networks. In this paper, we present a training-free framework for adaptive token merging in pretrained vision transformers to jointly reduce inference time and transmission resource usage. We formulate the selection of per-layer merging proportions as a multi-objective optimization problem to balance accuracy and computational cost. We employ Gaussian process-based Bayesian optimization to construct a Pareto frontier of optimal configurations, enabling flexible runtime adaptation to dynamic application requirements and channel conditions. Extensive experiments demonstrate that our method consistently outperforms other baselines and achieves significant reductions in floating-point operations while maintaining competitive accuracy across a wide range of signal-to-noise ratio (SNR) conditions. Additional results highlight the effectiveness of adaptive policies that adjust merging aggressiveness in response to channel quality, providing a practical mechanism to trade off latency and semantic fidelity on demand. These findings establish a scalable and efficient approach for deploying transformer-based semantic communication in future edge intelligence systems.

[79] Graph Alignment via Dual-Pass Spectral Encoding and Latent Space Communication

Maysam Behmanesh,Erkan Turan,Maks Ovsjanikov

Main category: cs.LG

TL;DR: 论文提出一种新图对齐框架,通过双通道谱编码和潜在空间通信增强节点区分度并保持几何一致性,优于现有无监督基线。

Details Motivation: 现有无监督图对齐方法因GNN嵌入的过平滑和潜在空间不对齐导致节点对应关系不可靠,亟需改进。

Contribution: 1. 双通道编码器结合高低频谱滤波器生成区分性强的嵌入;2. 几何感知功能映射模块确保潜在空间几何一致性。

Method: 1. 采用双通道谱编码(低通/高通滤波)生成结构感知且高区分度的嵌入;2. 通过功能映射模块学习双射等距变换对齐潜在空间。

Result: 在图基准和跨模态(视觉-语言)任务上均超越现有方法,对结构噪声和异质性表现出强鲁棒性。

Insight: 谱滤波与几何一致性约束的结合可有效解决嵌入平滑和空间不对齐问题,且框架具备跨领域泛化能力。

Abstract: Graph alignment-the problem of identifying corresponding nodes across multiple graphs-is fundamental to numerous applications. Most existing unsupervised methods embed node features into latent representations to enable cross-graph comparison without ground-truth correspondences. However, these methods suffer from two critical limitations: the degradation of node distinctiveness due to oversmoothing in GNN-based embeddings, and the misalignment of latent spaces across graphs caused by structural noise, feature heterogeneity, and training instability, ultimately leading to unreliable node correspondences. We propose a novel graph alignment framework that simultaneously enhances node distinctiveness and enforces geometric consistency across latent spaces. Our approach introduces a dual-pass encoder that combines low-pass and high-pass spectral filters to generate embeddings that are both structure-aware and highly discriminative. To address latent space misalignment, we incorporate a geometry-aware functional map module that learns bijective and isometric transformations between graph embeddings, ensuring consistent geometric relationships across different representations. Extensive experiments on graph benchmarks demonstrate that our method consistently outperforms existing unsupervised alignment baselines, exhibiting superior robustness to structural inconsistencies and challenging alignment scenarios. Additionally, comprehensive evaluation on vision-language benchmarks using diverse pretrained models shows that our framework effectively generalizes beyond graph domains, enabling unsupervised alignment of vision and language representations.

cs.AI [Back]

[80] Automated Unity Game Template Generation from GDDs via NLP and Multi-Modal LLMs

Amna Hassan

Main category: cs.AI

TL;DR: 该论文提出了一种通过自然语言处理和多模态大语言模型,将游戏设计文档(GDDs)自动转换为功能性的Unity游戏模板的框架。

Details Motivation: 游戏设计文档到实际开发的过程通常复杂且耗时,缺乏自动化工具导致开发效率低下。本框架旨在填补这一空白,利用AI技术简化从设计到实现的过渡。

Contribution: 主要贡献是一个端到端系统,能够解析GDDs并生成符合Unity规范的C#代码,显著提升了代码生成的性能(4.8/5.0)和设计一致性。

Method: 结合了针对Unity代码生成优化的LLaMA-3模型和自定义Unity集成包,解析GDDs并生成结构化游戏规格。

Result: 评估表明,该方法在编译成功、设计一致性、最佳实践采用和代码模块化方面优于基线模型,适用于多种游戏类型。

Insight: 多模态LLMs在游戏开发中有潜力显著提高从设计到实现的效率,成为游戏开发流程中的重要工具。

Abstract: This paper presents a novel framework for automated game template generation by transforming Game Design Documents (GDDs) into functional Unity game prototypes using Natural Language Processing (NLP) and multi-modal Large Language Models (LLMs). We introduce an end-to-end system that parses GDDs, extracts structured game specifications, and synthesizes Unity-compatible C# code that implements the core mechanics, systems, and architecture defined in the design documentation. Our approach combines a fine-tuned LLaMA-3 model specialized for Unity code generation with a custom Unity integration package that streamlines the implementation process. Evaluation results demonstrate significant improvements over baseline models, with our fine-tuned model achieving superior performance (4.8/5.0 average score) compared to state-of-the-art LLMs across compilation success, GDD adherence, best practices adoption, and code modularity metrics. The generated templates demonstrate high adherence to GDD specifications across multiple game genres. Our system effectively addresses critical gaps in AI-assisted game development, positioning LLMs as valuable tools in streamlining the transition from game design to implementation.

[81] Tree-OPO: Off-policy Monte Carlo Tree-Guided Advantage Optimization for Multistep Reasoning

Bingning Huang,Tu Nguyen,Matthieu Zimmer

Main category: cs.AI

TL;DR: 论文提出了一种结合MCTS和RL的新方法Tree-OPO,通过利用MCTS生成的轨迹优化策略学习,改进了偏好一致性的强化学习,并提出启发式和统计方法解决潜在问题。

Details Motivation: 现有的大语言模型(LLMs)在多步推理任务中展现了MCTS生成高质量中间轨迹的能力。但如何将这些轨迹用于策略优化,尤其是在偏好一致性学习中,仍是一个开放问题。

Contribution: 1. 提出Tree-OPO,将MCTS生成的轨迹与GRPO结合,优化策略学习;2. 引入树状优势估计框架;3. 分析并提出解决方案应对优势饱和和奖励信号崩溃问题。

Method: 采用分阶段GRPO训练,利用部分MCTS生成的轨迹作为完成信号,设计树状优势估计框架,并通过启发式和统计方法解决潜在问题。

Result: 结果表明,树状优势估计可以稳定更新并更好地反映组合推理质量,但仍需进一步解决优势饱和和奖励信号崩溃问题。

Insight: MCTS生成的轨迹可以丰富RL的策略优化,但树状奖励结构下的学习仍面临挑战,需更多理论研究。

Abstract: Recent advances in reasoning with large language models (LLMs) have shown the effectiveness of Monte Carlo Tree Search (MCTS) for generating high-quality intermediate trajectories, particularly in math and symbolic domains. Inspired by this, we explore how MCTS-derived trajectories, traditionally used for training value or reward models, can be repurposed to improve policy optimization in preference-based reinforcement learning (RL). Specifically, we focus on Group Relative Policy Optimization (GRPO), a recent algorithm that enables preference-consistent policy learning without value networks. We propose a staged GRPO training paradigm where completions are derived from partially revealed MCTS rollouts, introducing a novel tree-structured setting for advantage estimation. This leads to a rich class of prefix-conditioned reward signals, which we analyze theoretically and empirically. Our initial results indicate that while structured advantage estimation can stabilize updates and better reflect compositional reasoning quality, challenges such as advantage saturation and reward signal collapse remain. We propose heuristic and statistical solutions to mitigate these issues and discuss open challenges for learning under staged or tree-like reward structures.

[82] Mind Meets Space: Rethinking Agentic Spatial Intelligence from a Neuroscience-inspired Perspective

Bui Duc Manh,Soumyaratna Debnath,Zetong Zhang,Shriram Damodaran,Arvind Kumar,Yueyi Zhang,Lu Mi,Erik Cambria,Lin Wang

Main category: cs.AI

TL;DR: 本文提出了一种基于神经科学启发的计算框架,旨在提升自主AI系统的空间推理能力,弥补当前AI与人类空间智能之间的差距。

Details Motivation: 当前自主AI系统的空间推理能力有限,而人类的空间智能基于多感官感知和认知地图,能在非结构化环境中灵活决策。因此,有必要从神经科学角度重新思考AI的空间智能。

Contribution: 1. 提出了一个基于神经科学的计算框架,包含六种核心模块:生物启发的多模态感知、多感官整合、自我中心-他中心转换、人工认知地图、空间记忆和空间推理。2. 通过框架对其他方法进行分析,指出了当前研究的不足。

Method: 1. 从计算神经科学中提取空间神经模型。2. 设计六种核心计算模块。3. 框架指导的方法分析与评测。

Result: 1. 分析现有方法的局限。2. 提出了未来研究方向,特别是在动态和非结构化环境中推广空间推理能力的潜力。

Insight: 神经科学的视角为AI空间推理提供了结构化路径,尤其在机器人和虚拟系统中具有广泛应用前景。

Abstract: Recent advances in agentic AI have led to systems capable of autonomous task execution and language-based reasoning, yet their spatial reasoning abilities remain limited and underexplored, largely constrained to symbolic and sequential processing. In contrast, human spatial intelligence, rooted in integrated multisensory perception, spatial memory, and cognitive maps, enables flexible, context-aware decision-making in unstructured environments. Therefore, bridging this gap is critical for advancing Agentic Spatial Intelligence toward better interaction with the physical 3D world. To this end, we first start from scrutinizing the spatial neural models as studied in computational neuroscience, and accordingly introduce a novel computational framework grounded in neuroscience principles. This framework maps core biological functions to six essential computation modules: bio-inspired multimodal sensing, multi-sensory integration, egocentric-allocentric conversion, an artificial cognitive map, spatial memory, and spatial reasoning. Together, these modules form a perspective landscape for agentic spatial reasoning capability across both virtual and physical environments. On top, we conduct a framework-guided analysis of recent methods, evaluating their relevance to each module and identifying critical gaps that hinder the development of more neuroscience-grounded spatial reasoning modules. We further examine emerging benchmarks and datasets and explore potential application domains ranging from virtual to embodied systems, such as robotics. Finally, we outline potential research directions, emphasizing the promising roadmap that can generalize spatial reasoning across dynamic or unstructured environments. We hope this work will benefit the research community with a neuroscience-grounded perspective and a structured pathway. Our project page can be found at Github.

cs.CY [Back]

[83] A vibe coding learning design to enhance EFL students’ talking to, through, and about AI

David James Woo,Kai Guo,Yangyang Yu

Main category: cs.CY

TL;DR: 这篇创新实践论文探讨了在英语作为外语(EFL)教学中使用vibe coding(通过自然语言与AI协作开发软件应用)的试点研究。研究开发了一个人类-AI元语言框架,包含三个维度:与AI对话(提示工程)、通过AI对话(协商作者身份)和关于AI的对话(AI的心理模型)。通过案例研究,发现学生在vibe coding中的表现差异与其提示工程方法和AI心理模型相关。

Details Motivation: 研究的动机是探索如何通过AI技术(如vibe coding)提升EFL学生的语言学习体验,同时揭示学生在与AI协作过程中遇到的挑战及其背后的原因。

Contribution: 主要贡献包括:(1)提出一个人类-AI元语言框架,用于分析EFL学生与AI的互动;(2)通过案例研究展示了vibe coding在实际教学中的效果差异;(3)提出了针对vibe coding教学的具体建议,如提示工程训练和AI心理模型培养。

Method: 研究方法采用反向设计原则设计了四小时的工作坊,两名学生参与设计解决EFL写作问题的应用。数据收集包括工作表、视频记录、有声思考协议、屏幕录制和AI生成的图像。

Result: 研究发现一名学生成功设计出功能符合预期的应用,另一名则遇到技术困难,且设计与实际功能之间存在较大差距。差异主要源于学生的提示工程方法和AI心理模型不同。

Insight: 研究表明,有效的vibe coding教学需要明确的元语言支持,包括结构化提示工程训练、作者身份的批判性讨论以及AI心理模型的词汇培养。

Abstract: This innovative practice article reports on the piloting of vibe coding (using natural language to create software applications with AI) for English as a Foreign Language (EFL) education. We developed a human-AI meta-languaging framework with three dimensions: talking to AI (prompt engineering), talking through AI (negotiating authorship), and talking about AI (mental models of AI). Using backward design principles, we created a four-hour workshop where two students designed applications addressing authentic EFL writing challenges. We adopted a case study methodology, collecting data from worksheets and video recordings, think-aloud protocols, screen recordings, and AI-generated images. Contrasting cases showed one student successfully vibe coding a functional application cohering to her intended design, while another encountered technical difficulties with major gaps between intended design and actual functionality. Analysis reveals differences in students’ prompt engineering approaches, suggesting different AI mental models and tensions in attributing authorship. We argue that AI functions as a beneficial languaging machine, and that differences in how students talk to, through, and about AI explain vibe coding outcome variations. Findings indicate that effective vibe coding instruction requires explicit meta-languaging scaffolding, teaching structured prompt engineering, facilitating critical authorship discussions, and developing vocabulary for articulating AI mental models.

eess.SP [Back]

[84] Ultrafast Deep Learning-Based Scatter Estimation in Cone-Beam Computed Tomography

Harshit Agrawal,Ari Hietanen,Simo Särkkä

Main category: eess.SP

TL;DR: 该论文提出了一种基于深度学习的快速散射估计方法,通过在多个分辨率下优化网络结构和输入尺寸,显著降低了计算开销和内存需求,同时保持了较高的精度。

Details Motivation: CBCT(锥形束计算机断层扫描)中的散射伪影严重影响图像质量。现有的深度学习方法虽然有效,但因其网络规模大,难以部署在移动CBCT系统或边缘设备上。

Contribution: 论文的主要贡献是通过多分辨率分析和优化,提出了一种高效且轻量化的深度学习方法,显著降低了计算复杂度(FLOPs减少78倍)和内存需求,同时保持了原有性能。

Method: 方法包括:1)在六个分辨率下分析散射信号的重建误差,比较四种插值方法;2)训练并评估五种分辨率下的深度学习模型,量化FLOPs、推理时间和GPU内存的优化。

Result: 实验表明,优化后的方法在MAPE和MSE上与基线方法相当,FLOPs减少78倍,推理时间和GPU内存分别降低16倍和12倍。散射校正结果在仿真和真实数据上均表现出鲁棒性。

Insight: 论文揭示了降采样在深度学习散射估计中的重要作用,并证明了轻量化网络在资源受限环境中的可行性。

Abstract: Purpose: Scatter artifacts drastically degrade the image quality of cone-beam computed tomography (CBCT) scans. Although deep learning-based methods show promise in estimating scatter from CBCT measurements, their deployment in mobile CBCT systems or edge devices is still limited due to the large memory footprint of the networks. This study addresses the issue by applying networks at varying resolutions and suggesting an optimal one, based on speed and accuracy. Methods: First, the reconstruction error in down-up sampling of CBCT scatter signal was examined at six resolutions by comparing four interpolation methods. Next, a recent state-of-the-art method was trained across five image resolutions and evaluated for the reductions in floating-point operations (FLOPs), inference times, and GPU memory requirements. Results: Reducing the input size and network parameters achieved a 78-fold reduction in FLOPs compared to the baseline method, while maintaining comarable performance in terms of mean-absolute-percentage-error (MAPE) and mean-square-error (MSE). Specifically, the MAPE decreased to 3.85% compared to 4.42%, and the MSE decreased to 1.34 \times 10^{-2} compared to 2.01 \times 10^{-2}. Inference time and GPU memory usage were reduced by factors of 16 and 12, respectively. Further experiments comparing scatter-corrected reconstructions on a large, simulated dataset and real CBCT scans from water and Sedentex CT phantoms clearly demonstrated the robustness of our method. Conclusion: This study highlights the underappreciated role of downsampling in deep learning-based scatter estimation. The substantial reduction in FLOPs and GPU memory requirements achieved by our method enables scatter correction in resource-constrained environments, such as mobile CBCT and edge devices.

eess.IV [Back]

[85] Dynamic Structural Recovery Parameters Enhance Prediction of Visual Outcomes After Macular Hole Surgery

Yinzheng Zhao,Zhihao Zhao,Rundong Jiang,Louisa Sackewitz,Quanmin Liang,Mathias Maier,Daniel Zapp,Peter Charbel Issa,Mohammad Ali Nasseri

Main category: eess.IV

TL;DR: 该论文提出了一种结合动态结构参数的多模态深度学习模型,用于预测黄斑裂孔手术后患者的视力恢复情况,显著提高了预测准确性。

Details Motivation: 现有方法在预测黄斑裂孔手术后视力恢复时未充分利用动态结构参数,导致预测准确性不足。论文旨在填补这一空白。

Contribution: 1. 提出了新颖的动态结构参数;2. 开发了一种多模态深度学习模型,结合临床变量和OCT图像,显著提升预测性能。

Method: 1. 使用公开的纵向OCT数据集;2. 设计阶段特异性分割模型提取结构和动态特征;3. 构建多模态深度学习模型。

Result: 动态参数显著提高了逻辑回归AUC,多模态深度学习模型在所有阶段均优于传统方法(AUC提升高达0.12)。

Insight: 动态结构和多模态数据的结合能够显著提升预测性能,为临床决策提供了更精准的工具。

Abstract: Purpose: To introduce novel dynamic structural parameters and evaluate their integration within a multimodal deep learning (DL) framework for predicting postoperative visual recovery in idiopathic full-thickness macular hole (iFTMH) patients. Methods: We utilized a publicly available longitudinal OCT dataset at five stages (preoperative, 2 weeks, 3 months, 6 months, and 12 months). A stage specific segmentation model delineated related structures, and an automated pipeline extracted quantitative, composite, qualitative, and dynamic features. Binary logistic regression models, constructed with and without dynamic parameters, assessed their incremental predictive value for best-corrected visual acuity (BCVA). A multimodal DL model combining clinical variables, OCT-derived features, and raw OCT images was developed and benchmarked against regression models. Results: The segmentation model achieved high accuracy across all timepoints (mean Dice > 0.89). Univariate and multivariate analyses identified base diameter, ellipsoid zone integrity, and macular hole area as significant BCVA predictors (P < 0.05). Incorporating dynamic recovery rates consistently improved logistic regression AUC, especially at the 3-month follow-up. The multimodal DL model outperformed logistic regression, yielding higher AUCs and overall accuracy at each stage. The difference is as high as 0.12, demonstrating the complementary value of raw image volume and dynamic parameters. Conclusions: Integrating dynamic parameters into the multimodal DL model significantly enhances the accuracy of predictions. This fully automated process therefore represents a promising clinical decision support tool for personalized postoperative management in macular hole surgery.

[86] Virtual staining for 3D X-ray histology of bone implants

Sarah C. Irvine,Christian Lucas,Diana Krüger,Bianca Guedert,Julian Moosmann,Berit Zeller-Plumhoff

Main category: eess.IV

TL;DR: 该论文提出一种基于深度学习的方法,将虚拟染色技术拓展至3D X射线成像领域,通过跨模态图像翻译实现骨植入样本的虚拟染色,从而在无需物理切片或化学染色的情况下提升生物组织的生化特异性和可解释性。

Details Motivation: 传统2D组织学方法需要物理切片和化学染色,而3D X射线组织学技术虽然提供了无创的体成像能力,但其灰度图像对比度在生化特异性上存在局限。虚拟染色技术有望解决这一问题,但此前主要应用于光学图像领域。

Contribution: 论文的主要贡献是将虚拟染色技术首次应用于3D X射线成像,提出了一种改进的CycleGAN网络,能够在有限配对数据的情况下生成高质量的人工染色切片,并通过多项指标验证其优于现有方法。

Method: 方法包括:1) 使用同步辐射微CT扫描与甲苯胺蓝染色的共配准图像对;2) 改进了CycleGAN网络,加入了像素级监督和灰度一致性约束;3) 采用基于补丁的训练和数据增强。

Result: 实验结果表明,该方法在SSIM、PSNR和LPIPS等指标上优于Pix2Pix和标准CycleGAN基线模型,能够生成具有高分辨率结构细节的虚拟染色3D数据集。

Insight: 虽然该方法能够重现新骨形成等特征,但在植入物降解层的表现上仍存在变异性,这表明需要更多的训练数据和进一步优化。

Abstract: Three-dimensional X-ray histology techniques offer a non-invasive alternative to conventional 2D histology, enabling volumetric imaging of biological tissues without the need for physical sectioning or chemical staining. However, the inherent greyscale image contrast of X-ray tomography limits its biochemical specificity compared to traditional histological stains. Within digital pathology, deep learning-based virtual staining has demonstrated utility in simulating stained appearances from label-free optical images. In this study, we extend virtual staining to the X-ray domain by applying cross-modality image translation to generate artificially stained slices from synchrotron-radiation-based micro-CT scans. Using over 50 co-registered image pairs of micro-CT and toluidine blue-stained histology from bone-implant samples, we trained a modified CycleGAN network tailored for limited paired data. Whole slide histology images were downsampled to match the voxel size of the CT data, with on-the-fly data augmentation for patch-based training. The model incorporates pixelwise supervision and greyscale consistency terms, producing histologically realistic colour outputs while preserving high-resolution structural detail. Our method outperformed Pix2Pix and standard CycleGAN baselines across SSIM, PSNR, and LPIPS metrics. Once trained, the model can be applied to full CT volumes to generate virtually stained 3D datasets, enhancing interpretability without additional sample preparation. While features such as new bone formation were able to be reproduced, some variability in the depiction of implant degradation layers highlights the need for further training data and refinement. This work introduces virtual staining to 3D X-ray imaging and offers a scalable route for chemically informative, label-free tissue characterisation in biomedical research.

[87] In-Loop Filtering Using Learned Look-Up Tables for Video Coding

Zhuoyuan Li,Jiacheng Li,Yao Li,Jialin Li,Li Li,Dong Liu,Feng Wu

Main category: eess.IV

TL;DR: 该论文提出了一种基于查找表(LUT)的环内滤波(ILF)框架(LUT-ILF++),通过将深度神经网络的输出缓存为LUT,显著降低了计算复杂性和存储需求,同时保持了较好的编码增益。

Details Motivation: 基于神经网络的环内滤波方法虽然能显著提升视频编码质量,但其高昂的计算复杂性和硬件需求限制了实际应用。因此,研究一种更实用的替代方案成为必要。

Contribution: 1. 提出了一种通用的LUT-ILF++框架,通过多种滤波LUT的协作和定制化索引机制优化滤波效果;
2. 引入了跨组件索引机制,支持对不同颜色分量的联合滤波;
3. 设计了LUT压缩方案,进一步降低存储开销。

Method: 通过训练受限参考范围的DNN,遍历所有可能的输入并将其输出缓存为LUT。滤波过程通过索引和插值完成,避免了复杂的DNN推理计算。同时,采用定制化索引和LUT压缩技术优化性能和存储效率。

Result: 在VVC参考软件中实现该框架,实验表明在AI和RA配置下分别实现了平均0.82%/2.97%/1.63%和0.85%/4.11%/2.06%的码率降低,同时显著降低了时间复杂性和存储成本。

Insight: 通过结合LUT的轻量化和DNN的高效性,展示了在视频编码中平衡性能与复杂度的可行路径,为实际部署提供了新思路。

Abstract: In-loop filtering (ILF) is a key technology in video coding standards to reduce artifacts and enhance visual quality. Recently, neural network-based ILF schemes have achieved remarkable coding gains, emerging as a powerful candidate for next-generation video coding standards. However, the use of deep neural networks (DNN) brings significant computational and time complexity or high demands for dedicated hardware, making it challenging for general use. To address this limitation, we study a practical ILF solution by adopting look-up tables (LUTs). After training a DNN with a restricted reference range for ILF, all possible inputs are traversed, and the output values of the DNN are cached into LUTs. During the coding process, the filtering process is performed by simply retrieving the filtered pixel through locating the input pixels and interpolating between the cached values, instead of relying on heavy inference computations. In this paper, we propose a universal LUT-based ILF framework, termed LUT-ILF++. First, we introduce the cooperation of multiple kinds of filtering LUTs and propose a series of customized indexing mechanisms to enable better filtering reference perception with limited storage consumption. Second, we propose the cross-component indexing mechanism to enable the filtering of different color components jointly. Third, in order to make our solution practical for coding uses, we propose the LUT compaction scheme to enable the LUT pruning, achieving a lower storage cost of the entire solution. The proposed framework is implemented in the VVC reference software. Experimental results show that the proposed framework achieves on average 0.82%/2.97%/1.63% and 0.85%/4.11%/2.06% bitrate reduction for common test sequences, under the AI and RA configurations, respectively. Compared to DNN-based solutions, our proposed solution has much lower time complexity and storage cost.

cs.RO [Back]

[88] OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning

Yuecheng Liu,Dafeng Chi,Shiguang Wu,Zhanguang Zhang,Yuzheng Zhuang,Bowen Yang,He Zhu,Lingfeng Zhang,Pengwei Xie,David Gamaliel Arcos Bravo,Yingxue Zhang,Jianye Hao,Xingyue Quan

Main category: cs.RO

TL;DR: OmniEVA是一个基于多模态大语言模型的通用规划器,通过任务自适应的3D基础和体感知推理,解决了现有系统中的几何适应性差距和体限制差距问题。

Details Motivation: 现有MLLM(多模态大语言模型)在机器人智能体中的应用存在两大问题:空间信息的适应性不足(仅依赖2D输入或固定3D几何注入)和体限制的忽视(忽略机器人物理约束)。OmniEVA旨在解决这些问题,提升机器人任务规划的适应性和可行性。

Contribution: 1. 提出任务自适应的3D基础机制,通过门控路由实现3D融合的选择性调节;2. 设计体感知推理框架,将任务目标和体限制同时纳入推理过程。

Method: 1. 任务自适应的3D基础机制:使用门控路由动态调节3D信息融合;2. 体感知推理框架:联合建模任务目标和机器人物理约束。

Result: OmniEVA在通用推理任务中达到SOTA性能,并在多种下游场景中展现出强大的适应能力。

Insight: 动态调节3D信息和物理约束可以显著提升任务规划的表现,表明自适应性是机器人智能体泛化能力的关键。

Abstract: Recent advances in multimodal large language models (MLLMs) have opened new opportunities for embodied intelligence, enabling multimodal understanding, reasoning, and interaction, as well as continuous spatial decision-making. Nevertheless, current MLLM-based embodied systems face two critical limitations. First, Geometric Adaptability Gap: models trained solely on 2D inputs or with hard-coded 3D geometry injection suffer from either insufficient spatial information or restricted 2D generalization, leading to poor adaptability across tasks with diverse spatial demands. Second, Embodiment Constraint Gap: prior work often neglects the physical constraints and capacities of real robots, resulting in task plans that are theoretically valid but practically infeasible.To address these gaps, we introduce OmniEVA – an embodied versatile planner that enables advanced embodied reasoning and task planning through two pivotal innovations: (1) a Task-Adaptive 3D Grounding mechanism, which introduces a gated router to perform explicit selective regulation of 3D fusion based on contextual requirements, enabling context-aware 3D grounding for diverse embodied tasks. (2) an Embodiment-Aware Reasoning framework that jointly incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable. Extensive experimental results demonstrate that OmniEVA not only achieves state-of-the-art general embodied reasoning performance, but also exhibits a strong ability across a wide range of downstream scenarios. Evaluations of a suite of proposed embodied benchmarks, including both primitive and composite tasks, confirm its robust and versatile planning capabilities. Project page: https://omnieva.github.io

[89] SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Haozhan Li,Yuxin Zuo,Jiale Yu,Yuhao Zhang,Zhaohui Yang,Kaiyan Zhang,Xuekai Zhu,Yuchen Zhang,Tianxing Chen,Ganqu Cui,Dehui Wang,Dingxiang Luo,Yuchen Fan,Youbang Sun,Jia Zeng,Jiangmiao Pang,Shanghang Zhang,Yu Wang,Yao Mu,Bowen Zhou,Ning Ding

Main category: cs.RO

TL;DR: SimpleVLA-RL提出了一种基于强化学习的框架,用于提升视觉-语言-动作(VLA)模型的长时程动作规划能力,减少了大规模数据依赖性并提升了泛化性能。

Details Motivation: 当前VLA模型依赖昂贵的大规模人类操作轨迹数据进行监督微调(SFT),且面临分布偏移下的泛化能力不足问题。受大型推理模型(LRMs)通过强化学习提升推理能力的启发,本文探索RL对VLA模型动作规划的改进潜力。

Contribution: 1. 提出SimpleVLA-RL,一种针对VLA模型的高效RL框架;2. 引入探索增强策略,显著超越SFT方法和基线模型。

Method: 1. 基于veRL框架,设计了VLA专用的轨迹采样方法;2. 实现可扩展的并行化、多环境渲染和优化的损失计算。

Result: 在LIBERO和RoboTwin 1.0&2.0上达到SoTA性能,甚至超越基线模型π0。同时发现RL训练中的新现象“pushcut”。

Insight: 强化学习不仅能减少VLA模型对大规模数据的依赖,还能发现训练过程中未见的新动作模式,推动了真实任务中的性能突破。

Abstract: Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms $\pi_0$ on RoboTwin 1.0&2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut’’ during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL

[90] ObjectReact: Learning Object-Relative Control for Visual Navigation

Sourav Garg,Dustin Craggs,Vineeth Bhat,Lachlan Mares,Stefan Podgorski,Madhava Krishna,Feras Dayoub,Ian Reid

Main category: cs.RO

TL;DR: 论文提出了一种基于对象相对控制(object-relative control)的视觉导航方法,通过对象级别的表示取代传统的图像相对方法,提高了导航任务的泛化能力和跨场景适应性。

Details Motivation: 传统的视觉导航依赖于图像相对控制(image-relative control),其表现受限于视角和机器人姿态的变化。对象级别的表示更加稳定,能够提供更好的跨场景泛化能力。

Contribution: 1. 提出了对象相对控制的视觉导航新范式;2. 设计了一种基于相对3D场景图的拓扑地图表示;3. 训练了一个本地控制器ObjectReact,无需显式RGB输入。

Method: 1. 使用相对3D场景图进行全局路径规划;2. 训练本地控制器ObjectReact,基于WayObject Costmap进行控制预测;3. 通过对象级别表示实现跨场景泛化。

Result: 实验表明,对象相对控制在视角变化和反向导航等任务中优于传统方法,且模拟训练的模型能很好地泛化到真实室内环境。

Insight: 对象级别的表示提供了一种更稳定的导航控制方法,能够有效减少对视角和机器人姿态的依赖,适合跨场景部署。

Abstract: Visual navigation using only a single camera and a topological map has recently become an appealing alternative to methods that require additional sensors and 3D maps. This is typically achieved through an “image-relative” approach to estimating control from a given pair of current observation and subgoal image. However, image-level representations of the world have limitations because images are strictly tied to the agent’s pose and embodiment. In contrast, objects, being a property of the map, offer an embodiment- and trajectory-invariant world representation. In this work, we present a new paradigm of learning “object-relative” control that exhibits several desirable characteristics: a) new routes can be traversed without strictly requiring to imitate prior experience, b) the control prediction problem can be decoupled from solving the image matching problem, and c) high invariance can be achieved in cross-embodiment deployment for variations across both training-testing and mapping-execution settings. We propose a topometric map representation in the form of a “relative” 3D scene graph, which is used to obtain more informative object-level global path planning costs. We train a local controller, dubbed “ObjectReact”, conditioned directly on a high-level “WayObject Costmap” representation that eliminates the need for an explicit RGB input. We demonstrate the advantages of learning object-relative control over its image-relative counterpart across sensor height variations and multiple navigation tasks that challenge the underlying spatial understanding capability, e.g., navigating a map trajectory in the reverse direction. We further show that our sim-only policy is able to generalize well to real-world indoor environments. Code and supplementary material are accessible via project page: https://object-react.github.io/

[91] Dexplore: Scalable Neural Control for Dexterous Manipulation from Reference-Scoped Exploration

Sirui Xu,Yu-Wei Chao,Liuyu Bian,Arsalan Mousavian,Yu-Xiong Wang,Liang-Yan Gui,Wei Yang

Main category: cs.RO

TL;DR: Dexplore提出了一种统一的单循环优化方法,直接从动作捕捉数据中学习机器人控制策略,避免了传统三阶段工作流的误差累积问题。

Details Motivation: 传统的基于动作捕捉数据的机器人控制方法采用三阶段流程(重定向、跟踪和残差校正),容易导致误差累积和数据利用不足。Dexplore旨在通过统一的单循环优化解决这些问题。

Contribution: 1. 提出了Dexplore方法,将重定向和跟踪统一为一个单循环优化过程;2. 使用动作捕捉数据作为软指导而非绝对真值;3. 通过自适应空间范围训练策略,提高鲁棒性;4. 将大规模策略蒸馏为视觉驱动的生成控制器,支持泛化和实际部署。

Method: Dexplore通过强化学习联合优化重定向和跟踪,从原始轨迹中提取自适应空间范围,并训练策略保持在该范围内完成任务。最终,将该策略蒸馏为基于视觉的条件生成控制器。

Result: Dexplore能够保留演示意图,同时允许机器人特定的策略涌现,提高了对噪声的鲁棒性,并能扩展到大规模演示数据。生成的视觉控制器支持跨对象泛化和实际部署。

Insight: 将演示数据作为软指导而非绝对真值,并结合自适应范围训练策略,是一种有效利用不完美演示数据的方法,同时避免了多阶段流程的误差累积。

Abstract: Hand-object motion-capture (MoCap) repositories offer large-scale, contact-rich demonstrations and hold promise for scaling dexterous robotic manipulation. Yet demonstration inaccuracies and embodiment gaps between human and robot hands limit the straightforward use of these data. Existing methods adopt a three-stage workflow, including retargeting, tracking, and residual correction, which often leaves demonstrations underused and compound errors across stages. We introduce Dexplore, a unified single-loop optimization that jointly performs retargeting and tracking to learn robot control policies directly from MoCap at scale. Rather than treating demonstrations as ground truth, we use them as soft guidance. From raw trajectories, we derive adaptive spatial scopes, and train with reinforcement learning to keep the policy in-scope while minimizing control effort and accomplishing the task. This unified formulation preserves demonstration intent, enables robot-specific strategies to emerge, improves robustness to noise, and scales to large demonstration corpora. We distill the scaled tracking policy into a vision-based, skill-conditioned generative controller that encodes diverse manipulation skills in a rich latent representation, supporting generalization across objects and real-world deployment. Taken together, these contributions position Dexplore as a principled bridge that transforms imperfect demonstrations into effective training signals for dexterous manipulation.